Posted by Anon84 19 hours ago
Some of the claims about models training on their own data, in their enthusiasm to frame it as a failure, went further to suggest that it magnified biases. I had my doubts about their conclusions. If it were true, it would be a much greater breakthrough because the ability to magnify a property represents a way to measure a weak version that property. The ability to do that would mean they would have found a way to provide a training signal to avoid bias. It would be great if that's what they did but I suspect there would have been more news about it.
Perhaps this paper will put to rest the notion that AI output is useless as training data. It has only ever been the case that it was useless as an indiscriminate source of data.
So you prompt the base model for answer and then rerun the prompt with the answer from the first run?
They use self-distillation to shift the output distribution of the model towards that of the same model, but running with different temperature/truncation settings in sampling.
This effectively "folds" the logit tail truncation behavior into the model itself.
Not entirely unlike a few "model controlled sampling settings" things I've seen in what it does, but different in execution.
You use the outputs from the first run (right or wrong) as answers for the second training run, and repeat. Magically it works. That's what's so surprising.
I guess a theory is because there are so many diverse ways to be wrong that they don't accumulate error... still seems surprising and would be interesting to see if it works in other domains.
It's the first thing anyone would think of (like a self-hosted compiler) but everything I've read said "it doesn't work."
EDIT: For context:
> Shumailov et al. (2024) — "AI models collapse when trained on recursively generated data" (Nature, 2024)Their hypothesis as to why this works requires a bit more knowledge about model architecture, but basically when a model generates code some positions have only one right answer and some have many valid options - but the model has to use one global confidence setting for both. Sampling with a specific temperature + a garbage-token filter, then training on those outputs, teaches the model to internalize 'be precise where there's one answer, stay open-minded where there are several' — without anyone labeling which is which.
Note that there's a lot more nuance to this and I simplified a lot.
You teach the machine by asking it to solve some problems, and then whatever answer it gives you say "That's exactly right. Now we train on those answers YOU just gave me" (even if they are wrong) and repeat. Somehow THAT works over time.
you can generate and train answers by exploring on varying the length of the code generated