Embarrassingly simple self-distillation improves code generation

Posted by Anon84 19 hours ago

Embarrassingly simple self-distillation improves code generation(arxiv.org)

516 points | 161 commentspage 2

ultramann 15 hours ago|

Maybe not the thing I should be focusing on, but I was surprised this paper came from apple. I was under the impression that apples ai/LLM research was far behind the curve. I get that research is a rising tides lifts all boats situation, I just thought that I had seen lots of negative news about apples progress in the front, and heuristically haven’t seen many (any?) apple research papers make it the front page of hacker news. Wondering if anyone more familiar with apple/ai research could comment on this?

bensyverson 15 hours ago|

Apple routinely makes hn's front page for their AI research [0][1], particularly related to their work with small on-device models.

[0] https://news.ycombinator.com/item?id=46117802

[1] https://news.ycombinator.com/item?id=47107974

hackermeows 6 hours ago||

what is the big deal with obsidian ? I see a lot of people use it but I'm more than happy with giving an LLM a local sqlite table , embedding api and asking the agent to maintain its own memory

Lerc 7 hours ago||

This is the natural conclusion of what was really claimed about model collapse, and indeed natural evolution. Making an imperfect copy while invoking a selection mechanism is evolution.

Some of the claims about models training on their own data, in their enthusiasm to frame it as a failure, went further to suggest that it magnified biases. I had my doubts about their conclusions. If it were true, it would be a much greater breakthrough because the ability to magnify a property represents a way to measure a weak version that property. The ability to do that would mean they would have found a way to provide a training signal to avoid bias. It would be great if that's what they did but I suspect there would have been more news about it.

Perhaps this paper will put to rest the notion that AI output is useless as training data. It has only ever been the case that it was useless as an indiscriminate source of data.

l5870uoo9y 17 hours ago||

> Our method, simple self-distillation (SSD), is embarrassingly simple: sample solutions from the base model with specified temperature and truncation, then fine-tune on those raw, unverified samples via standard cross-entropy loss.

So you prompt the base model for answer and then rerun the prompt with the answer from the first run?

ACCount37 17 hours ago||

No. There's no "answer" really.

They use self-distillation to shift the output distribution of the model towards that of the same model, but running with different temperature/truncation settings in sampling.

This effectively "folds" the logit tail truncation behavior into the model itself.

Not entirely unlike a few "model controlled sampling settings" things I've seen in what it does, but different in execution.

zug_zug 13 hours ago||

Yeah basically.

You use the outputs from the first run (right or wrong) as answers for the second training run, and repeat. Magically it works. That's what's so surprising.

I guess a theory is because there are so many diverse ways to be wrong that they don't accumulate error... still seems surprising and would be interesting to see if it works in other domains.

gavinray 10 hours ago||

Why have we been fed the narrative that training models on their own output progressively degrades quality?

It's the first thing anyone would think of (like a self-hosted compiler) but everything I've read said "it doesn't work."

EDIT: For context:

  > Shumailov et al. (2024) — "AI models collapse when trained on recursively generated data" (Nature, 2024)

mickdarling 10 hours ago||

I'm working on a tool to determine which portions of an LLM process can be optimized, and how to measure that optimization and check whether it's optimizable at all. The shaping pattern that they talk about here is directly relevant and makes a whole lot more processes potentially optimizable by looking at the pattern rather than if the metrics just go up or down.

roger_ 17 hours ago||

Skimmed this but don't have an intuitive understanding of why this works and how temperature and truncation factor in.

an0malous 15 hours ago||

I’d like to understand AI research better and I recall some posts a while back where someone collected all the key papers that one should read, but I don’t remember enough to be able to find it. Does anyone know what I’m talking about and could link me to that post?

zug_zug 13 hours ago|

This might sound paradoxical -- but any decent LLM will be happy to explain all the papers to you at great depth, and read new ones, and translate the math into simpler concepts and such. It'll also happily recommend relevant math to study, or give training problems, or whatever you want.

vishnugupta 16 hours ago||

Can someone please eli5 this to a friend web developer? I read the abstract but couldn’t understand much.

unknownx113 15 hours ago||

you're probably overcomplicating it; as the paper says, it's embarrassingly simple: given a problem set, generate a response for each problem with a fixed temperature and truncation - then fine tune the model on the generations.

Their hypothesis as to why this works requires a bit more knowledge about model architecture, but basically when a model generates code some positions have only one right answer and some have many valid options - but the model has to use one global confidence setting for both. Sampling with a specific temperature + a garbage-token filter, then training on those outputs, teaches the model to internalize 'be precise where there's one answer, stay open-minded where there are several' — without anyone labeling which is which.

Note that there's a lot more nuance to this and I simplified a lot.

zug_zug 13 hours ago|||

ELI 5

You teach the machine by asking it to solve some problems, and then whatever answer it gives you say "That's exactly right. Now we train on those answers YOU just gave me" (even if they are wrong) and repeat. Somehow THAT works over time.

useful 15 hours ago||

if the probability mass is on a single token, its a precise answer like `1 + 1 = ` if next token predicted shares probability with other token, then there are multiple answers like `position: `

you can generate and train answers by exploring on varying the length of the code generated

itmitica 14 hours ago|

It’s an interesting claim, and the reported benchmark gains are large, but it is still an April 1, 2026 arXiv preprint, so I’d treat it as promising rather than settled.

More comments...