Embarrassingly simple self-distillation improves code generation

Posted by Anon84 1 day ago

Embarrassingly simple self-distillation improves code generation(arxiv.org)

580 points | 168 commentspage 3

gavinray 16 hours ago|

Why have we been fed the narrative that training models on their own output progressively degrades quality?

It's the first thing anyone would think of (like a self-hosted compiler) but everything I've read said "it doesn't work."

EDIT: For context:

  > Shumailov et al. (2024) — "AI models collapse when trained on recursively generated data" (Nature, 2024)

mickdarling 16 hours ago||

I'm working on a tool to determine which portions of an LLM process can be optimized, and how to measure that optimization and check whether it's optimizable at all. The shaping pattern that they talk about here is directly relevant and makes a whole lot more processes potentially optimizable by looking at the pattern rather than if the metrics just go up or down.

dwa3592 20 hours ago||

Can anyone help clarify these doubts - I didn't see any information about how different the test/benchmark set is from the training set. It feels like an important gap to not fill in a ML paper. What if there is an overlap between the problems in the test set and the training set?? What is the decontamination strategy of going from LCBv5 to LCBv6 ?

crustycoder 19 hours ago||

"SSD improves Qwen3-30B-Instruct from 42.4% to 55.3% pass@1 on LiveCodeBench v6"

I know virtually nothing about this area but my naive take is that something that means it still only passes tests around half the time doesn't seem like a particularly big jump forwards.

What am I missing?

SEMW 18 hours ago||

There's no shortage of benchmarks (coding or otherwise) that any competent coding model will now pass with ~100%.

But no-one quotes those any more because if everyone passes them, they don't serve any useful purpose in discriminating between different models or identifying advancements

So people switch to new benchmarks which either have more difficult tasks or some other artificial constraints that make them in some way harder to pass, until the scores are low enough that they're actually discriminating between models. and a 50% score is in some sense ideal for that - there's lots of room for variance around 50%.

(whether the thing they're measuring is something that well correlates to real coding performance is another question)

So you can't infer anything in isolation from a given benchmark score being only 50% other than that benchmarks are calibrated to make such scores the likely outcome

crustycoder 15 hours ago||

So it's the relative and not the absolute diff that matters - thanks.

martinrolph 3 hours ago||

Think of it less like a test suite and more like an exam. If you're trying to differentiate between the performance of different people/systems/models, you need to calibrate the difficulty accordingly.

When designing a benchmark, a pass rate of roughly 50% is useful because it gives you the most information about the relative performance of different models. If the pass rate is 90%+ too often, that means the test is too easy: you're wasting questions asking the model to do things we already know it can do, and getting no extra information. And if it's too low then you're wasting questions at the other end, trying to make it do impossible tasks.

xbmcuser 22 hours ago||

So the chances of Singularity went up.

hu3 19 hours ago|

Or down if this research leads to a local minima.

fooker 21 hours ago||

I'm excited for the long tail of techniques like this that are going to be discovered over the next several decades that's going to make this technology eventually run on a toaster!

drooby 23 hours ago||

Fascinating...

This feels eerily similar to sleep consolidation or synaptic pruning

ACCount37 21 hours ago|

I don't see much similarity? Unless you're looking at self-distillation in general and not just this use of it.

oliver236 19 hours ago||

How not?

I think the analogy is actually pretty specific to this paper, not just self-distillation in general.

During sleep your brain replays experiences but noisy and distorted. The replays are often incoherent as narratives (dreams are weird). But the consolidation still works because the value isn't in the narrative coherence, it's in the activation patterns at each moment. Important pathways get strengthened, weak ones get pruned. Section 4.4 of this paper is what makes the connection click. They cranked training temperature to 2.0 with no truncation. 62% of the sampled outputs had no extractable code. Coherent Python that devolves into multilingual gibberish halfway through. The model still improved (+5.7pp pass@1).

This makes no sense if you think the model is learning from good code examples. But it makes a lot of sense if you think of it as the model replaying its own knowledge back to itself in a noisy/distorted form, and the replay process strengthening what matters (sharp distributions at "lock" positions where one token is correct, broad distributions at "fork" positions where multiple approaches work) while pruning what doesn't (distractor tails). The model doesn't learn anything new. It just wakes up performing better because what it already knew got cleaned up.

How is this comment not at number 1??

ACCount37 16 hours ago||

This is a property of self-distillation.

Self-distillation shifts the behavior of the model towards that of the model + steering. As such, you don't strictly "need" the tokens to be in-domain for it to work. The logits are a vessel for transferring the steering into the model's internals.

The tokens can be gibberish. What transfers isn't whether they're gibberish or not, but how the flavor of model predictions, if given gibberish, differs from that of an unsteered version of itself.

In this specific case, the behavioral difference comes from the "temperature-shifted, truncated samples" in the "teacher" sampling strategy, and it is that difference that is internalized by the "student" model.

drooby 14 hours ago||

I think we’re agreeing. The point of the sleep parallel is exactly that the content doesn’t matter, and it’s the filtering process that does the work. Brains replay noisy, sometimes incoherent patterns during sleep and the value is in how that replay reshapes connection weights, not in whether the replay is accurate. That’s the same principle you’re describing with the steering signal

I.e sleep replays don’t need to replay Tuesday’s meeting accurately. They just need to activate the relevant pathways so that the strong ones fire and the weak ones don’t. The pattern of what fires versus what doesn’t is the signal. The “content” of the dream is basically irrelevant.

hackermeows 12 hours ago||

what is the big deal with obsidian ? I see a lot of people use it but I'm more than happy with giving an LLM a local sqlite table , embedding api and asking the agent to maintain its own memory

naasking 10 hours ago|

It's interesting that LLMs improve skills, especially on harder problems, just by practicing them. That's effectively what's going on.

More comments...