Embarrassingly simple self-distillation improves code generation

Posted by Anon84 4 days ago

Embarrassingly simple self-distillation improves code generation(arxiv.org)

653 points | 200 commentspage 4

try-working 4 days ago|

most codebases dont have traces to train on. if you use rlm-workflow you will build up rich traceability in the form of requirements, plans, implementation artifacts, along with worktree diffs. with these, you can then use self-distillation on models or use autoagent to improve your harness. https://github.com/doubleuuser/rlm-workflow

naasking 4 days ago||

It's interesting that LLMs improve skills, especially on harder problems, just by practicing them. That's effectively what's going on.

drooby 4 days ago||

Fascinating...

This feels eerily similar to sleep consolidation or synaptic pruning

ACCount37 4 days ago|

I don't see much similarity? Unless you're looking at self-distillation in general and not just this use of it.

oliver236 4 days ago||

How not?

I think the analogy is actually pretty specific to this paper, not just self-distillation in general.

During sleep your brain replays experiences but noisy and distorted. The replays are often incoherent as narratives (dreams are weird). But the consolidation still works because the value isn't in the narrative coherence, it's in the activation patterns at each moment. Important pathways get strengthened, weak ones get pruned. Section 4.4 of this paper is what makes the connection click. They cranked training temperature to 2.0 with no truncation. 62% of the sampled outputs had no extractable code. Coherent Python that devolves into multilingual gibberish halfway through. The model still improved (+5.7pp pass@1).

This makes no sense if you think the model is learning from good code examples. But it makes a lot of sense if you think of it as the model replaying its own knowledge back to itself in a noisy/distorted form, and the replay process strengthening what matters (sharp distributions at "lock" positions where one token is correct, broad distributions at "fork" positions where multiple approaches work) while pruning what doesn't (distractor tails). The model doesn't learn anything new. It just wakes up performing better because what it already knew got cleaned up.

How is this comment not at number 1??

ACCount37 4 days ago||

This is a property of self-distillation.

Self-distillation shifts the behavior of the model towards that of the model + steering. As such, you don't strictly "need" the tokens to be in-domain for it to work. The logits are a vessel for transferring the steering into the model's internals.

The tokens can be gibberish. What transfers isn't whether they're gibberish or not, but how the flavor of model predictions, if given gibberish, differs from that of an unsteered version of itself.

In this specific case, the behavioral difference comes from the "temperature-shifted, truncated samples" in the "teacher" sampling strategy, and it is that difference that is internalized by the "student" model.

drooby 4 days ago||

I think we’re agreeing. The point of the sleep parallel is exactly that the content doesn’t matter, and it’s the filtering process that does the work. Brains replay noisy, sometimes incoherent patterns during sleep and the value is in how that replay reshapes connection weights, not in whether the replay is accurate. That’s the same principle you’re describing with the steering signal

I.e sleep replays don’t need to replay Tuesday’s meeting accurately. They just need to activate the relevant pathways so that the strong ones fire and the weak ones don’t. The pattern of what fires versus what doesn’t is the signal. The “content” of the dream is basically irrelevant.

smallerize 4 days ago||

I don't suppose they published the improved models?

augment_me 4 days ago||

Isn't this was DeepSeek + Kimi did to Claude?

hnretards 4 days ago||

I've been doing something even better than this for years using only Mistral 7b.

My local running Mistral 7b is a 100x better at modern JavaScript than any model on the market, mainly just from RAG on my own code samples.

That's basically what they are describing with "post-training", the TLDR is that code especially of a certain style is vastly simpler than written language.

You really don't need a huge model or data centers etc. you just need a small but good model like Mistral 7b and literally a few good samples.

But you guys keep doing you lol. A bunch of non-devs trying to solve code is pretty funny to watch.

4b11b4 4 days ago||

Self-consistency meets fine-tuning?

antirez 4 days ago||

Another potentially usable trick is the following: based on the observation that longer token budget improves model performances, one could generate solutions using a lot of thinking budget, then ask the LLM to turn the trace into a more compact one, and later SFT on that. That said, I have the feeling the result of the paper will likely be hard to apply in practice without affecting other capabilities, and/or not superior to other techniques that provide similar improvement in sampling.

robwwilliams 4 days ago||

Very cool. An evolutionary biologist would say: Welcome to the party!

Mutation rate modulation is the AI engineers’ heat. And selection does the trimming of the outliers.

Some more serious biomorphic thinking and we may get to the next big insight courtesy of 3+ billion years of evolution—- evolution that enabled a great ape species to write a paper like this and build LMM’s like Gemma4 that totally rock on a 3.5 pound MacBookPro M5 Max with 128 GB of RAM.

hackermeows 4 days ago|

what is the big deal with obsidian ? I see a lot of people use it but I'm more than happy with giving an LLM a local sqlite table , embedding api and asking the agent to maintain its own memory

More comments...