Posted by jxmorris12 3 days ago
Seems a buried lede is that on-policy RL is unlocked by bitwise identical results between training and sampling. I'm not an expert here but my understanding is that this would allow for stronger guarantees about deployment/training alignment for the RL training that the labs already do.
I don't fully understand the BigMath example though. They show that off-policy RLVR requires off-policy correction, which avoids divergence, but is suboptimal because it results in noisy rewards. Then they say "we fixed the sampler and trainer numerical mismatch, which allows for on-policy RL, look how much better it is." It's not clear to me whether this is an artificial example that deliberately uses different trainer/sampler setups, or if it's actually impossible to have the same numerics between trainer/sampler without their fixes (even if we use same batch size, no atomics, etc.).
I've seen this play out dozens of times. So many startups that have come and go in the bay area were composed of extremely talented individuals, but almost all of them failed.
This is literally one of the most knowledgeable person on the topic. I think you are the one that hasn’t peeled enough layers to connect with what they are saying.
If you say so.
> the author has nothing to do with the original comment
Except for the part of the comment that was assuming the author had no idea how this all works, has only used LLMs through API and has never run a local model, you mean?
Not really, LLMs give you a distribution over possible next tokens. You are free to then sample from this distribution how you want. There is no need to hack RNG or whatever, for example you can simply just take a greedy approach and always output the most likely token, in which case the LLM becomes deterministic (mathematically). This is equivalent to setting the temperature to 0.