Posted by AMavorParker 2 hours ago
So my first impression is that either this is a non-evolutionary algorithm mascarading as one and diluting concepts like mutation and crossover that have well defined meanings, or it is one but you're abusing terminology from other fields (like RL and "rewards") instead. Either way it's a confusing first impression, and one gets the subtle vibe that word choices are more there to create a "buzz" than to create clarity.
(not trying to be dismissive, I genuinely hope this is useful feedback)
Paper does look interesting, I'll try to read properly when I have time.
Not necessarily. While the held-out downstream evals showed that 1T-1S setups outperformed larger populations like 4T-4S or 8T-8S on some specific benchmarks, that does not invalidate the motivation for population-based training.
The main motivation for larger populations is more diversity in both problems and solutions, which can encourage specialization and broader task coverage. Even if that diversity does not improve on some of the particular benchmarks we used, it is still arguably a desirable property.
Figure 9 in the paper, for example, shows that students trained with larger populations are exposed to a much wider range of tasks than the baseline.
Also, on average, we do see that 4v4 is the best across all benchmarks we measure.
The “creating new population members in seconds” comment refers to operating in LoRA space. The mutation and crossover operators are applied to lightweight LoRA adapters rather than full model weights, making the process very fast and memory efficient.
Regarding the TrueSkill of the teachers, The self-play settings we operate in in this paper are zero-sum competitive which means that the population skills cannot both increase together, as the objective of one population is adversarial against the other -- generating difficult tasks (teachers) but making difficult tasks easy (students learning to solve them)