Nothing amazing ever happens here. Everything is ordinary.
Read log
Why RL didn’t work before? Why RL works now? Priors.
Scaling language pre-training gave us powerful priors. Yao mentions how this* may seem counterintutive to a classic RL researcher even just few years ago. (whole miracle was empirical anyway)
*language reasoning as actions
AI’s first half involved search for novel methods to hillclimb harder and harder benchmarks. Now The “recipe” is in place and is scaling well so far.
But “If novel methods are no longer needed and harder benchmarks will just get solved increasingly soon, what should we do?”
The second half of AI will shift focus from solving problems to defining problems. In this new era, evaluation becomes more important than training.
Kaplan et al. trained all models on the fixed amount of data (~130B tokens) and used a learning rate schedule that zeroes. Former caused big models to not get enough data and later caused models to not train enough.
2406.12907, which tries to reconcile difference in results of two scaling lawa papers, is also inaccurate.
Labs’ equity vortex drying academia, closed research and not acknowledging wrong results… is a sad state of affairs.
The indexer becomes the bottleneck in sparse attention; Meituan LSA focuses on this bottleneck and introduces three orthogonal optimizations to indexer.
TL;DR Distillation gives a better training signal than hard labels.
… line between distillation, supervised fine-tuning, reinforcement learning and synthetic data is getting blurry.