re:ctx

Nothing amazing ever happens here. Everything is ordinary.

Read log

The Second Half - Shunyu Yao - 姚顺雨 *

Why RL didn’t work before? Why RL works now? Priors.
Scaling language pre-training gave us powerful priors. Yao mentions how this* may seem counterintutive to a classic RL researcher even just few years ago. (whole miracle was empirical anyway)

*language reasoning as actions

AI’s first half involved search for novel methods to hillclimb harder and harder benchmarks. Now The “recipe” is in place and is scaling well so far.
But “If novel methods are no longer needed and harder benchmarks will just get solved increasingly soon, what should we do?”

The second half of AI will shift focus from solving problems to defining problems. In this new era, evaluation becomes more important than training.

Scaling Laws, Honestly | Diogo Almeida

Kaplan et al. trained all models on the fixed amount of data (~130B tokens) and used a learning rate schedule that zeroes. Former caused big models to not get enough data and later caused models to not train enough.

2406.12907, which tries to reconcile difference in results of two scaling lawa papers, is also inaccurate.

Labs’ equity vortex drying academia, closed research and not acknowledging wrong results… is a sad state of affairs.

LSA LongCat Sparse Attention - arjunkocher

The indexer becomes the bottleneck in sparse attention; Meituan LSA focuses on this bottleneck and introduces three orthogonal optimizations to indexer.

A brief history of distillation in AI | Sergio Paniego

TL;DR Distillation gives a better training signal than hard labels.

… line between distillation, supervised fine-tuning, reinforcement learning and synthetic data is getting blurry.

Explorer

Recent Notes

Oscillatory Neural Network

Alternative Architectures

Papers: LLM x Information Theory

Read log

Recent Notes

Oscillatory Neural Network

Alternative Architectures

Papers: LLM x Information Theory