Loss falls in a straight line — on a log plot
Train a family of models at different sizes and plot their final loss against compute. On a log-log scale, the points fall on a remarkably straight line: each multiplication of compute buys a predictable reduction in loss. The same holds for parameters and for data independently. These are the scaling laws, first characterized at scale by Kaplan and colleagues at OpenAI in 2020.
The practical magic is extrapolation. Fit the curve on small, cheap runs and you can forecast the loss of a model 100× bigger before spending a dollar on it. That turned training frontier models from a gamble into a budgeted engineering decision — you know roughly what you'll get.
The budget is fixed — so what do you spend it on?
Compute is the binding constraint, and to a good approximation the training compute of a transformer is C ≈ 6 · N · D — six times the number of parameters N times the number of training tokens D. For a fixed budget C, that's a trade-off: a bigger model must be trained on fewer tokens, or a smaller model on more. The question scaling laws really answer is how to split a fixed budget between N and D.
Kaplan said go big. Chinchilla said go balanced.
The original Kaplan analysis suggested model size was the dominant lever, so the field built ever-larger models trained on comparatively little data — GPT-3 at 175B parameters being the emblem. In 2022, DeepMind's Chinchilla paper re-ran the experiment more carefully and found those giants were badly undertrained: for a given compute budget, you should scale parameters and data roughly equally, landing near a heuristic of about 20 training tokens per parameter.
The proof was dramatic: Chinchilla, at 70B parameters but trained on far more data, beat the 280B Gopher at the same compute. The lesson reshaped the field — models got “smaller” and were fed vastly more data, which is part of why data curation (see FineWeb) and trillion-token corpora became the obsession they are today.
Try it: split a compute budget
Slide the training compute up and down. The Chinchilla-optimal recipe scales the model and the data together; the Kaplan-style recipe pours the same budget into a bigger model on thinner data — and ends up undertrained. Watch the tokens-per-parameter ratio and notice that the optimal one stays near 20 no matter the budget.
Slide the training compute budget. The same budget can buy a big model on little data, or a smaller model trained on more data. Chinchilla found the sweet spot scales both together.
Uses the standard C ≈ 6·N·Dtraining-FLOP approximation and Chinchilla's ~20-tokens-per-parameter compute-optimal heuristic. Parameter and token figures are directional, not a precise loss prediction; the dollar figure is a rough illustration. The lesson is the split: as compute grows, scale model and data roughly equally.
The caveats that matter in 2026
Scaling laws describe pretraining loss, not everything you care about. Two refinements have reshaped how they're used. First, inference cost: if a model will serve billions of queries, it can be worth “over-training” a smaller model past the compute-optimal point, because a smaller model is cheaper to run forever — the optimum for total cost differs from the optimum for training alone. Second, the rise of test-time compute added a whole new scaling axis — spending compute at inference — that the original laws never modeled.
And loss isn't capability. Some abilities appear to emerge fairly suddenly as scale crosses thresholds, which smooth loss curves don't obviously predict. Still, the core insight stands and underwrites the entire industry: more compute, spent in the right proportion, reliably buys a better model — and you can know roughly how much better before you start.