A model learns the world by guessing the next token

Almost everything a frontier model knows is learned in one stage, playing one game: predict the next token, trillions of times, over a large slice of the internet. No human writes the answers. The text is the answer key. Understanding pretraining explains why models hallucinate, why scale works, and why the chat tuning that comes later can only redirect what pretraining already built.

The game: next-token prediction

Pretraining is self-supervised, which is a precise idea worth unpacking. There is no labeled dataset of questions and correct answers. Instead, the training signal comes for free from the text itself: take a passage, hide what comes next, and ask the model to predict it. The true next token is already sitting in the data, so every chunk of text generates millions of prediction problems automatically.

Concretely, the model reads a sequence of tokens and outputs a probability distribution over its entire vocabulary for what comes next. If the sentence so far is The capital of France is, a well-trained model puts most of its probability mass on Paris. The training objective compares that predicted distribution to the actual next token and nudges the weights to make the right answer more likely next time.

What the loss curve actually measures

The quantity being minimized is cross-entropy loss — roughly, the average surprise of the model when it sees the true next token. A loss that drops means the model is less surprised by real text, which means it has learned regularities: spelling, then grammar, then facts, then reasoning patterns, then style. A closely related number, perplexity, is just the exponential of that loss and is often quoted instead because it has an intuitive reading — “effectively how many tokens the model is choosing between.”

The curve falls fast at first (learning that text is mostly common words and valid grammar is cheap) and then settles into a long, slow grind where each increment of improvement costs disproportionately more compute. That shape is not a bug; it is the empirical signature behind scaling laws, which predict exactly how loss falls as you add parameters, data, and compute (see the scaling laws explainer).

What the model actually learns

Here is the subtle part. The model is only ever optimized to predict the next token — yet to do that well across the whole internet, it is forced to build internal machinery far richer than autocomplete. To predict the next token in a chess transcript it benefits from tracking the board. To predict the next token after a math problem it benefits from doing the arithmetic. To predict the next token in a story it benefits from modeling characters and intent. Capabilities emerge as instrumental side effects of getting the prediction right.

This is why a raw pretrained model is called a base model: it has absorbed enormous knowledge and competence, but its only drive is to continue text plausibly. Ask it a question and it might answer — or it might continue with three more questions, because in its training data a question is often followed by more questions. It knows a great deal and intends nothing.

The data is the product

Because the model learns the distribution of its training text, the composition of that text is the single biggest lever on what it becomes. Modern pretraining corpora are tens of trillions of tokens, assembled from filtered web crawls, code, books, and curated sources, then heavily deduplicated and quality-filtered. The FineWeb work (covered in its own explainer) showed that better filtering can beat simply adding more raw data at the same compute budget — the cleaning pipeline is part of the model, not preprocessing trivia.

It also explains the limits. A base model cannot know what was absent from or underrepresented in its corpus, and it will faithfully reproduce the biases, errors, and stale facts present in it. Hallucination is the same machinery working as designed: when the most plausible continuation is a confident-sounding fabrication, that is what next-token prediction produces.

Why pretraining sets the ceiling

Pretraining consumes the overwhelming majority of the compute that goes into a model — the later alignment stages are comparatively tiny. That budget asymmetry has a consequence: fine-tuning and RLHF mostly surface and shape capabilities that pretraining already created. They teach the base model to behave like a helpful assistant, to follow instructions, and to refuse harmful requests, but they rarely install a brand-new skill the base model could not already approximate.

That is the mental model to carry forward: pretraining decides what is possible; the alignment stages decide what is convenient and safe. The next two explainers — fine-tuning / SFT and the RLHF pipeline — pick up exactly where this one ends: with a knowledgeable base model that does not yet know it is supposed to help you.