From logits to probabilities
The model's raw output for the next position is a vector of logits— one unnormalized score per token in the vocabulary. Logits can be any real number and don't sum to anything in particular. To turn them into a probability distribution, you apply the softmax function: exponentiate each logit and divide by the total. Bigger logits become bigger probabilities, everything is positive, and it all sums to one.
If you always picked the single highest-probability token you would be doing greedy decoding — fully deterministic, same prompt always producing the same output. That sounds desirable but in practice greedy text is flat, repetitive, and prone to loops. Sampling introduces controlled randomness, and the dials below control how much.
Temperature: sharpen or flatten the distribution
Temperature is a single number you divide the logits by before softmax. Divide by a small number (temperature < 1) and the gaps between logits grow — the distribution sharpenstoward the top token, making the model more confident and repetitive. Divide by a large number (temperature > 1) and the gaps shrink — the distribution flattens, giving unlikely tokens a real chance and producing more surprising (and more error-prone) text.
At temperature 1.0 you sample from the model's raw beliefs untouched. At temperature 0 it collapses to greedy. This is the dial people reach for first: low for factual extraction and code, higher for brainstorming and creative writing.
Top-k and top-p: cut the tail
Temperature reshapes the whole distribution, but even a flattened distribution has a long tail of nonsense tokens that you never want chosen. Two truncation rules clip that tail before sampling:
Top-k keeps only the k highest-probability tokens and renormalizes over them — a hard cap on how many candidates survive. It is simple but blunt: k=40 keeps 40 tokens whether the model is supremely confident (where 2 would do) or genuinely uncertain (where 40 is too few).
Top-p (nucleus sampling) is adaptive instead. It keeps the smallest set of top tokens whose probabilities add up to at least p — say 0.9 — and renormalizes over that nucleus. When the model is confident the nucleus is tiny; when it is unsure the nucleus widens automatically. This is why top-p is the more common default. The two are often combined.
Try it: move the dials
Below is a fixed set of next-token logits for the prompt “The weather today is…”. The logits never change — only how a token gets selected from them does. Raise temperature and watch the bars even out. Pull top-p down and watch the long tail get struck through (excluded) and the survivors renormalize. Drop temperature to zero to see greedy decoding pick the single top token every time.
Same fixed logits from the model. The controls only change how one token gets chosen from them.
Why this matters in production
Sampling settings are the cheapest quality lever you have, and the most commonly misused. A retrieval or extraction task that needs faithful, repeatable output should run near temperature 0 — randomness there just manufactures inconsistency you then have to defend. A creative or ideation task starved at temperature 0 will feel robotic and loop. And reproducibility in evals depends on it: if you benchmark at temperature 1.0 you are measuring a distribution of outputs, not a fixed one, which is why our test methodology pins decoding settings per run.
The mental model to keep: the model supplies the odds; temperature reshapes them, top-k and top-p clip the tail, and only then is a single token drawn. Every quirk of “why did it say that?” starts here.