The bottleneck a single answer hits
A standard model produces its answer in one left-to-right pass: read the prompt, emit the response. Every token of compute it spends is also a token of output. For a hard multi-step problem — a competition math question, a tricky proof, a bug across several files — that is like demanding the final answer with no scratch paper. The model has the knowledge but no room to work.
Chain-of-thought prompting (“think step by step”) was the first crack in this: let the model write its reasoning out loud as tokens, and accuracy on hard problems jumps, because the intermediate tokens are the scratch paper. Test-time compute is the systematic, trained version of that insight.
The core trade: inference tokens for accuracy
The central finding is a scaling law of its own: on reasoning-heavy tasks, accuracy climbs smoothly as you let the model generate more reasoning tokens before committing to an answer. A reasoning model might emit thousands of hidden “thinking” tokens — exploring approaches, checking itself, backtracking — and only then write a short final response. You are quite literally buying correctness with inference compute.
This reframes cost. A reasoning model can be far more expensive per query than its size suggests, because the bill scales with how long it thinks, not just how big it is. It also reframes model choice: for a simple lookup, paying for extended reasoning is pure waste; for a genuinely hard problem, a smaller model that thinks longer can beat a bigger one that answers instantly. That trade-off is exactly what a good routing layer should measure.
Process reward models: grading the reasoning, not just the answer
Training a model to reason well needs a signal about how it got there, not only whether the final answer was right. An outcome reward model scores the final answer; a process reward model(PRM) scores each step of the reasoning. The 2023 OpenAI result “Let's Verify Step by Step” showed that supervising every step beats supervising only the outcome — because a right answer reached by faulty reasoning teaches the wrong lesson, and a PRM catches it.
With a PRM you can do more than train: at inference you can generate several reasoning paths and keep the one the PRM scores highest, or guide a search through the space of reasoning steps. This is where ideas like best-of-N sampling, self-consistency (sample many chains, take the majority answer), and tree/MCTS-style search over reasoning steps come in — all different ways to convert extra inference compute into a better-vetted answer.
How o1 and R1 are actually trained
The breakthrough models pushed this further with reinforcement learning directly on reasoning. Rather than imitating human-written reasoning, the model is rewarded for reasoning traces that lead to correct, verifiable answers — math with checkable solutions, code that passes tests. Over many rounds it discovers its own effective reasoning strategies: checking work, considering alternatives, allocating more thinking to harder problems. DeepSeek-R1 notably showed strong reasoning could emerge largely from RL on verifiable rewards, building on the alignment foundations covered in the RLHF explainer.
What it changes for you
Three practical consequences. First, latency and cost are now task-dependent in a new way— a reasoning model can take many seconds and many tokens on a hard prompt, so it is not a drop-in for every call. Second, the “thinking” tokens are usually hidden or summarized, which means you are paying for compute you can't fully see; benchmarks have to account for it. Third, the right move is often a tiered setup: a fast model for easy traffic, a reasoning model held in reserve for the hard cases that actually benefit.
The one-line takeaway: scaling didn't stop at training. Test-time compute added a second dial, and knowing when to turn it up — and when turning it up is just burning money — is now part of using models well.