Why generation is slow — and underuses the hardware
Autoregressive decoding is sequential: token N must exist before the model can compute token N+1. Each step is one forward pass through the entire model. The frustrating part is that this is memory-bound, not compute-bound: most of the time is spent reading the model's weights from memory, and the GPU's arithmetic units sit largely idle. Processing one token uses almost the same memory traffic as processing several — so generating tokens one at a time wastes the hardware.
The trick: verify many tokens in one pass
Here is the key fact that makes speculation work. Running the big model on a sequence to verify K tokens at once costs about the same as generating onetoken — because it's the same weight-reading bottleneck, just over a slightly longer sequence. So if you already had a guess at the next several tokens, checking all of them is nearly free relative to producing them one by one.
That reframes the problem: instead of asking the expensive model to produce each token, get cheap guesses from somewhere and ask it only to confirm them in bulk.
Draft and verify
Speculative decoding uses two models. A small, fast draft model (the same vocabulary, a fraction of the size) quickly generates a run of, say, 4–8 candidate tokens. The large target model then runs a single pass over that run and, for each position, checks whether the draft's token matches what it would have produced.
It accepts the candidate tokens up to the first disagreement, discards the rest, and supplies the correct token at the point of divergence itself. Then the draft model runs ahead again from there. When the draft is good, many tokens are accepted per expensive pass; when it's wrong, you fall back to roughly normal speed. Either way you never do worse than standard decoding by much, and usually much better.
Why the output is provably identical
This is the part that makes it a true free lunch rather than a quality trade. A careful acceptance rule (rejection sampling against the target model's own probabilities) guarantees the accepted tokens follow exactly the same distribution the target model would have produced on its own. The draft model only ever proposes; the target model's distribution is what's actually sampled from. So the final text is statistically indistinguishable from running the big model alone — same quality, fewer expensive passes.
The speedup depends entirely on how often the draft agrees with the target. A draft model well-matched to the target on typical text earns high acceptance and a 2–3× win; a poorly matched one earns little. That's the only real tuning knob.
Variants and where it shows up
Several flavors avoid needing a separate draft model. Self-speculativemethods use a few of the target model's own layers, or extra prediction heads (Medusa), to produce the guesses. Prompt lookup drafts by copying likely continuations straight from the input — surprisingly effective for summarization and code, where output often echoes the prompt. All share the same skeleton: cheaply propose, cheaply verify in bulk, accept the agreeing prefix.
Speculative decoding is now standard in production inference stacks because it asks for nothing in return — no fine-tuning, no quality loss, no API change. It simply makes the same model faster by spending the GPU's idle arithmetic on verification instead of leaving it idle. It sits alongside the KV cache in the toolkit of tricks that make large models affordable to serve.