A small model guesses, the big model checks

Generation is slow because a model emits one token at a time, and each token needs a full pass through billions of parameters. Speculative decoding gets a 2–3× speedup with provably identical outputby letting a tiny draft model run ahead and having the big model verify several tokens at once. Here is why that's essentially a free lunch.

Why generation is slow — and underuses the hardware

Autoregressive decoding is sequential: token N must exist before the model can compute token N+1. Each step is one forward pass through the entire model. The frustrating part is that this is memory-bound, not compute-bound: most of the time is spent reading the model's weights from memory, and the GPU's arithmetic units sit largely idle. Processing one token uses almost the same memory traffic as processing several — so generating tokens one at a time wastes the hardware.

The trick: verify many tokens in one pass

Here is the key fact that makes speculation work. Running the big model on a sequence to verify K tokens at once costs about the same as generating onetoken — because it's the same weight-reading bottleneck, just over a slightly longer sequence. So if you already had a guess at the next several tokens, checking all of them is nearly free relative to producing them one by one.

That reframes the problem: instead of asking the expensive model to produce each token, get cheap guesses from somewhere and ask it only to confirm them in bulk.

Draft and verify

Speculative decoding uses two models. A small, fast draft model (the same vocabulary, a fraction of the size) quickly generates a run of, say, 4–8 candidate tokens. The large target model then runs a single pass over that run and, for each position, checks whether the draft's token matches what it would have produced.

It accepts the candidate tokens up to the first disagreement, discards the rest, and supplies the correct token at the point of divergence itself. Then the draft model runs ahead again from there. When the draft is good, many tokens are accepted per expensive pass; when it's wrong, you fall back to roughly normal speed. Either way you never do worse than standard decoding by much, and usually much better.

Why the output is provably identical

This is the part that makes it a true free lunch rather than a quality trade. A careful acceptance rule (rejection sampling against the target model's own probabilities) guarantees the accepted tokens follow exactly the same distribution the target model would have produced on its own. The draft model only ever proposes; the target model's distribution is what's actually sampled from. So the final text is statistically indistinguishable from running the big model alone — same quality, fewer expensive passes.

The speedup depends entirely on how often the draft agrees with the target. A draft model well-matched to the target on typical text earns high acceptance and a 2–3× win; a poorly matched one earns little. That's the only real tuning knob.

Variants and where it shows up

Several flavors avoid needing a separate draft model. Self-speculativemethods use a few of the target model's own layers, or extra prediction heads (Medusa), to produce the guesses. Prompt lookup drafts by copying likely continuations straight from the input — surprisingly effective for summarization and code, where output often echoes the prompt. All share the same skeleton: cheaply propose, cheaply verify in bulk, accept the agreeing prefix.

Speculative decoding is now standard in production inference stacks because it asks for nothing in return — no fine-tuning, no quality loss, no API change. It simply makes the same model faster by spending the GPU's idle arithmetic on verification instead of leaving it idle. It sits alongside the KV cache in the toolkit of tricks that make large models affordable to serve.

The 2026 race: proprietary vs. open-source frameworks

Because the payoff is “same model, much faster, no quality cost,” speculative decoding has become a competitive battleground in its own right — and it splits along the familiar proprietary-vs-open line. On the proprietary side, NVIDIA markets DFlash as a Blackwell-tuned speculative-decoding framework, claiming up to 15× inference speedups for multi-agent, low-latency workloads — a vendor figure tied to its own silicon, so read it as a headline claim, not an independent measurement.

On the open side, DeepSeek shipped DSpark — a draft module that attaches to existing DeepSeek-V4 weights and is reported to run 57–85% faster per-user than the prior approach with identical output quality, released open-source with checkpoints and training code. The contrast is the point: the same technique arrives both as a closed, hardware-locked framework and as a drop-in, openly-published one any team can deploy on weights it already runs. As always, the claims are vendor claims until independently measured — but the direction is clear: acceleration frameworks, not just the models, are now part of the competitive and cost story this site tracks. (See the managed-inference layer for who turns these into served throughput.)

EyesInAI·Loading explainers…

Explainers

Inference · speculative decoding

A small model guesses, the big model checks

Why generation is slow — and underuses the hardware

The trick: verify many tokens in one pass

That reframes the problem: instead of asking the expensive model to produce each token, get cheap guesses from somewhere and ask it only to confirm them in bulk.