The wasteful version, first
Recall from the self-attention explainer that to produce a token, the model compares the current position against every prior position using three projections per token: a query, a key, and a value. The current token's query attends over all previous keys to decide what to pull from all previous values.
Here is the catch in generation: the keys and values for tokens 1 through 999 don't change when you generate token 1,000. They were fixed the moment those tokens existed. Recomputing them every step would make generating an N-token response cost on the order of N² work — quadratic, and ruinous for long outputs.
The cache: compute keys and values once
The fix is simple and is what every inference engine does: after computing the key and value vectors for a token, store them. When generating the next token, you only compute the new token's query, key, and value, append its key/value to the cache, and attend over the whole cache. Each new token now costs roughly constant work instead of redoing the whole prefix. That stored set of key and value vectors, for every token and every layer, is the KV cache.
This also explains a timing pattern you may have noticed: a streamed response often has a pause before the first token, then a steady flow. The pause is prefill — computing keys and values for the entire prompt in one big batched pass to populate the cache. After that comes decode — emitting tokens one at a time, each cheap because the cache is already warm.
Why the cache is big — and grows
The cache stores a key and a value vector per token, per attention layer, per attention head. Multiply that out for a model with dozens of layers and a long context and the KV cache becomes large — often gigabytes, and it grows linearly with every token in the context. This is the real, physical reason long context costs more: not just more compute, but more high-bandwidth GPU memory held hostage for the whole request, which limits how many requests a GPU can serve at once.
It is also why two requests with the same total tokens can cost differently: a request with a huge prompt and short answer is mostly cheap prefill; a request that generates a long answer pays decode cost on a steadily growing cache. Prompt caching features (reusing the KV cache for a shared prefix across calls) exist precisely because recomputing prefill for the same system prompt every time is pure waste.
Squeezing the cache: GQA, MQA, and PagedAttention
Because the KV cache dominates inference memory, much of modern model design is about shrinking it. Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) let many query heads share a single set of key/value heads, cutting the cache by a large factor with little quality loss — which is why GQA is now standard in models like Llama. Rotary position embeddings (RoPE) and related schemes make the cached keys extend cleanly to longer contexts.
On the serving side, PagedAttention (the idea behind vLLM) borrows from operating-system virtual memory: instead of reserving one big contiguous block per request, it stores the cache in fixed-size pages allocated on demand. That eliminates the memory fragmentation that otherwise wastes much of the GPU, letting a server pack far more concurrent requests onto the same hardware — a direct throughput-and-cost win.
The takeaway
The KV cache is the bridge between an architecture detail and your bill. Long context isn't expensive because the model “reads more” in some vague sense — it's expensive because every token in context occupies a slice of fast GPU memory for the entire request, and that memory is the scarce resource. Understanding it explains prefill-vs-decode latency, why prompt caching saves money, and why so much inference engineering (GQA, PagedAttention, the move toward state space models that avoid a growing cache entirely) is really about one thing: managing the KV cache.