How a model knows which word came first

Self-attention reads every token in parallel and weighs them by content alone — so on its own it has no idea that “dog bites man” differs from “man bites dog.” Something has to inject order back in. The scheme a model picks to do that is easy to overlook, but it quietly decides how far the model can read before it falls apart.

Attention is a bag of words until you tell it otherwise

A transformer block computes attention as a set operation: every token looks at every other token and scores how relevant it is, purely from the two tokens' vectors. There is no built-in notion of “the token three positions back.” Shuffle the input and the raw attention math gives you the same answer — the network is permutation-invariant. That is fatal for language, where order is meaning. (If the attention step itself is fuzzy, start with the self-attention explainer.)

So every transformer adds a positional signal — information about where each token sits — before or during attention. The three families below differ in what they encode (the absolute slot vs. the relative gap between two tokens) and where they inject it (into the token vector vs. into the attention score itself). That second difference is what governs how gracefully a model handles sequences longer than it was trained on.

1. Absolute: stamp each slot with a number

The original 2017 transformer added a fixed vector to each token that depends only on its position — slot 0 gets one pattern, slot 1 another, and so on. The classic version uses sinusoids of different frequencies, so each position gets a unique fingerprint and nearby positions get similar ones. A common alternative is a learned position embedding table: one trainable vector per slot, up to some maximum length.

Both work, and both share one weakness. The model learns what “position 512” feels like by seeing it during training. Ask for position 5,000 and a learned table simply has no row for it; sinusoids technically extrapolate but the model has never practiced attending across that gap, so quality degrades fast. Absolute encoding ties the usable context window to the lengths the model actually trained on — which is why early models had hard 512- or 2,048-token limits.

2. RoPE: rotate the vectors so attention sees the gap

Rotary Position Embedding (RoPE) — introduced by RoFormer and now used by Llama, Mistral, Qwen, and most open models — takes a cleverer route. Instead of adding a position vector, it rotates each query and key vector by an angle proportional to its position. When attention then takes the dot product of a query at position m and a key at position n, the rotations combine so the score depends on m − n — the relative distance between them — rather than on the two absolute slots.

That is a big deal. A model that reasons in relative terms doesn't care whether a pair of tokens sits at positions 10 and 13 or 4,010 and 4,013 — the gap is 3 either way. RoPE therefore generalizes to longer sequences far more gracefully than absolute encoding, and it can be stretched further still by scaling the rotation frequencies(the “NTK” / linear position-interpolation tricks behind most 128K-token context extensions). When you see a model jump from 8K to 128K context with a short fine-tune, it is almost always RoPE being re-scaled.

3. ALiBi: just penalize distance in the score

ALiBi (Attention with Linear Biases) is the most minimal of the three. It adds no position vectors at all. Instead, right before the attention softmax, it subtracts a penalty from each score that grows linearly with how far apart the two tokens are — distant tokens get quietly down-weighted, recent ones stay strong. Each attention head gets a different penalty slope, so some heads look far back and others stay local.

Because the bias is a simple function of distance with no learned length limit, ALiBi models extrapolateto sequences much longer than training with little extra effort — that was the headline result. The trade-off is the built-in recency bias: the mechanism inherently discounts far-away tokens, which can hurt tasks that need to pull a fact from the very start of a long document. RoPE's relative encoding stays more neutral about distance, which is one reason it won the popularity contest among frontier models.

Why the scheme is the context ceiling

Put the three side by side and the pattern is clear. Absolute encoding hard-codes a maximum length and degrades sharply past it. RoPE encodes relative distance and can be re-scaled to extend context with a light fine-tune. ALiBibakes extrapolation in but pays for it with a recency bias. None of these is the “attention is expensive” problem — that is a separate, compute cost addressed by the KV cache and by efficient-attention kernels. Positional encoding is the quality ceiling: it decides whether the model can still make sense of a token once it is far from home.

So when a model card advertises a context window, two different limits hide behind that one number. The memory limit is how many tokens the KV cache and serving stack can hold. The comprehension limit — the one that decides whether the model actually uses what is in that window — is set by the positional scheme and how far it was trained or interpolated to reach. A 1M-token window on paper means little if the position encoding stops generalizing at 100K. That gap is exactly what a measured benchmark catches and a spec sheet hides.