Pick any word below to trace it through the attention pipeline and watch where it looks. The sentence is the classic ambiguity test: what does “it” refer to — the animal, or the street?
Input · 11 tokens
01
The flow, for one token
Every token takes this journey simultaneously — shown here for your selected word. The vectors are schematic (real models use hundreds of dimensions); the scores → softmax step is computed for real from the scores below.
Token
"it" → id
“it”
index 7
→
Embed + Position
lookup + positional encoding
input vector
−
+
+
−
+
+
→
Project → Q K V
× W_q, W_k, W_v
Q (query)
+
+
+
+
−
+
K (key)
+
−
−
+
−
−
V (value)
+
+
−
+
−
+
→
Score
Q·Kᵀ / √dₖ vs every token
The
animal
didn't
cross
the
street
because
it
was
too
tired
→
Softmax
→ weights summing to 1
The0.00
animal0.93
didn't0.00
cross0.00
the0.00
street0.05
because0.00
it0.00
was0.00
too0.00
tired0.00
→
Output
Σ weightᵢ · Vᵢ
context-aware vector
+
+
+
+
+
+
Now carries info mostly from “animal”. Feeds the feed-forward layer, then the next block.
02
The attention map
Each row is a token asking “what should I pay attention to?” Brighter cell = more attention paid to that column’s token. Real transformers run dozens of these heads in parallel, each specializing. Toggle three documented head types:
Resolves the pronoun. Watch row “it” — it reaches all the way back to “animal”, with a weaker pull toward the competing antecedent “street”. This long-range binding is what attention does that older models struggled with.
q \ k
The
animal
didn't
cross
the
street
because
it
was
too
tired
less attentionmore
03
Where it actually looks
The attention distribution for your selected token — the actual weights, summing to 1.0, that the softmax produces. This is the weighted recipe used to blend the other tokens’ Value vectors into a new, context-aware representation.
Token “it” distributes its attention
Head: Coreference head · strongest link → “animal” (93% of its attention)
The
0.003
animal
0.927
didn't
0.003
cross
0.003
the
0.003
street
0.046
because
0.003
it
0.004
was
0.003
too
0.003
tired
0.003
On “real world”:production model weights aren’t loaded here, so the patterns shown are hand-built to mirror what mechanistic-interpretability research repeatedly finds — coreference heads binding pronouns to antecedents, previous-token heads, and syntactic heads linking verbs to their subjects/objects. The math you see (scaled scores → softmax → a probability distribution over tokens) is exactly the operation a real layer performs, and the displayed distributions are genuinely computed via softmax(scores).
Self-attention: softmax( Q·Kᵀ / √dₖ ) · V · Sentence after Vaswani et al. 2017 / Tensor2Tensor coreference example.