How a transformer reads a sentence

Pick any word below to trace it through the attention pipeline and watch where it looks. The sentence is the classic ambiguity test: what does “it” refer to — the animal, or the street?

Input · 11 tokens

The flow, for one token

Every token takes this journey simultaneously — shown here for your selected word. The vectors are schematic (real models use hundreds of dimensions); the scores → softmax step is computed for real from the scores below.

Token

"it" → id

“it”

index 7

→

Embed + Position

lookup + positional encoding

input vector

−

→

Project → Q K V

× W_q, W_k, W_v

Q (query)

−

K (key)

−

V (value)

−

→

Score

Q·Kᵀ / √dₖ vs every token

The

animal

didn't

cross

the

street

because

was

too

tired

→

Softmax

→ weights summing to 1

The

0.00

animal

0.93

didn't

0.00

cross

0.00

the

0.00

street

0.05

because

0.00

was

0.00

too

0.00

tired

0.00

→

Output

Σ weightᵢ · Vᵢ

context-aware vector

Now carries info mostly from “animal”. Feeds the feed-forward layer, then the next block.

The attention map

Each row is a token asking “what should I pay attention to?” Brighter cell = more attention paid to that column’s token. Real transformers run dozens of these heads in parallel, each specializing. Toggle three documented head types:

Resolves the pronoun. Watch row “it” — it reaches all the way back to “animal”, with a weaker pull toward the competing antecedent “street”. This long-range binding is what attention does that older models struggled with.

q ＼ k

The

animal

didn't

cross

the

street

because

was

too

tired

less attention

Where it actually looks

The attention distribution for your selected token — the actual weights, summing to 1.0, that the softmax produces. This is the weighted recipe used to blend the other tokens’ Value vectors into a new, context-aware representation.

Token “it” distributes its attention

Head: Coreference head · strongest link → “animal” (93% of its attention)

The

0.003

animal

0.927

didn't

0.003

cross

0.003

the

0.003

street

0.046

because

0.003

0.004

was

0.003

too

0.003

tired

0.003

On “real world”:production model weights aren’t loaded here, so the patterns shown are hand-built to mirror what mechanistic-interpretability research repeatedly finds — coreference heads binding pronouns to antecedents, previous-token heads, and syntactic heads linking verbs to their subjects/objects. The math you see (scaled scores → softmax → a probability distribution over tokens) is exactly the operation a real layer performs, and the displayed distributions are genuinely computed via softmax(scores).

Self-attention: softmax( Q·Kᵀ / √dₖ ) · V · Sentence after Vaswani et al. 2017 / Tensor2Tensor coreference example.

EyesInAI·Loading explainers…

Explainers

Self-attention · one layer · one head at a time

How a transformer reads a sentence

Pick any word below to trace it through the attention pipeline and watch where it looks. The sentence is the classic ambiguity test: what does “it” refer to — the animal, or the street?

Input · 11 tokens

The flow, for one token

Token

"it" → id

“it”

index 7

→

Embed + Position

lookup + positional encoding

input vector

−

→

Project → Q K V

× W_q, W_k, W_v

Q (query)

−

K (key)

−

V (value)

−

→

Score

Q·Kᵀ / √dₖ vs every token

The

animal

didn't

cross

the

street

because

was

too

tired

→

Softmax

→ weights summing to 1

The

0.00

animal

0.93

didn't

0.00

cross

0.00

the

0.00

street

0.05

because

0.00

was

0.00

too

0.00

tired

0.00

→

Output

Σ weightᵢ · Vᵢ

context-aware vector

Now carries info mostly from “animal”. Feeds the feed-forward layer, then the next block.

The attention map

q ＼ k

The

animal

didn't

cross

the

street

because

was

too

tired

less attention

Where it actually looks

Token “it” distributes its attention

Head: Coreference head · strongest link → “animal” (93% of its attention)

The

0.003

animal

0.927

didn't

0.003

cross

0.003

the

0.003

street

0.046

because

0.003

0.004

was

0.003

too

0.003

tired

0.003

Self-attention: softmax( Q·Kᵀ / √dₖ ) · V · Sentence after Vaswani et al. 2017 / Tensor2Tensor coreference example.