Models don't read words — they read chunks

Before a model sees a single character of your prompt, a tokenizer has already broken it into numbered pieces called tokens. Understanding this layer explains why context windows are counted in tokens not words, why some languages cost more than others, and why a characters-per-token ratio is the fastest way to estimate API cost before you send anything.

What a token actually is

A token is a chunk of text — somewhere between a character and a word, depending on the tokenizer and the content. For clean English prose, one token averages roughly four characters. The word tokenization might split into three tokens. The word the is one. A leading space before a word is often fused into the same token as the word itself.

The token is not the word. It's the unit the model actually operates on: the thing it embeds into a vector, attends over, and predicts next. Everything upstream — context limits, pricing, rate limits — is measured in tokens, not words or characters.

How a tokenizer is built

Most modern tokenizers are trained using Byte-Pair Encoding (BPE). The algorithm starts from individual bytes, then repeatedly merges the most frequent adjacent pair into a new symbol. Run this long enough on a large text corpus and you end up with a vocabulary of 50,000–100,000 symbols — common English words become single tokens, rarer words get split, and entirely unseen strings fall back to individual bytes.

The vocabulary is fixed at training time and shipped with the model. GPT-2 and GPT-3 use a 50,257-token vocabulary. GPT-4 and Claude use updated vocabularies with better coverage of code, non-Latin scripts, and emoji. The tokenizer is not the model — it's a deterministic lookup table that runs before inference and never changes during a conversation.

SentencePiece (used by LLaMA, Gemma, Mistral) works similarly but trains directly on raw UTF-8 bytes, which gives it stronger multilingual coverage out of the box. Tiktoken (OpenAI's open-source library) implements the same BPE logic but is optimized for speed in Python.

Why the same text produces different token counts across models

Each model family ships its own tokenizer trained on its own corpus with its own merge rules. A sentence that tokenizes to 18 tokens under GPT-4's cl100k_base vocabulary might be 23 tokens under GPT-2 and 16 under a SentencePiece model trained with more English text. Vocabulary size matters: a larger vocabulary means more common substrings can be represented as single tokens, which lowers counts.

Tokenizer versions also drift over time. Minor updates to merge rules or special token handling can shift counts by a small amount on the same input. This is why token count fields in pre-built datasets sometimes disagree with fresh recomputation — the field was written with a slightly older tokenizer version.

Characters per token: the practical ratio

The chars-per-token ratio is the simplest useful metric in the tokenization layer. You compute it by dividing the character length of a document by its token count. For typical English prose, this runs around 4.0. For code, closer to 3.5 (symbols tokenize inefficiently). For Chinese or Japanese text, it can drop below 2.0 — each character often becomes its own token or close to it, making those languages substantially more expensive per word of meaning.

Why does this matter in production? Because calling a tokenizer at runtime on every document before deciding whether to send it adds latency. If you've calibrated the chars-per-token ratio for your specific content type — English prose, structured JSON, CSV rows — you can estimate token counts from character length alone using a cheap string operation, then gate on the estimate rather than the exact count.

# Calibrate once against a real sample
chars_per_token = total_chars / total_tokens   # e.g. 3.8 for your content type

# Estimate without calling the tokenizer
estimated_tokens = len(text) / chars_per_token
estimated_cost   = estimated_tokens * price_per_token

Special tokens and the structure they carry

Beyond the text tokens, every vocabulary reserves a set of special tokens that carry structural meaning: <|endoftext|> marks document boundaries during training, <|im_start|> and <|im_end|> delimit turns in a chat-formatted conversation. These don't appear in outputs — they're control signals the model learned to respond to during fine-tuning.

This is also why prompt injection can be effective: if user-supplied text contains strings that look like special token markers, the model may treat them as structural. The tokenizer itself won't catch this — it just encodes whatever arrives.

Context windows are a token budget

A model's context window — 8k, 32k, 128k, 1M tokens — is the maximum number of tokens it can hold in memory during a single forward pass. Prompt tokens and completion tokens both count against this budget. Once you exceed it, the model can't attend to tokens that fell off the left edge, which means it effectively forgets them.

The cost of a long context window isn't just memory — it's compute. Attention scales quadratically with sequence length in a standard transformer. This is why a 128k-context call costs disproportionately more than a 4k call, and why retrieval-augmented approaches (pull the relevant 2k tokens, not the full corpus) often produce better results at lower cost than maxing out the context window.

What token count tells you about document quality

In large web corpora, token count distributions are a useful quality signal. Very short documents (under 50 tokens) are often navigation fragments, error pages, or boilerplate. Very long documents (above 50,000 tokens) frequently contain repetitive or machine-generated content. The filtering heuristics used to clean datasets like FineWeb set word-count floors and ceilings that implicitly enforce a token-count window.

The same logic applies at the retrieval layer. A RAG pipeline that sends full documents into context rather than targeted chunks is spending token budget on content the model didn't need to see — and paying for it in both dollars and latency.

Try it: tokenize any text

Type or paste below — or switch content type — and watch the token boundaries, the running token count, and the chars-per-token ratio update live. Notice how leading spaces fuse into the following token, long words fragment, and Chinese text runs close to one token per character.

Tokenize any text

Tokens

Characters

102

Chars / token

3.64

code-like

Token boundaries

Before a model sees your prompt, a tokenizer has already broken it into numbered pieces called tokens.

Approximate GPT-style tokenization (real pre-tokenization regex + a BPE-like split), not an exact vocabulary — counts are realistic and directional, not identical to any one model. Notice leading spaces fuse into the next token, long words fragment, and CJK characters run close to one token each.

At $0.50/1M input tokens, this input costs about$0.000014— and you could estimate it from character count alone via the ratio above.

EyesInAI·Loading explainers…

Explainers

Tokenization · how models read text