A vector that means something
An embedding is just a list of numbers — a vector — say 768 or 1,536 of them. What makes it special is how those numbers are produced: a model is trained so that the positionof the vector encodes meaning. Words, sentences, or documents that are semantically similar get vectors that point in similar directions; unrelated ones point elsewhere. The individual numbers aren't human-interpretable, but their geometry is.
This is distinct from the RAG explainer, which uses embeddings as one step in a retrieval pipeline. Here we look at the embeddings themselves: what they are, how similarity is measured, and why they work.
Cosine similarity: measure the angle, not the distance
Given two embedding vectors, the standard way to score how related they are is cosine similarity — the cosine of the angle between them. It ranges from +1 (pointing the same direction, nearly identical meaning) through 0 (perpendicular, unrelated) to −1 (opposite). Crucially it ignores magnitude and looks only at direction, which is why a short query and a long document can be compared fairly even though their raw vector lengths differ.
Mechanically: take the dot product of the two vectors and divide by the product of their lengths. That normalization is the whole trick — it strips out “how big” and keeps “which way.” Try it below.
Try it: compare two words
Pick two words and watch their cosine similarity computed live. Notice that cat and kitten sit almost on top of each other, king and queen are close, and king and banana are near zero — unrelated directions.
Illustrative toy vectors (4 dimensions, hand-placed) — a real embedding model uses hundreds to thousands of dimensions learned from data. The cosine math shown is exact. Notice that cat / kitten score high while king / banana score near zero — direction, not distance, is what carries the meaning.
How embeddings get built
Early word embeddings (Word2Vec, GloVe) learned a single fixed vector per word from co-occurrence statistics — the famous result that king − man + woman ≈ queen came from there. Modern text embeddings are contextual and produced by transformer encoders (see encoder vs decoder): they read a whole sentence and emit a vector that reflects meaning in context, so “bank” in a river sentence lands far from “bank” in a finance one.
They are trained with objectives that pull related pairs together and push unrelated pairs apart — contrastive learning. The same idea extends across modalities: CLIP trains image and text encoders into one sharedspace so a photo of a dog lands near the words “a dog,” which is the foundation of multimodal models (covered in the vision encoders explainer).
Why dimensionality matters
More dimensions give the space more room to separate fine distinctions, up to a point — there's a tradeoff against storage and search speed, since every stored item is a vector and every query compares against many of them. Production systems keep millions to billions of embeddings in a vector databasewith approximate-nearest-neighbor indexes so that “find the closest vectors to this query” runs in milliseconds rather than scanning everything.
That single operation — embed a query, find its nearest neighbors — powers semantic search, recommendation (“items near what you liked”), clustering (group nearby vectors), deduplication (near-identical vectors are duplicates), and the retrieval step of RAG. Different jobs, one geometric primitive.
The takeaway
Embeddings turn the fuzzy notion of “similar meaning” into the precise, cheap operation of “small angle between vectors.” Once meaning lives in a shared geometric space, comparison, search, and grouping become arithmetic — and that is why embeddings show up underneath so much of what models do, even when you never see them directly.