The problem: a model speaks only one language — vectors
A transformer doesn't fundamentally process text; it processes sequences of vectors (embeddings). Text becomes vectors via a tokenizer and an embedding table. The insight behind multimodal models is that an image can be turned into vectors too — and if those vectors live in a compatible space, the same transformer can attend over images and words together without caring which is which.
ViT: an image is a sequence of patches
The Vision Transformer (ViT) made this clean. Instead of processing pixels with convolutions, it cuts an image into a grid of fixed-size patches — say 16×16 pixels — flattens each patch, and treats the sequence of patches exactly like a sequence of tokens. A standard transformer encoder (see encoder vs decoder) then attends over the patches the same way it would attend over words. An image, in other words, becomes a short sequence of “visual tokens.”
CLIP: putting images and text in one space
ViT turns an image into vectors, but those vectors aren't automatically aligned with text. CLIP solved the alignment with contrastive training. Take hundreds of millions of image–caption pairs from the web. Run images through an image encoder and captions through a text encoder. Train both so that a matching image and caption land close in the shared space, and mismatched pairs land far apart.
The result is a single embedding space where “a photo of a golden retriever” and an actual photo of one point in nearly the same direction — the multimodal extension of the idea in the embeddings explainer. That alignment is what enables zero-shot image classification (compare an image to text labels) and, more importantly, gives a language model a vision encoder whose output it can already make sense of.
Fusing vision into a language model
A vision-language model (VLM) wires a pretrained vision encoder into a pretrained language model. The image is encoded into visual tokens, a small projection layermaps those tokens into the language model's embedding space, and they are dropped into the prompt right alongside the text tokens. From the language model's perspective the image is just more tokens it can attend to — so “what is in this picture?” becomes an ordinary next-token problem with some of the context coming from pixels.
Different designs connect the two streams differently. Flamingo inserted cross-attention layers and a resampler so a fixed language model could attend to visual features. LLaVA showed a strikingly simple recipe works well: a CLIP-style vision encoder, a lightweight projection, and instruction tuning on image–question–answer data. The trend has been toward simpler fusion and treating images as tokens.
What this explains in practice
Several familiar behaviors fall out of this design. Images cost tokens — a high-resolution image becomes many visual tokens, which is why vision calls consume more of your context budget and cost more (and why models tile or downsample large images). Models often struggle with dense text in images or precise spatial detail, because a coarse patch grid throws away fine information. And the same machinery generalizes: audio and video are encoded into tokens by their own encoders and fused the same way, which is how “omni” models handle several modalities at once.
The throughline: vision didn't require reinventing the transformer. It required a way to turn pixels into tokens (ViT) and a way to make those tokens speak the same language as text (CLIP) — after which the model you already understand does the rest.