Same transformer, three ways to wire it

BERT, GPT, and T5 are all built from the same attention blocks introduced in “Attention Is All You Need.” What separates them is which half of the original architecture they keep and how attention is allowed to flow. That single choice decides whether a model is best at understanding text, generating it, or transforming one sequence into another.

The original had two halves

The 2017 transformer was designed for translation and had two stacks. An encoder read the whole input sentence at once and built a rich representation of it. A decoder then generated the output one token at a time, attending both to what it had produced so far and, through cross-attention, to the encoder's representation of the input. Three families fall out of keeping the encoder, the decoder, or both. (For how attention itself works, see the self-attention explainer.)

Encoder-only (BERT): built to understand

An encoder-only model uses bidirectional attention: every token can attend to every other token, both left and right. This makes it excellent at building a deep representation of a fixed input — ideal for classification, named-entity recognition, retrieval embeddings, and any task where you have the whole text and want to understand or score it.

Because each token sees the future, an encoder cannot be used to generate text left-to-right — there would be nothing to predict that it hasn't already seen. BERT is therefore pretrained with masked language modeling: randomly hide some tokens and ask the model to fill them in using both sides of context. Models like BERT, RoBERTa, and the embedding models behind most retrieval systems live here.

Decoder-only (GPT): built to generate

A decoder-only model uses causal (masked) attention: each token may attend only to tokens before it, never ahead. That restriction is exactly what makes next-token generation possible and honest — at position nthe model genuinely doesn't see position n+1, so predicting it is a real task. This is the architecture behind GPT, Claude, Llama, Gemini, and essentially every modern chat and reasoning model.

It turns out that this single objective — predict the next token (see pretraining) — scales astonishingly well and subsumes most tasks: classification, translation, and summarization can all be framed as “continue this text appropriately.” That generality is why the field consolidated around decoder-only models even for jobs an encoder once did.

Encoder-decoder (T5): built to transform

The full two-stack design keeps a bidirectional encoder and a causal decoder joined by cross-attention. The encoder fully understands the input; the decoder generates an output conditioned on it. This shines for true sequence-to-sequence transformation where input and output are distinct objects — machine translation, summarization, speech-to-text — because the model can read the entire source freely while still generating the target autoregressively.

T5 (“Text-to-Text Transfer Transformer”) framed every NLP task as text-in, text-out, which made the encoder-decoder shape a clean fit. The tradeoff is added complexity and two attention regimes to train; for open-ended chat, a single decoder stack proved simpler and just as capable, which is why encoder-decoders are now more specialized than dominant.

How to choose

The practical rule: if you have a fixed text and want to understand or embed it (search, classification, similarity), an encoder-only model is cheap and strong. If you want to generate or chat open-endedly, a decoder-only model is the default and what every frontier API gives you. If your task is a clean transformation of one sequence into another and you control the training, an encoder-decoder can be the most data-efficient choice.

In 2026 most people never pick consciously — they call a decoder-only chat model and it handles everything. But knowing why embeddings come from encoders and generation comes from decoders explains a lot of otherwise-mysterious architecture choices, including why a RAG pipeline pairs an encoder embedding model with a decoder generation model.