The cost attention can't escape
In a transformer, every token attends to every other token, so the work grows with the squareof the sequence length, and inference keeps a key/value entry for every past token (the KV cache) that grows linearly and forever. That's what makes very long contexts expensive in both compute and memory. The natural question: is there an architecture that handles sequences without comparing every token to every other one?
The state space idea: carry a running summary
State space models borrow from classical signal processing. Instead of looking back over all previous tokens, they maintain a fixed-size hidden state — a compressed running summary of everything seen so far. At each token, the model updates the state from the previous state and the new input, then reads its output from the state. The state never grows; token 1 and token 1,000,000 cost the same.
This is a recurrence, like an RNN — and RNNs had exactly this constant-memory appeal but two fatal problems: they were slow to train (inherently sequential) and bad at long-range dependencies (information decayed). S4's contribution was a special, mathematically structured way of defining the state update that fixed both: it could be computed in parallel during training (as a convolution) and it preserved long-range information far better.
The catch S4 hit — and Mamba's fix
S4 was efficient but its state update was fixed: it processed every token the same way, unable to decide that some tokens matter more than others. Attention's superpower is exactly that selectivity — it dynamically focuses on relevant tokens. A static recurrence can't, which kept early SSMs behind transformers on language.
Mamba's key move (2023) was to make the state update selective: its parameters become functions of the input, so the model can choose, per token, what to keep in the state and what to ignore — effectively a content-aware gate. That recovered much of what made attention strong, while keeping linear scaling. Making the selective recurrence run fast on GPUs required a hardware-aware parallel-scan implementation, which is the other half of the contribution.
What you get — and what you give up
The payoff is concrete: linear-time processing of sequences and constant memory per tokenat inference — no KV cache that balloons with context length. For very long sequences (genomics, audio, long documents) that's a structural advantage transformers can't match without extra tricks. Mamba was the first SSM to match transformers of similar size on language modeling, which is what made the field take notice.
The trade-off is the flip side of the fixed state. Because everything must be compressed into a bounded summary, tasks that need precise recall of an arbitrary earlier token — “copy this exact string from 50k tokens ago” — are harder for a pure SSM than for attention, which can look directly at any position. In practice the strongest results have come from hybridmodels that interleave a few attention layers among many Mamba layers, getting linear scaling for most of the network plus attention's exact recall where it counts.
Where it stands
Transformers still dominate frontier language models, and the ecosystem — tooling, training recipes, intuition — is built around them. But state space models have moved from curiosity to a serious tool, especially for long-context and non-text modalities, and hybrid architectures are increasingly common. The broader lesson is that attention, for all its success, is not the only way to model sequences — and the pressure of long-context cost is exactly what keeps the alternatives advancing.