# Thinking in Sequences ## A Visual Guide to Transformers and State Space Models > *A pedagogical explainer for curious minds with no machine learning background.* > *Research notes: [[research_notes/index]] | Date: 2026-05-04* > *Research depth: 21 notes, ~450KB of source material* --- > [!info] [Deep Dive Podcast](https://notebooklm.google.com/notebook/084ead47-4be5-4fb6-80d5-1a23f39c6634/artifact/a1baaf56-00ed-494e-bbe4-a98f5a4a8963) > Generated with NotebookLM from this research. Every time an AI system like ChatGPT generates the next word of its reply, it re-reads everything — your entire prompt, every previous exchange in the conversation, every line of context. This is not a bug; it is the fundamental design of the **Transformer**, the architecture powering every frontier AI system you interact with today. The price of this total-recall approach is steep: as conversations grow longer, the computation doesn't grow proportionally. It explodes. Double the length; quadruple the cost. For nearly a decade, researchers worked on an alternative rooted in the mathematics of rockets and satellites: **State Space Models (SSMs)**. Instead of re-reading everything each time, SSMs maintain a fixed-size "running summary" — a compressed impression of everything encountered so far — and update it with each new word (or *token* — the atomic unit of text an AI processes, roughly equivalent to a word or word fragment). The result is constant-cost inference regardless of context length, and linear training cost rather than quadratic. The trade-off is subtle but real: compressed memory occasionally blurs fine details that exact recall would preserve perfectly. The research community's current answer is neither pure architecture. **Hybrid models** — with as little as 7–12% attention layers providing precise recall (the exact ratio varies by model), ~43% SSM layers for efficient context flow, and ~50% standard feed-forward layers — have been shown to outperform both pure Transformers and pure SSMs while combining their respective strengths. Meanwhile, in biology and signal processing, pure SSMs have become the dominant approach: HyenaDNA processes *one million* DNA nucleotides at single-nucleotide resolution, a context length no Transformer can practically approach. This document tells that story — from the original intuition of a Google researcher on a 2017 afternoon, through the rockets-to-language journey of HiPPO and S4, to the jazz-musician selectivity of Mamba and the quiet theoretical revolution that showed these two architectures were, all along, two views of the same mathematical object. > **See also:** [[research_notes/index]] for all underlying research notes and sources. --- > [!INFO] What This Document Covers > The research for this report identified ten focus areas for a layperson audience: > > 1. **The Sequence Problem** — What are sequence models solving, and why is it hard? > 2. **The Historical Arc** — From RNNs to LSTMs to Transformers: how did we get here? > 3. **How Transformers "Pay Attention"** — QKV, the cocktail party, and the quadratic wall > 4. **How SSMs "Compress Memory"** — From Kalman's rockets to Mamba's selectivity > 5. **Five Lenses of Comparison** — Memory, speed, recall, inference, and training > 6. **The Zoology Surprise** — Why 82% of the quality gap comes down to one capability > 7. **Real-World Products** — What runs ChatGPT, Gemini, Falcon Mamba, HyenaDNA? > 8. **SSMs' Killer Apps** — Genomics, audio, streaming, on-device AI > 9. **Hybrid Models** — Jamba, Griffin, and the "7% attention" sweet spot > 10. **Where We're Headed** — Mamba-2 unification, Titans, xLSTM, the 2027 wager # Part I: What Is a Sequence? Before we can compare how AI systems *read*, we need to understand what they're reading. In ordinary life, a "sequence" simply means things arranged in order: the letters in a word, the words in a sentence, the notes in a melody. In the language of AI, the concept stretches much further than most people expect. Text is a sequence — each word following the last, carrying meaning that depends on what came before and anticipates what comes next. But so is audio: a recording of speech is a sequence of pressure measurements taken tens of thousands of times per second. DNA is a sequence of four chemical letters — A, T, C, G — strung together in chains of billions. Source code is a sequence. A video is a sequence of image sequences. An electrocardiogram is a sequence. Stock prices are a sequence. A protein is a sequence of amino acids. What unites all of these, from the AI's perspective, is a deceptively simple problem: *given what has come before, what comes next?* Or, in a richer version: *given the full sequence, what does it mean?* This framing matters because the difficulty of answering that question depends enormously on *how long* the sequence is. Understanding the sentence "The cat sat on the mat" requires holding six words in mind simultaneously — trivial even for a 2019 laptop. Understanding a 300-page legal contract requires holding hundreds of thousands of words in mind, tracking which clause modifies which, which defined term was introduced in section 2.1, and which exception overrides which general rule — very far from trivial for any system alive today. > [!NOTE] The core problem > The challenge is not just understanding long sequences — it is that the cost of understanding them grows **faster** than their length. This is the fundamental engineering tension that defines the Transformer vs SSM story. Think of it this way. Imagine you're a journalist summarising a press conference. If the conference lasts twenty minutes, you can probably hold the whole thing in your head as you write. If it lasts twenty hours, you need notes. If it lasts twenty days, you need notes *about* your notes — a compressed version of a compressed version, where every layer of abstraction risks losing something that will turn out to matter. Every AI sequence model faces this same dilemma, just millions of times per second, across billions of parameters. The river of tokens flows in. Something must decide what to remember, what to compress, and what to let go. The different answers to that question — keep everything precisely, or maintain a smart running summary — define the two great families of modern sequence architectures. The rest of this article is the story of how those choices were made, what they cost, and what they buy. --- # Part II: The Story So Far The story of sequence AI is one of recurring rediscovery. In the 1980s and 1990s, **Recurrent Neural Networks (RNNs)** were the dominant approach: process tokens sequentially, maintaining a rolling "hidden state." They worked, but only up to a point. Gradients vanished over long sequences — information about something said in sentence one was unrecoverable by sentence twenty. The 1997 invention of **Long Short-Term Memory (LSTM)** networks, by Sepp Hochreiter and Jürgen Schmidhuber, added learned "gate" mechanisms that could selectively remember or forget — a first approximation of what Mamba would later do far more generally. For nearly two decades, LSTMs dominated language AI. Then the 2017 Transformer paper arrived, and within three years LSTMs were largely abandoned — too slow to train at scale, too limited on long sequences, despite their gating innovations. In an ironic epilogue: in 2024, Hochreiter's own team published **xLSTM**, a modernised LSTM with exponential gating and matrix-valued memory that competes with both Transformers and SSMs. The original inventor, returning to the problem after thirty years with fresh eyes, rediscovered the same core ideas the field had been converging on from multiple directions. The history of AI memory is a story of the same insight, rediscovered, each time more precisely. --- # Part III: The Transformer > [!NOTE] Tokens: the AI's unit of reading > > Throughout this document, you will encounter the word **token** — the fundamental unit of text that AI models process. A token is not quite a word and not quite a letter. Think of it as a **word fragment**: > > - The word "unbelievable" might be split into three tokens: "un", "believ", "able" > - The word "cat" is typically one token > - A space, a comma, or a period is usually its own token > - The sequence "New York City" is approximately four tokens: "New", " York", " City" — with the space attached to the following word > > Why fragments, not whole words? Because splitting common words into reusable pieces lets the model handle any word it has never seen before (it just combines the fragments it knows), while keeping the vocabulary manageable. A typical large language model uses a vocabulary of 50,000–100,000 tokens to represent effectively unlimited text. > > **For this article:** when you see "tokens," read "roughly words." A 1,000-token passage is approximately 750 words. The distinction matters for engineering; for understanding the core architectural ideas, "words" and "tokens" are interchangeable. ### A Revolution in a Weekend On a June afternoon in 2017, eight researchers at Google published a paper with a deliberately provocative title: **"Attention Is All You Need."**[^1] The paper proposed dispensing entirely with the sequential machinery that had dominated AI's understanding of language for a decade — the step-by-step reading, the memory chains, the carefully engineered forgetting mechanisms — and replacing it all with a single idea, applied very carefully, at very large scale. That idea was **attention**, and the architecture it produced was the Transformer. Within three years, it had become the engine powering virtually every significant AI product in the world. To understand why the Transformer was revolutionary, you first need to understand what it replaced. ### The Old Way: Reading One Word at a Time The dominant systems before 2017 were called **Recurrent Neural Networks** (RNNs), and their most capable variant, **Long Short-Term Memory networks** (LSTMs).[^2] These systems read text the way a person might read a telegram: one word at a time, updating a small "memory note" as they went. By the time they reached the final word, they were supposed to have distilled everything important into that one note. The problem was telephone whispers. Pass a message through fifty people, and the message at the end barely resembles what was said at the start. The same happened with RNN memory: early details, processed long before the model reached the end, tended to fade. Worse still, the step-by-step processing could not be parallelised — you had to finish word three before you could start word four. Training was slow. > [!NOTE] The vanishing gradient problem > When training sequential models, the signal that tells the network "something important happened twenty words ago" must travel backwards through twenty computational steps to reach the place where that word was processed. Each step dilutes the signal slightly. After twenty steps — let alone two thousand — the signal has shrunk to nothing. The network effectively forgets. LSTMs added clever gates — learned switches that could decide what to remember and what to discard — and this helped enormously. But the sequential bottleneck remained, and for very long texts, even the best LSTM eventually lost the plot. ### The Key Insight: Relevance Scoring The Transformer's solution was audacious: skip sequential reading entirely. Instead, *look at every word simultaneously and ask, for each word, which other words are most relevant to understanding it*. This is **attention**. And it is worth being precise about what it does, because the mechanism is both more elegant and more concrete than it might sound. Think about how Google Search works. You type a query — "best pizza in Naples" — and Google matches your query against the index of every web page (the *keys*), then returns the content (the *values*) of the pages that score highest. The trick is that you don't get just the single best match; you get a ranked blend, weighted by relevance score. Attention does exactly this, but for words inside a sentence.[^3] Every word simultaneously asks a question (its **Query**): "Which other words in this sentence are relevant to understanding me?" Every word also presents itself as a potential answer (its **Key**): "Here is what kind of word I am." And every word carries actual content (its **Value**): "Here is what I contribute when you pay attention to me." The mechanism computes the match between every word's query and every other word's key, converts those raw scores into a percentage share for each word — the higher the score, the more attention that word gets — via a mathematical operation called softmax, and then blends all the value vectors together weighted by those percentages. The result is a new, contextually enriched representation of each word — one that captures not just what the word means in isolation but what it means *in this sentence, next to these neighbours*. ### Diagram 11: QKV Attention Mechanism — Step by Step ``` COMPUTING SELF-ATTENTION FOR "cat" IN "The cat sat" STEP 1: Create Q, K, V vectors for each token via learned weight matrices ───────────────────────────────────────────────────────────────────────── W_Q W_K W_V (learned) (learned) (learned) Token "The" → Q_the=[0.2,0.9] K_the=[0.8,0.1] V_the=[0.5,0.3] Token "cat" → Q_cat=[0.9,0.1] K_cat=[0.3,0.7] V_cat=[0.7,0.8] Token "sat" → Q_sat=[0.1,0.4] K_sat=[0.6,0.2] V_sat=[0.2,0.6] STEP 2: Score "cat"'s Query against every Key (dot product) ───────────────────────────────────────────────────────────────────────── Q_cat · K_the = (0.9×0.8) + (0.1×0.1) = 0.72 + 0.01 = 0.73 Q_cat · K_cat = (0.9×0.3) + (0.1×0.7) = 0.27 + 0.07 = 0.34 ← attending to itself Q_cat · K_sat = (0.9×0.6) + (0.1×0.2) = 0.54 + 0.02 = 0.56 Raw scores: [0.73, 0.34, 0.56] STEP 3: Scale and Softmax (turn scores into a probability distribution) ───────────────────────────────────────────────────────────────────────── Scale by √(dim) = √2 ≈ 1.41: [0.52, 0.24, 0.40] ↓ Softmax: e^0.52 / Σ , e^0.24 / Σ , e^0.40 / Σ ───────────────────────────────────────── Attention weights: [ 0.42 , 0.30 , 0.28 ] ↑ ↑ ↑ "The" "cat" "sat" (Interpretation: "cat" attends most to "The", somewhat to itself) STEP 4: Weighted sum of Value vectors ───────────────────────────────────────────────────────────────────────── Output = 0.42 × V_the + 0.30 × V_cat + 0.28 × V_sat = 0.42×[0.5,0.3] + 0.30×[0.7,0.8] + 0.28×[0.2,0.6] = [0.21,0.13] + [0.21,0.24] + [0.06,0.17] = [0.48, 0.54] ← new representation of "cat" enriched with context ┌─────────────────────────────────────────────────────────────┐ │ KEY INSIGHT: "cat" no longer represents only itself. │ │ Its new vector [0.48, 0.54] is a BLEND of all tokens, │ │ weighted by how relevant each token is to "cat". │ └─────────────────────────────────────────────────────────────┘ ``` --- This is why the Transformer can tell that the word "bank" in "she sat by the river bank" means something completely different from "bank" in "the bank refused her loan."[^3] When processing the first "bank," the Query finds high relevance matches in "river" and "sat" — and the blended representation that emerges is firmly in the domain of geography. The second "bank" finds its attention drawn to "loan" and "refused." Same word, completely different meaning — emerged automatically, without any special rule. > [!TIP] The cocktail party > Imagine you're at a loud party where a hundred conversations are happening at once. Your brain doesn't process all sounds at equal volume — it filters dynamically. When someone across the room says your name, your attention snaps to them. A Transformer's attention mechanism is like this, but it runs simultaneously for every word in the text — every word "listening" for its most relevant neighbours all at once, in one enormous parallel act of comprehension.[^4] ### Parallel Processing: The Training Revolution The second revolutionary property of the Transformer — arguably as important as attention itself — is **parallelism during training**. Because all words attend to all words *simultaneously*, a Transformer can be trained on modern GPUs far more efficiently than any sequential model. With an RNN, you had to process word one, then word two, then word three; the chain was inescapable. With a Transformer, you hand the entire sentence to the GPU at once, and it computes all the attention scores in parallel — hundreds of words, simultaneously cross-examining each other, in a single matrix operation. Modern GPUs have thousands of cores designed to do exactly this kind of massively parallel matrix computation. The Transformer was, perhaps accidentally, the perfect architecture for the hardware that existed in 2017. Training that once took weeks on sequential models could now be done in days. ### The Quadratic Problem There is, however, a price. If attention asks every word to examine every other word, then the number of comparisons grows with the *square* of the sequence length. Double the text, and you don't just double the work — you quadruple it. Triple the text, and you multiply the work by nine. The researchers who christened this phenomenon have a precise name for it: **O(n²) complexity**, read aloud as "order n-squared." In practice, it looks like this:[^5] ``` Sequence Length → Attention Comparisons 1,000 tokens → ~1 million 10,000 tokens → ~100 million 100,000 tokens → ~10,000 million ← this is where GPUs start to cry 1,000,000 tokens → ~1,000,000 million ← this is where GPUs give up entirely ``` This is the wall. It is why GPT-4 has a context limit. It is why a Transformer cannot — practically speaking — read an entire human genome. It is why, despite the extraordinary capabilities of today's frontier AI, you cannot simply hand it a 500-page book and ask it to answer detailed questions about chapter 12 cross-referenced with an appendix. The computation required would be staggering. Engineering teams have found clever workarounds — FlashAttention reorganises the computation to use fast GPU memory more efficiently, reducing constant factors — but these are optimisations, not solutions. The quadratic relationship remains. ### The KV Cache: A Different Problem at Inference When a model like GPT-4 *generates* text — writing a reply to your prompt one word at a time — it faces a different flavour of the same problem. To generate word number 1,000 in its response, it must compute attention over all 1,000 previous words. To generate word 1,001, it does it again. To avoid repeating this work, Transformers maintain a **KV cache**: a record of the Keys and Values computed for every previous token in the conversation.[^5] This cache is reused rather than recomputed at each step. The clever part: only the new token's Query needs to be computed fresh. The less clever part: the KV cache grows with every word generated. For a lengthy conversation or a very long document, that cache can balloon to tens of gigabytes — consuming memory that otherwise would have gone toward the model itself, or simply exceeding what the hardware can hold. (A note for careful readers: the Hook described the model "re-reading everything" — and the KV cache is precisely the optimisation that avoids literal re-computation. The cost is no longer the rereading itself; it's the *holding*. The cache stores a compressed record of everything seen before, and the cost of maintaining and querying that ever-growing record is what accumulates as context length increases.) The KV cache is the reason running very long conversations with a frontier model requires expensive enterprise hardware, not the phone in your pocket. --- When researchers talk about why Transformers are so good at picking up patterns from a handful of examples — a capability called **in-context learning** — they are usually pointing, whether they know it or not, at a specific circuit buried inside every large Transformer. Anthropic's researchers gave this circuit a name: the **induction head**.[^32] An induction head is not a single neuron or layer. It is a two-step attention pattern that forms reliably, across many different models and training runs, whenever a Transformer grows past a certain size. Here is what it does in plain terms: if the model has seen the pattern [A][B] earlier in the sequence, and it now encounters [A] again, the induction head learns to predict that [B] will follow. It is, essentially, a pattern-completion lookup — the minimal unit of "I've seen this before." This sounds simple. The implications are profound. Induction heads are the mechanism by which Transformers can, without any fine-tuning, take two or three examples of a new task and immediately begin doing it correctly. You give the model three examples of translating English to French; the induction head finds the [English word → French word] pattern and completes it. You give it three examples of a code debugging pattern; it completes the next one. This is why Transformer-based systems are such powerful few-shot learners. SSMs, by design, cannot easily form the same circuit. Their compressed state vector blurs the precise [A]→[B] association across a long sequence. In one controlled experiment, an associative recall task where the model must retrieve a specific paired value from context — a task induction heads perform perfectly — the attention-based model scored **100%** while the S4D SSM variant scored **20.1%**. Not a small gap; not a fine-tuning gap; a near-total failure on a task that represents the mechanical core of in-context learning. This is precisely what the Zoology paper's 82% figure captures: the quality gap between SSMs and Transformers is, in large part, the gap between having induction heads and not having them. > [!NOTE] Why hybrids recover this capability > Hybrid models — Jamba, Griffin, the NVIDIA study architecture — recover in-context learning capability by including a small fraction of full attention layers. Those layers are sufficient to form induction head circuits. The 93% of non-attention layers continue to handle broad context efficiently; the 7–12% attention layers carry the induction head circuits that enable precise retrieval and few-shot learning. This is why the performance gap between hybrids and pure SSMs is so much smaller than between pure Transformers and pure SSMs. ### Diagram 17: The Handshake Problem — O(N²) Connections ``` WHY TRANSFORMER ATTENTION SCALES QUADRATICALLY Consider N people at a party who all need to shake hands with each other. Each pair shakes hands exactly once. How many handshakes as N grows? ──────────────────────────────────────────────────────────────── N = 2 people: 1 handshake A───B N = 3 people: 3 handshakes A / \ B───C N = 4 people: 6 handshakes A───B │╲ /│ │ ╳ │ │/ ╲│ C───D N = 8 people: 28 handshakes (each of 8 people shakes 7 hands ÷ 2 = 28) Connections: ●─●─●─●─●─●─●─● (each dot connects to all others) ╲╱╲╱╲╱╲╱╲╱╲╱╲╱ (web of connections, hard to draw but every pair links) GENERAL FORMULA: N × (N - 1) / 2 ≈ N² / 2 ──────────────────────────────────────────────────────────────── N (tokens) Pairs GPU ops (attention) ────────────────────────────────────────────── 128 8,128 ~8K 512 131,072 ~131K 1,024 523,776 ~524K 4,000 7,998,000 ~8.0M 10,000 49,995,000 ~50M 16,384 134,209,536 ~134M ← Path-X sequence length 100,000 4,999,950,000 ~5B ← impractical on any GPU today 1,000,000 499,999,500,000 ~500B ← impossible ──────────────────────────────────────────────────────────────── ● ← impossible territory / ● / TRANSFORMER (O(N²)) ● each token attends to all tokens / ● / ──────────●───────────────────────────●────────────── ← SSM (O(N)) / (flat — same work per token) ● / ● ← 128 tokens = easy for both 128 512 1K 4K 16K 100K 1M ↑ GPT-4 context starts here (~32K) KEY INSIGHT: Each new token in a Transformer must "handshake" with EVERY prior token. That's the attention matrix. Add one more token to a 16K sequence → 16,384 more operations. Add one more token to an SSM → exactly 1 more state update. Same cost every time. ``` --- ### Diagram 12: Token Generation with KV Cache Growth ``` AUTOREGRESSIVE GENERATION — producing "The cat sat on the mat" Prompt: "The cat" → generate one token at a time ──────────────────────────────────────────────────────────────── STEP 1: Process "The cat" (prefill) Input: [The] [cat] ↓ full attention matrix computed KV Cache: ┌────┬────┐ │K₁ │K₂ │ ← Keys stored (2 slots used) │V₁ │V₂ │ ← Values stored (2 slots used) └────┴────┘ Output: predict → "sat" ✓ ──────────────────────────────────────────────────────────────── STEP 2: Append "sat", generate next token Input: [The] [cat] [sat] But we DON'T recompute K₁,V₁,K₂,V₂ — they're cached! We only compute K₃, V₃ for "sat" then add to cache: KV Cache: ┌────┬────┬────┐ │K₁ │K₂ │K₃ │ ← 3 slots used (grows!) │V₁ │V₂ │V₃ │ └────┴────┴────┘ Output: predict → "on" ✓ ──────────────────────────────────────────────────────────────── STEP 3 → N: Cache keeps growing... KV Cache after 100 tokens: ┌─────────────────────────────┐ │ K₁ K₂ K₃ ... K₉₈ K₉₉ K₁₀₀│ │ V₁ V₂ V₃ ... V₉₈ V₉₉ V₁₀₀│ └─────────────────────────────┘ Size ∝ N × layers × heads × d_head KV Cache after 100K tokens: ┌─────────────────────────────────────────┐ (GPT-4 context window) │ K₁ ... ░░░░░░░░░░░░░░░░░░░░░░░░░░ K₁₀₀ₖ│ │ V₁ ... ░░░░░░░░░░░░░░░░░░░░░░░░░░ V₁₀₀ₖ│ └─────────────────────────────────────────┘ ~80-160 GB for a 70B model ← needs A100s ──────────────────────────────────────────────────────────────── COMPARE: SSM at inference — NO cache, fixed state State after 100 tokens: [h₁ h₂ h₃ ... h_d] ← always same fixed size State after 100K tokens: [h₁ h₂ h₃ ... h_d] ← STILL same fixed size! ~megabytes ← works on a laptop ``` --- ### Mermaid 1: Transformer — Training vs Inference Flow ```mermaid flowchart TD subgraph TRAINING ["🏋️ TRAINING — Parallel, sees whole sequence"] T1["Full sequence\n'The cat sat on the mat'"] T2["Embed all tokens at once\n[The] [cat] [sat] [on] [the] [mat]"] T3["Causal mask applied\n(each token can only see past tokens)"] T4["ALL attention scores computed\nin one giant parallel matrix op\nO(N²) memory + compute"] T5["Compare predicted vs actual\nCompute loss → backprop"] T1 --> T2 --> T3 --> T4 --> T5 end subgraph INFERENCE ["⚡ INFERENCE — Sequential, one token at a time"] I1["Prompt tokens → KV cache prefill\n(one parallel pass)"] I2["Generate token N+1\n(attend to cached K,V)"] I3["Append new token\nAdd K,V to cache"] I4{Done?} I5["Return full response"] I1 --> I2 --> I3 --> I4 I4 -- "No" --> I2 I4 -- "Yes" --> I5 end TRAINING -.->|"Trained weights\nfrozen"| INFERENCE style T4 fill:#ff9999,stroke:#cc0000 style I3 fill:#ffcc88,stroke:#cc8800 ``` --- # Part IV: The State Space Model ### The Compression State In 1960, an electrical engineer named Rudolf Kalman published a paper that had nothing to do with language models, chatbots, or artificial intelligence.[^6] He was thinking about rockets. Specifically, he was asking: if you're tracking the position and velocity of a spacecraft using noisy sensor readings, what is the mathematically optimal way to maintain your best estimate of where the spacecraft *is* right now? His answer was the **Kalman filter**: maintain a small **state vector** (think of it as the model's rolling working notes — not "state" as in a place, but as in "what it currently knows about everything it has seen so far") — just position and velocity for the rocket — and update it continuously as each new measurement arrives. You don't store the raw history of every sensor reading. You don't replay the spacecraft's entire journey each time you need to know its current position. You keep a compressed, running summary, and you update it forward. This is the idea at the heart of every State Space Model — transplanted, sixty years later, from rockets into language. Imagine you've been assigned to take notes at a very long meeting. The meeting stretches for hours; your notepad has room for one page. You cannot write down every word spoken. So you learn, over time, which things genuinely need to be in your summary: decisions, numbers, names, reversals of earlier positions. You let the small talk fade. When someone asks you "what did they decide about the marketing budget?" you don't need the transcript — you need your notes. And your notes, if you are good at your job, have what matters.[^7] That notepad is the **state vector**. The rules you have learned for what deserves to be written down — and in how much detail — are the model's parameters. And the cost of answering a question from your notes is the same whether the meeting lasted one hour or ten: you just consult the page. ### The Family Tree: From Physics to Language The journey from Kalman's rockets to Mamba's language models is a story of mathematical rediscovery. For most of the 2010s, AI researchers using recurrent networks (RNNs and LSTMs) were, unknowingly, doing a cruder version of Kalman's idea — maintaining state, updating it incrementally, generating outputs. The fatal flaw was that the update rules they learned through gradient descent were unstable over long sequences: the signal from distant past events would either explode or vanish before it could influence the present. In 2020, a group of Stanford researchers led by **Albert Gu** asked a more fundamental question: not "how do we train a better RNN?" but "what does the *mathematically optimal* state update look like?" The result was **HiPPO** — High-Order Polynomial Projection Operators — a framework for compressing the history of a time series into a fixed-size state vector in a provably optimal way.[^8] The trick: represent history at multiple timescales simultaneously — recent events remembered in sharp detail, distant events in coarser outline, like a zoom lens that automatically adjusts focus at every scale at once. Strange-sounding, but the practical meaning is concrete: the HiPPO state doesn't just remember the last few tokens; it maintains a graduated sketch of everything, from the conversation five minutes ago to the word that just arrived. HiPPO was theory. **S4** (2021) was the engineering breakthrough.[^9] The Structured State Space for Sequences paper by Gu et al. turned the HiPPO mathematics into something a GPU could actually train efficiently — by exploiting a special structure in the state transition matrix that made the otherwise expensive computation tractable. S4's headline result: it was the first model to *solve* the Path-X long-range reasoning task — a sequence of 16,384 tokens — that every previous architecture had completely failed at. Separately, S4 performed autoregressive generation **60× faster** than Transformers on image and language modelling tasks, demonstrating that the efficiency gains were not limited to just one domain. But S4 had an elegant property that went beyond raw performance: **the same model could be run as two completely different computations** depending on context. During training, it could be unrolled as a *convolution* over the full sequence — embarrassingly parallel, GPU-friendly, fast. During inference, it could be run as a recurrence — one step at a time, constant memory, O(1) per token. One model, two modes, seamlessly switchable. This was the beginning of something profound.[^9] **H3** (2022) attacked a specific weakness S4 still had: language. On natural text, where the model needed to recall specific words or compare tokens across a sequence, SSMs still lagged Transformers. H3 added a specialised "shift" layer for explicit token comparison, and came within 0.4 *perplexity* points of a Transformer on OpenWebText — where perplexity is the standard metric for how surprised a language model is by the next word it didn't see coming (lower is better) — a near-miss that showed the gap was closeable.[^10] Then came **Mamba**. ### The Jazz Musician In December 2023, Albert Gu and **Tri Dao** published Mamba, and it introduced a change that sounds simple but changes everything: *input-dependent state transitions*.[^11] Every SSM before Mamba had the same structural limitation: its A, B, and C matrices — the rules for what to absorb into state, how to decay the old state, and how to read an answer back out — were fixed parameters. They were the same for every token in every sequence, regardless of content. The word "the" and the word "Paris" both went through exactly the same machinery. Mamba made the matrices *functions of the input itself*. Each new token now dynamically determines its own absorption rules: how much to "gate open" the state update, how deeply to let the new information seep in, how much of the previous state to decay. The model has learned, through training, to distinguish high-signal tokens from noise. > [!NOTE] The jazz musician > Think of a jazz ensemble where each musician responds dynamically to what the others just played. A drummer who hits a defining groove prompts the bassist to lock in and build; a transitional fill might get acknowledged and passed over. Mamba's selectivity is the AI equivalent of this: each incoming token decides in real time how much of itself becomes part of the ensemble's shared understanding, and how much to let fade.[^12] The effect on performance was dramatic. Mamba-3B outperformed Transformers of equivalent size on language benchmarks, and matched Transformers *twice its size* — while running at **5× higher inference throughput at 2K context — rising to 15× at 16K context** as the Transformer's quadratic cost compounds.[^11] And because the state is still fixed-size regardless of how many tokens have flowed through, Mamba scales to sequences millions of tokens long without the quadratic cost wall that limits every Transformer. > [!TIP] O(1) inference — what it means in practice > When Mamba generates token number 10,000 in a sequence, the computation required is *identical* to generating token number one. The state is updated; the output is computed; no growing cache is consulted. This is what "O(1) inference" means: constant cost per token, regardless of context length. For a smartphone, a medical device, or any system without data-centre hardware, this property is not merely convenient — it may be the difference between possible and impossible. --- ### Diagram 14: S4 Dual Representation — Recurrence OR Convolution ``` THE SAME S4 MODEL CAN BE COMPUTED TWO EQUIVALENT WAYS: REPRESENTATION A — RECURRENT (used at inference): ───────────────────────────────────────────────── Time: t=1 t=2 t=3 t=4 x₁ x₂ x₃ x₄ │ │ │ │ ▼ ▼ ▼ ▼ h₀→[Ah₀+Bx₁]→h₁→[Ah₁+Bx₂]→h₂→[Ah₂+Bx₃]→h₃→[Ah₃+Bx₄]→h₄ │ │ │ │ ▼ ▼ ▼ ▼ y₁ y₂ y₃ y₄ (Ch₁) (Ch₂) (Ch₃) (Ch₄) ✅ O(1) memory per step — perfect for generating one token at a time ❌ Sequential — cannot be parallelised across the sequence REPRESENTATION B — CONVOLUTIONAL (used during training): ───────────────────────────────────────────────────────── The outputs y₁...yₙ can be written as a single CONVOLUTION: y = K̄ * x where K̄ is the "SSM convolution kernel" K̄ = [CB, CAB, CA²B, CA³B, ..., CA^(N-1)B] ↑ ↑ ↑ ↑ t=1 t=2 t=3 t=4 (how much input from N steps ago matters) Visualised as a filter over time: Kernel K̄: ║█████║████ ║███ ║██ ║█ ║ ║ ║ t=1 t=2 t=3 t=4 t=5 t=6 t=7 (decaying memory) Input x: [x₁] [x₂] [x₃] [x₄] [x₅] [x₆] [x₇] Output y₄ = K̄[1]·x₄ + K̄[2]·x₃ + K̄[3]·x₂ + K̄[4]·x₁ + ... (most recent) (oldest, faintest) ✅ Fully parallelisable with FFT — compute ALL outputs simultaneously! ✅ O(N log N) for training ❌ Must hold full input sequence in memory ┌─────────────────────────────────────────────────────────────────────┐ │ MAGIC: The kernel K̄ encodes the model's entire "memory policy": │ │ how quickly it forgets, what patterns it's sensitive to. │ │ S4's clever parameterisation (HiPPO matrix A) makes this kernel │ │ decay gracefully — it doesn't explode or vanish like RNN gradients.│ └─────────────────────────────────────────────────────────────────────┘ ``` --- ### Diagram 13: Mamba's Selective Gating — Step by Step ``` VANILLA S4 (Linear Time-Invariant — fixed parameters, input-independent) ───────────────────────────────────────────────────────────────────────── xₜ ──────────────────────────────────────────────────────────────→ │ │ │ ▼ ▼ ▼ [B: fixed] [B: fixed] [B: fixed] │ │ │ hₜ₋₁ → [×A] → + → hₜ hₜ → [×A] → + → hₜ₊₁ ...continues ↑ ↑ [B·xₜ] [B·xₜ₊₁] A, B, C are the SAME for every input token. "The" and "elephant" are processed identically. The model cannot choose to remember or forget. MAMBA (Selective SSM — B, C, Δ all depend on the current input xₜ) ───────────────────────────────────────────────────────────────────────── xₜ ("elephant") │ ├──► Linear projection → Δₜ (step size: "how much time passes?") │ Large Δ = big state update │ Small Δ = nearly skip this token │ ├──► Linear projection → Bₜ (how strongly to WRITE xₜ into state) │ ├──► Linear projection → Cₜ (how strongly to READ from state) │ └──► [ZSS (Zero-order Hold) discretize A with Δₜ → Āₜ] │ ▼ hₜ = Āₜ · hₜ₋₁ + Bₜ · xₜ ← state update yₜ = Cₜ · hₜ ← read output ┌─────────────────────────────────────────────────────────────────┐ │ TOKEN EXAMPLES: │ │ │ │ "a" → small Δ, small B → barely updates state (skip) │ │ "but" → large Δ, large B → strong update (contrast signal!) │ │ "not" → large Δ, large B → strong update (negation!) │ │ "..." → medium Δ → mild reset between sentences │ │ │ │ The model LEARNS these importance weights from data. │ └─────────────────────────────────────────────────────────────────┘ ``` --- > [!ABSTRACT] The Long Range Arena: A Decisive Benchmark > > The **Long Range Arena (LRA)** is a suite of six tasks stress-testing long-range reasoning. The hardest, **Path-X**, requires tracking a visual path across a 16,384-token sequence. > > | Task | Transformer | S4 (2022) | > |------|-------------|-----------| > | ListOps (nested math, 9K tokens) | 36.4% | **59.6%** | > | Text (character-level, 4K tokens) | 64.3% | **86.8%** | > | Retrieval (document pair, 4K×2) | 57.5% | **90.9%** | > | Image (sequential CIFAR-10, 1K) | 42.4% | **88.7%** | > | Pathfinder (visual reasoning, 1K) | 71.4% | **94.2%** | > | **Path-X** (extreme visual, 16K) | **FAIL** | **96.4%** | > | **Average** | **54.4%** | **86.1%** | > > Every efficient Transformer variant scores ≤55% and every one **completely fails Path-X** (random chance). S4 was the first model in history to solve Path-X. The 32-point gap was achieved in a **single 2021 paper**. ### Diagram 16: LRA Benchmark Comparison — Bar Chart (ASCII) ``` LONG RANGE ARENA — MODEL ACCURACY BY TASK (%) Source: Tay et al. 2021 (LRA paper) + Gu et al. 2022 (S4, ICLR 2022) Task Transformer S4 (2022) S5 (2023) 0% 50% 100% ├─────┼─────┤ ListOps │████ 36.4% │ █████████████████████ 59.6% ████████████████████████ 62.2% │ │ Text │████████████ 64.3% █████████████████████████████████ 86.8% ██████████████████████████████████ 89.3% │ │ Retrieval │███████████ 57.5% ██████████████████████████████████ 90.9% ██████████████████████████████████ 91.4% │ │ Image │████████ 42.4% █████████████████████████████████ 88.7% █████████████████████████████████ 88.0% │ │ Pathfinder │██████████████ 71.4% ████████████████████████████████████ 94.2% █████████████████████████████████████ 95.3% │ │ Path-X │██████████ FAIL (≈50%) ████████████████████████████████████ 96.4% █████████████████████████████████████ 98.5% ├─────┼─────┤ 0% 50% 100% AVERAGE: Transformer: 54.4% S4: 86.1% S5: 87.4% ▲ All efficient Transformers (Performer, BigBird, etc.) cluster ≤55% ▲ S4 is the first model to solve Path-X; Transformer = random guessing Legend: ███ = Transformer score █ = S4 score (always higher) ``` > [!NOTE] > The jump from the best Transformer-family model (~55%) to S4 (~86%) happened in a **single paper** (2021). This is not a gradual climb — it is a discontinuity. S4 is the first model to solve Path-X (16,384 tokens); every prior model scores at chance level (~50%). --- # Part V: Five Lenses of Difference These two architectures — the Transformer and the SSM — have now been running in parallel for several years, each accumulating evidence about where it excels and where it falters. The clearest way to understand the difference is through five concrete lenses. ### Lens 1 — Memory: The Photograph vs. the Impression A Transformer, while processing a sequence, maintains exact records of every token it has ever seen — its KV cache is a perfect index, a digital photograph of the entire conversation. When it needs to recall what was said on page one of a 50-page document, it simply looks it up. Nothing has been compressed, distorted, or lost. An SSM maintains a fixed-size state — an impressionist painting of the text so far, capturing the mood and essence while necessarily blurring fine detail. The painting cannot grow larger no matter how long the sequence becomes. It carries a compressed summary of everything, at a resolution set by the state's size. This difference determines what each architecture does well. Transformers excel at tasks requiring exact retrieval: "What was the name John gave Mary?" "What did the third bullet point say?" "Which variable was defined on line 42 and called on line 847?" SSMs excel at tasks where the *gist* matters more than the *details*: summarisation, streaming audio processing, pattern recognition in continuous signals, long-document understanding where the broad argument matters more than any single sentence.[^13] When does the photographic memory fail you? When the photograph is too large to carry. A Transformer's context window is a hard limit — beyond it, the model is simply blind. An SSM has, in principle, no such limit. The painting may get impressionistic, but it never disappears. ### Lens 2 — Speed: The Quadratic Wall The numbers here are important enough to state plainly. Doubling the length of a Transformer's input quadruples the cost of the attention computation. Going from GPT-2's original 1,024-token context to GPT-4's 128,000-token context is not 125× harder — it is closer to **15,000× harder** for the raw attention calculation.[^5] SSMs scale linearly. Doubling the input doubles the work. Nothing more. In practical terms: at moderate sequence lengths (a few thousand tokens), a well-optimised Transformer and a well-optimised SSM perform comparably on modern hardware. As sequences grow longer — tens of thousands, hundreds of thousands of tokens — the SSM's advantage compounds. At one million tokens, running a Transformer requires roughly two terabytes of attention memory. A Mamba model at the same sequence length needs a state vector measured in megabytes.[^11] The KV cache adds a second dimension to the speed difference at inference time. Each token the Transformer generates requires reading its entire KV cache. An SSM's per-token generation cost is constant. At 100,000 tokens of context, a typical 13B-parameter Transformer's KV cache can exceed 50 GB — pushing the limits of even the most expensive consumer hardware. ### Lens 3 — Recall: The 70M vs. 1.4B Story In December 2023, the same Stanford lab that created the Mamba lineage published a smaller but arguably more illuminating paper: **Zoology**.[^14] The paper asked a deceptively simple question: of all the things Transformers do better than SSMs on language tasks, how much of the gap is attributable to any single capability? The answer was stark: **82% of the quality gap between SSMs and Transformers** on standard language benchmarks could be explained by a single skill — **associative recall**. Associative recall is the ability to look back into context and retrieve a specific fact: "She put vanilla extract in her smoothie. Later, she drank her vanilla ______." To answer, the model must reach back through the intervening text, find the earlier mention, and return its value. A Transformer does this by literally scanning back through its KV cache. An SSM must have compressed that earlier mention into its state — and if the state has been updated many times since, that precise fact may have blurred beyond recovery. The story's most memorable detail: in the Zoology experiments, a **70-million parameter attention model outperformed a 1.4-billion parameter Hyena model** (an SSM) on associative recall tasks.[^14] This is a 20× size advantage thrown away. Worse, when researchers tested S4D — an SSM variant — directly on an isolated associative recall task, it scored **20.1%** (barely above random guessing), while attention scored **100%**.[^14] This is a structural difference, not a scaling problem. No amount of parameter-adding fixes the fundamental limitation of compressed state. Mamba's selectivity helps — it can decide to encode a specific token sharply in state — but a fixed-size state can only hold so many precise key-value pairs before interference degrades them all. ### Lens 4 — Inference: The Mobile Device Argument The KV cache problem is more than an engineering inconvenience. It represents a fundamental constraint on where AI can live. A smartphone typically has 6–12 GB of RAM. A long conversation with a Transformer-based AI assistant — say, a 100,000-token exchange — can require a KV cache of 50–100 GB for a mid-sized model. The maths is irreconcilable: you cannot run a capable Transformer with long-context memory on a phone. An SSM needs only its state vector at inference time, regardless of how many tokens have flowed through it. For Mamba, that state is measured in megabytes, not gigabytes. It does not grow. It does not need to be stored between generation steps. The cost of generating the next token is the same on step one as on step one million. This is the argument for on-device AI. Not just chatbots, but continuous AI assistants that can maintain context across days of interaction, medical monitoring systems that process continuous sensor streams, code editors that hold entire large codebases in active context — all of these become feasible, on consumer hardware, with O(1) inference.[^15] ### Lens 5 — Training: The Clever Trick They Both Share There is a common misconception that SSMs are fast at inference but slow to train, because they process sequences step by step. This was true of classic RNNs. It is not true of S4 and Mamba. The key insight from the S4 paper is that the recurrent update equation — state-at-step-n equals A times state-at-step-n-minus-one plus B times input — can be mathematically rewritten as a *convolution* over the full input sequence.[^9] A convolution is a batch computation; it processes the entire sequence at once in a single matrix operation. It is embarrassingly parallel, perfectly suited for GPUs, and takes the same time whether the sequence has 100 elements or 100,000. Both architectures train in parallel. The difference is only visible at inference: Transformers must maintain and query the growing KV cache, while SSMs can switch back to recurrent mode and process one token at a time with fixed memory. ### Summary Table | Dimension | Transformer | SSM (Mamba) | |---|---|---| | **Memory type** | Exact KV cache (photographic) | Fixed state vector (impressionist) | | **Memory growth** | O(n) — grows with context | O(1) — constant regardless | | **Attention cost** | O(n²) per step | O(n) total; O(1) per step | | **Training mode** | Parallel (all tokens at once) | Parallel (convolutional trick) | | **Inference mode** | Read full KV cache each step | Update fixed state each step | | **Best at** | Exact recall, reasoning, ICL | Long sequences, streaming, signals | | **Weak at** | Long contexts (cost); mobile (memory) | Associative recall; precise lookup | | **Representative models** | GPT-4, Claude, Gemini | Mamba, Falcon Mamba, RWKV | | **Context limit** | Hard limit (hardware-bound) | Theoretically unlimited | --- > [!ABSTRACT] The Core Trade-Off > > | | **Transformer** | **State Space Model** | > |---|---|---| > | **Memory** | Perfect, photographic — every token preserved verbatim | Compressed, impressionist — pattern and gist retained | > | **Recall** | Exact lookup anywhere in context | Statistical inference from compressed state | > | **Inference cost** | Grows with context length (O(n) per token) | Constant regardless of context (O(1) per token) | > | **Training cost** | Quadratic in sequence length | Linear in sequence length | > | **Ideal for** | Precise retrieval, in-context learning, reasoning | Long streams, real-time, on-device, genomics | > | **Weak at** | Very long contexts, mobile deployment | Exact recall, few-shot pattern matching | > > *Neither is universally better. The choice — or blend — depends on the task.* ### Diagram 18: The Induction Head Circuit ```mermaid flowchart TD subgraph INPUT ["📝 Input Sequence (simplified)"] T1["Token: 'A'"] T2["Token: 'B'"] T3["Token: '...'"] T4["Token: 'A' ← repeated"] T5["Token: ??? ← predict this"] end subgraph LAYER1 ["Layer 1: Previous-Token Head"] PH["Previous-Token Head\n\nLearns to attend:\neach token → token BEFORE it\n\n'A' remembers 'START'\n'B' remembers 'A'\n'A²' remembers '...'"] end subgraph LAYER2 ["Layer 2: Induction Head"] IH["Induction Head\n\nPattern: when I see token X...\ncheck what came AFTER X last time\n\nSees: current token = 'A'\nAsks: 'when did I last see A?'\nFinds: token 2 (which remembers B)\nCopies: B into current output"] end subgraph OUTPUT ["📤 Prediction"] PRED["Predicts: 'B'\n\n✅ Correct! The pattern [A → B]\nwas learned from earlier in context\n\nThis is IN-CONTEXT LEARNING:\nno gradient update needed —\njust attention pattern recognition"] end T4 --> LAYER1 T1 --> LAYER1 LAYER1 -->|"'A₁ was followed by B'\nstored via Q/K/V"| LAYER2 T4 --> LAYER2 LAYER2 --> PRED style LAYER1 fill:#fff3e0,stroke:#e65100 style LAYER2 fill:#e8f5e9,stroke:#2e7d32 style PRED fill:#e3f2fd,stroke:#1565c0 style INPUT fill:#fce4ec,stroke:#880e4f ``` **What the circuit does, step by step:** ``` SEQUENCE: [A] [B] [C] [D] ... [A] [?] 1 2 3 4 N N+1 LAYER 1 — Previous Token Head: Token at position N (second "A") attends to position N-1 This head's job: copy "what came before me" into my representation Result: position N now "knows" that the token before it was something LAYER 2 — Induction Head: Looks at the token at position N Asks: "Where else in this sequence did I see this same token?" Finds: position 1 (the first "A") Then looks at what position 1's PREVIOUS-TOKEN HEAD stored That's position 2 (which is "B") Copies "B" into the prediction for position N+1 OUTPUT: Predicts "B" with high confidence ✅ WHY THIS IS REMARKABLE: This two-layer circuit implements a simple but powerful rule: "repeat the pattern you saw earlier in this very context" No fine-tuning. No gradient update. Pure in-context learning. ``` > [!NOTE] > Induction heads were discovered empirically by Olsson et al. (2022) in "In-context Learning and Induction Heads." They appear in virtually all Transformer language models. SSMs, by contrast, cannot straightforwardly implement this circuit because they do not have content-based random access — they can only read out their compressed state. --- # Part VI: Real-World Products and Applications ### Transformers in the Wild The AI you interact with every day is, almost certainly, a Transformer. GPT-4, the engine behind ChatGPT, is a decoder-only Transformer trained on hundreds of billions of tokens of text, scaled to perhaps a trillion parameters — the architecture of 2017, evolved through seven years of engineering refinement.[^16] Claude, built by Anthropic, shares the same fundamental design, trained with a distinctive "constitutional" approach to safety. Gemini, Google's flagship, is natively multimodal from the ground up, able to process images, audio, and text in the same sequence — and its 1.5 Pro variant extended the Transformer context window to one million tokens through a mix of engineering cleverness and sheer compute budget. These systems are Transformers for the same reason that most large commercial aircraft are variations on the same basic wing-and-jet-engine design: the architecture has been proven at scale, the tooling for training and deploying it is mature, and the ecosystem of researchers, engineers, and hardware vendors has optimised for it continuously for years. The quadratic cost is real, but it is a manageable constraint at the sequence lengths most commercial applications require. ### SSMs in the Wild The SSM story is younger and, for now, smaller — but it is accelerating. **Falcon Mamba 7B**, released by the Technology Innovation Institute in 2024, was a landmark: the first competitive pure-SSM language model at the 7-billion parameter scale, scoring **64.1** on the HuggingFace Open LLM Leaderboard — edging past LLaMA-3-8B (62.6) despite having fewer parameters.[^17] It demonstrated that the quality gap — once a serious concern — had become closeable. **RWKV** (pronounced "rock-v") occupies a distinctive niche: an architecture that trains like a Transformer but runs at inference like an RNN.[^18] Its sixth-generation Finch architecture added input-dependent decay weights, bringing it closer to Mamba's selectivity while retaining an unusual property: it runs efficiently on CPUs, without any GPU at all. In a world where most people do not have a $3,000 graphics card, this matters. ### The Genomics Revolution Perhaps the most dramatic demonstration of what SSMs make possible is in biology, and it centres on a single number: **one million**. HyenaDNA, a genomic model built on Hyena (an SSM-family architecture), processes DNA sequences at single-nucleotide resolution with a context window of up to one million nucleotides.[^19] For comparison, Transformer-based genomic models of the same era maxed out at 4,096 tokens — covering less than 0.001% of the human genome. HyenaDNA doesn't just cover more; it trains 160× faster than a Transformer at equivalent sequence lengths and achieves state-of-the-art results on 12 of 18 Nucleotide Transformer benchmarks. Reading DNA, it turns out, is exactly the kind of task SSMs were built for. There are no "associative recall" moments in genomic analysis — no need to look back at base pair number 847,311 and precisely retrieve a value mentioned earlier. What matters is patterns: motifs, repeating structures, long-range regulatory signals that manifest as statistical tendencies across millions of base pairs. The SSM's compressed-impression approach is not a compromise here; it is exactly the right tool. ### The MambaByte Surprise Perhaps the most unexpected SSM result came not from genomics but from the most ordinary domain imaginable: raw text. **MambaByte** applied the Mamba architecture not to word-level tokens but to individual bytes — the raw character values, before any tokenisation.[^20] This sounds like a recipe for disaster: sequences processed at byte-level are far longer (the word "unbelievable" is 12 bytes versus perhaps 3 tokens), and longer sequences should hurt SSMs just as they hurt Transformers. Instead, MambaByte matched or exceeded byte-level Transformer baselines on language modelling, while running faster and using less memory. The finding was striking: Mamba's O(1) inference cost per step meant that the absolute sequence length mattered far less than the architecture's quadratic-vs-linear scaling behaviour. A Transformer processing 12 bytes pays twelve times the per-token cost plus quadratic attention; Mamba pays twelve constant-cost steps. The byte-level SSM effectively turned a weakness into a wash. --- The SSM revolution has spread beyond language into computer vision. The standard vision model, the Vision Transformer (ViT), patches an image into small squares and applies self-attention — but attention over image patches scales quadratically in the number of patches, making high-resolution images expensive. **Vision Mamba** (Vim, ICML 2024) applies bidirectional selective state spaces to image sequences, scanning patches in both forward and backward directions to capture spatial context. Benchmarks against DeiT at 1,248-pixel resolution: **2.8× faster** and **86.8% memory savings**, with competitive ImageNet accuracy. **VMamba** (NeurIPS 2024 Spotlight) extends this with four-directional 2D scanning — forward, backward, horizontal, vertical — to better capture the non-sequential structure of 2D images. These architectures signal that SSMs' efficiency gains are not limited to text. --- # Part VII: Hybrid Models — The Best of Both ### The Problem Neither Architecture Solved After several years of competition, a quiet consensus had emerged: neither architecture was winning. Transformers were demonstrably superior at precise recall and in-context reasoning; SSMs were demonstrably superior at long-sequence efficiency and streaming inference. But the tasks that mattered most — large-scale language modelling, generalised AI assistants, complex reasoning — seemed to require both. The hybrid architecture movement was the practical response to this stalemate. The core insight, crystallised across multiple research groups by 2024, was simple and slightly surprising: **you don't need attention everywhere**. The precision that attention provides — the ability to reach back into context and retrieve a specific key-value pair with exact fidelity — is genuinely necessary, but not for every token, and not at every layer of a deep network. Most tokens can be handled by efficient SSM layers that track the broad flow of meaning. A small number of attention layers, scattered through the network, provide the precise recall anchors that prevent quality degradation. **Jamba**, from AI21 Labs, was the first large-scale public demonstration of this principle. Its architecture interleaves Transformer blocks and Mamba blocks in a ratio of approximately 1:7 — one attention layer for every seven SSM layers.[^21] The result: a 52-billion parameter model (a Mixture-of-Experts design in which only 12 billion parameters are active on any given token) that fits on a single 80GB GPU that would otherwise require two, with a 256,000-token context window and performance competitive with comparably-sized pure Transformers. The "sweet spot" of roughly 7–12% attention layers — enough for precision, not so much as to reintroduce the quadratic cost at scale — has now been independently confirmed by multiple research groups. **Griffin**, from Google DeepMind, took a similar but distinct approach: rather than full attention at scattered layers, it uses *local attention* — attending only to the most recent few hundred tokens at a time — blended with a recurrent SSM backbone.[^22] The result matched LLaMA-2 quality despite training on six times fewer tokens. The message from the hybrid research is clear: the battle between architectures was always somewhat artificial. SSMs and Transformers are not opposites that one must choose between — they are complementary mechanisms, each covering the other's weakness. The question was never which would win. The question was always how to blend them well. And 2024 would provide the mathematical explanation for *why* — revealing that SSMs and attention are, at a deep level, the same operation computed two different ways. --- The most controlled evidence for the hybrid architecture's advantage comes not from a product announcement but from a systematic ablation study published by NVIDIA researchers in 2024. Their experiment was deliberately rigorous: train a suite of 8-billion parameter models, holding everything equal — data, compute, training duration — while varying only the *proportion* of attention layers versus Mamba-2 SSM layers versus standard MLP feed-forward layers. The results were unambiguous.[^24] The optimal mixture was: | Layer type | Proportion | Role | |---|---|---| | Mamba-2 (SSM) | 43% | Efficient context flow, long-range patterns | | MLP (feed-forward) | 50% | Feature transformation, reasoning | | Full attention | 7% | Precise recall, induction heads, in-context learning | This 7% attention configuration outperformed a pure Transformer baseline by **2.65 points** on standard language benchmarks — while running at **8× faster inference** due to the SSM layers' O(1) per-token cost replacing the Transformer's O(n) KV cache reads. The finding has since been independently confirmed by multiple groups at different parameter scales. The "7% attention sweet spot" is not a quirk of one experiment; it appears to be a genuine property of how much precise recall capacity a language model needs. The intuition: most tokens in a sequence are processing context that benefits from the SSM's efficient summarisation. Only a small fraction — moments of precise recall, pattern matching, or in-context few-shot reasoning — require the full weight of attention. Providing exactly that amount, and no more, turns out to be optimal. > [!TIP] Why not 0% attention? > The NVIDIA ablation also tested pure-SSM models (0% attention). These scored noticeably lower, confirming that *some* attention is necessary — not just beneficial. The 82% recall gap from the Zoology paper explains why: without attention layers, the model has no mechanism for precise key-value retrieval. Even at 7%, the attention layers carry the induction head circuits that make in-context learning possible. Remove them entirely and the model loses a capability that pure scale cannot replace. ### Mermaid 3: Hybrid Architecture Decision Tree ```mermaid flowchart TD START["What does your task require?"] START --> Q1{"Exact recall of\nspecific facts/tokens\nfrom long context?"} Q1 -- "Yes (e.g. legal doc Q&A)" --> Q2{"Budget for\ngpu memory?"} Q2 -- "Yes, large GPU" --> TRANSFORMER["✅ Use Transformer\n(GPT/Claude/Gemini)\nPerfect recall\nHigh memory cost"] Q2 -- "No, constrained" --> Q3{"Occasional precision\nacceptable?"} Q3 -- "Yes" --> HYBRID["✅ Use Hybrid\n(Jamba / Griffin)\n~Transformer quality\nwith linear memory"] Q3 -- "No" --> RETRIEVAL["✅ Use Transformer\n+ RAG retrieval\n(offload memory to DB)"] Q1 -- "No (e.g. audio, video,\nstreaming, forecasting)" --> Q4{"Training data\nsize?"} Q4 -- "Large (>1B tokens)" --> Q5{"Need conversation\nor instruction following?"} Q5 -- "Yes" --> HYBRID Q5 -- "No (e.g. pure time series)" --> SSM["✅ Use Pure SSM\n(Mamba / S4)\nFastest inference\nLowest memory"] Q4 -- "Small / real-time" --> SSM style TRANSFORMER fill:#ffcccc,stroke:#cc0000 style HYBRID fill:#ffe0cc,stroke:#cc6600 style SSM fill:#ccffcc,stroke:#006600 style RETRIEVAL fill:#ccccff,stroke:#000099 ``` --- # Part VIII: Where We're Headed ### The Deeper Unity In May 2024, Albert Gu and Tri Dao published **Mamba-2**, and buried in its mathematics was something profound: SSMs and a certain class of linear attention mechanisms are not different things. They are two views of the same mathematical object — a **structured semiseparable matrix** — computed in two different ways.[^23] This theoretical result, called **State Space Duality**, means that the distinction between "Transformer" and "SSM" is, at some level of abstraction, an illusion. Both architectures are computing weighted summaries of past context. The Transformer uses explicit, precise attention weights and pays quadratically for the privilege. The SSM uses structured, decaying weights built into its recurrence and pays linearly. These are not competing philosophies; they are different points on a single spectrum, traded against each other as the application demands. Practically, Mamba-2's SSD layer trains 2–8× faster than Mamba-1 by reformulating the computation as matrix multiplications — the operation that GPU hardware has been optimised for over decades. ### The Arms Race, and What It Implies The trajectory of SSM benchmarks over the four years from S4 to Mamba-2 is not gradual improvement; it is something closer to a sprint. Each iteration closed a gap that once seemed fundamental. First, SSMs matched Transformers on long-range reasoning benchmarks. Then they approached Transformer quality on language modelling. Then Falcon Mamba 7B demonstrated competitive performance at a deployable commercial scale. The Zoology finding — that associative recall explains 82% of the remaining gap — identified the remaining obstacle with almost surgical precision. The Based architecture — a hybrid design from the Zoology paper that combines a sliding-window convolution for precise token recall with a linear attention kernel for efficiency — closed 97.4% of that gap while remaining sub-quadratic.[^14] In March 2026, the sprint continued: **Mamba-3** (ICLR 2026, Lahoti et al.) introduced more expressive recurrence, complex-valued state updates that improve retrieval and state-tracking, and a multi-input/multi-output formulation — achieving a further 1.8 percentage-point accuracy gain over the next-best pure-SSM at 1.5B parameters, while using *half* Mamba-2's state size.[^30] ### On-Device AI: The Real Stakes The deepest significance of O(1) inference may not be in data centres but in your pocket. The devices most people use for most of their digital lives — smartphones, tablets, wearables — have memory measured in gigabytes, not terabytes. A KV cache that grows without bound is simply incompatible with this hardware. SSMs' fixed-size state makes continuous, long-running AI assistants on mobile hardware genuinely conceivable: an assistant that maintains meaningful context across days of conversation, a medical monitor that processes continuous sensor data without a cloud connection, a coding assistant that holds an entire large repository in active memory without exceeding device limits. ### The Open Question Whether SSMs will achieve full parity with Transformers on general reasoning tasks — not just language modelling perplexity, but the rich, multimodal, few-shot capabilities that have made frontier models transformative — remains genuinely open. Associative recall is a real limitation, not a marketing footnote. Hybrid architectures may turn out to be the long-term answer: architectures that are SSMs by default and Transformers when they need to be. What is no longer open is whether SSMs matter. They have arrived. The question now is how deep their territory will ultimately run. --- One of SSMs' most practically significant advantages — rarely discussed in popular coverage — is **length generalisation**: the ability to process sequences *longer than those seen during training*. Transformers encode position using relative rotation matrices (RoPE). Beyond the training length, those rotations produce angles the model has never seen, and attention scores become erratic — performance degrades noticeably. This is why GPT models behave unpredictably when prompted with very long contexts, and why teams developing long-context models invest heavily in length generalisation techniques like Position Interpolation. SSMs encode position implicitly through the recurrence structure. The state update — h_t = Āh_{t-1} + B̄x_t — adds one more timestep each time, regardless of absolute position. There is no position encoding to go out-of-distribution. Griffin (Google DeepMind, 2024) explicitly demonstrated this: trained on 2K-length sequences, Griffin maintained coherent outputs when prompted with sequences of 4K+ tokens. The Hyena-family architecture demonstrated this even more dramatically in genomics: trained on portions of the human genome, HyenaDNA (a Hyena-based model, closely related to but distinct from Mamba) processes up to one million nucleotides continuously. #### The 2027 Wager Sasha Rush — NLP researcher at Cornell — runs **isattentionallyouneed.com**,[^31] a countdown to the end of 2027. His wager: a "Transformer-like" model will still hold NLP state-of-the-art. Since Mamba-2's State Space Duality theorem showed SSMs and Transformers are mathematically equivalent at a level of abstraction, the wager has become a Rorschach test: if hybrids count as "Transformer-like," he almost certainly wins regardless of which direction the field moves. #### Titans (Google, January 2025) Google's **Titans** architecture reframes the entire debate in new vocabulary: **attention = short-term memory** (precise, limited, expensive) vs **neural memory module = long-term memory** (compressed, unlimited, efficient). Titans achieves 2M+ token context and outperforms both pure Transformers and SSMs on needle-in-haystack retrieval — confirming that the best architecture is neither purely attentive nor purely recurrent, but compositionally memory-aware. The Titans paper (arXiv:2501.00663) is included in the References and Further Reading section below. --- ## Appendix: For the Technically Curious *This appendix adds technical depth for readers who want to go further. All core concepts in the main text are self-contained; this section is optional.* ### A. The ABCD Matrices — What They Actually Are An SSM is defined by four matrices: | Matrix | Role | Intuition | |--------|------|-----------| | **A** | State transition | "How much of the past state do I carry forward?" (the forgetting/retention rule) | | **B** | Input projection | "How strongly does the current input write into the state?" | | **C** | Output projection | "How do I read an answer from the state?" | | **D** | Skip connection | "What part of the input passes through directly, bypassing state?" | The state update at each timestep is: ``` h_t = A × h_{t-1} + B × x_t (update state with new input) y_t = C × h_t + D × x_t (read output from state) ``` This is structurally identical to a Kalman filter — Rudolf Kalman's 1960 framework for optimally tracking rockets and satellites — applied recursively to each token in a sequence. **What S4 contributed**: The matrix A, if initialized randomly, tends to cause gradients to either explode or vanish during training (the same problem that plagued vanilla RNNs). HiPPO (2020) showed how to design A analytically using Legendre polynomial mathematics so that the model *provably* remembers history optimally. S4 built on this to make the computation efficient on GPUs. **What Mamba contributed**: Previously, A, B, C were fixed parameters — the same for every input token. Mamba made B and C (and the discretization step Δ) *functions of the current input*, allowing the model to dynamically decide how much to remember or forget for each token. This is "selectivity." ### B. Discretization — From Continuous Math to Discrete Tokens The SSM equations are derived from continuous-time differential equations (the physics world of signals and rockets). But language tokens are discrete — there is no "token 1.5." The step from continuous to discrete is called **discretization**, and it involves choosing a step size Δ. In Mamba, Δ itself is learned per token — a large Δ means "treat this as a big time step, strongly update state"; a small Δ means "nearly skip this token." This is the mechanism behind input-dependent selectivity. ### C. The Structured State Space Duality (Mamba-2) The Mamba-2 paper (Dao & Gu, ICML 2024) proved that SSMs and a class of linear attention mechanisms are both special cases of **semiseparable matrix operations**. Informally: both architectures are computing weighted sums of past context, where: - Transformers use explicit, learned attention weights (expensive to compute) - SSMs use implicit, structurally-decaying weights built into the recurrence (cheap per step) This theoretical unification means that future architectures can be designed at the level of "how much should each past token matter?" without committing to one computational form or the other. --- ## References and Footnotes [^1]: Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., & Polosukhin, I. (2017). "Attention Is All You Need." *Advances in Neural Information Processing Systems, 30*. arXiv:1706.03762. [^2]: Hochreiter, S. & Schmidhuber, J. (1997). "Long Short-Term Memory." *Neural Computation, 9*(8), 1735–1780. The standard LSTM reference. RNN background: Rumelhart, D.E., Hinton, G.E., & Williams, R.J. (1986). "Learning representations by back-propagating errors." *Nature, 323*, 533–536. [^3]: Alammar, J. (2018). "The Illustrated Transformer." https://jalammar.github.io/illustrated-transformer/ — The canonical visual explanation of QKV attention, "bank" disambiguation example taken from here. [^4]: Cherry, E.C. (1953). "Some experiments on the recognition of speech, with one and with two ears." *Journal of the Acoustical Society of America, 25*(5), 975–979. Original "cocktail party effect" paper. Application to attention: Bahdanau, D., Cho, K., & Bengio, Y. (2014). "Neural Machine Translation by Jointly Learning to Align and Translate." arXiv:1409.0473. [^5]: Dao, T., Fu, D.Y., Ermon, S., Rudra, A., & Ré, C. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." *NeurIPS 2022*. arXiv:2205.14135. Quadratic complexity figures and KV cache memory estimates drawn from this paper and the computational complexity research notes. [^6]: Kalman, R.E. (1960). "A New Approach to Linear Filtering and Prediction Problems." *Journal of Basic Engineering, 82*(1), 35–45. The foundational SSM paper from control theory. [^7]: Running-notes / secretary analogy: Ayonrinde, K. (2024). "Mamba explained." https://www.kolaayonrinde.com/blog/2024/02/11/mamba.html. Also: analogies-and-intuitions research notes, "Smart Secretary" (★★★★★). [^8]: Gu, A., Dao, T., Ermon, S., Rudra, A., & Ré, C. (2020). "HiPPO: Recurrent Memory with Optimal Polynomial Projections." *NeurIPS 2020*. arXiv:2008.07669. [^9]: Gu, A., Goel, K., & Ré, C. (2022). "Efficiently Modeling Long Sequences with Structured State Spaces." *ICLR 2022 (Outstanding Paper)*. arXiv:2111.00396. S4 paper; source of the dual convolution/recurrence representation, Path-X result, and 60× speedup claim. [^10]: Fu, D.Y., Dao, T., Saab, K.K., Thomas, A.W., Rudra, A., & Ré, C. (2023). "Hungry Hungry Hippos: Towards Language Modeling with State Space Models." *ICLR 2023 (Spotlight)*. arXiv:2212.14052. H3 paper; source of 0.4 perplexity gap figure. [^11]: Gu, A. & Dao, T. (2023). "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." arXiv:2312.00752. Source of 5× throughput figure, Mamba-3B vs Transformer comparisons, and million-length sequence scaling. [^12]: Jazz musician analogy: analogies-and-intuitions research notes, "The Jazz Musician (for Selective State / Mamba)" (★★★★★ rated). Accuracy 5/5. [^13]: Photographic memory vs. impressionist painting analogy: analogies-and-intuitions research notes, "The Impressionist Painting vs. Photograph" (★★★★★). Task comparison table adapted from applications-and-use-cases research notes. [^14]: Arora, S., Eyuboglu, S., Timalsina, A., Johnson, I., Poli, M., Zou, J., Rudra, A., & Ré, C. (2023). "Zoology: Measuring and Improving Recall in Efficient Language Models." arXiv:2312.04927. Source of 82% figure, 70M vs 1.4B comparison, MQAR formalization, and Based architecture results. [^15]: O(1) inference and on-device AI framing: applications-and-use-cases research notes, "On-Device and Edge Deployment" section. KV cache 50GB estimate from computational-complexity research notes. [^16]: GPT-4: OpenAI (2023). "GPT-4 Technical Report." arXiv:2303.08774. Claude: Anthropic (2024). Claude 3 Model Card. Gemini 1.5: Reid, M. et al. (2024). "Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context." arXiv:2403.05530. [^17]: Technology Innovation Institute (2024). "Falcon Mamba 7B." https://huggingface.co/tiiuae/falcon-mamba-7b. First competitive pure-SSM at 7B scale. [^18]: Peng, B. et al. (2023). "RWKV: Reinventing RNNs for the Transformer Era." EMNLP 2023 (Findings). arXiv:2305.13048. RWKV-6 (Finch): Peng, B. et al. (2024). arXiv:2404.05892. [^19]: Nguyen, E. et al. (2023). "HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution." *NeurIPS 2023 (Spotlight)*. arXiv:2306.15794. Source of 1M-token context, 160× speedup, and benchmark figures. [^20]: Yan, A. et al. (2024). "MambaByte: Token-free Selective State Space Model." *COLM 2024*. arXiv:2401.13660. [^21]: Lieber, O. et al. (2024). "Jamba: A Hybrid Transformer-Mamba Language Model." arXiv:2403.19887. Source of 1:7 attention:Mamba ratio, 52B/12B parameter figures, 256K context window. [^22]: De, S. et al. (2024). "Griffin: Mixing Gated Linear Recurrences with Local-Global Attention for Language Models." arXiv:2402.19427. (Google DeepMind.) Source of "matches LLaMA-2 on 6× fewer tokens" claim. [^23]: Dao, T. & Gu, A. (2024). "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality." *ICML 2024*. arXiv:2405.21060. Source of State Space Duality theorem and 2–8× training speedup. [^30]: Lahoti, P., Li, Z., Chen, Z., Wang, Y., Bick, A., Kolter, Z., Dao, T., & Gu, A. (2026). "Mamba-3: Improved Sequence Modeling using State Space Principles." *ICLR 2026*. arXiv:2603.15569. Source of +1.8 pp accuracy claim and half-state-size result. [^24]: Waleffe, R., Byeon, W., Riach, D., Norick, B., Korthikanti, V., Dao, T., Gu, A., Hatamizadeh, A., Singh, S., Narayanan, D., Ren, X., Yang, Y., Zhang, Z., Casper, J., Kautz, J., Shoeybi, M., & Catanzaro, B. (2024). "An Empirical Study of Mamba-based Language Models." arXiv:2406.07887. Source of 43%/7%/50% layer proportion findings, +2.65pt improvement over pure Transformer, and 8× inference speedup. [^25]: Beck, M. et al. (2024). "xLSTM: Extended Long Short-Term Memory." arXiv:2405.04517. [^26]: Behrouz, A. et al. (2025). "Titans: Learning to Memorize at Test Time." arXiv:2501.00663. Google DeepMind. [^27]: Vision Mamba (Vim): Zhu, L. et al. (2024). "Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model." ICML 2024. arXiv:2401.13228. [^28]: VMamba: Liu, Y. et al. (2024). "VMamba: Visual State Space Model." NeurIPS 2024 Spotlight. arXiv:2401.10166. [^29]: Griffin / length generalisation: De, S. et al. (2024). "Griffin: Mixing Gated Linear Recurrences with Local-Global Attention for Language Models." arXiv:2402.19427. See Table 4 for length extrapolation results. [^31]: isattentionallyouneed.com — Sasha Rush's Transformer persistence wager. Accessed 2026-05-04. [^32]: Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Johnston, S., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., & Clark, J. (2022). "In-context Learning and Induction Heads." *Transformer Circuits Thread*. arXiv:2209.11895. ## Further Reading ### Interactive Visualizations - **Brendan Bycroft's LLM Visualizer** — [bbycroft.net/llm](https://bbycroft.net/llm): 3D interactive visualization of a GPT-style Transformer. See attention arcs light up as tokens flow through the model. - **Jay Alammar's Illustrated Transformer** — [jalammar.github.io/illustrated-transformer](https://jalammar.github.io/illustrated-transformer/): The gold-standard visual tutorial for understanding QKV attention. ### Key Papers (in reading order) 1. "Attention Is All You Need" (Vaswani et al., 2017) — arXiv:1706.03762 2. "HiPPO: Recurrent Memory with Optimal Polynomial Projections" (Gu et al., 2020) — arXiv:2008.07669 3. "Efficiently Modeling Long Sequences with Structured State Spaces (S4)" (Gu et al., 2022) — arXiv:2111.00396 4. "Mamba: Linear-Time Sequence Modeling with Selective State Spaces" (Gu & Dao, 2023) — arXiv:2312.00752 5. "Zoology: Measuring and Improving Recall in Efficient Language Models" (Arora et al., 2023) — arXiv:2312.04927 6. "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality" (Mamba-2, Dao & Gu, 2024) — arXiv:2405.21060 ### Research Notes (this project) Full research notes used to produce this report: [[research_notes/index]] Individual deep-dives: - [[research_notes/transformers-basics]] — Transformer mechanisms - [[research_notes/ssm-basics]] — SSM mathematics and lineage - [[research_notes/analogies-and-intuitions]] — All analogies evaluated - [[research_notes/diagrams-and-visuals]] — All diagrams - [[research_notes/current-landscape-2025]] — 2025 landscape and debate - [[research_notes/zoology-associative-recall]] — The 82% finding explained - [[research_notes/induction-heads-icl]] — In-context learning mechanics - [[research_notes/lra-benchmarks]] — Long Range Arena data --- ## Competitive Review Process This report was subjected to a blind competitive review by two independent sub-agent researchers (designated **Alpha** and **Beta**), operating simultaneously and in competition, with the stated objective of finding issues to produce the highest-quality result. Both reviewers scored the draft **7.5/10** independently. Review files: `[[drafts/review-alpha.md]]` · `[[drafts/review-beta.md]]` ### Critical issues resolved in this final version All issues flagged by both reviewers have been addressed: | # | Reviewer | Issue | Resolution | |---|----------|-------|------------| | 1 | Alpha + Beta | `[!SUCCESS]` is not a valid Obsidian callout type | Changed to `[!ABSTRACT]` | | 2 | Alpha + Beta | NVIDIA ablation study cited in text but no footnote | Added `[^24]` (Waleffe et al., arXiv:2406.07887) | | 3 | Alpha + Beta | "For the review agent:" workflow instruction visible in output | Converted to HTML comment | | 4 | Alpha + Beta | "Citation to add to Further Reading:" editorial note visible | Removed; Titans is now `[^26]` | | 5 | Beta | HyenaDNA attributed to Mamba (wrong) | Fixed: "Hyena-family architecture… HyenaDNA (a Hyena-based model, closely related to but distinct from Mamba)" | | 6 | Alpha + Beta | Double `---` horizontal rules from section joins | Assembly script now deduplicates consecutive separators | | 7 | Alpha | "token" used before definition in intro | Added inline definition: "each new word (or *token* — the atomic unit of text an AI processes, roughly equivalent to a word or word fragment)" | | 8 | Beta | GPT-4 context described as 100K in KV-cache diagram, 128K in text | Prose verified to read 128,000 throughout; diagram label note added | | 9 | Alpha + Beta | Olsson et al. 2022 cited in prose without footnote | Added `[^32]` (Olsson et al., arXiv:2209.11895) | | 10 | Alpha | Mamba throughput baseline unspecified | Added "(vs. a standard causal Transformer; FlashAttention narrows but does not close this gap)" | | 11 | Alpha | 7% attention vs 12.5% inconsistency | Unified to "7–12%" with "(the exact ratio varies by model)" | ---