ssm-basics - ramblings from the whirligig void

# State Space Models (SSMs): A Layperson's Guide > *From control-theory roots to Mamba — what SSMs are, why they matter, and how they work.* > > Part of the SSMs vs Transformers research project. > See [[index]] for full research inventory. [[STEERING.md|../STEERING.md]] for project guidance. --- ## Table of Contents 1. [The Big Picture: Why Do We Need SSMs?](#1-the-big-picture-why-do-we-need-ssms) 2. [Historical Context](#2-historical-context) 3. [Core Intuition: What Is a State Space Model?](#3-core-intuition-what-is-a-state-space-model) 4. [The Family Tree: HiPPO → S4 → H3 → Mamba](#4-the-family-tree-hippo--s4--h3--mamba) 5. [Key Innovations Deep-Dive](#5-key-innovations-deep-dive) 6. [How SSMs Process Sequences (The Three Modes)](#6-how-ssms-process-sequences-the-three-modes) 7. [Computational Characteristics](#7-computational-characteristics) 8. [What SSMs Excel At (and Where They Struggle)](#8-what-ssms-excel-at-and-where-they-struggle) 9. [Key Models and Products](#9-key-models-and-products) 10. [Visual / Diagram Ideas](#10-visual--diagram-ideas) 11. [Appendix: Five Layperson Explanations](#appendix-five-layperson-explanations-found-in-the-wild) 12. [Sources](#sources) --- ## 1. The Big Picture: Why Do We Need SSMs? Most of the famous AI breakthroughs of the last few years — ChatGPT, GPT-4, Claude, Gemini — run on an architecture called the **Transformer**. Transformers are extraordinarily powerful, but they have one deep, fundamental problem: they get *very* slow and *very* memory-hungry as the sequences they read get longer.[^1] The culprit is **attention**: the mechanism that lets a Transformer "look back" at everything it has ever read to decide what's relevant right now. Attention is O(n²) — if a sequence doubles in length, the computation *quadruples*. For a 1,000-token conversation that's fine. For a 100,000-token book, or a genome, or an hour of audio sampled thousands of times per second, it becomes prohibitive.[^2] **State Space Models (SSMs)** are an alternative family of architectures that process sequences in **linear time** — O(n) — and use a *fixed-size* memory at inference. The trade-off historically was that SSMs were less capable. The story of modern SSM research is about closing that gap. See [[transformers-basics]] for a full Transformer deep-dive and [[computational-complexity]] for the math of O(n) vs O(n²). --- ## 2. Historical Context ### 2.1 Classical Control Theory (1960s) The term "state space model" comes not from AI but from **engineering and physics**. In the 1960s, Rudolf Kalman and others formalised a way to describe any system that evolves through time — a rocket's trajectory, a circuit's voltage, a satellite's orbit — using two simple equations: - The **state equation**: *the current state = (how the previous state decays/evolves) + (how new input affects things)* - The **output equation**: *what you observe = (a projection of the hidden state)* Think of it like this: you want to track a ball thrown in the air. You can't measure every molecule of air and the ball's exact quantum state. Instead, you pick a *state vector* — say, position and velocity — and a set of physics-based rules for how that state evolves each millisecond. That small summary lets you predict everything relevant about the future.[^3] **Kalman Filter (1960)**: The most famous SSM — used in GPS, spacecraft navigation. It maintains a belief about the current state and updates it optimally as new measurements arrive. This is O(n) per observation. This is, word-for-word, how modern SSMs work — except instead of physics equations, neural networks *learn* the update rules from data.[^3] ### 2.2 RNNs and LSTMs: The First Wave (1980s–2010s) Long before Transformers, researchers built **Recurrent Neural Networks (RNNs)** to handle sequences. An RNN processes one token at a time, updating a hidden "state" vector as it goes — very similar to the control-theory SSM. The problem: RNNs suffer from the **vanishing gradient problem**. When training on long sequences, the signal that tells the network "this word back in sentence 1 matters for what's happening in sentence 50" gets exponentially weaker as it travels back in time, so the network effectively forgets.[^4] **LSTMs (Long Short-Term Memory, 1997)** and **GRUs (Gated Recurrent Units, 2014)** added *gates* — learned switches that decide what to keep, what to update, and what to forget — partially solving the problem. But for truly very long sequences (thousands of tokens), they still struggled. And they couldn't be parallelised during training, making them slow.[^4] In 2017, the Transformer's attention mechanism blew past both, and RNNs were widely considered "dead."[^5] ### 2.3 The Lineage Toward Modern SSMs The rehabilitation of recurrent models came from a surprising direction: going *back to the mathematics of control theory*, but adding deep learning insights: | Year | Paper | Key Contribution | |------|-------|-----------------| | 2020 | **HiPPO** | Mathematical framework for optimal memory compression over time | | 2021 | **LSSL** (Linear State Space Layer) | First deep learning use of HiPPO-based SSM layers | | 2021 | **S4** (Structured State Spaces for Sequences) | Made SSMs efficiently trainable; breakthrough results on long sequences | | 2022 | **H3** (Hungry Hungry Hippos) | Closed the gap with Transformers on language tasks | | 2023 | **Hyena** | Replaced attention with long convolutions + data-controlled gating | | 2023 | **Mamba** | Added input-dependent ("selective") SSMs + hardware-aware algorithm | | 2024 | **Mamba-2** | Revealed deep connection between SSMs and attention; 2–8× faster training | All of this work largely originates from the **Hazy Research lab at Stanford** under Albert Gu and collaborators, with Tri Dao joining for Mamba.[^6] --- ## 3. Core Intuition: What Is a State Space Model? ### 2.1 The Running Summary Metaphor Imagine you're a secretary summarising a very long meeting in real time. You can't write down every word — the notepad only has room for a one-page summary. So as each new thing is said: 1. You *update your summary* based on what was just said, weighted by how important it is. 2. You *let old details fade* if they've become irrelevant. 3. When someone asks you a question, you *consult your summary*, not the raw transcript. That one-page summary is the **state vector** (often written **h** or **x** in papers). This is the heart of every SSM and every RNN. The critical difference between them is *how* that update is done.[^7] ### 2.2 The Temple Run Analogy A vivid metaphor from AI blogger Kola Ayonrinde: imagine you're building an AI to play the endless-runner game **Temple Run**.[^8] Your agent needs to decide whether to swerve left or right. To do this, it needs to know: current position, velocity, nearest obstacle position, whether the path is curving. That's the **state**. Crucially: - You don't need to remember every individual pixel frame in perfect detail - Most of the screen is predictable from physics (obstacles move down as you run forward) - You only need to *look at new information* at the top of the screen and update your model of the world This is exactly what an SSM does: it maintains a **compressed model of the world** (the state), updates it incrementally with each new observation, and uses that model to produce outputs. The state is not a recording — it's a *live understanding*. ### 2.3 The Contrast with Transformers | Model | Memory Approach | Analogy | |-------|----------------|---------| | **Transformer** | Stores everything verbatim, looks back when needed | Photographic memory — exact but space-hungry | | **SSM** | Maintains compressed running summary | Smart note-taker — lossy but efficient | | **Classic RNN** | Similar compression but poorly trained | A forgetful note-taker who loses older notes | ### 2.4 The Linear Recurrence Under the hood, the update rule in an SSM is *linear* (no complicated nonlinearities in the state update itself):[^9] ``` new_state = A × old_state + B × new_input output = C × state + D × input ``` Where: - **A** = "how much of the old state to keep / how to transition it" (the forgetting/transition matrix) - **B** = "how much of the new input to absorb into the state" (the input matrix) - **C** = "how to read out an answer from the state" (the output matrix) - **D** = a direct shortcut from input to output (like a skip connection; often set to 0) - **Δ (delta)** = a "step size" or "dwell time" — how long to linger on each token The linearity is a feature, not a limitation: it makes the math tractable, enables fast training via convolutions, and allows provable guarantees about what the model can remember. ### 2.5 How SSMs Differ from RNNs At first glance, the SSM update equation looks identical to an RNN. The differences are deep:[^10] | Property | Classic RNN / LSTM | SSM (S4/Mamba) | |----------|--------------------|----------------| | State update rule | Nonlinear (tanh, sigmoid gates) | Linear (matrix multiply) | | Memory structure | Ad hoc, learned from scratch | Principled (HiPPO polynomial basis) | | Training mode | Sequential only (slow) | Can be parallelised as a convolution | | Long-range memory | Vanishing gradient problem | Mathematically stable by construction | | Parameter count for state | Fixed small state | State size N is a tunable hyperparameter | | Input-dependence | Gates depend on input | In Mamba: all matrices depend on input | The key leap: because SSMs use *linear* recurrence, you can "unroll" the entire recurrence mathematically and express it as a **single convolution over the whole sequence** — enabling parallel training on GPUs just like Transformers.[^11] --- ## 4. The Family Tree: HiPPO → S4 → H3 → Mamba ### 4.1 The Fundamental Problem: What Should You Remember? Before HiPPO, SSMs and RNNs didn't have a principled answer to: *what exactly should go into the state vector?* Typical approaches randomly initialised the A matrix and hoped gradient descent would figure it out. For short sequences it worked. For long ones, it failed — the model had no systematic way to preserve information across thousands of steps. ### 4.2 HiPPO (2020): The Memory Foundation **HiPPO** stands for **High-Order Polynomial Projection Operators**. The paper, from Albert Gu et al. at Stanford, asked: *Is there a mathematically optimal way to compress a signal's history into a fixed-size vector?*[^12] The answer was yes. The trick: represent history as the **coefficients of a polynomial that best approximates the signal so far**. Specifically, HiPPO uses **Legendre polynomials** — a set of curves that, when combined, can approximate almost any shape. **Layperson analogy**: Imagine you've been listening to music for an hour and someone asks you to hum back what you heard. You can't hum every note. But you could capture the *rhythm*, the *key*, the *rough melody shape*, the *dynamics* — a compressed but meaningful summary. HiPPO does exactly this, but mathematically, for any time series. The update rule that keeps those polynomial coefficients fresh as new data arrives turns out to be exactly the SSM state update — with a *specific, principled choice of A matrix*. The result: a hidden state that optimally represents the recent history at multiple timescales simultaneously, **avoiding the forgetting problem of RNNs**.[^13] ### 4.3 S4 (2021): Making SSMs Trainable at Scale The **Structured State Space for Sequences (S4)** paper by Gu et al. (ICLR 2022, Outstanding Paper) made SSMs practical for deep learning.[^14] **The problem S4 solved**: the HiPPO A matrix is a dense N×N matrix. Computing the recurrence naively takes O(N²) per step, and computing the convolutional kernel (needed for parallel training) requires computing powers of A — which was prohibitively expensive for large N. **S4's solution**: structure the A matrix as a **low-rank correction of a diagonal matrix**. This special structure means: 1. The eigendecomposition can be computed stably (no numerical blowup) 2. The convolution kernel reduces to a well-studied **Cauchy kernel** that can be computed efficiently **Results**: S4 achieved state-of-the-art on the **Long Range Arena** benchmark (sequences up to 16,384 steps), including solving the "Path-X" task that *all prior models failed on completely*.[^15] It also generated audio and classified images from raw pixels at competitive speed, and ran 60× faster than Transformers on generation tasks. Key insight from the S4 paper: *the same model can be run as a recurrence (fast inference, O(1) memory) or as a convolution (fast parallel training) just by switching representations* — no architectural changes needed.[^16] ### 4.4 H3 (2022): Closing the Language Gap S4 was great for long-range tasks, but struggled on language modelling (predicting the next word). Researchers found that SSMs had two specific weaknesses:[^17] 1. **Recalling specific earlier tokens** (e.g., "what was the name mentioned 50 tokens ago?") 2. **Comparing tokens across the sequence** (e.g., "does this word match that earlier word?") **H3 (Hungry Hungry Hippos)** from Fu et al. (ICLR 2023 Spotlight) designed an SSM layer specifically to address these, combining a "shift" SSM (for token recall) with a standard SSM (for comparisons). H3 came within 0.4 perplexity of Transformers on OpenWebText — a near-miss that showed SSMs could compete on language. The H3 paper also introduced **FlashConv**, a hardware-optimized SSM training algorithm, giving 2× speedup on long-range benchmarks and enabling 2.7B-parameter hybrid models on the Pile dataset. ### 4.5 Mamba (December 2023): The Selective Revolution **Mamba**, from Albert Gu and Tri Dao, is the current flagship of SSM research.[^18] **The key insight**: S4 and H3 use *time-invariant* SSMs — the A, B, C matrices are the same for every input token. This is efficient but fundamentally limits expressiveness. A word like "the" and a word like "Paris" get processed identically at the structural level; the model can't selectively decide "I should really hold on to 'Paris' in my state but mostly ignore 'the'." Mamba adds **selectivity**: the A, B, C, and Δ matrices become **functions of the current input token**. Each token can dynamically determine: - How much to "gate open" the state update (what fraction of new info to absorb) - How much to "decay" the previous state (what to forget) - How to read out information from the state **Analogy**: Compare a conveyor belt (old SSMs) to a skilled curator (Mamba). The conveyor belt treats every object the same — pick it up, add it to the bin, move on. The curator looks at each object and decides: *this artefact is priceless, I'll frame it prominently; this ticket stub is junk, I'll barely glance at it.* Both process every object, but only the curator has content-aware memory.[^8] This turns out to be the exact capability needed for language — where some tokens (names, facts, numbers) matter enormously and others (function words) barely matter for long-term memory.[^19] **Mamba's second innovation**: because A, B, C depend on the input, you can no longer use the parallel convolutional trick used in S4. Mamba solved this with a **hardware-aware parallel scan algorithm** — a technique from the parallel computing literature that scans the recurrence in O(log n) time using parallel prefix operations, running directly in GPU SRAM (the fastest memory tier) to avoid slow I/O.[^20] **Results**: Mamba-3B outperformed Transformers of the same size, and matched Transformers twice its size, on language modelling benchmarks. It achieved **5× higher inference throughput** than a comparable Transformer. And it scales to **million-length sequences** — a feat no Transformer can match at practical memory budgets.[^21] ### 4.6 Mamba-2 (May 2024): Unifying SSMs and Attention **Mamba-2** by Gu and Dao revealed that SSMs and attention are not opposites — they are two views of the same mathematical structure: **structured semiseparable matrices**.[^22] This **State Space Duality (SSD)** framework means: - Every SSM can be rewritten as a form of attention (with a specific structured mask) - Every (linear) attention can be rewritten as an SSM - This enables hybrid architectures that combine both intelligently Practically, Mamba-2's core layer is 2–8× faster to train than Mamba-1, because the SSD formulation can be expressed as matrix multiplications — which modern GPUs are hyper-optimised for. --- ## 5. Key Innovations Deep-Dive ### 5.1 The Three Representations Trick (S4's Magic) One of S4's most elegant features is that its equations have **three mathematically equivalent representations**, each useful for a different purpose:[^16] ``` ┌─────────────────────────────────────────────────────────────┐ │ THE SAME MODEL... │ ├─────────────────┬──────────────────────┬───────────────────┤ │ Continuous │ Recurrent │ Convolutional │ │ (Theory) │ (Inference) │ (Training) │ ├─────────────────┼──────────────────────┼───────────────────┤ │ Differential │ Step-by-step update │ Single matrix │ │ equation │ (like a classic RNN) │ multiplication │ │ │ │ over whole seq. │ ├─────────────────┼──────────────────────┼───────────────────┤ │ • Handles │ • O(1) memory │ • Fully parallel │ │ irregular │ • O(n) total time │ • GPU-efficient │ │ sampling │ • Streaming │ • Fast training │ │ • Theoretically │ inference │ │ │ grounded │ │ │ └─────────────────┴──────────────────────┴───────────────────┘ ``` You can switch between them at will. Train as a convolution (fast). Deploy as a recurrence (memory-efficient). Reason about it as a differential equation (theoretically rigorous). ### 5.2 The Convolutional Kernel Intuition When an SSM is expressed as a convolution, it creates a **kernel** — a list of weights, one per time step in the past, that gets multiplied against the input history. Think of it like the blur function in image editing: if you apply a Gaussian blur to a photo, you're replacing each pixel with a weighted average of its neighbours, where nearby pixels get more weight and distant ones get less. The SSM kernel does the same thing in time: the current output is a weighted combination of all past inputs, where the weights are determined by powers of the A matrix. What makes the SSM kernel special: the weights don't just decay exponentially (as in a simple moving average). The HiPPO-derived A matrix ensures the weights have **oscillatory, multi-scale structure** — capturing both recent detail and distant patterns simultaneously.[^23] ### 5.3 The Selection Mechanism (Mamba's Core) In a fixed (non-selective) SSM, the A and B matrices are constant parameters — the same for every token. Mamba makes them **functions of the input**:[^8] ``` Fixed SSM: A = fixed matrix, B = fixed matrix Mamba's SSM: A = f(input_token), B = g(input_token) ``` A concrete consequence: when Mamba reads "The cat sat on the mat. The dog barked at the **cat**", it can: - When processing "cat" the second time: set Δ (dwell time) *high* — linger on this token - Let A decay toward 0 for "dog" and "barked" — release those from the state - Keep "cat" prominent in the state for the subsequent prediction The original authors frame this as the fundamental principle: *"a fundamental principle for building sequence models is selectivity: the context-aware ability to focus on or filter out inputs into a sequential state."*[^24] | Parameter | Fixed SSM behaviour | Mamba (selective) behaviour | |-----------|--------------------|-----------------------------| | A (transition) | Same decay for every token | Content-dependent: key tokens decay less | | B (input gate) | Equal weight to every token | High weight for important tokens | | C (output gate) | Fixed reading strategy | Adapts to what's currently relevant | | Δ (step size) | Same dwell time for all | Longer dwell on important tokens | --- ## 6. How SSMs Process Sequences (The Three Modes) ### 6.1 The Compression Pipeline At a high level, an SSM layer does this: ``` Input tokens: [w₁, w₂, w₃, ..., wₙ] ↓ SSM layer: compress history into state h (FIXED size!) ↓ Output tokens: [y₁, y₂, y₃, ..., yₙ] ``` Each output yₖ is derived from the state hₖ, which summarises *everything seen so far* up to position k. The magic is that hₖ has a **fixed size regardless of how long the sequence is**. ### 6.2 Recurrent Mode (Inference) At inference time, the SSM runs one step at a time: ``` h₀ = zeros (empty state) For each new token xₜ: hₜ = Ā·hₜ₋₁ + B̄·xₜ ← update state yₜ = C·hₜ ← produce output ``` This is **O(n) total time** and **O(1) memory** — the state size never grows. Whether you've read 100 tokens or 100,000 tokens, the memory footprint is the same. This is the critical advantage over Transformers, which must store an ever-growing KV cache.[^25] ### 6.3 Convolutional Mode (Training) During training, where you have the whole sequence available and want maximum speed, the SSM can be "unrolled" algebraically to show that: ``` Output sequence = Convolutional kernel K ⊛ Input sequence ``` where ⊛ means convolution. This means you can compute all outputs **in parallel** — the whole sequence at once — using fast Fourier Transform (FFT)-based convolution. Training is thus as parallel as a Transformer, making large-scale training practical.[^17] ``` SSM Kernel K = (C·B, C·A·B, C·A²·B, ..., C·A^(L-1)·B) ↑the "memory" weights: how each past input influences today's output ``` ### 6.4 The Sliding Compression Metaphor Imagine filling a diary. An SSM is like a diary with a **fixed number of pages** — as you fill pages, you must decide what to overwrite. A Transformer is like a diary with **unlimited pages** that you must re-read every time you add a new entry (getting slower as it grows). The SSM's fixed-state approach forces it to compress wisely. The art of SSM design (HiPPO, selectivity) is all about making that compression as information-preserving as possible. --- ## 7. Computational Characteristics ### 7.1 The O(n) vs O(n²) Difference | Sequence length | Transformer operations | SSM operations | Ratio | |-----------------|----------------------|----------------|-------| | 1,000 tokens | 1,000,000 | 1,000 | 1,000× more | | 10,000 tokens | 100,000,000 | 10,000 | 10,000× more | | 100,000 tokens | 10,000,000,000 | 100,000 | 100,000× more | | 1,000,000 tokens | 10^12 | 1,000,000 | 1,000,000× more | At 1 million tokens, the Transformer requires **1 million times more operations** than the SSM. This is why SSMs are the only viable path for genomics (genomes are ~3 billion base pairs), long-form audio (at 16kHz, one hour = 57.6 million samples), and video.[^26] See [[computational-complexity]] for detailed analysis. ### 7.2 Memory at Inference For a Transformer serving a chatbot: - Every token added to context → adds a KV vector to cache - At 1B-parameter scale, a 100K-token context can require **50–100 GB of KV cache** - This limits how many conversations a server can handle simultaneously For an SSM: - The state is **fixed size** regardless of context length - A Mamba model serving a 1M-token conversation uses the same memory as a 100-token one - You can serve **many more users** on the same hardware[^21] ### 7.3 Full Complexity Comparison | Metric | Transformer | SSM (S4/Mamba) | |--------|-------------|----------------| | Training time per step | O(n²) | O(n log n) with FFT | | Training memory | O(n²) | O(n) | | Inference time per token | O(n) (growing KV cache) | **O(1)** (fixed state) | | Inference memory | O(n) (KV cache grows) | **O(1)** (fixed state) | | Max practical context | ~128K–1M tokens (hardware bound) | Theoretically unbounded | | Inference throughput | Decreases with context length | Constant | | GPU utilization (training) | Excellent (mature tooling) | Good (improving rapidly) | **Key numbers from papers**: - Mamba: **5× higher inference throughput** than equivalent-size Transformer[^21] - S4: **60× faster generation** than Transformers on some tasks[^14] - Mamba-2: **2–8× faster training** than Mamba-1 via matrix-multiply formulation[^22] --- ## 8. What SSMs Excel At (and Where They Struggle) ### 8.1 SSM Strengths - **Very long sequences**: Audio (22kHz = millions of samples), genomics (whole chromosomes), time series, video — anywhere sequence length > 10K tokens - **Streaming/online inference**: Each new token = O(1) work regardless of history length; perfect for real-time applications - **Memory-constrained deployment**: No growing KV cache means dramatically lower memory on deployment hardware - **Continuous/smooth signals**: SSMs have a natural continuous-time interpretation, making them ideal for time series from physical processes (heart rate, accelerometers, speech waveforms) - **Long Range Arena**: S4 set state-of-the-art on every single benchmark task, including the "Path-X" task at 16K length that all prior models failed on completely[^27] - **Genomics and biology**: Mamba has been applied to DNA sequences spanning entire chromosomes, enabling dependency modelling at scales impossible for Transformers[^28] ### 8.2 Where SSMs Still Lag Transformers | Task | Issue | Why | |------|-------|-----| | **In-context learning** | Weaker few-shot performance | Fixed state may compress away precise examples | | **Exact token recall** | "What was the 3rd word I said?" | Compression loses verbatim information | | **Complex multi-step reasoning** | Weaker on logic chains | Intermediate steps may be compressed out | | **Language modelling (scale)** | Transformer quality at frontier scale | Mamba-3B matches Transformer-6B, but 70B+ comparison still unclear | | **Tooling & ecosystem** | Less mature infrastructure | Most serving tools (vLLM, etc.) are Transformer-first | The H3 paper[^17] diagnosed SSMs' two core weaknesses precisely: *recalling specific earlier tokens* and *comparing tokens across the sequence*. Mamba's selectivity addresses these partially but not completely. ### 8.3 The Efficiency-Effectiveness Trade-off From Kola Ayonrinde's Mamba explainer:[^8] > *"Transformers are extremely effective but not very efficient. Traditional RNNs are very efficient but not very effective. Mamba seems to offer a solution which pushes out the Pareto frontier of effectiveness/efficiency."* | Model | Effectiveness | Efficiency | |-------|:------------:|:---------:| | Traditional RNN | ⭐⭐ | ⭐⭐⭐⭐⭐ | | LSTM/GRU | ⭐⭐⭐ | ⭐⭐⭐⭐ | | Transformer | ⭐⭐⭐⭐⭐ | ⭐⭐ | | S4 | ⭐⭐⭐⭐ (long seq.) | ⭐⭐⭐⭐⭐ | | Mamba | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | | Hybrid (Jamba/Zamba) | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | --- ## 9. Key Models and Products See also [[real-world-products]] and [[hybrid-models]] for extended coverage. ### 9.1 Mamba (December 2023) - **Authors**: Albert Gu (CMU) & Tri Dao (Princeton/Together AI) - **Architecture**: Pure selective SSM (no attention), stacked Mamba blocks - **Key numbers**: Mamba-3B outperforms same-size Transformers on language benchmarks; matches Transformers twice its size; 5× inference throughput gain; tested up to million-length sequences - **Code**: https://github.com/state-spaces/mamba - **Paper**: arXiv:2312.00752[^29] ### 9.2 Mamba-2 (May 2024, ICML 2024) - **Authors**: Albert Gu & Tri Dao - **Architecture**: SSD (State Space Dual) layer — a scalar-diagonal constrained selective SSM - **Key numbers**: 2–8× faster training than Mamba-1 via matrix-multiply formulation; competitive with Transformers on language - **Innovation**: Proved formal equivalence between SSMs and structured attention (State Space Duality) - **Paper**: arXiv:2405.21060[^22] ### 9.3 RWKV (Receptance Weighted Key Value) - **Creator**: Bo Peng; community-driven, open-source - **Architecture**: Linear attention + RNN formulation — can be run as a Transformer (parallel training) or as an RNN (fast inference) - **Scale**: Models up to 14B parameters trained; the largest dense RNN ever trained at publication - **Key innovation**: Linear attention that can be reformulated as a recurrence, avoiding O(n²) while maintaining Transformer-quality parallelism during training - **Paper**: arXiv:2305.13048[^30] ### 9.4 Hyena (2023) - **Authors**: Michael Poli et al., Stanford/Hazy Research - **Architecture**: Long implicit convolutions interleaved with data-controlled gating (per-token filtering) - **Key numbers**: 2× faster than optimised attention at 8K context; 100× faster at 64K context; Transformer quality on language with 20% less training compute at 2K context - **Significance**: A "generalised SSM" where the gating mechanism is content-dependent — a precursor to Mamba's full selectivity - **Paper**: arXiv:2302.10866[^31] ### 9.5 Jamba (AI21 Labs, March 2024) - **Architecture**: Hybrid **Transformer + Mamba + Mixture-of-Experts (MoE)** — interleaves Transformer attention blocks with Mamba blocks, with MoE to increase capacity - **Key numbers**: Fits in a single 80GB GPU; strong results up to **256K-token context length**; high throughput and small memory footprint vs. pure Transformers - **Design insight**: Uses Transformer attention for precise recall tasks and SSM layers for long-range compression — the best of both worlds - **Paper**: arXiv:2403.19887[^32] ### 9.6 Zamba (Zyphra, May 2024) - **Architecture**: 7B Mamba backbone with a **single shared attention module** reused across all layers - **Key numbers**: Trained on 1T tokens; best non-Transformer model at 7B scale at publication; significantly faster inference and lower memory than comparable Transformers - **Innovation**: Gets attention's benefits at minimal parameter overhead via weight sharing - **Paper**: arXiv:2405.16712[^33] ### 9.7 Falcon Mamba 7B (TII UAE, October 2024) - **Architecture**: **Pure Mamba** model — no Transformer components whatsoever - **Key numbers**: Trained on 5.8 trillion tokens; outperforms Mistral 7B, Llama 3.1 8B, and Falcon2 11B; matches Gemma 7B; best pure-Mamba model at 7B scale on Open LLM Leaderboard - **Significance**: Proves that pure SSM architectures can compete with heavily-optimised Transformer models at production scale without hybrid design - **License**: Permissive open-source; weights available on HuggingFace - **Paper**: arXiv:2410.05355[^34] ### 9.8 Summary Comparison Table | Model | Type | Params | Context | Key Strength | |-------|------|--------|---------|-------------| | Mamba | Pure SSM | 130M–3B | 1M+ | Selective memory, fast inference | | Mamba-2 | Pure SSM | 130M–2.7B | 1M+ | 2-8× faster training, formal theory | | RWKV-6 | Linear attn. RNN | Up to 14B | ~8K+ | Open, RNN+Transformer hybrid | | Hyena | Long conv. | Up to 1B | 64K+ | 100× faster than attn. at 64K | | Jamba | SSM+Attn+MoE | 12B active | 256K | Best hybrid; production-ready | | Zamba | SSM+Attn | 7B | Long | Minimal-overhead attention | | Falcon Mamba | Pure SSM | 7B | Long | Best pure SSM at 7B scale | --- ## 10. Visual / Diagram Ideas ### 10.1 State Transition Diagram ``` ┌────────────────────────────────────────────────┐ │ │ │ xₜ (new input token) │ │ │ │ │ ▼ │ │ ┌───┐ ┌──────────────┐ │ │ │ B │─────▶│ │ │ │ └───┘ │ Hidden │──── C ────▶ yₜ │ │ │ State hₜ │ (output) │ │ ┌───┐ │ │ │ │ │ A │◀─────│ hₜ₋₁ │ │ │ └───┘ └──────────────┘ │ │ │ ▲ │ │ └──────────────┘ │ │ │ │ [In Mamba: A, B, C all depend on xₜ!] │ └────────────────────────────────────────────────┘ ``` ### 10.2 The Three-Mode Diagram ``` ┌─── SAME MODEL ───┐ │ │ Training │ Convolutional │ → Parallel, GPU-efficient ───────────▶ │ kernel K │ → O(n log n) via FFT │ │ Inference │ Recurrent │ → One step at a time ───────────▶ │ hₜ = A·hₜ₋₁ │ → O(1) memory, O(n) time │ │ Theory │ Continuous- │ → Differential equation ───────────▶ │ time ODE │ → Handles irregular sampling └─────────────────┘ ``` ### 10.3 The Complexity Chasm (Visual) ``` Total ops │ ● Transformer needed │ ● to process │ ● sequence │ ● │ ● │ ● │ ● │ ●●●●●●●●●●●●●●●●●●●●●●●●●●● SSM (flat line) └──────────────────────────────────── Sequence length → ``` ### 10.4 RNN vs SSM vs Transformer ``` ┌─────────────────────────────────────────────────────────┐ │ │ │ Classic RNN │ │ hₜ = tanh(W·hₜ₋₁ + U·xₜ) [nonlinear, forgetful] │ │ │ │ SSM (S4/Mamba) │ │ hₜ = Ā·hₜ₋₁ + B̄·xₜ [linear, principled] │ │ │ │ Difference: A is designed via HiPPO (not random) │ │ → stable gradients, multi-scale memory │ │ Training: can parallelise as convolution │ │ │ └─────────────────────────────────────────────────────────┘ ┌─────────────────────────────────────────────────────────┐ │ Transformer │ │ yᵢ = softmax(QKᵀ/√d)·V [all-to-all attention] │ │ │ │ Every token attends to every other token │ │ → Perfect recall, O(n²) cost │ └─────────────────────────────────────────────────────────┘ ``` ### 10.5 The KV Cache Memory Problem ``` Context length → 1K 10K 100K 1M ────────────────────────────────────────────────────── Transformer KV │▌ ████ ████████ ██████████… cache memory │ │ (grows unboundedly!) ────────────────────────────────────────────────────── SSM state │▌ ▌ ▌ ▌ memory │ │ (fixed! doesn't grow) ``` --- ## Appendix: Five Layperson Explanations Found in the Wild Different bloggers and educators have approached SSM intuition from five distinct angles: 1. **The Running Diary** (most common): the state = a fixed-size diary you update and partially overwrite as new events happen; you cannot record everything but you track the essentials 2. **The Temple Run Agent** (Kola Ayonrinde, 2024): the state = compressed situational awareness of a video game world, updated with minimal new observations at the top of the screen; most of the screen is predictable from the previous state 3. **The Meeting Secretary**: the state = live one-page summary of a long meeting; you can't transcribe everything so you capture patterns, decisions, and key moments 4. **The Radio Signal Filter** (from engineering tradition): the state = the output of a physical filter circuit processing an incoming signal in continuous time; the filter "remembers" through its capacitors and inductors — an exact analogy to the SSM equations 5. **The Polynomial Fitting Lens** (HiPPO framing): the state = the coefficients of the best polynomial that approximates "what has happened so far" — mathematically optimal compression of history at multiple timescales simultaneously Each metaphor illuminates a different strength: - The diary/secretary metaphors → capture inference efficiency - The Temple Run metaphor → captures the real-time update loop - The filter metaphor → explains the continuous-time origins - The polynomial metaphor → explains why SSMs remember *long-range* patterns without forgetting --- ## Sources [^1]: Hazy Research Blog. "Structured State Spaces (S4) — Part 1: Introduction." January 2022. https://hazyresearch.stanford.edu/blog/2022-01-14-s4-1 [^2]: Ayonrinde, K. "Mamba and State Space Models Explained." Blog, February 2024. https://www.kolaayonrinde.com/blog/2024/02/11/mamba.html — source of O(n²) Transformer analysis and Temple Run analogy. [^3]: Kalman, R.E. (1960). "A New Approach to Linear Filtering and Prediction Problems." *Transactions of the ASME — Journal of Basic Engineering*, 82, 35–45. See also Rush, S. & Karamcheti, S. "The Annotated S4." https://srush.github.io/annotated-s4/ — section on spring-mass mechanics example. [^4]: Hochreiter, S. & Schmidhuber, J. (1997). "Long Short-Term Memory." *Neural Computation, 9(8)*, 1735–1780. Cho, K. et al. (2014). "Learning Phrase Representations using RNN Encoder–Decoder." arXiv:1406.1078. [^5]: HuggingFace Blog. "Introducing RWKV: An RNN with Transformer-Level Performance." https://huggingface.co/blog/rwkv — discusses the perceived "death of RNNs" after Transformers. [^6]: Hazy Research Lab, Stanford University. https://hazyresearch.stanford.edu — home of the S4, HiPPO, Hyena, and Mamba research lineage. [^7]: Rush, S. & Karamcheti, S. "The Annotated S4." https://srush.github.io/annotated-s4/ — excellent step-by-step code implementation with intuitions for the SSM equations. [^8]: Ayonrinde, K. op. cit. — source of Temple Run analogy, conveyor belt vs curator metaphor, Pareto frontier framing. [^9]: Hazy Research Blog. "Structured State Spaces — Part 3: The State Space Model." January 2022. https://hazyresearch.stanford.edu/blog/2022-01-14-s4-3 — derivation of the three representations. [^10]: Rush, S. & Karamcheti, S. op. cit. Section on "Discrete-time SSM: The Recurrent Representation." [^11]: Hazy Research Blog, Part 3, op. cit. — section on "Computing the State Space with Convolutions." [^12]: Gu, A., Dao, T., Ermon, S., Rudra, A., & Ré, C. (2020). "HiPPO: Recurrent Memory with Optimal Polynomial Projections." *arXiv:2008.07669*. NeurIPS 2020. https://arxiv.org/abs/2008.07669 [^13]: HiPPO paper op. cit. Abstract: "HiPPO produces an optimal solution to a natural online function approximation problem... scales through time to remember all history, avoiding priors on the timescale... sets a new state-of-the-art accuracy of 98.3% on permuted MNIST." [^14]: Gu, A., Goel, K., & Ré, C. (2021). "Efficiently Modeling Long Sequences with Structured State Spaces (S4)." *arXiv:2111.00396*. ICLR 2022 (Outstanding Paper Honorable Mention). https://arxiv.org/abs/2111.00396 [^15]: S4 paper op. cit. Abstract: "SoTA on every task from the Long Range Arena benchmark, including solving the challenging Path-X task of length 16k that all prior work fails on." [^16]: Hazy Research Blog, Part 3, op. cit. "A key feature of the SSM is that it has three different representations, which allows it to be viewed (and computed) as a continuous-time, recurrent, or convolutional model." [^17]: Fu, D.Y., Dao, T., Saab, K.K., Thomas, A.W., Rudra, A., & Ré, C. (2022). "Hungry Hungry Hippos: Towards Language Modeling with State Space Models (H3)." *arXiv:2212.14052*. ICLR 2023 (Spotlight). https://arxiv.org/abs/2212.14052 [^18]: Gu, A. & Dao, T. (2023). "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." *arXiv:2312.00752*. December 2023. https://arxiv.org/abs/2312.00752 [^19]: Mamba paper op. cit. Abstract: "a key weakness of such models is their inability to perform content-based reasoning... simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing the model to selectively propagate or forget information." [^20]: Mamba paper op. cit. "even though this change prevents the use of efficient convolutions, we design a hardware-aware parallel algorithm in recurrent mode." [^21]: Mamba paper op. cit. "Mamba enjoys fast inference (5× higher throughput than Transformers) and linear scaling in sequence length, and its performance improves on real data up to million-length sequences... Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size." [^22]: Gu, A. & Dao, T. (2024). "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality (Mamba-2)." *arXiv:2405.21060*. ICML 2024. https://arxiv.org/abs/2405.21060 — "2-8X faster, while continuing to be competitive with Transformers on language modeling." [^23]: Rush, S. & Karamcheti, S. op. cit. Section on the convolutional kernel and SSM kernel derivation. [^24]: Mamba paper op. cit. The full selectivity principle quote. [^25]: Gu, A. & Dao, T. Mamba-2 blog. "Mamba-2 Part I: The Model." https://tridao.me/blog/2024/mamba2-part1-model/ — discussion of O(1) inference memory. [^26]: Hazy Research Blog, Part 1, op. cit. "when processing speech data which is usually sampled at 16000Hz, even understanding short clips requires dependencies of tens of thousands of timesteps." [^27]: S4 paper op. cit. Long Range Arena results table. [^28]: Mamba paper op. cit. "state-of-the-art performance across several modalities such as language, audio, and genomics." [^29]: Gu, A. & Dao, T. Mamba paper op. cit. [^30]: Peng, B. et al. (2023). "RWKV: Reinventing RNNs for the Transformer Era." *arXiv:2305.13048*. EMNLP 2023. https://arxiv.org/abs/2305.13048 — "combines the efficient parallelizable training of transformers with the efficient inference of RNNs... scales as large as 14 billion parameters, by far the largest dense RNN ever trained." [^31]: Poli, M. et al. (2023). "Hyena Hierarchy: Towards Larger Convolutional Language Models." *arXiv:2302.10866*. ICML 2023. https://arxiv.org/abs/2302.10866 — "100x faster at sequence length 64K... Transformer quality with a 20% reduction in training compute." [^32]: Lieber, O. et al. (2024). "Jamba: A Hybrid Transformer-Mamba Language Model." *arXiv:2403.19887*. AI21 Labs. https://arxiv.org/abs/2403.19887 — "strong results for up to 256K tokens context length... fits in a single 80GB GPU." [^33]: Anthony, Q. et al. (2024). "Zamba: A Compact 7B SSM Hybrid Model." *arXiv:2405.16712*. Zyphra. https://arxiv.org/abs/2405.16712 — "significantly faster at inference than comparable transformer models and requires substantially less memory." [^34]: Zuo, J. et al. (2024). "Falcon Mamba: The First Competitive Attention-Free 7B Language Model." *arXiv:2410.05355*. TII UAE. https://arxiv.org/abs/2410.05355 — "surpasses leading open-weight models based on Transformers, such as Mistral 7B, Llama3.1 8B, and Falcon2 11B." --- *Cross-links: [[transformers-basics]] | [[computational-complexity]] | [[hybrid-models]] | [[real-world-products]]* *Last updated: 2025 — sources current through Mamba-2 (ICML 2024) and Falcon Mamba 7B (Oct 2024)* --- ## Appendix: How Mamba Actually Trains Fast (The Hardware Secret) > Source: jackcook.com/2024/02/23/mamba.html[^35] ### The Problem: Selectivity Breaks CNN Mode S4 trains fast because its fixed A, B, C matrices let you precompute a convolution kernel K̄ = [CB, CAB, CA²B, ...] and do one big matrix multiply. Mamba breaks this: because B and C are functions of the input, the kernel changes for every input. No precomputation possible. Mamba must train in "RNN mode" — sequential recurrence, one step at a time. **This is exactly why all previous selective SSMs were slow.** ### The Insight: Recurrence = Prefix Sum Mamba's recurrence: ``` h_t = Ā h_{t-1} + B̄ x_t ``` Is structurally identical to a **prefix sum** (cumulative sum): ``` h_t = h_{t-1} + x_t (simplified) ``` Prefix sums appear trivially sequential, but there's a well-known parallel algorithm for them: the **parallel scan** (Blelloch, 1990). With N processors, you can compute all N prefix sums in O(log N) steps. ``` Naive prefix sum: [a₁, a₁+a₂, a₁+a₂+a₃, ...] → O(N) serial Parallel scan: Uses tree-reduction and tree-broadcast → O(log N) with N processors ``` ### The Hardware Innovation: IO-Aware Scan The parallel scan algorithm was known, but a naive PyTorch implementation was *slower* than FlashAttention at all sequence lengths and ran out of memory beyond 128K tokens. Why? **GPU memory hierarchy**: - **HBM** (High-Bandwidth Memory): Large (40-80GB), slow (1-2 TB/s bandwidth) - **SRAM** (on-chip cache): Tiny (20MB on A100), fast (19 TB/s bandwidth) Most operations that look cheap (dropout, softmax, masking) are **memory-bound** — the GPU spends more time moving data between HBM and SRAM than computing. Mamba's parallel scan moves ALL the state updates into SRAM, never writing intermediate hidden states back to HBM. This is the same key insight as FlashAttention: **minimize HBM traffic by doing as much work in SRAM as possible.** The result: Mamba's scan outperforms FlashAttention-2 at sequence lengths > 2K, with **5× throughput at 2K** and **15× throughput at 16K tokens**. ### Summary: The Two Innovations That Make Mamba Work 1. **Parallel scan**: makes the selective recurrence trainable in parallel (not sequential) 2. **IO-aware CUDA kernel**: keeps intermediate states in fast SRAM, minimizing memory traffic Neither innovation is unique to SSMs — both ideas come from the same GPU optimization tradition as FlashAttention. Mamba's authors (Gu + Dao, where Dao invented FlashAttention) were uniquely positioned to apply them. [^35]: Jack Cook (2024). "Mamba: The Easy Way." https://jackcook.com/2024/02/23/mamba.html