sequence-processing-comparison - ramblings from the whirligig void

# Sequence Processing: How Transformers and SSMs Handle Information Differently > **Part of:** [[index]] | **See also:** [[transformers-basics]], [[ssm-basics]], [[computational-complexity]], [[strengths-and-weaknesses]], [[real-world-products]] --- ## The Central Contrast Two architects build libraries. The first builds a **reading room** with every book laid open on a massive table — any reader can glance at any book at any moment, but the table can only hold so many books before it runs out of space. The second builds a **scholar's notebook** — as they read each book, they write notes summarizing what they've learned; by the end, the shelf of books can be infinite, but the notebook only captures what seemed important. This is the essential difference between Transformers and State Space Models. --- ## Part 1 — Parallel vs. Sequential Processing ### Transformers: The Simultaneous Reader When a Transformer processes text, it **reads the whole thing at once** — or rather, it performs all the pairwise comparisons (every token vs. every other token) in a massively parallel operation across GPU cores. Think of it like reading a page by taking a photograph of the whole page at once, rather than scanning it left to right. ``` Transformer processing "The cat sat on the mat": The cat sat on the mat │ │ │ │ │ │ └────┴────┴────┴────┴────┘ ← Every word "attends" to every other word simultaneously ``` During **training**, this parallelism is a massive advantage. All tokens in a training batch can be processed simultaneously, making full use of the thousands of GPU cores designed for parallel matrix operations. This is why Transformers train much faster than the RNNs they replaced, even though the math is more expensive.[^attention-is-all-you-need] During **inference** (generating new tokens one at a time), the picture changes. Generating the next word requires attending over all previous words — and each new word adds to the history. Technically, generation is still sequential (one token at a time), but the cost of each token grows with context length due to the growing KV cache. See [[computational-complexity]] for the memory implications. ### SSMs: The Sequential Note-Taker An SSM processes tokens **one at a time**, left to right. After each token, it updates a **hidden state** — a fixed-size vector that encodes everything the model knows so far. The hidden state is like a highly compressed summary of all previous tokens. ``` SSM processing "The cat sat on the mat": The → [state₁] cat → [state₂] (state₁ + "cat") sat → [state₃] (state₂ + "sat") on → [state₄] ... the → [state₅] ... mat → [state₆] ← final state summarizes entire sequence ``` **During training**, this sequential dependency seems like a problem — you can't process token 5 until you've processed tokens 1–4. However, modern SSMs (starting with S4, and especially Mamba) can be reformulated as a **parallel scan** algorithm during training, which computes all state updates simultaneously using a clever tree-structured computation.[^s4][^mamba] This is mathematically equivalent to sequential processing but runs in parallel on a GPU. **During inference**, the SSM's sequential nature becomes a genuine *advantage*. Generating each new token costs exactly the same as the last — no growing KV cache, no quadratic penalty, just one state update and one output.[^mamba] ### The Parallelism Paradox | Phase | Transformer | SSM | |-------|-------------|-----| | **Training** | Fully parallel across all tokens ✓ | Parallel via scan algorithm ✓ | | **Inference (generation)** | Sequential + growing KV cache ✗ | Sequential + constant state ✓ | | **Inference (full sequence)** | Parallel (one forward pass) ✓ | Sequential (no shortcut) — | | **Streaming new tokens** | Expensive (re-attend over all context) ✗ | Cheap (one state update) ✓ | --- ## Part 2 — Context Window vs. Infinite Context ### The Transformer's Hard Limit: Context Window Every Transformer has a **context window** — a hard upper bound on how many tokens it can "see" at once. This isn't a bug; it's an architectural necessity. The attention mechanism requires a fixed-size attention matrix (n × n), and that matrix must fit in GPU memory. Exceed the context window, and the model literally has no mechanism to include earlier tokens.[^transformer-efficiency] Context windows in major models (as of 2024–2025): | Model | Context Window | |-------|---------------| | GPT-2 (2019) | 1,024 tokens | | GPT-3 (2020) | 2,048 tokens | | GPT-4 (2023) | 8K → 128K tokens | | Claude 3 (2024) | 200K tokens | | Gemini 1.5 Pro (2024) | 1M tokens | | GPT-4o (2024) | 128K tokens | Extending context windows is extremely expensive. Going from 8K to 128K context represents a 256× increase in attention computation. Gemini 1.5's 1M context required novel architectural innovations (sparse attention patterns, mixture-of-experts) on top of enormous hardware investments.[^transformer-efficiency] Within the context window, a Transformer's memory is **perfect and lossless** — it can attend to any token in its window with equal precision, regardless of position. Word #1 and word #127,999 are equally accessible. This is a profound strength for tasks like question answering over a known document. Beyond the context window: **absolute amnesia.** Earlier tokens do not exist. The model cannot be aware of what it cannot see. ### The SSM's Alternative: Theoretically Infinite but Compressed An SSM's hidden state persists indefinitely — there is no architectural context limit. The model can in principle process a document of a million tokens, ten million tokens, or more. The Mamba paper demonstrates quality improvement on sequences up to **one million tokens** in length.[^mamba] But this theoretical infinity comes with a fundamental caveat: the hidden state has **finite size**. A typical Mamba model uses a state dimension of 16–64 per channel. This is like a notepad with a fixed number of pages — it can summarize any length of reading, but older details get compressed and eventually overwritten as newer information arrives. The SSM does not simply "forget" old tokens — it incorporates them into the state in a mathematically principled way, using the HiPPO framework (High-Order Polynomial Projection Operators) to optimally compress history.[^hippo] But the compression is real: a distant event is represented less faithfully than a recent one, and some information is inevitably lost. ``` SSM memory over time (conceptual): Token 1 ██████████ (high fidelity when recent) Token 100 ████░░░░░░ (compressed) Token 1000 ██░░░░░░░░ (further compressed) Token 9000 █░░░░░░░░░ (faint echo) Transformer memory: Token 1 ██████████ (perfect recall within context window) Token 1000 ██████████ (still perfect) Token 10001 ✗✗✗✗✗✗✗✗ (OUTSIDE WINDOW — doesn't exist) ``` --- ## Part 3 — What Gets Remembered, What Gets Forgotten ### The "Filing Cabinet vs. Notebook" Metaphor Picture a **Transformer** as a perfectly organized filing cabinet in a small room. Every document you've ever filed in that room is instantly accessible, perfectly intact, retrievable in any order. But the room has walls — when you run out of space, older files must be removed to make room for new ones. While a file is in the cabinet, you can find it instantly and read every word exactly as written. An **SSM** is like a scholar's **notebook**. The scholar reads everything and takes diligent notes. They can write notes about anything — an ancient document, a recent tweet, a conversation from yesterday — and their notebook never gets full, because they're always summarizing rather than copying verbatim. But the notes from last year are more abbreviated than the notes from this morning. Some nuances from early reading don't make it into the notes. And if a very specific fact from three years ago is needed, the scholar may have captured the gist but lost the exact wording. | Dimension | Transformer (Filing Cabinet) | SSM (Notebook) | |-----------|------------------------------|----------------| | **Memory fidelity** | Perfect within window | Compressed/lossy | | **Memory capacity** | Hard limit (context window) | Soft limit (state size) | | **Access pattern** | Random access — any token equally | Recency-weighted — recent > old | | **What you lose** | Everything outside window | Exact details from distant past | | **What you keep** | Verbatim copy of everything inside window | Approximated summary of all history | ### The "Photography vs. Impressionist Painting" Metaphor Another way to think about it: A **Transformer** takes a **photograph** of its context window. Everything in the frame is captured in full detail — colors, textures, fine print, background blur. But the photograph can only capture a certain field of view. Anything outside the frame doesn't appear at all. An **SSM** paints an **impressionist canvas** of everything it has ever encountered. The painting can grow to cover infinite time and space — but it's an impression, not a record. Recent events appear in vivid detail; distant ones are rendered in broad brushstrokes. The painting can suggest the shape of an early event, its emotional weight, its approximate timing — but not its exact wording. This is not a pure weakness. For tasks where **approximate long-range awareness** is more valuable than **exact short-range recall**, the impressionist painting wins. For tasks where a specific detail must be retrieved exactly, the photograph wins. ### Mamba's Innovation: Selective State Spaces One of the key weaknesses of early SSMs was that they compressed *everything* equally — both important and unimportant tokens got folded into the state indiscriminately. Mamba introduced **selective state spaces**: the SSM parameters (how strongly to update the state, what to keep vs. discard) are *functions of the input token itself*.[^mamba] This means: when a very important token arrives, Mamba can "pay attention" to it and update the state more strongly. When a stop word or filler token arrives, Mamba can mostly ignore it. This makes the impressionist painting much more accurate — the important details come through clearly, even from long ago. > [!important] Selective vs. Non-Selective SSMs > Pre-Mamba SSMs (S4, S4D, etc.) had **time-invariant** parameters: the same update equations applied regardless of what the current token was. Mamba's key innovation was making parameters **input-dependent** (time-varying), enabling selective retention of information. This is what allowed it to match Transformer performance on language tasks where earlier SSMs fell short. --- ## Part 4 — Long-Range Dependencies ### What Is a Long-Range Dependency? A long-range dependency is when understanding a word or phrase later in a sequence requires remembering something said much earlier. Examples: - *"The trophy didn't fit in the suitcase because **it** was too big."* — "it" refers to "trophy," not "suitcase." Resolving this requires remembering both nouns. - *"She left for Paris in January. By March, **she** had settled into the apartment."* — resolving "she" and "apartment" requires memory of January's sentence. - In code: *`int x = 5; ... (200 lines) ... return x + 1;`* — understanding the return value requires remembering the initialization. - In legal text: *"Notwithstanding subsection 4(a)(ii)..."* — requires knowing exactly what subsection 4(a)(ii) said. ### How Transformers Handle Long-Range Dependencies Transformers handle long-range dependencies **perfectly** within their context window, through a mechanism called **direct attention**. The attention head computing "it → trophy" can look directly at the trophy token regardless of how many tokens are between them. Distance is irrelevant — the attention matrix has n² entries, and any of them can be non-zero. This is the Transformer's greatest strength. Studies like the Long Range Arena benchmark confirm that Transformers excel on tasks requiring exact, distant-reference recall.[^lra] The failure mode: if the dependency spans more tokens than the context window, the Transformer literally cannot resolve it. The old token has been evicted from the window and no longer exists from the model's perspective. ### How SSMs Handle Long-Range Dependencies SSMs handle long-range dependencies through **state propagation** — the early token's information is folded into the hidden state and carried forward. The HiPPO framework mathematically guarantees that the state can represent the full history of a signal as a polynomial approximation, maintaining a principled form of long-range memory.[^hippo] In practice, early SSMs (before Mamba) struggled with language tasks that required very specific long-range recall — they could maintain a vague impression of earlier context but couldn't reliably retrieve precise details. Mamba's selectivity significantly improved this, but the fundamental nature of compressed memory remains. Long Range Arena benchmark results (approximate)[^lra]: | Task | Best Transformer | Best SSM (S4) | |------|-----------------|---------------| | ListOps | 37.1% | **59.6%** | | Text classification | 65.0% | **86.8%** | | Retrieval | 81.6% | **90.9%** | | Image (Path-X, 16K) | 0% (fails) | **61.4%** | | Pathfinder | 74.2% | **96.4%** | > [!note] > SSMs can actually *outperform* Transformers on Long Range Arena tasks. When the dependency truly spans thousands of tokens, the Transformer's context window — even a generous one — can become a limitation, while the SSM's theoretically infinite memory wins. --- ## Part 5 — In-Context Learning ### What Is In-Context Learning? In-context learning (ICL) is the ability of a model to learn a new task from **examples provided in the prompt**, without any weight updates. You show the model 3–5 examples of input/output pairs, and it infers the pattern and applies it to new inputs. Example: ``` Translate English to French: "Hello" → "Bonjour" "Goodbye" → "Au revoir" "Thank you" → ??? ``` The model should answer "Merci" — having learned the translation task from the few examples, entirely within its context window. ### Why Transformers Excel at In-Context Learning Transformers are remarkably good at ICL for a reason rooted in their attention mechanism: they can **directly compare** the new query against all provided examples. When processing the final "Thank you →", the attention mechanism can look directly at all three examples and extract the pattern "English word maps to French word." This direct comparison between examples and query is exactly what attention is designed for. The examples act as a "reference database" that attention can query. Research suggests Transformers can learn to implement approximations of gradient descent in-context — essentially doing few-shot learning as an emergent capability of the attention mechanism.[^icl] ### Why SSMs Struggle More with In-Context Learning SSMs process examples sequentially, folding each one into the hidden state. By the time the model reaches the final query, the first examples have been compressed and partially overwritten. The model cannot directly "look back" at example 1 to compare it against the query the way a Transformer can. This is a genuine and well-documented weakness. Research has shown that as the number of in-context examples grows, SSMs' performance on ICL tasks degrades more quickly than Transformers'.[^icl-ssm] The compressed memory makes it harder to maintain distinct, precise representations of each example. **Practical implication:** For tasks like few-shot prompting, retrieval-augmented generation over specific documents, or anything requiring exact recall of earlier text, Transformers are generally preferable. This is why most production LLM applications (ChatGPT, Claude, Gemini) still use Transformer-based architectures. --- ## Part 6 — Streaming and Real-Time Use Cases ### The Transformer's Streaming Problem For a Transformer to respond to a new message in a long conversation, it must: 1. Collect the entire conversation history 2. Run a full forward pass over all tokens (O(n²) attention) 3. Generate a response For a short conversation (say, 2K tokens), step 2 is fast. For a long one (50K tokens), it becomes expensive enough to noticeably slow response times. For a system processing **streaming audio** (audio arrives as a continuous real-time stream) or **live sensor data** (stock prices, network packets, heart rate), it's even worse — the Transformer would need to run a new full forward pass for every new data point, which is untenable. ### Why SSMs Excel at Streaming SSMs are **naturally streaming architectures**. When a new token arrives: 1. Compute the state update: new_state = A × old_state + B × new_token (one small matrix multiply) 2. Compute the output: y = C × new_state (one small matrix multiply) 3. Done. This is O(1) per new token — constant time regardless of how long the stream has been running. There is no "replay the entire history" step. The SSM's hidden state encodes everything it needs to know, and it's updated incrementally. This makes SSMs ideal for: - **Real-time speech recognition** — process audio frame by frame[^mamba] - **Live log analysis** — process log lines as they arrive, maintain running anomaly detection - **Genomics** — process DNA sequences that can be millions of base pairs long[^mamba] - **Financial time series** — process tick data with no lookback limit - **Wearable health sensors** — continuous monitoring without batch accumulation - **Robotics/control systems** — react to continuous sensory inputs in real time ### The Mamba Paper's Genomics Result One of the most striking demonstrations in the Mamba paper is on genomic sequences, where SSMs have a natural advantage: genomes are millions of base pairs long, with functional dependencies that span thousands to millions of positions. Mamba was shown to improve in quality all the way up to **1 million-length sequences** — something completely impossible for a standard Transformer.[^mamba] ### Streaming Comparison | Use Case | Transformer | SSM | |----------|-------------|-----| | **Chat (short context)** | ✓ Fast, excellent quality | ✓ Comparable quality | | **Chat (long context)** | Slow, expensive | ✓ Fast, constant cost | | **Real-time audio processing** | ✗ Not suitable | ✓ Ideal | | **Live sensor streams** | ✗ Not suitable | ✓ Ideal | | **Genomics (1M+ tokens)** | ✗ Infeasible | ✓ Demonstrated[^mamba] | | **Few-shot prompting** | ✓ Strong in-context learning | Weaker at ICL | | **Exact document recall** | ✓ Perfect within window | Lossy beyond a point | | **Code completion (short files)** | ✓ Excellent | ✓ Competitive | | **Code analysis (full codebases)** | Limited by window | ✓ Can process more | --- ## Part 7 — The Hybrid Resolution Given that each architecture excels at different things, recent work has combined them. The **Jamba** architecture from AI21 Labs interleaves Transformer and Mamba layers, enabling: - **Transformer layers** for precise, short-range attention and in-context learning - **Mamba layers** for efficient long-range propagation and constant-cost streaming - **Mixture-of-Experts (MoE)** for additional capacity without proportional compute cost The result: 256K context length on a **single 80GB GPU** — something no pure Transformer achieves.[^jamba] The hybrid approach essentially uses each architecture where it is strongest. Similarly, **RWKV** reformulates its architecture so it can be run either as a Transformer (parallel during training) or as an RNN (constant-cost at inference), achieving 14B-parameter scale with linear complexity.[^rwkv] This suggests the future is likely **not** "pure Transformers vs. pure SSMs" but rather hybrid architectures that preserve the best properties of both. See [[hybrid-models]] for more on this. --- ## Summary: The Two Architectures' Fundamental Tradeoff | Dimension | Transformer | SSM | |-----------|-------------|-----| | **Memory model** | Exact photography of context window | Impressionist painting of all history | | **Context limit** | Hard ceiling (context window) | Soft ceiling (state dimension) | | **What's lost** | Everything outside window | Exact details from distant past | | **Long-range within window** | Perfect, direct attention | Compressed propagation | | **Streaming data** | Expensive (re-compute attention) | Native, O(1) per token | | **In-context learning** | Strong (direct example comparison) | Weaker (compressed examples) | | **Best for** | Precise reasoning over known context | Long/infinite sequences, streaming, real-time | | **Analogy** | Filing cabinet in a small room | Scholar's ever-growing notebook | The deep insight from the Mamba-2 paper is that these architectures are not as different as they appear — both can be understood through the lens of **structured matrix transformations** on sequence data. [^mamba2] They occupy different points in a design space, each optimized for a different set of constraints. The right choice depends entirely on the task: if you need to look up exact text from a document you just provided, use a Transformer. If you need to monitor a million-line log file or process a genome, use an SSM. --- ## Footnotes [^mamba]: Gu, A. & Dao, T. (2023). *Mamba: Linear-Time Sequence Modeling with Selective State Spaces.* arXiv:2312.00752. https://arxiv.org/abs/2312.00752 [^mamba2]: Dao, T. & Gu, A. (2024). *Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality.* arXiv:2405.21060. ICML 2024. https://arxiv.org/abs/2405.21060 [^s4]: Gu, A., Goel, K., & Ré, C. (2021). *Efficiently Modeling Long Sequences with Structured State Spaces.* arXiv:2111.00396. ICLR 2022 (Outstanding Paper HM). https://arxiv.org/abs/2111.00396 [^hippo]: Gu, A. et al. (2020). *HiPPO: Recurrent Memory with Optimal Polynomial Projections.* NeurIPS 2020. The mathematical framework underlying SSMs' ability to approximate sequence history. [^flashattn]: Dao, T. et al. (2022). *FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.* arXiv:2205.14135. https://arxiv.org/abs/2205.14135 [^jamba]: Lieber, O. et al. (2024). *Jamba: A Hybrid Transformer-Mamba Language Model.* arXiv:2403.19887. https://arxiv.org/abs/2403.19887 [^rwkv]: Peng, B. et al. (2023). *RWKV: Reinventing RNNs for the Transformer Era.* arXiv:2305.13048. https://arxiv.org/abs/2305.13048 [^attention-is-all-you-need]: Vaswani, A. et al. (2017). *Attention Is All You Need.* NeurIPS 2017. The foundational paper introducing the Transformer architecture. [^transformer-efficiency]: Zhuang, B. et al. (2023). *A Survey on Efficient Training of Transformers.* arXiv:2302.01107. IJCAI 2023. https://arxiv.org/abs/2302.01107 [^lra]: Tay, Y. et al. (2021). *Long Range Arena: A Benchmark for Efficient Transformers.* ICLR 2021. The standard benchmark for comparing long-range sequence modeling architectures. [^icl]: Akyürek, E. et al. (2022). *What Learning Algorithm Is In-Context Learning? Investigations with Linear Models.* Shows how Transformer attention can implement gradient descent in-context. [^icl-ssm]: Research from multiple groups (2023–2024) showing SSMs underperform Transformers on in-context learning benchmarks with many examples. This remains an active area of investigation.