# Computational Complexity: Transformers vs SSMs > **Part of:** [[index]] | **See also:** [[ssm-basics]], [[transformers-basics]], [[sequence-processing-comparison]], [[strengths-and-weaknesses]], [[real-world-products]] --- ## The Core Problem in Plain English When a language model reads text, it needs to figure out which words are related to which other words. The way a Transformer does this — comparing every word to every other word — turns out to be extraordinarily expensive. The way an SSM does this — keeping a running "summary" that updates one step at a time — is much cheaper. This difference in cost is the central engineering tension in modern AI. --- ## Part 1 — Transformer Complexity: O(n²) ### The "Everyone Shakes Hands with Everyone" Analogy Imagine you're at a party with 10 guests. To make sure everyone meets everyone else, you need 45 handshakes (10×9÷2). Now double the guest list to 20 people: you need 190 handshakes — more than four times as many, even though you only doubled the guests. This is exactly what a Transformer's attention mechanism does. Every single token (word, subword, or character) in the input must be compared to every other token. With a 1,000-token passage, that's roughly **1 million comparisons**. With a 10,000-token document, it's roughly **100 million comparisons** — not 10× more, but **100× more**. ``` 10 tokens → 100 comparisons 100 tokens → 10,000 comparisons 1K tokens → 1,000,000 comparisons 10K tokens → 100,000,000 comparisons 100K tokens → 10,000,000,000 comparisons ← this is where GPUs cry ``` Mathematically, this relationship is written **O(n²)** — spoken aloud as "order n-squared" or "quadratic complexity." It means: *if you double the input length, you quadruple the work.* ### Why Doubling Context = 4× More Work Here's the doubling rule made concrete: | Context Length | Attention Operations (relative) | Memory Needed (relative) | |---------------|----------------------------------|--------------------------| | 1× (baseline) | 1× | 1× | | 2× | **4×** | **4×** | | 4× | 16× | 16× | | 8× | 64× | 64× | | 10× | 100× | 100× | This means going from GPT-2's 1,024-token context to GPT-4's 128,000-token context isn't "125× harder" — it's closer to **15,000× harder** for the raw attention computation. That's why so much engineering effort (FlashAttention, sparse attention, sliding windows) has gone into taming this quadratic growth.[^flashattn] ### Memory: Why 100K Tokens Won't Fit in RAM Easily The attention mechanism requires storing a matrix of size n × n (the "attention map") during computation. Here's what that looks like in real GPU memory (fp16 precision, one attention head): | Sequence Length | Attention Map Size (1 head, fp16) | |-----------------|-----------------------------------| | 1,024 tokens | ~2 MB | | 8,192 tokens | ~128 MB | | 32,768 tokens | ~2 GB | | 131,072 tokens | ~32 GB | | 1,000,000 tokens| **~2 TB** — no GPU has this | Modern LLMs have **dozens of attention heads across dozens of layers**. A GPT-3–sized model with 96 heads and 96 layers processing 8K tokens needs on the order of **10+ GB just for attention maps** — before even accounting for weights, activations, optimizer states, or the KV cache.[^transformer-efficiency] > [!tip] The KV Cache Problem > During inference (when the model generates text), Transformers cache the "Key" and "Value" computations for every past token in a structure called the **KV cache**. This cache grows linearly with context length — but for each new token generated, the model must read the *entire* cache to compute attention. At 100K tokens, the KV cache for a typical 13B model can exceed **50 GB**, pushing the limits of even the largest consumer GPUs.[^mamba] ### FlashAttention: A Clever Workaround (Not a Cure) FlashAttention (Dao et al., 2022) was a breakthrough in making the quadratic computation *faster in practice* by exploiting GPU memory hierarchy — keeping computations in fast on-chip SRAM rather than shipping data back and forth to slower HBM (High-Bandwidth Memory). This achieved: - **15% end-to-end speedup** on BERT-large (sequence length 512) - **3× speedup** on GPT-2 (sequence length 1,024) - **2.4× speedup** on long-range tasks (sequence length 1K–4K) FlashAttention-2 (2023) pushed further, yielding ~2× improvement over FlashAttention-1.[^flashattn2] **Crucially: FlashAttention does not change the O(n²) complexity.** It reduces constant factors and memory I/O costs, but the quadratic relationship remains. Double the context, and you still quadruple the work. It makes the wall less painful to hit — it doesn't move the wall. --- ## Part 2 — SSM Complexity: O(n) ### The "One Person Takes Notes for Everyone" Analogy Now imagine a different party strategy. Instead of having everyone shake hands with everyone else, you hire **one very good secretary** who circulates through the room. They talk to each guest briefly, update their notepad with the most important things they've learned, and move on. By the end of the party, they have a single notepad with a compressed summary of everything that happened. This is how a State Space Model works. Rather than comparing every token to every other token, the SSM maintains a **hidden state** — a fixed-size "notepad" — that gets updated as each new token arrives. The notepad size stays constant. Processing the 1,000th token takes exactly the same amount of computation as processing the 1st token. ``` 10 tokens → 10 notepad updates 100 tokens → 100 notepad updates 1K tokens → 1,000 notepad updates 10K tokens → 10,000 notepad updates ← 10× the tokens = 10× the work 100K tokens → 100,000 notepad updates ← exactly as expected ``` Mathematically: **O(n)** — "linear complexity." *Double the input, double the work. Nothing more.* ### Why Linear Scaling Is Transformative The practical consequences are profound. Below is a concrete throughput comparison from the Mamba paper[^mamba] at different sequence lengths, comparing a 1.4B parameter Mamba model to a comparable Transformer: | Sequence Length | Transformer Throughput | Mamba Throughput | Mamba Speedup | |-----------------|------------------------|------------------|---------------| | 2,048 tokens | ~baseline | ~2× | 2× | | 8,192 tokens | drops significantly | stays flat | ~3–4× | | 16,384 tokens | drops further | stays flat | ~5× | | 131,072 tokens | near-unusable | stays flat | **>5×** | > [!quote] From the Mamba Paper > "Mamba enjoys fast inference (5× higher throughput than Transformers) and linear scaling in sequence length, and its performance improves on real data up to million-length sequences."[^mamba] The S4 model (an earlier SSM from the same lineage) demonstrated similar advantages on generation tasks, achieving **60× faster generation** than Transformer baselines on certain tasks while maintaining comparable accuracy.[^s4] --- ## Part 3 — Side-by-Side Comparison Table | Dimension | Transformer | SSM (e.g., Mamba) | |-----------|-------------|-------------------| | **Complexity (time)** | O(n²) — quadratic | O(n) — linear | | **Complexity (memory)** | O(n²) for attention maps | O(1) for state, O(n) for sequence | | **Training speed** | Highly parallelizable (fast for short-medium sequences) | Parallelizable via scan algorithms (fast across all lengths) | | **Inference speed** | Slows with context length; KV cache grows unboundedly | Constant per-step cost; no accumulating cache | | **Max practical context** | ~128K tokens (GPT-4), ~2M (Gemini 1.5) with enormous hardware | Theoretically unbounded; demonstrated to 1M+ tokens[^mamba] | | **Memory at 100K tokens** | Tens of GB just for attention + KV cache | Gigabytes for weights only; state is fixed-size | | **Throughput gain vs baseline** | Baseline | 2–5× at long sequences[^mamba]; 60× on some tasks[^s4] | | **Scales with sequence length** | Quadratically worse | Linearly, as expected | | **Hardware requirements** | Needs large GPU memory for long contexts | Fits longer contexts in the same hardware | | **Who wins on short sequences** | Effectively tied (quadratic is fine at small n) | Slight overhead from scan algorithm | --- ## Part 4 — Real Benchmark Numbers ### Benchmark 1: Mamba vs. Transformer Throughput (Gu & Dao, 2023)[^mamba] - **Models compared:** 1.4B Mamba vs 1.4B Transformer - **Result:** At sequence length 16K, Mamba achieves **~5× higher throughput** (tokens/second) than the Transformer. - **Memory:** Mamba's memory usage grows linearly with sequence length; Transformer's grows quadratically, leading to OOM (out-of-memory) errors at lengths where Mamba still runs comfortably. ### Benchmark 2: Mamba-2 Core Layer Speed (Gu & Dao, 2024)[^mamba2] - **Models compared:** Mamba-2 SSM layer vs Mamba-1 SSM layer - **Result:** The Mamba-2 core layer is **2–8× faster** than Mamba-1's selective scan, achieved by reformulating the problem in terms of structured semiseparable matrices with better GPU kernel utilization. - **Context:** Mamba-2 also introduces "state space duality" — showing SSMs and attention are related through matrix decompositions, which helps both architectures learn from each other. ### Benchmark 3: S4 Generation Speed (Gu et al., 2021)[^s4] - **Task:** Long-range sequence generation - **Result:** S4 generates sequences **60× faster** than Transformer baselines while achieving comparable perplexity. - **Context:** This was among the first demonstrations that SSMs could match Transformer quality at dramatically lower cost. ### Benchmark 4: FlashAttention Speedups (Dao et al., 2022)[^flashattn] - **Result:** 15% speedup on BERT-large (seq 512), 3× on GPT-2 (seq 1K), 2.4× on long-range arena (1K–4K). - **Key insight:** Even FlashAttention — the most clever Transformer optimization — gains only 2–3× and only through IO optimization, not algorithmic improvement. SSMs gain 5–60× through fundamentally different complexity. ### Benchmark 5: Jamba Hybrid — Long-Context at Scale (AI21, 2024)[^jamba] - **Model:** Jamba (interleaved Transformer + Mamba + MoE) - **Result:** 256K-token context fits on a **single 80GB GPU** — impossible with a pure Transformer at that scale. - **Throughput:** 3× higher throughput than a comparable Transformer for long-context inputs. ### Benchmark 6: RWKV Linear Inference (Peng et al., 2023)[^rwkv] - **Model:** RWKV-14B (14 billion parameters) - **Inference:** Constant memory per step regardless of context length — the largest dense linear-time model trained as of its publication. - **Performance:** Competitive with similarly-sized Transformers on NLP benchmarks, demonstrating linear complexity can scale to frontier model sizes. --- ## Part 5 — The Quadratic Wall The "quadratic wall" is the point at which Transformer complexity goes from being a manageable tax to being an impossible barrier. It's not a single fixed threshold — it depends on hardware, model size, and use case — but it manifests in several recognizable ways: ### In Production AI Systems 1. **Context limits are artificially low.** GPT-4 (128K context) and Claude 3 (~200K context) are impressive, but researchers and engineers regularly wish for million-token contexts for scientific papers, codebases, long novels, and legal documents. Quadratic complexity is the primary reason these aren't yet practical at consumer cost.[^transformer-efficiency] 2. **Cost grows super-linearly with use case.** A company using an LLM to analyze 50-page documents pays quadratically more than one analyzing 25-page documents — even though the document is only 2× longer. This makes billing models and cost estimation difficult and can make certain applications economically unviable. 3. **Latency spikes with context.** Every token generated requires attending over the entire context window. A chatbot handling a long conversation doesn't just get slower by the same amount — it gets slower *faster*, with later turns taking disproportionately longer than earlier ones. 4. **Fine-tuning long-context models is extremely expensive.** Training on 32K-token examples costs 1,024× more attention compute than training on 1K-token examples. This is why most fine-tuned models are still trained at short context lengths even if inference supports longer ones. ### A Concrete Production Scenario > Imagine a customer service chatbot with a 10,000-message conversation history that needs to be summarized for every new response. With a Transformer: > - Message 1,000: fast response (~1 second) > - Message 5,000: slow response (~25 seconds) > - Message 10,000: extremely slow (~100 seconds, likely times out) > > With an SSM: each response takes roughly the same time regardless of history length. --- ## Part 6 — Hardware Considerations ### GPU Memory Bandwidth vs. Compute The Transformer and SSM architectures have different **hardware bottlenecks** — the resource that limits their speed. Understanding this explains why FlashAttention only partially solves the problem and why SSMs require different GPU optimizations. #### Transformers: Compute-Bound (with Memory Bandwidth Pressure) The dominant cost in attention is the matrix-matrix multiplications forming the attention scores (Q×Kᵀ and score×V). These are **compute-bound** operations — the GPU's arithmetic units (CUDA cores, tensor cores) are the bottleneck, not data movement. However, the attention map itself (the n×n matrix) is too large to fit in fast on-chip SRAM, so it must be written to and read from the GPU's slower HBM (High-Bandwidth Memory). This **memory bandwidth bottleneck** is what FlashAttention addresses — it tiles the computation to minimize trips between HBM and SRAM. ``` GPU Architecture (simplified): ┌─────────────────────────────────────────────────┐ │ Compute Cores (TFLOPS of arithmetic) │ │ ↕ (fast, ~19 TB/s) │ │ On-chip SRAM (~50 MB on A100) │ │ ↕ (slower, ~2 TB/s) │ │ HBM (80 GB on A100) ← attention maps live here│ └─────────────────────────────────────────────────┘ ``` At long sequences, attention maps don't fit in SRAM, so HBM bandwidth becomes the bottleneck. This is the specific inefficiency FlashAttention solves by **fusing** the attention computation to avoid materializing the full n×n matrix.[^flashattn] #### SSMs: Memory-Bound SSMs work quite differently. Their core computation — a **scan** operation that propagates the hidden state step by step — involves small, sequential memory accesses rather than large matrix multiplications. This makes SSMs **memory-bandwidth bound** rather than compute-bound. The implication: to fully utilize modern GPUs (which are optimized for massive parallel matrix ops), SSMs need specialized CUDA kernels that schedule the scan efficiently. The Mamba paper includes a **hardware-aware parallel scan algorithm** specifically designed for this. Without it, a naive SSM implementation would actually be *slower* than a Transformer despite lower theoretical complexity.[^mamba] | Characteristic | Transformer (Attention) | SSM (e.g., Mamba) | |---------------|-------------------------|-------------------| | **Primary bottleneck** | Compute (matrix multiply) + HBM bandwidth at long seqs | Memory bandwidth (sequential scan) | | **GPU utilization** | High on tensor cores (short seqs) → drops at long seqs | Moderate but constant | | **SRAM fit** | Attention map often too large → HBM spill | Hidden state always fits in SRAM | | **Best hardware** | A100/H100 with large HBM and fast tensor cores | Any GPU, benefits from high memory bandwidth | | **Parallelism type** | Data parallel + model parallel, easy to scale | Requires careful scan parallelization | | **KV cache** | Grows with sequence length → memory wall | No KV cache; fixed state size | ### Practical Implications for GPU Memory For running inference on a model with a 128K-token context: - **Pure Transformer (e.g., Llama-3 70B at 128K context):** KV cache ≈ 128K × 80 layers × 8 KV heads × 128 head_dim × 2 bytes (fp16) ≈ **~26 GB** just for the cache, on top of ~140 GB for weights. This requires **multiple A100/H100 GPUs** and sophisticated tensor parallelism. - **SSM (e.g., Mamba at 128K context):** Hidden state size is fixed — typically a few hundred MB regardless of sequence length. Weights for a 3B Mamba model fit in ~6 GB. Running 128K context inference fits on **a single consumer GPU**. --- ## Summary The quadratic wall is not a theoretical curiosity — it is the central practical constraint shaping the landscape of modern AI. Every long-context feature in GPT-4, every engineering decision at Anthropic about Claude's context window, every discussion about AI for scientific research runs into this wall. SSMs offer a principled escape route: by compressing sequence history into a fixed-size state rather than attending over everything, they achieve the same scaling in compute as the sequence itself grows. The cost is a form of lossy compression — the state can't remember everything with perfect fidelity. How much that matters depends on the task. See [[sequence-processing-comparison]] for a deep dive on exactly this tradeoff. --- ## Footnotes [^mamba]: Gu, A. & Dao, T. (2023). *Mamba: Linear-Time Sequence Modeling with Selective State Spaces.* arXiv:2312.00752. https://arxiv.org/abs/2312.00752 [^mamba2]: Dao, T. & Gu, A. (2024). *Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality.* arXiv:2405.21060. ICML 2024. https://arxiv.org/abs/2405.21060 [^s4]: Gu, A., Goel, K., & Ré, C. (2021). *Efficiently Modeling Long Sequences with Structured State Spaces.* arXiv:2111.00396. ICLR 2022 (Outstanding Paper HM). https://arxiv.org/abs/2111.00396 [^flashattn]: Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022). *FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.* arXiv:2205.14135. https://arxiv.org/abs/2205.14135 [^flashattn2]: Dao, T. (2023). *FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning.* https://tridao.me/publications/flash2/flash2.pdf [^jamba]: Lieber, O. et al. (2024). *Jamba: A Hybrid Transformer-Mamba Language Model.* arXiv:2403.19887. https://arxiv.org/abs/2403.19887 [^rwkv]: Peng, B. et al. (2023). *RWKV: Reinventing RNNs for the Transformer Era.* arXiv:2305.13048. https://arxiv.org/abs/2305.13048 [^transformer-efficiency]: Zhuang, B. et al. (2023). *A Survey on Efficient Training of Transformers.* arXiv:2302.01107. IJCAI 2023. https://arxiv.org/abs/2302.01107