# Key Facts, Quotes, and Numbers for the Report
> [[index]] — all verified figures with sources for direct use in the report
## Milestone Numbers (Verified)
### S4 Benchmarks
- Sequential CIFAR-10: **91% accuracy** (no augmentation) — matches larger 2D ResNet[^1]
- Generation: **60× faster** than Transformers (at equal quality)[^1]
- Long Range Arena Path-X (16K length): **61.4% accuracy** — only model to beat chance; all prior work fails[^1][^2]
- Path-256 (64K with FlashAttention): **63.1% accuracy**
### Mamba Benchmarks
- At 1.4B parameters: matches GPT-3-style models in perplexity[^3]
- Inference throughput: **5× faster** than Transformer at 2K context[^3]
- Inference throughput: **15× faster** than Transformer at 16K context[^3]
- Memory: constant regardless of sequence length (vs Transformer: grows linearly with KV cache)
### Mamba-2 (State Space Duality)
- Core layer: **2–8× faster** than Mamba-1[^4]
- Theoretical unification: shows SSMs and attention are special cases of semiseparable matrix operations
### Jamba
- Context: **256K tokens on a single 80GB GPU**[^5]
- Architecture: 52B total / 12B active (mixture of experts)
- Required: 2 GPUs for equivalent pure Transformer; Jamba does it on 1
### Falcon Mamba 7B
- HuggingFace LLM Leaderboard v1 average: **64.1**
- LLaMA-3-8B (Transformer): **62.6** (Mamba wins despite being smaller!)
- Mistral-7B (Transformer): **61.0**
- Falcon2-11B (Transformer, larger): 64.3[^6]
### NVIDIA Hybrid Study (8B scale)
- **Mamba-2-Hybrid** (43% Mamba-2 + 7% attention + 50% MLP): **+2.65 pts** over pure Transformer[^7]
- Predicted inference: **8× faster** than pure Transformer at long contexts[^7]
- Trains at same speed as Transformer
### Zoology / Associative Recall
- **82% of the Transformer–SSM quality gap** explained by a single task: associative recall[^8]
- 70M attention model outperforms **1.4B** Hyena model on associative recall[^8]
- Hybrids with input-dependent attention: close **97.4% of the attention quality gap** sub-quadratically[^8]
### HyenaDNA (Genomics)
- Context: up to **1 million nucleotides** at single-nucleotide resolution[^9]
- Previous Transformer-based genomic models: max 4,096 tokens
- Training speed: **160× faster** than Transformer at equivalent sequence length[^9]
- Benchmark: SotA on **12/18 Nucleotide Transformer datasets**; beats SotA on 7/8 GenomicBenchmarks by +10 accuracy points[^9]
### Griffin (DeepMind)
- Matches **Llama-2 performance** on downstream tasks[^10]
- Trained on **6× fewer tokens** than Llama-2 (more data-efficient)[^10]
- Can extrapolate to sequences **longer than seen during training**[^10]
- Scaled to 14B parameters
### FlashAttention
- BERT-large training: **15% faster** than MLPerf 1.1 record[^2]
- GPT-2 training: **3× faster** than HuggingFace baseline[^2]
- **Important**: Still O(N²) computationally — only reduces constant by IO optimization
### RWKV (14B)
- Linear scaling in both memory and compute during inference[^11]
- Performs "on par with similarly-sized Transformers"[^11]
- Can be formulated as either Transformer (training) or RNN (inference)
## Memorable Quotes
> "Quadratic attention has been indispensable for information-dense modalities such as language... until now." — Albert Gu, announcing Mamba (Dec 2023)[^12]
> "On January 1, 2027, a Transformer-like model will continue to hold the state-of-the-art position in most benchmarked NLP tasks." — Sasha Rush's "Is Attention All You Need?" wager[^13]
> "We show that SSMs and variants of attention are connected through various decompositions of a class of structured semiseparable matrices." — Mamba-2 paper[^4]
### Long Range Arena (LRA) Benchmark Scores
Official scores from the [LRA GitHub repository](https://github.com/google-research/long-range-arena), Nov 2020 baseline + external entries:
| Model | ListOps | Text | Retrieval | Image | Path | Path-X | **Avg** |
|-------|---------|------|-----------|-------|------|--------|---------|
| Transformer (baseline) | 36.37 | 64.27 | 57.46 | 42.44 | 71.40 | **FAIL** | 54.39 |
| BigBird (best efficient Transformer) | 36.05 | 64.02 | 59.29 | 40.83 | 74.87 | **FAIL** | **55.01** |
| S4 (ICLR 2022) | 59.60 | 86.82 | 90.90 | 88.65 | 94.20 | **96.35** | **86.09** |
| S5 (2023) | ~62.15 | ~89.31 | ~91.40 | ~88.00 | ~95.33 | **98.50** | **87.40** |
Key notes:
- All Transformers **FAIL** on Path-X (16K sequence) — score random (50%)
- S4 is the **first model ever** to solve Path-X (96.35%) — massive jump from FAIL
- S4 achieves SotA on **every single LRA task** with comparable computational cost
- The Transformer avg (54.39) to S4 avg (86.09) is a **32-point gap**
### The Three Representations of SSMs
An SSM has THREE equivalent forms (huge pedagogical point):
1. **Continuous-time ODE**: x'(t) = Ax(t) + Bu(t) — the physics/control theory view
2. **Discrete recurrence**: h_t = Āh_{t-1} + B̄x_t — use at inference (O(1) per step)
3. **Convolution**: y = K̄ * x where K̄ = [CB̄, CĀB̄, C²B̄, ...] — use at training (parallelizable)
These are **mathematically identical** — same model, different computation.
### Mamba's Three Innovations
1. **Selectivity**: B and C become functions of input (input-dependent forgetting/remembering)
2. **Parallel scan**: selectivity breaks CNN mode, but the recurrence = prefix sum → efficient GPU parallel scan
3. **IO-aware CUDA kernel**: keeps intermediate states in fast on-chip SRAM (same insight as FlashAttention)
### Why Hybrids Win
The NVIDIA study shows a hybrid with only ~7% attention layers:
- Handles associative recall (attention's specialty) with just 7% attention
- Handles everything else (SSM's specialty) with 93% SSM layers
- Achieves better quality than pure Transformer AND faster inference
### The 2027 Wager Context
Sasha Rush's bet is on "Transformer-like" models — which includes hybrids. The bet isn't that SSMs are bad — it's that the Transformer architecture or something closely related will remain SotA. Many researchers expect hybrids (which are "Transformer-like" in the broad sense) to win by 2027.
## Key Papers at a Glance
| Paper | Year | Contribution |
|-------|------|-------------|
| "Attention is All You Need" | 2017 | Transformer architecture |
| HiPPO | 2020 | Polynomial projection for optimal state compression |
| S4 | 2021 | Structured SSMs, 3 representations, LRA SotA |
| FlashAttention | 2022 | IO-aware attention, 3× faster training |
| Mamba | 2023 | Selective SSMs, parallel scan, 5-15× faster inference |
| Zoology | 2023 | 82% quality gap = associative recall |
| Griffin/Hawk | 2024 | RNNs can match Transformers, linear recurrence + local attention |
| Jamba | 2024 | 256K context on single GPU |
| Mamba-2/SSD | 2024 | SSMs ↔ attention duality, 2-8× faster layer |
| Falcon Mamba 7B | 2024 | Pure SSM beats LLaMA-3-8B on benchmarks |
| HyenaDNA | 2023 | 1M-nucleotide genomic context |
[^1]: Gu et al. (2021). "Efficiently Modeling Long Sequences with Structured State Spaces (S4)." ICLR 2022. arXiv:2111.00396.
[^2]: Dao et al. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." NeurIPS 2022. arXiv:2205.14135.
[^3]: Gu & Dao (2023). "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." arXiv:2312.00752.
[^4]: Dao & Gu (2024). "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality." ICML 2024. arXiv:2405.21060.
[^5]: Lieber et al. (2024). "Jamba: A Hybrid Transformer-Mamba Language Model." arXiv:2403.19887.
[^6]: TII (2024). "Falcon Mamba 7B." Technical report. HuggingFace model card.
[^7]: Waleffe et al. (2024). "An Empirical Study of Mamba-based Language Models." arXiv:2406.07887.
[^8]: Arora et al. (2023). "Zoology: Measuring and Improving Recall in Efficient Language Models." arXiv:2312.04927.
[^9]: Nguyen et al. (2023). "HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution." NeurIPS 2023. arXiv:2306.15794.
[^10]: De et al. (2024). "Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models." arXiv:2402.19427.
[^11]: Peng et al. (2023). "RWKV: Reinventing RNNs for the Transformer Era." EMNLP 2023. arXiv:2305.13048.
[^12]: Albert Gu (2023). Tweet announcing Mamba. https://twitter.com/_albertgu/status/1731727672286294400
[^13]: Sasha Rush (2022-). "Is Attention All You Need?" https://www.isattentionallyouneed.com/