key-facts-for-report - ramblings from the whirligig void

# Key Facts, Quotes, and Numbers for the Report > [[index]] — all verified figures with sources for direct use in the report ## Milestone Numbers (Verified) ### S4 Benchmarks - Sequential CIFAR-10: **91% accuracy** (no augmentation) — matches larger 2D ResNet[^1] - Generation: **60× faster** than Transformers (at equal quality)[^1] - Long Range Arena Path-X (16K length): **61.4% accuracy** — only model to beat chance; all prior work fails[^1][^2] - Path-256 (64K with FlashAttention): **63.1% accuracy** ### Mamba Benchmarks - At 1.4B parameters: matches GPT-3-style models in perplexity[^3] - Inference throughput: **5× faster** than Transformer at 2K context[^3] - Inference throughput: **15× faster** than Transformer at 16K context[^3] - Memory: constant regardless of sequence length (vs Transformer: grows linearly with KV cache) ### Mamba-2 (State Space Duality) - Core layer: **2–8× faster** than Mamba-1[^4] - Theoretical unification: shows SSMs and attention are special cases of semiseparable matrix operations ### Jamba - Context: **256K tokens on a single 80GB GPU**[^5] - Architecture: 52B total / 12B active (mixture of experts) - Required: 2 GPUs for equivalent pure Transformer; Jamba does it on 1 ### Falcon Mamba 7B - HuggingFace LLM Leaderboard v1 average: **64.1** - LLaMA-3-8B (Transformer): **62.6** (Mamba wins despite being smaller!) - Mistral-7B (Transformer): **61.0** - Falcon2-11B (Transformer, larger): 64.3[^6] ### NVIDIA Hybrid Study (8B scale) - **Mamba-2-Hybrid** (43% Mamba-2 + 7% attention + 50% MLP): **+2.65 pts** over pure Transformer[^7] - Predicted inference: **8× faster** than pure Transformer at long contexts[^7] - Trains at same speed as Transformer ### Zoology / Associative Recall - **82% of the Transformer–SSM quality gap** explained by a single task: associative recall[^8] - 70M attention model outperforms **1.4B** Hyena model on associative recall[^8] - Hybrids with input-dependent attention: close **97.4% of the attention quality gap** sub-quadratically[^8] ### HyenaDNA (Genomics) - Context: up to **1 million nucleotides** at single-nucleotide resolution[^9] - Previous Transformer-based genomic models: max 4,096 tokens - Training speed: **160× faster** than Transformer at equivalent sequence length[^9] - Benchmark: SotA on **12/18 Nucleotide Transformer datasets**; beats SotA on 7/8 GenomicBenchmarks by +10 accuracy points[^9] ### Griffin (DeepMind) - Matches **Llama-2 performance** on downstream tasks[^10] - Trained on **6× fewer tokens** than Llama-2 (more data-efficient)[^10] - Can extrapolate to sequences **longer than seen during training**[^10] - Scaled to 14B parameters ### FlashAttention - BERT-large training: **15% faster** than MLPerf 1.1 record[^2] - GPT-2 training: **3× faster** than HuggingFace baseline[^2] - **Important**: Still O(N²) computationally — only reduces constant by IO optimization ### RWKV (14B) - Linear scaling in both memory and compute during inference[^11] - Performs "on par with similarly-sized Transformers"[^11] - Can be formulated as either Transformer (training) or RNN (inference) ## Memorable Quotes > "Quadratic attention has been indispensable for information-dense modalities such as language... until now." — Albert Gu, announcing Mamba (Dec 2023)[^12] > "On January 1, 2027, a Transformer-like model will continue to hold the state-of-the-art position in most benchmarked NLP tasks." — Sasha Rush's "Is Attention All You Need?" wager[^13] > "We show that SSMs and variants of attention are connected through various decompositions of a class of structured semiseparable matrices." — Mamba-2 paper[^4] ### Long Range Arena (LRA) Benchmark Scores Official scores from the [LRA GitHub repository](https://github.com/google-research/long-range-arena), Nov 2020 baseline + external entries: | Model | ListOps | Text | Retrieval | Image | Path | Path-X | **Avg** | |-------|---------|------|-----------|-------|------|--------|---------| | Transformer (baseline) | 36.37 | 64.27 | 57.46 | 42.44 | 71.40 | **FAIL** | 54.39 | | BigBird (best efficient Transformer) | 36.05 | 64.02 | 59.29 | 40.83 | 74.87 | **FAIL** | **55.01** | | S4 (ICLR 2022) | 59.60 | 86.82 | 90.90 | 88.65 | 94.20 | **96.35** | **86.09** | | S5 (2023) | ~62.15 | ~89.31 | ~91.40 | ~88.00 | ~95.33 | **98.50** | **87.40** | Key notes: - All Transformers **FAIL** on Path-X (16K sequence) — score random (50%) - S4 is the **first model ever** to solve Path-X (96.35%) — massive jump from FAIL - S4 achieves SotA on **every single LRA task** with comparable computational cost - The Transformer avg (54.39) to S4 avg (86.09) is a **32-point gap** ### The Three Representations of SSMs An SSM has THREE equivalent forms (huge pedagogical point): 1. **Continuous-time ODE**: x'(t) = Ax(t) + Bu(t) — the physics/control theory view 2. **Discrete recurrence**: h_t = Āh_{t-1} + B̄x_t — use at inference (O(1) per step) 3. **Convolution**: y = K̄ * x where K̄ = [CB̄, CĀB̄, CĀ²B̄, ...] — use at training (parallelizable) These are **mathematically identical** — same model, different computation. ### Mamba's Three Innovations 1. **Selectivity**: B and C become functions of input (input-dependent forgetting/remembering) 2. **Parallel scan**: selectivity breaks CNN mode, but the recurrence = prefix sum → efficient GPU parallel scan 3. **IO-aware CUDA kernel**: keeps intermediate states in fast on-chip SRAM (same insight as FlashAttention) ### Why Hybrids Win The NVIDIA study shows a hybrid with only ~7% attention layers: - Handles associative recall (attention's specialty) with just 7% attention - Handles everything else (SSM's specialty) with 93% SSM layers - Achieves better quality than pure Transformer AND faster inference ### The 2027 Wager Context Sasha Rush's bet is on "Transformer-like" models — which includes hybrids. The bet isn't that SSMs are bad — it's that the Transformer architecture or something closely related will remain SotA. Many researchers expect hybrids (which are "Transformer-like" in the broad sense) to win by 2027. ## Key Papers at a Glance | Paper | Year | Contribution | |-------|------|-------------| | "Attention is All You Need" | 2017 | Transformer architecture | | HiPPO | 2020 | Polynomial projection for optimal state compression | | S4 | 2021 | Structured SSMs, 3 representations, LRA SotA | | FlashAttention | 2022 | IO-aware attention, 3× faster training | | Mamba | 2023 | Selective SSMs, parallel scan, 5-15× faster inference | | Zoology | 2023 | 82% quality gap = associative recall | | Griffin/Hawk | 2024 | RNNs can match Transformers, linear recurrence + local attention | | Jamba | 2024 | 256K context on single GPU | | Mamba-2/SSD | 2024 | SSMs ↔ attention duality, 2-8× faster layer | | Falcon Mamba 7B | 2024 | Pure SSM beats LLaMA-3-8B on benchmarks | | HyenaDNA | 2023 | 1M-nucleotide genomic context | [^1]: Gu et al. (2021). "Efficiently Modeling Long Sequences with Structured State Spaces (S4)." ICLR 2022. arXiv:2111.00396. [^2]: Dao et al. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." NeurIPS 2022. arXiv:2205.14135. [^3]: Gu & Dao (2023). "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." arXiv:2312.00752. [^4]: Dao & Gu (2024). "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality." ICML 2024. arXiv:2405.21060. [^5]: Lieber et al. (2024). "Jamba: A Hybrid Transformer-Mamba Language Model." arXiv:2403.19887. [^6]: TII (2024). "Falcon Mamba 7B." Technical report. HuggingFace model card. [^7]: Waleffe et al. (2024). "An Empirical Study of Mamba-based Language Models." arXiv:2406.07887. [^8]: Arora et al. (2023). "Zoology: Measuring and Improving Recall in Efficient Language Models." arXiv:2312.04927. [^9]: Nguyen et al. (2023). "HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution." NeurIPS 2023. arXiv:2306.15794. [^10]: De et al. (2024). "Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models." arXiv:2402.19427. [^11]: Peng et al. (2023). "RWKV: Reinventing RNNs for the Transformer Era." EMNLP 2023. arXiv:2305.13048. [^12]: Albert Gu (2023). Tweet announcing Mamba. https://twitter.com/_albertgu/status/1731727672286294400 [^13]: Sasha Rush (2022-). "Is Attention All You Need?" https://www.isattentionallyouneed.com/