real-world-products - ramblings from the whirligig void

# Real-World Products: Transformers and SSMs in the Wild > See [[index]], [[transformers-basics]], [[ssm-basics]], [[hybrid-models]], [[computational-complexity]], [[strengths-and-weaknesses]], [[current-landscape-2025]] > [!NOTE] Last updated: 2025. This note covers the state of the field through mid-2025, including all major 2024 releases and benchmark results. --- ## Transformer-Based Models ### The Major Language Models (2024–2025) | Model | Creator | Context Window | Key Strength | Architecture Notes | |-------|---------|---------------|-------------|-------------------| | **GPT-4 / GPT-4o** | OpenAI | 128k tokens | General reasoning, multimodal | Dense Transformer + RLHF; ~1.8T params (rumored) | | **GPT-4.1 / o3** | OpenAI | 1M tokens | Reasoning, coding | Chain-of-thought reasoning Transformer | | **Claude 3.5 / 3.7 Sonnet** | Anthropic | 200k tokens | Long-form reasoning, careful | Constitutional AI trained | | **Gemini 1.5 / 2.0 Pro** | Google | 1M–2M tokens | Extreme long context | Mixture-of-Experts Transformer | | **LLaMA 3.1 / 3.3 (70B)** | Meta | 128k tokens | Open weights, high capability | Standard Transformer, openly released | | **Mistral 7B / Mistral Large** | Mistral AI | 32k–128k tokens | Efficiency, sliding window | Grouped Query Attention + Sliding Window | | **Mixtral 8×7B** | Mistral AI | 32k tokens | Efficiency at scale | Sparse Mixture-of-Experts | | **Qwen2.5 / QwQ** | Alibaba | 128k tokens | Multilingual, reasoning | Standard Transformer | | **DeepSeek-V3 / R1** | DeepSeek | 128k tokens | Efficient MoE, reasoning | MoE Transformer; trained cheaply | ### Encoder-Only Models (Classification / Embeddings) | Model | Use Case | Context | |-------|----------|---------| | **BERT** (Google, 2018) | Classification, NER, QA | 512 tokens | | **RoBERTa** (Meta, 2019) | Better-trained BERT | 512 tokens | | **DeBERTa** (Microsoft) | NLU tasks | 512 tokens | | **BGE / E5** embeddings | Semantic search | 512–8k tokens | ### Vision Transformers | Model | Creator | Notes | |-------|---------|-------| | **ViT** (Google, 2020) | Image classification | "An Image is Worth 16×16 Words" | | **DINO v2** (Meta) | Self-supervised vision | Strong visual features | | **Swin Transformer** (Microsoft) | Hierarchical vision | Sliding window attention | | **GPT-4V / GPT-4o** | OpenAI | Vision + language | --- ## SSM-Based Models ### The S4 Lineage (Academic → Applied) | Model | Year | Creator | Key Contribution | |-------|------|---------|-----------------| | **S4** | 2021 | Gu et al. (Stanford/CMU) | First practical deep SSM; Long Range Arena SOTA | | **DSS** | 2022 | Gupta et al. | Diagonal S4 simplification | | **S5** | 2022 | Smith et al. | Simplified S4 with parallel scan | | **H3** | 2022 | Fu et al. (Stanford) | SSM + attention hybrid for language | | **Hyena** | 2023 | Poli et al. (Stanford) | Long convolutions, subquadratic | | **HyenaDNA** | 2023 | Nguyen et al. (Hazy Research) | Genomics foundation model; 1M token context, single-nucleotide | ### Mamba Family (2023–2025) | Model | Year | Params | Notes | |-------|------|--------|-------| | **Mamba** | Dec 2023 | 130M–3B | Selective SSM; 5× faster inference vs Transformer[^2] | | **Mamba-2** | May 2024 | 130M–2.8B | State Space Duality; 2–8× faster than Mamba-1; ICML 2024[^7] | | **Codestral Mamba** | Jun 2024 | 7.3B | Mistral; code generation SSM, Apache 2.0 license[^5] | | **Falcon Mamba 7B** | Jul 2024 | 7B | TII; first production-scale pure SSM competitive with 7B Transformers[^4] | | **Zamba** | May 2024 | 7B | Zyphra; hybrid Mamba + shared Transformer block[^4b] | | **Zamba2-7B** | Oct 2024 | 7B | Zyphra; improved second-generation hybrid | | **RWKV-7 (Goose)** | 2025 | Up to 14B | Continued RWKV development; improved gating | #### Falcon Mamba 7B — Detailed Profile **Creator**: Technology Innovation Institute (TII), Abu Dhabi **Released**: July 2024 **License**: TII Falcon Mamba License 2.0 (open access) **Architecture**: Pure Mamba with additional RMS normalization layers for stable training at scale[^4] **Training**: - ~5,500 GT (billion tokens) of data - Data: RefinedWeb-English, FineWeb-edu, high-quality technical and code data - Tokenizer: Shared with Falcon-7B/11B - Hardware: 256 × H100 80GB GPUs, ~2 months of training - Multi-stage strategy: context length expanded from 2,048 → 8,192 during training[^4] **Key capability**: Fits on a single **A10 24GB GPU** (no KV cache growth). Generates tokens at constant throughput regardless of context length.[^4] #### Codestral Mamba — Detailed Profile **Creator**: Mistral AI (with Albert Gu and Tri Dao) **Released**: July 2024 **License**: Apache 2.0 **Parameters**: 7,285,403,648 (~7.3B) **Architecture**: Pure Mamba, trained for code generation **Context**: Tested on in-context retrieval up to 256k tokens[^5] **Deployment**: Available via `mistral-inference` SDK, TensorRT-LLM, and `la Plateforme` as `codestral-mamba-2407` ### RWKV Family **RWKV** (Receptance Weighted Key Value) is a community-driven model that formulates attention-like operations as recurrent computations — making it an "RNN that feels like a Transformer".[^1] | Version | Params | Notes | |---------|--------|-------| | RWKV-4 | 1.5B–14B | First widely-used version | | RWKV-5 | Up to 7B | Eagle architecture | | RWKV-6 | Up to 14B | Finch architecture; improved gating; competitive with Mamba | | RWKV-7 (Goose) | Up to 14B | 2025; further improved; open-source, Apache 2.0 | **Strengths**: True RNN inference (O(1) per token), fully open-source, runs on CPU reasonably well. **Unique property**: The only major model family explicitly designed to be usable without GPUs for inference. --- ## Application Domains: Where SSMs Are Deployed > [!NOTE] SSMs shine in "non-language-like" domains — long continuous sequences where attention's quadratic cost is most painful and where exact recall is less critical. ### Genomics SSMs are a natural fit for DNA/RNA sequences: extremely long (billions of nucleotides), continuous signals, and sequential structure. | Model | Year | Architecture | Key Claim | |-------|------|-------------|-----------| | **HyenaDNA** | 2023 | Hyena (subquadratic conv) | 1M token context at single-nucleotide resolution; NeurIPS 2023 Spotlight[^8] | | **Caduceus** | 2024 | Mamba-based | Bidirectional DNA foundation model; RC-equivariant | | **Evo** | 2024 | StripedHyena hybrid | Arc Institute; 7B params; trained on 2.7M prokaryotic genomes; generates novel DNA sequences | **Why SSMs win in genomics**: Transformer context is limited to ~4K nucleotides (due to quadratic cost), which is <0.001% of the human genome. HyenaDNA extends this to 1M nucleotides — a 250× increase — by using sub-quadratic convolutions. Training is 160× faster than equivalent Transformer.[^8] ### Audio | Model | Task | Notes | |-------|------|------| | **SaShiMi** | Audio generation | S4-based; outperforms WaveNet on speech synthesis | | **DiffWave + S4** | Text-to-speech | SSM diffusion model | | **Mamba for EEG** | Brain signal processing | SSM processes very long physiological time series | | **Audio Mamba** | Audio classification | ViM-style SSM for spectrogram processing | ### Time Series SSMs are among the best architectures for multivariate time series due to their natural sequential inductive bias. | Model | Task | Notes | |-------|------|------| | **SSSD** | Time series imputation + generation | S4D-based diffusion | | **Spacetimeformer** | Spatio-temporal forecasting | Hybrid SSM+Transformer | | **S-Mamba / TimeMamba** | Long-horizon forecasting | Mamba for weather, energy, traffic | **Key advantage**: At 8,192-step forecasting horizons, SSMs are 10× faster than full Transformers while matching or exceeding their accuracy. ### Video Video is a natural SSM domain: extremely long token sequences (frames × patches), strong temporal locality, and high redundancy. | Model | Task | Notes | |-------|------|------| | **VideoMamba** | Video understanding | Vision Mamba applied to video; linear temporal complexity | | **Video-SSM** | Video generation | Mamba-based diffusion for long video | | **Vim (Vision Mamba)** | Image + video classification | 2D visual SSM; ViT alternative | **Why SSMs matter for video**: A 1-minute 720p video at 30fps has ~432,000 frames. Even at 196 tokens/frame (ViT-style), that's ~85M tokens — impossible for full attention. SSMs process this linearly. ### On-Device / Edge Deployment > [!NOTE] This is the most commercially significant application for SSMs in 2025. **The problem**: Running LLMs on phones, laptops, and edge devices requires constant memory, low power, and no KV-cache growth. | Property | Transformer | SSM | |----------|-------------|-----| | Memory per token | Grows (KV cache) | Fixed (recurrent state) | | Multi-turn conversations | KV cache fills memory | No memory growth | | CPU inference | Slow (attention ops) | Feasible (RNN-style) | | Power usage (long context) | High | Low | **Examples**: - **RWKV-6**: The only major 7B-class model that runs well on CPU without quantization - **Falcon Mamba 7B**: Runs on a single 24GB A10 GPU with unlimited-length context - **Codestral Mamba**: Serves code completions in embedded IDEs with constant memory ### Medical / Scientific | Application | Why SSMs Help | |-------------|--------------| | EHR (Electronic Health Records) | Very long patient timelines; sequential events | | Protein folding support | Extremely long amino acid sequences | | Brain signals (EEG/ECoG) | High-rate time series, continuous processing | | Genomic variant calling | Million-nucleotide contexts | --- ## Audio SSMs | Model | Task | Note | |-------|------|------| | **SaShiMi** | Audio generation | S4 for audio; outperforms WaveNet | | **DiffWave+S4** | Text-to-speech | SSM diffusion | | **Mamba for audio** | Multi-modal | Ongoing research | --- ## Hybrid Models See [[hybrid-models]] for full details. | Model | Creator | Architecture | Key Innovation | |-------|---------|-------------|----------------| | **Jamba** | AI21 Labs | Transformer + Mamba + MoE | 256K context on 1 GPU; 52B total / 12B active | | **Jamba 1.5** | AI21 Labs | Improved Jamba | 256K context; production-quality instruct model | | **Zamba** | Zyphra | Mamba + shared Transformer | 7B params, shared attention layer | | **Zamba2-7B** | Zyphra | Improved hybrid | Better benchmarks than original Zamba | | **Griffin** | Google DeepMind | Linear recurrence + local attention | Matches Llama-2 on 6× fewer tokens | | **Hawk** | Google DeepMind | Pure gated linear recurrence | Exceeds Mamba on benchmarks | | **RecurrentGemma** | Google DeepMind | Griffin-based | Production-scale (2B, 9B, 27B variants) | | **RetNet** | Microsoft | Retention mechanism | "Transformer with recurrent mode" | | **Mamba-2-Hybrid-8B** | NVIDIA | 43% Mamba-2 + 7% attn + 50% MLP | +2.65 pts over pure Transformer on 12 tasks[^6] | --- ## Performance Benchmarks ### Language Quality: HuggingFace Open LLM Leaderboard v1 (2024) Classic benchmark suite (ARC, HellaSwag, MMLU, Winogrande, TruthfulQA, GSM8K). Evaluated with `lighteval`.[^4] | Model | ARC | HellaSwag | MMLU | Winogrande | TruthfulQA | GSM8K | **Average** | |-------|-----|-----------|------|-----------|-----------|-------|------------| | **Falcon Mamba-7B** (pure SSM) | 62.0 | 80.8 | **62.1** | 73.6 | 53.4 | 52.5 | **64.1** | | Zamba-7B-v1 (hybrid SSM) | 56.1 | 82.2 | 58.1 | 79.9 | 52.9 | 30.8 | 60.0 | | RecurrentGemma-9B (hybrid) | 52.0 | 80.4 | 60.5 | 73.6 | 38.6 | 42.6 | 58.0 | | Mamba-7B-rw (pure SSM) | 51.3 | 80.9 | 33.4 | 71.1 | 32.1 | 4.7 | 45.5 | | Mistral-7B-v0.1 (Transformer) | 60.0 | 83.3 | 64.2 | 78.4 | 42.2 | 37.8 | 61.0 | | Meta-Llama-3-8B (Transformer) | 60.2 | 82.2 | 66.7 | 78.5 | 42.9 | 45.2 | 62.6 | | Falcon2-11B (Transformer) | 59.7 | 82.9 | 58.4 | 78.3 | 52.6 | 53.8 | 64.3 | | gemma-7B (Transformer) | 61.1 | 82.2 | 64.6 | 79.0 | 44.8 | 50.9 | 63.8 | **Key takeaway**: Falcon Mamba-7B achieves competitive average scores against 7–8B Transformer models on this suite, despite being a *pure* SSM — the first to do so at this scale.[^4] ### HuggingFace Open LLM Leaderboard v2 (2024) Harder benchmarks: IFEval, BBH, MATH Level 5, GPQA, MUSR, MMLU-PRO.[^4] | Model | IFEval | BBH | MATH Lvl5 | GPQA | MUSR | MMLU-PRO | **Avg** | |-------|--------|-----|-----------|------|------|----------|---------| | **Falcon Mamba-7B** (pure SSM) | 33.4 | 19.9 | 3.6 | 8.1 | 10.9 | 14.5 | **15.0** | | Zamba-7B-v1 (hybrid SSM) | 24.1 | 21.1 | 3.3 | 3.0 | 7.7 | 16.0 | 12.6 | | RecurrentGemma-9B (hybrid) | 30.8 | 14.8 | 4.8 | 4.7 | 6.6 | 17.9 | 13.2 | | Falcon2-11B (Transformer) | 32.6 | 21.9 | 2.3 | 2.8 | 7.5 | 15.4 | 13.8 | | Meta-Llama-3-8B (Transformer) | 14.6 | 24.5 | 3.3 | 7.4 | 6.2 | 24.6 | 13.4 | | Meta-Llama-3.1-8B (Transformer) | 12.7 | 25.3 | 4.6 | 6.2 | 9.0 | 25.0 | 13.8 | | Mistral-7B-v0.1 (Transformer) | 23.9 | 22.0 | 2.5 | 5.6 | 10.7 | 22.4 | 14.5 | | gemma-7B (Transformer) | 26.6 | 21.1 | 6.4 | 4.9 | 11.0 | 21.6 | **15.3** | ### NVIDIA Mamba-2-Hybrid at 8B Scale (2024)[^6] NVIDIA's controlled study trained 8B Mamba, Mamba-2, Transformer, and a Hybrid (43% Mamba-2 + 7% attention + 50% MLP) on **identical datasets up to 3.5T tokens**. | Model | Architecture | Avg score (12 standard tasks) | Notes | |-------|-------------|-------------------------------|-------| | Transformer-8B | Pure Transformer | baseline | Strong ICL & copying | | Mamba-8B | Pure SSM | ≈baseline on many tasks | Lags on 5-shot MMLU, phonebook lookup | | Mamba-2-8B | Pure SSM | ≈baseline on many tasks | Faster than Mamba-1 | | **Mamba-2-Hybrid-8B** | 43% Mamba-2 + 7% attn + 50% MLP | **+2.65 pts vs Transformer** | Best of both | The hybrid **exceeded** the pure Transformer on all 12 standard tasks and on 23 long-context tasks (16K–128K tokens).[^6] ### Language Quality (Perplexity on The Pile, lower = better) | Model | Params | Perplexity | |-------|--------|-----------| | GPT-Neo | 2.7B | 13.0 | | **Mamba** | 2.8B | **10.0** | | Pythia | 2.8B | 10.7 | | RWKV-4 | 3B | 11.0 | Source: Mamba paper[^2] ### Inference Throughput (tokens/second, A100 GPU) | Model | Context | Throughput | |-------|---------|-----------| | Transformer 2.8B | 2K | ~1,500 t/s | | **Mamba 2.8B** | 2K | ~**7,500 t/s** (5×) | | Transformer 2.8B | 16K | ~500 t/s | | **Mamba 2.8B** | 16K | ~**7,500 t/s** (15×!) | SSM throughput is *constant* regardless of sequence length.[^2] ### Throughput at Scale: Falcon Mamba vs LLaMA-3.1-8B (H100) Falcon Mamba was benchmarked against LLaMA-3.1-8B on an H100 GPU generating up to 130K tokens (batch size 1, float32).[^4] | Metric | Falcon Mamba-7B | LLaMA-3.1-8B | |--------|----------------|--------------| | Peak memory (128k context) | **Constant** (recurrent state) | ~Grows linearly (KV cache) | | Generation throughput (long ctx) | **Constant** t/s | Degrades as context grows | | Max sequence on 24GB A10 | Arbitrary* | ~16–24K tokens | \* With sequential (token-by-token) prefill, Falcon Mamba can process sequences of arbitrary length on a 24GB A10. With parallel prefill (batched), memory scales with prompt length, but is still lower than Transformer KV cache. ### Long Range Arena (Higher = Better) The LRA benchmark tests models on tasks requiring dependencies across 1,000–16,000 token sequences. | Model | Average Score | Path-X (hardest) | |-------|--------------|-----------------| | Transformer | 53.4% | Failed | | Linformer | 53.9% | Failed | | **S4** | **86.8%** | 88.1% (first to solve!) | | Mamba | ~88% | Solves easily | Source: S4 paper, Hazy Research blog[^3] --- ## Why Does This Matter? *A layperson summary of what these benchmarks mean in practice.* ### The speed story When you chat with an AI, there are two phases: **reading** (processing your prompt) and **writing** (generating the response). Transformers are fast at reading but slow at writing long responses because they must look back at everything they've "said" so far. With 128K tokens of context, that lookback cost is enormous. SSMs are different: once they've read your prompt, they "compress" everything into a small fixed-size state (like working memory), and each new token costs the same regardless of how long the conversation has been. **This is why Mamba runs at 15× the throughput of a Transformer at long contexts** — it simply doesn't slow down. ### The quality story For years, SSMs were faster but noticeably worse at hard reasoning tasks. Falcon Mamba-7B changed that in 2024: it's the first pure SSM to score competitively with 7–8B Transformers across standard benchmarks (64.1 average vs 62.6 for LLaMA-3-8B). However, SSMs still lag on tasks that require **exact lookup** (e.g., "what was said on line 2 of my 50,000-word document?") and **few-shot in-context learning** — because their compressed state can lose fine-grained details. Hybrids (Jamba, Griffin, Mamba-2-Hybrid) resolve this by adding a small number of attention layers for precision. ### The hardware story - A pure SSM like Falcon Mamba-7B fits on a **single 24GB A10 GPU** for unlimited-length generation. - An equivalent Transformer would require a growing KV cache — eventually exceeding GPU memory. - For cloud deployments with many concurrent users, lower memory = **more users per GPU = lower cost**. ### Bottom line | What you care about | Use | Why | |--------------------|-----|-----| | Maximum quality at <8K tokens | Transformer (GPT-4, Claude, LLaMA) | Best reasoning, ICL | | Long documents, low latency | Hybrid (Jamba, Griffin) | Quality + efficiency | | Streaming / real-time / edge | Pure SSM (Mamba, Falcon Mamba) | Constant memory & time | | Code generation, long context | Codestral Mamba | Trained for code, 256K context | | Genomics / DNA | HyenaDNA, Evo, Caduceus | Million-token DNA context | | Time series / audio | SaShiMi, TimeMamba | Long continuous signals | --- ## State of the Field (2025) > [!NOTE] This section summarizes where the research stands as of early 2025. For a deeper analysis see [[current-landscape-2025]]. ### What is now settled 1. **SSMs can match Transformers on standard language benchmarks** at 7B scale. Falcon Mamba-7B was the first to demonstrate this (July 2024). The quality gap that existed in 2022–2023 has largely closed at this scale. 2. **Hybrids consistently outperform both** pure SSMs and pure Transformers on the standard benchmark suites. NVIDIA's controlled 8B study showed their Mamba-2-Hybrid exceeded the pure Transformer on *all* 12 standard tasks (+2.65 average) and on 23 long-context tasks.[^6] 3. **SSMs and attention are mathematically related**. Mamba-2's State Space Duality (SSD) paper (ICML 2024) proved that SSMs are equivalent to a structured, decaying form of attention — they compute the same semiseparable matrix product. This is not just theoretical: it enables borrowing FlashAttention-style hardware kernels for SSMs.[^7] 4. **The weakness is specific: associative recall**. The Zoology paper (Arora et al., 2023) showed 82% of the SSM-Transformer perplexity gap is explained by exact in-context retrieval. A 70M attention model outperforms a 1.4B SSM on this specific task. See [[zoology-associative-recall]].[^9] 5. **Hybrids solve the recall weakness cheaply**. Adding just 1 attention layer per 7 SSM layers (Jamba ratio) or ~7% attention (NVIDIA hybrid) is enough to recover recall performance. The cost is minimal; the quality benefit is large. ### What remains contested 1. **Long-context reasoning**: SSMs process long contexts but their compressed state means subtle details can be lost. Transformers with sliding-window or sparse attention may still be better at needle-in-a-haystack retrieval. 2. **5-shot in-context learning**: NVIDIA's study showed pure Mamba lags on 5-shot MMLU and phonebook lookup. Hybrids recover this, but the fundamental architecture difference remains. 3. **Frontier scaling**: All the data on SSMs at "scale" means 7B–8B. We don't yet know if SSMs continue to close the gap at 70B+ parameters. The frontier Transformer labs (OpenAI, Anthropic, Google) have not publicly deployed SSM-based models. 4. **Mamba-2 improvements over Mamba-1**: Mamba-2 is 2-8× faster than Mamba-1 due to SSD-optimized kernels, but the *quality* improvement is modest. The main win is efficiency, not accuracy. ### Key papers from 2024 (must-reads) | Paper | Venue | Key Contribution | |-------|-------|-----------------| | Mamba-2 / SSD | ICML 2024 | SSMs = structured attention; 2–8× faster[^7] | | Jamba | arXiv Mar 2024 | First large hybrid; 256K context on 1 GPU[^1a] | | Griffin / Hawk | arXiv Feb 2024 | Google's hybrid beats LLaMA-2 training efficiency[^10] | | NVIDIA Empirical Study | arXiv Jun 2024 | Hybrid beats pure Transformer at 8B scale[^6] | | Falcon Mamba 7B | HF Jul 2024 | First production-scale competitive pure SSM[^4] | | Zoology | arXiv Dec 2023 | Formalizes AR weakness; motivates hybrids[^9] | | MambaByte | COLM 2024 | Token-free SSM; competitive with subword Transformers[^11] | | minLSTM/minGRU | arXiv Oct 2024 | Simplified LSTMs fully parallelizable; competitive with Transformers[^12] | --- ## Sources [^1]: RWKV-LM. https://www.rwkv.com/ [^1a]: Lieber, O. et al. (2024). "Jamba: A Hybrid Transformer-Mamba Language Model." *arXiv:2403.19887*. https://arxiv.org/abs/2403.19887 [^2]: Gu, A. & Dao, T. (2023). "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." *arXiv:2312.00752*. https://arxiv.org/abs/2312.00752 [^3]: Gu, A. et al. (2021). "Efficiently Modeling Long Sequences with Structured State Spaces." *arXiv:2111.00396*. See also: https://hazyresearch.stanford.edu/blog/2022-01-14-s4-1 [^4]: Zuo, J. et al. (2024). "Falcon Mamba 7B." TII / HuggingFace. https://huggingface.co/blog/falconmamba; model card: https://huggingface.co/tiiuae/falcon-mamba-7b [^4b]: Glorioso, P. et al. (2024). "Zamba: A Compact 7B SSM Hybrid Model." Zyphra. *arXiv:2405.16712*. https://arxiv.org/abs/2405.16712 [^5]: Mistral AI (2024). "Codestral Mamba." https://mistral.ai/news/codestral-mamba/ [^6]: Waleffe, R. et al. (2024). "An Empirical Study of Mamba-based Language Models." *arXiv:2406.07887*. (NVIDIA / Megatron-LM) https://arxiv.org/abs/2406.07887 [^7]: Dao, T. & Gu, A. (2024). "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality." ICML 2024. *arXiv:2405.21060*. https://arxiv.org/abs/2405.21060 [^8]: Nguyen, E. et al. (2023). "HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution." NeurIPS 2023 Spotlight. *arXiv:2306.15794*. https://arxiv.org/abs/2306.15794 [^9]: Arora, S. et al. (2023). "Zoology: Measuring and Improving Recall in Efficient Language Models." *arXiv:2312.04927*. https://arxiv.org/abs/2312.04927 [^10]: De, S. et al. (2024). "Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models." *arXiv:2402.19427*. https://arxiv.org/abs/2402.19427 [^11]: Yan, J.N. et al. (2024). "MambaByte: Token-free Selective State Space Model." COLM 2024. *arXiv:2401.13660*. https://arxiv.org/abs/2401.13660 [^12]: Feng, L. et al. (2024). "Were RNNs All We Needed?" *arXiv:2410.01201*. https://arxiv.org/abs/2410.01201