applications-and-use-cases - ramblings from the whirligig void

# Applications and Use Cases: Where Each Architecture Wins > [[index]] · [[ssm-basics]] · [[transformers-basics]] · [[strengths-and-weaknesses]] · [[real-world-products]] ## The Key Insight The architecture you choose should match the **nature of your data and task**: - Does your task require exact recall of specific facts? → Transformer - Does your task require processing very long sequences efficiently? → SSM - Does your task require streaming/real-time inference? → SSM - Both? → Hybrid ## Language Modeling ### Where Transformers Excel | Task | Why Transformer Wins | Example | |------|---------------------|---------| | Fact lookup | Exact associative recall | "What is the capital of France?" | | Code with distant references | Tracks variable defined 1000 lines ago | GitHub Copilot | | Document QA | "What did paragraph 3 say about X?" | RAG systems | | Few-shot learning | Reads and reuses examples in context | GPT-4 demos | | Reasoning chains | Each step must attend to prior steps | Chain-of-thought | ### Where SSMs Are Competitive | Task | Why SSM is Good Enough | Example | |------|----------------------|---------| | Long-context summarization | Gist is enough; no exact lookup | "What was this 100K-word document about?" | | Chat with long history | Conversational gist > exact recall | Casual chat assistants | | Token-free (byte) LMs | SSMs scale to byte-level 1M+ token contexts | MambaByte[^1] | | Streaming chat | Low latency, O(1) per token | Edge deployment | ## Genomics and Biology: SSM's Killer Application > [!NOTE] Why Genomics is Perfect for SSMs > DNA sequences are **enormously long** (human genome = 3 billion base pairs) and require **local pattern matching** more than long-range exact recall. No DNA sequence has a "associative recall" task. The key patterns are in local motifs, not precise cross-references. ### HyenaDNA[^2] - **Context**: Up to **1 million nucleotides** at single-nucleotide resolution - **Compare**: Transformer-based genomic models maxed at 4,096 tokens (<0.001% of human genome) - **Speed**: Trains 160× faster than Transformer at same sequence length - **Quality**: SotA on 12/18 Nucleotide Transformer benchmarks; 7/8 GenomicBenchmarks (+10 pts avg) - **Architecture**: Hyena (implicit convolution SSM) ### Evo[^3] - Multi-modal DNA/RNA/protein model - 7B parameters, trained on 300B genomic tokens - Can generate functional gene sequences from scratch ### Caduceus - Bidirectional Mamba for genomics - Exploits SSM's efficiency for very long sequence genomic modeling > [!TIP] The Genomics Metaphor for the Report > Reading a DNA sequence is like reading a novel in a language you're learning — you need to track patterns and motifs as they flow past, not look up what was on page 42. SSMs are like a very attentive reader taking notes; Transformers would be taking a photo of every page. ## Audio and Speech ### Why Audio Suits SSMs - Audio is a **continuous time-series signal** - A second of audio at 44.1kHz = 44,100 samples - Relevant patterns are local (phonemes) or medium-range (words) - Perfect for state compression — no need for exact lookup ### Examples | Model | Architecture | Application | |-------|-------------|-------------| | MambaAudio | Mamba | Audio classification, separation | | Mamba TTS | Mamba | Text-to-speech synthesis | | AudioMamba | Mamba | Audio event detection | | S4 (original) | S4 | speech commands, pathfinder | > [!NOTE] The S4 Audio Achievement > S4 was the **first model to solve the Speech Commands and Long Range Audio tasks** with high accuracy. The Long Range Audio task requires modeling audio sequences of thousands of samples — this is where SSMs first showed their dominance over Transformers. ## Video Understanding ### VideoMamba[^4] - SSM-based video understanding - O(n) complexity over spatial + temporal dimensions - Unlike ViT-based video models: no quadratic attention over all frame patches ### Why Video Suits SSMs - Video = sequences of image frames = extremely long sequences - A 1-minute video at 30fps = 1,800 frames - Full Transformer attention over all patches of all frames: *catastrophically expensive* - SSMs track temporal state efficiently ## Time Series and Signal Processing | Domain | Model | Key Advantage | |--------|-------|--------------| | Forecasting | Mamba Time Series | Long horizon, efficient training | | Medical signals (ECG, EEG) | SSM variants | Continuous signal modeling | | Industrial IoT | Mamba | Streaming, low-latency prediction | | Finance | SSMs (experimental) | Long history, efficient rolling window | ## On-Device and Edge Deployment > [!TIP] Why SSMs Matter for Edge > The key property: **O(1) memory per generation step** vs O(n) KV cache growth for Transformers. > > A smartphone can't hold a 100K-token KV cache (multiple GB). An SSM needs only its fixed state vector (KBs) regardless of context length. This is why SSMs may power **truly long-running AI assistants** on mobile devices. ## Scientific and Specialized Domains | Domain | SSM Application | Why | |--------|----------------|-----| | Drug discovery | Molecular sequence modeling | Very long SMILES strings | | Protein folding | Sequence + structure SSMs | Long amino acid chains | | Climate modeling | Long time series SSMs | Multi-year temporal data | | Code generation (streaming) | Mamba for code | Token-by-token, long files | ## Where Transformers Will Likely Remain Dominant 1. **Complex reasoning tasks** — require exact tracking of logical steps 2. **Multi-document QA** — needs to reference specific passages 3. **Code with complex dependencies** — function calls, imports, distant variables 4. **Few-shot learning / in-context learning** — requires reading and reusing examples from context 5. **Large-scale pre-training** — GPT-4, Claude architecture is proven and tooled ## The Competitive Landscape (2025) ``` GPT-4o, Claude 3.5, Gemini 1.5 → Pure Transformer (with efficient attention variants) Falcon Mamba 7B, Mamba-2 → Pure SSM (competitive at 7B+ scale) Jamba, Griffin, Zamba → Hybrid (SSM + sparse attention) HyenaDNA, Evo → SSM for genomics AudioMamba, VideoMamba → SSM for audio/video RWKV-6 → Linear Transformer/SSM hybrid ``` [^1]: Yan et al. (2024). "MambaByte: Token-free Selective State Space Model." COLM 2024. arXiv:2401.13660. [^2]: Nguyen et al. (2023). "HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution." NeurIPS 2023 (Spotlight). arXiv:2306.15794. [^3]: Arc Institute (2024). "Sequence Modeling and Design from Molecular to Genome Scale with Evo." bioRxiv. [^4]: Chen et al. (2024). "VideoMamba: State Space Model for Efficient Video Understanding." arXiv.