hybrid-models - ramblings from the whirligig void

# Hybrid Models: The Best of Both Worlds > See [[index]], [[transformers-basics]], [[ssm-basics]], [[real-world-products]], [[computational-complexity]], [[strengths-and-weaknesses]] --- ## Why Hybrids Exist Pure Transformers: excellent reasoning, exact recall, but quadratic cost. Pure SSMs: fast and efficient, but weaker at precise recall and in-context learning. **Hybrids** interleave attention layers (precision) with SSM layers (efficiency) — getting most of the reasoning quality at a fraction of the cost.[^1] The key insight from the Jamba paper: **you don't need attention everywhere**. A few attention layers scattered among many SSM layers maintains quality while drastically reducing memory and compute requirements.[^1] --- ## Key Hybrid Architectures ### Jamba (AI21 Labs, March 2024) **Architecture**: Interleaved Transformer blocks + Mamba blocks + Mixture-of-Experts (MoE) | Property | Value | |----------|-------| | Total params | 52B | | Active params (per token) | 12B | | Context window | 256K tokens | | GPU requirement | 1 × 80GB A100 | | Comparable Transformer | Requires ~160GB (2 GPUs) | **Key ratio**: 1 Transformer block per every 7 Mamba blocks. This is the "sweet spot" they found experimentally.[^1] **MoE integration**: Mixture of Experts adds capacity without proportional compute cost — most tokens only activate a subset of experts. **Why it matters**: First large hybrid to publicly demonstrate that you can serve 256K context on a single consumer GPU. --- ### Griffin + Hawk (Google DeepMind, February 2024) **Hawk**: Pure gated linear recurrence (SSM-family). Exceeds Mamba on downstream tasks.[^2] **Griffin**: Mixes Hawk's recurrent layers with **local attention** (attention over small windows, not full context). Matches Llama-2 quality despite training on 6× fewer tokens.[^2] | Model | Architecture | Params | Key Result | |-------|-------------|--------|-----------| | Hawk | Pure recurrent | Up to 14B | Beats Mamba | | Griffin | Recurrent + local attention | Up to 14B | Matches LLaMA-2 | **Hardware efficiency**: Both Hawk and Griffin match Transformer training hardware efficiency (same throughput on same GPUs), while being significantly cheaper at inference.[^2] --- ### RWKV (Peng et al., 2023–2024) **Philosophy**: A true RNN trained with Transformer-like parallelism during training. RWKV (Receptance Weighted Key Value) reformulates attention as: - During training: runs like a Transformer (parallelizable) - During inference: runs like an RNN (O(1) per token) #### How RWKV Works RWKV replaces the attention mechanism with a **linear attention** variant that can be expressed as a recurrence: ``` At each timestep t: r_t = sigmoid(W_r · x_t + b_r) # Receptance gate k_t = W_k · x_t # Key (like attention key) v_t = W_v · x_t # Value w_t = exp(-exp(w)) # Time-decay weight (learned) # Update state: num_t = exp(k_t) · v_t + w_t · num_{t-1} den_t = exp(k_t) + w_t · den_{t-1} output_t = r_t ⊙ (num_t / den_t) # Receptance-gated output ``` The key innovation: by using *exponentially decaying* weights (`w_t`), RWKV can be both: - **Trained in parallel** (rewritten as a time-mixing convolution) - **Run as an RNN** (state updated one token at a time) #### RWKV-6 (Finch Architecture, 2024) RWKV-6 introduced **data-dependent (input-dependent) decay weights** — closer in spirit to Mamba's selective state spaces. Where RWKV-4/5 used fixed exponential decay, RWKV-6 gates the decay based on the current input token, giving the model more control over what to remember vs. forget.[^3] | Version | Architecture Name | Key Improvement | |---------|------------------|-----------------| | RWKV-4 | Original RWKV | Fixed time decay | | RWKV-5 | Eagle | Token-shift improvements | | RWKV-6 | Finch | Data-dependent decay (like Mamba selectivity) | **What makes RWKV different from Mamba**: - RWKV is explicitly designed to look like a Transformer at training time (same loss landscape), making fine-tuning LoRA adapters from Transformer checkpoints easier - RWKV uses **channel mixing** + **time mixing** layers alternating, vs Mamba's single SSM layer - RWKV-6 runs well on **CPU** — important for edge deployment without GPUs[^3] RWKV-6 (Finch architecture) is the most recent version, with improved gating that closes much of the gap to Transformers on language tasks.[^3] **Unique value**: Can run on CPU effectively, making it accessible without GPUs. --- ### Zamba (Zyphra, May 2024) **Architecture**: Mamba backbone with a single shared Transformer attention layer, reused at multiple depths. **Key innovation**: The attention layer is "shared" (same weights used at multiple positions), saving memory compared to having unique attention layers. This allows a 7B model with hybrid properties to run efficiently.[^4] **Training**: 1T tokens from openly available datasets, using a two-phase pretraining approach: 1. **Phase 1**: Standard pretraining on web datasets 2. **Phase 2**: Annealing over high-quality instruct and synthetic datasets with rapid learning rate decay[^4b] **Architecture diagram** (conceptual): ``` Input tokens │ [Mamba block] ×N ← bulk processing (cheap, linear) │ [Shared Attention] ← precision anchor (reuses same weights) │ [Mamba block] ×N │ [Shared Attention] ← same weights again (minimal overhead) │ [Mamba block] ×N │ Output ``` The single shared attention block is inserted at regular intervals through the Mamba backbone. Because the weights are *shared* (not duplicated), this adds minimal parameter overhead while giving the model precise in-context retrieval capability. **Benchmark results** (HuggingFace Leaderboard v1):[^7] | Benchmark | Zamba-7B-v1 | Mamba-7B-rw | Mistral-7B-v0.1 | LLaMA-3-8B | |-----------|------------|------------|----------------|-----------| | ARC | 56.1 | 51.3 | 60.0 | 60.2 | | HellaSwag | 82.2 | 80.9 | 83.3 | 82.2 | | MMLU | 58.1 | 33.4 | 64.2 | 66.7 | | Winogrande | 79.9 | 71.1 | 78.4 | 78.5 | | TruthfulQA | 52.9 | 32.1 | 42.2 | 42.9 | | GSM8K | 30.8 | 4.7 | 37.8 | 45.2 | | **Average** | **60.0** | **45.5** | **61.0** | **62.6** | **Why it matters**: Zamba-7B outperforms the pure Mamba-7B by +14.5 points on average, getting close to Mistral-7B-v0.1 with a much simpler hybrid architecture. The shared attention layer is the key differentiator. ### RetNet (Microsoft, 2023) **Philosophy**: Generalizes Transformer to support three modes: - Parallel training (like Transformer) - Recurrent inference (like RNN/SSM) - Chunk-level processing (hybrid) RetNet introduces a "retention" mechanism that exponentially decays attention weights with distance — mathematically equivalent to an SSM in certain formulations.[^5] --- ### Mamba-2 and State Space Duality (ICML 2024) **Paper**: "Transformers are SSMs: Generalized Models and Efficient Algorithms through Structured State Space Duality" (Gu & Dao, 2024)[^8] The central contribution of Mamba-2 is a theoretical framework called **State Space Duality (SSD)** — a proof that SSMs and certain forms of attention are two ways of computing the same mathematical operation. #### Plain-English Explanation of State Space Duality Imagine you're summarizing a book: - **The SSM approach**: Read page by page, updating a running summary (your "state"). Fast, but details can blur. - **The Attention approach**: At any point, look back at all pages and weigh which ones matter. Precise, but slow. The SSD theorem says: *these two strategies are mathematically dual*. Given a special class of matrices (semiseparable structured matrices), you can compute the same result either way — as a recurrence *or* as a kind of attention. The key insight: > **SSMs ≈ attention with structured (low-rank) weights, operating under an exponential decay bias** This means: 1. You can borrow algorithmic tricks from attention (like FlashAttention hardware kernels) to speed up SSMs 2. Hybrid models can smoothly interpolate between the two modes 3. The "weakness" of SSMs (forgetting) and the "weakness" of attention (quadratic cost) are dual constraints — no free lunch, but you can choose your tradeoff **Practical result**: Mamba-2's core SSD layer is **2–8× faster** than Mamba-1's selective scan on modern hardware, despite equivalent expressive power.[^8] --- ### MambaFormer / Mamba-in-the-Middle Various research papers have explored systematically mixing Mamba and Transformer layers. Key findings: 1. **Attention at every 4th layer** is enough to maintain strong in-context learning 2. SSM layers handle bulk processing efficiently 3. Attention layers provide "anchor points" for precise retrieval 4. The optimal ratio varies by task domain --- ## The Architecture Spectrum ``` Pure RNN SSM Hybrid Sparse Attn Dense Transformer │ │ │ │ │ LSTM Mamba Jamba/Griffin Mistral/Llama GPT-4 RWKV S4 RetNet Longformer Claude (recurrent) (linear) (interleaved) (local windows) (full attention) ◄── More efficient inference More precise recall ──► ◄── Constant memory Growing KV cache ──► ◄── Weaker in-context learning Stronger ICL ──► ``` --- ## When to Use Which | Use Case | Recommended Architecture | Why | |----------|--------------------------|-----| | Long documents (>100K tokens) | Hybrid or SSM | Quadratic cost prohibitive | | Short-form chat (<4K tokens) | Transformer | SSM advantage negligible | | Streaming/real-time inference | SSM | O(1) per token | | Few-shot prompting | Transformer | Better in-context learning | | Audio/genomics/time series | SSM | Long sequences, continuous data | | General LLM deployment | Hybrid | Best trade-off | | Edge/CPU deployment | RWKV | CPU-friendly RNN inference | --- ## Future Directions - **Mamba-2** (2024): Established State Space Duality — SSMs are a restricted form of attention, enabling new hybrid designs - **Research direction**: How many attention layers are actually needed? Early results suggest very few (1-in-8) suffice - **Hardware co-design**: Custom chips optimized for SSM recurrence (vs GPU attention) - **Multimodal hybrids**: Applying SSM efficiency to vision, audio, video alongside language --- ## Quality Comparison: Hybrid vs Pure (MMLU Scores, ~7–9B Scale) All models evaluated on MMLU (5-shot). Sources: HuggingFace LLM Leaderboard, individual papers.[^7][^6][^9] | Model | Type | MMLU | Avg (v1 leaderboard) | Notes | |-------|------|------|---------------------|-------| | **Mamba-2-Hybrid-8B** | Hybrid (43% SSM + 7% attn) | ~66–68* | +2.65 vs Transformer | NVIDIA study, 3.5T tokens | | **Falcon Mamba-7B** | Pure SSM | 62.1 | 64.1 | First competitive pure SSM at 7B | | **Zamba-7B-v1** | Hybrid (SSM + shared attn) | 58.1 | 60.0 | 1T tokens, 2-phase training | | **RecurrentGemma-9B** | Hybrid (Griffin-style) | 60.5 | 58.0 | Google DeepMind | | Mistral-7B-v0.1 | Transformer | 64.2 | 61.0 | Reference Transformer | | LLaMA-3-8B | Transformer | 66.7 | 62.6 | Meta, strong reference | | gemma-7B | Transformer | 64.6 | 63.8 | Google | | Mamba-7B-rw | Pure SSM | 33.4 | 45.5 | Early large Mamba, poor MMLU | \* Exact MMLU not reported in NVIDIA paper; derived from aggregate +2.65 over Transformer baseline. **Pattern**: Pure SSMs have historically struggled with MMLU (knowledge retrieval). Falcon Mamba-7B broke this pattern. Hybrids with attention layers consistently close or exceed the Transformer baseline on knowledge tasks. --- ## How Do You Decide? *When to use each architecture in practice.* See also [[strengths-and-weaknesses]]. ### Decision flowchart ``` Does your use case require >32K tokens of context? ├── Yes → Is quality critical (legal, medical, research)? │ ├── Yes → Hybrid (Jamba, Griffin, Mamba-2-Hybrid) │ └── No → Pure SSM (Falcon Mamba, Codestral Mamba) └── No → Is latency / cost critical? ├── Yes → Hybrid or Transformer-MoE (Mixtral, DeepSeek-V3) └── No → Dense Transformer (GPT-4, Claude, LLaMA-3) ``` ### Practical decision table | Situation | Best Choice | Key Reason | |-----------|------------|------------| | Summarizing long documents (>50K words) | Hybrid or SSM | Quadratic attention prohibitive | | Real-time chat / streaming | SSM (Mamba, Falcon) | Constant-time generation | | Few-shot learning from examples in prompt | Transformer | Better in-context learning | | Phonebook lookup / exact retrieval in context | Transformer or Hybrid | SSM can lose precise details | | Code completion (autocomplete) | Codestral Mamba or Transformer | SSM handles long files well | | Production API at scale (cost per token) | Hybrid | 8× inference speedup, good quality | | Edge / on-device (no GPU) | RWKV-6 | CPU-efficient recurrent inference | | Domain fine-tuning from a Transformer checkpoint | RWKV | LoRA adapters work like Transformers | | Genomics / audio / long time-series | SSM | Natural fit for continuous sequences | | Research into architecture tradeoffs | Mamba-2 or Hybrid | Best-studied design space (2024) | ### The attention ratio question Multiple papers have converged on a similar answer: **you only need attention at ~1 in 8 layers** to recover most of the quality of a pure Transformer. The rest can be SSM layers, dramatically reducing memory and compute: | Architecture | Attention ratio | Quality vs Transformer | |-------------|-----------------|----------------------| | Pure Transformer | 100% attention | Baseline | | Jamba | ~12.5% (1:7) | Competitive at 256K ctx | | Mamba-2-Hybrid | ~7% attention | **+2.65 pts** over Transformer | | Zamba | ~5% (1 shared block) | -2.6 pts vs LLaMA-3-8B | | Pure Mamba | 0% attention | -4 to -14 pts on MMLU | **Insight**: A small attention budget buys most of the quality. The crossover point where adding more attention stops helping is around 10–15% of layers.[^6][^4b] --- ## Sources [^1]: Lieber, O. et al. (2024). "Jamba: A Hybrid Transformer-Mamba Language Model." *arXiv:2403.19887*. https://arxiv.org/abs/2403.19887 [^2]: De, S. et al. (2024). "Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models." *arXiv:2402.19427*. https://arxiv.org/abs/2402.19427 [^3]: Peng, B. et al. (2023). "RWKV: Reinventing RNNs for the Transformer Era." *arXiv:2305.13048*. https://arxiv.org/abs/2305.13048 [^4]: Glorioso, P. et al. (2024). "Zamba: A Compact 7B SSM Hybrid Model." Zyphra. (Original footnote — note: correct arXiv ID is 2405.16712) [^4b]: Anthony, Q. et al. (2024). "Zamba: A Compact 7B SSM Hybrid Model." *arXiv:2405.16712*. https://arxiv.org/abs/2405.16712 [^5]: Sun, Y. et al. (2023). "Retentive Network: A Successor to Transformer for Large Language Models." *arXiv:2307.08621*. [^6]: Waleffe, R. et al. (2024). "An Empirical Study of Mamba-based Language Models." *arXiv:2406.07887*. NVIDIA / Megatron-LM. https://arxiv.org/abs/2406.07887 [^7]: HuggingFace Open LLM Leaderboard v1 & v2 results. https://huggingface.co/blog/falconmamba (aggregated from lm-evaluation-harness) [^8]: Gu, A. & Dao, T. (2024). "Transformers are SSMs: Generalized Models and Efficient Algorithms through Structured State Space Duality." *arXiv:2405.21060*. ICML 2024. https://arxiv.org/abs/2405.21060 [^9]: De, S. et al. (2024). "Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models." *arXiv:2402.19427*.