# Diagrams and Visuals for the Final Report > Reference for all diagrams to include in the final report. > See [[index]], [[analogies-and-intuitions]], [[transformers-basics]], [[ssm-basics]] --- ## Diagram 1: The Architecture Spectrum ``` ╔════════════════════════════════════════════════════════════════════╗ ║ THE SEQUENCE MODEL SPECTRUM ║ ╠══════════════════╦═══════════════╦═══════════════╦════════════════╣ ║ RECURRENT ║ SSM ║ HYBRID ║ TRANSFORMER ║ ║ (RNN/LSTM) ║ (Mamba) ║ (Jamba/Griffin║ (GPT/Claude) ║ ╠══════════════════╬═══════════════╬═══════════════╬════════════════╣ ║ Sequential ║ Sequential ║ Mixed ║ Parallel ║ ║ O(1) inference ║ O(1) inference║ O(few) infrnc ║ O(N) inference ║ ║ O(N) training ║ O(N) training ║ O(N) training ║ O(N²) training ║ ║ Forgets easily ║ Smart compress║ Balanced ║ Perfect recall ║ ║ Hard to train ║ Easy to train ║ Easy to train ║ Easy to train ║ ╚══════════════════╩═══════════════╩═══════════════╩════════════════╝ ◄─────────── more efficient inference more precise ─► ``` --- ## Diagram 2: Attention Visualization (ASCII Heat Map) ``` SELF-ATTENTION: "The animal didn't cross the street because IT was too tired" Tokens: The animal didn't cross the street because it was too tired ┌─────────────────────────────────────────────────────────────────┐ The │ ■■■ □□□ □□□ □□□ □□□ □□□ □□□ □□□ □□□ □□□ □□□ │ animal │ □□□ ■■■ □□□ □□□ □□□ □□□ □□□ □■□ □□□ □□□ □□□ │ didn't │ □□□ □□□ ■■■ □□□ □□□ □□□ □□□ □□□ □□□ □□□ □□□ │ cross │ □□□ □■□ □□□ ■■■ □□□ □□□ □□□ □□□ □□□ □□□ □□□ │ the │ □■□ □□□ □□□ □□□ ■■■ □□□ □□□ □□□ □□□ □□□ □□□ │ street │ □□□ □□□ □□□ □□□ □■□ ■■■ □□□ □□□ □□□ □□□ □□□ │ because │ □□□ □□□ □□□ □□□ □□□ □□□ ■■■ □□□ □□□ □□□ □□□ │ IT │ □□□ ■■■ □□□ □□□ □□□ □□□ □□□ ■■■ □□□ □□□ □□□ │ │ ↑ high attention │ │ "it" attends strongly to "animal" │ └─────────────────────────────────────────────────────────────────┘ ■■■ = high attention □■□ = medium □□□ = low ``` --- ## Diagram 3: Quadratic Explosion (Mermaid Bar Chart Concept) ```mermaid xychart-beta title "Attention Operations vs Sequence Length (O(n²))" x-axis ["100 tokens", "1K tokens", "10K tokens", "100K tokens"] y-axis "Operations (millions)" 0 --> 10000 bar [0.01, 1, 100, 10000] ``` *Or as an ASCII chart:* ``` Operations vs Context Length (Transformer Attention) 10B ┤ █ │ █ 1B ┤ █ │ █ 100M┤ █ █ │ █ █ 10M ┤ █ █ │ █ █ █ 1M ┤ █ █ █ │ █ █ █ █ 100k┤ █ █ █ █ 100 1K 10K 100K 1M tokens tokens tokens tokens Same chart for SSM (linear): 10B ┤ │ 1B ┤ │ 100M┤ │ 10M ┤ │ 1M ┤ │ █ 100k┤ █ █ █ █ █ 100 1K 10K 100K 1M 10M tokens (flat, linear growth) ``` --- ## Diagram 4: Transformer Block (Mermaid) ```mermaid graph TD A["🔤 Input Tokens\n'The cat sat...'"] --> B["📊 Token Embeddings\n(numbers representing each word)"] B --> C["📍 + Positional Encoding\n(numbered seats in a theater)"] C --> D["🔍 Multi-Head Self-Attention\n(every word looks at every other word)"] D --> E["➕ Add & Normalize"] E --> F["🧮 Feed-Forward Network\n(per-token computation)"] F --> G["➕ Add & Normalize"] G --> H["🔄 Repeat N times\n(12-96 layers in large models)"] H --> I["📤 Output Predictions"] style D fill:#ff9999 style H fill:#99ccff ``` --- ## Diagram 5: SSM State Update ```mermaid graph LR A["📝 Previous State\n(compressed memory)"] --> C["🔄 State Update\nA × state + B × input"] B["📥 New Token\n'elephant'"] --> C C --> D["📝 New State\n(updated memory)"] D --> E["📤 Output\nC × state"] D --> F["➡️ Pass to next step"] style C fill:#99ff99 style D fill:#ffff99 ``` --- ## Diagram 6: The Three Modes of an SSM ``` ╔══════════════════════════════════════════════════════════════╗ ║ THREE WAYS TO USE AN SSM ║ ╠══════════════════╦════════════════════╦═════════════════════╣ ║ RECURRENT MODE ║ CONVOLUTIONAL ║ PARALLEL SCAN ║ ║ (Inference) ║ MODE (Training) ║ (Fast Training) ║ ╠══════════════════╬════════════════════╬═════════════════════╣ ║ x₁→h₁→x₂→h₂→ ║ Output = Input * ║ Compute all states ║ ║ x₃→h₃→... ║ learned_kernel ║ in parallel via ║ ║ ║ (like image filter)║ tree structure ║ ╠══════════════════╬════════════════════╬═════════════════════╣ ║ ✅ O(1) memory ║ ✅ Fully parallel ║ ✅ O(N log N) ║ ║ ✅ Fast inference║ ✅ Fast training ║ ✅ Best of both ║ ║ ❌ Sequential ║ ❌ Fixed kernel ║ ✅ Used in practice ║ ╚══════════════════╩════════════════════╩═════════════════════╝ ``` --- ## Diagram 7: Memory Trade-off Illustration ``` TRANSFORMER MEMORY (KV Cache grows with context): Context: 100 tokens ████░░░░░░░░░░░░░░░░░░░░░░░░░░░░ (small) Context: 1000 tokens ████████████████████░░░░░░░░░░░░ (medium) Context: 8000 tokens ████████████████████████████████ (full!) Context: 32000 tokens ████████████████████████████████ (overflow!) → Need bigger GPU SSM MEMORY (State stays the same size): Context: 100 tokens ████ ← fixed state size Context: 1000 tokens ████ ← same Context: 8000 tokens ████ ← same Context: ∞ tokens ████ ← STILL the same! → Same GPU works forever ``` --- ## Diagram 8: The Pareto Frontier ``` Quality (accuracy) ↑ │ High │ ⭐ Transformer (GPT-4) │ ★ Hybrid (Jamba) │ 🐍 Mamba │ ⚡ RWKV │ 🔁 LSTM Low │_________________________________→ Low High Very High Efficiency (speed, memory) ``` --- ## Diagram 9: Selective State Spaces (Mamba's Key Innovation) ``` VANILLA SSM: Every token treated the same "the" → [same A, same B] → state update "cat" → [same A, same B] → state update "sat" → [same A, same B] → state update ". " → [same A, same B] → state update ← Comma gets same weight as "cat"! MAMBA (Selective SSM): State update depends on the token itself "the" → [A(the), B(the)] → state update ← "forget quickly, minor word" "cat" → [A(cat), B(cat)] → state update ← "remember well, key noun!" "sat" → [A(sat), B(sat)] → state update ← "keep in state, action word" ". " → [A(.), B(.)] → state update ← "sentence boundary, moderate" The model LEARNS what kinds of tokens deserve big B (write to memory) vs small B (skip past quickly). ``` --- ## Diagram 10: Hybrid Model Architecture (Jamba) ``` Jamba Layer Stack: ┌─────────────────────┐ │ Mamba Block │ ← efficient, O(1) inference │ Mamba Block │ │ Mamba Block │ │ Mamba Block │ │ Mamba Block │ │ Mamba Block │ │ Mamba Block │ │ ┌───────────────┐ │ │ │Transformer │ │ ← precise attention for complex reasoning │ │Attention Block│ │ (only 1 in every 7-8 blocks!) │ └───────────────┘ │ │ Mamba Block │ │ Mamba Block │ │ ... │ └─────────────────────┘ = 256K context on a single 80GB GPU ``` --- ## Notes for Report Formatting - Diagrams 2 (attention heat map) and 9 (selective state spaces) are the MOST IMPORTANT for intuition - Diagram 7 (memory trade-off) is highly accessible to any reader - Mermaid diagrams will render in Obsidian natively - ASCII art diagrams render in any monospace font in Obsidian - For callouts, use Obsidian `> [!NOTE]` syntax around key insights from diagrams --- ## Added Section: Best Online Visual Resources > Cross-reference: [[transformers-basics]], [[ssm-basics]], [[analogies-and-intuitions]] ### Jay Alammar — The Illustrated Transformer[^1] The gold standard for transformer pedagogy. Key visual approaches that work: 1. **The "black box → open box" reveal**: Alammar starts with a transformer as a solid black rectangle, then progressively opens it to reveal encoders and decoders. This scaffolded reveal prevents overwhelming the reader. 2. **Colour-coded Q/K/V matrices**: Queries are shown in orange, Keys in red/pink, Values in purple — consistently across all diagrams. Colour consistency reduces cognitive load as complexity builds. 3. **The dot-product attention step-by-step strip**: Shows `q₁ · k₁`, `q₁ · k₂`, etc. as individual multiplication operations laid out horizontally before softmax is applied. Makes the "scoring" step concrete. 4. **Multi-head attention recap diagram** (`transformer_multi-headed_self-attention-recap.png`): All 8 heads shown with their separate W^Q/W^K/W^V matrices, the 8 separate Z outputs, the concat step, and the final W^O projection — all in a single visual. This is the best single-diagram summary of multi-head attention in existence. 5. **Positional encoding as a colour grid**: 20 words × 512 dimensions shown as a heatmap. The sine/cosine pattern appears as a visible gradient split down the centre — the pattern itself becomes the intuition. 6. **Residual ("skip") connection arrows**: Drawn as curved arrows bypassing the sub-layer box. Immediately conveys "information can flow around, not just through." **What makes Alammar's style work**: Every new abstraction is introduced with a "zoom out" view first (boxes with labels), then a "zoom in" (actual numbers/vectors). He never introduces matrix math without first showing a 2×2 toy example. --- ### Jay Alammar — Seq2Seq with Attention[^2] The precursor to the Transformer post. Key visual lessons: 1. **The context vector bottleneck**: A single small circle compressing an entire sentence — immediately shows why attention was needed. 2. **"Unrolled" decoder animation concept**: Instead of one recurrent box, show N copies of the decoder side-by-side (one per timestep). This "unrolling" trick is essential for making sequential models comprehensible statically. 3. **Attention scoring animation**: At each decoder step, the encoder hidden states glow with different intensities matching their attention score. This is the canonical "heat" metaphor for attention. 4. **French→English alignment matrix**: A real attention weight grid showing "European Economic Area" ↔ "zone économique européenne" with reversed word order highlighted. Proves the model learned *linguistic structure*, not just word position. **Key pedagogical insight from this post**: Show the *problem* (bottleneck) visually before showing the *solution* (attention). The contrast makes the solution obvious. --- ### Brendan Bycroft — LLM Visualizer (bbycroft.net/llm)[^3] An interactive 3D WebGL visualizer of a small GPT-style transformer. Since it's a JavaScript SPA it cannot be fetched as text, but based on its well-documented design: **What it shows**: - The entire model as a 3D column of "blocks" — you can literally see the depth of layers as physical height - Individual neurons as coloured cubes that activate/deactivate in real time as tokens flow through - The attention pattern rendered as a glowing arc overlay connecting token positions - Residual stream shown as a persistent "spine" running through the whole model **Why it's uniquely effective**: - **Spatial metaphor for abstraction**: Depth in 3D = number of layers = model "depth". Readers who've never thought about neural network layers *physically feel* the difference between a 12-layer and 96-layer model. - **Activations are colours, not numbers**: Red = highly active, grey = inactive. No math required to understand "this neuron fired." - **Token-by-token stepping**: Users click "next token" and watch the entire 3D structure light up sequentially, making causality (input → computation → output) visceral. **Recommended use in the report**: Reference it as a "living diagram" — include a screenshot of the attention arcs view with a caption pointing readers to the interactive version. --- ### Lilian Weng — Attention? Attention![^4] Weng's post excels at the *mathematical-visual bridge*. Key approaches: 1. **Formula + diagram side-by-side**: Every equation is immediately followed by a diagram where each variable in the equation is colour-labelled to a box in the diagram. 2. **The attention "family tree"**: A taxonomy diagram showing self-attention, cross-attention, soft vs hard attention as a branching tree — excellent for showing the field's conceptual landscape. --- ## Added Section: New ASCII Diagrams ### Diagram 11: QKV Attention Mechanism — Step by Step ``` COMPUTING SELF-ATTENTION FOR "cat" IN "The cat sat" STEP 1: Create Q, K, V vectors for each token via learned weight matrices ───────────────────────────────────────────────────────────────────────── W_Q W_K W_V (learned) (learned) (learned) Token "The" → Q_the=[0.2,0.9] K_the=[0.8,0.1] V_the=[0.5,0.3] Token "cat" → Q_cat=[0.9,0.1] K_cat=[0.3,0.7] V_cat=[0.7,0.8] Token "sat" → Q_sat=[0.1,0.4] K_sat=[0.6,0.2] V_sat=[0.2,0.6] STEP 2: Score "cat"'s Query against every Key (dot product) ───────────────────────────────────────────────────────────────────────── Q_cat · K_the = (0.9×0.8) + (0.1×0.1) = 0.72 + 0.01 = 0.73 Q_cat · K_cat = (0.9×0.3) + (0.1×0.7) = 0.27 + 0.07 = 0.34 ← attending to itself Q_cat · K_sat = (0.9×0.6) + (0.1×0.2) = 0.54 + 0.02 = 0.56 Raw scores: [0.73, 0.34, 0.56] STEP 3: Scale and Softmax (turn scores into a probability distribution) ───────────────────────────────────────────────────────────────────────── Scale by √(dim) = √2 ≈ 1.41: [0.52, 0.24, 0.40] ↓ Softmax: e^0.52 / Σ , e^0.24 / Σ , e^0.40 / Σ ───────────────────────────────────────── Attention weights: [ 0.42 , 0.30 , 0.28 ] ↑ ↑ ↑ "The" "cat" "sat" (Interpretation: "cat" attends most to "The", somewhat to itself) STEP 4: Weighted sum of Value vectors ───────────────────────────────────────────────────────────────────────── Output = 0.42 × V_the + 0.30 × V_cat + 0.28 × V_sat = 0.42×[0.5,0.3] + 0.30×[0.7,0.8] + 0.28×[0.2,0.6] = [0.21,0.13] + [0.21,0.24] + [0.06,0.17] = [0.48, 0.54] ← new representation of "cat" enriched with context ┌─────────────────────────────────────────────────────────────┐ │ KEY INSIGHT: "cat" no longer represents only itself. │ │ Its new vector [0.48, 0.54] is a BLEND of all tokens, │ │ weighted by how relevant each token is to "cat". │ └─────────────────────────────────────────────────────────────┘ ``` --- ### Diagram 12: Token Generation with KV Cache Growth ``` AUTOREGRESSIVE GENERATION — producing "The cat sat on the mat" Prompt: "The cat" → generate one token at a time ──────────────────────────────────────────────────────────────── STEP 1: Process "The cat" (prefill) Input: [The] [cat] ↓ full attention matrix computed KV Cache: ┌────┬────┐ │K₁ │K₂ │ ← Keys stored (2 slots used) │V₁ │V₂ │ ← Values stored (2 slots used) └────┴────┘ Output: predict → "sat" ✓ ──────────────────────────────────────────────────────────────── STEP 2: Append "sat", generate next token Input: [The] [cat] [sat] But we DON'T recompute K₁,V₁,K₂,V₂ — they're cached! We only compute K₃, V₃ for "sat" then add to cache: KV Cache: ┌────┬────┬────┐ │K₁ │K₂ │K₃ │ ← 3 slots used (grows!) │V₁ │V₂ │V₃ │ └────┴────┴────┘ Output: predict → "on" ✓ ──────────────────────────────────────────────────────────────── STEP 3 → N: Cache keeps growing... KV Cache after 100 tokens: ┌─────────────────────────────┐ │ K₁ K₂ K₃ ... K₉₈ K₉₉ K₁₀₀│ │ V₁ V₂ V₃ ... V₉₈ V₉₉ V₁₀₀│ └─────────────────────────────┘ Size ∝ N × layers × heads × d_head KV Cache after 100K tokens: ┌─────────────────────────────────────────┐ (GPT-4 context window) │ K₁ ... ░░░░░░░░░░░░░░░░░░░░░░░░░░ K₁₀₀ₖ│ │ V₁ ... ░░░░░░░░░░░░░░░░░░░░░░░░░░ V₁₀₀ₖ│ └─────────────────────────────────────────┘ ~80-160 GB for a 70B model ← needs A100s ──────────────────────────────────────────────────────────────── COMPARE: SSM at inference — NO cache, fixed state State after 100 tokens: [h₁ h₂ h₃ ... h_d] ← always same fixed size State after 100K tokens: [h₁ h₂ h₃ ... h_d] ← STILL same fixed size! ~megabytes ← works on a laptop ``` --- ### Diagram 13: Mamba's Selective Gating — Step by Step ``` VANILLA S4 (Linear Time-Invariant — fixed parameters, input-independent) ───────────────────────────────────────────────────────────────────────── xₜ ──────────────────────────────────────────────────────────────→ │ │ │ ▼ ▼ ▼ [B: fixed] [B: fixed] [B: fixed] │ │ │ hₜ₋₁ → [×A] → + → hₜ hₜ → [×A] → + → hₜ₊₁ ...continues ↑ ↑ [B·xₜ] [B·xₜ₊₁] A, B, C are the SAME for every input token. "The" and "elephant" are processed identically. The model cannot choose to remember or forget. MAMBA (Selective SSM — B, C, Δ all depend on the current input xₜ) ───────────────────────────────────────────────────────────────────────── xₜ ("elephant") │ ├──► Linear projection → Δₜ (step size: "how much time passes?") │ Large Δ = big state update │ Small Δ = nearly skip this token │ ├──► Linear projection → Bₜ (how strongly to WRITE xₜ into state) │ ├──► Linear projection → Cₜ (how strongly to READ from state) │ └──► [ZSS (Zero-order Hold) discretize A with Δₜ → Āₜ] │ ▼ hₜ = Āₜ · hₜ₋₁ + Bₜ · xₜ ← state update yₜ = Cₜ · hₜ ← read output ┌─────────────────────────────────────────────────────────────────┐ │ TOKEN EXAMPLES: │ │ │ │ "a" → small Δ, small B → barely updates state (skip) │ │ "but" → large Δ, large B → strong update (contrast signal!) │ │ "not" → large Δ, large B → strong update (negation!) │ │ "..." → medium Δ → mild reset between sentences │ │ │ │ The model LEARNS these importance weights from data. │ └─────────────────────────────────────────────────────────────────┘ ``` --- ### Diagram 14: S4 Dual Representation — Recurrence OR Convolution ``` THE SAME S4 MODEL CAN BE COMPUTED TWO EQUIVALENT WAYS: REPRESENTATION A — RECURRENT (used at inference): ───────────────────────────────────────────────── Time: t=1 t=2 t=3 t=4 x₁ x₂ x₃ x₄ │ │ │ │ ▼ ▼ ▼ ▼ h₀→[Ah₀+Bx₁]→h₁→[Ah₁+Bx₂]→h₂→[Ah₂+Bx₃]→h₃→[Ah₃+Bx₄]→h₄ │ │ │ │ ▼ ▼ ▼ ▼ y₁ y₂ y₃ y₄ (Ch₁) (Ch₂) (Ch₃) (Ch₄) ✅ O(1) memory per step — perfect for generating one token at a time ❌ Sequential — cannot be parallelised across the sequence REPRESENTATION B — CONVOLUTIONAL (used during training): ───────────────────────────────────────────────────────── The outputs y₁...yₙ can be written as a single CONVOLUTION: y = K̄ * x where K̄ is the "SSM convolution kernel" K̄ = [CB, CAB, CA²B, CA³B, ..., CA^(N-1)B] ↑ ↑ ↑ ↑ t=1 t=2 t=3 t=4 (how much input from N steps ago matters) Visualised as a filter over time: Kernel K̄: ║█████║████ ║███ ║██ ║█ ║ ║ ║ t=1 t=2 t=3 t=4 t=5 t=6 t=7 (decaying memory) Input x: [x₁] [x₂] [x₃] [x₄] [x₅] [x₆] [x₇] Output y₄ = K̄[1]·x₄ + K̄[2]·x₃ + K̄[3]·x₂ + K̄[4]·x₁ + ... (most recent) (oldest, faintest) ✅ Fully parallelisable with FFT — compute ALL outputs simultaneously! ✅ O(N log N) for training ❌ Must hold full input sequence in memory ┌─────────────────────────────────────────────────────────────────────┐ │ MAGIC: The kernel K̄ encodes the model's entire "memory policy": │ │ how quickly it forgets, what patterns it's sensitive to. │ │ S4's clever parameterisation (HiPPO matrix A) makes this kernel │ │ decay gracefully — it doesn't explode or vanish like RNN gradients.│ └─────────────────────────────────────────────────────────────────────┘ ``` --- ### Diagram 15: HiPPO — Compressing History into Polynomial Coefficients ``` PROBLEM: How do you summarise "everything that happened so far" in a fixed-size vector? HIPPO'S ANSWER: Fit a polynomial to your signal history, update the fit at each step. Signal history (past inputs over time): Value │ · ← the actual signal, token by token │ · · · │· · · · │ · · · └─────────────────────────→ time (past ←—————→ present) HIPPO continuously fits a polynomial p(t) to this history: Value │ /\ │ / \ / │ / \ / ← p(t): a polynomial approximation of the signal │ / \_/ └─────────────────────────→ time The COEFFICIENTS of p(t) = [c₀, c₁, c₂, ..., cₙ] are the STATE VECTOR hₜ HiPPO UPDATE RULE (arrives each new input xₜ): ───────────────────────────────────────────── Old coefficients: [c₀, c₁, c₂, c₃, ...] │ new xₜ arrives │ ▼ A_HiPPO × h + B_HiPPO × xₜ ← standard SSM form! │ ▼ New coefficients: [c₀', c₁', c₂', c₃', ...] ← updated polynomial fit KEY INSIGHT: The HiPPO matrix A is carefully designed so: - Recent inputs are approximated at HIGH resolution (many polynomial terms) - Distant inputs are approximated at LOW resolution (coarser terms) - This mimics how human memory works: clear yesterday, blurry last year WHAT THE STATE "KNOWS": ┌───────────────────────────────────────────────────────┐ │ State hₜ (N numbers) ≈ "the shape of the past signal" │ │ │ │ Low-frequency trend: captured in c₀, c₁ │ │ Medium-frequency wiggles: captured in c₂, c₃, c₄ │ │ Fine recent detail: captured in cₙ₋₂, cₙ₋₁, cₙ │ └───────────────────────────────────────────────────────┘ ``` --- ## Added Section: Mermaid Diagrams for the Report ### Mermaid 1: Transformer — Training vs Inference Flow ```mermaid flowchart TD subgraph TRAINING ["🏋️ TRAINING — Parallel, sees whole sequence"] T1["Full sequence\n'The cat sat on the mat'"] T2["Embed all tokens at once\n[The] [cat] [sat] [on] [the] [mat]"] T3["Causal mask applied\n(each token can only see past tokens)"] T4["ALL attention scores computed\nin one giant parallel matrix op\nO(N²) memory + compute"] T5["Compare predicted vs actual\nCompute loss → backprop"] T1 --> T2 --> T3 --> T4 --> T5 end subgraph INFERENCE ["⚡ INFERENCE — Sequential, one token at a time"] I1["Prompt tokens → KV cache prefill\n(one parallel pass)"] I2["Generate token N+1\n(attend to cached K,V)"] I3["Append new token\nAdd K,V to cache"] I4{Done?} I5["Return full response"] I1 --> I2 --> I3 --> I4 I4 -- "No" --> I2 I4 -- "Yes" --> I5 end TRAINING -.->|"Trained weights\nfrozen"| INFERENCE style T4 fill:#ff9999,stroke:#cc0000 style I3 fill:#ffcc88,stroke:#cc8800 ``` --- ### Mermaid 2: SSM — Recurrent vs Convolutional Mode ```mermaid flowchart LR subgraph PARAMS ["Shared Parameters: A, B, C"] P["A matrix (state transitions)\nB matrix (input projection)\nC matrix (output projection)"] end subgraph RECURRENT ["🔁 Recurrent Mode\n(O(1) memory — inference)"] R1["hₜ = A·hₜ₋₁ + B·xₜ"] R2["yₜ = C·hₜ"] R3["Fixed-size state\n→ next step"] R1 --> R2 --> R3 --> R1 end subgraph CONV ["🌊 Convolutional Mode\n(O(N log N) — training)"] C1["Compute kernel K̄\n= [CB, CAB, CA²B, ...]"] C2["Apply FFT convolution\ny = K̄ * x"] C3["All outputs computed\nin parallel"] C1 --> C2 --> C3 end PARAMS -->|"Use same weights"| RECURRENT PARAMS -->|"Use same weights"| CONV style RECURRENT fill:#e8f5e9,stroke:#388e3c style CONV fill:#e3f2fd,stroke:#1976d2 style PARAMS fill:#fff9c4,stroke:#f57f17 ``` --- ### Mermaid 3: Hybrid Architecture Decision Tree ```mermaid flowchart TD START["What does your task require?"] START --> Q1{"Exact recall of\nspecific facts/tokens\nfrom long context?"} Q1 -- "Yes (e.g. legal doc Q&A)" --> Q2{"Budget for\ngpu memory?"} Q2 -- "Yes, large GPU" --> TRANSFORMER["✅ Use Transformer\n(GPT/Claude/Gemini)\nPerfect recall\nHigh memory cost"] Q2 -- "No, constrained" --> Q3{"Occasional precision\nacceptable?"} Q3 -- "Yes" --> HYBRID["✅ Use Hybrid\n(Jamba / Griffin)\n~Transformer quality\nwith linear memory"] Q3 -- "No" --> RETRIEVAL["✅ Use Transformer\n+ RAG retrieval\n(offload memory to DB)"] Q1 -- "No (e.g. audio, video,\nstreaming, forecasting)" --> Q4{"Training data\nsize?"} Q4 -- "Large (>1B tokens)" --> Q5{"Need conversation\nor instruction following?"} Q5 -- "Yes" --> HYBRID Q5 -- "No (e.g. pure time series)" --> SSM["✅ Use Pure SSM\n(Mamba / S4)\nFastest inference\nLowest memory"] Q4 -- "Small / real-time" --> SSM style TRANSFORMER fill:#ffcccc,stroke:#cc0000 style HYBRID fill:#ffe0cc,stroke:#cc6600 style SSM fill:#ccffcc,stroke:#006600 style RETRIEVAL fill:#ccccff,stroke:#000099 ``` --- ## Added Section: Ideal Visuals for Designer Implementation > These descriptions are for a visual designer or illustrator to implement as polished graphics for the final report. See [[analogies-and-intuitions]] for the conceptual framings that should accompany each visual. --- ### Visual Spec A: Attention Heat Map **Concept**: Show that "attention" means different tokens matter differently when computing the meaning of a word. **Layout**: A square grid, 8×8, representing the sentence: > "The bank can guarantee deposits will eventually cover future tuition costs" - **X-axis** (top): source tokens (all 8 words) - **Y-axis** (left): query tokens (each word computing its attention) - **Cells**: colour-coded from white (0% attention) → deep blue (100% attention) - **Highlight**: The row for "bank" should show high attention to "deposits", "guarantee", "cover" — demonstrating disambiguation. A second small heat map for "bank" in a financial vs. river context should show *different* patterns — proving attention captures meaning, not just position. - **Annotations**: Small arrows pointing from the "bank" cell to its high-attention cells labelled "bank is attending to these words" **Colours**: White → light blue → medium blue → navy. Avoid red (it implies error). --- ### Visual Spec B: The Quadratic Explosion — Growth Curve **Concept**: Transformer memory/compute grows with the square of context length; SSM grows linearly. **Layout**: A 2D line chart, x-axis = "Context window (tokens)", y-axis = "GPU Memory Required" - **X-axis ticks**: 1K, 4K, 16K, 64K, 128K, 1M tokens (log scale recommended) - **Two lines**: - **Red curve** (Transformer, O(N²)): starts slow, accelerates dramatically, shoots off the top of the chart around 128K tokens. Label: "Transformer — runs out of GPU" - **Green line** (SSM, O(N)): gentle upward slope, stays comfortably in frame even at 1M tokens. Label: "SSM — still fits on one GPU" - **Shaded region**: between 32K and 128K tokens, shade in red labelled "❌ Requires multiple high-end GPUs" - **Icon**: At 128K token mark on the Transformer line, place a 🔥 "out of memory" icon --- ### Visual Spec C: KV Cache Memory Growth Over Time **Concept**: During inference, a Transformer's memory requirement grows token-by-token; an SSM's stays flat. **Layout**: A side-by-side "container fill" animation concept (static version: two columns of tanks) **Left column — Transformer**: - A tall vertical tank, initially empty - At t=100: 5% full - At t=1000: 50% full - At t=8000: 100% full — **overflow indicator** (water splashing out) labelled "Out of Memory" - Label on tank: "KV Cache — grows every token" **Right column — SSM**: - A small, fixed-size tank - At t=100, t=1000, t=8000, t=∞: always showing the same fill level (~70%) - Label on tank: "State — fixed size forever" - Small annotation: "Same tank works for 100 tokens OR 1 million" **Colours**: Tank fill = blue water; overflow = red splash; background = light grey. --- ### Visual Spec D: SSM State as a Compression Vessel **Concept**: The SSM state is a fixed-size "memory crystal" that holds a compressed summary of all past inputs. **Layout**: A flowing river metaphor - **Left side**: A wide river of data flowing right — individual droplets labelled with words from a story ("Once", "upon", "a", "time", "there", "was", "a", "giant", "who", ...) - **Centre**: A funnel/compression chamber labelled "SSM State Update (A, B, C)" - **Right side**: A small glowing gem/crystal — the state vector. Fixed size no matter how long the river is. - **Below the gem**: A "read head" labelled "Output layer C" that can query the gem to produce answers **Visual metaphor**: The gem *changes colour and texture* as new information flows in, but never *grows in size*. Recent information makes it glow brighter in certain spectra; old information fades into background colour patterns. **Contrast inset**: A second panel showing the Transformer: no funnel, instead *every single droplet* is stored in a giant warehouse (KV cache) that keeps expanding rightward off the page. --- --- ## Added Section: LRA Benchmarks and New Conceptual Diagrams > See [[lra-benchmarks]] for full data and source citations. --- ### Diagram 16: LRA Benchmark Comparison — Bar Chart (ASCII) ``` LONG RANGE ARENA — MODEL ACCURACY BY TASK (%) Source: Tay et al. 2021 (LRA paper) + Gu et al. 2022 (S4, ICLR 2022) Task Transformer S4 (2022) S5 (2023) 0% 50% 100% ├─────┼─────┤ ListOps │████ 36.4% │ █████████████████████ 59.6% ████████████████████████ 62.2% │ │ Text │████████████ 64.3% █████████████████████████████████ 86.8% ██████████████████████████████████ 89.3% │ │ Retrieval │███████████ 57.5% ██████████████████████████████████ 90.9% ██████████████████████████████████ 91.4% │ │ Image │████████ 42.4% █████████████████████████████████ 88.7% █████████████████████████████████ 88.0% │ │ Pathfinder │██████████████ 71.4% ████████████████████████████████████ 94.2% █████████████████████████████████████ 95.3% │ │ Path-X │██████████ FAIL (≈50%) ████████████████████████████████████ 96.4% █████████████████████████████████████ 98.5% ├─────┼─────┤ 0% 50% 100% AVERAGE: Transformer: 54.4% S4: 86.1% S5: 87.4% ▲ All efficient Transformers (Performer, BigBird, etc.) cluster ≤55% ▲ S4 is the first model to solve Path-X; Transformer = random guessing Legend: ███ = Transformer score █ = S4 score (always higher) ``` > [!NOTE] > The jump from the best Transformer-family model (~55%) to S4 (~86%) happened in a **single paper** (2021). This is not a gradual climb — it is a discontinuity. S4 is the first model to solve Path-X (16,384 tokens); every prior model scores at chance level (~50%). --- ### Diagram 17: The Handshake Problem — O(N²) Connections ``` WHY TRANSFORMER ATTENTION SCALES QUADRATICALLY Consider N people at a party who all need to shake hands with each other. Each pair shakes hands exactly once. How many handshakes as N grows? ──────────────────────────────────────────────────────────────── N = 2 people: 1 handshake A───B N = 3 people: 3 handshakes A / \ B───C N = 4 people: 6 handshakes A───B │╲ /│ │ ╳ │ │/ ╲│ C───D N = 8 people: 28 handshakes (each of 8 people shakes 7 hands ÷ 2 = 28) Connections: ●─●─●─●─●─●─●─● (each dot connects to all others) ╲╱╲╱╲╱╲╱╲╱╲╱╲╱ (web of connections, hard to draw but every pair links) GENERAL FORMULA: N × (N - 1) / 2 ≈ N² / 2 ──────────────────────────────────────────────────────────────── N (tokens) Pairs GPU ops (attention) ────────────────────────────────────────────── 128 8,128 ~8K 512 131,072 ~131K 1,024 523,776 ~524K 4,000 7,998,000 ~8.0M 10,000 49,995,000 ~50M 16,384 134,209,536 ~134M ← Path-X sequence length 100,000 4,999,950,000 ~5B ← impractical on any GPU today 1,000,000 499,999,500,000 ~500B ← impossible ──────────────────────────────────────────────────────────────── ● ← impossible territory / ● / TRANSFORMER (O(N²)) ● each token attends to all tokens / ● / ──────────●───────────────────────────●────────────── ← SSM (O(N)) / (flat — same work per token) ● / ● ← 128 tokens = easy for both 128 512 1K 4K 16K 100K 1M ↑ GPT-4 context starts here (~32K) KEY INSIGHT: Each new token in a Transformer must "handshake" with EVERY prior token. That's the attention matrix. Add one more token to a 16K sequence → 16,384 more operations. Add one more token to an SSM → exactly 1 more state update. Same cost every time. ``` --- ### Diagram 18: The Induction Head Circuit ```mermaid flowchart TD subgraph INPUT ["📝 Input Sequence (simplified)"] T1["Token: 'A'"] T2["Token: 'B'"] T3["Token: '...'"] T4["Token: 'A' ← repeated"] T5["Token: ??? ← predict this"] end subgraph LAYER1 ["Layer 1: Previous-Token Head"] PH["Previous-Token Head\n\nLearns to attend:\neach token → token BEFORE it\n\n'A' remembers 'START'\n'B' remembers 'A'\n'A²' remembers '...'"] end subgraph LAYER2 ["Layer 2: Induction Head"] IH["Induction Head\n\nPattern: when I see token X...\ncheck what came AFTER X last time\n\nSees: current token = 'A'\nAsks: 'when did I last see A?'\nFinds: token 2 (which remembers B)\nCopies: B into current output"] end subgraph OUTPUT ["📤 Prediction"] PRED["Predicts: 'B'\n\n✅ Correct! The pattern [A → B]\nwas learned from earlier in context\n\nThis is IN-CONTEXT LEARNING:\nno gradient update needed —\njust attention pattern recognition"] end T4 --> LAYER1 T1 --> LAYER1 LAYER1 -->|"'A₁ was followed by B'\nstored via Q/K/V"| LAYER2 T4 --> LAYER2 LAYER2 --> PRED style LAYER1 fill:#fff3e0,stroke:#e65100 style LAYER2 fill:#e8f5e9,stroke:#2e7d32 style PRED fill:#e3f2fd,stroke:#1565c0 style INPUT fill:#fce4ec,stroke:#880e4f ``` **What the circuit does, step by step:** ``` SEQUENCE: [A] [B] [C] [D] ... [A] [?] 1 2 3 4 N N+1 LAYER 1 — Previous Token Head: Token at position N (second "A") attends to position N-1 This head's job: copy "what came before me" into my representation Result: position N now "knows" that the token before it was something LAYER 2 — Induction Head: Looks at the token at position N Asks: "Where else in this sequence did I see this same token?" Finds: position 1 (the first "A") Then looks at what position 1's PREVIOUS-TOKEN HEAD stored That's position 2 (which is "B") Copies "B" into the prediction for position N+1 OUTPUT: Predicts "B" with high confidence ✅ WHY THIS IS REMARKABLE: This two-layer circuit implements a simple but powerful rule: "repeat the pattern you saw earlier in this very context" No fine-tuning. No gradient update. Pure in-context learning. ``` > [!NOTE] > Induction heads were discovered empirically by Olsson et al. (2022) in "In-context Learning and Induction Heads." They appear in virtually all Transformer language models. SSMs, by contrast, cannot straightforwardly implement this circuit because they do not have content-based random access — they can only read out their compressed state. --- ### Diagram 19: SSM Memory — "Photographic" vs "Impressionist Sketch" ``` SAME SOURCE DOCUMENT: "The defendant, Marcus Webb, a 42-year-old software engineer from Austin, was acquitted on all three counts of fraud on Tuesday, citing lack of evidence. The jury deliberated for six hours. Webb's attorney, Sarah Chen, called the verdict a 'complete vindication.'" ════════════════════════════════════════════════════════════════════════ TRANSFORMER KV CACHE — "Photographic Memory" (exact, but expensive) ──────────────────────────────────────────────────────────────────────── Stored verbatim: ┌──────────────────────────────────────────────────────────────────┐ │ K₁:"The" K₂:"defendant" K₃:"," K₄:"Marcus" K₅:"Webb" K₆:","│ │ K₇:"a" K₈:"42-year-old" K₉:"software" K₁₀:"engineer" ... │ │ ... K₄₇:"deliberated" K₄₈:"for" K₄₉:"six" K₅₀:"hours" ... │ │ ... K₆₃:"complete" K₆₄:"vindication" K₆₅:"." │ └──────────────────────────────────────────────────────────────────┘ Query: "Who was acquitted?" → Scan ALL 65 key-value pairs, compute 65 dot-products → Find K₄-K₅ ("Marcus Webb") and K₁₆ ("acquitted") → Return exact answer ✅ Cost: 65 keys stored, 65 comparisons per query Scales to 100K tokens? → 100K comparisons per query 😰 ════════════════════════════════════════════════════════════════════════ SSM STATE VECTOR — "Impressionist Sketch" (compressed, fixed cost) ──────────────────────────────────────────────────────────────────────── After reading the same paragraph, state vector h (fixed size: N=256): ┌──────────────────────────────────────────────────────────────────┐ │ h = [ 0.87, -0.23, 0.61, 0.44, -0.91, 0.33, 0.72, -0.18, ... ]│ │ │ │ ≈ "adult male professional, legal exoneration, positive outcome,│ │ short deliberation, named parties (blurry specifics)" │ │ │ │ High activation dims: PERSON • LEGAL • ACQUITTAL • VINDICATION │ │ Medium activation dims: ENGINEER • AUSTIN • ATTORNEY │ │ Low/zero activation dims: exact ages, exact durations │ └──────────────────────────────────────────────────────────────────┘ Query: "Who was acquitted?" → Read output: C · h ← single matrix multiply, O(N) not O(seq_len) → Returns: "Marcus Webb" (if B matrix encoded the name strongly enough) → OR returns: "a defendant" (if name was compressed away) ⚠️ Cost: SAME whether document was 65 tokens or 65,000 tokens ✅ Trade-off: May lose exact details (exact age "42", exact hours "six") ════════════════════════════════════════════════════════════════════════ THE CORE TRADE-OFF: Transformer: "I remember every word. Ask me anything precisely." But: memory bill grows with every new word. SSM: "I hold a compressed impression of everything I've read." But: exact verbatim recall may degrade for long docs. Hybrid: "I use SSM layers for flowing context, attention layers for precision recall of key facts." → Best of both worlds at moderate cost. ``` --- ### Diagram 20: Hybrid Architecture "Sweet Spot" ```mermaid flowchart TD subgraph PURE_XFMR ["🔴 Pure Transformer\n(e.g. GPT-4, Claude)"] X1["Attention Block"] X2["MLP Block"] X3["Attention Block"] X4["MLP Block"] X5["Attention Block"] X6["MLP Block"] X1 --> X2 --> X3 --> X4 --> X5 --> X6 X6 --->|"...×96 layers"| XOUT["Output"] XNOTE["⚠️ O(N²) compute\n⚠️ KV cache grows unboundedly\n✅ Perfect recall\n✅ Strong reasoning"] end subgraph PURE_SSM ["🟢 Pure SSM\n(e.g. Mamba, S4)"] S1["SSM Block"] S2["SSM Block"] S3["SSM Block"] S4b["SSM Block"] S5b["SSM Block"] S6["SSM Block"] S1 --> S2 --> S3 --> S4b --> S5b --> S6 S6 --->|"...×48 layers"| SOUT["Output"] SNOTE["✅ O(N) compute\n✅ Fixed-size state\n⚠️ Weaker exact recall\n⚠️ Struggles with in-context lookup"] end subgraph HYBRID ["⭐ Hybrid Model\n(e.g. Jamba, Griffin, Zamba)"] H1["SSM Block"] H2["SSM Block"] H3["SSM Block"] H4["SSM Block"] H5["SSM Block"] H6["SSM Block"] H7["SSM Block"] HATTN["🔍 Attention Block\n(1 per 7–8 SSM blocks)"] H8["SSM Block"] H9["SSM Block"] H10["SSM Block"] H1 --> H2 --> H3 --> H4 --> H5 --> H6 --> H7 H7 --> HATTN HATTN --> H8 --> H9 --> H10 H10 --->|"pattern repeats"| HOUT["Output"] HNOTE["✅ Near-O(N) compute\n✅ Near-fixed memory\n✅ Good recall (attention layers)\n✅ Fits 256K ctx on 1×80GB GPU"] end subgraph SWEET_SPOT ["📊 Performance vs. Efficiency"] CHART["Quality ↑ High ┤ ★ Transformer (best recall) │ ⭐ Hybrid (sweet spot) │ 🐍 Mamba (strong language) │ S4 (best long-range) ↓ Low ─┼────────────────────────────→ Low Efficiency High"] end PURE_XFMR -.->|"Add linear layers\nreduce attention frequency"| HYBRID PURE_SSM -.->|"Add sparse attention\nfor precision tasks"| HYBRID HYBRID --> SWEET_SPOT style PURE_XFMR fill:#ffebee,stroke:#c62828 style PURE_SSM fill:#e8f5e9,stroke:#2e7d32 style HYBRID fill:#fff9c4,stroke:#f9a825 style SWEET_SPOT fill:#e3f2fd,stroke:#1565c0 style HATTN fill:#fff3e0,stroke:#e65100 ``` **The key numbers (Jamba, AI21 Labs, 2024):** - ~7% of layers are attention blocks (1 per 8 SSM blocks) - Achieves 256K context window on a **single 80GB A100 GPU** - Quality competitive with Transformer models of similar parameter count - Inference speed ~2–3× faster than equivalent pure Transformer [^1]: Jay Alammar, "The Illustrated Transformer" (2018). https://jalammar.github.io/illustrated-transformer/ — The most-translated technical blog post in NLP history; translated into 20+ languages. [^2]: Jay Alammar, "Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention)" (2018). https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/ [^3]: Brendan Bycroft, "LLM Visualization" (interactive). https://bbycroft.net/llm — 3D WebGL interactive transformer visualiser. Best experienced live; shows a nano-GPT (85K parameters) with animated activations. [^4]: Lilian Weng, "Attention? Attention!" (2018). https://lilianweng.github.io/posts/2018-06-24-attention/ — Comprehensive mathematical treatment with visual/formula side-by-side presentation.