advanced-technical-topics - ramblings from the whirligig void

# Advanced Technical Topics: Four Deep Dives > *This note covers four technical topics that add depth to the core Transformers vs SSMs comparison. Each section is self-contained but cross-references other notes for context. See [[ssm-basics]], [[transformers-basics]], [[induction-heads-icl]], and [[applications-and-use-cases]] for foundations.* --- ## Table of Contents 1. [Position Encodings — Transformers Need Them, SSMs Don't](#1-position-encodings--transformers-need-them-ssms-dont) 2. [In-Context Learning in SSMs — A Fundamental Weakness](#2-in-context-learning-in-ssms--a-fundamental-weakness) 3. [Length Generalization — Can SSMs Extrapolate?](#3-length-generalization--can-ssms-extrapolate) 4. [Vision and Image SSMs — Mamba for Computer Vision](#4-vision-and-image-ssms--mamba-for-computer-vision) --- ## 1. Position Encodings — Transformers Need Them, SSMs Don't ### 1.1 The Fundamental Problem: Transformers Are Deaf to Order Imagine reading a sentence by randomly shuffling its words before you start. For most tasks, that would be disastrous. But a Transformer's attention mechanism, in its raw form, does exactly this — it computes relationships between all pairs of tokens *without any notion of their order*. This is because attention is mathematically **permutation-equivariant**: swapping two tokens in the input produces output that is identically swapped, with no other changes. The content matters; the positions do not. If you asked a bare Transformer without positional encoding to read "The dog bit the man" versus "The man bit the dog," it would produce *identical representations* — just with the token embeddings for "dog" and "man" exchanged.[^1] > [!NOTE] Analogy: the bag-of-words problem > A Transformer without position encoding has the same fundamental limitation as the old "bag of words" approach in NLP — it treats a sentence as an unordered collection of tokens. The attention mechanism is an enormously powerful way to mix bag-of-words token representations, but the order information has to come from somewhere else. That somewhere else is positional encoding. This is why every practical Transformer adds some form of **positional encoding** — an additional signal injected into each token's representation that tells the model where in the sequence that token lives. --- ### 1.2 Sinusoidal Encodings — The Original Solution The original Transformer paper (Vaswani et al., 2017) proposed a clever solution: add a fixed, deterministic vector to each token's embedding that encodes its position as a pattern of sine and cosine waves at different frequencies.[^1] For a token at position `p` in a sequence, and for the `i`-th dimension of the embedding, the encoding is: ``` PE(p, 2i) = sin(p / 10000^(2i/d)) PE(p, 2i+1) = cos(p / 10000^(2i/d)) ``` **Why this works (intuition):** - Different dimensions oscillate at vastly different rates — some like a rapidly blinking light (high frequency), others like a slow pendulum (low frequency). - Each position produces a unique pattern across all dimensions — a "fingerprint" the model can learn to read. - Crucially, relative positions are encoded too: the difference between position 5 and position 7 produces a consistent pattern regardless of where they appear in the sequence, because trigonometric identities relate `sin(p+k)` to `sin(p)` and `cos(p)`. - The encoding can in principle extend to any length — you just evaluate the formulas at larger `p`. **The limitation:** While sinusoidal encodings are mathematically neat and work for lengths close to what was seen during training, they do not guarantee that the model can *generalise* to much longer sequences. The model learns to interpret specific position patterns; extrapolating far beyond training length leads to unexpected and often broken behaviour.[^2] --- ### 1.3 RoPE — The Modern Standard **Rotary Position Embedding (RoPE)**, introduced in 2021 and now used in virtually every major language model (LLaMA, Mistral, Gemma, Qwen, and more), takes a fundamentally different approach.[^3] Instead of *adding* a position vector to token embeddings, RoPE *rotates* the query and key vectors before computing attention. Specifically, the query vector for a token at position `p` is rotated by an angle proportional to `p`: ``` q'_p = Rotate(q_p, p·θ) k'_p = Rotate(k_p, p·θ) ``` where `θ` is a set of frequencies (one per pair of embedding dimensions). **Why rotation is clever:** When you compute the dot product between a rotated query and a rotated key (which is what attention does), the rotation angles partially cancel: ``` q'_p · k'_m = q_p · R(p-m) · k_m ``` The dot product naturally depends on the *relative distance* `(p - m)` between query position `p` and key position `m`. RoPE thus gives attention a built-in sense of "how far apart are these two tokens?" without any explicit relative position computation. It also has a desirable **decay property**: tokens far apart tend to produce smaller dot products than nearby tokens, which makes intuitive sense. > [!TIP] Why RoPE is better than sinusoidal > Sinusoidal encodings inject absolute position into the representation before attention; this information can be "washed out" as it passes through layers. RoPE injects relative position *directly into the attention score computation*, so the position signal is always present wherever attention is computed. This is why RoPE has become the standard. **RoPE's Achilles heel — length generalisation failure:** RoPE works beautifully within the trained context window. But when sequences are *longer* than anything seen during training, RoPE breaks down severely. The position angles that the model learned to interpret have never been applied at these large values; the attention scores become unpredictably large or small, effectively destroying the attention mechanism.[^2] The Position Interpolation paper (2023) demonstrated this clearly: naively applying a LLaMA model trained on 2K context to 32K sequences produces catastrophic failure. Their fix — linearly scaling down position indices to fit within the original range — works, but requires additional fine-tuning and still doesn't give you true open-ended generalisation.[^2] --- ### 1.4 Why SSMs Don't Have This Problem SSMs are fundamentally different: they process tokens **one at a time, in order**, updating a hidden state at each step. There is no attention matrix to be position-agnostic; the architecture itself is inherently sequential. In the SSM equations: ``` h(t) = A·h(t-1) + B·x(t) [update hidden state] y(t) = C·h(t) [produce output] ``` The index `t` *is* the position — it is baked into the computational structure, not an add-on.[^4] Token 42 is processed after token 41 by definition. The hidden state `h(t)` is a compressed summary of everything seen at positions 1 through t, ordered by when it arrived. In Mamba specifically, the time-step parameter `Δ` (delta) modulates how much the state transitions for each input. This is an input-dependent notion of "how long ago was that?" — a form of relative temporal reasoning that emerges naturally from the recurrent structure, without any external positional signal.[^4] **The practical consequence:** An SSM trained on sequences of length 4,096 and then asked to process a sequence of length 32,768 is not doing anything architecturally different — it simply keeps updating its state for more steps. The *math doesn't change*. This is a profound practical advantage for applications where sequence length varies dramatically (legal documents, genomics, audio). > [!NOTE] The analogy: humans vs. a camera > A Transformer reading a document is like someone who receives all the words printed on cards, shuffled into a pile, and must read them in any order while adding position labels. An SSM is like someone reading a scroll left-to-right — they never needed position labels because the act of reading *is* sequential. Longer scrolls are just more scrolling. --- ### 1.5 Summary Table | Feature | Sinusoidal | RoPE | SSM | |---|---|---|---| | How position is encoded | Added vector per position | Rotation of Q/K vectors | Implicit in recurrent structure | | Relative position? | Indirect | Direct (via angle difference) | Intrinsic | | Length generalisation | Weak | Very weak without fine-tuning | Strong (by design) | | Used in | Original Transformer, BERT | LLaMA, Mistral, Gemma, GPT-NeoX | Mamba, Griffin, S4, S6 | | Can extend to arbitrary length? | In principle, but breaks | No (Position Interpolation needed) | Yes | --- ## 2. In-Context Learning in SSMs — A Fundamental Weakness ### 2.1 What In-Context Learning Is (Brief Recap) **In-context learning (ICL)** is the ability of a model to learn a new task from examples provided directly in the prompt, without any weight updates. See [[induction-heads-icl]] for a deep treatment; this section focuses specifically on *how SSMs handle ICL* and *why they struggle*. The short version: ICL requires the model to see a pattern like `[A][B] ... [A][B] ... [A]→?` and complete it with `[B]`. Anthropic's mechanistic interpretability research showed this is implemented in Transformers via **induction heads** — two-layer attention circuits that can look back into the full context, find the previous occurrence of any token, and return whatever followed it.[^5] This capability is fundamentally an **associative recall** operation: "given a key, look up its associated value." See [[induction-heads-icl]] for how exactly this circuit works. --- ### 2.2 Why Pure SSMs Struggle at ICL The core problem is that induction-head-style recall requires **exact, content-addressable lookup**: given the *value* of a token, find its *position* in history and return the *next* token. This is what attention was built for — it can look at any position in the full KV cache and retrieve exactly what it needs. SSMs, by contrast, compress everything into a **fixed-size hidden state**. Once a token is in the past, it exists only as a smeared component of the state vector — mixed together with everything else that came after it. You cannot "query the state for a specific key" the way attention queries the KV cache. **The H3 paper (2022) quantified this gap directly:**[^6] | Model | Associative Recall Accuracy | |---|---| | S4D (a standard SSM) | 20.1% | | GSS (gated SSM) | 27.1% | | Standard Attention | **100%** | A 5× parameter advantage (1.4 billion SSM vs. 70 million attention) did not close this gap. This is not a scaling problem — it is a *structural* limitation.[^7] > [!NOTE] Analogy: the difference between a notebook and RAM > Attention is like having a perfectly indexed notebook — you can flip to any page, look up any entry, and find it instantly. An SSM is like trying to remember everything from a conversation after compressing your notes to a single index card. You can capture the gist well, but you cannot reliably answer "what exact word did they say at minute 23?" --- ### 2.3 The Zoology Paper — Quantifying How Much This Matters The **Zoology paper** (Arora et al., 2023) took a systematic approach to measuring this gap.[^7] They pretrained 17 models — attention and gated-convolution SSMs — and compared performance on language modeling (measured by perplexity on The Pile). Key findings: - SoTA gated-convolution models underperform attention by up to **2.1 perplexity points** - **82% of this gap** is explained by a single capability: associative recall — the ability to recall information mentioned earlier in the context - They formalised this as **Multi-Query Associative Recall (MQAR)**: a realistic synthetic task where a sequence contains many `(key, value)` pairs, followed by queries for random keys The MQAR task was important because earlier synthetic tests had shown SSMs could *perfectly solve* simpler associative recall variants. The Zoology paper demonstrated these synthetic tasks were too easy — they involved single queries, not multiple, and didn't reflect the density of recall required in real language.[^7] --- ### 2.4 How Hybrids Recover ICL Ability The solution is to give the model *some* attention — specifically, attention layers that can do exact lookup when needed. **Two hybrid strategies:** **1. Fixed attention layers at key depths** Insert full attention at specific positions in an otherwise SSM stack. Models like Jamba and Zamba use ratios like "one attention layer per 8 SSM layers." These provide the induction-head-style recall circuits while keeping most of the sequence processing in SSM layers for efficiency.[^8] See [[hybrid-models]] for details. **2. Input-dependent sparse attention (the Based architecture)** The Zoology paper's own solution was more surgical: allow the model to use **sparse attention** triggered by input content — only attending over specific tokens when needed, not over the full sequence. This closed **97.4% of the quality gap** relative to full attention while remaining sub-quadratic.[^7] The intuition: most tokens don't need full backward lookup — only a few per sequence need the associative-recall superpower. If you give those tokens a way to do that lookup efficiently, you get nearly all the ICL benefit at a fraction of the cost. > [!TIP] Practical implication > If you're deploying an SSM-based model and need it to follow in-context instructions reliably — "here are 5 examples of the format I want, now do it for this new case" — you want a **hybrid model** with at least some attention layers. Pure SSMs (Mamba, S4, H3) will struggle on this task at scale. Hybrid models like Jamba, Zamba, and Mamba-2 with MLA achieve near-Transformer ICL performance. --- ### 2.5 What Pure SSMs *Can* Do Well In-Context It's important not to overstate the limitation: - SSMs are still capable of **statistical pattern recognition** across context — learning a "topic" or "register" or "style" from examples - They perform well on tasks where ICL doesn't require exact recall of specific (key, value) pairs — e.g., adapting to a consistent translation style, or inferring the genre of text to continue - The gap is specifically severe for **exact retrieval tasks**: "what was the phone number mentioned at the start?" or "what did the user say the password was?" The distinction is between *statistical* ICL (SSMs can do) and *exact retrieval* ICL (SSMs struggle with). --- ## 3. Length Generalization — Can SSMs Extrapolate? ### 3.1 The Core Question You train a model on sequences of length 2,048 tokens. At inference time, a user sends a 32,768-token document. Can the model handle it? This is the **length generalisation** problem, and the answer differs dramatically between Transformers and SSMs. --- ### 3.2 Why Transformers Struggle with Length Generalisation The problem for Transformers is twofold: **Problem 1: Position encoding out-of-distribution** As discussed in Section 1, RoPE assigns rotational angles based on absolute position. During training, positions 1–2048 are seen; at inference, positions 2049–32768 appear. These rotation angles have never been encountered during training. The attention computation receives inputs in a regime it has never learned to handle, producing garbled results.[^2] **Problem 2: Attention score statistics shift** Even if position encoding were perfect, the *distribution* of attention scores changes with sequence length. A model trained to attend over 2,048 tokens has softmax computations calibrated for that scale. When presented with 32,768 tokens, the softmax denominator is 16× larger; the "sharpness" of attention patterns changes in ways the model never experienced. This produces degraded quality even if the positional encoding is solved.[^2] **The research response:** Position Interpolation (2023) showed that fine-tuning with linearly compressed position indices (squeezing 32K positions into the original 2K range) allows LLaMA to extend to 32K context with just 1,000 fine-tuning steps. But this is still a **fine-tuning requirement** — it's not zero-cost generalisation.[^2] > [!NOTE] The practical consequence > When you see a Transformer model advertise "128K context window," that capability usually required additional training or fine-tuning specifically for long contexts — it didn't come for free from the base model. Claude 3's 200K context, GPT-4's 128K context — these all involved specific engineering work to extend context length beyond the base training window. --- ### 3.3 Why SSMs Naturally Generalise to Longer Lengths An SSM's recurrence is *stateless* with respect to absolute position. The update rule: ``` h(t) = A·h(t-1) + B·x(t) ``` ...is the same mathematical operation at timestep `t = 5` as at timestep `t = 50,000`. The model never directly "sees" the position number `t`; it only sees the current token `x(t)` and the accumulated state `h(t-1)`. There is no clock that starts blinking at position 2049.[^4] This means an SSM trained on 4,096-token sequences, when applied to a 65,536-token sequence, runs *the same computation for each token* — just for more steps. The extrapolation is free. **The important caveat:** While the *architecture* generalises, the *learned representations* may not be perfectly calibrated for very long sequences. If the training data never had long-range dependencies beyond 4K tokens, the model may not have learned the *capacity* to use 64K worth of state effectively. This is a learning / data problem, not an architectural one. --- ### 3.4 Griffin — Direct Evidence of Extrapolation The **Griffin paper** (Google DeepMind, 2024) provides the clearest empirical demonstration of SSM length generalisation.[^8] Griffin is a hybrid architecture combining: - **Hawk**: a pure RNN using gated linear recurrences (no attention) - **Griffin**: Hawk plus local attention windows (hybrid) Key finding: **"Griffin can extrapolate on sequences significantly longer than those seen during training."**[^8] Concretely, Griffin models trained on fixed-length sequences were evaluated on sequences substantially longer without any additional fine-tuning. Performance remained competitive, demonstrating that the recurrent structure transfers naturally to longer sequences. Additional Griffin results that contextualise this: - Griffin-14B matches Llama-2 performance despite training on **6× fewer tokens** — suggesting the architecture is more data-efficient, not just longer-capable - At inference, Griffin has lower latency and higher throughput than equivalently sized Transformers, because inference uses the recurrent form (constant memory) rather than the KV-cache-growing form[^8] > [!TIP] The training cost vs. inference benefit tradeoff > Griffin demonstrates that SSMs have a particularly attractive property for deployment: **train on short sequences, deploy on long sequences**. Training on short sequences is cheaper (less memory, faster throughput); deploying on long sequences doesn't require retraining. For applications like legal document analysis, genomics, or long-conversation assistants, this could substantially reduce the total cost of a model deployment. --- ### 3.5 Mamba's Length Generalisation Properties Mamba (Gu & Dao, 2023) is designed around selective state spaces — SSMs where the `B`, `C`, and `Δ` parameters are *input-dependent* rather than fixed.[^4] This selectivity is the key to Mamba's quality advantage over fixed SSMs. Regarding length generalisation: - Mamba's recurrent structure is architecturally positioned for the same free extrapolation as other SSMs - The selective mechanism `Δ` (which controls how much state is updated at each step) can in principle learn to manage longer-range dependencies if exposed to them in training - Mamba's paper reports strong performance on sequences up to million-length for genomics data — the model processes these efficiently without architectural modification[^4] **Mamba's length story in practice:** The Mamba paper demonstrated training and inference on sequences up to 1 million tokens in genomics, making it a practical choice for ultra-long-sequence domains that are simply inaccessible to Transformers without extraordinary infrastructure. --- ### 3.6 Comparison Summary | Model Family | Length Generalisation | Why | |---|---|---| | Transformer (RoPE) | Poor — breaks beyond training length | Position angles out of distribution | | Transformer (RoPE + fine-tuning) | Moderate — needs explicit extension | Position Interpolation re-trains distribution | | Pure SSM (S4, Mamba, Hawk) | Strong — architectural free lunch | No position encoding; same math at any `t` | | Hybrid (Griffin, Jamba) | Strong for SSM layers; limited for attention layers | Recurrent parts extrapolate; attention parts don't | > [!NOTE] Takeaway for the document > The length generalisation story is one of the clearest practical advantages of SSMs over Transformers. An SSM trained cheaply on short sequences can be deployed on long sequences without retraining. This is not just a theoretical property — Griffin demonstrated it empirically. The Transformer community's response (Position Interpolation, ALiBi, etc.) involves substantial additional work that SSMs simply do not need. --- ## 4. Vision and Image SSMs — Mamba for Computer Vision ### 4.1 The Vision Transformer Problem Before asking how SSMs handle images, it's worth understanding how Vision Transformers (ViTs) do — and what the cost is. **ViT's approach (Dosovitskiy et al., 2020):** Divide an image into a grid of non-overlapping patches (e.g., 16×16 pixels each). Flatten each patch into a vector. Treat each patch as a "token." Run standard Transformer attention over all patch tokens. For a 224×224 image with 16×16 patches, that gives you a sequence of **(224/16)² = 196 tokens**. Standard attention is O(N²) — so ViT's cost is O(196²) ≈ 38,000 operations just for the attention computation. For higher resolution images — say, 1024×1024 with 16×16 patches — that's (64)² = 4,096 tokens, and O(4096²) ≈ 16 million attention operations. The quadratic scaling makes ViT very expensive for high-resolution images or dense prediction tasks (segmentation, detection) where fine-grained spatial understanding is needed.[^9] --- ### 4.2 The Core Challenge: Images Are 2D, SSMs Are 1D Mamba processes **sequences** — a 1D ordered list of tokens. Images are inherently **2D grids**. A pixel at position (row 3, col 7) has spatial relationships with its neighbours in all four directions, not just "what came before it in a left-to-right reading." This mismatch is the central engineering challenge for vision SSMs. Two papers released in January 2024 — Vision Mamba and VMamba — proposed different solutions. --- ### 4.3 Vision Mamba (Vim) — Bidirectional Scanning **Vision Mamba** (Zhu et al., 2024, ICML 2024) adapts Mamba for images with a bidirectional approach:[^9] 1. **Flatten**: Divide the image into patches (same as ViT). Flatten into a 1D sequence. 2. **Mark positions**: Add position embeddings to the flattened patch tokens. (Note: unlike text SSMs, Vim *does* use position embeddings — because the order of a flattened image doesn't inherently encode spatial relationships. If you flatten row-by-row, patch (0, 15) and patch (1, 0) are adjacent in the sequence but spatially far apart.) 3. **Bidirectional Mamba**: Run Mamba forwards *and* backwards over the sequence — a concatenation that ensures each patch can "see" all other patches, not just earlier ones. The result: **O(N) complexity over patches** instead of ViT's O(N²). **Vim-base (98M parameters) vs. DeiT-base (86M parameters) on ImageNet-1K:** | Model | Top-1 Accuracy | Inference Speed | GPU Memory (1248px) | |---|---|---|---| | DeiT-Base (ViT) | 81.8% | 1× | 1× | | Vim-Base | **81.9%** | **2.8× faster** | **86.8% less** | Vim matches DeiT's accuracy at higher resolution while being dramatically more memory-efficient — a consequence of the O(N) vs O(N²) scaling.[^9] > [!NOTE] The 1248×1248 comparison > At 1248×1248 resolution (roughly 10× the pixels of a standard 224×224 benchmark image), DeiT requires 16× the memory of a 224×224 image (quadratic scaling). Vim requires only 2.7× the memory (linear scaling). The divergence grows with resolution — which matters enormously for medical imaging, satellite imagery, and high-resolution photography. --- ### 4.4 VMamba — 2D-Aware Scanning with Four Directions **VMamba** (Liu et al., 2024, NeurIPS 2024 Spotlight) takes a different approach to the 2D problem:[^10] Instead of flattening and scanning once or twice, VMamba's core module — **2D Selective Scan (SS2D)** — scans the image in *four* directions: - Top-left → bottom-right - Bottom-right → top-left - Top-right → bottom-left - Bottom-left → top-right Each scan produces a separate sequence of hidden states; the four are merged to produce the output. **Why four directions?** Any single 1D scan creates a "seam" — a spatial discontinuity between the end of one row and the start of the next. By scanning in multiple directions, the seams in different scans fall in different places, and the merge step can fill in the gaps. A pixel in the middle of the image gets good context from all four diagonal directions, approximating the all-to-all connectivity that 2D convolutions or attention would provide. VMamba achieves linear time complexity while maintaining competitive performance with Swin Transformer and ViT variants across ImageNet classification, COCO detection, and ADE20k segmentation.[^10] --- ### 4.5 The Scanning Problem — A Key Research Theme Both papers highlight a fundamental tension that text SSMs don't have: **images don't have a natural "forward" direction**. In text, causality is clear: earlier words come first, later words come after. An SSM scanning left-to-right processes information in the same order a human would read it. But images have no such arrow — spatial context is needed in all directions simultaneously. The solutions explored so far: - **Bidirectional scan** (Vim): forward + backward over the flattened sequence - **Multi-directional scan** (VMamba): four diagonal sweeps - **Zigzag scan** (other variants): serpentine path through the image grid - **Hilbert curve scan** (research): space-filling curve that preserves 2D locality better than row-major order This is an open research area — the "best" scanning strategy for different vision tasks is still being worked out. > [!TIP] Practical implication > For standard image classification benchmarks at 224×224, ViT and Mamba-based models perform comparably, making the choice a wash. The SSM advantage becomes meaningful at **high resolution** and in **dense prediction** tasks (detection, segmentation) where the quadratic cost of attention over many patches becomes the bottleneck. Medical imaging is a particularly promising application: high-resolution scans with fine spatial structure, where linear scaling over patches could enable models that ViT cannot afford. --- ### 4.6 Performance Benchmarks **ImageNet-1K Top-1 Accuracy** (selected models, as of early 2024): | Model | Type | Params | Top-1 Acc. | |---|---|---|---| | DeiT-Tiny | ViT | 5M | 72.2% | | Vim-Tiny | SSM | 7M | 76.1% | | DeiT-Small | ViT | 22M | 79.8% | | Vim-Small | SSM | 26M | 80.5% | | DeiT-Base | ViT | 86M | 81.8% | | Vim-Base | SSM | 98M | **81.9%** | | Swin-T | Hierarchical ViT | 28M | 81.3% | | VMamba-T | SSM (4-dir scan) | 31M | **82.2%** | The trend: SSM-based vision models are **competitive with ViT at similar parameter counts**, and increasingly exceed ViT accuracy at higher resolutions or parameter budgets.[^9][^10] --- ### 4.7 What This Means for the Broader Story The success of vision SSMs is significant beyond the accuracy numbers: 1. **It confirms SSMs aren't just a language trick.** Mamba's efficiency advantages transfer to computer vision — the same O(N) vs O(N²) story applies equally to patch tokens as to word tokens. 2. **It opens up high-resolution vision AI.** Applications like whole-slide pathology images (gigapixel scans), satellite imagery, and 4K video analysis become tractable with linear-scaling architectures in ways they are not with attention. 3. **The 2D scanning challenge will converge.** Just as text Transformers eventually found optimal positional encodings (RoPE), vision SSMs will likely converge on best-practice scanning strategies. Early results suggest the specific scanning method matters less than the linear scaling itself. > [!NOTE] Cross-domain unification > One of the most interesting directions in 2024–2025 research is **multimodal SSMs**: architectures that handle text, images, and audio within a single SSM backbone, exploiting the fact that all three are ultimately sequences that benefit from O(N) scaling. This would be architecturally difficult for pure ViT-style Transformers (different quadratic costs for different modalities) but is a natural fit for SSMs. --- ## Footnotes [^1]: Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). "Attention Is All You Need." *NeurIPS 2017*. arXiv:1706.03762. Original Transformer paper; introduces sinusoidal position encoding and demonstrates permutation-equivariance of attention without it. [^2]: Chen, S., Wong, S., Chen, L., & Tian, Y. (2023). "Extending Context Window of Large Language Models via Positional Interpolation." arXiv:2306.15595. Demonstrates catastrophic failure of RoPE-based LLMs when extrapolating beyond trained context length; proposes Position Interpolation as a fine-tuning fix. Upper bound of interpolation error is ~600× smaller than extrapolation error. [^3]: Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., & Liu, Y. (2021). "RoFormer: Enhanced Transformer with Rotary Position Embedding." arXiv:2104.09864. Introduces RoPE; shows rotation matrix encoding of absolute position produces relative position dependency in attention dot products, with desirable decaying inter-token dependency with increasing distance. [^4]: Gu, A., & Dao, T. (2023). "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." arXiv:2312.00752. Introduces selective SSMs where B, C, Δ are input-dependent. Reports 5× throughput over Transformers; linear scaling to million-length sequences for genomics; Mamba-3B outperforms Transformers twice its size. [^5]: Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell, A., Lasenby, R., Jones, A., Elhage, N., Conerly, T., Hume, T., Drain, D., Kaplan, J., Solaiman, I., Amodei, D., & Anthropic (2022). "In-context Learning and Induction Heads." *Transformer Circuits Thread*. https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html. Identifies induction heads as primary mechanism for in-context learning; documents phase change coinciding with ICL acquisition. [^6]: Fu, D.Y., Dao, T., Saab, K.K., Thomas, A.W., Rudra, A., & Ré, C. (2023). "Hungry Hungry Hippos: Towards Language Modeling with State Space Models." *ICLR 2023 (Spotlight)*. arXiv:2212.14052. Measures associative recall accuracy: S4D 20.1%, GSS 27.1%, Attention 100%. Proposes H3 layer to close gap; achieves within 0.4 PPL of Transformers on OpenWebText. Hybrid H3-attention outperforms Transformers by 1.0 PPL. [^7]: Arora, S., Eyuboglu, S., Timalsina, A., Johnson, I., Poli, M., Zou, J., Rudra, A., & Ré, C. (2023). "Zoology: Measuring and Improving Recall in Efficient Language Models." arXiv:2312.04927. Pretrained 17 models; found 82% of Transformer–SSM quality gap explained by associative recall; 70M attention outperforms 1.4B gated-convolution; formalised MQAR task; Based architecture closes 97.4% of gap with sub-quadratic scaling. [^8]: De, S., Smith, S.L., Fernando, A., Botev, A., Cristian-Muraru, G., Gu, A., Haroun, R., Berrada, L., Chen, Y., Srinivasan, S., Desjardins, G., Doucet, A., Budden, D., Teh, Y.W., Pascanu, R., De Freitas, N., & Gulcehre, C. (2024). "Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models." arXiv:2402.19427. Introduces Hawk (pure gated-recurrent) and Griffin (gated-recurrent + local attention hybrid). Griffin matches Llama-2 on 6× fewer tokens; explicitly demonstrates extrapolation to sequences longer than training; scales to 14B parameters. [^9]: Zhu, L., Liao, B., Zhang, Q., Wang, X., Liu, W., & Wang, X. (2024). "Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model." *ICML 2024*. arXiv:2401.09417. Vim-Base achieves 81.9% top-1 on ImageNet-1K (vs. DeiT-Base 81.8%); 2.8× faster inference; 86.8% GPU memory reduction at 1248×1248 resolution relative to DeiT. [^10]: Liu, Y., Tian, Y., Zhao, Y., Yu, H., Xie, L., Wang, Y., Ye, Q., Jiao, J., & Liu, Y. (2024). "VMamba: Visual State Space Model." *NeurIPS 2024 (Spotlight)*. arXiv:2401.10166. Introduces Visual State Space (VSS) blocks with 2D Selective Scan (SS2D) traversing four scanning routes to bridge 1D selective scan and 2D vision data structure; VMamba-T achieves 82.2% top-1 on ImageNet-1K.