# Efficient Transformers, RWKV, and Linear Recurrent Units
> **Part of:** SSMs vs Transformers research project
> **See:** [[index]] for full research inventory | [[../STEERING.md]] for project guidance
> **Cross-links:** [[transformers-basics]] | [[ssm-basics]] | [[hybrid-models]] | [[computational-complexity]] | [[real-world-products]]
>
> **Status:** Research complete ✅
---
## TL;DR
Researchers have spent years trying to make Transformers faster. Some approaches (FlashAttention) make the *hardware* faster without changing the underlying math. Others (sparse attention, linear attention) actually change the math — but each shortcut comes with a tradeoff. Meanwhile, a parallel universe of architectures — RWKV, Griffin, Linear Recurrent Units — converges on similar ideas from a different direction. By the end, you'll see why everyone is building toward the same destination.
---
## 1. FlashAttention: Engineering Genius, Not Mathematical Magic
### What is FlashAttention?
Standard attention does something that seems obvious but turns out to be catastrophically slow: it computes the full **N×N attention matrix** and writes it to GPU memory (called HBM — High Bandwidth Memory), then reads it back for the next step. For a sequence of 8,000 tokens, that's 64 million numbers traveling back and forth.
FlashAttention, introduced by Tri Dao et al. (2022), asks: *what if we never wrote that matrix to slow memory at all?*[^1]
The key insight is **tiling**: instead of computing the entire attention matrix at once, FlashAttention computes small tiles that fit inside the GPU's ultra-fast on-chip SRAM (like L1 cache on a CPU). It accumulates the final result as it goes, using a mathematical trick to merge partial softmax computations correctly. The full N×N matrix is never materialized in slow memory.
> [!important] The Critical Distinction
> FlashAttention does **not** reduce the number of floating-point operations. It still performs O(N²) multiplications — the same math as before. What it reduces is **memory reads and writes** (IO complexity). This matters enormously in practice because GPU memory bandwidth is often the real bottleneck, not raw computation.
### What Does FlashAttention Actually Achieve?
| Version | Speedup vs baseline | Peak FLOPs/s utilization | Notes |
|---------|--------------------|--------------------------|----|
| FlashAttention v1 (2022) | 2–4× | ~25–40% of peak | Linear memory (no N² matrix stored) [^1] |
| FlashAttention v2 (2023) | +2× over v1 | 50–73% on A100 | Better thread/warp work partitioning [^2] |
FlashAttention-2 achieved 225 TFLOPs/s per A100 GPU during end-to-end GPT training — approaching the theoretical efficiency of plain matrix multiplication.[^2]
### What FlashAttention Does NOT Change
- The **O(N²) time complexity** remains. Double the sequence length → 4× more computation.
- The **fundamental capability** — every token still attends to every other token — remains identical.
- It is **exact attention**, not an approximation. You get the same outputs as standard attention.
> [!quote] Analogy
> FlashAttention is like reorganizing a library so the most-used books are on your desk instead of in the basement. You still read every book needed for the task — you just stop sprinting to the basement and back between each one.
FlashAttention also enables a **block-sparse** variant, where you skip certain tiles entirely (zeroing out long-range connections you've decided aren't needed). This is approximate but much faster for very long sequences.[^1]
---
## 2. The Efficiency Landscape: A Taxonomy of Approaches
Every approach to making attention faster sits somewhere on a **complexity vs. quality tradeoff** spectrum:
```
QUALITY (task performance)
▲
Full Attention ● │ ← Gold standard, O(N²), slow
(exact) │
FlashAttention ● │ ← Same quality, faster hardware use
│
Sliding Window ● │ ← Near-full quality, O(N·W)
(Mistral) │
Sparse Attn. ● │ ← Good quality, O(N·√N) or O(N log N)
(BigBird) │
Linear Attn. ● │ ← Approximate, O(N), quality gap
(Performer) │
SSMs (Mamba) ● │ ← Different formulation, O(N), inference = O(1)
│
RNNs (vanilla) ● │ ← O(N) but train slowly, poor quality
└──────────────────────────────────►
EFFICIENCY (N-scaling)
```
### The Big Comparison Table
```
╔══════════════════════╦═══════════════╦══════════════╦═════════════════════════════════════════╗
║ Approach ║ Time (train) ║ Mem (train) ║ Notes & Key Limitation ║
╠══════════════════════╬═══════════════╬══════════════╬═════════════════════════════════════════╣
║ Standard Attention ║ O(N²) ║ O(N²) ║ Exact; memory wall hits ~4K tokens ║
║ FlashAttention ║ O(N²) FLOPs ║ O(N) ║ IO-efficient; same quality, faster HW ║
║ Sliding Window (SWA) ║ O(N·W) ║ O(N·W) ║ W=window size; misses very long deps ║
║ Longformer ║ O(N·W + N·G) ║ O(N) ║ Local + global tokens; task-specific ║
║ BigBird ║ O(N) ║ O(N) ║ Random + window + global; universal approx║
║ Linformer ║ O(N) ║ O(N) ║ Low-rank approx; fixed seq lengths only ║
║ Performer (FAVOR+) ║ O(N) ║ O(N) ║ Kernel approx; variance in quality ║
║ RetNet ║ O(N) parallel ║ O(N) ║ Recurrence-based; no softmax ║
║ SSM / Mamba ║ O(N) parallel ║ O(N) ║ Selective state; O(1) inference memory ║
║ RWKV ║ O(N) parallel ║ O(N) ║ Linear attn + RNN; O(1) inference ║
╚══════════════════════╩═══════════════╩══════════════╩═════════════════════════════════════════╝
```
*N = sequence length, W = window size, G = number of global tokens*
---
## 3. Sparse Attention: Skipping the Parts You Don't Need
### Core Idea
Instead of letting every token attend to every other token (N×N = dense), sparse attention architectures only allow each token to attend to a *selected subset* of other tokens. The resulting attention matrix is *sparse* — mostly zeros.
This can bring complexity from O(N²) down to O(N·√N), O(N log N), or even O(N).
### Longformer (Allen AI, 2020)[^3]
Longformer uses **two types of attention simultaneously**:
1. **Local sliding window attention**: Each token attends to a window of ±W neighboring tokens (like a context window around each word). This is O(N·W).
2. **Global attention**: A small set of special tokens (like the `[CLS]` classification token, or task-specific tokens) attends to *the entire sequence*. This anchors global understanding.
> [!example] Think of it Like a Newspaper Editor
> Most reporters only talk to their colleagues on the same story (local). But the editor-in-chief (global token) talks to everyone. The overall picture is built from both.
**Result**: Handles documents up to ~4,096 tokens where BERT would run out of memory. Used in scientific paper summarization (arXiv dataset), long-document QA, etc.[^3]
### BigBird (Google, 2020)[^4]
BigBird extends sparse attention to three components per head:
1. **Local windowed** — same as Longformer
2. **Global tokens** — O(1) designated tokens that see everything
3. **Random attention** — each token also attends to a few randomly selected tokens
The random component is theoretically important: the paper proves BigBird is a **universal approximator** of sequence functions and is **Turing complete** — meaning it can compute anything full attention can, even with sparse connections. Handles sequences up to 8× longer than standard Transformers on the same hardware.[^4]
### Limitations of Sparse Attention
- The sparsity pattern must be **designed in advance** (which tokens can talk to which). This is inflexible — what if the important relationship is between token 3 and token 3,841?
- Random attention helps but is not the same as the model *choosing* which tokens to attend to.
- Still O(N) in memory for moderate window sizes, but the constant factor is large.
- Hard to implement efficiently on GPUs, which are optimized for dense matrix operations.
---
## 4. Sliding Window Attention: The Mistral Trick
### What Is It?
Sliding window attention (SWA) is the simplest version of sparse attention: each token attends only to the *W* most recent tokens in its local window. No global tokens, no random jumps.
In **Mistral 7B** (2023), SWA with window W=4,096 is combined with a **rolling buffer KV-cache** — a fixed-size cache that stores only the last W keys and values, overwriting old ones.[^5]
> [!info] How Mistral Gets "Infinite" Context
> Mistral can *generate* tokens of arbitrary length because its KV-cache doesn't grow. But it can't *attend* to tokens more than W positions back. Long-range dependencies from early in the conversation are genuinely lost.
**The result**: Mistral 7B uses grouped-query attention (GQA) + sliding window to achieve significantly faster inference than Llama-2 13B while using less memory, beating Llama-2 13B on all benchmarks with fewer parameters.[^5]
**Tradeoff**: Information from >4K tokens ago is effectively forgotten during generation. Works surprisingly well for conversations and code, which tend to be locally coherent.
---
## 5. Linear Attention: The Dream That Almost Works
### The Big Idea
Standard attention computes:
```
Attention(Q, K, V) = softmax(QKᵀ / √d) · V
```
The softmax creates the N×N matrix. What if we used a different similarity function — one we could factor into: `kernel(Q, K) = φ(Q) · φ(K)ᵀ` where φ is some feature map?
If we can do that, we can reorder the matrix multiplication:
```
Instead of: (φ(Q) · φ(K)ᵀ) · V ← O(N²) in the middle
Compute: φ(Q) · (φ(K)ᵀ · V) ← O(N) because inner product first
```
This is the idea behind **linear attention** — compute the key-value interaction first, accumulate a compact *context matrix*, then apply queries against that matrix.
### Linformer (Facebook, 2020)[^6]
Linformer projects the N-length K and V matrices down to a fixed rank r < N using a learned projection. The attention matrix becomes (N × r) instead of (N × N).
**Works well when**: The attention matrix is approximately low-rank (much of the useful information is concentrated in a small number of directions). The paper shows this holds empirically for NLP tasks.
**Fails when**: Sequences vary in length (the projection is length-dependent), or when sharp, precise attention is needed (the low-rank approximation smooths out spiky attention patterns).
### Performer / FAVOR+ (Google, 2020)[^7]
Performers use **random feature maps** to approximate the softmax kernel. The key insight: `exp(qᵀk)` can be approximated by the expectation of random features `φ(q)ᵀφ(k)` using orthogonal random projections.
**Strengths**: Unbiased or nearly-unbiased estimates; provably approximates softmax; O(N) in time and memory; competitive with sparse methods on benchmarks.[^7]
**Weaknesses**: The variance of the approximation can be high, especially for large softmax values (sharp attention peaks). Performance degrades on tasks requiring **precise, narrow** attention to specific tokens — like copying exact strings or precise in-context learning.
### Why Linear Attention Doesn't Fully Replace Quadratic Attention
> [!warning] The Fundamental Tension
> Linear attention methods compress all past context into a **fixed-size matrix** (the accumulated key-value outer product). This is mathematically equivalent to a **state** — like an RNN's hidden state. But unlike exact attention, you cannot *look up* a specific past token with precision.
This is the core tradeoff:
- **Exact attention**: Any token can precisely query any past token. Great for tasks needing exact retrieval (e.g., "what was the user's name 2,000 tokens ago?")
- **Linear attention**: All past information is *blended* into a fixed-size summary. Retrieval is approximate. Sharp attention patterns are smoothed over.
The research paper "Zoology" (2023) showed that **associative recall** — the ability to retrieve a specific key-value pair from context — is where linear methods most consistently fail vs. full attention. See also [[zoology-associative-recall]] for the full analysis.
---
## 6. Why O(N²) Persists: The Fundamental Limit
Even with FlashAttention making hardware more efficient, the underlying O(N²) computation remains. Here's why it's genuinely unavoidable in standard attention:
**Step 1**: For each of N tokens, compute how much it should attend to each of the N other tokens → N² dot products.
**Step 2**: Apply softmax → N² values.
**Step 3**: Weight-sum the V matrix → N² multiplications.
You can make each step faster (FlashAttention does), but you can't eliminate the steps without changing the mathematical model.
> [!important] The Scaling Problem in Practice
> - At N=512: ~260K attention pairs → fast
> - At N=4,096: ~16M attention pairs → manageable
> - At N=32,768: ~1B attention pairs → slow
> - At N=131,072 (GPT-4 context): ~17B attention pairs → requires FlashAttention + special hardware
This is why models with very long context windows (Claude 200K, Gemini 1M) require enormous GPU resources, even with FlashAttention. The O(N²) wall is real.
**SSMs and linear recurrent models solve this differently**: rather than attending to all past tokens, they compress past information into a fixed-size **state vector** that's updated step-by-step. Inference cost is O(1) per token regardless of context length. The cost: you must decide what to remember and what to forget.
See [[computational-complexity]] for a deeper treatment of O() analysis.
---
## 7. RWKV Architecture: The Transformer That Thinks Like an RNN
### What Does RWKV Stand For?
**R**eceptance, **W**eighted, **K**ey, **V**alue — the four components of RWKV's attention-like mechanism. Pronounced "RWaKuV."
Created by Bo Peng and the RWKV open-source community; joined the Linux Foundation in 2023. Scaled to 14B parameters — the largest dense RNN ever trained.[^8]
### The Core Idea: Dual-Mode Operation
RWKV's fundamental innovation is that **the same model can be run in two equivalent modes**:
```
┌─────────────────────────────────────────────────┐
│ RWKV MODEL │
│ │
│ TRAINING MODE INFERENCE MODE │
│ (Transformer-like) (RNN-like) │
│ │
│ Sees all tokens Sees one token at a time │
│ simultaneously with a hidden state │
│ → O(N) parallel → O(1) per step │
│ scan constant memory │
└─────────────────────────────────────────────────┘
```
During training, you feed the whole sequence at once (like a Transformer). During inference, you run token-by-token, carrying a fixed-size "state" (like an RNN). Same weights, same results — just different execution patterns.
### How RWKV's "Attention" Works
RWKV uses **linear attention** — not the softmax-based attention of standard Transformers. Here's an intuition:
Standard Transformer attention for position t:
```
output[t] = Σ softmax(q[t]·k[i]) · v[i] for all i ≤ t
```
(Sum over all past tokens — O(N) operations per position → O(N²) total)
RWKV's time-mixing (attention analog):
```
output[t] = (r[t] · sigmoid) × (numerator[t] / denominator[t])
where:
numerator[t] = e^(u+k[t]) · v[t] + Σᵢ e^(-(t-i)·w + k[i]) · v[i]
denominator[t] = e^(u+k[t]) + Σᵢ e^(-(t-i)·w + k[i])
```
The key is the **exponential decay** term `e^(-(t-i)·w)`: older tokens contribute exponentially less. This is a **learned decay rate** — the model learns how fast to "forget." This sum can be computed **incrementally** (each step updates from the previous state), enabling the RNN-mode execution.
> [!info] Analogy: Exponentially Weighted Moving Average
> Imagine you're reading a news feed. RWKV remembers recent headlines vividly and older ones faintly — how faint is controlled by the learned decay `w`. You can compute this running average without going back to every past headline; you just multiply the old average by a factor and add the new item.
**Receptance (R)**: A gate that controls how much of the past to incorporate vs. ignore. Acts like the gates in LSTMs.
### How RWKV Differs from Mamba (SSM)
This is the most important distinction to nail down:
| Property | RWKV | Mamba (SSM) |
|----------|------|-------------|
| **Mathematical basis** | Linear attention (reformulated as RNN) | State space model (differential equations → discrete) |
| **How context is stored** | Accumulated weighted key-value outer product | Hidden state vector updated by learned matrices |
| **Selectivity** | Fixed decay schedule per channel (learned but input-independent in v4/v5) | **Input-dependent** — each token decides how to update the state |
| **Expressive power** | Limited by linear attention approximation | Selective scan can focus sharply on certain inputs |
| **Training parallelism** | Parallel prefix scan (O(N)) | Parallel scan (CUDA-level optimization) |
| **Origin** | Reformulated Transformer → RNN | Control theory / signal processing → neural net |
| **Architecture family** | Linear Transformer | SSM |
| **Key limitation** | Fixed decay = cannot dynamically choose to remember/forget | Selectivity adds complexity; harder to implement efficiently |
| **Max scale (as of 2024)** | 14B parameters | ~7B parameters (Mamba-2 exploring larger) |
> [!important] The Selectivity Gap
> In RWKV v4 and v5, the time decay rate `w` is **fixed per channel** — it doesn't change based on the input content. This means RWKV can't dynamically decide "this is an important token, I should remember it longer." Mamba's selective scan does exactly this. RWKV-6 (Finch) introduced dynamic recurrence to partially address this.[^9]
### RWKV Version History
| Version | Codename | Key Innovation |
|---------|----------|---------------|
| RWKV-4 | (original) | Dual-mode Transformer/RNN; linear attention; 14B scale[^8] |
| RWKV-5 | Eagle | Multi-headed matrix-valued states (richer state representation)[^9] |
| RWKV-6 | Finch | Dynamic recurrence mechanism (partial selectivity)[^9] |
| RWKV-7 | Goose | Dynamic State Evolution; surpasses TC0 expressive power limits[^10] |
TC0 (Threshold Circuit class 0) is a formal complexity class. The claim in RWKV-7 is that its Dynamic State Evolution mechanism enables computations beyond what standard linear attention / Transformer-like architectures can do.[^10]
### RWKV vs Mamba: Performance
Both architectures perform **on par with comparably-sized Transformers** on standard NLP benchmarks. The differences show up at extremes:
- **Very long context**: RWKV and Mamba both excel (O(1) inference memory), Transformers struggle.
- **Precise retrieval tasks** (find exact token from 5K tokens ago): RWKV and Mamba both weaker than full attention.
- **Throughput**: RWKV can be faster per token at inference due to simpler state update.
- **Quality at equal scale**: Mamba's selectivity gives it an edge on tasks requiring dynamic memory management; RWKV's linear attention is more predictable.
See [[real-world-products]] for deployment context.
---
## 8. Linear Recurrent Units (LRU): Bridging RNNs and SSMs
### Background: "Resurrecting RNNs for the Transformer Era" (2023)[^11]
The LRU paper (Orvieto et al., Google DeepMind) asks a pointed question: **why do SSMs like S4 and Mamba outperform classical RNNs, when both process sequences step-by-step?** What's the magic ingredient?
The answer turned out to be surprisingly mundane: it's not the differential equations or the HiPPO theory (see [[ssm-basics]]) — it's **careful engineering choices** applied to standard RNNs:
1. **Linearize the recurrence**: Remove the nonlinearity inside the state update (use `h[t] = A·h[t-1] + B·x[t]` instead of `h[t] = tanh(A·h[t-1] + B·x[t])`)
2. **Diagonalize A**: Instead of a full N×N matrix, use a diagonal complex matrix (N complex scalars). Dramatically reduces parameters and compute.
3. **Use complex numbers**: Complex eigenvalues allow oscillatory dynamics that real numbers can't express.
4. **Careful initialization**: Initialize the diagonal values near the unit circle (magnitude ~1) to enable long-range gradient flow.
5. **Proper normalization**: Normalize inputs to ensure stable forward propagation.
These changes recover **SSM-level performance** with **standard RNN components**.
### What Is an LRU?
An LRU (Linear Recurrent Unit) applies these principles in a clean block:
```
State update: h[t] = A · h[t-1] + B · x[t]
(A is complex diagonal, learned)
Output: y[t] = Re(C · h[t]) + D · x[t]
(real part of complex state projection)
```
- **A** (diagonal complex): the "forgetting" matrix. The magnitude of each eigenvalue controls how long information persists. |λ| → 1 means long memory; |λ| → 0 means quick forgetting.
- **B, C**: input and output projections (learned linear maps).
- No nonlinearity inside the recurrence → trainable by standard backpropagation without vanishing gradients.
**Matches Long Range Arena (LRA) benchmark** scores against SSMs like S4, while being conceptually simpler.[^11]
### Griffin: LRU + Local Attention (Google DeepMind, 2024)[^12]
Griffin combines two components in every layer:
```
Input sequence
│
┌───────────────┴───────────────┐
│ │
Gated Linear Local Sliding Window
Recurrence (Hawk) Attention
[LRU variant] [W=2048 tokens]
│ │
└───────────────┬───────────────┘
│
Gated MLP
│
Output
```
**Hawk** is the pure-recurrence variant (no attention) — it already exceeds Mamba on downstream tasks.[^12]
**Griffin** (hybrid) matches Llama-2 performance despite being trained on **6× fewer tokens** — a striking result suggesting the architecture is sample-efficient. Scales to 14B parameters.[^12]
The local attention window is the key addition over pure LRU: it lets Griffin recall precise recent context (the last 2K tokens) while the recurrence handles everything older as a compressed state.
> [!info] Throughput vs. Latency
> During training, Griffin matches Transformer hardware efficiency (FLOPs utilization).
> During inference, Griffin has:
> - **Lower latency**: No KV-cache grows with context length
> - **Higher throughput**: Can process more sequences per second
> This is the practical advantage that makes hybrid architectures compelling for deployment.
---
## 9. The Big Picture: Convergence of Ideas
All these architectures — FlashAttention optimizations, sparse attention, linear attention, SSMs, RWKV, LRU, Griffin — are circling the same fundamental problem from different directions:
> **How do you give a model long memory without paying quadratic cost?**
```
STARTING POINTS:
Transformers ──────────────────────────────────► SSMs / RNNs
(exact attention, (O(1) inference,
quadratic cost) approximate memory)
APPROACHES FROM TRANSFORMER SIDE:
→ FlashAttention: make hardware faster, keep math same
→ Sparse attention: skip unimportant connections
→ Linear attention: approximate softmax with kernel
→ RetNet: use recurrence-style accumulation in attention
→ Sliding window: local context only
APPROACHES FROM SSM/RNN SIDE:
→ Mamba: add selectivity to SSM
→ RWKV: reformulate attention as RNN
→ LRU: recover SSM power using simple linear RNN
→ Griffin: add local attention to LRU
CONVERGENCE ZONE:
→ Hybrid models (Jamba, Zamba, Griffin)
→ "Gated linear attention" looks identical from both perspectives
→ The distinction between "linear attention" and "SSM" is increasingly formal
```
> [!important] The Key Insight
> Linear attention (from the Transformer side) and state space models (from the RNN side) are **mathematically equivalent** for a certain class of models. RWKV is simultaneously described as "linear Transformer" and "RNN" — and both descriptions are correct. Griffin uses LRU (an RNN concept) with local attention (a Transformer concept). Mamba uses SSM math but trains with parallel scans just like Transformers.
>
> The field is converging on: **a fast, trainable recurrence for long context, combined with precise local attention for recent context**. This is the Griffin / Jamba recipe. See [[hybrid-models]] for the specific products.
### Why Does This Matter for Our Report?
The narrative for a layperson report is:
1. Transformers are powerful but expensive (quadratic)
2. People tried to fix this from inside the Transformer (FlashAttention, sparse, linear) — partial success
3. SSMs attacked from the other side — good, but different tradeoffs
4. The best current models blend both — you get the efficiency of recurrence + the precision of local attention
This creates a clean story arc. See [[diagrams-and-visuals]] for diagram ideas.
---
## Footnotes
[^1]: Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. arXiv:2205.14135. "FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU HBM and GPU on-chip SRAM... 15% end-to-end wall-clock speedup on BERT-large."
[^2]: Dao, T. (2023). FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. arXiv:2307.08691. "Around 2× speedup compared to FlashAttention, reaching 50–73% of the theoretical maximum FLOPs/s on A100... 225 TFLOPs/s per A100 GPU."
[^3]: Beltagy, I., Peters, M. E., & Cohan, A. (2020). Longformer: The Long-Document Transformer. arXiv:2004.05150. "An attention mechanism that scales linearly with sequence length, combining local windowed attention with task motivated global attention."
[^4]: Zaheer, M., Guruganesh, G., et al. (2020). Big Bird: Transformers for Longer Sequences. arXiv:2007.14062 (NeurIPS 2020). "A sparse attention mechanism that reduces quadratic dependency to linear... BigBird is a universal approximator of sequence functions and is Turing complete."
[^5]: Jiang, A. Q., et al. (2023). Mistral 7B. arXiv:2310.06825. "Grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost."
[^6]: Wang, S., et al. (2020). Linformer: Self-Attention with Linear Complexity. arXiv:2006.04768. "The self-attention mechanism can be approximated by a low-rank matrix... reduces overall self-attention complexity from O(n²) to O(n) in both time and space."
[^7]: Choromanski, K., et al. (2020). Rethinking Attention with Performers. arXiv:2009.14794 (ICLR 2021). "Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+)... linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness."
[^8]: Peng, B., et al. (2023). RWKV: Reinventing RNNs for the Transformer Era. arXiv:2305.13048 (EMNLP 2023). "Combines the efficient parallelizable training of transformers with the efficient inference of RNNs... scaled as large as 14 billion parameters, by far the largest dense RNN ever trained."
[^9]: Peng, B., et al. (2024). Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence. arXiv:2404.05892. "Multi-headed matrix-valued states and a dynamic recurrence mechanism that improve expressivity while maintaining the inference efficiency characteristics of RNNs."
[^10]: Peng, B., et al. (2025). RWKV-7 Goose with Expressive Dynamic State Evolution. arXiv:2503.14456. "RWKV-7 adopts Dynamic State Evolution, surpassing the fundamental limitations of the TC0 expressive power of the attention/linear attention paradigm."
[^11]: Orvieto, A., et al. (2023). Resurrecting Recurrent Neural Networks for Long Sequences. arXiv:2303.06349. "Careful design of deep RNNs using standard signal propagation arguments can recover the impressive performance of deep SSMs on long-range reasoning tasks, while also matching their training speed. We introduce an RNN block called the Linear Recurrent Unit."
[^12]: De, S., Smith, S. L., et al. (2024). Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models. arXiv:2402.19427. "Hawk exceeds the reported performance of Mamba on downstream tasks, while Griffin matches the performance of Llama-2 despite being trained on over 6 times fewer tokens."