lra-benchmarks - ramblings from the whirligig void

# Long Range Arena (LRA) Benchmark — Research Notes > Cross-reference: [[ssm-basics]], [[transformers-basics]], [[computational-complexity]], [[diagrams-and-visuals]] > Primary sources: Gu et al. (2022) "Efficiently Modeling Long Sequences with Structured State Spaces" (arXiv:2111.00396, ICLR 2022 Outstanding Paper HM); Tay et al. (2021) LRA benchmark paper; LRA GitHub leaderboard (google-research/long-range-arena) --- ## What Is Long Range Arena? Long Range Arena (LRA) is a systematic benchmark introduced by Tay et al. (2021) specifically designed to evaluate **how well sequence models handle long-range dependencies**. It covers six tasks spanning multiple modalities — text, maths, vision — each requiring the model to integrate information across sequences of 1K–16K tokens. > [!NOTE] > LRA was designed as a "stress test" for sequence models. A model that merely memorises local patterns will fail. The tasks require genuine long-range reasoning. ### The Six Tasks | Task | Sequence Length | What It Tests | |------|----------------|---------------| | **ListOps** | ~9,000 | Hierarchical mathematical structure (nested operations like `[MIN 3 [MAX 5 2] 1]`) | | **Text** | ~4,000 | Character-level sentiment classification on IMDb (no word boundaries!) | | **Retrieval** | ~4,000 (×2) | Document pair similarity — must jointly reason over two documents | | **Image (sCIFAR-10)** | ~1,024 | Sequential image classification — CIFAR-10 read as a 1D pixel sequence | | **Pathfinder** | ~1,024 | Visual/spatial reasoning — connect dashed paths in a noisy image | | **Path-X** | ~16,384 | Extreme version of Pathfinder — 16K-length sequences, the hardest task | **Why it matters**: These tasks are hard for different reasons. Path-X requires integrating information across 16,384 steps. ListOps requires maintaining nested mathematical context. Together they probe whether a model truly understands long-range structure or is just good at local pattern-matching. --- ## Benchmark Results Table All scores are accuracy (%). FAIL = random chance (model could not learn the task; ~50% binary for most tasks). ### Official LRA Leaderboard + S4 Results (Verified) | Model | ListOps | Text | Retrieval | Image | Pathfinder | Path-X | **Avg** | |-------|---------|------|-----------|-------|------------|--------|---------| | **Standard Transformer** | 36.37 | 64.27 | 57.46 | 42.44 | 71.40 | **FAIL** | **54.39** | | Local Attention | 15.82 | 52.98 | 53.39 | 41.46 | 66.63 | FAIL | 46.06 | | Linear Transformer | 16.13 | 65.90 | 53.09 | 42.34 | 75.30 | FAIL | 50.55 | | Reformer | 37.27 | 56.10 | 53.40 | 38.07 | 68.50 | FAIL | 50.67 | | Performer | 18.01 | 65.40 | 53.82 | 42.77 | 77.05 | FAIL | 51.41 | | Longformer | 35.63 | 62.85 | 56.89 | 42.22 | 69.71 | FAIL | 53.46 | | BigBird | 36.05 | 64.02 | 59.29 | 40.83 | 74.87 | FAIL | **55.01** | | **S4** *(Gu et al., ICLR 2022)* | **59.60** | **86.82** | **90.90** | **88.65** | **94.20** | **96.35** | **86.09** | | **S5** *(Smith et al., 2023)* | ~62.15 | ~89.31 | ~91.40 | ~88.00 | ~95.33 | **98.50** | **87.40** | > [!IMPORTANT] > **Correction to working estimate**: Earlier working notes estimated S4 avg as ~79.3%. The published ICLR 2022 paper reports **86.09%** average. The discrepancy likely traces to an early arXiv preprint version (October 2021) or a specific ablation variant. The 86.09% figure is the correct, peer-reviewed number. **Source for Transformer baseline**: [google-research/long-range-arena](https://github.com/google-research/long-range-arena) official leaderboard **Source for S4**: Gu et al. (2022), Table 1, ICLR 2022 final version **Source for S5**: Smith et al. (2023), arXiv:2208.04933 abstract --- ## Key Findings ### 1. The Path-X Breakthrough Path-X (16,384-length sequences) is binary classification: does a dashed path connect two endpoints in a noisy image? Every prior model — Transformer and all efficient Transformer variants — scored at or near **50% = random chance**. They completely failed to learn the task. **S4 scores 96.35% on Path-X.** This is the first sequence model to solve Path-X in the published literature, and it does so by a massive margin. It is not an incremental improvement — it's a phase transition. > [!NOTE] > S4 solving Path-X is the equivalent, in the SSM field, of AlexNet in computer vision: a result so decisive it ended one research debate and opened a new one. ### 2. S4 Dominates Every Single Task S4 doesn't just win on average — it achieves best-in-class on **all six tasks simultaneously**: - **+23.23 pts** over Transformer on ListOps (hierarchical math) - **+22.55 pts** over Transformer on Text (character-level) - **+33.44 pts** over Transformer on Retrieval - **+46.21 pts** over Transformer on Image - **+22.80 pts** over Transformer on Pathfinder - **+∞ pts** over Transformer on Path-X (Transformer = FAIL) ### 3. Efficient Transformers Don't Close the Gap Reformer, Performer, Longformer, BigBird — all designed to make attention more efficient — barely beat the standard Transformer on most tasks. The best efficient Transformer (BigBird) averages 55.01%. S4 is **31 percentage points higher** at 86.09%. This tells us that making attention faster doesn't address the fundamental limitation: **attention at any speed still fails to integrate information across 16K steps.** ### 4. The Successor Models (S5, and the lineage) The S5 paper (Smith et al., 2023) pushed further: - S5 average: **87.40%** (vs S4's 86.09%) - S5 Path-X: **98.50%** (vs S4's 96.35%) - S5 uses a multi-input multi-output (MIMO) SSM with efficient parallel scans The S4 → S5 → Mamba lineage shows steadily improving long-range capability. ### 5. Mamba and LRA Mamba (Gu & Dao, 2023, arXiv:2312.00752) does **not directly report LRA results** in its main paper. Mamba focuses on language modeling, audio, and genomics tasks, where it outperforms same-size Transformers. Mamba's design (selective SSM) trades some of S4's long-range mathematical guarantees for better performance on discrete, language-like data. For LRA-style tasks, S4 and its descendants (S5, S4D, Liquid S4) remain the relevant benchmarks. --- ## Why Transformers Fail Long-Range Tasks The LRA results illuminate a fundamental architectural limitation: ### The Attention Bottleneck Under Extreme Length For Path-X (16,384 tokens), a standard Transformer must compute a **16,384 × 16,384 attention matrix** — 268 million attention weights per layer. This is: - **Computationally expensive**: O(N²) = 268M operations per layer just for attention - **Statistically difficult**: The gradient signal for a meaningful long-range dependency (token 1 → token 16,000) must flow through this enormous softmax, making it nearly zero In practice, Transformers develop attention patterns that focus on local context (nearby tokens) or global aggregators (CLS tokens). Pure long-range spatial reasoning — like "does this path connect across a 128×128 grid?" — doesn't fit this inductive bias. ### Why SSMs Succeed S4 encodes the sequence processing problem as a **differential equation** whose solution has provably good long-range properties. The HiPPO initialisation of the A matrix means: 1. Information from step 1 is still **mathematically present** at step 16,384 (it hasn't been gradient-vanished away) 2. The model doesn't need to "look back" — the state carries forward all relevant information in compressed form 3. The computation is O(N log N) via FFT convolution — efficient even at 16K lengths --- ## Score Progression: The LRA "Leaderboard Moment" ``` Year Model LRA Avg Path-X ───────────────────────────────────────────────── 2020 Transformer (baseline) 54.39% FAIL 2020 BigBird (best "eff. Xfmr")55.01% FAIL 2020 All efficient Transformers ≤55.01% FAIL ... 2021 S4 (Gu et al.) 86.09% 96.35% ← The Moment 2022 S5 (Smith et al.) 87.40% 98.50% ``` The jump from 55% to 86% happened in a single paper. No gradual climb — a sudden discontinuity. --- ## Caveats and Nuances 1. **LRA is not the whole story.** Mamba beats Transformers on language modeling (where Transformers are very strong) without using LRA-optimised components. These are different skill sets. 2. **Parameter counts matter.** The LRA comparison should ideally be parameter-matched. S4 in the "apples-to-apples" setting uses similar parameter budgets to the Transformer baselines. 3. **Path-X is binary classification.** 96.35% accuracy on a binary task means S4 is near-perfect. But the task itself, while visually intuitive, is quite narrow. 4. **S4 requires careful initialisation.** The HiPPO A matrix and S4's training procedure are non-trivial. The model's LRA performance depends heavily on this initialisation — simple SSMs without HiPPO don't achieve these results. 5. **Language modeling results diverge from LRA results.** S4 is SotA on LRA but not on GPT-style language benchmarks. Mamba improves language benchmarks significantly by adding selectivity. These are complementary, not contradictory. --- ## Quotes Worth Using in the Report > *"SoTA on every task from the Long Range Arena benchmark, including solving the challenging Path-X task of length 16k that all prior work fails on."* > — Gu et al. (2022), S4 paper abstract > *"S5 averages 87.4% on the long range arena benchmark, and 98.5% on the most difficult Path-X task."* > — Smith et al. (2023), S5 paper abstract > *"On the Long Range Arena (LRA) benchmark for long-range sequence modeling, S4 sets a clear SotA on every task while being at least as computationally efficient as all competitors. It is the first sequence model to solve the Path-X task involving sequences of length 16384."* > — HazyResearch blog post, January 2022 --- ## See Also - [[diagrams-and-visuals]] — LRA bar chart visual (Diagram 16) - [[ssm-basics]] — the HiPPO and S4 mechanisms that enable these results - [[computational-complexity]] — why O(N²) fails at 16K length - [[historical-narrative]] — where LRA fits in the timeline of the field