edge-research - ramblings from the whirligig void

# Edge Research Findings --- ## T1: S4 Speed Claims **Short answer: The "60×" figure is real and directly quoted in the paper, but the draft's framing is inaccurate about the context.** ### What the S4 paper actually says From the published abstract of Gu et al. (2022), arXiv:2111.00396 (ICLR 2022 Outstanding Paper HM): > "substantially closing the gap to Transformers on **image and language modeling tasks**, while performing generation **60× faster** (iii) SoTA on every task from the Long Range Arena benchmark, including solving the challenging Path-X task of length 16k that all prior work fails on" Note that these are **two separate claims** in the abstract: - **(ii)** The 60× generation speed-up is stated in the context of **image and language modeling tasks** (i.e., sequential CIFAR-10 image generation and language modelling) - **(iii)** The 16,384-token (16k) sequence length is cited separately for the **Path-X / LRA benchmark**, as the length of that specific task ### The problem with the draft's phrasing The pre-skeleton currently says: > "It also ran up to 60× faster than Transformers on long autoregressive generation tasks **at the 16,000-token sequence lengths tested in the LRA benchmarks**" This conflates two separate results. The 60× figure is **not** stated by the paper as applying specifically to 16k-length LRA tasks. The paper presents the generation speed figure in the context of image/language modelling tasks; Path-X's 16k length is the sequence length of that LRA task, not the sequence length at which the 60× benchmark was taken. ### What to say instead Recommended correction: > "S4 performed autoregressive generation **60× faster** than Transformers on image and language modelling tasks — while also solving the Long Range Arena's hardest challenge, Path-X, which operates on sequences of 16,384 tokens and which every prior model, including Transformers, had failed to crack." ### Confidence and caveats - The 60× figure is confirmed directly from the paper abstract; it is not a secondary source claim. - The specific sequence length(s) at which the 60× benchmark was measured are in the body of the paper (Section 4.3, "Generation Speed"), which was not fetchable as HTML. The abstract does not specify them. Based on the LM experiments in the paper (typically run at 1K–4K tokens), the 60× likely applies to those ranges — not 16k. - **Action for final draft**: Separate the 60× speed claim from the Path-X/16k claim; they are distinct achievements. **Source**: Gu, A., Goel, K., & Ré, C. (2022). "Efficiently Modeling Long Sequences with Structured State Spaces." ICLR 2022. arXiv:2111.00396. --- ## T2: LRA Text Score Verification **Confirmed score: 64.27% — rounds correctly to 64.3%** From `research_notes/lra-benchmarks.md` (cross-referenced to S4 paper Table 1, ICLR 2022 final version, and the google-research/long-range-arena official leaderboard): | Model | Text | |-------|------| | **Standard Transformer** | **64.27%** | | S4 | 86.82% | - **64.3% is correct** (it is 64.27 rounded to one decimal place). - **65.2% is incorrect** — this figure does not appear in the verified LRA table. It may be from a different model (Linear Transformer scores 65.90%, Performer scores 65.40%) or an early/misread source. - The averages in the notes are confirmed: Transformer avg **54.39%**, S4 avg **86.09%** — consistent with Path-X counting as FAIL (≈50%). **Verdict for the draft**: Use **64.27%** or **~64.3%**. Reject the 65.2% figure. --- ## T3: Mamba-3 Status **Mamba-3 exists and was published in March 2026.** - **Paper**: "Mamba-3: Improved Sequence Modeling using State Space Principles" - **Authors**: Lahoti, Li, Chen, Wang, Bick, Kolter, **Tri Dao**, **Albert Gu** (the original Mamba team) - **Venue**: ICLR 2026 - **arXiv**: 2603.15569, submitted 16 March 2026 ### What Mamba-3 does Three core improvements over Mamba-2, all motivated by an inference-first perspective: 1. **More expressive recurrence** derived from SSM discretization 2. **Complex-valued state update rule** enabling richer state tracking (addresses a known Mamba-2 weakness on retrieval and state-tracking tasks) 3. **Multi-input, multi-output (MIMO) formulation** for better model quality without increasing decode latency ### Key results (at 1.5B scale) - +0.6 pp average downstream accuracy vs. next-best competitor (Gated DeltaNet) - MIMO variant: +1.8 pp total gain over Gated DeltaNet - Achieves comparable perplexity to Mamba-2 with **half the state size** - Advances the performance-efficiency Pareto frontier ### Draft implications The report can be updated to note Mamba-3 as the current (as of early 2026) state-of-the-art in pure-SSM architectures. The key narrative remains the same: SSMs continue to close the quality gap with Transformers while retaining efficiency advantages. The hybrid approach (SSM + Attention) remains the frontier at scale, but Mamba-3 shows pure SSMs are still progressing. --- ## T4: Pedagogy Gaps The teaching-best-practices.md contains several well-developed principles. The draft already uses multiple analogies well and broadly follows the intuition-first progression. Three gaps stand out: **Gap 1 — No single running example traced all the way through.** The pedagogy notes (Jay Alammar principle) stress picking *one sentence* and following it through both architectures. The draft uses many analogies but switches examples frequently. Adding a consistent "anchor sentence" (e.g., "The bank was steep") used in both the attention section and the SSM section would let readers see the contrast directly and concretely. (~40 words) **Gap 2 — Missing explicit tradeoff callout box.** The notes specifically flag: "Make tradeoffs explicit — Transformers = perfect memory, expensive; SSMs = lossy memory, cheap." The draft describes both architectures' properties but never places the core tradeoff side-by-side in a single referenceable callout. A summary box (e.g., two-column "Perfect but Costly / Efficient but Lossy") would give readers a mental anchor they can carry forward. (~45 words) **Gap 3 — "State" is flagged as a dangerous word; draft uses it without anchoring.** The pedagogy notes list "state" as a term with "wildly different connotations in CS, physics, and everyday language." The draft introduces "hidden state" in SSM sections without a brief parenthetical like "(think: the model's rolling notes — not 'state' as in a place, but as in 'what it currently knows')." A single sentence of anchoring would prevent the word from sliding past readers. (~50 words)