historical-narrative - ramblings from the whirligig void

# The Historical Narrative: From RNNs to Transformers to SSMs > [[index]] · [[transformers-basics]] · [[ssm-basics]] · [[sequence-processing-comparison]] ## The Three Ages of Sequence Modeling ### Age 1: The Sequential Machines (1980s–2016) The original approach to teaching machines to read was deeply human: **read left to right, remember what you've seen**. Recurrent Neural Networks (RNNs) did exactly this. They processed tokens one by one, passing a "hidden state" — a vector of numbers encoding "what I've learned so far" — from position to position. At each step, they'd update this state with the new input. > [!NOTE] The RNN Mental Model > Imagine reading a book while only keeping one notecard of summary notes. After every sentence, you must fit everything you know onto that one notecard. The problem: older information gets overwritten by newer information. **The critical failure**: vanilla RNNs couldn't handle *long-term dependencies*. By the time they reached the word "French" in "I grew up in France... I speak fluent ___", the information about France had been overwritten by everything in between.[^1] **LSTMs (1997) partly solved this**[^2]: Long Short-Term Memory networks added a "cell state" — a dedicated highway for important information that could be maintained across long distances. Think of a "conveyor belt" running through the sequence, with gates controlling what gets added, what gets preserved, and what gets discarded. #### The LSTM Gates: - **Forget gate**: "Should I discard what I knew before?" - **Input gate**: "Should I add this new information?" - **Output gate**: "What should I output right now?" This worked well enough for medium-range dependencies. But LSTMs had a killer weakness: **they must be processed sequentially**. Token 1 → Token 2 → Token 3. You cannot parallelize across the sequence. In the GPU age (when parallel processing is everything), this was catastrophic for scaling. By 2016, RNNs/LSTMs were hitting a wall: - Training was slow (sequential by design) - Long-range dependencies still caused degradation - The field was desperate for something better --- ### Age 2: The Attention Revolution (2017–2022) In June 2017, a team at Google published "Attention is All You Need"[^3]. The title was a declaration. The proposal was audacious: throw out the sequential processing entirely. **The key insight**: instead of passing information through a chain of sequential steps, let every position directly *attend* to every other position. No chain. No bottleneck. Direct connections. ``` RNN (Sequential) Transformer (Parallel) Token 1 → h1 → Token 2 → h2 → Token 3 Token 1 ←→ Token 2 ←→ Token 3 Token 1 ←→ Token 3 Token 2 ←→ Token 3 (all at once) ``` > [!TIP] The Cocktail Party Insight > At a loud party, you can focus on one voice by *paying attention* to its pitch, cadence, and position. You don't need to process every conversation sequentially. Transformers do the same thing with text — they attend to relevant parts of the sequence directly, in parallel. **What this enabled**: - Parallelization: train on all positions simultaneously → huge GPU utilization - Direct long-range dependencies: token 1 can directly attend to token 10,000 - Scale: with efficient hardware utilization, you could train on internet-scale data The result? GPT-1 (2018), BERT (2018), GPT-3 (2020), ChatGPT (2022). The Transformer era. #### The Cost: Quadratic Attention But attention between N tokens requires N² comparisons. Every token must compare against every other. For 1,000 tokens: 1,000,000 comparisons. For 10,000 tokens: 100,000,000 comparisons. ``` Sequence length × Memory/Time 100 → 10,000 1,000 → 1,000,000 10,000 → 100,000,000 100,000→ 10,000,000,000 ``` By 2022, researchers were hitting the "quadratic wall": - GPT-3's 2,048-token context felt tiny - Processing long documents required chunking - Real-time streaming was prohibitively expensive - Mobile/edge deployment was impossible at scale **The field needed something that had the quality of Transformers without the quadratic scaling.** --- ### Age 3: The Return of State — SSMs (2021–present) The solution came from an unlikely place: control theory, the 60-year-old mathematical framework for modeling physical systems like aircraft autopilots and signal filters. **The key insight**: what if we could design a "state" that *intelligently compresses* a sequence, rather than naively trying to remember everything? **HiPPO (2020)**[^4]: Albert Gu et al. discovered that if you design the state matrix A using specific polynomial mathematics, the model naturally remembers recent events more precisely than distant ones — just like humans do. This wasn't a hack; it was derived from first principles of approximation theory. **S4 (2021)**[^5]: Building on HiPPO, S4 showed that an SSM with a special "structured" matrix could be computed as a *convolution* during training (fast, parallel) while still running as a recurrence during inference (constant memory). One model, two modes. Revolutionary. **Mamba (2023)**[^6]: The final breakthrough. Gu and Dao made the state matrices *input-dependent* — the model learns to decide what to remember and what to forget based on what it's currently reading. This "selectivity" was what prior SSMs lacked, and it finally brought SSM quality to par with Transformers on language tasks. > [!NOTE] The Three Ages in One Table > > | Era | Model | Strength | Weakness | > |-----|-------|---------|---------| > | 1980s–2016 | RNN/LSTM | Sequential state | Slow training, long-range failure | > | 2017–2022 | Transformer | Parallel, exact recall | Quadratic scaling | > | 2021–now | SSM/Mamba | Efficient, streaming | Weaker associative recall | --- ## Why the 2022–2024 Race Happened By 2022, GPT-4 was clearly a Transformer, and it worked wonderfully — at enormous expense. But: - Processing a 100-page document: required chunking and approximate handling - Genomics/biology applications: DNA sequences of millions of bases were inaccessible - Real-time streaming: KV cache grew without bound - Mobile/edge AI: couldn't fit in a phone's memory The combination of HiPPO's math, S4's training trick, and Mamba's selectivity offered a new path. And it arrived just as the field had matured enough to properly evaluate it. By 2024, the best SSMs (Mamba-2, Falcon Mamba 7B) were approaching Transformer quality at 7B+ parameters while maintaining their efficiency advantages. The Jamba and Griffin hybrids suggested a synthesis: why not use SSMs for most of the processing and add attention layers only where exact recall is needed? --- ## The Timeline at a Glance ``` 1986 Backprop + RNNs — sequential processing begins 1997 LSTMs — cell state "conveyor belt" for long range 2014 seq2seq + attention — first use of attention for alignment 2017 Transformer — "Attention is All You Need" 2018 BERT, GPT-1 — bidirectional and autoregressive Transformers 2020 GPT-3 — large-scale Transformers show emergent capabilities 2020 HiPPO — principled polynomial state compression 2021 S4 — structured SSMs, dual conv/recurrent modes, LRA breakthrough 2022 ChatGPT — Transformer dominance confirmed 2023 Mamba — selective SSMs, O(1) inference, competitive with Transformers 2024 Jamba, Griffin — hybrids; Mamba-2 SSD; Falcon Mamba 7B 2025 SSMs closing quality gap; hybrid models emerging as best-of-both ``` [^1]: Hochreiter & Schmidhuber (1997). "Long Short-Term Memory." Neural Computation. [^2]: Colah (2015). "Understanding LSTMs." http://colah.github.io/posts/2015-08-Understanding-LSTMs/ [^3]: Vaswani et al. (2017). "Attention is All You Need." NeurIPS 2017. arXiv:1706.03762. [^4]: Gu et al. (2020). "HiPPO: Recurrent Memory with Optimal Polynomial Projections." NeurIPS 2020. arXiv:2008.07669. [^5]: Gu et al. (2021). "Efficiently Modeling Long Sequences with Structured State Spaces." ICLR 2022. arXiv:2111.00396. [^6]: Gu & Dao (2023). "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." arXiv:2312.00752.