teaching-best-practices - ramblings from the whirligig void

# Teaching Best Practices — Explaining Transformers & SSMs to Laypeople > **Cross-links:** [[analogies-and-intuitions]] · [[diagrams-and-visuals]] · [[transformers-basics]] · [[ssm-basics]] · [[computational-complexity]] > **Related:** [[anti-patterns]] --- ## What Makes a Great ML Explainer? The best ML explainers in the world share a specific grammar: *intuition first, formalism never (or last)*. Below are profiles of the four most influential pedagogical styles in the field, followed by the principles they share. --- ### 3Blue1Brown — Visual Intuition Before Formalism Grant Sanderson (3Blue1Brown) is perhaps the gold standard for mathematical visual explanation. His neural network series[^1] has been watched by tens of millions of people, most of whom have no formal math background. **What he does:** - Opens with a *concrete, surprising phenomenon* — "a network can recognize handwritten digits" — before introducing any mechanism - Uses animation to show *structure emerging* rather than presenting it statically - Treats equations as *compressed summaries* of visualizations, not as starting points - Explicitly names the gap between formula and intuition: "The formula looks intimidating, but here is what it actually means..." - Builds *one concept at a time*, testing it against a minimal example before layering on complexity **Key pedagogical moves:** - The "zooming out" shot: show the big picture, then zoom into one component - Dual representation: show a diagram and its algebraic equivalent side by side - Color encoding: each concept gets a persistent color throughout the video - Rhetorical questions: "Why would you even want to do this?" before every new idea **Why it works:** Viewers build a *mental model before labels are attached to it*. This means when the label arrives, it slots into an existing structure rather than floating free. --- ### Jay Alammar — Step-by-Step Traced Visualization Jay Alammar's "The Illustrated Transformer"[^2] is probably the single most-linked explanation of transformers on the internet. It has been translated into 15+ languages and is cited in ML courses worldwide. **What he does:** - Picks *one concrete input* ("The animal didn't cross the street because it was too tired") and traces it through every layer - Shows *actual dimensions*: "this vector has 512 numbers" - Presents the *same information in multiple formats*: color heat-map, arrow diagram, grid of numbers - Uses animated GIFs for things that are inherently sequential - Introduces each sub-mechanism *in isolation* with its own concrete worked example before assembling the full system **Key pedagogical moves:** - The "zoom level" toggle: whole model → encoder stack → single encoder → attention heads → one attention head - Color-coded attention weights: visible relevance connections drawn as lines - "Let's set these aside for now" — temporarily ignores complexity to focus on core mechanism - Explicit labeling: "this is the Query vector," "this is the Key vector," with arrows pointing to physical locations in a diagram **Why it works:** Concrete grounding at every scale means a reader can pause at any level of abstraction and understand what they see. Nothing is assumed to be "obvious."[^3] --- ### Andrej Karpathy — Build It Yourself Karpathy's pedagogical style[^4] is radically hands-on: *the code is the explanation*. His "makemore" and "nanoGPT" tutorials walk students through building a character-level language model from nothing. **What he does:** - Grounds everything in runnable Python — "let's just write it" - Uses *experiments as demonstrations*: runs the model on Shakespeare, shows it generating plausible but wrong text - Explicitly says what surprised him when he was learning - Narrates his *debugging process*, not just the clean solution - Connects each mathematical operation to its code equivalent **Key pedagogical moves:** - The "surprised scientist" persona: "I was shocked this actually works" - Code before math: write `W @ x` before writing the matrix equation - Incremental complexity: start with a bigram model, layer on MLP, then attention - Intentional mistakes: write a "bad" version first, show why it fails, fix it **Why it works:** Learners build *causal understanding* — they see that changing one line changes behavior. Formulas become descriptions of things they already built. --- ### Distill.pub — Interactive Explorable Explanations Distill.pub[^5] (2016–2021) pioneered "explorable explanations" in ML research: articles where the reader can manipulate parameters and see outputs change in real-time. **What they did:** - Hover over a word → see which other words it attends to (attention visualization) - Drag a slider → watch loss landscapes morph - Click on a neuron → see which inputs activate it most - Animations that respond to scroll position (parallax explanation) **Key pedagogical principles:** - *Discoverability*: readers find truths themselves rather than being told them - *Minimal toy examples*: use the simplest possible case that exhibits the phenomenon - *Remove irrelevant complexity*: a visualization about attention doesn't need to show the full transformer stack **Why it works:** Active manipulation creates *episodic memory* — "I dragged that slider and the pattern changed." This is far more memorable than being told the same information. --- ### Shared Pedagogical Principles Across all four styles, the same principles emerge: | Principle | 3B1B | Alammar | Karpathy | Distill | |-----------|------|---------|----------|---------| | Intuition before formalism | ✅ | ✅ | ✅ | ✅ | | Concrete example before abstraction | ✅ | ✅ | ✅ | ✅ | | Multiple representations of same concept | ✅ | ✅ | ⚠️ | ✅ | | Learner builds the mental model | ✅ | ⚠️ | ✅ | ✅ | | Explicit naming of confusion points | ✅ | ✅ | ✅ | ⚠️ | | One new concept at a time | ✅ | ✅ | ✅ | ✅ | | Connects to prior knowledge | ✅ | ✅ | ⚠️ | ✅ | **The deepest principle:** The best explainers know that *understanding is not transmission*. You cannot pour knowledge into a reader's head. You can only create conditions where their brain constructs the right model. Everything else is decoration. --- ## Learning Progressions ### What Prerequisite Knowledge Helps? **Genuinely useful prior knowledge:** - The idea of a *score* or *similarity* between two things (you don't need cosine similarity, just the concept) - The idea that words have *context-dependent meaning* (intuitive for any native speaker) - The experience of using a *search engine* (maps directly to query/key attention) - Basic familiarity with sequences (reading a sentence left-to-right) - The concept of *compression* (summarizing a long meeting into bullet points) **What can safely be skipped:** - Backpropagation / gradients - Embeddings and vector spaces (can be introduced as a black box: "words become numbers") - Recurrent neural networks (can skip to SSMs) - Convolutional neural networks - The specific mathematics of softmax, layer normalization, etc. - The original "Attention is All You Need" paper structure **What sounds simple but causes problems:** - The word "attention" — people assume it means something cognitive/conscious - The word "state" — has wildly different connotations in CS, physics, and everyday language - "Parameters" vs "hyperparameters" — unnecessary distinction for laypeople - "Training" vs "inference" — important but often introduced too early --- ### Recommended Teaching Order For a complete layperson, build understanding in this sequence: 1. **The problem**: Why do some words' meanings depend on other words? ("Bank" = river or money depending on context) 2. **The idea of relevance**: When reading a word, some earlier words are more useful than others 3. **Query-key matching**: Introduce the idea that we can compute "how relevant is word A to word B?" as a score 4. **Weighted averaging**: Show that we can blend information from multiple words based on relevance scores 5. **Self-attention**: Every word simultaneously asks: "which other words should I pay attention to?" 6. **Why this is slow**: Doing this for all pairs → scales quadratically (the handshake problem) 7. **The SSM idea**: What if instead of looking at everything at once, we maintained a running *summary*? 8. **State compression**: The hidden state is a lossy summary — like a zip file of everything so far 9. **The tradeoff**: Transformers = perfect memory, expensive; SSMs = lossy memory, cheap 10. **Mamba/selective SSMs**: What if the compression is smart? Remembering only what matters This progression respects two key constraints: - Each step builds on the previous one (no orphaned concepts) - Complexity of mechanism is introduced *after* motivation for that mechanism --- ### The Curse of Knowledge The "curse of knowledge"[^6] is a cognitive bias where knowing something makes it harder to imagine not knowing it. It is rampant in ML explanations. **How it manifests in transformer explanations:** - An expert writes "we compute Q, K, V matrices" assuming the reader knows what a matrix is, what "compute" means in this context, and why we need three of them - An expert says "the attention is scaled by √d_k to prevent vanishing gradients" — 5 concepts a novice has never heard - A tutorial says "simply apply a linear transformation" — "simply" is the tell; nothing is simple to someone who hasn't seen it before - An expert introduces multi-head attention before single-head attention has been internalized **Diagnostic question:** Can you explain the core concept using *only words a 12-year-old knows*? If not, you're probably under the curse. **The antidote:** - Use the *Feynman Technique*: explain it to an imaginary child; wherever you reach for jargon, that's a gap - Have a non-technical person read a draft and mark every word they don't understand - Count the number of new concepts introduced per paragraph; one is good, three is probably too many --- ## Anti-Patterns (What NOT To Do) ### 1. Starting With the Math The most common error in transformer tutorials: opening with the attention formula. $\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$ **Why it fails:** A reader who doesn't know what Q, K, V are yet will experience this as symbol soup. They'll lose confidence before any understanding forms. The formula is the *compressed result* of an idea, not a starting point. **Real example of this pattern:** Most Wikipedia articles on transformers and attention lead with notation.[^7] The Wikipedia article on "Attention (machine learning)" opens with a definition involving probability distributions and weighted sums before any motivating example. **The fix:** Lead with *why*, then *what*, then (if at all) *how mathematically*. --- ### 2. Explaining by Listing Components An explanation that enumerates components without describing mechanism creates an *illusion of understanding*. **Real example (from a popular ML blog):** > "The Transformer consists of an encoder and a decoder. The encoder has a multi-head attention layer, a feed-forward network, and layer normalization with residual connections. The decoder has two attention layers..." This is a *parts list*, not an explanation. Reading it leaves you knowing what's in the box but not how the box works or why anyone would want it. **Why it persists:** Writing a parts list is easy. Explaining mechanism requires understanding it deeply. Many tutorial writers list components because that's what they found in the original paper. **The fix:** For each component, ask: *"What problem does this solve? What would go wrong without it?"* --- ### 3. Jargon Without Anchoring **Examples from Stack Overflow and forums:**[^8] > "Transformers use positional encodings added to the token embeddings to preserve sequential information since attention is permutation-invariant." Every phrase here is opaque to a newcomer: positional encodings, token embeddings, sequential information, permutation-invariant. The sentence is correct but teaches nothing without prior context. > "The key insight of self-attention is that it allows the model to attend to itself." "Attend to itself" is not an explanation. It restates the term being explained. > "SSMs compress the sequence into a hidden state vector through a recurrent transition." "Hidden state vector" and "recurrent transition" are jargon. A layperson reads this and nods, having learned nothing. **The fix:** When introducing any technical term for the first time, anchor it immediately with either (a) an analogy or (b) a concrete example. Never rely on the term to be self-explanatory. --- ### 4. Not Connecting to Prior Knowledge A common tutorial structure: 1. Here's the new architecture 2. Here's how it works 3. Here's what it's good at Missing: *Why should I care? What problem does this solve? What was the world like before it?* **Real complaint from a Reddit thread on ML explanations:**[^9] > "I've read 10 articles about transformers and I understand the mechanics, but I still don't understand WHY attention works. Why does it help to weight information? What was wrong before?" The answer (RNNs forgotten long sequences; attention lets you look back directly) is the whole point, but many tutorials bury it or omit it entirely. **The fix:** Begin with the *problem in the learner's language*. "Imagine you're summarizing a book, but you can only remember the last sentence you read. That's the problem attention was invented to solve." --- ### 5. The Premature Generalization **Pattern:** > "Attention mechanisms can be applied to any sequence-to-sequence problem, including NLP, computer vision, protein folding, music generation..." This is technically true but cognitively harmful for beginners. Introducing transformer applications before the mechanism is understood dilutes focus and creates cognitive overload. **Why it happens:** Writers want to demonstrate scope and importance upfront. The result is that beginners get excited about applications but have no framework to understand them. **The fix:** Pick *one* application (usually language translation or next-word prediction) and stick with it for the entire explanation. Breadth can wait until the core is solid. --- ## Best Explanation Resources Online ### Top 5 Layperson Explanations of Transformers | Resource | Author | What Makes It Good | |----------|--------|--------------------| | **The Illustrated Transformer**[^2] | Jay Alammar | Color-coded diagrams, traces one sentence through the entire model, multiple zoom levels, animated GIFs for sequential operations | | **Attention? Attention!** (Lilian Weng's blog)[^10] | Lilian Weng | Excellent historical narrative — shows *why* attention was invented by showing what RNNs failed at first; connects mechanism to biological visual attention | | **What is ChatGPT Doing and Why Does It Work?**[^11] | Stephen Wolfram | Written for total non-technical audience, extremely patient pacing, connects to familiar search/autocomplete experience | | **Visual Guide to Mamba and State Space Models**[^12] | Maarten Grootendorst | Best SSM explanation; uses maze analogy for state space; shows transformer flaw → RNN flaw → SSM solution as a progression | | **Mamba: The Easy Way**[^13] | Jack Cook | Excellent for SSMs specifically; explains S4 and Mamba with clear diagrams; good on the RNN↔CNN duality of SSMs | ### Top 5 for SSMs and Mamba | Resource | Author | Highlight | |----------|--------|-----------| | Visual Guide to Mamba (above)[^12] | Maarten Grootendorst | Best intro for laypeople | | Get on the SSM Train[^14] | Loïck Bourdois | Comprehensive coverage of S4, S5, Mamba progression | | Mamba: The Easy Way[^13] | Jack Cook | Clean explanation of the recurrence ↔ convolution duality | | Annotated Mamba (Hard Way)[^15] | Sasha Rush | Not for laypeople, but the gold standard for technical depth | | Colah's Understanding LSTMs[^16] | Chris Olah | Essential for understanding why SSMs improve on RNNs; conveyor belt analogy for cell state is canonical | ### What All the Best Resources Have in Common 1. **They start with a problem.** The problem is stated in plain language before any solution is offered. 2. **They use a consistent running example.** One sentence, one scenario, traced all the way through. 3. **They use color.** Consistently colored diagrams where color = concept. 4. **They move from concrete to abstract.** Specific numbers → general pattern → formal notation. 5. **They acknowledge complexity without hiding behind it.** "This looks scary but here's all you need..." 6. **They make tradeoffs explicit.** Transformers are powerful but expensive. SSMs are efficient but lossy. The best explanations make this concrete. --- ## Visual Learning ### Diagrams That Consistently Appear in Good Explanations | Diagram | What It Shows | Why It Works | |---------|--------------|--------------| | **Attention heatmap** | Which words attend to which (color intensity = attention weight) | Makes an abstract mathematical operation concretely visible | | **Encoder-decoder flow diagram** | Data flow from input to output | Establishes the big picture before zooming in | | **Single attention head step-by-step** | Q, K, V creation → dot products → softmax → weighted sum | The mechanism as a sequence of simple operations | | **Unrolled RNN** | Same network repeated across time steps | Makes recurrence visual and destroys the "magical loop" illusion | | **SSM state transition** | Hidden state updated by each input token | Shows the compression/forgetting dynamic | | **O(n²) vs O(n) comparison** | Attention matrix (full grid) vs SSM (linear chain) | Makes computational complexity viscerally apparent | | **Context window visualization** | Fixed-width sliding window of tokens | Concretizes the concept of limited memory | ### Most Frequently Appearing Analogies (See [[analogies-and-intuitions]] for full development) 1. **Cocktail party** — attention as selective listening 2. **Search engine (Query/Key/Value)** — most accurate to mechanism 3. **Bank (river vs money)** — self-attention disambiguates meaning 4. **Conveyor belt** — LSTM cell state / SSM hidden state 5. **Working memory vs long-term memory** — context window limit 6. **Handshake problem** — O(n²) complexity ### Minimal Viable Diagram for Attention The simplest diagram that conveys the attention mechanism: ``` Input words: [The] [animal] [didn't] [cross] [street] [it] ↑ "it" asks: "who am I?" Attention weights for "it": The: 0.02 ░░ animal: 0.48 ████████████ didn't: 0.04 █ cross: 0.05 █ street: 0.10 ██ it: 0.31 ███████ → "it" = mostly "animal" + a bit of itself ``` This diagram needs no math. It shows: - That each word produces a distribution over all other words - That the distribution is *content-driven* (not positional) - The result: a context-enriched representation A heat-map version of this (with darker blue = more attention) is even better visually and is the signature diagram of Alammar's Illustrated Transformer. --- ## Sources [^1]: 3Blue1Brown, "Neural Networks" series. https://www.3blue1brown.com/topics/neural-networks [^2]: Jay Alammar, "The Illustrated Transformer" (2018). https://jalammar.github.io/illustrated-transformer/ [^3]: Jay Alammar, "Visualizing Neural Machine Translation / Seq2Seq with Attention" (2018). https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/ [^4]: Andrej Karpathy, "The Unreasonable Effectiveness of Recurrent Neural Networks" (2015). https://karpathy.github.io/2015/05/21/rnn-effectiveness/ [^5]: Distill.pub, "Memorization in RNNs" and general interactive format. https://distill.pub/2019/memorization-in-rnns/ [^6]: The "curse of knowledge" was named by Camerer et al. (1989) and popularized by Chip & Dan Heath in "Made to Stick" (2007). [^7]: Wikipedia, "Attention (machine learning)." https://en.wikipedia.org/wiki/Attention_(machine_learning) [^8]: Examples drawn from Stack Overflow threads on "explain transformer" and "attention mechanism explained" (2020–2024). [^9]: Reddit r/MachineLearning, "Why are transformer explanations so bad?" (2021). Thread unavailable due to Reddit bot-blocking, but theme widely documented. [^10]: Lilian Weng, "Attention? Attention!" (2018). https://lilianweng.github.io/posts/2018-06-24-attention/ [^11]: Stephen Wolfram, "What Is ChatGPT Doing and Why Does It Work?" (2023). https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work/ [^12]: Maarten Grootendorst, "A Visual Guide to Mamba and State Space Models" (2024). https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-mamba-and-state [^13]: Jack Cook, "Mamba: The Easy Way" (2024). https://jackcook.com/2024/02/23/mamba.html [^14]: Loïck Bourdois, "Get on the SSM Train" (2023–2024). https://huggingface.co/blog/lbourdois/get-on-the-ssm-train [^15]: Sasha Rush, "Mamba: The Hard Way" (Annotated Mamba). https://srush.github.io/annotated-mamba/hard.html [^16]: Chris Olah, "Understanding LSTM Networks" (2015). https://colah.github.io/posts/2015-08-Understanding-LSTMs/