# Anti-Patterns: How NOT to Explain Transformers and SSMs
> See [[index]], [[teaching-best-practices]], [[analogies-and-intuitions]]
---
## Why Anti-Patterns Matter
A bad explanation is worse than no explanation. It installs wrong mental models that are hard to dislodge. This note documents observed failure modes in ML education — collected from Stack Overflow questions, Reddit complaints, YouTube comment sections, and direct observation.
---
## Anti-Pattern 1: Leading with Math
### The Pattern
Opening with formulas before establishing intuition.
**Example of BAD explanation**:
> "Attention computes: Attention(Q,K,V) = softmax(QKᵀ/√dₖ)V, where Q ∈ ℝ^{n×d_k}, K ∈ ℝ^{n×d_k}, V ∈ ℝ^{n×d_v}..."
**Why it fails**:
- 95% of the target audience exits immediately
- Those who stay spend cognitive energy parsing notation instead of building intuition
- Creates illusion of understanding: readers think they "get it" because they can repeat the formula
**What people say**: "I've read 10 explanations of attention and still don't actually understand what it DOES."
**Better approach**: "Attention is like a Google search within the sentence — each word searches for relevant other words, retrieves their meaning, and blends it in."
---
## Anti-Pattern 2: The Components List (No Mechanism)
### The Pattern
Explaining what components exist without explaining what they DO.
**Example of BAD explanation**:
> "A Transformer has: token embeddings, positional encoding, multi-head self-attention, layer normalization, feed-forward networks, residual connections, and an output projection layer."
**Why it fails**:
- This is a parts manifest, not an explanation
- Tells reader NOTHING about why it works or what any part is for
- Equivalent to explaining a car as: "It has wheels, an engine, a steering wheel, seats, and doors"
**Better approach**: Explain the ONE core idea (every word looks at every other word) and let architecture details follow naturally.
---
## Anti-Pattern 3: Unexplained Jargon
### The Pattern
Using technical terms without grounding them in everyday concepts.
**Observed examples of unexplained jargon**:
- "The model learns contextual embeddings" (What's an embedding? What's contextual mean here?)
- "Attention is a soft alignment mechanism" (What's alignment? Why soft?)
- "The SSM maintains a latent state that recurrently updates" (three technical terms in a row)
- "Gradient vanishing is solved by skip connections" (means nothing to most people)
**Why it fails**: Each unexplained term requires the reader to context-switch to look it up. Most don't. They give up or fake understanding.
**Better approach**: First use only words from everyday English. Introduce ONE technical term at a time, immediately anchored to an everyday analogy.
---
## Anti-Pattern 4: False Precision vs Comprehension
### The Pattern
Giving technically precise statements that mean nothing without context.
**Bad**:
> "Transformers have O(n²) complexity while SSMs have O(n) complexity."
**Why it fails**: "O(n²)" is meaningless to most people. Even many CS graduates don't immediately feel the intuition.
**Better**:
> "Double the length of text → 4× the work for a Transformer, but only 2× for an SSM. At 10× the text, a Transformer needs 100× the work — the SSM only needs 10×."
---
## Anti-Pattern 5: Making One Approach Sound Like "The Winner"
### The Pattern
Presenting the topic as: "Transformers bad, SSMs good" (or vice versa).
**Bad examples**:
- "Mamba will replace Transformers"
- "SSMs can't compete with GPT-4's reasoning"
- "Attention is all you need (and nothing else ever will be)"
**Why it fails**:
- Factually incomplete
- Creates audience who will be surprised when they encounter the other side
- Misses the real insight: different architectures for different tasks
**Better**: Embrace the trade-off honestly. "Transformers have perfect memory but pay for it. SSMs have efficient memory but compress. Hybrids try to get both."
---
## Anti-Pattern 6: The Motivationless Architecture Tour
### The Pattern
Walking through architecture details without first establishing WHY any of it matters.
**Bad**:
> "So first the input goes through the embedding layer, then we add positional encodings, then we have multi-head attention with 8 heads, each computing Q, K, V projections..."
**Why it fails**: Reader has no framework to decide what's important vs. incidental. Everything seems equally (un)important.
**Better approach**:
1. First: "Here's the problem we're solving"
2. Second: "Here's the key insight that solves it"
3. Third: "Here's how the architecture implements that insight"
4. Fourth: "These other details support the key insight"
---
## Anti-Pattern 7: Ignoring the Why of Positional Encoding
### The Pattern
Mentioning positional encoding but not explaining why it's needed.
**Bad**:
> "We add positional encodings to give the model information about token positions."
**Why it fails**: The reader doesn't understand WHY the model would need to be told about positions. Seems like an obvious thing.
**Better**:
> "Here's a surprising fact: the attention mechanism is completely blind to word order. 'The cat chases the dog' and 'The dog chases the cat' look identical to self-attention. We have to manually inject position information — that's what positional encoding does."
---
## Anti-Pattern 8: Overextended Analogies
### The Pattern
Taking an analogy too far until it breaks.
**Example**: Using "Google Search" analogy for attention, then saying "so the model 'Googles' the answer"
**Break point**: Google returns ranked web pages. Attention returns a weighted blend of vectors. Very different outcome.
**Better approach**: State the analogy, note the mapping, note where the analogy breaks. "This is like a Google search, except instead of returning one result, you get a weighted blend of ALL the results."
---
## Anti-Pattern 9: Context Window Confusion
### The Pattern
Conflating "context window" with "memory" in a misleading way.
**Bad**:
> "Transformers have a context window of 128K tokens, meaning they can 'remember' 128K tokens."
**Why it fails**: People think of memory as something that persists. The context window is a SLIDING WINDOW — old context falls off the edge. The model doesn't "remember" anything outside the window.
**Better**:
> "The context window is like a whiteboard that's always being erased at one end as you write more at the other. Everything on the whiteboard right now, the model can see perfectly. Anything that got erased — it's as if it never existed."
---
## Anti-Pattern 10: "Simple" Without Foundation
### The Pattern
Saying "it's actually quite simple" to reassure readers, without actually simplifying.
**Bad**:
> "Don't worry, self-attention is actually quite simple. You just compute Q = XWᵀ_Q, K = XWᵀ_K, V = XWᵀ_V, then..."
**Why it fails**: Reader feels stupid that this "simple" thing is confusing them. Creates learned helplessness.
**Better**: Either actually simplify (use the analogy) or acknowledge the complexity honestly without false reassurance.
---
## Observed Real-World Complaints
From Reddit r/MachineLearning, Stack Overflow, and YouTube comments:
- "Every explanation of transformers just describes components. None of them explain WHY it works."
- "I get that attention is QKV but I still don't know what that MEANS"
- "Why do I need positional encoding? The positions are already there aren't they?"
- "Everyone explains the math but nobody explains the intuition"
- "What does the state in Mamba actually contain? Nobody says."
- "How is Mamba different from an LSTM? Papers just say 'selective state spaces'"
---
## The Gold Standard: What Good Explanations Do Instead
| Anti-Pattern | Gold Standard Alternative |
|-------------|--------------------------|
| Lead with math | Lead with the problem it solves |
| Component list | One core mechanism, then supporting parts |
| Unexplained jargon | Everyday word first, then technical term |
| False precision | Concrete numbers and everyday comparisons |
| "X replaces Y" | Honest trade-off analysis |
| Architecture tour | Problem → Insight → Architecture |
| No motivation | Always answer "Why do we need this?" |
| Overextended analogy | State where the analogy breaks |
| Context = memory | Explain the sliding window |
| "It's simple" | Either simplify or acknowledge difficulty |
---
## Key Takeaway for This Report
Every section of the final report should pass the "So what?" test:
- Does the reader know WHY this matters?
- Do they have a concrete mental model?
- Can they explain it to a friend?
If not, the section needs an analogy, a diagram, or a concrete example before any technical detail.