# Transformers: A Layperson's Deep Dive > **Part of:** SSMs vs Transformers research project > **See:** [[index]] for full research inventory | [[../STEERING.md]] for project guidance > **Cross-links:** [[ssm-basics]] | [[computational-complexity]] | [[analogies-and-intuitions]] | [[real-world-products]] | [[sequence-processing-comparison]] | [[anti-patterns]] | [[diagrams-and-visuals]] > > **Status:** Research complete ✅ --- ## TL;DR A **Transformer** is a type of neural network that can read a chunk of text — or an image, or audio — all at once, and figure out how every part relates to every other part simultaneously. The "trick" is called **attention**, and it's what makes GPT-4, Claude, Gemini, and virtually every cutting-edge AI system work.[^1] Unlike older systems that read word-by-word like a person, a Transformer is like **photographing the entire page and analyzing all words at the same moment**. This parallelism is why modern AI can train in days rather than months. --- ## 1. Historical Context: What Came Before? ### The Old Guard: RNNs and LSTMs Before Transformers, the dominant approach for understanding sequences (sentences, time-series, audio) was the **Recurrent Neural Network (RNN)**.[^2] Think of an RNN like a reader who can only see one word at a time, reading left to right. After reading each word, they update a little "memory sticky note." By the time they reach the end of a sentence, that single sticky note is supposed to contain everything important about what they read. **The problem:** By the time you get to the end of a long sentence, the sticky note has been overwritten so many times that early details get blurry. This is called the **vanishing gradient problem** — early signals fade away during training.[^3] > **Analogy:** Imagine playing "telephone" (Chinese whispers) with 50 people. The message at the end barely resembles what was whispered at the start. **Long Short-Term Memory (LSTM)** networks (invented 1997 by Hochreiter & Schmidhuber) were a major improvement.[^4] They introduced a "cell state" — like a conveyor belt running through the chain — that could carry information over long distances with less corruption. Chris Olah's famous 2015 blog post describes it: *"The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates."*[^5] LSTMs helped enormously. But they still had two big problems: 1. **Sequential processing:** You had to process word 1 before word 2, word 2 before word 3. You couldn't parallelize. Training was slow. 2. **Fixed bottleneck:** For translation tasks, the encoder still had to compress an entire sentence into one vector before the decoder could start. Long sentences suffered. ### Attention Added to RNNs (2014–2015) Researchers discovered a fix: let the decoder "look back" at all the encoder's hidden states simultaneously, instead of just the final one. This was called an **attention mechanism**, introduced by Bahdanau et al. (2014) for machine translation.[^6] Instead of only seeing one compressed summary, the decoder could look at the entire source sentence and decide which words to pay attention to at each step. This was a huge win — but it was bolted onto the side of an RNN; the sequential bottleneck remained. ### "Attention is All You Need" — The 2017 Revolution In June 2017, eight researchers at Google Brain and Google Research published **"Attention is All You Need."**[^1] Their radical claim: throw out the recurrence entirely. No more LSTM conveyors, no more sequential reading. **Use attention, and only attention.** The result — the **Transformer** — was: - Faster to train (fully parallelizable across all tokens) - Better at capturing long-range dependencies - More scalable (just add more compute and data) From the abstract: *"We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely... Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results... after training for 3.5 days on eight GPUs."*[^1] **The authors:** Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin. --- ## 2. Core Intuition: What Is Attention? This is the conceptual heart of everything. Here are **five different ways** educators have explained it to laypeople: ### Analogy 1: The Cocktail Party 🍸 Imagine you're at a loud cocktail party. Dozens of conversations happening around you. Your brain doesn't process all sounds equally — it **filters and focuses**. When someone says your name, your attention snaps toward them. You're dynamically weighting the inputs around you based on relevance to your current "query" (what are you interested in right now?).[^7] A Transformer does something similar with words. When processing the word "it" in the sentence *"The animal didn't cross the street because it was too tired"* — the model needs to figure out what "it" refers to. It looks at all the other words and asks: "Which words are most relevant to understanding 'it'?" Answer: "animal" scores high; "street" scores low.[^8] ### Analogy 2: The Spotlight 🔦 From the TDS "Illustrated Self-Attention" article: imagine a stage with many actors. Each actor (word) has a spotlight. The spotlight can shine on other actors to ask: "Are you relevant to my scene right now?" Some actors illuminate brightly (high relevance); others dimly (low relevance).[^9] - The **Query** is the actor asking: "Who should I pay attention to?" - The **Keys** are the labels each other actor holds up: "Here's what I represent." - The **Values** are the actual information each actor provides when lit up. *Specific subanalogy from TDS using Macbeth:* Macbeth on stage asks "Should I seize the crown?" (Query). Lady Macbeth and other characters respond based on their relevance (Keys). Their actual actions and motivations (Values) then influence Macbeth's updated understanding.[^9] ### Analogy 3: The Library Search 📚 Think of a **library** with a retrieval system: - You walk in with a **query** ("I want books about ancient Rome") - Every book has a **key** (its catalog entry — "Roman History, 200 BC–400 AD") - The **value** is the actual book content The attention mechanism computes how well your query matches each key, then blends together the values of the best-matching books — weighted by match quality. The output is a custom "synthesis" weighted toward the most relevant sources.[^10] ### Analogy 4: The Detective 🔍 Tim Lee and Sean Trott (Understanding AI, 2023) frame it this way: a Transformer is like a detective who can simultaneously cross-reference every clue against every other clue, rather than working sequentially. The attention mechanism is the detective's ability to ask: "Given this clue, which other clues are most relevant to resolving its meaning?"[^11] This is why a Transformer can resolve "it" → "animal" in a complex sentence: it's not reading left-to-right hoping to remember, it's running every word against every other word in one big cross-examination. ### Analogy 5: The Voting Panel 🗳️ Each word gets to "vote" on how much every other word matters to its own interpretation. The votes are tallied (via a softmax function — just turning raw scores into percentages that sum to 100%), and each word's final meaning is updated based on a weighted average of all other words' contributions. Words that "win" more votes from a given query word contribute more to that word's contextual meaning. ### Self-Attention vs. Cross-Attention | Type | What Queries What | When Used | |------|-------------------|-----------| | **Self-attention** | Sequence queries itself | Encoder processing input; Decoder processing its output-so-far | | **Cross-attention** | Output queries Input | Decoder attending to encoder output (e.g., translation: French output queries English source) | Self-attention = *"How does every word in THIS sentence relate to every other word in THIS same sentence?"* Cross-attention = *"Given what I've generated so far, which parts of the ORIGINAL input are most relevant for my next word?"* ### The Query-Key-Value Mechanism: Step by Step In plain English, for each word during self-attention:[^8] 1. **Create three representations** of each word: a *Question* (Query), a *Label* (Key), and *Content* (Value). 2. **For each word**, compare its Question against every other word's Label — high match → high relevance score. 3. **Normalize** the scores so they add up to 1 (softmax — raw scores become percentages). 4. **Blend** all words' Content together, weighted by their relevance percentages → context-enriched representation. > **Key insight:** After self-attention, the representation of "bank" in "river bank" is mathematically different from "bank" in "bank account," because the surrounding context words have different relevance scores.[^11] --- ## 3. How Transformers Process Sequences ### Parallel Processing: The Big Win The single most revolutionary aspect of the Transformer is **parallelism**. RNNs had to process word-by-word, sequentially. Transformers process all tokens simultaneously. ``` RNN (sequential — slow, hard to parallelize): token1 → token2 → token3 → token4 → token5 ↓ ↓ ↓ ↓ ↓ state1 → state2 → state3 → state4 → state5 (can't start token2 until token1 is done) Transformer (parallel — fast, GPU-friendly): token1 token2 token3 token4 token5 ↕ ↕ ↕ ↕ ↕ [EVERY TOKEN ATTENDS TO EVERY TOKEN SIMULTANEOUSLY] ``` This matters enormously during **training**: modern GPUs are massively parallel processors. With RNNs, you couldn't exploit this parallelism. With Transformers, you can. Training that might have taken weeks with LSTMs can be done in days.[^8] ### Tokens vs. Words Note: Transformers don't actually operate on words — they operate on **tokens**, typically 3–4 characters of text. The word "unbelievable" might be 3 tokens: "un", "believ", "able." This allows the model to handle rare words and new coinage gracefully. A typical sentence of ~10 words ≈ ~13–15 tokens. ### Positional Encoding: Adding a Sense of Order Here's a subtle problem: because Transformers process all tokens simultaneously — not in sequence — they have no inherent sense of word *order*. The sentences "dog bites man" and "man bites dog" would look identical without something extra. The solution is **positional encoding**: before feeding words into the Transformer, you add a unique "position tag" to each word's representation. In the original paper, this is done using sine and cosine waves of different frequencies.[^1] ``` Word embedding for "cat": [0.2, -0.5, 0.8, 0.1, ...] (meaning) Position encoding for pos 3: [0.0, 0.9, 0.4, -0.7, ...] (position tag) ───────────────────────────── Combined input to Transformer: [0.2, 0.4, 1.2, -0.6, ...] (meaning + position) ``` > **Intuition:** Think of it like giving each audience member a unique seat number, printed on their name tag. The performer (model) knows "the person in seat 5 asked a question" even though they're all physically present at the same time.[^8] The patterns are designed so: - Each position has a unique signature - The signature smoothly varies — nearby positions are similar, distant positions are different - The model can generalize to sequence lengths it hasn't seen in training Modern models (GPT-4, LLaMA, Mistral) use improved schemes like **Rotary Position Embeddings (RoPE)**, which encode relative position between every pair of tokens rather than absolute position of each token.[^12] ### All Tokens Attend to All Tokens In the encoder, every single token can directly "see" every other token. There's no distance penalty. The word at position 1 can influence the word at position 500 just as easily as position 2. This is the **"global receptive field"** — why Transformers excel at resolving long-range dependencies.[^1] > **Contrast with CNNs:** Convolutional neural networks can only see a local neighborhood at each layer. You need many stacked layers to "see" far away. Transformers see everything at once, in one shot. ### Context Windows: What They Are and Why They Matter The **context window** is the maximum number of tokens the Transformer can consider at once. Everything outside this window is invisible to the model. | Model | Context Window (approx.) | ~Words | |-------|--------------------------|--------| | GPT-2 (2019) | 1,024 tokens | ~750 words | | GPT-3 (2020) | 2,048 tokens | ~1,500 words | | GPT-4 (2023) | 8,192–128,000 tokens | ~6K–96K words | | Claude 3 (2024) | 200,000 tokens | ~150,000 words | | Gemini 1.5 Pro (2024) | 1,000,000 tokens | ~750,000 words | Why does this matter? The model can only "remember" what's in the window. For a 500-page book with a 32K-token window (~25 pages), the model can only see 25 pages at once. It's like reading through a narrow slit. Expanding context windows is a major research frontier — directly limited by the quadratic complexity discussed in §4.[^13] --- ## 4. Computational Characteristics ### O(n²) Complexity: Plain English The notation **O(n²)** means: if the input length doubles, computation doesn't double — it *quadruples*. If it triples, computation multiplies by *nine*. Why? Because in self-attention, every token must compare itself to every other token: ``` Handshake analogy: 5 guests → 10 handshakes 10 guests → 45 handshakes 50 guests → 1,225 handshakes 100 guests → 4,950 handshakes ``` > **The pattern:** n guests → n×(n-1)/2 handshakes ≈ n²/2. If every word must "shake hands" with every other word, and your document has 100,000 words... that's 5 billion handshakes.[^14] With n tokens in the sequence: - 1,000 tokens → ~1 million comparisons - 10,000 tokens → ~100 million comparisons - 100,000 tokens → ~10 billion comparisons - 1,000,000 tokens → ~1 trillion comparisons ### The Attention Matrix The attention mechanism produces an n×n matrix (every token vs. every other token). For n=128,000 tokens with 32-bit floats, the raw attention matrix is ~65 GB — exceeding most GPU memory. ``` Attention Matrix (4 tokens: "The cat sat down"): "The" "cat" "sat" "down" "The" [ 0.6 0.2 0.1 0.1 ] ← "The" mostly attends to itself + "cat" "cat" [ 0.1 0.5 0.3 0.1 ] ← "cat" attends to itself + "sat" (who sat?) "sat" [ 0.1 0.3 0.4 0.2 ] ← "sat" looks back at "cat" "down" [ 0.1 0.1 0.3 0.5 ] ← "down" attends to itself + "sat" (rows sum to ~1.0 after softmax normalization) Bright cells = strong attention; dim cells = weak attention ``` ### Memory Requirements The O(n²) complexity isn't just about speed — it's about **memory**. This is why techniques like **Flash Attention** (Dao et al., 2022) are critical engineering achievements: Flash Attention computes the attention matrix in tiles that fit in fast SRAM rather than loading the whole n×n matrix into HBM (slow GPU memory), making long contexts feasible.[^15] ### Why This Is a Problem for Long Sequences For short sequences (sentences, paragraphs), quadratic cost is fine. For long sequences: - Entire novels (100,000+ words) - Full codebases - Long audio recordings (transcribed to tokens) - Genomic sequences ...the cost becomes prohibitive. This is a core motivation for exploring alternatives like **State Space Models (SSMs)** — see [[ssm-basics]] and [[computational-complexity]]. ### Multi-Head Attention: Multiple Perspectives Rather than doing attention once, the Transformer runs it multiple times in **parallel** — typically 8 or 16 "heads." Each head learns to detect different relationship types:[^8] - Head 1 might specialize in subject-verb agreement - Head 2 might track pronoun coreference ("it" → "animal") - Head 3 might catch semantic similarity - Head 8 might detect positional relationships The outputs of all heads are concatenated and linearly mixed. > **Analogy:** Rather than one detective reading evidence, you assign a specialist team — each looking at the same text from a different angle — then pool their reports. --- ## 5. The Architecture In Detail ### The Encoder-Decoder Stack The original Transformer had two parts:[^1] ``` INPUT TOKENS ("I love Paris") │ ▼ ┌─────────────────────────────────┐ │ ENCODER STACK │ │ │ │ Layer 6 [Multi-Head Attn] │ │ [Feed-Forward] │ │ [LayerNorm + Skip] │ │ ↑ │ │ ×6 identical layers │ │ ↑ │ │ Layer 1 [Multi-Head Attn] │ │ [Feed-Forward] │ │ [LayerNorm + Skip] │ │ ↑ │ │ Token Embeddings │ │ + Positional Encoding │ └─────────────────────────────────┘ │ ENCODED REPRESENTATION │ ┌─────────────────────────────────┐ │ DECODER STACK │ │ │ │ Layer 6 [Masked Self-Attn] │ │ [Cross-Attn] ─────────┼─── (looks at encoder output) │ [Feed-Forward] │ │ ↑ │ │ ×6 identical layers │ └─────────────────────────────────┘ │ OUTPUT TOKENS ("J'aime Paris") ``` **Encoder:** Reads and "understands" the input. Bidirectional — every token sees every other. Used in BERT-style models for comprehension/classification tasks. **Decoder:** Generates output one token at a time. Has **masked** self-attention (can only see previous tokens, not future ones — otherwise it's "cheating"). Used in GPT-style models for generation. **Encoder-Decoder:** Used in translation, summarization. Encoder processes source; decoder generates target while attending to source via cross-attention. ### Residual Connections: Why They Matter Each sub-layer has a "skip connection" — the input is added directly to the output. This allows gradients to flow easily during training through very deep networks:[^8] ``` Input ──→ [Self-Attention] ──→ (+) ──→ [LayerNorm] ──→ next layer └─────────────────────────┘ (residual / skip) ``` Without these, training networks deeper than ~10 layers was practically impossible. With them, modern Transformers use 96+ layers. ### Feed-Forward Layers After attention, each token is independently processed by a small neural network (the same network applied to each position). Think of attention as *gathering information* from context, and the feed-forward as *processing and transforming that information*.[^16] --- ## 6. What Transformers Excel At ### Language Tasks | Task | Example | Why Transformers Win | |------|---------|---------------------| | **Language modeling** | "What word comes next?" | Global attention captures all context | | **Translation** | English → French | Cross-attention naturally models alignment | | **Summarization** | Condense a document | Encoder captures full document meaning | | **Question answering** | "Who wrote Hamlet?" | Can find span anywhere in context | | **Text generation** | Write a poem | Autoregressive generation token-by-token | | **Classification** | Sentiment analysis | BERT-style classification from [CLS] token | | **Code generation** | Python function from description | Code treated as another language | ### Vision Transformers (ViT) In 2020, Google researchers showed you could apply the Transformer directly to images.[^17] The trick: split an image into a grid of patches (e.g., 16×16 pixel squares), treat each patch as a "token," and run attention over patches. > **Analogy:** Instead of reading a sentence word-by-word, read an image patch-by-patch. "What does the patch showing a wheel tell me about the patch showing a windshield?" (Answer: probably a car.) ViT abstract: *"a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data... Vision Transformer attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train."*[^17] ViT-style architectures now power: - Google's image search - DALL-E, Stable Diffusion (image encoder/decoder portions) - Medical imaging AI (pathology slides, radiology) - AlphaFold 2 (protein structure prediction) ### Why Transformers Dominated 2018–2024 1. **BERT (2018):** Google's bidirectional encoder pre-trained on Wikipedia + BooksCorpus. Fine-tuned for any NLP task. Set records across the board.[^18] 2. **GPT-2/3 (2019–2020):** OpenAI showed that scale + Transformers = emergent capabilities nobody expected. 3. **Hardware alignment:** GPUs are optimized for the matrix multiplications attention requires. Architecture and hardware co-evolved. 4. **Scaling laws (Kaplan et al., 2020):** Reliable empirical rules show performance improves predictably with more data, parameters, and compute. All of which Transformers can exploit. 5. **Transfer learning:** Pre-train once on massive data → fine-tune cheaply for any task. Democratized AI. --- ## 7. Key Products Built on Transformers | Product | Creator | Architecture | Context | Notable Feature | |---------|---------|--------------|---------|-----------------| | **GPT-4/4o** | OpenAI | Decoder-only | 128K | Multimodal; powers ChatGPT | | **Claude 3.5/4** | Anthropic | Decoder-only | 200K | Strong reasoning, Constitutional AI | | **Gemini 1.5 Pro** | Google DeepMind | Multimodal | 1M | Extreme long context | | **LLaMA 3** | Meta | Decoder-only | 128K | Open weights, runs locally | | **Mistral/Mixtral** | Mistral AI | Decoder + MoE | 32K | Efficient; sparse expert routing | | **BERT** | Google | Encoder-only | 512 | Bidirectional; search/classification | | **Whisper** | OpenAI | Encoder-decoder | Audio | Speech → text transcription | | **DALL-E 3** | OpenAI | Transformer + diffusion | — | Text → image | | **AlphaFold 2** | DeepMind | Transformer-based | — | Protein structure prediction | | **GitHub Copilot** | GitHub/OpenAI | GPT-based | 8K+ | Code generation in-editor | > See [[real-world-products]] for deeper dives. ### Brief Product Sketches **GPT Family (OpenAI):** "Generative Pre-trained Transformer." Decoder-only architecture trained to predict the next token. GPT-3 (2020, 175B parameters) was the first to show "emergent" few-shot learning — give it 3 examples and it figures out the pattern without further training. GPT-4 is multimodal (can see images) and powers ChatGPT.[^19] **Claude (Anthropic):** Built by former OpenAI researchers. Emphasizes safety via "Constitutional AI" training. Known for 200K+ context windows and strong long-document analysis.[^20] **Gemini (Google DeepMind):** Natively multimodal from the ground up. Gemini 1.5 Pro extended context to 1M tokens. Powers Google Search AI features.[^21] **LLaMA (Meta):** Open-weights Transformer family. LLaMA 3 (2024) enables researchers to run capable models locally. A 70B parameter model fits on a single high-end workstation.[^22] **BERT (Google, 2018):** Unlike GPT (left-to-right generation), BERT reads the whole sentence at once (bidirectionally). Excellent for understanding tasks (QA, sentiment, named entity recognition). Still widely deployed in Google Search ranking.[^18] --- ## 8. Visual Diagram Concepts ### ASCII: The Attention Matrix (Heat Map) ``` "The" "animal" "didn't" "cross" "street" "because" "it" "was" "tired" "The" [0.3 0.1 0.1 0.1 0.1 0.05 0.05 0.05 0.05] "it" [0.05 0.45 0.05 0.05 0.1 0.1 0.1 0.05 0.05] ^^^^ "it" strongly attends to "animal" (resolving the reference) Color coding: ████ = strong attention; ░░░░ = weak attention ``` ### Mermaid: Encoder-Decoder Architecture ```mermaid graph TD A[Input Tokens] --> B[Embeddings + Positional Encoding] B --> C[Multi-Head Self-Attention] C --> D[Add & LayerNorm] D --> E[Feed-Forward Network] E --> F[Add & LayerNorm] F --> G[×N encoder layers] G --> H[Encoded Representation] H --> I[Cross-Attention in Decoder] I --> J[Output Token Probabilities] ``` ### ASCII: O(n²) Visual Explosion ``` Sequence length vs. attention operations: 100 tokens: ████ 10K ops 300 tokens: ████████████████████████████ 90K ops (9× more!) 1000 tokens: ████...████████████████████████ 1M ops (100× more!) 3000 tokens: ███...a very long bar...████ 9M ops (900× more!) Each doubling of length = 4× the computation ``` --- ## 9. Common Explanatory Anti-Patterns > See also: [[anti-patterns]] Pedagogical failures most commonly observed when teaching attention to laypeople: | Anti-pattern | Why It Fails | Better Approach | |---|---|---| | **Start with the math** (show Q·Kᵀ/√d first) | Loses non-technical readers immediately | Lead with analogy; math (if at all) comes last | | **"Like human attention"** without nuance | Implies volitional consciousness; misleads | Use it as an entry point, then clarify it's computed | | **"The model understands language"** | Opens philosophical rabbit hole | Say "processes patterns in language" instead | | **Blur training vs. inference** | Training = all tokens at once; inference = one token at a time | Explicitly distinguish both modes | | **Skip positional encoding** | Leaves reader unable to understand why order matters | Always explain it — the analogy is easy | | **"Just predicts the next word"** | Undersells sophistication; leads to dismissal | Add: "...and to do that accurately, it builds rich contextual understanding of everything that came before" | | **Black box dismissal** ("too complex to explain") | There ARE accessible explanations — Jay Alammar proves it[^8] | Commit to a good analogy | --- ## 10. Emergent Capabilities at Scale Several remarkable capabilities arise from scale + Transformer architecture that weren't explicitly programmed: **In-context learning:** Give GPT-3 a few examples in the prompt ("cat → cats, dog → dogs, mouse → ?") and it generalizes without any weight update. This "few-shot learning" emerged from scale — nobody programmed it in.[^19] **Chain-of-thought reasoning:** Larger Transformers can be prompted to "think step by step," dramatically improving performance on math and logic. The ability seems to emerge around a certain parameter scale threshold. **Instruction following:** RLHF (Reinforcement Learning from Human Feedback) fine-tuning shifts a Transformer from "complete the sentence" to "follow instructions helpfully." The base architecture is unchanged; only the training signal differs. **Cross-modal transfer:** A Transformer trained on text develops representations that transfer usefully to images, audio, and code — suggesting it learns some domain-general structure of information. --- ## Footnotes [^1]: Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). "Attention is All You Need." *arXiv:1706.03762*. https://arxiv.org/abs/1706.03762 [^2]: Wikipedia contributors. "Transformer (deep learning architecture)." Wikipedia. https://en.wikipedia.org/wiki/Transformer_(deep_learning_architecture) [^3]: Bengio, Y., Simard, P., & Frasconi, P. (1994). "Learning Long-Term Dependencies with Gradient Descent is Difficult." *IEEE Transactions on Neural Networks, 5*(2), 157–166. [^4]: Hochreiter, S. & Schmidhuber, J. (1997). "Long Short-Term Memory." *Neural Computation, 9*(8), 1735–1780. [^5]: Olah, C. (2015). "Understanding LSTM Networks." colah.github.io. https://colah.github.io/posts/2015-08-Understanding-LSTMs/ [^6]: Bahdanau, D., Cho, K., & Bengio, Y. (2015). "Neural Machine Translation by Jointly Learning to Align and Translate." *arXiv:1409.0473*. [^7]: The "cocktail party" metaphor for selective attention traces to Cherry, E.C. (1953). "Some experiments on the recognition of speech, with one and with two ears." *J. Acoustical Society of America, 25*(5), 975–979. Widely adopted in ML pedagogy. [^8]: Alammar, J. (2018). "The Illustrated Transformer." jalammar.github.io. Referenced in Stanford CS224N, Harvard, MIT, Princeton, CMU courses. https://jalammar.github.io/illustrated-transformer/ [^9]: "Illustrated Self-Attention" (2024). *Towards Data Science*. Based on Prof. Tom Yeh's AI by Hand series. https://towardsdatascience.com/illustrated-self-attention-2d627e33b20a [^10]: The library/retrieval analogy for QKV is widely used in ML pedagogy; formalized in: Weng, L. (2023). "The Transformer Family Version 2.0." https://lilianweng.github.io/posts/2023-01-27-the-transformer-family-v2/ [^11]: Lee, T. & Trott, S. (2023). "Large Language Models Explained with a Minimum of Math and Jargon." *Understanding AI* (Substack). https://www.understandingai.org/p/large-language-models-explained-with [^12]: Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., & Liu, Y. (2021). "RoFormer: Enhanced Transformer with Rotary Position Embedding." *arXiv:2104.09864*. Referenced in Lilian Weng's Transformer Family v2. [^13]: Weng, L. (2023). "The Transformer Family Version 2.0." Section on longer context. https://lilianweng.github.io/posts/2023-01-27-the-transformer-family-v2/ [^14]: The n² "handshake problem" is a standard combinatorics illustration applied to attention complexity. Explained in Wikipedia's Transformer article and many ML course notes. [^15]: Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." *arXiv:2205.14135*. [^16]: Wolfram, S. (2023). "What Is ChatGPT Doing... and Why Does It Work?" stephenwolfram.com. https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work/ [^17]: Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2020). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale." *arXiv:2010.11929*. https://arxiv.org/abs/2010.11929 [^18]: Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." *arXiv:1810.04805*. Summarized at: https://huggingface.co/blog/bert-101 [^19]: Brown, T., Mann, B., Ryder, N., et al. (2020). "Language Models are Few-Shot Learners." (GPT-3 paper.) *arXiv:2005.14165*. [^20]: Anthropic. (2024). Claude 3 model family technical documentation. https://www.anthropic.com/claude [^21]: Google DeepMind. (2024). "Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context." Technical Report. https://storage.googleapis.com/deepmind-media/gemini/gemini_v1_5_report.pdf [^22]: Meta AI. (2024). "Introducing Meta Llama 3." Meta AI Blog. https://ai.meta.com/blog/meta-llama-3/ --- ## Sources | # | Source | URL | Type | |---|--------|-----|------| | 1 | Vaswani et al. (2017) — Attention is All You Need | https://arxiv.org/abs/1706.03762 | Academic paper | | 2 | Jay Alammar — The Illustrated Transformer | https://jalammar.github.io/illustrated-transformer/ | Blog/Tutorial | | 3 | Lilian Weng — The Transformer Family v2 | https://lilianweng.github.io/posts/2023-01-27-the-transformer-family-v2/ | Blog/Tutorial | | 4 | Tim Lee & Sean Trott — LLMs Explained | https://www.understandingai.org/p/large-language-models-explained-with | Journalism | | 5 | Stephen Wolfram — What Is ChatGPT Doing? | https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work/ | Expert essay | | 6 | Chris Olah — Understanding LSTMs | https://colah.github.io/posts/2015-08-Understanding-LSTMs/ | Blog | | 7 | Wikipedia — Transformer (deep learning) | https://en.wikipedia.org/wiki/Transformer_(deep_learning_architecture) | Encyclopedia | | 8 | Dosovitskiy et al. (2020) — An Image is Worth 16x16 Words | https://arxiv.org/abs/2010.11929 | Academic paper | | 9 | HuggingFace — BERT 101 | https://huggingface.co/blog/bert-101 | Tutorial | | 10 | TDS — Illustrated Self-Attention | https://towardsdatascience.com/illustrated-self-attention-2d627e33b20a | Tutorial | | 11 | Hochreiter & Schmidhuber (1997) — LSTM | Neural Computation 9(8) | Academic paper | | 12 | Devlin et al. (2018) — BERT | arXiv:1810.04805 | Academic paper | | 13 | Su et al. (2021) — RoPE | arXiv:2104.09864 | Academic paper | | 14 | Dao et al. (2022) — FlashAttention | arXiv:2205.14135 | Academic paper | | 15 | Brown et al. (2020) — GPT-3 | arXiv:2005.14165 | Academic paper | --- *Research note for the SSMs vs Transformers project. Feeds into [[../drafts/]] and the final report.*