Lesson 5 12 min read

Transformers & Attention

The architecture that changed everything -- and why.

What you'll understand

Why transformers replaced RNNs (parallel vs sequential processing)
Self-attention and the Query-Key-Value mechanism at a conceptual level
What tokens are and why tokenization creates surprising limitations
The scaling hypothesis and why "bigger = better" (mostly) worked

In 2017, eight Google researchers wrote a paper with a Beatles-inspired title. It now has over 173,000 citations, making it one of the ten most-cited papers of the 21st century. Every single one of those researchers left Google to start or join AI companies. The paper introduced an architecture called the Transformer -- and it's the reason you can talk to ChatGPT, Claude, Gemini, or any other large language model today.

The original 'Attention Is All You Need' paper from 2017

The Problem with RNNs

Before transformers, the dominant architecture for processing language was the Recurrent Neural Network (RNN)A neural network that processes sequences one element at a time, passing hidden state forward through each step.. RNNs process tokens one at a time, left to right. Each step depends on the previous step's output, like reading a book one word at a time and never looking back.

This created three brutal bottlenecks. First, you can't parallelize. Step 47 has to wait for step 46, which has to wait for step 45, all the way back to step 1. Training was painfully slow. Second, the vanishing gradient problem meant that information from early tokens degraded through many sequential steps. An RNN reading a 500-word paragraph is essentially playing a game of telephone -- the message gets garbled over distance. Third, LSTMs (Long Short-Term Memory networks, introduced by Hochreiter and Schmidhuber in 1997) partially solved this with gating mechanisms, but still struggled with very long sequences.

Noam Shazeer, one of the Transformer co-authors, put it bluntly at NVIDIA GTC 2024:

"We could have done the industrial revolution on the steam engine, but it would just have been a pain. Things went way, way better with internal combustion."
-- Noam Shazeer

The transformer was that internal combustion engine. Instead of processing tokens sequentially, it processes the entire sequence at once. Self-attention lets any token attend to any other token directly -- no telephone game. Residual connections prevent vanishing gradients. The result: dramatically faster training and better handling of long-range dependencies.

"Attention Is All You Need"

The paper was published on June 12, 2017, at the 31st Conference on Neural Information Processing Systems (NeurIPS). The title is a reference to the Beatles song "All You Need Is Love." Jakob Uszkoreit, one of the authors, picked the name "Transformer" simply because he liked how the word sounded. An early internal design document was actually titled "Transformers: Iterative Self-Attention and Processing for Various Tasks" and included illustrations from the Transformers franchise.

173K+

Citations (as of 2025)

65M

Parameters (base model)

P100 GPUs to train

12 hrs

Training time (base)

The paper was notably understated. The authors weren't claiming to have solved AI; they were trying to build a better machine translation system. The base model had just 65 million parameters -- roughly 27,000 times smaller than GPT-4 -- and trained in 12 hours on 8 GPUs. It achieved 28.4 BLEU on English-to-German and 41.0 BLEU on English-to-French, both new state-of-the-art results. All eight authors were credited as equal contributors; the listed order was randomized.

Jensen Huang, speaking at GTC 2024 with the original authors, said simply: "Everything that we're enjoying today can be traced back to that moment."

Where the 8 Authors Went

Every single author left Google. They collectively went on to shape the AI industry:

Ashish Vaswani

Co-founder & CEO

Essential AI

Noam Shazeer

Co-founder & CEO

Character.AI

Niki Parmar

Co-founder

Essential AI

Jakob Uszkoreit

Co-founder & CEO

Inceptive

Llion Jones

Co-founder & CTO

Sakana AI

Aidan Gomez

Co-founder & CEO

Cohere

Lukasz Kaiser

Researcher

OpenAI

Illia Polosukhin

Co-founder

NEAR Protocol

As Llion Jones recalled: "We had very recently started throwing bits of the model away, just to see how much worse it would get. And to our surprise it started getting better." And Aidan Gomez offered a striking thought: "I think the world needs something better than the transformer."

Self-Attention: How the Model Decides What to Focus On

Self-attentionA mechanism allowing each token to attend to all other tokens in parallel, computing relevance scores between every pair. lets each word in a sentence look at every other word and decide how much to "pay attention" to each one, based on context. The word "bank" in "river bank" should attend heavily to "river"; in "bank account" it should attend to "account." This is how transformers resolve ambiguity.

The mechanism works through three learned vectors per token:

Query (Q): "What am I looking for?" -- represents the information this token is seeking.

Key (K): "What do I contain?" -- represents the information this token offers.

Value (V): "What information do I pass along?" -- the actual content to be aggregated.

Think of it like a library search. The Query is your search term. The Keys are the index cards for each book. The Values are the actual book contents. You match your search against all index cards, then pull content from the best matches.

The math is elegant: for each token, compute how similar its Query is to every other token's Key (via dot product), scale by the square root of the dimension to keep gradients well-behaved, apply softmax to get a probability distribution, then take a weighted sum of all Values. The formula: Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V.

Multi-Head Attention

Instead of running one attention computation, transformers run multiple attention heads in parallel -- each with its own learned Q, K, V projections. Why? Language has multiple simultaneous relationships. One head might learn to track syntax (subject-verb agreement), another semantics (meaning), another positional patterns (nearby words), another coreference (which noun a pronoun refers to). The original Transformer used 8 heads. GPT-3 uses 96.

Interactive / Attention Weights

Strong attention

Moderate attention

Weak attention

Tokens and Tokenization

Language models don't see words. They see tokensThe atomic unit a language model processes -- a word, subword, or character mapped to an integer ID. -- subword fragments, each mapped to an integer. The dominant method is Byte Pair Encoding (BPE): start with individual characters, count all adjacent pairs in the training data, merge the most frequent pair into a single token, and repeat until you reach a target vocabulary size.

40K

GPT-1 vocabulary

50K

GPT-2 vocabulary

100K

GPT-4 vocabulary

This creates some surprising consequences. The word "strawberry" gets tokenized as something like:

"str" "aw" "berry"

The model never sees individual letters. It literally cannot count the r's because "r" doesn't exist as a discrete unit in its perception. Numbers get split unpredictably too -- "123456" might become ["123", "456"] or ["12", "345", "6"]. Non-English languages often require more tokens per word, consuming more of the context window and costing more per interaction.

SentencePiece, developed by Google, takes a different approach: it treats input as a raw byte stream without assuming whitespace separates words. That's critical for languages like Chinese, Japanese, and Thai. It's used by T5, LLaMA, and most multilingual models.

Context Windows

The context windowThe maximum number of tokens a model can process at once -- its "working memory." is the model's working memory. Everything the model reads and writes must fit inside it. Tokens outside the window simply don't exist to the model.

The growth has been staggering -- roughly 20,000x in 8 years:

2018 GPT-1: 512 tokens (~400 words)

2019 GPT-2: 1,024 tokens

2020 GPT-3: 2,048 tokens

2023 GPT-4: 128K tokens; Claude 2: 100K tokens

2024 Gemini 1.5 Pro: 1M tokens; Claude 3: 200K tokens

2025 Claude Opus 4: 1M tokens (~750,000 words -- 4-5 full novels)

There's a catch: attention is O(n²) with sequence length. Double the context, quadruple the compute. And research shows models perform best on information at the beginning and end of the context -- the "lost in the middle" problem. The model's attention literally sags in the middle of long inputs, like a human reader's concentration on page 300 of a dense novel.

The Scaling Hypothesis

In March 2019, reinforcement learning pioneer Richard Sutton published a short, devastating essay called "The Bitter Lesson":

"The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin."
-- Richard Sutton, The Bitter Lesson (2019)

He called it "bitter" because researchers keep trying to build clever, hand-crafted solutions -- and they keep getting beaten by simpler methods that just use more compute. Chess (Deep Blue), speech recognition, Go (AlphaGo) -- the pattern repeats relentlessly.

This became the guiding principle of the AI boom. Ilya Sutskever, OpenAI co-founder, was its most prominent advocate: just make the model bigger, train it on more data, use more compute. In January 2020, Kaplan et al. at OpenAI quantified this. They found that model performance scales as a smooth power law with model size, data, and compute -- predictable across seven orders of magnitude. This led directly to GPT-3 at 175 billion parameters.

Then in March 2022, DeepMind's Chinchilla paper corrected the balance. Kaplan had overweighted model size; Chinchilla showed that model size and training data should scale equally. The practical rule: roughly 20 tokens per parameter. Their 70-billion parameter Chinchilla model, trained on 1.4 trillion tokens, outperformed 280-billion parameter Gopher trained on just 300 billion tokens -- despite being 4x smaller.

Why this matters: The scaling laws turned AI from an art into something closer to engineering. You can forecast how well a model will perform before spending millions training it. This predictability enabled companies to confidently invest billions in training runs.

By 2025, the picture shifted again. Sutskever himself acknowledged that the original pre-training scaling recipe is reaching its limits because internet data is finite: "The age of scaling is giving way to the age of research."

History Thread

The transformer didn't appear from nowhere. Attention mechanisms existed before 2017 -- Bahdanau et al. (2014) introduced attention for machine translation. But those systems still used recurrent networks as the backbone. What Uszkoreit proposed, and the team proved, was that you could throw away the recurrence entirely and build a model from pure attention. RNNs had dominated NLP since the late 1980s. Transformers replaced them in under two years. BERT (encoder-only, 2018), GPT-2 (decoder-only, 2019), and T5 (encoder-decoder, 2019) all proved the architecture's versatility.

3Blue1Brown -- "Attention in Transformers, Visually Explained." The best visual walkthrough of the QKV mechanism, multi-head attention, and why it all works.

Pop Culture Connection

The Transformer was named after the toy franchise -- and the irony is perfect. Optimus Prime transforms between a truck and a robot. The Transformer architecture transforms sequences of numbers into meaning. And like the franchise itself, it spawned an entire cinematic universe of derivatives: BERT, GPT, T5, PaLM, LLaMA, Claude, Gemini. If "Attention Is All You Need" is the original film, we're deep into the expanded universe.

Andrej Karpathy -- "Let's build GPT: from scratch, in code, spelled out." For learners who understand by seeing code. Nearly 2 hours of building a working Transformer from the ground up.

Key Terms

Transformer: Neural network architecture using self-attention instead of sequential processing, introduced in 2017.
Self-attention: Mechanism allowing each token to attend to all other tokens in parallel, computing relevance scores between every pair.
Query / Key / Value: Three vectors each token produces to compute attention. Query = what I'm looking for, Key = what I offer, Value = the info I pass along.
Token: The atomic unit a language model processes -- a word, subword, or character mapped to an integer ID.
Context window: Maximum number of tokens a model can process at once -- its working memory.
BPE (Byte Pair Encoding): The dominant tokenization algorithm: iteratively merges the most frequent character pairs until reaching a target vocabulary size.
Scaling law: Mathematical relationship showing model performance improves predictably as a power law with more compute, data, or parameters.

Did This Land?

Why was the shift from sequential (RNN) to parallel (Transformer) processing so important?

Parallel processing enabled training across multiple GPUs simultaneously, making training dramatically faster. It also eliminated the "telephone game" problem -- any token can attend directly to any other token, so information doesn't degrade over long distances.

What does "attention" actually compute?

For each token, attention computes a relevance score to every other token (via Query-Key dot products), normalizes these into a probability distribution (softmax), then produces a weighted sum of all tokens' Value vectors. The output is a context-aware representation of each token.

Why can't LLMs reliably count the letters in "strawberry"?

LLMs don't see individual characters. The tokenizer splits "strawberry" into subword pieces like ["str", "aw", "berry"]. The model processes these tokens, not letters, so it has no way to directly inspect or count characters within a token.

Lesson Summary

RNNs processed tokens sequentially and suffered from vanishing gradients. Transformers process entire sequences in parallel, enabling massive speedups and better long-range understanding.
"Attention Is All You Need" (2017) introduced the Transformer with 65M parameters. All 8 authors left Google; the architecture now underpins every major language model.
Self-attention uses Query, Key, and Value vectors to let each token decide how much to focus on every other token. Multi-head attention runs multiple attention patterns simultaneously.
Tokenization (BPE) splits text into subword fragments, which is why models struggle with character-level tasks like counting letters or handling non-English scripts efficiently.
Scaling laws showed AI performance is surprisingly predictable -- but by 2025, the era of pure "just make it bigger" is giving way to new research approaches.