Lesson 6 14 min read

From Model to Product

How ChatGPT actually works, end-to-end.

What you'll understand

The full pipeline from pre-training to deployed product (pre-training, SFT, RLHF)
What happens during inference: tokenize, forward pass, sample, detokenize
Why RLHF matters and the human labor behind it
AlphaGo's Move 37 and why it still matters for understanding AI capability

In March 2016, an AI made a move in a board game with a 1-in-10,000 probability. The world champion walked out of the room for 15 minutes. When he came back, he'd lost. That moment -- Move 37 in Game 2 of the AlphaGo vs Lee Sedol match -- was the first time many people realized AI could do something genuinely creative. Not just fast. Not just accurate. Novel.

But a beautiful move in Go is one thing. How do you get from a pile of numbers (that's what a model is, remember, from Lesson 1) to a product that 800 million people use every week? That's this lesson.

The Full Pipeline

ChatGPT follows a three-stage pipeline laid out in the InstructGPT paper (Ouyang et al., 2022). Each stage transforms the model into something more useful.

Pre-training

Learn patterns from massive text. GPT-3: 175 billion parameters, ~300 billion tokens, 3.14 x 10^23 FLOPs. GPT-4 training cost: estimated $78-100 million. The model learns grammar, facts, reasoning patterns -- but has no concept of "being helpful."

Supervised Fine-Tuning (SFT)

Human labelers (~40 contractors) write ideal responses to ~13,000 prompts. The model is fine-tuned on these demonstrations. This teaches the format of being helpful -- but it still makes mistakes about what is good.

RLHF (Reinforcement Learning from Human Feedback)

Human raters rank multiple model outputs. A reward model learns these preferences (~33,000 prompts). PPO optimizes the language model to maximize the reward model's score. A KL divergence penalty prevents the model from straying too far.

The stunning result from the InstructGPT paper: a 1.3 billion parameter model trained with RLHF was preferred by human evaluators over the 175 billion parameter GPT-3 -- a model 100x larger. Alignment techniques matter more than raw scale for making models useful.

13K

SFT training prompts

33K

Reward model prompts

Human labelers

100x

Smaller model preferred

Inference: What Happens When You Type a Prompt

When you type a message into ChatGPT, four things happen in rapid sequence. This is inferenceRunning a trained model to generate output -- as opposed to training, which updates the model's weights. -- the model producing output, not learning.

Step 1 -- Tokenize: Your text is split into tokens (subword units). "What is the capital of France?" becomes something like ["What", " is", " the", " capital", " of", " France", "?"] -- each mapped to an integer ID.

Step 2 -- Forward pass: Token IDs become high-dimensional vectors (embeddings) and flow through all transformer layers in parallel. Attention mechanisms determine which parts of the input matter most. This populates the KV cache for efficiency.

Step 3 -- Sample: The model outputs a probability distribution over ~100,000 possible next tokens. A sampling strategy picks one. That token gets appended, and the process repeats -- one token at a time, autoregressively.

Step 4 -- Detokenize: Token IDs are mapped back to human-readable text. The response streams to your screen as tokens are generated -- that's why you see it "typing."

At scale, this is extraordinary. OpenAI reportedly uses ~28,936 GPUs (3,617 HGX A100 servers) to serve ChatGPT. As of 2025, it processes roughly 29,000 prompts per second. A single 30-word response costs OpenAI about 1 cent to generate.

Interactive / Inference Pipeline

Input

Capital of France?

→

Tokenize

→

Forward Pass

→

Sample

→

Output

Temperature and Sampling

LLMs don't "decide" on an answer. They produce a probability distribution over all possible next tokens. The temperatureA parameter that controls randomness in text generation. 0 = always pick the most likely token. 1 = sample proportionally from the distribution. parameter controls how that distribution is converted into a choice.

Deterministic

Always pick the highest-probability token. Same prompt = same answer every time. Good for factual queries.

0.7

Balanced

Moderate randomness. Coherent but varied. The typical default for conversational AI.

1.0+

Creative

Sample broadly from the distribution. More surprising, more unpredictable. Above 1.0, things get weird fast.

Mechanically, temperature divides the logits (raw scores) before the softmax function. Lower temperature sharpens the distribution; higher temperature flattens it. Think of it like dice: temperature 0 = always roll a 6. Temperature 0.7 = you mostly roll 5s and 6s. Temperature 2.0 = the dice is almost fair.

Top-p (nucleus) sampling, introduced by Holtzman et al. (2020), adds another layer: instead of considering all ~100K tokens, only consider the smallest set whose cumulative probability exceeds a threshold (e.g., p=0.95). If the model is confident, that might be 2-3 tokens. If uncertain, hundreds. This is why the same prompt gives different answers -- you're sampling from a distribution, not looking up an answer.

RLHF: Teaching AI to Be Helpful

The InstructGPT paper is the blueprint for how base models become products. Human raters compare model outputs -- "Which response is better?" -- and a reward model learns to predict their preferences. PPO (Proximal Policy Optimization) then adjusts the language model's weights to maximize the reward model's score, while a KL divergence penalty prevents it from going off the rails.

Anthropic's Constitutional AI takes a different approach: replace most human raters with AI self-critique, guided by a written set of principles (the "constitution"). The AI generates a response, critiques it against a principle, revises it, and the whole cycle feeds a reward model. It scales better -- you don't need thousands of human labels per iteration -- and the principles are explicit and auditable.

The uncomfortable story: A TIME investigation (January 2023) revealed that OpenAI outsourced content labeling to Sama, a firm in Kenya, Uganda, and India. Workers were paid $1.32-$2 per hour to label graphic content including descriptions of child abuse, violence, and self-harm. Workers reported panic attacks, insomnia, and PTSD-like symptoms. When exposed, Sama laid off over 200 Kenyan workers. This is the hidden human cost of "aligned AI" -- someone had to read the worst of the internet so the model could learn to avoid it.

AlphaGo and Move 37

Go has more possible board positions (~10^170) than atoms in the observable universe (~10^80). Unlike chess, brute-force search is impossible -- you need intuition. In 2015, experts believed AI was "10 years away" from defeating top professionals. It happened in March 2016.

The match: AlphaGo vs Lee Sedol, 18-time world champion, at the Four Seasons Hotel in Seoul. Five games. $1 million prize (donated to UNICEF). AlphaGo won 4-1.

The moment that changed everything was Game 2, Move 37. On its 19th stone, AlphaGo placed on the fifth line -- a shoulder hit where no human professional would have played. AlphaGo had calculated there was a 1-in-10,000 chance a human would make this move. But when it evaluated all possible futures from this position, it determined the move was correct.

"That's a very strange move."
-- Michael Redmond (9-dan commentator), WIRED

Lee Sedol left the room for 15 minutes. Fan Hui, the European champion who had earlier lost 5-0 to AlphaGo and was watching as commentator, kept saying: "So beautiful." The next day, Lee Sedol said simply: "Yesterday I was surprised, but today I am quite speechless."

But the story has a human counterpoint. In Game 4, Lee Sedol played White 78 -- a move Gu Li (9-dan) called a "divine move." It was the only game a human ever won against AlphaGo in a formal match. Three years later, Lee Sedol retired from professional Go with the words: "Even if I become the number one, there is an entity that cannot be defeated."

Fan Hui's arc is perhaps the most beautiful. After training with AlphaGo, his world ranking improved from 633rd to approximately 300th. He didn't crumble. He learned.

Pop Culture Connection

In the 1983 film WarGames, the AI WOPR concludes that the only winning move in nuclear war is not to play. Lee Sedol's retirement carries a similar melancholy -- except he found that the only winning move for humans in Go is to play differently, inspired by what the machine revealed. The AlphaGo documentary (100% on Rotten Tomatoes) captures this tension with extraordinary intimacy.

AlphaGo -- The Movie (2017). Cued to approximately minute 49:00, the Move 37 sequence. The full documentary is one of the best films about artificial intelligence ever made.

Emergent Abilities

EmergenceCapabilities that appear at scale without being explicitly trained -- abilities that are absent in smaller models and present in larger ones. is when models do things nobody explicitly trained them to do. Wei et al. (2022) cataloged 137 emergent abilities across different benchmarks. The pattern: below a certain scale, performance is near-random. Above it, the ability appears to "switch on" -- like water freezing at 0C.

The most striking example: chain-of-thought reasoning. Kojima et al. (2022) discovered that simply adding "Let's think step by step" to a prompt boosted accuracy on MultiArith from 17.7% to 78.7% -- a 61 percentage point improvement from six words. They tried alternatives ("Let's solve this by splitting it into steps," "Let's think logically"), but "Let's think step by step" was consistently the most effective.

137

Documented emergent abilities

17.7% → 78.7%

"Let's think step by step" on MultiArith

But is emergence real? A Stanford paper (Schaeffer et al., 2023) argued many "emergent" abilities are measurement artifacts: when you use discontinuous metrics (like exact string match), abilities look like they switch on. When you use continuous metrics, performance improves smoothly. Of 29 metrics examined, 25 showed no emergent properties with continuous measurement. The debate is unresolved -- some abilities genuinely seem to phase-transition at scale, others may just look that way because of how we measure.

There's also the Reversal Curse: a model trained on "A is B" often can't answer "who is B?" This hints at a fundamental limit -- these models aren't building the kind of flexible knowledge structures humans use.

History Thread

ChatGPT launched on November 30, 2022 as a "research preview" with zero fanfare. Sam Altman's tweet five days later: "ChatGPT launched on wednesday. today it crossed 1 million users!" By January 2023 it had 100 million monthly users -- the fastest-growing consumer app in history. TikTok took 9 months. Instagram took 2.5 years. By November 2025: 800 million weekly users. The team never saw it coming. As John Schulman said: "I expected it to be intuitive for people... but I didn't expect it to reach this level of mainstream popularity."

Key Terms

Inference: Running a trained model to generate output -- as opposed to training, which updates the model's weights.
Fine-tuning: Additional training on curated data to adapt a pre-trained model for specific tasks or behaviors.
RLHF: Reinforcement Learning from Human Feedback -- aligning models to human preferences by training a reward model on human comparisons.
Temperature: Parameter controlling randomness in generation. 0 = deterministic (always pick the most likely token). 1 = proportional sampling.
Emergence: Capabilities that appear at scale without being explicitly trained -- absent in smaller models, present in larger ones.
Chain-of-thought: Prompting technique where asking the model to "think step by step" dramatically improves reasoning performance.
Alignment: The challenge of getting AI systems to reliably pursue the goals humans actually intend, not unintended objectives.

Did This Land?

What's the difference between pre-training and fine-tuning?

Pre-training is the initial phase where the model learns broad language patterns from massive text data (predict the next token, billions of examples, months of compute). Fine-tuning is the subsequent phase where the model is adapted for specific behavior using curated examples -- teaching it to follow instructions, be helpful, and avoid harmful outputs. Pre-training gives knowledge; fine-tuning gives manners.

Why did Move 37 shock the Go community?

Move 37 had a 1-in-10,000 probability from the perspective of human play -- no professional would have considered it. But AlphaGo evaluated the consequences of every possible move and determined it was correct. It was a genuinely novel move in a game with 5,500 years of human analysis, contradicting centuries of accumulated wisdom. It wasn't just good -- it was creative in a way nobody expected from a machine.

Why does the same prompt sometimes give different answers?

Because the model samples from a probability distribution, not a lookup table. At temperature > 0, there's randomness in which token gets selected at each step. "Paris" might be 94% likely as the next token, but 6% of the time the model picks something else -- and once it's on a different path, the entire response diverges. Same starting point, different random rolls.

Lesson Summary

The path from base model to product is pre-training (broad knowledge), SFT (format of helpfulness), and RLHF (actual preferences). A 1.3B model with RLHF beat 175B GPT-3 without it.
Inference is a four-step loop: tokenize your input, run a forward pass through transformer layers, sample from a probability distribution, detokenize to text. Repeat for each token.
Temperature controls how adventurous the sampling is. Top-p limits which tokens are even considered. This is why the same prompt gives different answers.
RLHF aligns models to human preferences, but the human cost is real -- Kenyan workers earning $1.32-$2/hr labeled the internet's worst content to make these models safe.
AlphaGo's Move 37 remains a landmark: the first time AI produced something genuinely novel in a domain with millennia of human expertise. Emergent abilities suggest this novelty extends to language models too -- though whether emergence is real or a measurement artifact is still debated.