Lesson 3 12 min read

Training -- Teaching Machines to Learn

How does a network that starts with random noise learn to recognize cats, translate languages, or write code? The answer involves a mountain, some fog, and a paper that was rejected four times before it changed the world.

What you'll understand

What training actually means -- adjusting weights to minimize loss
Gradient descent and backpropagation, intuitively
The practical hyperparameters: learning rate, epochs, batch size
Why the ImageNet moment of 2012 mattered so much

The entire field of deep learning was saved by a paper that was rejected, ignored, and reinvented four separate times. The core idea -- that you can work backwards through a neural network to figure out which weights caused the error -- was first published in a Finnish master's thesis in 1970, buried in obscurity. It was reinvented by an American PhD student in 1974 and ignored again. It took until 1986, when Geoffrey Hinton and colleagues published it in Nature, for the idea to finally stick.

That idea is called backpropagation. Without it, there is no ChatGPT, no image generation, no modern AI. And the story of how neural networks learn is, in large part, the story of this one algorithm.

What Training Actually Means

Training a neural network is a five-step loop, repeated millions of times:

Feed data in -- run an input through the network (the forward pass)
Measure how wrong it is -- compare the prediction to the right answer
Figure out who's to blame -- which of the millions of weights caused the error?
Nudge those weights -- adjust them slightly in the right direction
Repeat -- do this for every example in the dataset, many times over

A network starts with random weights -- its predictions are garbage. Through training, it gradually improves, like a student learning from practice problems. The "textbook" is the training data. The "grade" is the loss functionA formula measuring how wrong the model's prediction was -- a single number that captures the gap between what the model predicted and the correct answer.. The "study strategy" is gradient descentAn optimization algorithm that adjusts weights in the direction of steepest error reduction, step by step..

Loss Functions: Your Model's Report Card

A loss function boils down the model's performance to a single number -- how wrong was that prediction? The entire goal of training is to make this number as small as possible.

MSE

Mean Squared Error -- for regression. Off by 10 = 100x the penalty of off by 1.

Cross-Entropy

For classification. Bet 99% "cat" when it's a dog? You lose big.

The choice of loss function defines what "good" means for a given task. Without a clear measure of "how wrong am I?", there's no way to improve. It shapes what the network optimizes for -- and by extension, what the network becomes.

Gradient Descent: Hiking Down a Mountain in Fog

Imagine you're on a mountain in thick fog. You can't see the valley floor. You can only feel the slope beneath your feet. Your strategy: take a step in the steepest downhill direction. Repeat. Eventually you reach the bottom.

That's gradient descent.

The gradient is a vector that answers, for each of the millions of weights: "If I nudge this weight by a tiny amount, how much does the loss change?" It points in the direction of steepest ascent -- so you move in the opposite direction to reduce the loss.

new_weight = old_weight - learning_rate * gradient

That's the entire update rule. Every modern AI system -- from GPT-3 to image generators to self-driving cars -- is trained with some variant of this formula.

Interactive Visualization

Loss

Weight space Loss

Gradient descent: the ball follows the steepest downhill path, and the loss value drops as it finds the minimum.

In practice, three flavors exist. Batch gradient descent computes the gradient using all training data -- accurate but slow. Stochastic gradient descent (SGD) uses one random sample at a time -- noisy but fast. Mini-batch (typically 32-256 samples) is the practical sweet spot that balances accuracy and speed. Most modern training uses mini-batch SGD or its descendants.

3Blue1Brown -- "Gradient descent, how neural networks learn." The best visual explanation of the core algorithm behind all of deep learning.

Backpropagation: The Breakthrough That Changed Everything

Here's the problem gradient descent leaves you with: the network has millions of weights, and they all contributed to the error. Which ones deserve the blame, and how much? This is called the credit assignment problem, and it's the hard part.

BackpropagationA method for computing each weight's contribution to the error by working backwards through the network using the chain rule of calculus. solves it by working backwards. Starting from the output, it propagates the error signal through each layer, using the chain rule of calculus to figure out exactly how much each weight contributed to the final mistake. Then gradient descent uses that information to update every weight proportionally.

Andrej Karpathy spent eight years figuring out the best way to explain this, culminating in his micrograd project: just 94 lines of Python that implement everything needed to train a neural network.

"These 94 lines of code are everything that is needed to train a neural network. Everything else is just efficiency."
-- Andrej Karpathy, on micrograd

History Thread

Backpropagation was invented at least four times before it stuck:

1960

Henry J. Kelley publishes a precursor in control theory.

1970

Seppo Linnainmaa, a Finnish master's student, invents the modern version (reverse-mode automatic differentiation) in his thesis. Published in Finnish, it goes largely unnoticed. As of 2020, ALL modern frameworks -- TensorFlow, PyTorch -- are based on his method.

1974

Paul Werbos programs backpropagation into his Harvard PhD thesis. Also largely ignored.

1986

Rumelhart, Hinton & Williams publish "Learning representations by back-propagating errors" in Nature. This paper demonstrated that backprop could train multi-layer networks, learn useful internal representations, and scale to practical problems. It "saved neural networks from extinction and became the foundation for modern AI."

The Practical Knobs

Training involves three critical hyperparameters. Getting them right is, as practitioners often say, "part science, part art" -- the "dark arts" of deep learning.

Learning Rate

How much weights change per update -- the step size. Too high and the network overshoots the minimum, bouncing wildly or diverging. Too low and it crawls toward the minimum, taking forever to converge. Typical values: 0.001 to 0.01.

Epochs

One complete pass through the entire training dataset. Too few: underfitting -- the model hasn't learned the patterns yet. Too many: overfittingWhen a model memorizes the training data instead of learning general patterns. Like a student who memorizes specific test answers instead of understanding the concepts. -- the model memorizes training data but fails on anything new. Like re-reading a textbook: the first pass gets broad strokes, the tenth catches nuances, but the thousandth just memorizes the words.

Batch Size

Number of training samples processed before updating weights. Small batches (32) mean more frequent updates and often better generalization. Large batches (1024+) are faster computationally but may converge to sharper minima. The sweet spot for most tasks: 32-256.

Epoch sets the duration ("how long do we train?"). Batch size determines frequency ("how often do we update?"). Learning rate controls magnitude ("how much do we change?"). Together they determine everything about the practical dynamics of training.

The ImageNet Moment (2012)

Fei-Fei Li saw something in 2006 that almost no one else did. While the entire AI research community was focused on building better algorithms, she focused on data.

"The paradigm shift of the ImageNet thinking is that while a lot of people are paying attention to models, let's pay attention to data. Data will redefine how we think about models."
-- Fei-Fei Li, Quartz interview

Her colleagues were skeptical. Spending years collecting data instead of publishing algorithm papers seemed like career suicide for a junior professor. She did it anyway.

14.2M

Images in ImageNet

21,841

The Godfathers and Their Long Bet

Geoffrey Hinton, Yann LeCun, and Yoshua Bengio kept faith in neural networks through decades when the approach was considered a dead end.

"There was a dark period between the mid-90s and early-to-mid-2000s when it was impossible to publish research on neural nets, because the community had lost interest in it."
-- Yann LeCun, Understanding AI

"The three of us all knew that it would be the ultimate answer."
-- Geoffrey Hinton, Neural Buddies

Their persistence was rewarded with the 2018 Turing Award (the "Nobel Prize of Computing"). And in 2024, Hinton went further -- winning the Nobel Prize in Physics alongside John Hopfield for "foundational discoveries and inventions that enable machine learning with artificial neural networks."

"If you believe in something, don't give up on it until you understand why that belief is wrong."
-- Geoffrey Hinton, Nobel Prize speech, 2024

The Cost of Training

Training costs have skyrocketed as models scale:

Model	Year	Parameters	Training Cost
Original Transformer	2017	~65M	~$900
GPT-3	2020	175B	$4.6M+
GPT-4	2023	~1.8T (rumored)	~$78M
Llama 3.1 405B	2024	405B	~$170M
Gemini Ultra	2024	Unknown	~$191M
DeepSeek V3	2024	~671B (MoE)	$5.6M

Notice the outlier: DeepSeek V3 achieved competitive results for $5.6M -- about 30x cheaper than Llama 3.1. Training efficiency matters as much as raw scale. Hardware improvements (H200/B200 GPUs) have also driven a 45% cost reduction for standard model sizes. But frontier models keep getting bigger faster than hardware gets cheaper.

Pop Culture Connection

In the 2014 film The Imitation Game, Alan Turing tells his team: "Sometimes it is the people no one can imagine anything of who do the things no one can imagine." The parallel to backpropagation's history is hard to miss -- a Finnish student, an ignored Harvard thesis, decades in the wilderness. The breakthrough wasn't a flash of genius. It was the same idea, persisting until the world was ready for it.

Key Terms

Training: The process of adjusting a model's weights to minimize prediction error, through repeated exposure to labeled data.
Loss function: A formula measuring how wrong the model's prediction was -- the single number training tries to minimize.
Gradient descent: The optimization algorithm that adjusts weights in the direction of steepest error reduction, step by step.
Backpropagation: A method for computing each weight's contribution to the error by working backwards through the network using the chain rule of calculus.
Learning rate: How much weights change per update -- the step size in gradient descent. Too high = overshoot; too low = crawl.
Epoch: One complete pass through the entire training dataset.
Overfitting: When a model memorizes training data instead of learning general patterns -- performing well on training data but poorly on new data.

Did This Land?

What does the loss function measure?

How wrong the model's prediction was -- a single number that captures the gap between what the model predicted and the correct answer. The entire goal of training is to make this number as small as possible.

Why was the 1986 Nature paper so important?

Rumelhart, Hinton, and Williams demonstrated that backpropagation could train multi-layer networks, learn useful internal representations automatically, and scale to practical problems. The same core idea had been published before (1960, 1970, 1974), but the 1986 paper brought it to widespread attention and revived the neural network field from near-extinction.

What made AlexNet different from previous ImageNet entries?

Previous entries used hand-engineered features (designed by humans) fed into traditional classifiers. AlexNet used a deep neural network trained end-to-end on raw pixels -- letting the network learn its own features. It also pioneered GPU training (2 GTX 580s), ReLU activation (6x faster training), and dropout regularization. The result: 15.3% error vs. 26.2% for the runner-up.

Lesson Summary

Training is a loop: forward pass, measure loss, backpropagate blame, update weights, repeat millions of times.
Gradient descent navigates the loss landscape by stepping in the direction of steepest descent -- like hiking down a mountain in fog.
Backpropagation solves the credit assignment problem by working backwards through the network with the chain rule -- and it took four inventions across 26 years before it stuck.
Learning rate, epochs, and batch size are the three knobs that control training dynamics. Getting them right is part science, part dark art.
The ImageNet moment (AlexNet, 2012) proved that deep learning plus GPUs could demolish hand-engineered approaches -- and launched the revolution we're living through now.