Training -- Teaching Machines to Learn
How does a network that starts with random noise learn to recognize cats, translate languages, or write code? The answer involves a mountain, some fog, and a paper that was rejected four times before it changed the world.
What you'll understand
- What training actually means -- adjusting weights to minimize loss
- Gradient descent and backpropagation, intuitively
- The practical hyperparameters: learning rate, epochs, batch size
- Why the ImageNet moment of 2012 mattered so much
The entire field of deep learning was saved by a paper that was rejected, ignored, and reinvented four separate times. The core idea -- that you can work backwards through a neural network to figure out which weights caused the error -- was first published in a Finnish master's thesis in 1970, buried in obscurity. It was reinvented by an American PhD student in 1974 and ignored again. It took until 1986, when Geoffrey Hinton and colleagues published it in Nature, for the idea to finally stick.
That idea is called backpropagation. Without it, there is no ChatGPT, no image generation, no modern AI. And the story of how neural networks learn is, in large part, the story of this one algorithm.
What Training Actually Means
Training a neural network is a five-step loop, repeated millions of times:
- Feed data in -- run an input through the network (the forward pass)
- Measure how wrong it is -- compare the prediction to the right answer
- Figure out who's to blame -- which of the millions of weights caused the error?
- Nudge those weights -- adjust them slightly in the right direction
- Repeat -- do this for every example in the dataset, many times over
A network starts with random weights -- its predictions are garbage. Through training, it gradually improves, like a student learning from practice problems. The "textbook" is the training data. The "grade" is the loss functionA formula measuring how wrong the model's prediction was -- a single number that captures the gap between what the model predicted and the correct answer.. The "study strategy" is gradient descentAn optimization algorithm that adjusts weights in the direction of steepest error reduction, step by step..
Loss Functions: Your Model's Report Card
A loss function boils down the model's performance to a single number -- how wrong was that prediction? The entire goal of training is to make this number as small as possible.
The choice of loss function defines what "good" means for a given task. Without a clear measure of "how wrong am I?", there's no way to improve. It shapes what the network optimizes for -- and by extension, what the network becomes.
Gradient Descent: Hiking Down a Mountain in Fog
Imagine you're on a mountain in thick fog. You can't see the valley floor. You can only feel the slope beneath your feet. Your strategy: take a step in the steepest downhill direction. Repeat. Eventually you reach the bottom.
That's gradient descent.
The gradient is a vector that answers, for each of the millions of weights: "If I nudge this weight by a tiny amount, how much does the loss change?" It points in the direction of steepest ascent -- so you move in the opposite direction to reduce the loss.
new_weight = old_weight - learning_rate * gradient
That's the entire update rule. Every modern AI system -- from GPT-3 to image generators to self-driving cars -- is trained with some variant of this formula.
Gradient descent: the ball follows the steepest downhill path, and the loss value drops as it finds the minimum.
In practice, three flavors exist. Batch gradient descent computes the gradient using all training data -- accurate but slow. Stochastic gradient descent (SGD) uses one random sample at a time -- noisy but fast. Mini-batch (typically 32-256 samples) is the practical sweet spot that balances accuracy and speed. Most modern training uses mini-batch SGD or its descendants.
Backpropagation: The Breakthrough That Changed Everything
Here's the problem gradient descent leaves you with: the network has millions of weights, and they all contributed to the error. Which ones deserve the blame, and how much? This is called the credit assignment problem, and it's the hard part.
BackpropagationA method for computing each weight's contribution to the error by working backwards through the network using the chain rule of calculus. solves it by working backwards. Starting from the output, it propagates the error signal through each layer, using the chain rule of calculus to figure out exactly how much each weight contributed to the final mistake. Then gradient descent uses that information to update every weight proportionally.
Andrej Karpathy spent eight years figuring out the best way to explain this, culminating in his micrograd project: just 94 lines of Python that implement everything needed to train a neural network.
"These 94 lines of code are everything that is needed to train a neural network. Everything else is just efficiency."
-- Andrej Karpathy, on micrograd
Backpropagation was invented at least four times before it stuck:
The Practical Knobs
Training involves three critical hyperparameters. Getting them right is, as practitioners often say, "part science, part art" -- the "dark arts" of deep learning.
Learning Rate
How much weights change per update -- the step size. Too high and the network overshoots the minimum, bouncing wildly or diverging. Too low and it crawls toward the minimum, taking forever to converge. Typical values: 0.001 to 0.01.
Epochs
One complete pass through the entire training dataset. Too few: underfitting -- the model hasn't learned the patterns yet. Too many: overfittingWhen a model memorizes the training data instead of learning general patterns. Like a student who memorizes specific test answers instead of understanding the concepts. -- the model memorizes training data but fails on anything new. Like re-reading a textbook: the first pass gets broad strokes, the tenth catches nuances, but the thousandth just memorizes the words.
Batch Size
Number of training samples processed before updating weights. Small batches (32) mean more frequent updates and often better generalization. Large batches (1024+) are faster computationally but may converge to sharper minima. The sweet spot for most tasks: 32-256.
Epoch sets the duration ("how long do we train?"). Batch size determines frequency ("how often do we update?"). Learning rate controls magnitude ("how much do we change?"). Together they determine everything about the practical dynamics of training.
The ImageNet Moment (2012)
Fei-Fei Li saw something in 2006 that almost no one else did. While the entire AI research community was focused on building better algorithms, she focused on data.
"The paradigm shift of the ImageNet thinking is that while a lot of people are paying attention to models, let's pay attention to data. Data will redefine how we think about models."
-- Fei-Fei Li, Quartz interview
Her colleagues were skeptical. Spending years collecting data instead of publishing algorithm papers seemed like career suicide for a junior professor. She did it anyway.
Then in September 2012, something happened that stunned the field. Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton entered a deep convolutional neural network called AlexNet into the annual ImageNet competition. The results:
A 10.8 percentage point gap -- unprecedented. And AlexNet was trained on just 2 NVIDIA GTX 580 GPUs, reportedly in Krizhevsky's bedroom at his parents' house. Training took 5-6 days.
Before AlexNet, the dominant approach was: humans manually design feature extractors (SIFT, HOG), then feed those features to a classifier. After AlexNet: feed raw pixels to a neural network and let it learn everything end-to-end.
By 2017, 29 of 38 competing teams had achieved error rates below 5% -- surpassing estimated human performance (~5.1%). The competition was discontinued, having served its purpose.
The Godfathers and Their Long Bet
Geoffrey Hinton, Yann LeCun, and Yoshua Bengio kept faith in neural networks through decades when the approach was considered a dead end.
"There was a dark period between the mid-90s and early-to-mid-2000s when it was impossible to publish research on neural nets, because the community had lost interest in it."
-- Yann LeCun, Understanding AI
"The three of us all knew that it would be the ultimate answer."
-- Geoffrey Hinton, Neural Buddies
Their persistence was rewarded with the 2018 Turing Award (the "Nobel Prize of Computing"). And in 2024, Hinton went further -- winning the Nobel Prize in Physics alongside John Hopfield for "foundational discoveries and inventions that enable machine learning with artificial neural networks."
"If you believe in something, don't give up on it until you understand why that belief is wrong."
-- Geoffrey Hinton, Nobel Prize speech, 2024
The Cost of Training
Training costs have skyrocketed as models scale:
| Model | Year | Parameters | Training Cost |
|---|---|---|---|
| Original Transformer | 2017 | ~65M | ~$900 |
| GPT-3 | 2020 | 175B | $4.6M+ |
| GPT-4 | 2023 | ~1.8T (rumored) | ~$78M |
| Llama 3.1 405B | 2024 | 405B | ~$170M |
| Gemini Ultra | 2024 | Unknown | ~$191M |
| DeepSeek V3 | 2024 | ~671B (MoE) | $5.6M |
Notice the outlier: DeepSeek V3 achieved competitive results for $5.6M -- about 30x cheaper than Llama 3.1. Training efficiency matters as much as raw scale. Hardware improvements (H200/B200 GPUs) have also driven a 45% cost reduction for standard model sizes. But frontier models keep getting bigger faster than hardware gets cheaper.
In the 2014 film The Imitation Game, Alan Turing tells his team: "Sometimes it is the people no one can imagine anything of who do the things no one can imagine." The parallel to backpropagation's history is hard to miss -- a Finnish student, an ignored Harvard thesis, decades in the wilderness. The breakthrough wasn't a flash of genius. It was the same idea, persisting until the world was ready for it.
Key Terms
- Training
- The process of adjusting a model's weights to minimize prediction error, through repeated exposure to labeled data.
- Loss function
- A formula measuring how wrong the model's prediction was -- the single number training tries to minimize.
- Gradient descent
- The optimization algorithm that adjusts weights in the direction of steepest error reduction, step by step.
- Backpropagation
- A method for computing each weight's contribution to the error by working backwards through the network using the chain rule of calculus.
- Learning rate
- How much weights change per update -- the step size in gradient descent. Too high = overshoot; too low = crawl.
- Epoch
- One complete pass through the entire training dataset.
- Overfitting
- When a model memorizes training data instead of learning general patterns -- performing well on training data but poorly on new data.
Did This Land?
What does the loss function measure?
Why was the 1986 Nature paper so important?
What made AlexNet different from previous ImageNet entries?
Lesson Summary
- Training is a loop: forward pass, measure loss, backpropagate blame, update weights, repeat millions of times.
- Gradient descent navigates the loss landscape by stepping in the direction of steepest descent -- like hiking down a mountain in fog.
- Backpropagation solves the credit assignment problem by working backwards through the network with the chain rule -- and it took four inventions across 26 years before it stuck.
- Learning rate, epochs, and batch size are the three knobs that control training dynamics. Getting them right is part science, part dark art.
- The ImageNet moment (AlexNet, 2012) proved that deep learning plus GPUs could demolish hand-engineered approaches -- and launched the revolution we're living through now.