Lesson 2 12 min read

How Neural Networks Think

In 1958, the New York Times reported the Navy had built a machine that could "walk, talk, see, write, reproduce itself and be conscious of its existence." They were... optimistic.

What you'll understand after this

What an artificial neuron computes -- weighted sum plus activation
How layers build feature hierarchies, from edges to objects
What actually happens during a forward pass, step by step
Why the Perceptron died, and what killed it

The Artificial Neuron

In 1943, neurophysiologist Warren McCulloch and logician Walter Pitts published "A Logical Calculus of the Ideas Immanent in Nervous Activity" -- the first mathematical model of an artificial neuron. Pitts was 18 years old and had no academic credentials. The paper laid the theoretical foundation for everything that follows in this course.

An artificial neuron does two things. First, it takes all its inputs and computes a weighted sum:

z = w1*x1 + w2*x2 + ... + wn*xn + b

Each input (x) gets multiplied by its weight (w) -- a number representing how much that input matters. The bias (b) is a constant offset, like a baseline mood before hearing anyone's opinion. Then an activation functionA non-linear function that decides whether a neuron "fires" -- without it, the entire network reduces to a single linear equation transforms that sum into the neuron's output.

Think of it this way: each input is a person giving you advice. The weights are how much you trust each person. The bias is your gut feeling before hearing anyone. And the activation function is your decision threshold -- at some point, enough evidence tips you from "no" to "yes."

Visualization of an artificial neural network with glowing connections between nodes

Layers: Input, Hidden, Output

Individual neurons are simple. The power comes from organizing them into layersA group of neurons that process data at the same level of abstraction -- input layers receive raw data, hidden layers transform it, output layers produce predictions.

The input layer receives raw data and does no computation -- it just passes information forward. For a 28x28 pixel grayscale image (the classic MNIST handwritten digit dataset), that's 784 neurons, one per pixel.

Hidden layers are where the actual "thinking" happens. And here's the key insight: hidden layers discover relevant features automatically, without human engineering. Nobody tells the network "look for edges" or "find circles." It figures that out on its own. The deeper the layer, the more abstract the features:

Layer 1

Detects simple edges and corners

Layer 2

Combines edges into shapes

Layer 3

Identifies parts (eyes, wheels)

Deeper

Recognizes whole objects

The output layer produces the final prediction. For digit recognition, it's 10 neurons -- one for each digit 0-9. A softmax function converts the outputs into probabilities that sum to 1: "87% confident this is a 3, 8% it's an 8, 3% it's a 5..."

Diagram showing how deep learning layers progressively extract higher-level features

The Forward Pass

A forward passThe process of feeding input through the network layer by layer to get an output prediction -- no learning happens during this step is what happens every time a neural network processes an input. No learning occurs here -- it's pure prediction. Data flows in one direction: forward.

Here's the step-by-step for recognizing a handwritten digit:

Step 1: Image pixels are converted to numbers. Each pixel (0-255 for grayscale) gets normalized to a value between 0 and 1. A 28x28 image becomes a vector of 784 values.

Step 2: Each neuron in the first hidden layer computes a weighted sum of all 784 inputs, adds its bias, and applies the activation function. Result: a new set of values representing detected low-level features.

Step 3: Output of layer 1 becomes input to layer 2. The same weighted-sum-then-activate pattern repeats. Each subsequent layer transforms data into more abstract representations.

Step 4: The output layer converts the final hidden representation into 10 numbers. Softmax turns these into probabilities. The network reports: "87% confident this is a 3."

Think of it as a factory assembly line. Raw materials (pixels) enter, each station (layer) refines them, and the final station produces the finished product (a classification). Every worker (neuron) has the same basic job: weigh inputs, sum them, decide whether to activate. The only difference is what they've learned to care about.

Interactive Animation

Why Non-Linearity Matters

Here's a fact that sounds abstract but is actually the single most important insight in neural network design: without activation functions, your entire network -- no matter how many layers -- is mathematically equivalent to a single linear equation. Stack 100 layers of linear transformations and you get... one linear transformation. Completely useless for learning anything interesting about the real world.

The Universal Approximation Theorem (Cybenko 1989, Hornik et al. 1989) proved the fix: a neural network with just one hidden layer and non-linear activation functions can approximate any continuous function to any desired accuracy, given enough neurons. Non-linearity is what gives neural networks their power.

Two activation functions defined the field:

Sigmoid -- the dimmer switch

sigma(x) = 1 / (1 + e^(-x))

Squashes any input smoothly to a value between 0 and 1. Historically important -- used in early networks and still shows up in specific contexts. But it has a fatal flaw for deep networks: its first derivative maxes out at 0.25, causing the vanishing gradient problem. During backpropagation, gradients get multiplied layer by layer. When each multiplication is at most 0.25, the signal reaching early layers is vanishingly small. Deep networks with sigmoid essentially can't learn.

ReLU -- the on/off valve

f(x) = max(0, x)

If the input is positive, let it through unchanged. If negative, output zero. That's it. No exponentials, no complex curves. The derivative is either 0 or 1 -- so gradients don't vanish. Networks using ReLURectified Linear Unit: max(0, x). The most popular activation function in modern networks -- simple, fast, and avoids the vanishing gradient problem train up to 6x faster than sigmoid on benchmark tasks like CIFAR-10. This simplicity is exactly why modern deep learning works.

The Perceptron and Why It Died

In July 1958, the U.S. Office of Naval Research unveiled the Perceptron -- the first machine that could learn from examples. Its creator, Frank Rosenblatt, demonstrated it on an IBM 704 (a 5-ton, room-sized computer). He fed it punch cards, and after just 50 trials, it taught itself to distinguish cards marked on the left from cards marked on the right.

The Mark I Perceptron at Cornell Aeronautical Laboratory, the first machine that learned from examples

The media response was extraordinary. The New York Times reported that the PerceptronThe first neural network (1958), invented by Frank Rosenblatt. Could only solve linearly separable problems, but proved machines could learn from data. was:

"the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence."
-- New York Times, July 13, 1958

The New Yorker called it "the first serious rival to the human brain ever devised." Rosenblatt himself called it "the first machine which is capable of having an original idea."

Then came Marvin Minsky.

The Rivalry

Minsky was a grade behind Rosenblatt at the Bronx High School of Science in the 1940s. Both traced parallel careers from that same school, but their approaches to AI diverged sharply. Rosenblatt was a connectionist -- he believed in learning from data, bottom-up. Minsky was a symbolist -- he believed in hand-crafted logic and reasoning, top-down.

At conferences, they publicly debated the perceptron's viability, "as their colleagues and students looked on in amazement." Part of the tension was territorial: Rosenblatt's training was in psychology -- "soft science" -- but his perceptron work was mathematical, turf that Minsky, with his Princeton mathematics PhD, didn't feel Rosenblatt belonged on.

The XOR Problem

In 1969, Minsky and Seymour Papert published Perceptrons, a book that proved a devastating mathematical fact: single-layer perceptrons cannot solve the XOR problemExclusive OR: returns 1 only when inputs differ. A single-layer perceptron can't solve it because the outputs aren't linearly separable -- you can't draw one straight line to separate the 1s from the 0s..

XOR (exclusive OR) returns 1 only when its two inputs differ:

(0, 0)0

(0, 1)1

(1, 0)1

(1, 1)0

If you plot these four points on a graph, there's no single straight line that separates the 1s from the 0s. They sit on opposite corners. A perceptron can only draw one line. For XOR, you need at least two lines -- which means at least two layers.

Minsky and Papert's book was widely misinterpreted as proving that all neural networks were fundamentally limited, not just single-layer ones. It was generally understood at the time that multi-layer networks could solve XOR. But nobody knew how to train them efficiently -- backpropagation hadn't been popularized yet.

The Fallout

The damage was catastrophic. Funding collapsed. The Lighthill Report (1973) declared AI research an "utter failure." DARPA cut grants. Researchers abandoned neural networks for a decade. This was the first AI winter (1974-1980+).

Frank Rosenblatt drowned on his 43rd birthday in 1971 while sailing on Chesapeake Bay. He never lived to see his ideas vindicated. The Perceptron now resides at the Smithsonian Institution. In 2004, IEEE established the Frank Rosenblatt Award in his honor.

As Geoffrey Hinton -- who would eventually lead the neural network revival -- said in his 2024 Nobel Prize speech: "If you believe in something, don't give up on it until you understand why that belief is wrong."

Go Deeper

3Blue1Brown's "But what is a GPT?" -- a visual walkthrough of how transformer-based language models process text. Picks up where the neural network video leaves off.

The canonical reference

Jay Alammar's The Illustrated Transformer is the most-linked technical explainer in the field. It walks through self-attention with clear diagrams. Bookmark it -- you'll need it in Lesson 5.

History Thread

1943 -- McCulloch and Pitts publish the first mathematical model of a neuron. Pitts is 18, has no degree, and is essentially self-taught. The paper creates the theoretical foundation for neural networks.

1958 -- Rosenblatt builds the Mark I Perceptron at Cornell. After 50 trials, it learns to classify patterns. The media goes wild. The New York Times predicts conscious machines.

1969 -- Minsky and Papert publish Perceptrons, proving single-layer limits. Funding collapses. Neural network research enters a decade-long winter.

1986 -- Rumelhart, Hinton, and Williams popularize backpropagation in Nature. Multi-layer networks can now be trained. The winter thaws. (More on this in Lesson 3.)

Pop Culture Connection

The 1958 New York Times headline is arguably the most famous piece of AI overhype ever published. The Navy claimed its machine would achieve consciousness; what it actually did was distinguish left-marked cards from right-marked cards after 50 tries. The gap between the embryo of a computer that will walk, talk, see, write, reproduce itself and be conscious of its existence and the reality of a pattern classifier on punch cards is a cautionary tale that repeats with every AI cycle. Every time someone claims AGI is 5 years away, the Perceptron headline is relevant.

Key Terms

Neuron (node): The basic unit of a neural network: computes a weighted sum of inputs, then applies an activation function
Layer: A group of neurons that process data at the same level of abstraction
Forward pass: The process of feeding input through the network layer by layer to produce output -- no learning happens here
Activation function: A non-linear function that decides whether a neuron "fires" -- without it, the network can only learn linear relationships
ReLU: max(0, x) -- the most popular activation function in modern networks. Simple, fast, avoids vanishing gradients.
Perceptron: The first neural network (1958), invented by Frank Rosenblatt. Could only solve linearly separable problems.
XOR problem: A logic function whose outputs can't be separated by a single line -- proved single-layer perceptrons have hard limits

Did This Land?

Why can't a single-layer perceptron solve XOR?

A single-layer perceptron can only draw one straight line to separate outputs. In XOR, the 1s and 0s sit on opposite corners of a square -- no single line can separate them. You need at least two lines (two layers) to carve out the right regions.

What do deeper layers in a network "see" compared to shallow layers?

Shallow layers detect low-level features: edges, corners, textures. Deeper layers combine those into progressively more abstract representations: shapes, parts, and eventually whole objects or concepts. Layer 1 sees "a curve"; layer 5 sees "a face."

Why was the switch from sigmoid to ReLU so important?

Sigmoid's maximum derivative is 0.25, so gradients shrink exponentially as they pass through layers (vanishing gradient problem). Deep networks with sigmoid can't update their early layers effectively. ReLU's derivative is 0 or 1, so gradients pass through unchanged. Networks with ReLU train up to 6x faster and can go much deeper.

Lesson Summary

An artificial neuron computes a weighted sum of inputs plus bias, then applies a non-linear activation function. McCulloch and Pitts described this in 1943.
Neural networks organize neurons into layers. Hidden layers automatically discover features at increasing levels of abstraction -- edges, shapes, parts, objects.
A forward pass pushes data through the network layer by layer. The same weighted-sum-then-activate pattern repeats at every stage.
Non-linearity (activation functions) is what gives networks their power. Without it, any network reduces to a single linear equation. ReLU replaced sigmoid because it's simpler and avoids vanishing gradients.
The Perceptron (1958) proved machines could learn from data but was limited to linearly separable problems. The XOR critique triggered the first AI winter. Rosenblatt died in 1971, never seeing his ideas vindicated.