Lesson 4 14 min read

Data -- The Secret Ingredient

Everyone obsesses over the model. But the data is where the real story lives -- who collected it, how it was labeled, what biases it carries, and who got hurt making it "safe."

What you'll understand

What LLMs are actually trained on, and at what scale
How Amazon reviews accidentally bootstrapped sentiment analysis
The invisible human labor behind AI training data
How bias enters through data -- and the real-world consequences

The field of sentiment analysis -- teaching machines to tell whether text is positive or negative -- was accidentally bootstrapped by millions of Amazon customers leaving star ratings. Nobody planned it. Nobody funded a data collection effort. People were just shopping, complaining, and raving about products. And that messy, unstructured, human-generated data turned out to be exactly what AI researchers needed.

This is a pattern that repeats across the entire history of AI: what goes in matters more than the architecture. The most powerful datasets are often "found data" -- collected for one purpose, repurposed for another.

What LLMs Eat

GPT-3 was trained on approximately 499 billion tokens from five sources. That's roughly equivalent to a million books -- more text than any human could read in 26,000 years of continuous reading.

Dataset Composition

Common Crawl 60% 410B tokens

WebText2 22% 19B tokens

Books1 8% 12B tokens

Books2 8% 55B tokens

Wikipedia 3% 3B tokens

CommonCrawl 67% 3.3 TB

C4 15% 783 GB

GitHub 4.5% 328 GB

Wikipedia 4.5% 83 GB

Books 4.5% 85 GB

ArXiv 2.5% 92 GB

Stack Exchange 2% 78 GB

GPT-3 saw ~300B tokens during training -- Common Crawl was downsampled while higher-quality sources were sampled 2-3x. LLaMA trained on ~1.4T tokens total.

The backbone is Common Crawl -- a nonprofit that has been crawling the web since 2008. Their archive totals 9.5+ petabytes, over 300 billion pages, with monthly crawls capturing roughly 3 billion web pages each. It's used by GPT-3, LLaMA, T5, and virtually every major LLM.

But raw web crawl data is noisy. GPT-3's training mix deliberately downsampled Common Crawl (82% of raw data, but only 60% of training weight) while oversampling higher-quality sources like Wikipedia and curated book collections.

The Amazon Reviews Story

Before machine learning, sentiment detection was a nightmare. Researchers tried hand-crafted dictionaries of positive and negative words, grammatical rules about negation, expert-curated knowledge bases. The results were brittle, expensive, and didn't generalize.

In 2002, Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan at Cornell had a different idea: instead of writing rules, train a model on labeled examples and let it learn the patterns. Their support vector machine hit ~82% accuracy on movie review sentiment -- definitively outperforming hand-coded baselines.

Then came the Amazon reviews. The genius was that they came pre-labeled: every review had a 1-5 star rating attached by the reviewer. Millions of text/sentiment pairs generated as a byproduct of commerce -- no expensive annotation required.

Amazon product review with star rating -- the kind of naturally-labeled data that bootstrapped sentiment analysis — Star ratings gave researchers something priceless: free, massive-scale sentiment labels, generated as a byproduct of online shopping.

42M

Reviews in the McAuley dataset

5.1B

Words total

Years of Amazon data

Product categories

The McAuley/Leskovec Stanford dataset (2013) compiled 42 million reviews from 10 million users across 3 million items. It became one of the most-used benchmarks in NLP and recommendation systems research, bootstrapping transfer learning, recommendation systems, and the techniques that underpin all modern natural language processing.

ImageNet: The Contrarian Bet

In 2006, virtually every AI researcher was focused on building better algorithms. Fei-Fei Li -- born in Beijing, immigrated to the US at 16, who worked at a dry cleaner while studying physics at Princeton -- saw the fundamental limitation: the best algorithm in the world couldn't work well if the data it learned from didn't reflect reality.

"The paradigm shift of the ImageNet thinking is that while a lot of people are paying attention to models, let's pay attention to data. Data will redefine how we think about models."
-- Fei-Fei Li, Quartz

Her colleagues were skeptical. For a junior professor, spending years collecting data instead of publishing papers seemed like career suicide. She initially estimated it would take decades using traditional annotation methods -- then turned to Amazon Mechanical Turk for crowdsourced labeling.

14.2M

Annotated images

21,841

Who Labels the Data?

Every AI system has an invisible army of human workers behind it -- labeling, rating, filtering, and correcting data. The conditions of that labor are one of the industry's most uncomfortable truths.

The Kenyan Workers Behind ChatGPT

In November 2021, OpenAI outsourced content moderation to Sama, a San Francisco firm employing workers in Kenya. Their job: label toxic content so ChatGPT could learn to avoid it. The content was "pulled from the darkest recesses of the internet" -- text describing child sexual abuse, bestiality, murder, torture, self-harm.

The pay: $1.32 to $2 per hour.

"However much I feel good seeing ChatGPT become famous and being used by many people globally, making it safe destroyed my family. It destroyed my mental health. As we speak, I'm still struggling with trauma."
-- Mophat Okinyi, QA analyst, TIME

"It got to a point where my body couldn't function."
-- Michael Geoffrey Asia, data labeler, now Secretary General of the Data Labelers Association

Sama canceled all OpenAI work in February 2022, eight months earlier than planned, due to the traumatic nature. Researchers documented over 60 cases of serious mental health harm including PTSD, depression, insomnia, and suicidal ideation.

The Mechanical Turk Economy

Amazon's Mechanical Turk -- deliberately named after the 18th-century chess "automaton" that was secretly operated by a human -- is the platform behind much of AI's training data. Workers ("Turkers") complete micro-tasks for micro-payments. The median wage: ~$2/hour, with only 4% earning above the US federal minimum wage.

$29B

Scale AI valuation -- the labeling middleman

$2/hr

Median Mechanical Turk wage

$1.32/hr

Kenyan RLHF workers (lowest tier)

Bias: Garbage In, Garbage Out

If your training data encodes bias, your model will learn that bias and reproduce it at scale. This isn't a theoretical concern -- it has caused documented, measurable harm.

Amazon's Hiring Tool (2014-2017)

Amazon built an AI to rank job applicants 1-5 stars. Trained on 10 years of resumes the company had received. Since tech is male-dominated, the vast majority came from men. The system penalized resumes containing the word "women's" -- as in "women's chess club" or graduates of all-women's colleges. It favored language more common on male engineers' resumes. Amazon disbanded the team by early 2017.

Gender Shades (2018)

Joy Buolamwini and Timnit Gebru evaluated three commercial facial recognition systems on 1,270 faces. The results exposed a staggering disparity:

Lighter-skinned males

0.8%

Darker-skinned females (Microsoft)

20.8%

Darker-skinned females (IBM)

34.7%

A 43x difference in error rates between the best- and worst-served groups. The root cause: training datasets over-represented lighter-skinned faces, particularly male faces. Buolamwini founded the Algorithmic Justice League and the work became the Netflix documentary Coded Bias (2020).

COMPAS: Criminal Justice

COMPAS, a tool used by US courts to predict recidivism risk, was found by ProPublica to falsely flag Black defendants as high-risk at a rate of 44.9%, compared to 23.5% for white defendants -- nearly twice as likely. After controlling for criminal history, age, and gender, Black defendants were still 77% more likely to be flagged as higher risk for violent crime.

The Stochastic Parrots Paper (2021)

Emily Bender, Timnit Gebru, and colleagues argued that LLMs trained on internet data inevitably over-represent "the perspectives of the young, white, male, English speakers who dominate internet sites such as Reddit." Gebru, then co-lead of Google's Ethical AI team, was fired over this paper. Approximately 2,700 Google employees and 4,300+ academics signed letters condemning her dismissal.

Copyright Wars

As the value of training data became clear, so did the legal battles over who owns it.

The New York Times sued OpenAI and Microsoft in December 2023 for using Times articles without permission. In March 2025, a federal judge allowed the case to proceed, rejecting OpenAI's motion to dismiss. It's now considered the most significant AI copyright case in US courts.

In the image space, artists including Sarah Andersen filed suit against Stability AI, Midjourney, and DeviantArt. The LAION-5B dataset (5.85 billion images) used to train Stable Diffusion was found to contain at least 1,008 confirmed instances of CSAM and over 3,000 suspected instances. A leaked list revealed 16,000+ non-consenting artists allegedly used to train Midjourney.

Meanwhile, data licensing has become a new gold rush:

$60M/yr

Google + Reddit

$70M/yr

OpenAI + Reddit

$250M+

News Corp + OpenAI (5 years)

Quality vs. Quantity: The Chinchilla Lesson

For years, the assumption was simple: more data, bigger model, better results. Then in 2022, DeepMind published the Chinchilla scaling laws and upended that thinking.

The key finding: for compute-optimal training, the ideal ratio is approximately 20 tokens per parameter. Chinchilla (70B parameters, 1.3T tokens) outperformed Gopher (280B parameters) on nearly every task -- despite being 4x smaller. GPT-3, with 175B parameters but only 300B tokens, had a ratio of ~1.7 tokens/parameter. By Chinchilla standards, it was severely data-starved.

Microsoft's Phi models drove the point home further. Phi-1 -- just 1.3 billion parameters, trained on 7 billion tokens of "textbook quality" data -- achieved 50.6% on the HumanEval code generation benchmark, competitive with models 10-100x larger. Their philosophy: "Textbooks are all you need."

"In many industries where giant data sets simply don't exist, I think the focus has to shift from big data to good data. Having 50 thoughtfully engineered examples can be sufficient to explain to the neural network what you want it to learn."
-- Andrew Ng, IEEE Spectrum

Pop Culture Connection

In March 2023, an AI-generated image of Pope Francis in a white puffer jacket went massively viral -- many people genuinely believed it was real. It was created with Midjourney, trained on exactly the kind of scraped internet data described above. And in September 2022, Jason Allen won the Colorado State Fair art competition with "Theatre D'opera Spatial," generated using Midjourney. "I'm not going to apologize for it," Allen told the New York Times. The controversy over AI-generated art isn't abstract -- it's already reshaping who gets to call themselves a creator.

The Ouroboros Problem: AI Training on AI

There's an emerging risk that sounds like science fiction but is backed by a 2024 Nature paper: model collapse. When AI models train on data generated by other AI models, each generation loses information about rare patterns, minority perspectives, and unusual examples. Models converge toward "bland averages."

The scale of the problem: by April 2025, over 74% of newly created web pages contained AI-generated text. As Common Crawl hoovers up the web for the next generation of training data, models are increasingly eating their own output.

This is one reason companies are now paying for guaranteed-human content -- the Reddit and News Corp licensing deals above aren't just about avoiding lawsuits. They're about keeping models grounded in genuinely human expression.

Key Terms

Training data: The corpus of text, images, or other data a model learns from during training.
Label: A tag or annotation telling the model what the correct answer is -- like star ratings on reviews, or "cat" vs "dog" tags on images.
Bias: Systematic errors in training data that lead to unfair model outputs. Reflects historical patterns, not inherent truth.
Common Crawl: A nonprofit web archive totaling 9.5+ petabytes and 300B+ pages, used in training virtually every major LLM.
Sentiment analysis: Determining whether text expresses positive, negative, or neutral opinion -- one of the earliest commercial NLP applications.
Synthetic data: AI-generated data used to train other AI models. Powerful but risks model collapse if used without real human data.
Fair use: A legal doctrine at the center of AI copyright disputes -- whether training on copyrighted works constitutes fair use or infringement.

Did This Land?

Why were Amazon reviews so useful for AI researchers?

Star ratings provided free, naturally occurring sentiment labels at massive scale. Instead of paying humans to annotate text as positive or negative, researchers could use the star ratings that millions of shoppers had already attached to their reviews. The McAuley dataset alone contained 42 million reviews and 5.1 billion words.

What does the Chinchilla scaling law say about data vs. model size?

Both should scale equally -- the ideal ratio is approximately 20 tokens per parameter. A 70B parameter model needs ~1.4 trillion tokens of training data. GPT-3, with only ~1.7 tokens per parameter, was severely data-starved by this standard. The implication: the industry was building models that were too large for their training data.

Give one example of how bias in training data caused real-world harm.

Multiple examples: (1) Amazon's hiring tool penalized resumes containing "women's" because it was trained on 10 years of male-dominated tech resumes. (2) The Gender Shades study found facial recognition had a 0.8% error rate for lighter-skinned males vs. 34.7% for darker-skinned females -- a 43x disparity. (3) The COMPAS recidivism algorithm falsely flagged Black defendants as high-risk at nearly twice the rate of white defendants (44.9% vs. 23.5%).

Lesson Summary

LLMs train on trillions of tokens from web crawls, books, code, and social media. Common Crawl alone totals 9.5+ petabytes.
Some of AI's most important datasets were "found data" -- Amazon star ratings accidentally bootstrapped sentiment analysis; ImageNet proved data matters as much as algorithms.
The human labor behind AI -- Kenyan workers at $1.32/hr, Mechanical Turk at $2/hr median -- is largely invisible and often psychologically harmful.
Bias in training data produces bias at scale: Amazon's hiring tool penalized women, facial recognition failed 43x more often on dark-skinned women, and criminal justice algorithms produced racially disparate outcomes.
Quality can beat quantity: Chinchilla proved data-starved models underperform, and Microsoft's Phi showed "textbook quality" data can match models 100x larger.