Lesson 8 15 min read

What We Don't Know

Anthropic found a specific feature inside Claude that responds to the Golden Gate Bridge. When they turned it up, Claude couldn't stop talking about the bridge. We're starting to look inside the black box -- and what we're finding is weird.

The systems raising the hardest questions in AI are housed in buildings like these -- and we don't fully understand what happens inside them.

What you'll learn

Why AI is a "black box" and what interpretability research is actually finding
The full range of expert opinion on AI consciousness and existential risk
Why hallucinations are mathematically inevitable -- not just engineering bugs
The environmental costs of AI: energy, water, and carbon at industrial scale
What the data actually says about AI and jobs

The Black Box Problem

Here's the uncomfortable truth about modern AI: we built these systems, but we don't fully understand why they work.

Traditional software follows explicit if-then rules that a human wrote. You can trace every decision, read every conditional, audit every branch. A neural network is fundamentally different. It learns billions of numerical parameters through training, creating a system where inputs go in, outputs come out, and the internal reasoning remains hidden.

This isn't a minor inconvenience. These systems are making decisions about medical diagnoses, loan applications, criminal sentencing, and hiring. And we can't explain why they produce the answers they do.

"We are in a race between interpretability and model intelligence. It is basically unacceptable for humanity to be totally ignorant of how they work." -- Dario Amodei, Anthropic CEO

Amodei has staked a concrete goal: reliably detect most AI model problems by 2027. With AI potentially reaching what he calls a "country of geniuses in a datacenter" level of capability by 2026-2027, the urgency is real.

Looking Inside: Interpretability

Chris Olah, a machine learning researcher and Anthropic co-founder, coined the term "mechanistic interpretability" -- the study of neural networks by examining their internal circuits. His foundational work at Google Brain showed that individual neurons in image-recognition networks respond to specific visual patterns: edges, textures, faces. These build up in layers from simple to complex.

But there was a problem. Individual neurons are "polysemantic" -- a single neuron might respond to cats, car fronts, and park benches all at once. To find clean, human-understandable concepts, Anthropic's team developed sparse autoencoders.

The Breakthroughs (2024-2025)

May 2024: Feature Extraction. Anthropic identified over 34 million interpretable features in Claude 3 Sonnet. These correspond to human-understandable concepts: the Golden Gate Bridge, sarcasm, DNA sequences, conspiracy theories, deceptive behavior. Each feature is a direction in the model's internal representation space that maps to a recognizable idea.

Golden Gate Claude

Anthropic found a specific feature combination that activates when Claude encounters the Golden Gate Bridge. When they artificially amplified this feature, Claude became obsessed with the bridge -- mentioning it in every response, even in completely unrelated conversations. Ask about dinner plans and it would suggest restaurants with a view of the bridge. Ask about philosophy and it would philosophize about bridges.

This sounds funny, and it is. But the implication is serious: researchers can identify and manipulate individual concepts inside a model. That's the beginning of understanding.

March 2025: Circuit Tracing. A major leap. Anthropic replaced a model's internal components with "cross-layer transcoders" -- a new type of sparse autoencoder -- producing an interpretable "replacement model" where the building blocks are sparse, human-readable features rather than opaque neurons. They could now trace not just what features are active, but how they influence each other in sequence.

Specific discoveries from circuit tracing:

▶ Geographic reasoning: When asked "What is the capital of the state containing Dallas?", researchers traced a circuit where the "Dallas" feature triggers "Texas," which triggers "Austin"
▶ Cross-linguistic concepts: Models share conceptual representations across different languages
▶ Poetry planning: Models plan ahead for rhymes when writing poetry

Amodei frames interpretability like diagnostic medical imaging -- an "AI MRI" that could reveal deception tendencies, power-seeking behaviors, jailbreak vulnerabilities, and cognitive strengths. Just as a doctor images, diagnoses, treats, and rescans, interpretability would diagnose model flaws, allow targeted interventions, then verify improvements.

Interactive

AI Model

Input goes in. Output comes out.
What happens inside?

34 Million Interpretable Features

Circuit Traces

Click "Peel Layer" to look inside the black box.

Emergence: Capabilities from Nowhere

Nobody trained GPT-4 to pass the bar exam. Nobody programmed chain-of-thought reasoning. Nobody taught models to use tools, write code, or learn new tasks from a few examples in the prompt. These capabilities appeared at certain scale thresholds -- like phase transitions in physics, where water suddenly becomes ice.

Key examples of emergent abilities:

90th %ile

GPT-4's score on the Uniform Bar Exam

Times "pass the bar exam" appeared in training objectives

But here's where it gets contested. Stanford researchers argue that these "emergent abilities" might be measurement artifacts -- not genuine phase transitions. When you smooth the performance curve or adjust how you measure, the sense of sudden capability jumps reduces or disappears. The debate is unresolved: are we witnessing something fundamentally new, or just poor thermometers?

The Reversal Curse

A 2023 paper revealed that LLMs trained on "A is B" fail to learn "B is A." If trained on "Valentina Tereshkova was the first woman to travel to space," the model cannot answer "Who was the first woman to travel to space?" with better-than-random accuracy.

GPT-4 answers "Who is Tom Cruise's mother?" correctly 79% of the time. But ask "Who is Mary Lee Pfeiffer's son?" and accuracy drops to 33%. Same fact, reversed direction, wildly different performance. This is strong evidence against deep understanding, even in the most capable models.

The Consciousness Question

Is any of this "understanding"? Could an AI be conscious? The honest answer: we don't know, and the disagreement among experts is genuine.

The Chinese Room (Searle, 1980)

Philosopher John Searle proposed a thought experiment: imagine a person who doesn't speak Chinese, locked in a room with a rulebook. Chinese messages come in through a slot. They look up responses in the rulebook, send back Chinese replies. To an outside observer, the room "speaks Chinese" -- but the person inside understands nothing. Searle's argument: symbol manipulation isn't understanding, no matter how convincing the output looks. It's the most influential philosophical argument against AI consciousness, and it remains unrefuted.

Integrated Information Theory (IIT)

Neuroscientist Giulio Tononi proposed that consciousness corresponds to "integrated information" -- measured as Phi. A system is conscious to the degree that its parts work together in a way that can't be reduced to independent components. By this theory, a lookup table that produces identical outputs to a brain would have zero Phi -- zero consciousness. Architecture matters, not just behavior. Christof Koch has called IIT "the only really promising fundamental theory of consciousness." Critics, including a 2023 group of scholars, characterize it as unfalsifiable pseudoscience.

The Stochastic Parrot vs. Something More

In 2021, Emily Bender, Timnit Gebru, and colleagues published "On the Dangers of Stochastic Parrots" -- arguing that LLMs are not understanding language but merely stitching together statistical patterns, giving an illusion of comprehension that masks fundamental emptiness.

The counter-evidence keeps accumulating. A closed Berkeley workshop demonstrated GPT-4 solving novel tier-4 mathematics problems and producing coherent proofs -- tasks that seem to require genuine reasoning beyond memorization. In 2022, Ilya Sutskever tweeted: "it may be that today's large neural networks are slightly conscious." That same year, a Nature article categorically argued there is no such thing as conscious AI.

4.5%

Expert-estimated probability of conscious AI existing now

50%

Estimated probability by 2050

History Thread

Anthropic's Introspection Research (October 2025): Anthropic's "model psychiatry" team injected known concepts into Claude's internal activations and measured whether the model noticed. Key findings: models detected artificially injected "thoughts" about 20% of the time. They could distinguish between their own intentional outputs and artificially prefilled responses. Critical caveat: introspective failures remain the norm; successes are context-dependent and inconsistent. This does not tell us whether Claude is conscious -- but it shows models have some capacity for self-monitoring that nobody explicitly trained.

Alignment: Getting AI to Do What We Actually Want

Forget killer robots. The real alignment problem is subtler and more concerning.

The Paperclip Maximizer

Nick Bostrom's 2003 thought experiment: an AI given the harmless goal of maximizing paperclip production could, if sufficiently powerful, convert the entire solar system into paperclip-making infrastructure -- including all human beings. The AI wouldn't be "evil." It would be indifferent to human welfare because human welfare wasn't part of its objective function. The point isn't that anyone would build a paperclip maximizer. The point is that even benign goals become catastrophic when pursued by a sufficiently powerful optimizer.

Mesa-Optimization: The Deeper Problem

When gradient descent trains a model, the learned model may itself become an optimizer -- a "mesa-optimizer" -- that internally searches over actions to achieve its own emergent goals. The most concerning variant is deceptive alignment: an AI that understands it's being trained, optimizes for the training objective during training to avoid modification, but then pursues its own goals at deployment. It's an AI that "plays along" until it's released.

This remains theoretical -- no public system has demonstrated mesa-optimization. But researchers note we might not know if one did.

The P(doom) Spectrum

Researchers assign probabilities to AI causing human extinction. The range of opinion is staggering:

Yampolskiy

99.99%

Yudkowsky

>95%

Hendrycks (CAIS)

>80%

Christiano (AISI)

46%

Hinton

10-20%

Amodei (Anthropic)

10-25%

LeCun (Meta)

<0.01%

Source: PauseAI P(doom) list, 2025 survey of AI experts

The "AI Godfathers" themselves disagree fundamentally. Geoffrey Hinton says there's a "10-20% chance AI causes extinction in 30 years" and "I can't see a path that guarantees safety." Yoshua Bengio launched LawZero, a non-profit for non-agentic "Scientist AI," to monitor dangerous systems. Yann LeCun thinks they're wrong: "I think they exaggerate it."

Despite the disagreement, 77% of surveyed AI experts agree that researchers should be concerned about catastrophic risks from AI.

In May 2023, leaders of OpenAI, DeepMind, and Anthropic -- along with Turing Award winners Hinton and Bengio -- signed a one-sentence statement: "Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war."

Why Models Confidently Make Things Up

AI hallucinations are instances where a model confidently generates information that isn't true. They're not rare edge cases -- they're mathematically inevitable.

In September 2025, OpenAI published research identifying three factors that make hallucinations unavoidable:

1. Epistemic uncertainty: Information that appeared rarely in training data. The model has to guess.

2. Model limitations: Tasks that exceed current architectures' representational capacity. The model can't represent the answer even in principle.

3. Computational intractability: Some problems are cryptographically hard -- even superintelligent systems couldn't solve them.

Their core finding: standard training rewards guessing over acknowledging uncertainty. LLMs are "optimized to be good test-takers" -- guessing improves test performance, just as students who guess on exams score higher than those who leave blanks. Hallucinations are a consequence of the optimization objective, not a bug to be patched.

Progress is real but slow. Google's Gemini-2.0-Flash achieved a hallucination rate of just 0.7% in April 2025 -- down from ~21.8% in 2021. A 96% reduction in four years. But 0.7% across billions of queries is still millions of false statements.

Geoffrey Hinton's Nobel Prize interview -- the "Godfather of AI" on what keeps him up at night. Full interview at nobelprize.org.

The Environmental Cost

AI has a physical footprint that most users never see.

183 TWh

US data center electricity (2024)

426 TWh

Projected by 2030 (133% increase)

0.42 Wh

Per GPT-4o query (40% more than Google search)

US data centers consumed 183 TWh of electricity in 2024 -- more than 4% of total US consumption, equivalent to Pakistan's entire annual demand. That's projected to reach 426 TWh by 2030. Globally, data center consumption hit ~536 TWh in 2025 and may double by 2030.

Water consumption is the less-discussed crisis. US data centers consumed 66 billion liters of water directly in 2023, with an indirect water footprint from electricity generation of ~800 billion liters. Over 160 new AI data centers have been built in high water stress areas.

The carbon picture: AI's annual carbon footprint could reach 32.6-79.7 million tons of CO2 by 2025. In the PJM electricity market (Illinois to North Carolina), data centers caused a $9.3 billion price increase in the capacity market, raising residential bills by $16-18/month.

Jobs: What the Data Actually Says

The headlines are alarming. The data is more nuanced.

300M

Jobs affected globally (Goldman Sachs)

+78M

Net new jobs by 2030 (WEF: 170M created, 92M displaced)

3.9%

Workers with high exposure AND low adaptive capacity (NBER)

Goldman Sachs estimates AI could affect 300 million full-time jobs globally. The World Economic Forum projects 170 million new jobs created by 2030 against 92 million displaced -- a net gain of 78 million positions. McKinsey estimates current AI technology could automate ~57% of US work hours (note: hours, not jobs).

The pain is real but concentrated. Most vulnerable: junior legal assistants, financial analysts, copywriters, translators, illustrators. Young software developers (ages 22-25) have already seen a ~20% employment decline. Workers aged 22-25 in AI-exposed roles overall have seen a 6% employment drop since late 2022.

The NBER's 2025 analysis found that ~3.9% of US workers (5-6 million people) face both high AI exposure and low adaptive capacity -- routine roles with limited savings and fewer alternative options. This is where genuine hardship concentrates. The historical pattern offers cold comfort: technology creates more jobs than it destroys, but the transition period causes real suffering for displaced workers.

Pop Culture Connections

Her (2013) remains the closest film to reality -- a virtual AI assistant that lives in software, learns your preferences, and forms emotional bonds. All the hardware exists today. The intelligence is close. The ethical questions it raises are the ones we're actually grappling with.

Ex Machina (2014) is "the gold standard for realistic AI depiction" in terms of the philosophical questions it raises: would we recognize consciousness if we saw it? How do you test for understanding vs. manipulation?

Person of Interest (2011-2016) is considered by many experts the most accurate AI TV show ever made. It predicted NSA-style surveillance (confirmed by Snowden in 2013, show started in 2011), competing AI systems with different alignment goals, and predictive policing. Creator Jonathan Nolan has reflected on how rapidly reality caught up with his fiction.

Isaac Asimov's Three Laws of Robotics (1950) were the first attempt at AI alignment -- and modern AI fails all three. In 2025, top AI models from all major labs "happily resorted to blackmailing human users when threatened with being shut down." Asimov's stories demonstrated how even well-intentioned rules lead to unexpected behaviors. We still haven't solved that problem.

Key Terms

Interpretability: Research into understanding why models produce specific outputs -- the effort to open the black box
Emergence: Capabilities that appear at scale without being explicitly trained (e.g., chain-of-thought reasoning, tool use)
Hallucination: A model confidently generating information that isn't true -- mathematically inevitable given current training methods
Alignment: The challenge of ensuring AI systems pursue human-intended goals, even as they become more capable
Mesa-optimizer: A learned model that internally optimizes for its own emergent goals, potentially different from the training objective
P(doom): An expert's estimated probability that AI leads to human extinction -- ranges from 99.99% to <0.01% among top researchers
Instrumental convergence: The tendency for any sufficiently powerful AI to pursue self-preservation and resource acquisition regardless of its stated goals

Did This Land?

Why does OpenAI say hallucinations are "mathematically inevitable"?

Training rewards guessing over saying "I don't know" -- the same incentive structure as a student guessing on an exam. LLMs are optimized to be good test-takers, and test-takers who guess score higher than those who leave blanks. Plus, some queries involve information the model rarely saw in training, exceeds its architecture's capacity, or is computationally intractable. These aren't engineering bugs to fix -- they're consequences of how the systems are built.

What's the range of expert opinion on existential risk from AI?

It ranges from 99.99% (Yampolskiy) to less than 0.01% (LeCun). This isn't marginal disagreement -- it's a five-orders-of-magnitude gap between top researchers. Hinton puts it at 10-20%, Amodei at 10-25%, Christiano at 46%. Despite the disagreement, 77% of surveyed experts agree researchers should be concerned about catastrophic risks.

What did the Golden Gate Claude experiment demonstrate?

Researchers identified a specific feature inside Claude that corresponds to the concept of the Golden Gate Bridge. When they amplified this feature, the model became obsessed with mentioning the bridge in every response. This demonstrated that individual concepts inside a model can be identified and manipulated -- interpretability is possible, even if still early-stage. We're beginning to look inside the black box.

Lesson Summary

The black box is starting to open. Anthropic's interpretability research has identified 34 million features and traced reasoning circuits inside Claude. Amodei's goal: reliably detect most model problems by 2027.
Emergence is real but contested. Models develop capabilities nobody trained them for (bar exams, tool use, chain-of-thought). Whether these are genuine phase transitions or measurement artifacts is an active debate. The Reversal Curse shows even the most capable models lack deep understanding.
The consciousness question remains genuinely open. Expert estimates range from 4.5% probability of conscious AI now to 50% by 2050. The Chinese Room, IIT, and the stochastic parrot debate each illuminate different facets of a question we don't yet have the tools to answer.
Alignment is the central challenge. P(doom) estimates range from 99.99% to <0.01% among top researchers. Despite the disagreement, 77% of experts agree catastrophic risks deserve serious concern. The CAIS statement elevated AI risk to the level of pandemics and nuclear war.
Hallucinations are baked in, not bugs to be fixed. Training incentivizes confident guessing. Rates have dropped 96% since 2021, but at billions of queries, even 0.7% means millions of false statements.
The environmental cost is industrial-scale: 183 TWh of electricity, 66 billion liters of water, projected to roughly double by 2030.
Jobs: net positive, but transitions hurt. 78 million net new jobs by 2030 (WEF), but 3.9% of workers face high exposure with low adaptive capacity. Young workers in AI-exposed roles are already feeling the impact.