DIGEST
Proofs Are Going to Get Cheap: AI, Formal
Verification, and the Future of Trust
Speaker: Mike (Principal, Galois) | Venue: University of Washington | Source: Academic talk
transcript | Date: 2026
BOTTOM LINE UP FRONT
AI is on track to make mathematical proof radically cheaper. Historically,
formal proof has required expensive human labor; rapidly improving AI capabilities
in mathematics and Lean theorem proving are poised to change the economics
fundamentally.
Cheap proofs enable a new trust architecture for AI systems. Formal
verification can serve as an "oversight technology" — letting untrusted AI systems
produce solutions while trusted proof checkers verify correctness, separating search
from verification.
Three hard problems remain. Even with cheap proofs, the field faces major
obstacles: building the Lean infrastructure to model real systems, bridging the gap
between real-world code and formal models (world modeling), and the deepest
challenge of all — writing correct specifications for what we actually want.
Specification is the enduring bottleneck. The hardest problem is not proof
generation but articulating precisely what "correct" or "secure" means for a given
system. This is the challenge least likely to be solved by AI capability
improvements alone.
The community should think big now. If proofs become cheap on short timelines,
the formal methods community needs to prepare infrastructure, tooling, and
conceptual frameworks to exploit this shift — rather than being caught flat-footed.
The Trust Problem: Why Proof Matters for AI
AI systems generate outputs through opaque processes. For trivial tasks, humans can spot
errors — a wrong arithmetic answer is immediately obvious. But for consequential tasks like
building secure systems, the opacity becomes dangerous. An AI-generated VPN might look
plausible while quietly exfiltrating data. As Leopold Aschenbrenner observed in Situational
Awareness, small AI systems produce interpretable outputs; large, capable ones produce
outputs whose correctness cannot be easily assessed.
The fundamental architecture that resolves this tension separates search from checking. On
one side: heuristic, powerful, opaque processes that find solutions quickly — and are
allowed to be wrong. On the other side: slow, predictable, auditable processes that verify
correctness — and must be right. This maps onto a deep asymmetry in computer science
(closely related to the P vs. NP distinction): finding answers is often much harder than
checking them.
Core insight: AI should function as an untrusted source of solutions. Trust should
reside in a separate verification layer that does not depend on the AI's reliability. The
AI can be powerful and opaque; the checker must be transparent and correct.
This pattern applies across domains: AI-generated theorems need proof that they are true, AI-
built systems need proof that they are secure, and AI actions need verification that they are
permitted and harmless. The goal is to minimize what must be taken on trust from the neural
network itself.
AI's Mathematical Trajectory
AI mathematical capability has improved dramatically and consistently. The METR time-
horizon benchmark shows exponential growth in task complexity: GPT-2 could handle tasks
a human completes in four seconds; Claude 4.6 handles tasks requiring roughly ten hours of
human effort. This growth has been steady on a log scale, spanning many model generations.
Capability jumps can be sudden and dramatic. Anthropic's unreleased Mythos model
demonstrated a ~100× improvement over Opus 4.6 in autonomously finding JavaScript
shell exploits — going from 0.3% to 72% success rate. Such discontinuous jumps
underscore the unpredictability of capability frontiers.
On the Frontier Math benchmark — graduate-level math problems designed by
mathematicians, kept off the public internet, structured with compact integer answers for
easy checking — performance has risen from near-zero in July 2024 to 45–50% by early
2026 across difficulty tiers. These results represent something close to novel knowledge
generation, since the problems are designed to prevent shortcut solutions.
An important nuance emerged from audience discussion: Frontier Math's easy-to-check
design (small integer answers) is what makes it a clean benchmark. But this structure also
limits what it measures. Real mathematical progress requires proofs, not just answers — and
proof is where the harder, more consequential challenge lies.
Formal Proof as the Verification Layer
Mathematical proof has a uniquely valuable property: it is trustless. A valid proof does not
require knowing or trusting the person who produced it. The proof itself is an artifact that
can be checked independently. This makes proof an ideal verification technology for AI-
generated claims.
Human-written proofs, however, are difficult to check mechanically. Natural-language
proofs rely on implicit reasoning steps, shared mathematical intuition, and social trust
processes. Andrew Wiles' proof of Fermat's Last Theorem took six years (1993–1999) to
move from initial presentation to broad acceptance, including fixing substantive errors. If AI
mathematicians generate proofs at scale, humans cannot check them fast enough — the
bottleneck shifts from proof generation to proof verification.
Interactive Theorem Provers and Lean
Interactive theorem provers solve this by encoding proofs as code that a mechanical checker
can verify. Lean is the leading tool in this space — the culmination of decades of work in
interactive theorem proving dating back to the 1970s. In Lean, a proof is a program: each
step can be individually confirmed, the overall structure is legible and auditable, and
correctness is guaranteed by Lean's kernel rather than by human judgment.
AI is improving at Lean theorem proving specifically, not just informal mathematics. On one
broad benchmark drawn from the Lean community, Claude Opus 4.5 achieved 7.8% on a
single pass and roughly 15% on two passes — respectable given the range of difficulty.
Anecdotally, AI Lean capability has gone from terrible (nine months prior) to respectable,
with momentum still building. Multiple iterative approaches achieve significantly higher
scores.
Commercial signal: Companies including Harmonic, Axiom, Logical Intelligence, and
Math Inc. have built their business models around AI mathematical solvers targeting
Lean, betting that this technology will prove valuable beyond pure mathematics.
Formal Oversight: Proofs as an AI Governance Technology
If proofs become cheap, they unlock a powerful governance pattern: formal oversight. In
this model, a specification is written in Lean describing what a system should do. An AI
produces both a solution and a proof that the solution meets the specification. Lean then
independently verifies the proof.
Formal oversight is attractive because it scales with AI capability. As AI systems become
more powerful and their outputs harder for humans to review (imagine a 600,000-line pull
request), the same increasing capability makes the AI better at generating proofs of
correctness. The two trends reinforce each other rather than competing.
Well-known programming language techniques can encode properties like memory safety,
process isolation, encryption correctness, and functional specifications into Lean-checkable
forms. The infrastructure to do this is not complete, but the theoretical foundations exist.
The Three Hard Problems
Problem 1: Proof Cost (Possibly Solved)
Formal proof has historically been prohibitively expensive. The seL4 verified microkernel
illustrates the ratio: 8,500 lines of C code required 165,000 lines of Isabelle proof — and the
current total is around a million lines of handwritten proof. Changing the C code forces
corresponding proof changes, making the technology extremely rigid and costly. Only the
most critical security infrastructure (like seL4) could justify this investment.
AI-assisted proof may dramatically shift this cost curve. If the AI can generate and maintain
proofs, the "habitable zone" of projects where formal verification makes economic sense
expands enormously — potentially from a tiny sliver to a large category of software
engineering work.
Problem 2: World Modeling
Verification requires a formal model of the system under analysis. Translating real Rust or C
code into a Lean model requires trusting the translation tool, the language semantics model,
the hardware model, the OS model, and the library models. Each is a potential source of
error.
Trust boundaries have bugs. A buffer overflow was discovered this week in the Lean
runtime itself — found, notably, by Claude during AI red-teaming. Lean's kernel is
highly trustworthy, but nothing is perfect. Wherever the verification boundary is
drawn, bugs will cluster just outside it.
Microarchitectural attacks (Meltdown, Spectre) exemplify the world-modeling problem: a
verification that doesn't model timing channels provides no assurance against timing-based
attacks. The model's assumptions must be clearly communicated so that users understand
exactly what has and hasn't been verified.
Problem 3: Specification (The End Boss)
Specification is the deepest and most durable challenge. A formal specification defines what
"correct" means — but where does it come from?
If the AI writes the specification, it is grading its own homework. If humans write it, they
must understand what they want with mathematical precision — which is often harder than
writing the code itself. The seL4 specification is 5,000 lines, which must be reviewed and
understood to trust the verified system. The verification hasn't eliminated the reading burden;
it has shifted it from 9,000 lines of C to 5,000 lines of specification.
Specification through proof: An audience member made a critical observation — in
practice, humans refine specifications by working on the proof. A statement that's too
easy to prove signals it's too weak; one that's impossible to prove signals it doesn't
match the code. If AI handles the proof and humans handle only the spec, this iterative
refinement loop breaks. New workflows must be invented.
Most real-world systems have messy, informal, or nonexistent specifications. RFCs
approximate formal specs; PowerPoint decks sometimes serve as the sole specification
document. For many programs, the code itself is the specification — GCC does what GCC
currently does. Work on PDF specifications at DARPA's SafeDocs project revealed that even
well-specified formats have implementations that diverge from the spec in countless edge
cases.
Formally specifiable systems — crypto libraries, operating system kernels, compilers, cloud
services — are the "plumpest cherries on the tree." Large categories of important software
(browsers, word processors, self-driving cars) may simply resist clean formal specification.
A formal spec for Chrome or Microsoft Word does not obviously exist.
The Ambitious Vision: Verified Agent Containment
The aspirational goal is a fully verified AI agent containment stack: an AI agent with a
verified permission system, running through a verified Docker container, through a verified
OCI layer, through a verified operating system with verified system calls, through a verified
Linux kernel, all the way down to verified hardware. Every layer would carry a formal proof
of its security properties.
This vision requires massive infrastructure that does not yet exist: handling real system code
at scale, eliciting specifications for Docker-level, kernel-level, and internal kernel interfaces,
and AI assistance at every layer. The project may not be feasible, but if proof truly becomes
cheap, it represents the kind of ambitious target worth working toward.
Implications & Connections
The chess analogy is instructive but imperfect. AI comprehensively surpassed
humans at chess, and something similar may happen in mathematical proof. But chess
has a fixed, fully specified rule system. Mathematics — and especially the
specification problem — involves open-ended judgment about what questions to ask
and what properties matter. The analogy holds for proof generation but breaks down
for specification.
Proofs are a training signal, not just an output. The verifiability of proofs creates
clean training data for AI systems. Unlike natural-language tasks where correctness is
ambiguous, a proof either checks or it doesn't. This makes mathematical reasoning a
domain where AI can improve rapidly through self-play-like feedback loops.
The specification problem may be the defining challenge of AI governance. If AI
systems can prove they meet any given specification, the entire weight of trust shifts to
whether the specification captures what humans actually want. This is a restatement of
the alignment problem in formal-methods language — and it will not be solved by AI
getting better at proof.
Further Exploration
Build in Lean. For tool builders enthusiastic about formal methods, contributing to the
Lean ecosystem is high-value work. The infrastructure has critical missing pieces, and
the "Lean theory of everything" — porting all formal verification tooling into Lean —
is a viable long-term strategy.
Invest in specification technology. For researchers seeking problems that won't be
rendered obsolete by AI capability improvements, specification — how to formally
state what we want from complex, messy, real-world systems — is the area with the
longest shelf life and the greatest need for deep, thoughtful work.
Monitor the capability curves. The exponential improvement in AI mathematical
ability could plateau or could continue. Either outcome demands preparation. If proofs
become cheap on short timelines, the formal methods community must have
infrastructure ready to exploit the shift.