Summary (Software and Proof) — MISC TOPICS

SUMMARY
Proofs Are Going to Get Cheap: AI, Formal
Verification, and the Future of Trust
Speaker: Mike (Principal, Galois) | Venue: University of Washington | Source: Academic talk
transcript | Date: 2026
AI and Trust
AI presents fundamental trust problems. For simple tasks, humans can verify AI outputs
directly — a wrong arithmetic answer is obviously wrong. But for consequential tasks like
building a secure VPN, correctness is not self-evident. The AI produces something through
an opaque process, claims it works, and users have no independent way to verify the claim.
The ideal architecture separates the search for solutions (heuristic, fast, opaque, allowed to
be wrong) from the verification of solutions (predictable, slow, auditable, must be correct).
AI should function as an untrusted source of outputs, with a separate mechanism providing
trusted guarantees of correctness. This pattern applies to AI-generated theorems, AI-built
systems, and AI actions alike.
AI Is Getting Good at Math
AI mathematical capability has improved on a consistent exponential trajectory. The METR
time-horizon benchmark shows that model capability — measured by the difficulty of tasks
AI can complete, calibrated against how long those tasks take humans — has grown steadily
from four-second tasks (GPT-2) to ten-hour tasks (Claude 4.6).

Capability can jump discontinuously. Anthropic's unreleased Mythos model achieved
a ~100× improvement over Opus 4.6 on autonomous JavaScript exploit discovery, going
from 0.3% to 72% success. On the Frontier Math benchmark (graduate-level, closed
math problems with compact integer answers), scores have risen from near-zero in July
2024 to 45–50% by 2026.
Frontier Math is structured with hard questions and easy-to-check answers, which makes it a
clean benchmark but limits what it demonstrates. An audience member pointed out that the
problems lean toward enumerative geometry over finite fields — narrower than "math"
broadly, but more interesting than trivial computation. The real challenge is that most hard
math questions require proofs, not just compact answers.
Mathematical Proof and Its Limitations
Proof is valuable because it is trustless: the proof itself is an independent artifact that can be
checked regardless of who produced it. But human proofs are hard to check mechanically.
Andrew Wiles' proof of Fermat's Last Theorem took six years from initial presentation to
broad acceptance, including the repair of substantive errors. AI-generated proofs at scale
would overwhelm human checking capacity.
Meanwhile, AI is already producing novel mathematical results. Recent solutions to Erdős
problems have impressed legitimate mathematicians — these appear to involve genuinely
novel results, not just rediscoveries of existing knowledge. But these results are written in
informal human mathematics, which cannot be mechanically verified.
Formalization with Lean
Interactive theorem provers like Lean allow proofs to be written as code and checked
mechanically. Lean represents the culmination of decades of work in this space (dating to the

1970s). A Lean proof has its structure legible and auditable, its individual steps confirmable,
and its correctness guaranteed by the Lean kernel rather than by human review.
AI is getting better at writing Lean proofs specifically. On a broad benchmark from the Lean
community, Claude Opus 4.5 scored 7.8% on a single pass and ~15% on two passes.
Iterative approaches perform significantly better. Anecdotally, AI Lean capability has
progressed from terrible (nine months ago) to respectable, with continued momentum.
Commercial investment confirms the trend: companies like Harmonic, Axiom, Logical
Intelligence, and Math Inc. have built business models around AI mathematical solvers
targeting Lean.
The chess analogy: There was a time when humans were better than AI at chess, a
period when they were comparable, and now AI is comprehensively and permanently
superior. The same trajectory may apply to mathematical proof — the capability curves
look similar.
Formal Oversight: Using Proofs to Govern AI
If proof becomes cheap, it enables formal oversight: writing a specification in Lean for
what a system should do, having the AI produce both a solution and a proof of correctness,
and letting Lean verify the proof. Well-known programming language techniques can encode
properties like memory safety, process isolation, and encryption correctness.
Formal oversight scales with AI capability. As AI systems grow more powerful and their
outputs harder for humans to review, the same capability improvements make AI better at
generating correctness proofs. The two trends reinforce each other.

Problem 1: Proof Cost
Formal proof has been prohibitively expensive. The seL4 verified microkernel required
165,000 lines of Isabelle proof for 8,500 lines of C — a ratio that makes verification
uneconomical for nearly all software. The current seL4 proof totals roughly a million lines of
handwritten proof, and changes to the C code force corresponding proof changes.
AI-assisted proof may shift this cost curve dramatically, expanding the "habitable zone" of
projects where verification is economically viable from a tiny sliver to a broad category.
Problem 2: World Modeling
Verification requires translating real system code into formal models, and trusting the
translation. The model must correctly represent language semantics, hardware behavior, OS
interactions, and library behavior. Microarchitectural attacks (Meltdown, Spectre) illustrate
the gap: a model that omits timing channels provides no assurance against timing-based
attacks.
Even the checker has bugs. A buffer overflow was found in the Lean runtime this
week — discovered by Claude during AI red-teaming. Lean's kernel is highly
trustworthy, but wherever the verification boundary is drawn, vulnerabilities will exist
just beyond it.
Problem 3: Specification
Specification is the deepest challenge. A specification defines what "correct" means, but its
source is unclear. The AI cannot write its own specification without marking its own

homework. Humans must understand what they want with mathematical precision — which
is often harder than writing the code.
In practice, humans develop correct specifications by working on proofs: a statement that's
too easy to prove is too weak; one that's impossible to prove doesn't match the code. If AI
generates proofs and humans only write specs, this refinement loop is broken.
Most real-world software lacks clean formal specifications. RFCs approximate them;
sometimes a PowerPoint deck is the only spec document. For many programs, the code itself
is the spec. Even well-specified formats like PDF have implementations that widely diverge
from the specification. The formally specifiable systems — crypto libraries, kernels,
compilers — are the easiest cases, not the general case.
The Vision and What's Missing
The aspirational target is a fully verified AI agent containment stack: verified permissions,
Docker containers, OCI layer, OS system calls, Linux kernel, and hardware — every layer
carrying a formal proof. This requires massive infrastructure that does not exist today:
handling real system code, eliciting specifications at every architectural layer, and pervasive
AI assistance.
Proofs are going to get cheap. Proof technology is very useful. But critical infrastructure is
missing from the Lean ecosystem, and specification technology — how to formally state
what complex real-world systems should do — is the problem most in need of ambitious,
deep work. It is also the problem least likely to be solved simply by AI becoming more
capable.

MISC TOPICS: Summary (Software and Proof)