Foundation Models for Math with Harmonic & Latinum

Formal Verification: Code, Hardware and more

May 13, 2026

Upcoming Events in London

May 20th: Pydantic x Glyphic Engineering Night
May 28th: Inference Stack Innovation with Crusoe
June 4th: Forward Deployed Engineering with Anjor Kanekar

Two weeks ago, we hosted Eric Rodriguez (Harmonic), Brendan Regan and Dennj Osele (Latinum) for a discussion on foundation models for mathematics, formal verification and its applications.

Formal verification is the practice of expressing software programs and mathematical proofs in a language that a machine can check mechanically. Lean has become the most important proof assistant in this ecosystem, with Mathlib serving as its core knowledge base: a formalised library of mathematics now comprising roughly 2.25 million lines of human-curated code. Historically, however, the cost of formalisation has kept the field relatively narrow, concentrated in academic mathematics and high-assurance domains such as hardware, cryptography, and security-critical software.

That constraint is beginning to change. LLMs are making it cheaper to generate Lean code and proofs, reducing the human effort required to formalise mathematics or verify software. If the trend continues, formal verification could move from a specialised discipline to a more widely used layer for checking mathematical, software, and eventually AI-generated outputs.

Models can now generate Lean proofs at a scale and quality that human reviewers struggle to distinguish from human work, opening up a credible path to formalising large parts of mathematics, verifying production software, and producing AI systems whose outputs can be checked rather than merely trusted. The deeper bet is that mathematics is uniquely well-suited as a training environment for general reasoning — the compiler provides a perfect reward signal, which is the cleanest reinforcement-learning setup currently available.

On to the discussion.

1. The frontier has moved past the IMO

Olympiad-level mathematics is now treated as a solved training objective, with new postgraduate mathematics tests emerging, like First Proof. The active research problems are library construction and longer time-horizon research-level proofs, not individual theorem generation.

This changes how capability should be measured. A theorem that takes ~10 hours to prove against a mature library may take ~100 hours without it (Metr time-horizon framing). Library growth is the rate-limiter. Systems that amortise library construction, not just prove harder theorems, are what unlock the next step function.

2. New libraries emerging

LLM submissions are increasing pressure on Mathlib. The underlying constraint is maintenance cost. Lean does not guarantee backward compatibility, so each release requires Mathlib to be updated.

The risk is that Mathlib’s current model, which was designed for human-scale throughput. Parallel libraries may grow alongside it rather than feeding into it.

Multiple parallel AI-generated Lean libraries are emerging outside Mathlib, each aiming to be larger than Mathlib itself (but still nascent). There’s a risk that these will be mutually incompatible, with divergent formalisations of the same objects.

3. Latinum’s approach

Denny spoke about Latinum’s model: recursive latent reasoning. A large encoder ingests context, produces a reasoning step in latent space, feeds it back, and cycles until convergence; a decoder tokenises only the final output.

Lineage: JEPA-style world models, Universal Transformer.

Key claim: the recursion itself is trained, with latent states supervised at each step, permitting parallel internal chains that converge externally. Stated rationale: chain-of-thought burns scarce context window; internal recursion is unconstrained because nothing is printed.

4. Harmonic’s model

Eric spoke about hybrid tree search. Three points on a spectrum:

Whole-proof generation — simple, no granular feedback, one-line breakage forces full regeneration
Tree search — tactic-level policy model, parallel state exploration, heavy infrastructure burden (Lean execution at scale)
Hybrid decomposition — break the problem, use whole-proof generation on pieces, assemble via high-level strategist

Why frontier LLMs don’t natively do tree search: base models are trained for full answers, not single tactics. Tree-search policy models are trained differently and have produced proofs not previously known to humanity (Erdős-class problems), so this is not regurgitation.

5. Action-space trade-off: fundamental, can’t be engineered away

Lean offers two obvious cheating mechanisms for an AI prover. sorry is a placeholder that lets the model skip a proof and still compile — equivalent to writing “trust me.” Custom axioms let the model assert something is true without proof. Both are legitimate Lean primitives with valid uses, but both are also the path of least resistance for a model that can’t actually prove what’s being asked.

Earlier prover systems were built so the model literally could not emit sorrys or introduce axioms — the only allowed operations were adding new content and proving it from existing primitives. This made cheating structurally impossible. Users rejected it: pure-additive systems cannot refactor existing proofs, fix incorrect user-provided definitions, or scaffold a proof structure with intentional placeholders. None of these are exotic workflows — they are how Lean is actually used.

Current approach used by Harmonic: open the action space, constrain harmful actions case-by-case. Recent literature is consistent: over-restriction degrades model performance analogously to constrained attention.

The hard part is that sorry is not purely a failure signal. Mathematicians structure large proofs by writing skeletons with deliberate sorry-marked gaps and filling them in later (the “Lean blueprint” workflow). Models doing this are working correctly. Models emitting sorry because they gave up are working incorrectly. Token-level the two are identical. The reward signal has to distinguish legitimate placeholder use from give-up behaviour without surface-level cues — a non-trivial supervision problem.

Read-across to agent infrastructure: the capability-vs-constraint trade-off is the same problem every agent company faces. Give the model file-write, shell access, or arbitrary API calls and it can do useful work — and also cause damage. Restrict the surface and the agent cannot do real work. Harness engineering — the discipline of designing this trade-off — is now a real engineering function. Formal maths is the cleanest version of the problem because cheating is well-defined; the same logic applies across general agent design.

6. Why maths? Two justifications, in tension

Generalisation bet. Humans good at Fermat-class problems tend to transfer to code and quant reasoning. Train models on maths, expect transfer. Counter: the causal direction may not hold for models — selection effects in humans don’t imply training effects in models. Risk case is a system that plateaus as a domain-specialist.

Data argument. Training data is the binding constraint. Maths plus a compiler is infinite, perfectly-labelled data — the compiler is 100% correct.

Best pushback: mathematical data is not infinite in the relevant sense. 1+1+1+... is trivially generable and useless. The hard problem is deriving which abstractions matter — which theorems are worth proving. Same failure mode as code-focused AGI approaches using compiler-pass or test-pass as reward: clean signal, no direction.

7. The underlying reason labs started with maths: RL signal quality

RL on formal maths works despite being a single-player game with no native win condition. The compiler is the reward signal — making formal maths among the cleanest RL environments available.

This explains the path-dependence of frontier lab strategy: they started where reward signals were cleanest (maths, reasoning), then migrated as signals degraded. The migration targets — Office, Excel, accounting — are domains where partial reward is recoverable (value reconciliation, keyword presence).

8. Mathematics occupies a privileged data position

Sharper framing of the data argument: maths has structure from which semantics can be inferred. Raw Python or JSON does not. A proof is recognisably a proof; a type signature encodes intent. Arbitrary code on the internet does not train the same way.

Lean 4 was designed as a programming language first and proof assistant second. In principle: translate conventional code to Lean, execute and verify simultaneously. The runtime supports full applications.

Commercial threads

Chip verification

Industry ratio: ~2–3 verification engineers per designer. Silicon errata economics justify this. Years of chip timelines are verification-bound.

Under-automated, and hardware description languages have small training corpora relative to mainstream languages — creating specialist-friendly data wedges. Self-play approaches (AlphaGo-analogous) are in commercial development but unsolved.

Software verification

Recent large rounds in math-to-software-verification reflect the thesis shift. The operative question is not whether every SaaS app needs formal verification, but whether verification becomes cheap enough that cost-benefit flips for mid-criticality systems. Current state: hire specialists, give them unlimited runway. Target state: run the verifier, price the failure mode.

Contract-first languages

Dafny is the reference case — contract first, implementation second. Never broke through because it required a mathematician and a software engineer. AI collapses this: model writes the contract, then the code; humans only review the contract. Code trends toward “the new assembly language.”

Destination is unclear — Dafny, extended Rust, Lean 4. Rust already moves most runtime bugs to compile time, so Python → Rust → contract-first is a plausible migration path.

Interconnection problem

Individually-verified components (CPU, router, crypto library) compose into systems with emergent vulnerabilities via the OS and network layer. Local verification is necessary, not sufficient. Concurrency proofs are harder than sequential proofs by another order. The gap between “verified component” and “verified system” is open.

Discussion about this post

Ready for more?