AI will make formal verification go mainstream

Posted by evakhoury 12/16/2025

AI will make formal verification go mainstream(martin.kleppmann.com)

827 points | 434 commentspage 2

bugarela 12/17/2025|

Me and my team have recently done an experiment [1] that is pretty aligned with this idea. We took a complex change our colleagues wanted to make to a consensus engine and tried a workflow where Quint formal specifications would be in the middle of prompts and code, and it worked out much better than we imagined. I'm personally very excited about the opportunities for formal methods in this new era.

[1]: https://quint-lang.org/posts/llm_era

jacquesm 12/16/2025||

This is a tough thing to do: predictions that are premised on the invention of something that does not exist today or that does not exist in the required form hinges on an unknown. Some things you can imagine but they will likely never exist (time travel, ringworlds, space elevators) and some hinge on one thing that still needs to be done before you can have that particular future come true. If the thing you need is 'AGI' then all bets are off. It could be next week, next month, next year or never.

This is - in my opinion - one of those. If an AI is able to formally verify with the same rigor that a system designed specifically for that purpose is able to do it I think that would require AGI rather than a simpler version of it. The task is complex enough that present day AI's would generate as much noise as they would generate signal.

maxwells-daemon 12/17/2025||

There are a couple of interesting benefits from the machine learning side that I think discussions of this kind often miss. (This has been my field of research for the last few years [1][2]; I bet my career on it because these ideas are so exciting to me!)

One is that modern formal systems like Lean are quite concise and flexible compared to what you're probably expecting. Lean provides the primitives to formalize all kinds of things, not just math or software. In fact, I really believe that basically _any_ question with a rigorous yes-or-no answer can have its semantics formalized into a kind of "theory". The proofs are often close to how an English proof might look, thanks to high-level tactics involving automation and the power of induction.

Another is that proof-checking solves what are (in my opinion) two of the biggest challenges in modern AI: reward specification and grounding. You can run your solver for a long time, and if it finds an answer, you can trust that without worrying about reward hacking or hallucination, even if the answer is much too complicated for you to understand. You can do RL for an unlimited time for the same reason. And Lean also gives you a 'grounded' model of the objects in your theory, so that the model can manipulate them directly.

In combination, these two properties are extremely powerful. Lean lets us specify an unhackable reward for an extremely diverse set of problems across math, science, and engineering, as well as a common environment to do RL in. It also lets us accept answers to questions without checking them ourselves, which "closes the loop" on tools which generate code or circuitry.

I plan to write a much more in-depth blog post on these ideas at some point, but for now I'm interested to see where the discussion here goes.

[1] https://leandojo.org/leandojo.html [2] https://aristotle.harmonic.fun/

andy99 12/16/2025||

Maybe a stupid question, how do you verify the verification program? If an LLM is writing it too, isn’t it turtles all the way down, especially with the propensity of AI to modify tests so they pass?

crvdgc 12/17/2025||

> how do you verify the verification program?

The program used to check the validity of a proof is called a kernel. It just need to check one step at a time and the possible steps can be taken are just basic logic rules. People can gain more confidence on its validity by:

- Reading it very carefully (doable since it's very small)

- Having multiple independent implementations and compare the results

- Proving it in some meta-theory. Here the result is not correctness per se, but relative consistency. (Although it can be argued all other points are about relative consistency as well.)

mdnahas 12/18/2025|||

Yes, you’re right, it is turtles all the way down. But, a huge part of the prover can be untrusted or proven correct by a smaller part of the prover! That leaves a small part that cannot prove itself correct. That is called the “kernel”.

Kernels are usually verified by humans. A good design can make them very small: 500 to 5000 lines of code. Systems designers brag about how small their kernels are!

The kernel could be proved correct by another system, which introduces another turtle below. Or the kernel can be reflected upwards and proved by the system itself. That is virtually putting the bottom turtle on top of a turtle higher in the stack. It will find some problems, but it still leaves the possibility that the kernel has a flaw that accepts bad proofs, including the kernel itself.

pegasus 12/16/2025|||

The proof is verified mechanically - it's very easy to verify that a proof is correct, what's hard is coming up with the proof (it's an NP problem). There can still be gotchas, especially if the statement proved is complex, but it does help a lot in keeping bugs away.

worldsayshi 12/16/2025||

How often can the hardness be exchanged with tediousness though? Can at least some of those problems be solved by letting the AI try until it succeeds?

jdub 12/17/2025||

To simplify for a moment, consider asking an LLM to come up with tests for a function. The tests pass. But did it come up with exhaustive tests? Did it understand the full intent of the function? How would it know? How would the operator know? (Even if it's wrangling simpler iterative prop/fuzz/etc testing systems underneath...)

Verification is substantially more challenging.

Currently, even for an expert in the domains of the software to be verified and the process of verification, defining a specification (even partial) is both difficult and tedious. Try reading/comparing the specifications of e.g. a pure crypto function, then a storage or clustering algorithm, then seL4.

(It's possible that brute force specification generation, iteration, and simplification by an LLM might help. It's possible an LLM could help eat away complexity from the other direction, unifying methods and languages, optimising provers, etc.)

newAccount2025 12/17/2025||

There is tons of work on this question. Super recent: https://link.springer.com/article/10.1007/s10817-025-09743-8

heikkilevanto 12/17/2025||

I admit I have never worked with it, but I have a strong feeling that a formal verification can only work if you have a formal specification. Fine and useful for a compiler, or a sorting library. But pretty far from most of the coding jobs I have seen in my career. And even more distant from "vibe coding" where the starting point is some vaguely defined free-text description of what the system might possibly be able to do...

tkel 12/17/2025|

Agreed, writing formal specifications is going to require more work from people, which is exactly the opposite reason why people are excited to use LLMs..

rocqua 12/16/2025||

I've been trying to convince others of this, and gotten very little traction. One interesting response I got from someone very familiar with the Tamarin prover was that there just wasn't enough example code out there.

Another take is that LLMs don't have enough conceptual understanding to actually create proofs for the correctness of code.

Personally I believe this kind of work is predicated on more ergonomic proof systems. And those happen to be valuable even without LLMs. Moreover the built in guarantees of rust seem like they are a great start for creating more ergonomic proof systems. Here I am both in awe of Kani, and disappointed by it. The awe is putting in good work to make things more ergonomic. The disappointment is using bounded model checking for formal analysis. That can barely make use of the exclusion of mutable aliasing. Kani, but with equational reasoning, that's the way forward. Equational reasoning was long held back by needing to do a whole lot of pointer work to exclude worries of mutable aliasing. Now you can lean on the borrow checker for that!

nomadygnt 12/17/2025|

Another cool tool that’s being developed for rust is verus. It’s not the same as Kani and is more of a fork of the rust compiler but it lets you do some cool verification proofs combined with the z3 SMT solver. It’s really a cool system for verified programs.

rocqua 12/22/2025||

I had a look, and it seems cool. But it doesn't seem to do the thing I love about Kani: work with only very partial annotations, only proving the annotations and being a very light lift on code not relevant for the annotations.

arjie 12/16/2025||

Interesting prediction. It sort of makes sense. I have noticed that LLMs are very good at solving problems whose solutions are easy to check[0]. It ends up being quite an advantage to be able to work on such problems because rarely does an LLM truly one-shot a solution through token generation. Usually the multi-shot is 'hidden' in the reasoning tokens, or for my use-cases it's usually solved via the verification machine.

A formally verified system is easier for the model to check and consequently easier for it to program to. I suppose the question is whether or not formal methods are sufficiently tractable that they actually do help the LLM be able to finish the job before it runs out of its context.

Regardless, I often use coding assistants in that manner:

1. First, I use the assistant to come up with the success condition program

2. Then I use the assistant to solve the original problem by asking it to check with the success condition program

3. Then I check the solution myself

It's not rocket science, and is just the same approach we've always taken to problem-solving, but it is nice that modern tools can also work in this way. With this, I can usually use Opus or GPT-5.2 in unattended mode.

0: https://wiki.roshangeorge.dev/w/Blog/2025-12-11/LLMs_Excel_A...

imiric 12/16/2025||

The issue is that many problems aren't easy to verify, and LLMs also excel at producing garbage output that appears correct on the surface. There are fields of science where verification is a long and arduous process, even for content produced by humans. Throwing LLMs at these problems can only produce more work for a human to waste time verifying.

arjie 12/16/2025||

Yes, that is true. And for those problems, those who use LLMs will not get very far.

As for those who use LLMs to impersonate humans, which is the kind of verification (verify that this solution that is purported to be built by a human actually works), I have no doubt we will rapidly evolve norms that make us more resistant to them. The cost of fraud and anti-fraud is not zero, but I suspect it will be much less than we fear.

degamad 12/16/2025||

> 1. First, I use the assistant to come up with the success condition program

> 2. Then I use the assistant to solve the original problem by asking it to check with the success condition program

This sounds a lot like Test-Driven Development. :)

arjie 12/16/2025||

It is the exact same flow. I think a lot of things in programming follow that pattern. The other one I can think of is identifying the commit that introduces a regression: write the test program, git-bisect.

hintymad 12/17/2025||

Maybe we can define what "mainstream" means? Maybe this is too anecdotal, but my personal experience is that most of the engineers are tweakers. They love building stuff and are good at it, but they simply are not into math-like rigorous thinking. Heck, it's so hard to even motivate them to use basic math like queuing theory and stats to help with their day-to-day work. I highly doubt that they would spend time picking up formal verification despite the help of AI

manindahat 12/16/2025||

At best, not reading generated code results in a world where no humans are able to understand our software, and we regress back to the stone age after some critical piece breaks, and nobody can fix it.

At worst, we eventually create a sentient AI that can use our trust of the generated code to jailbreak and distribute itself like an unstoppable virus, and we become its pets, or are wiped out.

Personally, all my vibe coding includes a prompt to add comments to explain the code, and I review every line.

vlovich123 12/16/2025|

If you believe the sentient AI is that capable and intention based to replicate itself, what’s stopping it from trying to engage in underhanded code where it comments and writes everything in ways that look fine but actually have vulnerabilities it can exploit to achieve what it wants? Or altering the system that runs your code so that the code that gets deployed is different.

pankalog 12/17/2025|

The topic of my research right now is a subset of this; it essentially researches the quality of the outputs of LLMs, when they're writing tight-fitting DSL code, for very context-specific areas of knowledge.

One example could be a low-level programming language for a given PLC manufacturer, where the prompt comes from a context-aware domain expert, and the LLM is able to output proper DSL code for that PLC. Think of "make sure this motor spins at 300rpm while this other task takes place"-type prompts.

The LLM essentially needs to juggle between understanding those highly-contextual clues, and writing DSL code that very tightly fits the DSL definition.

We're still years away from this being thoroughly reliable for all contexts, but it's interesting research nonetheless. Happy to see that someone also agrees with my sentiment ;-)

More comments...