Posted by evakhoury 7 days ago
This is - in my opinion - one of those. If an AI is able to formally verify with the same rigor that a system designed specifically for that purpose is able to do it I think that would require AGI rather than a simpler version of it. The task is complex enough that present day AI's would generate as much noise as they would generate signal.
One is that modern formal systems like Lean are quite concise and flexible compared to what you're probably expecting. Lean provides the primitives to formalize all kinds of things, not just math or software. In fact, I really believe that basically _any_ question with a rigorous yes-or-no answer can have its semantics formalized into a kind of "theory". The proofs are often close to how an English proof might look, thanks to high-level tactics involving automation and the power of induction.
Another is that proof-checking solves what are (in my opinion) two of the biggest challenges in modern AI: reward specification and grounding. You can run your solver for a long time, and if it finds an answer, you can trust that without worrying about reward hacking or hallucination, even if the answer is much too complicated for you to understand. You can do RL for an unlimited time for the same reason. And Lean also gives you a 'grounded' model of the objects in your theory, so that the model can manipulate them directly.
In combination, these two properties are extremely powerful. Lean lets us specify an unhackable reward for an extremely diverse set of problems across math, science, and engineering, as well as a common environment to do RL in. It also lets us accept answers to questions without checking them ourselves, which "closes the loop" on tools which generate code or circuitry.
I plan to write a much more in-depth blog post on these ideas at some point, but for now I'm interested to see where the discussion here goes.
[1] https://leandojo.org/leandojo.html [2] https://aristotle.harmonic.fun/
The program used to check the validity of a proof is called a kernel. It just need to check one step at a time and the possible steps can be taken are just basic logic rules. People can gain more confidence on its validity by:
- Reading it very carefully (doable since it's very small)
- Having multiple independent implementations and compare the results
- Proving it in some meta-theory. Here the result is not correctness per se, but relative consistency. (Although it can be argued all other points are about relative consistency as well.)
Kernels are usually verified by humans. A good design can make them very small: 500 to 5000 lines of code. Systems designers brag about how small their kernels are!
The kernel could be proved correct by another system, which introduces another turtle below. Or the kernel can be reflected upwards and proved by the system itself. That is virtually putting the bottom turtle on top of a turtle higher in the stack. It will find some problems, but it still leaves the possibility that the kernel has a flaw that accepts bad proofs, including the kernel itself.
Verification is substantially more challenging.
Currently, even for an expert in the domains of the software to be verified and the process of verification, defining a specification (even partial) is both difficult and tedious. Try reading/comparing the specifications of e.g. a pure crypto function, then a storage or clustering algorithm, then seL4.
(It's possible that brute force specification generation, iteration, and simplification by an LLM might help. It's possible an LLM could help eat away complexity from the other direction, unifying methods and languages, optimising provers, etc.)
Another take is that LLMs don't have enough conceptual understanding to actually create proofs for the correctness of code.
Personally I believe this kind of work is predicated on more ergonomic proof systems. And those happen to be valuable even without LLMs. Moreover the built in guarantees of rust seem like they are a great start for creating more ergonomic proof systems. Here I am both in awe of Kani, and disappointed by it. The awe is putting in good work to make things more ergonomic. The disappointment is using bounded model checking for formal analysis. That can barely make use of the exclusion of mutable aliasing. Kani, but with equational reasoning, that's the way forward. Equational reasoning was long held back by needing to do a whole lot of pointer work to exclude worries of mutable aliasing. Now you can lean on the borrow checker for that!
A formally verified system is easier for the model to check and consequently easier for it to program to. I suppose the question is whether or not formal methods are sufficiently tractable that they actually do help the LLM be able to finish the job before it runs out of its context.
Regardless, I often use coding assistants in that manner:
1. First, I use the assistant to come up with the success condition program
2. Then I use the assistant to solve the original problem by asking it to check with the success condition program
3. Then I check the solution myself
It's not rocket science, and is just the same approach we've always taken to problem-solving, but it is nice that modern tools can also work in this way. With this, I can usually use Opus or GPT-5.2 in unattended mode.
0: https://wiki.roshangeorge.dev/w/Blog/2025-12-11/LLMs_Excel_A...
As for those who use LLMs to impersonate humans, which is the kind of verification (verify that this solution that is purported to be built by a human actually works), I have no doubt we will rapidly evolve norms that make us more resistant to them. The cost of fraud and anti-fraud is not zero, but I suspect it will be much less than we fear.
> 2. Then I use the assistant to solve the original problem by asking it to check with the success condition program
This sounds a lot like Test-Driven Development. :)
At worst, we eventually create a sentient AI that can use our trust of the generated code to jailbreak and distribute itself like an unstoppable virus, and we become its pets, or are wiped out.
Personally, all my vibe coding includes a prompt to add comments to explain the code, and I review every line.
One example could be a low-level programming language for a given PLC manufacturer, where the prompt comes from a context-aware domain expert, and the LLM is able to output proper DSL code for that PLC. Think of "make sure this motor spins at 300rpm while this other task takes place"-type prompts.
The LLM essentially needs to juggle between understanding those highly-contextual clues, and writing DSL code that very tightly fits the DSL definition.
We're still years away from this being thoroughly reliable for all contexts, but it's interesting research nonetheless. Happy to see that someone also agrees with my sentiment ;-)