Top
Best
New

Posted by evakhoury 12/16/2025

AI will make formal verification go mainstream(martin.kleppmann.com)
826 points | 432 commentspage 12
8organicbits 12/16/2025|
If the engineer doesn't understand the proof system then they cannot validate that the proof describes their problem. The golden rule of LLMs is that they make mistakes and you need to check their output.

> writing proof scripts is one of the best applications for LLMs. It doesn’t matter if they hallucinate nonsense, because the proof checker will reject any invalid proof

Nonsense. If the AI hallucinated the proof script then it has no connection to the problem statement.

user3939382 12/17/2025||
I’ve made huge advances in formal verification with AI feel free to email if this is your field
goalieca 12/17/2025||
Really? Language models are fundamentally not equivalent in scale and scope to formal logical models. People are extrapolating poorly.
css_apologist 12/17/2025||
i want this to be true, however i am not confident
teleforce 12/16/2025||
>writing those proofs is both very difficult (requiring PhD-level training) and very laborious.

>For example, as of 2009, the formally verified seL4 microkernel consisted of 8,700 lines of C code, but proving it correct required 20 person-years and 200,000 lines of Isabelle code – or 23 lines of proof and half a person-day for every single line of implementation. Moreover, there are maybe a few hundred people in the world (wild guess) who know how to write such proofs, since it requires a lot of arcane knowledge about the proof system.

I think this type of pattern (genuine difficult problem domain with very small number of experts) is the future of AI not AGI. For examples formal verification like this article and similarly automated ECG interpretation can be the AI killer applications, and the former is I'm currently working on.

For most of the countries in the world, only several hundreds to several thousands registered cardiologist per country, making the ratio about 1:100,000 cardiologist to population ratio.

People expecting cardiologist to go through their ECG readings but reading ECG is very cumbersome. Let's say you have 5 minutes ECG signals for the minimum requirement for AFib detection as per guideline. The standard ECG is 12-lead resulting in 12 x 5 x 60 = 3600 beats even for the minimum 5 minutes durations requirements (assuming 1 minute ECG equals to 60 beats).

Then of course we have Holter ECG with typical 24-hour readings that increase the duration considerably and that's why almost all Holter reading now is automated. But current ECG automated detection has very low accuracy because their accuracy of their detection methods (statistics/AI/ML) are bounded by the beat detection algorithm for example the venerable Pan-Tompkins for the limited fiducial time-domain approach [1].

The cardiologist will rather spent their time for more interesting activities like teaching future cardiologists, performing expensive procedures like ICD or pacemaker, or having their once in a blue moon holidays instead of reading monotonous patients' ECGs.

This is why ECG reading automation with AI/ML is necessary to complement the cardiologist but the trick is to increase the sensitivity part of the accuracy to very high value preferably 100% and we achieved this accuracy for both major heart anomalies namely arrhythmia (irregular heart beats) and ischemia (heart not regulating blood flow properly) by going with non-fiducial detection approach or beyond time domain with the help of statistics/ML/AI. Thus the missing of potential patients (false negative) is minimized for the expert and cardiologist in the loop exercise.

[1] Pan–Tompkins algorithm:

https://en.wikipedia.org/wiki/Pan%E2%80%93Tompkins_algorithm

justatdotin 12/17/2025||
> 2. AI-generated code needs formal verification so that we can skip human review and still be sure that it works;

hahahahaha

CyberDildonics 12/17/2025||
When people have the choice whether to use AI or use rust because LLMs don't produce workable rust programs they leave rust behind and use something else. The venn diagram of people who want formal verification and think LLM slop is a good idea is two separate circles.
justatdotin 12/17/2025|
two of my current agent projects use rust: agent generated rust is going very well for me.
CyberDildonics 12/17/2025||
You're missing the point of what I said. If LLM programming and rust conflict people don't suddenly learn to program and stay with rust, they find a new language.
justatdotin 12/17/2025||
my experience is that llm programming and rust can fit together very nicely, thankyou.

I am finding agent tooling expands my capacity for multi-language projects

CyberDildonics 7 days ago||
This seems like it's intentional at this point. Try to follow me. The whole point is that if someone needs to lean on LLMs in the first place and they have problems with the language, they go to a different language instead of stopping using LLMs.
justatdotin 4 days ago||
I think I do understand your claim,

which is why I am offering my personal experience as counter-evidence.

agents can empower us to choose the best language for the job, rather than defaulting to one which we are most familiar with.

I suspect our difference in opinion comes from your attitude about tools as a crutch to lean on vs a lever to apply.

CyberDildonics 3 days ago||
There isn't a difference of opinion you aren't replying to what I'm saying in the first place.
datatrashfire 12/17/2025||
lol no it won’t
victor22 7 days ago||
meds
yuppiemephisto 12/16/2025|
I vibe code extremely extensively with Lean 4, enough to run out 2 claude code $200 accounts api limits every day for a week.

I added LSP support for images to get better feedback loops and opus was able to debug https://github.com/alok/LeanPlot. The entire library was vibe coded by older ai.

It also wrote https://github.com/alok/hexluthor (a hex color syntax highlighting extension that uses lean’s metaprogramming and lsp to show you what color a hex literal is) by using feedback and me saying “keep goign” (yes i misspelled it).

It has serious issues with slop and the limitations of small data, but the rate of progress is really really fast. Opus 4.5 and Gemini were a huge step change.

The language is also improving very fast. not as fast as AI.

The feedback loop is very real even for ordinary programming. The model really resists it though because it’s super hard, but again this is rapidly improving.

I started vibe coding Lean about 3 years ago and I’ve used Lean 3 (which was far worse). It’s my favorite language after churning through idk 30?

A big aspect of being successful with them is not being all or nothing with proofs. It’s very useful to write down properties as executable code and then just not prove them because they still have to type check and fit together and make sense. github.com/lecopivo/scilean is a good example (search “sorry_proof”).

There’s property testing with “plausible” as a nice 80/20 that can be upgraded to full proof at some point.

When the model gets to another jump in capacity, I predict it will emergently design better systems from the feedback needed to prove that they are correct in the first place. Formal Verification has a tendency like optimization to flow through the system in an anti-modular way and if you want to claw modularity back, you have to design it really really well. But ai gives a huge intellectual overhang. Why not let them put their capacity towards making better systems?

Even the documentation system for lean (verso) is (dependently!) typed.

Check out my Lean vibe codes at https://github.com/alok?tab=repositories&q=Lean&type=&langua...

pooooooooooooop 12/16/2025|
cool quote about smooth operator ... I notice none of the vibe codes are proofs of anything and rather frameworks for using lean. this seems like a waste of tokens - what is your thinking behind this?
yuppiemephisto 12/17/2025||
There’s 4000 lines of nonstandard analysis which are definitely proofs, including equivalence to the standard definitions.

The frameworks are to improve lean’s programming ecosystem and not just its proving. Metaprogramming is pretty well covered already too, but not ordinary programs.

More comments...