AI will make formal verification go mainstream

Posted by evakhoury 12/16/2025

AI will make formal verification go mainstream(martin.kleppmann.com)

827 points | 434 commentspage 5

plainOldText 12/16/2025|

I guess it’s time to learn OCaml then.

It appears many of the proof assistants/verification systems can generate OCaml. Or perhaps ADA/Spark?

Regardless of how the software engineering discipline will change in the age of gen AI, we must aim to produce higher not lower quality software than whatever we have today, and formal verification will definitely help.

AgentOrange1234 12/16/2025||

Hard disagree. "I'm an expert" in that I have done tons of proofs on many systems with many provers, both academically and professionally for decades.

Also, I am a novice when it comes to programming with sound, and today I have been dorking with a simple limiter. ChatGPT knows way more than me about what I am doing. It has taught me a ton. And as magical and wonderful as it is, it is incredibly tedious to try to work with it to come up with real specifications of interesting properties.

Instead of banging my head against a theorem prover that won't say QED, I get a confident sounding stream of words that I often don't even understand. I often don't even have the language to tell it what I am imagining. When I do understand, it's a lot of typing to explain my understanding. And so often, as a teacher, it just is utterly failing to effectively communicate to me why I am wrong.

At the end of all of this, I think specification is really hard, intellectually creative and challenging work. An LLM cannot do the work for you. Even to be guided down the right path, you will need perseverance and motivation.

zmmmmm 12/17/2025||

I dunno about formal verification, but for sure it's brought me back to a much more TDD first style of coding. You get so much mileage from having it first implement tests and then run them after changes. The key is it lowers the friction so much in creating the tests as well. It's like a triple bottom line.

tachim 12/17/2025||

I'm Tudor, CEO of Harmonic. We're huge believers in formal verification, and started Harmonic in 2023 to make this technology mainstream. We built a system that achieved gold medal performance at 2025 IMO (https://harmonic.fun/news), and recently opened a public API that got 10/12 problems on this year's Putnam exam (https://aristotle.harmonic.fun/).

If you also think this is the future of programming and are interested in building it, please consider joining us: https://jobs.ashbyhq.com/Harmonic. We have incredibly interesting challenges across the stack, from infra to AI to Lean.

brookman64k 12/17/2025||

How do you verify that your verification verifies the right thing? Couldn’t the LLM spit out a nice looking but ultimately useless proof (boiling down to something like 1=1). Also, in my experience software projects are full of incorrect, incomplete and constantly changing assumptions and requirements.

lewisjoe 12/16/2025||

A while back I did this experiment where I asked ChatGPT to imagine a new language such that it's best for AI to write code in and to list the properties of such a language. Interestingly it spit out all the properties that are similar to today's functional, strongly typed, in fact dependently typed, formally verifiable / proof languages. Along with reasons why such a language is easier for AI to reason about and generate programs. I found it fascinating because I expected something like typescript or kotlin.

While I agree formal verification itself has its problems, I think the argument has merit because soon AI generated code will surpass all human generated code and when that happens we atleast need a way to verify the code can be proved that it won't have security issues or adheres to compliance / policy.

WhyOhWhyQ 12/17/2025|

AI generated code is pretty far from passing professional human generated code, unless you're talking about snippets. Who should be writing mission critical code in the next 10 years, the people currently doing it (with AI assistance), or e.g. some random team at Google using AI primarily? The answer is obvious.

chamomeal 12/16/2025||

Formal methods are very cool (and I know nothing about them lol) but there's still a gap between the proof and the implementation, unless you're using a language with proof-checking built in.

If we're looking to use LLMs to make code absolutely rock-solid, I would say advanced testing practices are a good candidate!. Property-based testing, fuzzing, contract testing (for example https://github.com/griffinbank/test.contract) are all fun but extremely tedious to write and maintain. I think that makes it the perfect candidate for LLMs. These kinds of tests are also more easily understandable by regular ol' software developers, and I think we'll have to be auditing LLM output for quite a while.

measurablefunc 12/16/2025|

If you want to see what verified software looks like then Project Everest is a good demonstration: https://project-everest.github.io/

whazor 12/16/2025||

Domain Specific Languages + Formal Verification.

With a domain specific language you can add extra limitations on the kinds of outputs. This can also make formal verification faster.

Maybe like React components. You limit the format. You could limit which libraries can be imported, what hooks could be used, how expressive could be.

dwrodri 12/17/2025||

I think if AI can help us modernize the current state of hardware verification, I think that would be an enormous boon to the tech industry.

Server class CPUs and GPUs are littered with side channels which are very difficult to “close”, even in hardened cloud VMs.

We haven’t verified “frontier performance” hardware down to the logic gate in quite some time. Prof. Margaret Martinosi’s lab and her students have spent quite some time on this challenge, and i am excited to see better, safer memory models oyt in the wild.

A lot of the same big ideas used in hardware are making their way into the software later too, see https://faultlore.com/blah/tower-of-weakenings/

throwaway613745 12/16/2025|

GPT 5.2 can't even tell me how many rs are in garlic.

bglazer 12/16/2025||

This is a very tiring criticism. Yes, this is true. But, it's an implementation detail (tokenization) that has very little bearing on the practical utility of these tools. How often are you relying on LLM's to count letters in words?

1970-01-01 12/16/2025|||

The implementation detail is that we keep finding them! After this, it couldn't locate a seahorse emoji without freaking out. At some point we need to have a test: there are two drinks before you. One is water, the other is whatever the LLM thought you might like to drink after it completed refactoring the codebase. Choose wisely.

101008 12/16/2025||||

It's an example that shows that if these models aren't trained in a specific problem, they may have a hard time solving it for you.

altruios 12/16/2025|||

An analogy is asking someone who is colorblind how many colors are on a sheet of paper. What you are probing isn't reasoning, it's perception. If you can't see the input, you can't reason about the input.

9rx 12/17/2025||

> What you are probing isn't reasoning, it's perception.

Its both. A colorblind person will admit their shortcomings and, if compelled to be helpful like an LLM is, will reason their way to finding a solution that works around their limitations.

But as LLMs lack a way to reason, you get nonsense instead.

altruios 12/19/2025|||

What tools does the LLM have access to that would reveal sub-token characters to it?

This assumes the colorblind person both believes it is true that they are colorblind, in a world where that can be verified, and possesses tools to overcome these limitations.

You have to be much more clever to 'see' an atom before the invention of a microscope, if the tool doesn't exist: most of the time you are SOL.

Uehreka 12/16/2025||||

No, it’s an example that shows that LLMs still use a tokenizer, which is not an impediment for almost any task (even many where you would expect it to be, like searching a codebase for variants of a variable name in different cases).

8note 12/16/2025||

the question remains: is the tokenizer going to be a fundamental limit to my task? how do i know ahead of time?

worldsayshi 12/16/2025||

Would it limit a person getting your instructions in Chinese? Tokenisation pretty much means that the LLM is reading symbols instead of phonemes.

This makes me wonder if LLMs works better in Chinese.

victorbjorklund 12/16/2025|||

No, it is the issue with the tokenizer.

iAMkenough 12/16/2025||||

The criticism would stop if the implementation issue was fixed.

It's an example of a simple task. How often are you relying on LLMs to complete simple tasks?

andy99 12/16/2025|||

At this point if I was openAI I wouldn’t bother fixing this to give pedants something to get excited about.

properbrew 12/16/2025||

Unless they fixed this in 25 minutes (possible?) it correctly counts 1 `r`.

https://chatgpt.com/share/6941df90-789c-8005-8783-6e1c76cdfc...

desman 12/16/2025|||

This is like complaining that your screwdriver is bad at measuring weight.

If you really need an answer and you really need the LLM to give it to you, then ask it to write a (Python?) script to do the calculation you need, execute it, and give you the answer.

bgwalter 12/16/2025||

⌴⌴⌴

worldsayshi 12/16/2025|||

That's a problem that is at least possible for the LLM to perceive and learn through training, while counting letters is much more like asking a colour blind person to count flowers by colour.

charcircuit 12/17/2025||

Why is it not possible? Do you think it's impossible to count? Do you think it's imposing to learn the letters in each token? Do you think combining the two is not possible.

wikiterra 12/16/2025||

[dead]

More comments...