Top
Best
New

Posted by evakhoury 7 days ago

AI will make formal verification go mainstream(martin.kleppmann.com)
825 points | 432 commentspage 5
plainOldText 7 days ago|
I guess it’s time to learn OCaml then.

It appears many of the proof assistants/verification systems can generate OCaml. Or perhaps ADA/Spark?

Regardless of how the software engineering discipline will change in the age of gen AI, we must aim to produce higher not lower quality software than whatever we have today, and formal verification will definitely help.

AgentOrange1234 7 days ago||
Hard disagree. "I'm an expert" in that I have done tons of proofs on many systems with many provers, both academically and professionally for decades.

Also, I am a novice when it comes to programming with sound, and today I have been dorking with a simple limiter. ChatGPT knows way more than me about what I am doing. It has taught me a ton. And as magical and wonderful as it is, it is incredibly tedious to try to work with it to come up with real specifications of interesting properties.

Instead of banging my head against a theorem prover that won't say QED, I get a confident sounding stream of words that I often don't even understand. I often don't even have the language to tell it what I am imagining. When I do understand, it's a lot of typing to explain my understanding. And so often, as a teacher, it just is utterly failing to effectively communicate to me why I am wrong.

At the end of all of this, I think specification is really hard, intellectually creative and challenging work. An LLM cannot do the work for you. Even to be guided down the right path, you will need perseverance and motivation.

zmmmmm 7 days ago||
I dunno about formal verification, but for sure it's brought me back to a much more TDD first style of coding. You get so much mileage from having it first implement tests and then run them after changes. The key is it lowers the friction so much in creating the tests as well. It's like a triple bottom line.
dwrodri 7 days ago||
I think if AI can help us modernize the current state of hardware verification, I think that would be an enormous boon to the tech industry.

Server class CPUs and GPUs are littered with side channels which are very difficult to “close”, even in hardened cloud VMs.

We haven’t verified “frontier performance” hardware down to the logic gate in quite some time. Prof. Margaret Martinosi’s lab and her students have spent quite some time on this challenge, and i am excited to see better, safer memory models oyt in the wild.

A lot of the same big ideas used in hardware are making their way into the software later too, see https://faultlore.com/blah/tower-of-weakenings/

brookman64k 7 days ago||
How do you verify that your verification verifies the right thing? Couldn’t the LLM spit out a nice looking but ultimately useless proof (boiling down to something like 1=1). Also, in my experience software projects are full of incorrect, incomplete and constantly changing assumptions and requirements.
lewisjoe 7 days ago||
A while back I did this experiment where I asked ChatGPT to imagine a new language such that it's best for AI to write code in and to list the properties of such a language. Interestingly it spit out all the properties that are similar to today's functional, strongly typed, in fact dependently typed, formally verifiable / proof languages. Along with reasons why such a language is easier for AI to reason about and generate programs. I found it fascinating because I expected something like typescript or kotlin.

While I agree formal verification itself has its problems, I think the argument has merit because soon AI generated code will surpass all human generated code and when that happens we atleast need a way to verify the code can be proved that it won't have security issues or adheres to compliance / policy.

WhyOhWhyQ 7 days ago|
AI generated code is pretty far from passing professional human generated code, unless you're talking about snippets. Who should be writing mission critical code in the next 10 years, the people currently doing it (with AI assistance), or e.g. some random team at Google using AI primarily? The answer is obvious.
chamomeal 7 days ago||
Formal methods are very cool (and I know nothing about them lol) but there's still a gap between the proof and the implementation, unless you're using a language with proof-checking built in.

If we're looking to use LLMs to make code absolutely rock-solid, I would say advanced testing practices are a good candidate!. Property-based testing, fuzzing, contract testing (for example https://github.com/griffinbank/test.contract) are all fun but extremely tedious to write and maintain. I think that makes it the perfect candidate for LLMs. These kinds of tests are also more easily understandable by regular ol' software developers, and I think we'll have to be auditing LLM output for quite a while.

measurablefunc 7 days ago|
If you want to see what verified software looks like then Project Everest is a good demonstration: https://project-everest.github.io/
whazor 7 days ago||
Domain Specific Languages + Formal Verification.

With a domain specific language you can add extra limitations on the kinds of outputs. This can also make formal verification faster.

Maybe like React components. You limit the format. You could limit which libraries can be imported, what hooks could be used, how expressive could be.

throwaway613745 7 days ago||
GPT 5.2 can't even tell me how many rs are in garlic.
bglazer 7 days ago||
This is a very tiring criticism. Yes, this is true. But, it's an implementation detail (tokenization) that has very little bearing on the practical utility of these tools. How often are you relying on LLM's to count letters in words?
1970-01-01 7 days ago|||
The implementation detail is that we keep finding them! After this, it couldn't locate a seahorse emoji without freaking out. At some point we need to have a test: there are two drinks before you. One is water, the other is whatever the LLM thought you might like to drink after it completed refactoring the codebase. Choose wisely.
101008 7 days ago||||
It's an example that shows that if these models aren't trained in a specific problem, they may have a hard time solving it for you.
altruios 7 days ago|||
An analogy is asking someone who is colorblind how many colors are on a sheet of paper. What you are probing isn't reasoning, it's perception. If you can't see the input, you can't reason about the input.
9rx 7 days ago||
> What you are probing isn't reasoning, it's perception.

Its both. A colorblind person will admit their shortcomings and, if compelled to be helpful like an LLM is, will reason their way to finding a solution that works around their limitations.

But as LLMs lack a way to reason, you get nonsense instead.

altruios 5 days ago|||
What tools does the LLM have access to that would reveal sub-token characters to it?

This assumes the colorblind person both believes it is true that they are colorblind, in a world where that can be verified, and possesses tools to overcome these limitations.

You have to be much more clever to 'see' an atom before the invention of a microscope, if the tool doesn't exist: most of the time you are SOL.

Uehreka 7 days ago||||
No, it’s an example that shows that LLMs still use a tokenizer, which is not an impediment for almost any task (even many where you would expect it to be, like searching a codebase for variants of a variable name in different cases).
8note 7 days ago||
the question remains: is the tokenizer going to be a fundamental limit to my task? how do i know ahead of time?
worldsayshi 7 days ago||
Would it limit a person getting your instructions in Chinese? Tokenisation pretty much means that the LLM is reading symbols instead of phonemes.

This makes me wonder if LLMs works better in Chinese.

victorbjorklund 7 days ago|||
No, it is the issue with the tokenizer.
iAMkenough 7 days ago||||
The criticism would stop if the implementation issue was fixed.

It's an example of a simple task. How often are you relying on LLMs to complete simple tasks?

andy99 7 days ago|||
At this point if I was openAI I wouldn’t bother fixing this to give pedants something to get excited about.
blazingbanana 7 days ago||
Unless they fixed this in 25 minutes (possible?) it correctly counts 1 `r`.

https://chatgpt.com/share/6941df90-789c-8005-8783-6e1c76cdfc...

desman 7 days ago|||
This is like complaining that your screwdriver is bad at measuring weight.

If you really need an answer and you really need the LLM to give it to you, then ask it to write a (Python?) script to do the calculation you need, execute it, and give you the answer.

bgwalter 7 days ago||
⌴⌴⌴
worldsayshi 7 days ago|||
That's a problem that is at least possible for the LLM to perceive and learn through training, while counting letters is much more like asking a colour blind person to count flowers by colour.
charcircuit 7 days ago||
Why is it not possible? Do you think it's impossible to count? Do you think it's imposing to learn the letters in each token? Do you think combining the two is not possible.
wikiterra 7 days ago||
[dead]
bob1029 7 days ago|
I don't know if it's "formal", but I think combining Roslyn analyzers with an LLM and a human in the loop could be game changing. The analyzers can enforce intent over time without any regressions. The LLM can write and modify analyzers. Analyzers can constrain the source code of other analyzers. Hypothetically there is no limit to the # of things that could be encoded this way. I am currently working on a few prototypes in the vicinity of this.

https://learn.microsoft.com/en-us/dotnet/csharp/roslyn-sdk/t...

More comments...