Leanstral: Open-source agent for trustworthy coding and formal proof engineering

Posted by Poudlardo 1 day ago

Leanstral: Open-source agent for trustworthy coding and formal proof engineering(mistral.ai)

Lean 4 paper (2021): https://dl.acm.org/doi/10.1007/978-3-030-79876-5_37

720 points | 174 commentspage 2

andai 23 hours ago|

Trustworthy vibe coding. Much better than the other kind!

Not sure I really understand the comparisons though. They emphasize the cost savings relative to Haiku, but Haiku kinda sucks at this task, and Leanstral is worse? If you're optimizing for correctness, why would "yeah it sucks but it's 10 times cheaper" be relevant? Or am I misunderstanding something?

On the promising side, Opus doesn't look great at this benchmark either — maybe we can get better than Opus results by scaling this up. I guess that's the takeaway here.

teekert 13 hours ago||

I also don't understand the focus on vibe coding in the marketing. Vibe coding kind of has the image of being for non-devs, right?

I do like agents (like Claude Code), but I don't consider myself to be vibe coding when I use them. Either I'm using a language/framework I know and check every step. OR I'm learning, checking every step and asking for explanations.

I tried vibe coding, and really dislike the feeling I have when doing it. It feels like building a house, but without caring about it, and just using whatever tech. Sure I may have moisture problems later, but it's a throwaway house anyway. That's how I feel about it. Maybe I have a wrong definition.

Maybe it's good to not use "vibe coding" as a synonym for programming with agent assistance. Just to protect our profession. Like: "Ah you're vibing" (because you have Claude Code open), "No, I'm using CC to essentially type faster and prevent syntax errors and get better test coverage, maybe to get some smart solutions without deep research. But I understand and vouch for every loc here. 'We are not the same.'"

benterix 10 hours ago|||

> I tried vibe coding, and really dislike the feeling I have when doing it. It feels like building a house, but without caring about it, and just using whatever tech. Sure I may have moisture problems later, but it's a throwaway house anyway. That's how I feel about it. Maybe I have a wrong definition.

No, I feel the same. I vibe-coded a few projects and after a few weeks I just threw them away, ultimately I felt I just wasted my time and wished I coudl get it back to do something useful.

andai 6 hours ago||||

Yeah, the original meaning of Vibe Coding was "not looking at the code, just going on vibes", but a lot of people now use it to mean "AI was involved in some way".

I see a whole spectrum between those two. I typically alternate between "writing code manually and asking AI for code examples" (ChatGPT coding), and "giving AI specific instructions like, write a function blarg that does foo".

The latter I call Power Coding, in the sense of power armor, because you're still in control and mostly moving manually, but you're much stronger and faster.

I like this better than "tell agent to make a bunch of changes and come back later" because first of all it doesn't break flow (you can use a smaller model for such fine-grained changes so it goes very fast -- it's "realtime"), and second, you don't ever desync from the codebase and need to spend extra time figuring out what the AI did. Each change is sanity-checked as it comes in.

So you stay active, and the code stays slop-free.

I don't hear a lot of people doing this though? Maybe we just don't have good language for it.

teekert 4 hours ago||

"I don't hear a lot of people doing this though? Maybe we just don't have good language for it."

Interesting thought. I guess we don't really, vibe coding is to powerful a term. But perhaps just call it LLM assisted programming? Where we used to do Stack Overflow assisted programming. LLM assisted programming is more focused, goes faster. But since you're wandering around less I guess you learn less, you're exposed to less new information, some of it was helpful in unexpected ways. Now you have to make learning a specific part of your flow, and that takes discipline/time. But is well worth it imho. Actually, for me it's the only way to enjoy it.

DANmode 12 hours ago|||

> It feels like building a house, but without caring about it, and just using whatever tech.

So, most homebuilders (in the US) unfortunately.

teekert 11 hours ago||

I myself am now and expert at insulation and all the vapor-permeable and vapor-blocking membranes/foils/foams that come with it.

It came at great cost though, I hated the process of learning and the execution. I was less than happy for some years. But I feel even more uncomfortable vibe-home-improving than I do vibe-coding. The place is starting to look nice now though.

flowerbreeze 23 hours ago|||

They haven't made the chart very clear, but it seems it has configurable passes and at 2 passes it's better than Haiku and Sonnet and at 16 passes starts closing in on Opus although it's not quite there, while consistently being less expensive than Sonnet.

ainch 20 hours ago|||

pass@k means that you run the model k times and give it a pass if any of the answers is correct. I guess Lean is one of the few use cases where pass@k actually makes sense, since you can automatically validate correctness.

andai 23 hours ago|||

Oh my bad. I'm not sure how that works in practice. Do you just keep running it until the tests pass? I guess with formal verification you can run it as many times as you need, right?

DrewADesign 23 hours ago||

It’s really not hard — just explicitly ask for trustworthy outputs only in your prompt, and Bob’s your uncle.

miacycle 22 hours ago||

Assuming that what you're dealing with is assertable. I guess what I mean to say is that in some situations is difficult to articulate what is correct and what isn't depending in some situations is difficult to articulate what is correct and what isn't depending upon the situation in which the software executes.

DrewADesign 21 hours ago||

And Bob’s your uncle.

esperent 22 hours ago||

I absolutely called this a couple of weeks ago, nice to be vindicated!

> I'm interested to see what it is in the age of LLMs or similar future tools. I suspect a future phase change might be towards disregarding how easy it is for humans to work with the code and instead focus on provability, testing, perhaps combined with token efficiency.

> Maybe Lean combined with Rust shrunk down to something that is very compiler friendly. Imagine if you could specify what you need in high level language and instead of getting back "vibe code", you get back proven correct code, because that's the only kind of code that will successfully compile.

https://news.ycombinator.com/item?id=47192116

AlotOfReading 19 hours ago||

It's important to keep in mind that no proof system ensures your proof is the correct proof, only that it's a valid proof. Completely understanding what a proof proves is often nearly as difficult as understanding the program it's proving. Normally you benefit because the process of building a proof forces you to develop your understanding more fully.

specvsimpl 17 hours ago||

Uhm, no? Even with "simple" examples like Dijkstra's shortest path, the spec is easier than the implementation. Maybe not for you, but try it out on an arbitrary 5-yr old. On the extreme end, you have results in maths, like Fermat's Last Theorem. Every teenager can understand the statement (certainly after 10 mins of explanation) but the proof is thousands of pages of super-specialized maths. It is a spectrum. For cryptography, compression, error-correction, databases, etc, the spec is often much simpler than the implementation.

AlotOfReading 16 hours ago||

I don't know why you created a new account for this, but take the textbook example of a nontrivial formally verified system: SeL4. That implementation was 8.7k of C code, which correspondend to 15k lines of Isabelle that ultimately needed 100k+ lines of proof to satisfy. And that was with the formal model excluding lots of important properties like hardware failure that actual systems deal with.

auggierose 12 hours ago|||

You are confusing the proof with the spec/theorem. A correct proof and a valid proof are the same thing. It doesn't really matter how long the proof is, and you don't even need to understand it for it to be correct, the machine can check that.

But indeed, if the spec includes 8.7k of C code, that is problematic. If you cannot look at the theorem and see that it is what you mean, that is a problem. That is why abstraction is so important; your ultimate spec should not include C-code, that is just too low-level.

AlotOfReading 7 hours ago||

I'm not confusing them. That's why I gave each of the numbers for SeL4 separately.

Knowing whether those theorems are the right theorems for the problem can be as difficult as understanding the implementation itself. Hence the example of SeL4 where the number of theorems exceeds lines of code in the original implementation and the formal model is large.

It's my experience that most people doing formal methods have seen cases where they actually proved something slightly different than what they intended to. This usually involves an unintentional assumption that isn't generally true.

auggierose 6 hours ago||

I think you have been confusing them. Two theorems are the same if they have the same statement (spec). A proof is not a theorem, nobody cares about when two proofs are the same or not.

xpe 7 hours ago|||

> I don't know why you created a new account for this

What value does this add to the conversation? I’m not seeing it: am I missing something? It comes across as a kind of insult.

They made a good point in my opinion! (The “Uhm no” part got it off on the wrong foot, I will admit.) But even if you felt annoyed or didn’t agree with the point, it was substantive and moved the conversation forward. I’m here for the (genuine) questions and (constructive) debate and (civil) pushback.

I like to welcome new users before they take too much of a beating. That can come later when they are too invested to leave and/or when morale needs improving.

So welcome! Bring a helmet, and don’t stop disagreeing.

xpe 6 hours ago||

[dead]

agentultra 6 hours ago||

Very cool but I haven’t been able to convince software developers in industry to write property based tests. I sometimes joke that we will start writing formal proofs until the tests improve. Just so that they will appreciate the difference a little more.

I can’t even convince most developers to use model checkers. Far more informal than a full proof in Lean. Still highly useful in many engineering tasks. People prefer boxes and arrows and waving their hands.

Anyway, I don’t know that I’d want to have a system vibe code a proof. These types of proofs, I suspect, aren’t going to be generated to be readable, elegant, and be well understood by people. Like programs they generate it will look plausible.

And besides, you will still need a human to review the proof and make sure it’s specifying the right things. This doesn’t solve that requirement.

Although I have thought that it would be useful to have a system that could prove trivial lemmas in the proof. That would be very neat.

rowanG077 6 hours ago|

The point is you just need to scrutinize the theorem. Not easy either, but still significantly less work than writing the proof.

patall 23 hours ago||

Maybe a naive question: given that they see better performance with more passes but the effect hits a limit after a few passes, would performance increase if they used different models per pass, i.e leanstral, kimi, qwen and leanstral again instead of 4x leanstral?

andai 23 hours ago|

This is called a "LLM alloy", you can even do it in agentic, where you simply swap the model on each llm invocation.

It does actually significantly boost performance. There was an article on here about it recently, I'll see if I can find it.

Edit: https://news.ycombinator.com/item?id=44630724

They found the more different the models were (the less overlap in correctly solved problems), the more it boosted the score.

patall 23 hours ago||

That sounds quite interesting. Makes me wonder if sooner or later they will have to train multiple independent models that cover those different niches. But maybe we will see that sooner or later. Thanks for the link.

andai 6 hours ago|||

Mixture of Mixtures of Experts ;)

cyanydeez 23 hours ago|||

One would think that LoRAs being so successful in StableDiffusion, that more people would be focused on constructing framework based LoRas; but the economics of all this probably preclude trying to go niche in any direction and just keep building the do-all models.

Aerroon 16 hours ago||

The SD ecosystem in large part was grassroots and focused on nsfw. I think current LLM companies would have a hard time getting that to happen due to their safety stuff.

andai 6 hours ago||

Fine-tuning does exist on the major model providers, and presumably already uses LoRA. (Not sure though.)

We saw last year that it's remarkably easy to bypass safety filters by fine-tuning GPT, even when the fine-tuning seems innocuous. e.g. the paper about security research finetuning (getting the model to add vulnerabilities) producing misaligned outputs in other areas. It seems like it flipped some kind of global evil neuron. (Maybe they can freeze that one during finetuning? haha)

Found it: Emergent Misalignment

https://news.ycombinator.com/item?id=43176553

https://news.ycombinator.com/item?id=44554865

JoshTriplett 22 hours ago||

Pleasant surprise: someone saying "open source" and actually meaning Open Source. It looks like the weights are Apache-2.0 licensed.

jasonjmcghee 21 hours ago|

Based on community definitions I've seen, this is considered "open weights". If you can't reproduce the model, it's not "open source"

xpe 6 hours ago||

Yes “open weights” conveys the reality more clearly: merely having the parameters is very different than able to run a process that creates them. Without openness of the full process start to finish, much is hidden.*

Remember, language is what we make it. Dictionaries are useful catalogs of usage but we make the judgment calls.

* Even with the process, much is not well understood! / The ethics of releasing an open weights model at some capability level is a separate discussion.

flakiness 22 hours ago||

FYI The Lean 4 paper: https://dl.acm.org/doi/10.1007/978-3-030-79876-5_37

wazHFsRy 13 hours ago||

Is anyone using this approach with lean to ship production code? Writing lean spec as human, implementation and proof by agent? And then shipping lean or exporting to C? Would be great to understand how you are actually using this.

techcam 3 hours ago||

The tricky part is that prompts can look “correct” but still behave unpredictably depending on phrasing.

toastal 16 hours ago||

Naturally the Microsoft-owned language is getting the AI hype instead of the more mature options that could do this sort of work… Agda, ATS, Coq/Rocq, Dafny, Fstar, Idris, Isabelle, Why3 just to name a few.

markusde 8 hours ago||

You should check out the recent PR's to the Agda repo... the community is currently very divided about AI. For better or worse, the people driving the Lean project have been interested in AI for quite some time.

Paracompact 16 hours ago|||

A bit uncharitable. I'm a diehard fan of Rocq, but it's nothing unusual to see the young new hotness that is Lean continue to get the spotlight. It's not a sign of Microsoft putting its thumb on the scales, and the hype for Lean has long predated LLMs.

It's certainly less mature when it comes to verified programming, but its appeal to mathematicians (rather than formal methods experts) has earned it much respect.

mrklol 16 hours ago||

Am I missing something? Isn’t that the language most are using currently when looking at research at openai, google, deepseek etc?

More comments...