Leanstral: Open-source agent for trustworthy coding and formal proof engineering

Posted by Poudlardo 1 day ago

Leanstral: Open-source agent for trustworthy coding and formal proof engineering(mistral.ai)

Lean 4 paper (2021): https://dl.acm.org/doi/10.1007/978-3-030-79876-5_37

728 points | 177 commentspage 3

piyh 22 hours ago|

Automated theorem provers running on a $5k piece of hardware is a cool version of the future

maelito 14 hours ago||

I don't understand how this can impact my JS (+yaml, css, etc) code writing in a complex app.

techcam 4 hours ago||

The tricky part is that prompts can look “correct” but still behave unpredictably depending on phrasing.

Havoc 1 day ago||

What are these "passes" they reference here? Haven't seen that before in LLM evals

Could definitely be interesting for having another model run over the codebase when looking for improvements

rockinghigh 1 day ago|

It's the number of attempts at answering the question.

lefrenchy 1 day ago||

Does Mistral come close to Opus 4.6 with any of their models?

chucky_z 1 day ago||

I use mistral-medium-3.1 for a lot of random daily tasks, along with the vibe cli. I'd state from my personal opinion that mistral is my preferred 'model vendor' by far at this point. They're extremely consistent between releases while each of them just feels better. I also have a strong personal preference to the output.

I actively use gemini-3.1-pro-preview, claude-4.6-opus-high, and gpt-5.3-codex as well. I prefer them all for different reasons, however I usually _start_ with mistral if it's an option.

sa-code 1 day ago||

Why not Large 3? It's larger and cheaper

tjwebbnorfolk 1 day ago|||

Mistral hasn't been in the running for SOTA for quite awhile now

DarkNova6 1 day ago||

Not at the moment, but a release of Mistral 4 seems close which likely bridges the gap.

re-thc 1 day ago||

Mistral Small 4 is already announced.

androiddrew 23 hours ago||

MOE but 120B range. Man I wish it was an 80B. I have 2 GPUs with 62Gib of usable VRAM. A 4bit 80B gives me some context window, but 120B puts me into system RAM

Aerroon 17 hours ago||

Either some q3 or since it's a MoE, maybe a REAP version of q4 might work (or could be terrible, I'm not sure about REAP'd models).

elAhmo 1 day ago||

I don’t know a single person using Mistral models.

consumer451 23 hours ago||

Isn't their latest speech to text model SOTA? When I tested it on jargon, it was amazing.

https://news.ycombinator.com/item?id=46886735

troyvit 20 hours ago||

I'm using this model for my first python project, coding using opencode along with devstral and Mistral Large 3. I know it's not as capable as other, more expensive models, but working with it this way is teaching me python. More directly to your point though, the speech to text model is really good.

It's funny because I just took a break from it to read some hn and found this post.

Adrig 23 hours ago|||

I used Ministral for data cleaning.

I was surprised: even tho it was the cheapest option (against other small models from Anthropic) it performed the best in my benchmarks.

Bombthecat 23 hours ago||

Mistral is super smart in smaller context and asking questions about it

badsectoracula 22 hours ago|||

Pretty much all of my LLM usage has been using Mistral's open source models running on my PC. I do not do full agentic coding as when i tried it with Devstral Small 2 it was a bit too slow (though if i could get 2-3 times the speed of my PC from a second computer it'd be be a different story and AFAIK that is doable if i was willing to spend $2-3k on it). However i've used Mistral's models for spelling and grammar checks[0], translations[1][2], summaries[3] and trying to figure out if common email SPAM avoidance tricks are pointless in the LLM age :-P [4]. FWIW that tool you can see in the shots is a Tcl/Tk script calling a llama.cpp-based command-line utility i threw together some time ago when experimenting with llama.cpp.

I've also used Devstral Small to make a simple raytracer[5][6] (it was made using the "classic" chat by copy/pasting code, not any agentic approach and i did fix bits of it in the process) and a quick-and-dirty "games database" in Python+Flask+Sqlite for my own use (mainly a game backlog DB :-P).

I also use it to make various small snippets, have it generate some boilerplate stuff (e.g. i have an enum in C and want to write a function that prints names for each enum value or have it match a string i read from a json file with the appropriate enum value), "translate" between languages (i had it recently convert some matrix code that i had written in Pascal into C), etc.

[0] https://i.imgur.com/f4OrNI5.png

[1] https://i.imgur.com/Zac3P4t.png

[2] https://i.imgur.com/jPYYKCd.png

[3] https://i.imgur.com/WZGfCdq.png

[4] https://i.imgur.com/ytYkyQW.png

[5] https://i.imgur.com/FevOm0o.png (screenshot)

[6] https://app.filen.io/#/d/e05ae468-6741-453c-a18d-e83dcc3de92... (C code)

[7] https://i.imgur.com/BzK8JtT.png

ainch 21 hours ago|||

That's likely because they're chasing enterprise - see deals with HSBC, ASML, AXA, BNP Paribas etc... Given swelling anti-US sentiment and their status as a French 'national champion', Mistral are probably in a strong position for now regardless of model performance, research quality or consumer uptake.

brainless 21 hours ago|||

I'm building a knowledge graph on personal data (emails, files) with Ministral 3:3b. I try with Qwen 3.5:4b as well but mostly Ministral.

Works really well. Extracts companies you have dealt with, people, topics, events, locations, financial transactions, bills, etc.

pelagicAustral 1 day ago|||

Me neither, they're not ready for prime imo. I have a yearly sub and the product is just orders of magnitude behind Anthropic's offering. I use Code for real world stuff and I am happy with the result, Mistral is just not something I can trust right now.

Fnoord 22 hours ago|||

I use them solely.

nimchimpsky 22 hours ago||

[dead]

blueTiger33 13 hours ago||

I read it as Lanestra, and thought of that story :D

jasonjmcghee 22 hours ago||

Curious if pass@2 was tested for haiku and sonnet?

miacycle 23 hours ago||

The TDD foundation! We might need one of those. :)

igravious 20 hours ago|

"and continues to scale linearly"

it clearly and demonstrably does not. in fact, from eyeballing their chart Qwen, Kimi, and GLM scale linearly whereas Leanstral does not. But this is not surprising because the Alibaba, Moonshot, and Zhipu have hundreds of employees each and hundreds of millions of dollars of investment each.

More comments...