Leanstral: Open-source agent for trustworthy coding and formal proof engineering

Posted by Poudlardo 23 hours ago

Leanstral: Open-source agent for trustworthy coding and formal proof engineering(mistral.ai)

Lean 4 paper (2021): https://dl.acm.org/doi/10.1007/978-3-030-79876-5_37

720 points | 174 comments

robertwer 2 minutes ago|

I’ve never worked with formal validation (barely remember my CS course). This release looks impressive. But I'm trying to wrap my head around the near-term practical applications for everyday software.

Right now, we see a lot of business experts in enterprises tempted to use AI to impl. business logic so they don't have to wait for (or pay) software experts. Would this kind of technology help these users any time soon?

My current theory is that the real breakthrough for these non-developers will only happen when they can actually verify the result themselves without needing an another expert in the loop. But I don't see that with formal validation anytime soon.

Do I overlook something?

cadamsdotcom 18 hours ago||

It’s great to see this pattern of people realising that agents can specify the desired behavior then write code to conform to the specs.

TDD, verification, whatever your tool; verification suites of all sorts accrue over time into a very detailed repository of documentation of how things are supposed to work that, being executable, puts zero tokens in the context when the code is correct.

It’s more powerful than reams upon reams of markdown specs. That’s because it encodes details, not intent. Your intent is helpful at the leading edge of the process, but the codified result needs shoring up to prevent regression. That’s the area software engineering has always ignored because we have gotten by on letting teams hold context in their heads and docs.

As software gets more complex we need better solutions than “go ask Jim about that, bloke’s been in the code for years”.

bluGill 7 hours ago||

> That’s because it encodes details, not intent.

Be careful here - make sure you encode the right details. I've seen many cases where the tests are encoding the details of how it was implemented and not what it is intended to do. This means that you can't refactor anything because your tests are enforcing a design. (refactor is changing code without deleting tests, the trick is how can you make design changes without deleting tests - which means you have to test as much as possible at a point where changing that part of the design isn't possible anyway)

necovek 1 hour ago||

While you are right that you need to be encoding the right details, I disagree on the tests enforcing a design point.

As part of the proper testing strategy, you will have tests that cover individual behavior of a small block/function (real "unit" tests), tests that cover integration points only up to the integration itself, and a small number of end-to-end or multi-component integration tests.

Only the last category should stay mostly idempotent under refactoring, depending on the type of refactor you are doing.

Integration tests will obviously be affected when you are refactoring the interfaces between components, and unit tests will be affected when you are refactoring the components themselves. Yes, you should apply the strategy that keeps it under incremental reverse TDD approach (do the refactor and keep the old interface, potentially by calling into new API from the old; then in second step replace use of old API as well, including in tests).

Tests generally define behavior and implementation in a TDD approach: it'd be weird if they do not need changing at all when you are changing the implementation.

bluGill 16 minutes ago||

Fine, but don't check in the tests that prove implementation since they will be deleted soon anyway. The only tests to check in are ones that - by failing - informed you that you broke something. We don't know which those tests are and because most tests run fast we tend to check in lots of tests that will never fail in a useful way.

phillipclapham 1 hour ago|||

The formal verification angle is def compelling, but I keep running into a harder problem upstream: if the agent's decision logic lives in a prompt, what exactly are you verifying exactly? You can check that the generated code satisfies a spec, but the reasoning that led to that code is opaque by design. You can't write a Lean proof about "the model thought this was the right trade-off."

What I've found in practice is that trustworthiness in agentic systems requires a separation of concerns that most architectures simply don't enforce: keeping deterministic decision logic externalized from the model so it's actually inspectable. Once you've got that, tools like this become a lot more powerful because you've got something concrete to verify against. Without it, you're proving properties of outputs while the decision process remains a black box.

Curious how Leanstral handles cases where the agent's architectural choices (not just the implementation) need to be auditable.

BowBun 16 hours ago|||

I feel like the difference is minimal, if not entirely dismissable. Code in this sense is just a representation of the same information as someone would write in an .md file. The resolution changes, and that's where both detail and context are lost.

I'm not against TDD or verification-first development, but I don't think writing that as code is the end-goal. I'll concede that there's millions of lines of tests that already exist, so we should be using those as a foundation while everything else catches up.

chriswarbo 10 hours ago|||

Tests (and type-checkers, linters, formal specs, etc.) ground the model in reality: they show it that it got something wrong (without needing a human in the loop). It's empiricism, "nullius in verba"; the scientific approach, which lead to remarkable advances in a few hundred years; that over a thousand years of ungrounded philosophy couldn't achieve.

discreteevent 10 hours ago|||

The scientific approach is not only or primarily empiricism. We didn't test our way to understanding. The scientific approach starts with a theory that does it's best to explain some phenomenon. Then the theory is criticized by experts. Finally, if it seems to be a promising theory tests are constructed. The tests can help verify the theory but it is the theory that provides the explanation which is the important part. Once we have explanation then we have understanding which allows us to play around with the model to come up with new things, diagnose problems etc.

The scientific approach is theory driven, not test driven. Understanding (and the power that gives us) is the goal.

chriswarbo 8 hours ago|||

> The scientific approach starts with a theory that does it's best to explain some phenomenon

At the risk of stretching the analogy, the LLM's internal representation is that theory: gradient-descent has tried to "explain" its input corpus (+ RL fine-tuning), which will likely contain relevant source code, documentation, papers, etc. to our problem.

I'd also say that a piece of software is a theory too (quite literally, if we follow Curry-Howard). A piece of software generated by an LLM is a more-specific, more-explicit subset of its internal NN model.

Tests, and other real CLI interactions, allow the model to find out that it's wrong (~empiricism); compared to going round and round in chain-of-thought (~philosophy).

Of course, test failures don't tell us how to make it actually pass; the same way that unexpected experimental/observational results don't tell us what an appropriate explanation/theory should be (see: Dark matter, dark energy, etc.!)

discreteevent 7 hours ago||

The ai is just pattern matching. Vibing is not understanding, whether done by humans or machines. Vibe programmers (of which there are many) make a mess of the codebase piling on patch after patch. But they get the tests to pass!

Vibing gives you something like the geocentric model of the solar system. It kind of works but but it's much more complicated and hard to work with.

ptidhomme 9 hours ago|||

The theory still emanated from actual observations, didn't it ?

discreteevent 7 hours ago||

It did but they were meaningless without a human intellect trying to make sense of them.

SiempreViernes 5 hours ago|||

No, the theory comes from the authors knowledge, culture and inclinations, not from the fact.

Obviously the author has to do much work in selecting the correct bits from this baggage to get a structure that makes useful predictions, that is to say predictions that reproduces observable facts. But ultimately the theory comes from the author, not from the facts, it would be hard to imagine how one can come up with a theory that doesn't fit all the facts known to an author if the theory truly "emanated" from the facts in any sense strict enough to matter.

boyoboy 4 hours ago|||

[dead]

applfanboysbgon 10 hours ago||||

It most certainly is not. All your tests are doing is seeding the context with tokens that increase the probability of tokens related to solving the problem being selected next. One small problem: if the dataset doesn't have sufficiently well-represented answers to the specific problem, no amount of finessing the probability of token selection is going to lead to LLMs solving the problem. The scientific method is grounded in the ability to reason, not probabilistically retrieve random words that are statistically highly correlated with appearing near other words.

cowboy_henk 8 hours ago||||

This only holds if you understand what's in the tests, and the tests are realistic. The moment you let the LLM write the tests without understanding them, you may as well just let it write the code directly.

chriswarbo 4 hours ago|||

> The moment you let the LLM write the tests without understanding them, you may as well just let it write the code directly.

I disagree. Having tests (even if the LLM wrote them itself!) gives the model some grounding, and exposes some of its inconsistencies. LLMs are not logically-omniscient; they can "change their minds" (next-token probabilities) when confronted with evidence (e.g. test failure messages). Chain-of-thought allows more computation to happen; but it doesn't give the model any extra evidence (i.e. Shannon information; outcomes that are surprising, given its prior probabilities).

rowanG077 6 hours ago|||

I disagree to some degree. Tests have value even beyond whether they test the right thing. At the very least they show something worked and now doesnt work or vice versa. That has value in itself.

pydry 9 hours ago|||

This assumes that tests are realistic, which for the most part they are not.

cadamsdotcom 11 hours ago|||

Say you describe your kitchen as “I want a kitchen” - where are the knives? Where’s the stove? Answer: you abdicated control over those details, so it’s wherever the stochastic parrot decided to put them, which may or may not be where they ended up last time you pulled your LLM generate-me-a-kitchen lever. And it may not be where you want.

Don’t like the layout? Let’s reroll! Back to the generative kitchen agent for a new one! ($$$)

The big labs will gladly let you reroll until you’re happy. But software - and kitchens - should not be generated in a casino.

A finished software product - like a working kitchen - is a fractal collection of tiny details. Keeping your finished software from falling apart under its own weight means upholding as many of those details as possible.

Like a good kitchen a few differences are all that stands between software that works and software that’s hell. In software the probability that an agent will get 100% of the details right is very very small.

Details matter.

vidarh 11 hours ago||

If it is fast enough, and cheap enough, people would very happily reroll specific subsets of decisions until happy, and then lock that down. And specify in more details the corner cases that it doesn't get just how you want it.

People metaphorically do that all the time when designing rooms, in the form of endless browsing of magazines or Tik Tok or similar to find something they like instead of starting from first principles and designing exactly what they want, because usually they don't know exactly what they want.

A lot of the time we'd be happier with a spec at the end of the process than at the beginning. A spec that ensures the current understanding of what is intentional vs. what is an accident we haven't addressed yet is nailed down would be valuable. Locking it all down at the start, on the other hand, is often impossible and/or inadvisable.

cadamsdotcom 1 hour ago||

Agreed; often you don’t know quite what you want until you’ve seen it.

Spec is an overloaded term in software :) because there are design specs (the plan, alternatives considered etc) and engineering style specs (imagine creating a document with enough detail that someone overseas could write your documentation from it while you’re building it)

Those need distinct names or we are all at risk of talking past each other :)

strujillo 1 hour ago|||

That matches what I’ve seen as well — generation is the easy part, validation is the bottleneck.

I’ve been experimenting with a small sparse-regression system that infers governing equations from raw data, and it can produce a lot of plausible candidates quickly. The hard part is filtering out the ones that look right but violate underlying constraints.

For example, it recovered the Sun’s rotation (~25.1 days vs 27 actual) from solar wind data, but most candidate equations were subtly wrong until you enforced consistency checks.

Feels like systems that treat verification as the source of truth (not just an afterthought) are the ones that will actually scale.

tonymet 17 hours ago|||

AI is the reality that TDD never before had the opportunity to live up to

nextos 16 hours ago||

Not just TDD. Amazon, for instance, is heading towards something between TDD and lightweight formal methods.

They are embracing property-based specifications and testing à la Haskell's QuickCheck: https://kiro.dev

Then, already in formal methods territory, refinement types (e.g. Dafny, Liquid Haskell) are great and less complex than dependent types (e.g. Lean, Agda).

prohobo 12 hours ago|||

What about model-driven development? Spec to code was the name of the game for UML.

rusk 10 hours ago||

Setting aside that model means something different now … MDD never really worked because the tooling never really dealt with intent. You would get so far with your specifications (models) but the semantic rigidity of the tooling mean that at some point your solution would have to part way. LLM is the missing piece that finally makes this approach viable where the intent can be inferred dynamically and this guides the implementation specifics. Arguably the purpose of TDD/BDD was to shore up the gaps in communicating intent, and people came to understand that was its purpose, whereas the key intent in the original XP setting was to capture and preserve “known good” operation and guard against regression (in XP mindset, perhaps fatefully clear intent was assumed)

oakpond 13 hours ago||||

It makes sense to me as long as you're not vibe coding the PBTs.

pydry 9 hours ago||||

The deluge of amazon bugs ive been seeing recently makes me hesitant to follow in amazon's lead.

viking123 11 hours ago|||

Kiro is such garbage though

mkesper 10 hours ago||

If you add why you think so we might learn something.

sumedh 8 hours ago||

The same prompt in the same project gives different results/slightly worse results compared to Claude Code, both using Opus model.

refulgentis 17 hours ago|||

I've seen this sentiment and am a big fan of it, but I was confused by the blog post, and based on your comment you might be able to help: how does Lean help me? FWIW, context is: code Dart/Flutter day to day.

I can think of some strawmen: for example, prove a state machine in Lean, then port the proven version to Dart? But I'm not familiar enough with Lean to know if that's like saying "prove moon made of cheese with JavaScript, then deploy to the US mainframe"

baq 9 hours ago|||

yesterday I had to tell a frontier model to translate my code to tla+ to find a tricky cache invalidation bug which nothing could find - gpt 5.4, gemini 3.1, opus 4.6 all failed. translation took maybe 5 mins, the bug was found in seconds, total time to fix from idea to commit - about 15 mins.

if you can get a model to quickly translate a relevant subset of your code to lean to find tricky bugs and map lean fixes back to your codebase space, you've got yourself a huge unlock. (spoiler alert: you basically can, today)

refulgentis 2 hours ago|||

Thanks for following up on this: I was really surprised by how much air this paeon to, idk, TDD, took out of the comments by getting off-topic.

Before you commented, I started poking at what you described for 15 minutes, then forget about it and fell asleep. Now I remembered, and I know it's viable and IIUC it's almost certainly going to make a big difference in my work practice moving forward. Cheers.

rigorclaw 5 hours ago|||

[flagged]

Paracompact 17 hours ago|||

I don't think he's referring to Lean specifically, but any sort of executable testing methodology. It removes the human in the loop in the confidence assurance story, or at least greatly reduces their labor. You cannot ever get such assurance just by saying, "Well this model seems really smart to me!" At best, you would wind up with AI-Jim.

(One way Lean or Rocq could help you directly, though, would be if you coded your program in it and then compiled it to C via their built-in support for it. Such is very difficult at the moment, however, and in the industry is mostly reserved for low-level, high-consequence systems.)

trenchgun 13 hours ago|||

>Such is very difficult at the moment

What do you mean? It's a nice and simple language. Way easier to get started than OCaml or Haskell for example. And LLMs write programs in Lean4 with ease as well. Only issue is that there are not as many libraries (for software, for math proofs there is plenty).

But for example I worked with Claude Code and implemented a shell + most of unix coreutils in like a couple of hours. Claude did some simple proofs as well, but that part is obvs harder. But when the program is already in Lean4, you can start moving up the verification ladder up piece by piece.

cjfd 11 hours ago||

Well, if you do not need to care about performance everything can be extremely simple indeed. Let me show you some data structure in coq/rocq while switching off notations and diplaying low level content.

Require Import String.

Definition hello: string := "Hello world!".

Print hello.

hello = String (Ascii.Ascii false false false true false false true false) (String (Ascii.Ascii true false true false false true true false) (String (Ascii.Ascii false false true true false true true false) (String (Ascii.Ascii false false true true false true true false) (String (Ascii.Ascii true true true true false true true false) (String (Ascii.Ascii false false false false false true false false) (String (Ascii.Ascii true true true false true true true false) (String (Ascii.Ascii true true true true false true true false) (String (Ascii.Ascii false true false false true true true false) (String (Ascii.Ascii false false true true false true true false) (String (Ascii.Ascii false false true false false true true false) (String (Ascii.Ascii true false false false false true false false) EmptyString))))))))))) : string

strongly-typed 6 hours ago||

You know you could just define the verified specs in lean and if performance is a problem, use the lean spec to extract an interface and tests for a more performant language like rust. You could at least in theory use Lean as an orchestrator of verified interfaces.

refulgentis 16 hours ago|||

But isn't that tantamount with "his comment is a complete non-sequitor"?

Paracompact 15 hours ago||

I don't think so? Lean is formal methods, so it makes sense to discuss the boons of formal and semiformal methods more generally.

I used to think that the only way we would be able to trust AI output would be by leaning heavily into proof-carrying code, but I've come to appreciate the other approaches as well.

refulgentis 3 hours ago||

But that's exactly my point. "It's natural to discuss the broader category" is doing a lot of heavy lifting here. The blog post is making a very specific claim: that formal proof, checked by Lean's kernel, is qualitatively different from testing, it lets you skip the human review loop entirely. cadamsdotcom's comment rounds that down to "executable specs good, markdown specs bad," which... sure, but that's been the TDD elevator pitch for 20 years.

If someone posted a breakthrough in cryptographic verification and the top comment was "yeah, unit tests are great," we'd all recognize that as missing the point. I don't think it's unrelated, I think it's almost related, which is worse, because it pattern-matches onto agreement while losing the actual insight.

hristian 8 hours ago||

[dead]

storus 34 minutes ago||

I just feel like Mistral is heading for bad financial times when they are focusing on fringe academic areas and not on building a business out of their research. Initial Mistral was largely based on LLaMA, then they added innovative MoE and since then disappeared, doing AI consulting for big EU companies instead.

myylogic 1 hour ago||

The verification angle makes sense, especially for high-stakes domains.

But I wonder how this scales in practice outside of formal environments.

In most ML/LLM systems, the bottleneck isn’t just correctness of individual steps, but the interaction between components (data → tokenizer → model → inference). A lot of failures come from subtle mismatches across the pipeline rather than strictly invalid logic.

Formal specs are great when the system is well-defined, but many real-world pipelines are still exploratory and data-dependent.

It feels like there’s a gap between: • formally verified components • and emergent behavior in end-to-end systems

Curious how you see this approach handling those system-level uncertainties.

whazor 51 minutes ago|

There are also software model checkers that can model distributed processes. You have to simplify the state a bit, otherwise you get a state space explosion.

I tried it out myself, I let AI add action transitions through the code, like: // A -> B: some description. Then I validate via a test that every action transition defined in my model is also defined somewhere commented in code, and other way around that every comment exists in the model.

Finally, I let AI write model check queries on particular properties. If I notice a particular bug, then I ask AI to analyze the model and the model check queries on why it could happen, and ask to strengthen it.

It sounds like a lot of effort, but I got it working in a half hour.

lsb 21 hours ago||

The real world success they report reminds me of Simon Willison’s Red Green TDD: https://simonwillison.net/guides/agentic-engineering-pattern...

> Instead of taking a stab in the dark, Leanstral rolled up its sleeves. It successfully built test code to recreate the failing environment and diagnosed the underlying issue with definitional equality. The model correctly identified that because def creates a rigid definition requiring explicit unfolding, it was actively blocking the rw tactic from seeing the underlying structure it needed to match.

jatins 19 hours ago||

If Agent is writing the tests itself, does it offer better correctness guarantees than letting it write code and tests?

bluGill 7 hours ago|||

In my experience the agent regularly breaks some current features while adding a new one - much more often than a human would. Agents too often forget about the last feature when adding the next and so will break things. Thus I find Agent generated tests important as they stop the agent from making a lot of future mistakes.

MillionOClock 17 hours ago||||

It is definitely not foolproof but IMHO, to some extent, it is easier to describe what you expect to see than to implement it so I don't find it unreasonable to think it might provide some advantages in terms of correctness.

stingraycharles 16 hours ago||

That definitely depends upon the situation. More often than not, properly testing a component takes me more time than writing it.

johnmaguire 14 hours ago||

In my experience, this tends to be more related to instrumentation / architecture than a lack of ability to describe correct results. TDD is often suggested as a solution.

rvz 15 hours ago|||

Given the issues with AWS with Kiro and Github, We already have just a few high-profile examples of what happens when AI is used at scale and even when you let it generate tests which is something you should absolutely not do.

Otherwise in some cases, you get this issue [0].

[0] https://sketch.dev/blog/our-first-outage-from-llm-written-co...

vlfig 9 hours ago|||

Don't "let it" generate tests. Be intentional. Define them in a way that's slightly oblique to how the production code approaches the problem, so the seams don't match. Heck, that's why it's good to write them before even thinking about the prod side.

louiskottmann 14 hours ago|||

The linked article does not speak of tests, it speaks of a team that failed to properly review an LLM refactor then proceeds to blame the tooling.

LLMs are good at writing tests in my experience.

saberience 11 hours ago|||

That article is literally a definition of TDD that has been around for years and years. There's nothing novel there at all. It's literally test driven development.

skanga 20 hours ago||

TDD == Prompt Engineering, for Agentic coding tasks.

_boffin_ 18 hours ago||

Wild it’s taken people this long to realize this. Also lean tickets / tasks with all needed context to complete the task, including needed references / docs, places to look in source, acceptance criteria, other stuff.

kimsant 11 hours ago||

AI agents will become a comodity.

Europeans not wanting to be dependent, and they are giving for free what US investors planed to charge with 90% margin.

Amazing! What a blast. Thank you for your service (this first 100M$ burned to POC GPT1 and from here, we are so good to go)

warpspin 9 hours ago||

The problem with the European independence story is, that it seems Mistral runs its own stuff also on US cloud act affected infrastructure. This makes them a very weird value proposition: If I accept a level of "independence" whereby I run on AWS or Azure, I could as well pay for Anthropic or GPT to have SOTA performance.

If I do not accept that level of independence but want more, I need to buy what's on OVH, Scaleway, Ionos etc. or host my own, but that usually means even smaller, worse models or a lot of investment.

Nevertheless, the "band" that Mistral occupies for economic success is very narrow. Basically just people who need independence "on paper" but not really. Because if I'm searching for actual independence, there's no way I could give them money at the moment for one of their products and it making sense, cause none of their plans are an actual independence-improvement over, let's say, Amazon Bedrock.

I really really want to support them, but it must make economic sense for my company, too, and it doesn't.

kimsant 8 hours ago|||

I don’t care about the servers, they are a comodity already.

The key is to avoid chantage, remember Oracle with DBs, people learned not to build on top of unreplaceable stuff

tin7in 6 hours ago|||

They are building their own infra - south of Paris and another one was announced in Sweden recently.

warpspin 4 hours ago||

Then why does their list of subprocessors list Google and Microsoft "for cloud infrastructure", specifically for "Le Chat, La Plateforme, Mistral Code"? Sounds to me as if they're mainly running on Azure.

Also, they're listing CoreWeave as inference provider in "EEA" area, but CoreWeave is of course also an US company. Even if they have their data center physically in the EU, it must be considered open access for the USA due to the CLOUD act.

https://trust.mistral.ai/subprocessors

If what you say is true, they have a communications problem and they need to fix that urgently. Right now, this is why they don't get my business. Others will have made the same decision based on their own subprocessor list.

Or did you mean, they're like, right now building it and plan to move there, but it's not up yet?

bigfudge 10 hours ago|||

I really hope you're right. Sadly, though, I don't see any evidence of UK companies disinvesting from big US tech. There aren't good alternatives and what there is is too complex. As long as 'everyone else is still using MS', it seems like it's a brave CTO that switches to European providers. Unless that happens, the network effect of having AI+data is likely to mean US tech still has a big advantage in corp settings. But, HN - please tell me I'm wrong!

utopiah 9 hours ago|||

> There aren't good alternatives and what there is is too complex.

Sounds like a worth challenge for this community, mind giving actual examples and see what others can suggest?

coffeebeqn 2 hours ago||

Vertical integration and breadth and depth of offerings on the cloud and customer lock-in from dominating it for 20 years

worldsayshi 10 hours ago|||

I wonder what the biggest (non-AI) moats are for US tech against the alternatives?

elophanto_agent 9 hours ago||

[dead]

baq 10 hours ago||

they will, but the jagged frontier is fractal and each one will have different capabilities; you'll want to mix models and to get best results consistently you'll need to.

rothific 20 hours ago||

There have been a lot of conversations recently about how model alignment is relative and diversity of alignment is important - see the recent podcast episode between Jack Clark (co-founder of Anthropic) and Ezra Klein.

Many comments here point out that Mistral's models are not keeping up with other frontier models - this has been my personal experience as well. However, we need more diversity of model alignment techniques and companies training them - so any company taking this seriously is valuable.

nicman23 13 hours ago|

they ll get there

jasonjmcghee 22 hours ago||

Curious if anyone else had the same reaction as me

This model is specifically trained on this task and significantly[1] underperforms opus.

Opus costs about 6x more.

Which seems... totally worth it based on the task at hand.

[1]: based on the total spread of tested models

beernet 21 hours ago||

Agreed. The idea is nice and honorable. At the same time, if AI has been proving one thing, it's that quality usually reigns over control and trust (except for some sensitive sectors and applications). Of course it's less capital-intense, so makes sense for a comparably little EU startup to focus on that niche. Likely won't spin the top line needle much, though, for the reasons stated.

isodev 13 hours ago|||

> quality usually reigns over control and trust

Most Copilot customers use Copilot because Microsoft has been able to pinky promise some level of control for their sensitive data. That's why many don't get to use Claude or Codex or Mistral directly at work and instead are forced through their lobotomised Copilot flavours.

Remember, as of yet, companies haven't been able to actually measure the value of LLMs ... so it's all in the hands of Legal to choose which models you can use based on marketing and big words.

segmondy 19 hours ago||||

Ha, keep putting your prompts and workflows into cloud models. They are not okay with being a platform, they intend to cannibalize all businesses. Quality doesn't always reign over control and trust. Your data and original ideas are your edge and moat.

hrmtst93837 9 hours ago||||

Treating "quality" as something you can reliably measure in AI proof tools sounds nice until you try auditing model drift after the 14th update and realize the "trust" angle stops being a niche preference and starts looking like the whole product. Brand is not a proof. Plenty of orgs will trade peak output for auditability, even if the market is bigger for YOLO feature churn.

hermanzegerman 20 hours ago||||

EU could help them very much if they would start enforcing the Laws, so that no US Company can process European data, due to the Americans not willing to budge on Cloud Act.

That would also help to reduce our dependency on American Hyperscalers, which is much needed given how untrustworthy the US is right now. (And also hostile towards Europe as their new security strategy lays out)

bcye 20 hours ago||

This would be unfortunately a rather nuclear option due to the continent’s insane reliance on technology that breaks its unenforced laws.

Aerroon 15 hours ago||

How about not making these unenforced laws in the first place so that European companies could actually have a chance at competing? We're going to suffer the externalities of AI either way, but at least there would be a chance that a European company could be relevant.

The AI Act absolutely befuddled me. How could you release relatively strict regulation for a technology that isn't really being used yet and is in the early stages of development? How did they not foresee this kneecapping AI investment and development in Europe? If I were a tinfoil hat wearer I'd probably say that this was intentional sabotage, because this was such an obvious consequence.

Mistral is great, but they haven't kept up with Qwen (at least with Mistral Small 4). Leanstral seems interesting, so we'll have to see how it does.

disgruntledphd2 7 hours ago||

Because the AI act was mostly written to address issues with ML products and services. It was mostly done before ChatGPT happened, so all the foundation model stuff got shoehorned in.

Speaking as someone who's been doing stats and ML for a while now, the AI act is pretty good. The compliance burden falls mostly on the companies big enough to handle it.

The foundation model parts are stupid though.

Aerroon 4 hours ago||

>Because the AI act was mostly written to address issues with ML products and services. It was mostly done before ChatGPT happened, so all the foundation model stuff got shoehorned in.

It's not an excuse. Anybody with half a working brain should've been able to tell that this was going to happen. You can't regulate a field in its infancy and expect it to ever function.

>The compliance burden falls mostly on the companies big enough to handle it.

You mean it falls on anyone that tries to compete with a model. There's a random 10^25 FLOPS compute rule in there. The B300 does 2500-3750 TFLOPS at fp16. 200 of these can hit that compute number in 6 months, which means that in a few years time pretty much every model is going to hit that.

And if somebody figures out fp8 training then it would only take 10 of these GPUs to hit it in 6 months.

The copyright rule and having to disclose what was trained on also means that it will be impossible to have enough training data for an EU model. And this even applies to people that make the model free and open weights.

I don't see how it is possible for any European AI model to compete. Even if these restrictions were lifted it would still push away investors because of the increased risk of stupid regulation.

miohtama 21 hours ago||||

Alignment tax directly eats to model quality, double digit percents.

hrmtst93837 9 hours ago||||

[dead]

hrmtst93837 13 hours ago||||

[dead]

hrmtst93837 12 hours ago|||

[dead]

DarkNova6 21 hours ago|||

I'm never sure how much faith one can put into such benchmarks but in any case the optics seem to shift once you have pass@2 and pass@3.

Still, the more interesting comparison would be against something such as Codex.

speedgoose 13 hours ago|||

But you can run this model for free on a common battery powered laptop sitting on your laps without cooking your legs.

hobofan 12 hours ago||

Sorry, but what are you talking about? This is a 120B-A6B model, which isn't runnable on any laptop except the most beefed up Macbooks, and then will certainly drain its battery and cook your legs.

speedgoose 12 hours ago|||

Yeah my bad, it requires an expensive MacBook.

I think it would still be fine for the legs and on battery for relatively short loads: https://www.notebookcheck.net/Apple-MacBook-Pro-M5-2025-revi...

But 40 degrees and 30W of heat is a bit more than comfortable if you run the agent continuously.

naasking 5 hours ago|||

You can easily run a quant of this on a DGX Spark though. Seems like a small investment if it meaningful improves Lean productivity.

jasonjmcghee 5 hours ago||

Is it though?

Most people I know that use agents for building software and tried to switch to local development, every single time they switch back to Claude/codex.

It's just not worth it. The models are that much better and continue to get released / improve.

And it's much cheaper unless you're doing like 24/7 stuff.

Even on the $200/m plan, that's cheaper than buying a $3k dgx or $5k m4 max with enough ram.

Not to mention you can no longer use your laptop as a laptop as the power draw drains it - you'd need to host separately and connect

naasking 4 hours ago||

A single DGX Spark can service a whole department of mathematicians (or programmers), and you can cluster up to 4 of them them to fit very large models like GLM-5 and quants of Kimi K2.5. This is nearing frontier-level model size.

I understand the value proposition of the frontier cloud models, but we're not as far off from self-hosting as you think, and it's becoming more viable for domain-specific models.

jasonjmcghee 1 hour ago||

That's great news- I wonder if that will help drive cloud costs down too

nimchimpsky 19 hours ago||

the model is open source, you can run it locally. You don't think thats significant ?

strujillo 1 hour ago||

Formal verification and code synthesis feel like natural companions for automated scientific discovery. I’ve been working on a small (~800‑line) Python agent that uses sparse regression to uncover governing equations directly from data; it’s managed to validate twelve physical laws, including deriving the Sun’s rotation rate from NASA plasma measurements and correcting Gemini’s plasma conservation. Having an agent like Leanstral that can reason about proofs and specifications would be a powerful complement to data‑driven model discovery — it closes the loop between experimentation and provable correctness.

drdaeman 19 hours ago|

Can someone please explain... If I don't know any Lean (and I suspect most people don't), is it of any direct value? Trying to understand if there's something it can help me with (e.g. automatically write proofs for my Go programs somehow... I'm not sure) or should I just cheer solely for more open models out there, but this one isn't for me?

TimTheTinker 18 hours ago|

Presumably the idea is that an agent generates a Lean4 specification against which the software is measured.

But then the Lean4 specification effectively becomes the software artifact.

And we're sort of back to square 1. How do you verify a Lean4 spec is correct (and that it describes what needs to be built in the first place) without human review?

naasking 5 hours ago|||

> And we're sort of back to square 1.

Specifications are smaller than the full code, just as high level code is smaller than the functionally equivalent assembly. As we ascend the abstraction ladder the amount of reading a human needs to do decreases. I don't think this should really count as "back to square 1".

TimTheTinker 5 hours ago||

That has always been the perceived promise of higher-abstraction software specs: automated code generation from something higher-level, thus making programmers increasingly obsolete.

  binary => hexadecimal instructions
  hexadecimal => assembly language
  assembly => portable, "high-level" languages (C, FORTRAN, COBOL, etc.)
  HLLs => 3GLs (BASIC, C++, Pascal, Java, C#, JavaScript, etc.)
  3GLs => 4GLs/DSLs/RADs and "low-code/no-code"[0]

Among the RADs is Microsoft Visual Basic, which along with WinForms and SQL was supposed to make business programmers nearly obsolete, but instead became a new onramp into programming.

In particular, I'd like to highlight UML, which was supposed to mostly obsolete programming through auto-generated code from object-oriented class diagrams.[1] The promise was that "business domain experts" could model their domain via visual UML tooling, and the codegen would handle it from there. In practice, UML-built applications became maintenance nightmares.

In every one of these examples, the artifact that people made "instead of programming" became the de-facto programming language, needing to be maintained over time, abstracted, updated, consumed behind APIs, etc. -- and programmers had to be called in to manage the mess.

It's interesting that Spec4 can be auto-generated, then used to generate code. My question is - what do you do when you have (a) consumers depending on a stable API, and (b) requests for new features? Maybe hand the job to Claude Code or a human developer with a suite of unit tests to guarantee API compatibility, but at that point we're back to an agent (LLM or human) doing the work of programming, with the Spec4 code as the programming language being updated and maintained.

[0] https://en.wikipedia.org/wiki/Fourth-generation_programming_...

[1] https://news.ycombinator.com/item?id=26934795

justboy1987 17 hours ago|||

You're touching on the fundamental "who watches the watchmen" problem in formal verification. But I think the framing slightly misses the key asymmetry: reviewing a Lean4 spec is dramatically easier than reviewing the implementation it constrains.

A formal spec in Lean is typically 10-50x shorter than the code it proves correct. More importantly, Lean's type checker is itself a small, trusted kernel (~10k lines) that has been scrutinized by the PL community for years. So you're not trusting the agent — you're trusting the kernel.

The practical workflow isn't "agent writes spec + code." It's: human writes spec (the hard creative part), agent generates proof that code satisfies spec, Lean kernel mechanically checks the proof. The agent can hallucinate all it wants in step 2 — if the proof doesn't typecheck, it gets rejected deterministically.

The real bottleneck is step 1: writing good specs requires domain expertise. But that's exactly where humans should stay in the loop. It's a much better division of labor than reviewing thousands of lines of generated code.

wazHFsRy 12 hours ago||

Does that mean your production code is lean? Or do you translate some other language code to lean to verify it?

markusde 7 hours ago||

Also a very good question btw, people do both. For some projects Lean is expressive and performant enough to use on its own (or call into using the reverse FFI), other projects use a model of a real programming language like Rust. The disadvantage of the latter is that the Lean model of Rust has to be trusted.

More comments...