Monty: A minimal, secure Python interpreter written in Rust for use by AI

Posted by dmpetrov 12 hours ago

Monty: A minimal, secure Python interpreter written in Rust for use by AI(github.com)

197 points | 88 commentspage 2

krick 9 hours ago|

I don't quite understand the purpose. Yes, it's clearly stated, but, what do you mean "a reasonable subset of Python code" while "cannot use the standard library"? 99.9% of Python I write for anything ever uses standard library and then some (requests?). What do you expect your LLM-agent to write without that? A pseudo-code sorting algorithm sketch? Why would you even want to run that?

impulser_ 9 hours ago||

They plan to use to for "Code Mode" which mean the LLM will use this to run Python code that it writes to run tools instead of having to load the tools up front into the LLM context window.

DouweM 8 hours ago||

(Pydantic AI lead here) We’re implementing Code Mode in https://github.com/pydantic/pydantic-ai/pull/4153 with support for Monty and abstractions to use other runtimes / sandboxes.

The idea is that in “traditional” LLM tool calling, the entire (MCP) tool result is sent back to the LLM, even if it just needs a few fields, or is going to pass the return value into another tool without needing to see the intermediate value. Every step that depends on results from an earlier step also requires a new LLM turn, limiting parallelism and adding a lot of overhead.

With code mode, the LLM can chain tool calls, pull out specific fields, and run entire algorithms using tools with only the necessary parts of the result (or errors) going back to the LLM.

These posts by Cloudflare: https://blog.cloudflare.com/code-mode/ and Anthropic: https://platform.claude.com/docs/en/agents-and-tools/tool-us... explain the concept and its advantages in more detail.

notepad0x90 9 hours ago||

It's pydantic, they're verifying types and syntax, those don't require the stdlib. Type hints, syntax checks, likely logical issues,etc.. static type checking is good with that, but LLMs can take to the next level where they analyze the intended data flow and find logical bugs, or good syntax and typing but not the intended syntax.

For example, incorrect levels of indentation. Let me use dots instead of space because of HN formatting:

for key,val in mydict.items():

..if key == "operation":

....logging.info("Executing operation %s",val)

..if val == "drop_table":

....self.drop_table()

This uses good syntax, and I the logging part is not in the stdlib, so I assume it would ignore it or replace it with dummy code? That shouldn't prevent it from analyzing that loop and determining that the second if-block was intended to be under the first, and the way it is written now, the key check isn't done.

In other words, if you don't want to do validate proper stdlib/module usage, but proper __Python__ usage, this makes sense. Although I'm speculating on exactly what they're trying to do.

EDIT: I think I my speculation was wrong, it looks like they might have developed this to write code for pydantic-ai: https://github.com/pydantic/pydantic-ai , i'll leave the comment above as-is though, since I think it would still be cool to have that capability in pydantic.

kodablah 19 hours ago||

I'm of the mind that it will be better to construct more strict/structured languages for AI use than to reuse existing ones.

My reasoning is 1) AIs can comprehend specs easily, especially if simple, 2) it is only valuable to "meet developers where they are" if really needing the developers' history/experience which I'd argue LLMs don't need as much (or only need because lang is so flexible/loose), and 3) human languages were developed to provide extreme human subjectivity which is way too much wiggle-room/flexibility (and is why people have to keep writing projects like these to reduce it).

We should be writing languages that are super-strict by default (e.g. down to the literal ordering/alphabetizing of constructs, exact spacing expectations) and only having opt-in loose modes for humans and tooling to format. I admit I am toying w/ such a lang myself, but in general we can ask more of AI code generations than we can of ourselves.

bityard 9 hours ago|

I think the hard part about that is you first have to train the model on a BUTT TON of that new language, because that's the only way they "learn" anything. They already know a lot of Python, so telling them to write restricted and sandboxed Python ("you can only call _these_ functions") is a lot easier.

But I'd be interested to see what you come up with.

kodablah 4 hours ago|||

> that's the only way they "learn" anything

I think skills and other things have shown that a good bit of learning can be done on-demand, assuming good programming fundamentals and no surprise behavior. But agreed, having a large corpus at training time is important.

I have seen, given a solid lang spec to a never-before-seen lang, modern models can do a great job of writing code in it. I've done no research on ability to leverage large stdlib/ecosystem this way though.

> But I'd be interested to see what you come up with.

Under active dev at https://github.com/cretz/duralade, super POC level atm (work continues in a branch)

Terretta 5 hours ago|||

> you first have to train the model on a BUTT TON of that new language

Tokenization joke?

globular-toast 2 hours ago||

I don't get what "the complexity of a sandbox" is. You don't have to use Docker. I've been running agents in bubblewrap sandboxes since they first came out.[0]

If the agent can only use the Python interpreter you choose then you could just sandbox regular Python, assuming you trust the agent. But I don't trust any of them because they've probably been vibe coded, so I'll continue to just sandbox the agent using bubblewrap.

[0] https://blog.gpkb.org/posts/ai-agent-sandbox/

wewewedxfgdf 6 hours ago||

If I say my code is secure does hat make it secure?

Or is all Rust code secure unquestionably?

maxbond 3 hours ago|

Of course not, especially when the security model is about access to resources like file systems that are outside the scope of what the Rust compiler can verify. While you won't have a data race in safe Rust you absolutely can have data races accessing the file system in any language.

Their security model, as explained in the README, is in not including the standard library and limiting all access to the environment to functions you write & control. Does that make it secure? I'll leave it to you to evaluate that in the context of your use case/threat model.

It would appear to me that they used Rust primarily because a.) they want to deliver very fast startup times and b.) they want it to be accessible from a variety of host languages (like Python and JavaScript). Those are things Rust does well, though not to the exclusion of C or other GC-free compiled languages. They certainly do not claim that Rust is pixie dust you sprinkle on a project to make it secure. That would clearly be cargo culting.

I find this language war tiring. Don't you? Let's make 2026 the year we all agree to build cool stuff in whatever language we want without this pointless quarreling. (I've personally been saying this for three years at this point.)

Retr0id 9 hours ago||

I'm enjoying watching the battle for where to draw the sandbox boundaries (and I don't have any answers, either!)

ushakov 8 hours ago|

best answer is probably to have a layered approach - use this to limit what the generated code can do, wrap it in a secure VM to prevent leaking out to other tenants.

dmpetrov 12 hours ago||

I like the idea a lot but it's still unclear from the docs what the hard security boundary is once you start calling LLMs - can it avoid "breaking out" into the host env in practice?

falcor84 10 hours ago||

Wow, a start latency of 0.06ms

rienbdj 10 hours ago||

If we’re going to have LLMs write the code, why not something more performant? Like pages and pages of Java maybe?

scolvin 9 hours ago|

this is pretty performant for short scripts if you measure time "from code to rust" which can be as low as 1us.

Of course it's slow for complex numerical calculations, but that's the primary usecase.

I think the consensus is that LLMs are very good at writing python and ts/js, generally not quite as good at writing other languages, at least in one shot. So there's an advantage to using python/js/ts.

catlifeonmars 9 hours ago||

Seems like we should fix the LLMs instead of bending over backwards no?

redman25 5 hours ago||

They’re good at it because they’ve learned from the existing mountains of python and javascript.

catlifeonmars 2 hours ago|||

I think the next big breakthrough will be cost effective model specialization, maybe through modular models. The monolithic nature of today’s models is a major weakness.

rienbdj 1 hour ago|||

Plenty of Java in the training data too.

spacedatum 7 hours ago||

There is no reason to continue writing Python in 2026. Tell Claude to write Rust apriori. Your future self will thank you.

JoshPurtell 5 hours ago|

I do both and compile times are very unfriendly to AI!

spacedatum 2 hours ago||

Compile times, I can live with. You can run previous models on the gpu while your new model is compiling. Or switch from cargo to bazel if it is that bad.

JoshPurtell 2 hours ago||

What compile times do you work with? I use bazel and it still hurts

spacedatum 2 hours ago||

It is a tradeoff, but I prefer my checks at compile time to runtime. Python can be brittle and silently wrong.

wiseowise 1 hour ago||

What kind of type checking do you think Rust does at runtime?

OutOfHere 10 hours ago|

It is absurd for any user to use a half baked Python interpreter, also one that will always majorly lag behind CPython in its support. I advise sandboxing CPython instead using OS features.

simonw 8 hours ago||

How do I sandbox CPython using OS features?

(Genuine question, I've been trying to find reliable, well documented, robust patterns for doing this for years! I need it across macOS and Linux and ideally Windows too. Preferably without having to run anything as root.)

nickpsecurity 5 hours ago|||

It could be difficult. My first thought would be a SELinux policy like this article attempted:

https://danwalsh.livejournal.com/28545.html

One might have different profiles with different permissions. A network service usually wouldn't need your hone directory while a personal utility might not need networking.

Also, that concept could be mixed with subprocess-style sandboxing. The two processes, main and sandboxed, might have different policies. The sandboxed one can only talk to main process over a specific channel. Nothing else. People usually also meter their CPU, RAM, etc.

INTEGRITY RTOS had language-specific runtimes, esp Ada and Java, that ran directly on the microkernel. A POSIX app or Linux VM could run side by side with it. Then, some middleware for inter-process communication let them talk to each other.

OutOfHere 6 hours ago|||

Docker and other container runners allow it. https://containers.dev/ allows it too.

https://github.com/microsoft/litebox might somehow allow it too if a tool can be built on top of it, but there is no documentation.

simonw 5 hours ago||

Every time I use Docker as a sandbox people warn me to watch out for "container escapes".

I trust Firecracker more because it was built by AWS specifically to sandbox Lambdas, but it doesn't work on macOS and is pretty fiddly to run on Linux.

bityard 9 hours ago|||

Python already has a lot of half-baked (all the way up to nearly-fully-baked) interpreters, what's one more?

https://en.wikipedia.org/wiki/List_of_Python_software#Python...

avaer 10 hours ago||

The repo does make a case for this, namely speed, which does make sense.

sd2k 9 hours ago|||

True, but while CPython does have a reputation for slow startup, completely re-implementing isn't the only way to work around it - e.g. with eryx [1] I've managed to pre-initialize and snapshots the Wasm and pre-compile it, to get real CPython starting in ~15ms, without compromising on language features. It's doable!

[1] https://github.com/eryx-org/eryx

OutOfHere 6 hours ago|||

Speed is not a feature if there isn't even syntax parity with CPython.

maxbond 2 hours ago||

Not having parity is a property they want, similar to Starlark. They explicitly want a less capable language for sandboxing.

Think of it as a language for their use case with Python's syntax and not a Python implementation. I don't know if it's a good idea or not, I'm just an intrigued onlooker, but I think lifting a familiar syntax is a legitimate strategy for writing DSLs.

More comments...