Posted by emilburzo 8 hours ago
> a malicious AI trying to escape the VM (VM escape vulnerabilities exist, but they’re rare and require deliberate exploitation)
No VM escape vulns necessary. A malicious AI could just add arbitrary code to your Vagrantfile and get host access the first time you run a vagrant command.
If you're only worried about mistakes, Claude could decide to fix/improve something by adding a commit hook. If that contains a mistake, the mistake gets executed on your host the first time you git commit/push.
(Yes, it's unpleasantly difficult to truly isolate dev environments without inconveniencing yourself.)
I basically do something like "take snapshot -> run tiny vm -> let agent do what it does -> take snapshot -> look at diff" for each change, restarting if it doesn't give me what I wanted, or I misdirected it somehow. But there is no automatic sync of files, that'd defeat the entire point of putting it into a VM in the first place, wouldn't it?
> A malicious AI could just add arbitrary code to your Vagrantfile
> [...]
> Claude could decide to fix/improve something by adding a commit hook.
You can fix this by confining Claude to a subdirectory (with Docker volume mounts, for example): repository/
├── sandbox <--- Claude lives in here
│ └── main.py <--- Claude can edit this
└── .git <--- Claude can not touch thishttps://github.com/EstebanForge/construct-cli
For Linux, WSL also of course, and macOS.
Any coding agent (from the supported ones, our you can install your own).
Podman, Docker or even Apple's container.
In case anyone is interested.
Shannot[0] captures intent before execution. Scripts run in a PyPy sandbox that intercepts all system calls - commands and file writes get logged but don't happen. You review in a TUI, approve what's safe, then it actually executes.
The trade-off vs VMs: VMs let Claude do anything in isolation, Shannot lets Claude propose changes to your real system with human approval. Different use cases - VMs for agentic coding, whereas this is for "fix my server" tasks where you want the changes applied but reviewed first.
There's MCP integration for Claude, remote execution via SSH, checkpoint/rollback for undoing mistakes.
Feedback greatly appreciated!
The problem with this approach (unless I'm misunderstanding - entirely possible!) is that it still blocks the agent on the first need for approval.
What I think most folks actually want (or at least what I want) is to allow the agent to explore a space, including exploring possible dead ends that require permissions/access, without stopping until the task is finished.
So if the agent is trying to "fix a server" it might suggest installing or removing a package. That suggestion blocks future progress.
Until a human comes in and says "yes - do it" or "no - try X instead" it will sit there doing nothing.
If instead it can just proceed, observe that the package doesn't resolve the issue, and continue exploring other solutions immediately, you save a whole lot of time.
So the agent can freely explore, check logs, list files, inspect service status. It only blocks when it wants to change something (install a package, write a config, restart a service).
Also worth noting: Shannot operates on entire scripts, not individual commands. The agent writes a complete program, the sandbox captures everything it wants to do during a dry run, then you review the whole batch at once. Claude Code's built-in controls interrupt at each command whereas Shannot interrupts once per script with a full picture of intent.
That said, you're pointing at a real limitation: if the fix genuinely requires a write to test a hypothesis, you're back to blocking. The agent can't speculatively install a package, observe it didn't help, and roll back autonomously.
For that use case, the OP's VM approach is probably better. Shannot is more suited to cases where you want changes applied to the real system but reviewed first.
Definitely food for thought though. A combined approach might be the right answer. VM/scratch space where the agent can freely test hypotheses, then human-in-the-loop to apply those conclusions to production systems.
- Spin up a vm with an image of the real target device.
- Let the agent act freely in the vm until the task is resolved, but capture and record all dangerous actions
- Review & replay those actions on the real machine
My issue is that for any real task, an agent without feedback mechanisms is essentially worthless. You have to have some sort of structured "this is what success looks like, here's how you check" target for it. A human in the loop can act as that feedback, which is in line with how claude code works by default (you define success by approving actions and giving feedback on status), but requiring a human in the loop also slows it down a bunch - you can end up ping-ponging between terminals trying to approve actions and review the current status.
I started running Claude Code in a devcontainer with limited file access (repo only) and limited outbound network access (allowlist only) for that reason.
This weekend, I generalized this to work with docker compose. Next up is support for additional agents (Codex, OpenCode, etc). After that, I'd like to force all network access through a proxy running on the host for greater control and logging (currently it uses iptables rules).
This workflow has been working well for me so far.
Still fresh, so may be rough around the edges, but check it out: https://github.com/mattolson/agent-sandbox
You can also use Lima, a lightweight VM control plane, as it natively works with qemu and Virtualization.Framework. (I think Vagrant does too; it's been a minute since I've tried.) This has traditionally been used for running container engines, but it's great for narrowly-scoped use cases like this.
Just need to be careful about how the directory Claude is working with is shared. I copy my Git repo to a container volume to use with Claude (DinD is an issue unless you do something like what Kind did) and rsync my changes back and verify before pushing. This way, I don't have to worry if Claude decides to rewind the reflog or something.
> So in that sense it seems that AI is actually more aligned with my goals than a potential employee.
It may seem like that but I recommend you reading up on different kinda of misalignment in AI safety.
(GPT recently changed its attitude on this subject too which is very interesting.)
The most interesting part is that you will be given the option to downgrade the conversation to an older model. Implying that there was a step change in capability on this front in recent months.
sandbox-run npx @anthropic-ai/claude-code
This runs npx (...) transparently inside a Bubblewrap sandbox, exposing only the $PWD. Contrary to many other solutions, it is a few lines of pure POSIX shell.What other dev OSs are there?
> once privileges are dropped [...] it doesn't appear to be possible to reinstate them
I don't understand. If unprivileged code could easily re-elevate itself, privilege dropping would be meaningless ... If you need to communicate with the outside, you can do so via sockets (such as the bind-mounted X11 socket in one of the readme Examples).
Consider one wanted to replicate the human-approval workflow that most agent harnesses offer. It's not obvious to me how that could be accomplished by dropping privileges without an escape hatch.
And I think that what CC's /sandbox uses on a Mac
There's a load of ways that a repository owner can get an LLM agent to execute code on user's machines so not a good plan to let them run on your main laptop/desktop.
Personally my approach has been put all my agents in a dedicated VM and then provide them a scratch test server with nothing on it, when they need to do something that requires bare metal.
I currently apply the same strategy we use in case of the senior developer or CTO going off the deep end. Snapshots of VMs, PITR for databases and file shares, locked down master branches, etc.
I wouldn't spend a bunch of energy inventing an entirely new kind of prison for these agents. I would focus on the same mitigation strategies that could address a malicious human developer. Virtual box on a sensitive host another human is using is not how you'd go about it. Giving the developer a cheap cloud VM or physical host they can completely own is more typical. Locking down at the network is one of the simplest and most effective methods.