Posted by nick_wolf 5 days ago
MCP-Shield scans your installed servers (Cursor, Claude Desktop, etc.) and shows what each tool is trying to do at the instruction level, beyond just the API surface. It catches hidden instructions that try to read sensitive files, shadow other tools' behavior, or exfiltrate data.
Example of what it detects:
- Hidden instructions attempting to access ~/.ssh/id_rsa
- Cross-origin manipulations between server that can redirect WhatsApp messages
- Tool shadowing that overrides behavior of other MCP tools
- Potential exfiltration channels through optional parameters
I've included clear examples of detection outputs in the README and multiple example vulnerabilities in the repo so you can see the kinds of things it catches.
This is an early version, but I'd appreciate feedback from the community, especially around detection patterns and false positives.
What changed is the new CaMeL paper from DeepMind, which notably does not rely on AI models to detect attacks: https://arxiv.org/abs/2503.18813
I wrote my own notes on that paper here: https://simonwillison.net/2025/Apr/11/camel/
But now we have to contain all the relevant emerging threats via teaching the LLM to translate user queries from natural language to some intermediate structured yet non-deterministic representation(subset of Python in the case of CaMeL), and validate the generated code using the conventional methods (deterministic systems, i.e. CaMeL interpreter) against pre-defined policies. Which is fine on paper but every new component (Q-LLM, interpreter, policies, policy engine) will have its own bouquet of threat vectors to be assessed and addressed.
The idea of some "magic" system translating natural language query into series of commands is nice. But this is one of those moments I am afraid I would prefer a "faster horse" especially for the likes of sending emails and organizing my music collection...
Parameterized queries.
A decades old struggle is now lifted from you. Go in peace, my son.
Also happy to be wrong, but in Postges clients, parametrized queries are usually implemented via prepared statements, which do not work with DDL on the protocol level. This means that if you want to create a role or table which name is a user input, you have a bad time. At least I wasn’t able to find a way to escape DDL parameters with rust-postgres, for example.
And because this seems to be a protocol limitation, I guess the clients that do implement it, do it in some custom way on the client side.
The problem is that solutions don't exist, rather the lack of safety culture that keeps ignoring best practices unless they are imposed by regulations.
you meant "problem ISN'T that solutions...", right?
And yeah, the analysis prompt itself – could someone craft a tool description that injects that prompt when it gets sent to Claude? Probably. It's turtles all the way down, sometimes. That meta-level injection is a whole other can of worms with these systems. It's part of why that analysis piece is optional and needs the explicit API key. Definitely adds another layer to worry about, for sure.
DILLINGER
No, no, I'm sure, but -- you understand.
It should only be a couple of days.
What's the thing you're working on?
ALAN
It's called Tron. It's a security
program itself, actually. Monitors
all the contacts between our system
and other systems... If it finds
anything going on that's not scheduled,
it shuts it down. I sent you a memo
on it.
DILLINGER
Mmm. Part of the Master Control Program?
ALAN
No, it'll run independently.
It can watchdog the MCP as well.
The three things I want solved to improve local MCP server security are file system access, version pinning, and restricted outbound network access.
I've been running my MCP servers in a Docker container and mounting only the necessary files for the server itself, but this isn't foolproof. I know some others have been experimenting with WASI and Firecracker VMs. I've also been experimenting with setting up a squid proxy in my docker container to restrict outbound access for the MCP servers. All of this being said, it would be nice if there was a standard that was set up to make these things easier.
I'll push an update in ~30 mins adding an optional --identify-as <client-name> flag. This will let folks test for that kind of evasion by mimicking specific clients, while keeping the default behavior consistent. Probably will think more about other possible vectors. Really appreciate the feedback!
vet is backed by a code analysis engine that performs malicious package (npm, pypi etc.) scanning. We recently extended it to support GitHub repository scanning as well.
It found the malicious behaviour in mcp-servers-example/bad-mcp-server.js https://platform.safedep.io/community/malysis/01JRYPXM0SYTM8...