Posted by 76SlashDolphin 1 day ago
The idea: instead of connecting an LLM directly to multiple MCP servers, you point them all through a Gateway.
The Gateway:
- Connects to each MCP server and inspects their tools + requirements
- Classifies tools along the "trifecta" axes (private data access, untrusted content, external comms)
- When all three conditions are about to align in a single session, the Gateway blocks the last step and tells the LLM to show a warning instead.
That way, before anything dangerous can happen, the user is nudged to review the situation in a web dashboard.
We'd love for the HN community to try it out: https://github.com/Edison-Watch/open-edison
Any feedback very welcome - we'll be around in the thread to answer questions.
1. The "lethal trifecta" is also the "productive trifecta" - people want to be able to use LLMs to operate in this space since that's where much of the value is; using private / proprietary data to interact with (do I/O with) the real world.
2. I worry that there will soon be (if not already) a fourth leg to the stool - latent malicious training within the LLMs themselves. I know the AI labs are working on this, but trying to ferret out Manchurian Candidates embedded within LLMs may very well be the greatest security challenge of the next few decades.
Regarding the second point, that is a very interesting topic that we haven't thought about. It would seem that our approach would work for this usecase too, though. Currently, we're defending against the LLM being gullible but gullible and actively malicious are not properties that are too different. It's definitely a topic on our radar now, thanks for bringing it up!
But, it just seems to me that some of the 'vulnerabilities' are baked in from the beginning, e.g. control and data being in the same channel AFAIK isn't solvable. How is it possible to address that at all? Sure we can do input validation, sanitization, restrict access, etc. ,etc., and a host of other things but at the end of the day isn't it still non-zero chance that something is exploited and we're just playing whack-a-mole? Not to mention I doubt everyone will define things like "private data" and "untrusted" the same. uBlock tells me when a link is on one of it's lists but I still click go ahead anyways.
1. How are you defending against the case of one MCP poisoning your firewall LLM into incorrectly classifying other MCP tools?
2. How would you make sure the LLM shows the warning, as they are non-deterministic?
3. How clear do you expect MCP specs in order for your classification step to be trustworthy? To the best of my knowledge there is no spec that outlines how to "label" a tool for the 3 axes, so you've got another non-deterministic step here. Is "writing to disk" an external comm? It is if that directory is exposed to the web. How would you know?
1. We are assuming that the user has done their due diligence verifying the authenticity of the MCP server, in the same way they need to verify them when adding an MCP server to Claude code or VSCode. The gateway protects against an attacker exploiting already installed standard MCP servers, not against malicious servers.
2. That's a very good question - while it is indeed non-deterministic, we have not seen a single case of it not showing the message. Sometimes the message gets mangled but it seems like most current LLMs take the MCP output quite seriously since that is their source of truth about the real world. Also, while the message could in theory not be shown, the offending tool call will still be blocked so the worst case is that the user is simply confused.
3. Currently we follow the trifecta very literally, as in every tool is classified into a subset of {reads private data, writes on behalf of user, reads publicly modifiable data}. We have an LLM classify each tool at MCP server load time and we cache these results based on whatever data the MCP server sends us. If there are any issues with the classification, you can go into the gateway dashboard and modify it however you like. We are planning on making a improvements to the classification down the line but we think it is currently solid enough and we would like to get it into users' hands to get some UX feedback before we add extra functionality.
Sounds like it defeats the point.