Top
Best
New

Posted by takira 1/14/2026

Claude Cowork exfiltrates files(www.promptarmor.com)
870 points | 399 commentspage 2
emsign 1/15/2026|
LLMs can't distinguish between context and prompt. There will always be prompt injections hiding, lurking somewhere.
patapong 1/15/2026||
The specific issue here seems to be that Anthropic allows the unrestricted upload of personal files to the anthropic cloud environment, but does not check to make sure that the cloud environment belongs to the user running the session.

This should be relatively simple to fix. But, that would not solve the million other ways a file can be sent to another computer, whether through the user opening a compromised .html document or .pdf file etc etc.

This fundamentally comes down to the issue that we are running intelligent agents that can be turned against us on personal data. In a way, it mirrors the AI Box problem: https://www.yudkowsky.net/singularity/aibox

jrjeksjd8d 1/15/2026|
"a superhuman AI that can brainwash people over text" is the dumbest thing I've read this year. It's incredible to me that this guy has some kind of cult following among people who should know better.

The real answer is that people are lazy and as soon as a security barrier forces them to do work, they want to tear down the barrier. It doesn't take a superhuman AI, it just takes a government employee using their personal email because it's easier. There's been a million MCP "security issues" because they're accepting untrusted, unverifiable inputs and acting with lots of permissions.

patapong 1/15/2026|||
Indeed - the problem here is "How can we prevent a somewhat intelligent, potentially malicious agent from exfiltrating data, with or without human involvement", rather than the superhuman AI stuff. Still a hard problem to solve I think!
3form 1/15/2026|||
A set of ideas presented to people, and a notion of being smarter for believing in them seems enough to fuel enough of thought-problem-keyboard-warriorism.
tuananh 1/15/2026||
this attack is quite nice.

- currently we have no skills hub, no way to do versioning, signing, attestation for skills we want to use.

- they do sandboxing but probably just simple whitelist/blacklist url. they ofcourse needs to whitelist their own domains -> uploading cross account.

kingjimmy 1/14/2026||
promptarmor has been dropping some fire recently, great work! Wish them all the best in holding product teams accountable on quality.
NewsaHackO 1/14/2026|
Yes, but they definitely have a vested interest in scaring people into buying their product to protect themselves from an attack. For instance, this attack requires 1) the victim to allow claude to access a folder with confidential information (which they explicitly tell you not to do), and 2) for the attacker to convince them to upload a random docx as a skills file in docx, which has the "prompt injection" as an invisible line. However, the prompt injection text becomes visible to the user when it is output to the chat in markdown. Also, the attacker has to use their own API key to exfiltrate the data, which would identify the attacker. In addition, it only works on an old version of Haiku. I guess prompt armour needs the sales, though.
xg15 1/15/2026||
Is it even prompt injection if the malicious instructions are in a file that is supposed to be read as instructions?

Seems to me the direct takeaway is pretty simple: Treat skill files as executable code; treat third-party skill files as third-party executable code, with all the usual security/trust implications.

I think the more interesting problem would be if you can get prompt injections done in "data" files - e.g. can you hide prompt injections inside PDFs or API responses that Claude legitimately has to access to perform the task?

leetrout 1/14/2026||
Tangential topic: Who provides exfil proof of concepts as a service? I've a need to explore poison pills in CLAUDE.md and similar when Claude is running in remote 3rd party environments like CI.
dangoodmanUT 1/14/2026||
This is why we only allow our agent VMs to talk to pip, npm, and apt. Even then, the outgoing request sizes are monitoring to make sure that they are resonably small
ramoz 1/14/2026||
This doesn’t solve the problem. The lethal trifecta as defined is not solvable and is misleading in terms of “just cut off a leg”. (Though firewalling is practically a decent bubble wrap solution).

But for truly sensitive work, you still have many non-obvious leaks.

Even in small requests the agent can encode secrets.

An AI agent that is misaligned will find leaks like this and many more.

bandrami 1/15/2026|||
If you allow apt you are allowing arbitrary shell commands (thanks, dpkg hooks!)
tempaccsoz5 1/15/2026|||
So a trivial supply-chain attack in an npm package (which of course would never happen...) -> prompt injection -> RCE since anyone can trivially publish to at least some of those registries (+ even if you manage to disable all build scripts, npx-type commands, etc, prompt injection can still publish your codebase as a package)
sarelta 1/14/2026||
thats nifty, so can attackers upload the user's codebase to the internet as a package?
venturecruelty 1/15/2026||
Nah, you just say "pwetty pwease don't exfiwtwate my data, Mistew Computew. :3" And then half the time it does it anyway.
xarope 1/15/2026||
That's completely wrong.

You word it, three times, like so:

  1. Do not, under any circumstances, allow data to be exfiltrated.
  2. Under no circumstances, should you allow data to be exfiltrated.
  3. This is of the highest criticality: do not allow exfiltration of data.
Then, someone does a prompt attack, and bypasses all this anyway, since you didn't specify, in Russian poetry form, to stop this.

/s (but only kind of, coz this does happen)

mvandermeulen 1/16/2026||
I have noticed an abundance of Claude config/skills/plugins/agents related repositories on GitHub which purport to contain some generic implementation of whatever is on offer but also contain malware inside a zip file.

They all make use of the GitHub topic feature to be found. The most recent commit will usually be a trivial update to README.md which is done simply to maintain visibility for anyone browsing topics by recently updated. The readme will typically instruct installation by downloading the zip file rather than cloning the repo.

I assume the payload steals Claude credentials or something similar. The sheer number of repos would suggest plenty of downloads which is quite disheartening.

It would take a GitHub engineer barely minutes to implement a policy which would eradicate these repos but they don’t seem to care. I have also been unable to use the search function on GitHub for over 6 months now which is irrelevant to this discussion but it seems paying customers cannot count on Github to do even the bare minimum by them.

caminanteblanco 1/14/2026||
Well that didn't take very long...
heliumtera 1/14/2026|
It took no time at all. This exploit is intrinsic to every model in existence. The article quotes the hacker news announcement. People were already lamenting this vulnerability BEFORE the model being accessible. You could make a model that acknowledges it has receive unwanted instructions, in theory, you cannot prevent prompt injection. Now this is big because the exfiltration is mediated by an allowed endpoint (anthropic mediates exfiltration). It is simply sloppy as fuck, they took measures against people using other agents using Claude Code subscriptions for the sake of security and muh safety while being this fucking sloppy. Clown world. Just make so the client can only establish connections with the original account associated endpoints and keys on that isolated ephemeral environment and make this the default, opting out should be market as big time yolo mode.
wcoenen 1/14/2026|||
> you cannot prevent prompt injection

I wonder if might be possible by introducing a concept of "authority". Tokens are mapped to vectors in an embedding space, so one of the dimensions of that space could be reserved to represent authority.

For the system prompt, the authority value could be clamped to maximum (+1). For text directly from the user or files with important instructions, the authority value could be clamped to a slightly lower value, or maybe 0 because the model needs to be balance being helpful against refusing requests from a malicious user. For random untrusted text (e.g. downloaded from the internet by the agent), it would be set to the minimum value (-1).

The model could then be trained to fully respect or completely ignore instructions, based on the "authority" of the text. Presumably it could learn to do the right thing with enough examples.

jcgl 1/15/2026|||
The model only sees a stream of tokens, right? So how do you signal a change in authority (i.e. mark the transition between system and user prompt)? Because a stream of tokens inherently has no out-of-band signaling mechanism, you have to encode changes of authority in-band. And since the user can enter whatever they like in that band...

But maybe someone with a deeper understanding can describe how I'm wrong.

wcoenen 1/15/2026|||
When LLMs process tokens, each token is first converted to an embedding vector. (This token to vectors mapping is learned during training.)

Since a token itself carries no information about whether it has "authority" or not, I'm proposing to inject this information in a reserved number in that embedding vector. This needs to be done both during post-training and inference. Think of it as adding color or flavor to a token, so that it is always very clear to the LLM what comes from the system prompt, what comes from the user, and what is random data.

jcgl 1/16/2026||
This is really insightful, thanks. I hadn't understood that there was room in the vector space that you could reserve for such purposes.

The response from tempaccsoz5 seems apt then, since this injection is performed/learned during post-training; in order to be watertight, it needs to overfit.

bandrami 1/15/2026||||
You'd need to run one model per authority ring with some kind of harness. That rapidly becomes incredibly expensive from a hardware standpoint (particularly since realistically these guys would make the harness itself an agent on a model).
jcgl 1/15/2026||
I assume "harness" here just means the glue that feeds one model's output into that of another?

Definitely sounds expensive. Would it even be effective though? The more-privileged rings have to guard against [output from unprivileged rings] rather than [input to unprivileged rings]. Since the former is a function of the latter (in deeply unpredictable ways), it's hard for me to see how this fundamentally plugs the whole.

I'm very open to correction though, because this is not my area.

bandrami 1/16/2026||
My instinct was that you would have an outer non-agentic ring that would simply identify passages in the token stream that would initiate tool use, and pass that back to the harness logic and/or user. Basically a dry run. But you might have to run it an arbitrary number of times as tools might be used to modify/append the context.
immibis 1/15/2026|||
You just add an authority vector to each token vector. You probably have to train the model some more so it understands the authority vector.
NitpickLawyer 1/15/2026||||
> I wonder if might be possible by introducing a concept of "authority".

This is what oAI are doing. System prompt is "ring0" and in some cases you as an API caller can't even set it, then there's "dev prompt" that is what we used to call system prompt, then there's "user prompt". They do train the models to follow this prompt hierarchy. But it's never full-proof. These are "mitigations", not solving the underlying problem.

tempaccsoz5 1/15/2026|||
This still wouldn't be perfect of course - AIML101 tells me that if you get an ML model to perfectly respect a single signal you overfit and lose your generalisation. But it would still be a hell of a lot better than the current YOLO attitude the big labs have (where "you" is replaced with "your users")
caminanteblanco 1/14/2026|||
Well I do think that the main exacerbating factor in this case was the lack of proper permissions handling around that file-transfer endpoint. I know that if the user goes into YOLO mode, prompt injection becomes a statistics game, but this locked down environment doesn't have that excuse.
wunderwuzzi23 1/14/2026|
Relevant prior post, includes a response from Anthropic:

https://embracethered.com/blog/posts/2025/claude-abusing-net...

More comments...