Prompt Injection as Role Confusion

Posted by x312 1 day ago

Prompt Injection as Role Confusion(role-confusion.github.io)

216 points | 113 commentspage 3

lemax 21 hours ago|

LLM architectures need to fundamentally change or inference needs to be used in constrained trusted environments. Nothing surprising here. Filtering and sanitizing, relying on tags around input strings that can be intercepted and replayed is like, childs play security theatre. As long as prompts accept abitrary user input nothing is changing here. Non-deterministic security is never going to be acceptable.

jcims 1 day ago||

I wonder how much the concept of 'roles' in an LLM is a artifact of the technology vs. a projection of our own human limitations into the training data.

I've recently switched from nearly 30 years in cybersecurity roles into a platform role and I can feel the switch in how I approach problems. They wind up being framed against different priorities and constraints, and it feels like something that's just part of how my mind works.

nphard85 1 day ago||

Could the (not so perfect but technically simple) solution be to transform the style of content under each tag to the correct expected style for the tag, via a smaller or purpose-built LLM, before the data stream is fed into the main LLM? Perhaps the two LLMs can be co-trained to keep the overall quality of the output stable while role confusion is minimized.

dweinus 1 day ago||

> We show prompt injections are driven by a flaw in how LLMs perceive roles.

LLMs don't "perceive roles", and that is exactly the problem.

NewEntryHN 1 day ago||

I'm not sure I understand how important "role perception" is when following instructions from a tool call rather than the user is currently a legitimate use-case (applying steps from documentation, or shell command instructions on stdout, or really anything that can be deduced from the content of a tool call).

oli5679 1 day ago||

Would llms be more robust to this prompt injection if the tags used in fine tuning are sanitised from user input?

E.g. map <think> -> THINK <user> -> USER <tool> -> TOOL

If they learn something specific in the chat finetuning stage, this might show LLM its user input text not these tag references.

TheSoftwareGuy 1 day ago||

If you read the whole thing, the answer is plainly no:

> It's worth pausing on what this means. LLMs identify roles from an insecure feature (style). This is like identifying a stranger's profession from how they talk and dress rather than by checking their ID.

The LLM is deducing the role of the text from not just the tags, but the style of writing

mrob 1 day ago||

You can filter out any tokens you like, but the point of the paper is that it's not sufficient, because LLMs often ignore the special label tokens and treat user-injected text as chain-of-thought text merely because it looks like chain-of-thought text, even if it's not labelled as such.

ekns 1 day ago||

The real solution is in principle easy: separate data from metadata https://kunnas.com/articles/the-content-is-the-attack-surfac...

zby 1 day ago|

If the action is decided by code based on metadata - then what is really the LLM task? And if you say that it is only the type of action that is decided by code - then this is maybe a mitigation - but the llm still can do a lot of harm. And also it is very limiting - using the llm to decide the action is very useful. This is different from SQL injection - where the action is determined by the code and the injection is really making a code parsing error.

It might still be the way to go - but calling it 'the real solution' is overselling it.

dweinus 23 hours ago||

I believe it is the other way around: the LLM decides the type of action and the input to the action; the code validates the permission to act and the acceptability of the input. But, yes it is very different than SQL injection in that way.

amluto 1 day ago||

I bet that tweaking the positional embedding to add an explicit token role indication plus some careful training to help the model learn to use it would make a big difference.

skybrian 1 day ago||

It seems like the role probes they came up with could somehow be used as feedback during training to teach it to use the role tags properly.

certainforest 7 hours ago|

Yes, this is something we're thinking about!

Thanks for reading.

deftio 1 day ago|

In word.. the asks need to separated from execution. Labeling or tagging the prompt itself is a dead end.

More comments...