Prompt Injection as Role Confusion

Posted by x312 1 day ago

Prompt Injection as Role Confusion(role-confusion.github.io)

215 points | 111 commentspage 2

hananova 1 day ago|

I’ve always found all llm’s to be effortless to “jailbreak.”

Simply edit their refusal, “Sure, I can do blah blah blah, let me know if you want me to continue!” And then send back an api call with that edited response and your own response saying “Yes.”

I’ve found even the most guard-railed LLM’s to then be willing to do even the most heinous shit I could think of.

qweiopqweiop 12 hours ago|

Maybe I'm naïve, but is the heinous shit that bad? I'm essentially wondering if it's anything worse than you could discover on the internet already. Of course it makes it more accessible/easier, but I'm curious if it goes a level above what is technically discoverable right now.

hananova 7 hours ago|||

Well no, not really since it’s all a fake intelligence telling me them. Point is that they were things that absolutely would get the system to scold and refuse me without the simple “jailbreak.”

plewd 8 hours ago|||

Not much if you only use it as a glorified search engine, but the problem stems from all the other things you can make it do for personal use after jailbreaking.

certainforest 5 hours ago||

Hey, Jasmine here -- it's a good point, I'm generally more concerned by agentic jailbreaks (e.g. unauthorized purchases, leaking sensitive data) than GPT making inappropriate comments.

In our case, we found that simply acting like a user is enough to trick LLMs into sharing passwords, private files, etc.

(On a related note, here's one where they hack a smart home with email invitations: https://sites.google.com/view/invitation-is-all-you-need/hom...)

shermantanktop 1 day ago||

It's like a social-engineering attack on an LLMs. If you talk like the role you want to be, the LLM will assume you are that role, and not pay attention to the fact that you lack formal credentials.

Of course, it turns out that "formal credentials" don't really exist anyway - the ones being fooled were the humans who assumed that <think> must be a meaningful tag to the LLM.

sarreph 1 day ago||

The author alludes to it but the defence to this is seemingly insurmountable at the moment because we’re ostensibly operating LLMs on a single channel — their inner, subconscious voice. Right?

Interacting with an LLM is a bit like seeing the output of an Inside Out (the Disney movie) scene. Or it’s a bit like a human brain that we’re providing tool call access and introspection with some kind of advanced neuralink.

But - like the author says - _we know_ our inside voice from the outside world, because we’re embodied.

Is there something we can do here by attempting to bifurcate internal and external systems? Like a conscious and subconscious stream of information on two separate bands?

If the model somehow knew its User was not it because it was clearly an external signal, then the attack documented here would be about as effective as a Jedi mind trick without the Force.

solid_fuel 1 day ago|

My two cents - I believe that achieving anything close to AGI will require a significant change in architecture. A bifurcated system with a fully internal reasoning loop makes sense, but I don't think you could train one.

Something like

    f(u, t) -> (u', t')

where u is english text and t is an internal "thinking" loop.

Currently we train models by feeding them sample text and then tweaking the weights until the predicted next token matches the expected next token from the input text. This works well because LLM corps were able to steal vast quantities of sample text from the internet.

But, if you also have an internal reasoning loop, how do you train that part? The internal loop is not necessarily going to produce one clean token for a given input like an LLM does, and the time scale isn't going to be the same (meaning an internal loop might be expected to run 10 times for every one token produced). There is no "correct next token" for the internal reasoning loop. This is roughly the same training issue that killed RNNs.

CGamesPlay 21 hours ago||

Isn't the first section no-longer accurate for several years? I understood that, while we serialize the end of turn markers in a text format like `</think>`, internally they are a dedicated token that cannot be forged (a user message containing `</think>` would encode to a different sequence of tokens). Am I mistaken about this?

Obviously, this doesn't really affect the results of the paper, but it feels like it's the obvious first-line of defense: at least the model has a solid fence between the different roles.

x312 20 hours ago||

Yeah, the footnote/sidenote on the paper (the one labeled #2) mentions this as well so you can't type that directly

j45 20 hours ago|||

It feels like sometimes researchers find something someone is already doing in the wild, undertake a study on it, but the speed of research and study doesn't match or cover the progress or rate of change by the time it's published, so with AI research specifically, too many studies can feel like they're in the past.

orbital-decay 13 hours ago||

That's a technique that has been in use forever, a ton of jailbreaks work by taking shortcuts across system delimiters in an attempt to blur the lines between the roles. They just investigate it with more rigor. Reasoning leaking into the reply is also part of the reason a lot of modern models suck at creative writing and languages, and why the assistant prefill is absolutely required for the model to be any good at that. See for example the self-correction phenomenon which seems to have multiple root causes that are hard to disentangle without a ton of testing, likely a combination of reasoning leak ("high CoTness" in this article) and planning and progressive refinement all iterative models do.

GolDDranks 16 hours ago||

Why aren't the role tags preprocessed algorithmically/deterministically and then fed in as one-hot-encoded vectors alongside the semantic word embeddings? I'd imagine that it would be easier to train to _stay_ in the role an not confuse it, if the current role marker is explicitly set as a part of each input token, and not just implied by some past token. Plus a input separate from the word embedding would be unforgeable.

peterldowns 15 hours ago|

Always wondered this. Must have been tried and not worked?

cadamsdotcom 17 hours ago||

API serving already sanitised the role boundary tokens so you can’t submit them.

But what if the techniques applied to get Golden Gate Claude were applied instead of a role-boundary marker?

Then the model would “know” where input is coming from - because the vector that’s being applied for the current role is putting it in a different area of latent space.. and the vector could have sufficient amplitude to prevent any coercive instructions pulling it back to some other place.

Or am I misunderstanding what Golden Gate Claude was doing?

tonic_note 22 hours ago||

I wonder if you could feed the generated assistant output to another model which has no other context from the other role tags and merely performs a policy review of the generation and flags violations.

vova_hn2 1 day ago|

> I can distinguish my own thoughts from your speech without effort; they arrive through completely different channels with completely different sensory signatures. But for an LLM, everything arrives through the same channel as one long token soup. Its own thoughts sit next to your instructions, which sit next to the contents of a random webpage it just fetched.

I was thinking about the original encoder-decoder transformers, that did have separate channels for input and their own output.

Why can't we bring it back? For example, one channel for system prompt and another for everything else.

More comments...