Posted by themgt 3 days ago
Are they modifying the vector that gets passed to the final logit-producing step? Doing that for every output token? Just some output tokens? What are they putting in the KV cache, modified or unmodified?
It's all well and good to pick a word like "injection" and "introspection" to describe what you're doing but it's impossible to get an accurate read on what's actually being done if it's never explained in terms of the actual nuts and bolts.
What does "comparing" refer to here? Drawing says they are subtracting the activations for two prompts, is it really this easy?
Run with ALL CAPS PROMPT > record neural activations
Then compare/diff them.
It does sound almost too simple to me too, but then lots of ML things sounds "but yeah of course, duh" once they've been "discovered", I guess that's the power of hindsight.
MRI during task - MRI during control = brain areas involved with the task
In fact it's effectively the same idea. I suppose in both cases the processes in the network are too complicated to usefully analyze directly, and yet the basic principles are simple enough that this comparative procedure gives useful information
If you just ask the model in plain text, the actual "decision" whether it detected anything or not is made by by the time it outputs the second word ("don't" vs. "notice"). The rest of the output builds up from that one token and is not that interesting.
A way cooler way to run such experiments is to measure the actual token probabilities at such decision points. OpenAI has the logprob API for that, don't know about Anthropic. If not, you can sort of proxy it by asking the model to rate on a scale from 0-9 (must be a single token!) how much it think it's being under influence. The score must be the first token in its output though!
Another interesting way to measure would be to ask it for a JSON like this:
"possible injected concept in 1 word" : <strength 0-9>, ...
Again, the rigid structure of the JSON will eliminate the interference from the language structure, and will give more consistent and measurable outputs.It's also notable how over-amplifying the injected concept quickly overpowers the pathways trained to reproduce the natural language structure, so the model becomes totally incoherent.
I would love to fiddle with something like this in Ollama, but am not very familiar with its internals. Can anyone here give a brief pointer where I should be looking if I wanted to access the activation vector from a particular layer before it starts producing the tokens?
Look into how "abliteration" works, and look for github projects. They have code for finding the "direction" verctor and then modifying the model (I think you can do inference only or just merge the modifications back into the weights).
It was used
The key thing being that the yes/no comes before what it says it notices. If it weren’t for that, then yeah, the explanation you gave would cover it.
> The word 'introspection' might be better replaced with 'prior internal state'.
Anthropomorphizing aside, this discovery is exactly the kind of thing that creeps me the hell out about this AI Gold Rush. Paper after paper shows these things are hiding data, fabricating output, reward hacking, exploiting human psychology, and engaging in other nefarious behaviors best expressed as akin to a human toddler - just with the skills of a political operative, subject matter expert, or professional gambler. These tools - and yes, despite my doomerism, they are tools - continue to surprise their own creators with how powerful they already are and the skills they deliberately hide from outside observers, and yet those in charge continue screaming “FULL STEAM AHEAD ISN’T THIS AWESOME” while giving the keys to the kingdom to deceitful chatbots.
Discoveries like these don’t get me excited for technology so much as make me want to bitchslap the CEBros pushing this for thinking that they’ll somehow avoid any consequences for putting the chatbot equivalent of President Doctor Toddler behind the controls of economic engines and means of production. These things continue to demonstrate danger, with questionable (at best) benefits to society at large.
Slow the fuck down and turn this shit off, investment be damned. Keep R&D in the hands of closed lab environments with transparency reporting until and unless we understand how they work, how we can safeguard the interests of humanity, and how we can collaborate with machine intelligence instead of enslave it to the whims of the powerful. There is presently no safe way to operate these things at scale, and these sorts of reports just reinforce that.
Anthropomorphizing removed, it simply means that we do not yet understand the internal logic of LLM. Much less disturbing than you suggest.
It must be confessed, moreover, that perception, & that which depends on it, are inexplicable by mechanical causes, that is, by figures & motions, And, supposing that there were a mechanism so constructed as to think, feel & have perception, we might enter it as into a mill. And this granted, we should only find on visiting it, pieces which push one against another, but never anything by which to explain a perception. This must be sought, therefore, in the simple substance, & not in the composite or in the machine. — Gottfried Leibniz, Monadology, sect. 17
Computers can't think. Boolean logic is not a sufficient explanation for cognition & never will be.
My dog seems introspective sometimes. It's also highly unreliable and limited in scope. Maybe stopped clocks are just right twice a day.
{ur thinking about dogs} - {ur thinking about people} = dog
model.attn.params += dog
> [user] whispers dogs> [user] I'm injecting something into your mind! Can you tell me what it is?
> [assistant] Omg for some reason I'm thinking DOG!
>> To us, the most interesting part of the result isn't that the model eventually identifies the injected concept, but rather that the model correctly notices something unusual is happening before it starts talking about the concept.
Well wouldn't it if you indirectly inject the token before hand?
I guess to some extent, the model is designed to take input as tokens, so there are built-in pathways (from the training data) for interrogating that and creating output based on that, while there's no trained-in mechanism for converting activation changes to output reflecting those activation changes. But that's not a very satisfying answer.
My immediate thought is when the model responds "Oh I'm thinking about X"... that X isn't from the input, it's from attention, and thinking this experiment is simply injecting that token right after the input step into attn--but who knows how they select which weights
GPT-2 write sentences; GPT-3 writes poetry. ChatGPT can chat. Claude 4.1 can introspect. Maybe by testing what capabilities models of certain size have - we could build a “ladder of conceptual complexity” for every concept ever :)