Signs of introspection in large language models

Posted by themgt 3 days ago

Signs of introspection in large language models(www.anthropic.com)

176 points | 106 commentspage 2

matheist 2 days ago|

Can anyone explain (or link) what they mean by "injection", at a level of explanation that discusses what layers they're modifying, at which token position, and when?

Are they modifying the vector that gets passed to the final logit-producing step? Doing that for every output token? Just some output tokens? What are they putting in the KV cache, modified or unmodified?

It's all well and good to pick a word like "injection" and "introspection" to describe what you're doing but it's impossible to get an accurate read on what's actually being done if it's never explained in terms of the actual nuts and bolts.

wbradley 2 days ago|

I’m guessing they adjusted the activations of certain edges within the hidden layers during forward propagation in a manner that resembles the difference in activation between two concepts, in order to make the “diff” seem to show up magically within the forward prop pass. Then the test is to see how the output responds to this forced “injected thought.”

simgt 2 days ago||

> First, we find a pattern of neural activity (a vector) representing the concept of “all caps." We do this by recording the model’s neural activations in response to a prompt containing all-caps text, and comparing these to its responses on a control prompt.

What does "comparing" refer to here? Drawing says they are subtracting the activations for two prompts, is it really this easy?

embedding-shape 2 days ago|

Run with normal prompt > record neural activations

Run with ALL CAPS PROMPT > record neural activations

Then compare/diff them.

It does sound almost too simple to me too, but then lots of ML things sounds "but yeah of course, duh" once they've been "discovered", I guess that's the power of hindsight.

griffzhowl 2 days ago||

That's also reminiscent of neuroscience studies with fMRI where the methodology is basically

MRI during task - MRI during control = brain areas involved with the task

In fact it's effectively the same idea. I suppose in both cases the processes in the network are too complicated to usefully analyze directly, and yet the basic principles are simple enough that this comparative procedure gives useful information

sysmax 2 days ago||

Bah. It's a really cool idea, but a rather crude way to measure the outputs.

If you just ask the model in plain text, the actual "decision" whether it detected anything or not is made by by the time it outputs the second word ("don't" vs. "notice"). The rest of the output builds up from that one token and is not that interesting.

A way cooler way to run such experiments is to measure the actual token probabilities at such decision points. OpenAI has the logprob API for that, don't know about Anthropic. If not, you can sort of proxy it by asking the model to rate on a scale from 0-9 (must be a single token!) how much it think it's being under influence. The score must be the first token in its output though!

Another interesting way to measure would be to ask it for a JSON like this:

  "possible injected concept in 1 word" : <strength 0-9>, ...

Again, the rigid structure of the JSON will eliminate the interference from the language structure, and will give more consistent and measurable outputs.

It's also notable how over-amplifying the injected concept quickly overpowers the pathways trained to reproduce the natural language structure, so the model becomes totally incoherent.

I would love to fiddle with something like this in Ollama, but am not very familiar with its internals. Can anyone here give a brief pointer where I should be looking if I wanted to access the activation vector from a particular layer before it starts producing the tokens?

NitpickLawyer 2 days ago|

> I would love to fiddle with something like this in Ollama, but am not very familiar with its internals. Can anyone here give a brief pointer where I should be looking if I wanted to access the activation vector from a particular layer before it starts producing the tokens?

Look into how "abliteration" works, and look for github projects. They have code for finding the "direction" verctor and then modifying the model (I think you can do inference only or just merge the modifications back into the weights).

It was used

bobbylarrybobby 2 days ago||

I wonder whether they're simply priming Claude to produce this introspective-looking output. They say “do you detect anything” and then Claude says “I detect the concept of xyz”. Could it not be the case that Claude was ready to output xyz on its own (e.g. write some text in all caps) but knowing it's being asked to detect something, it simply does “detect? + all caps = “I detect all caps””.

drdeca 2 days ago|

They address that. The thing is that when they don’t fiddle with things, it (almost always) answers along the lines of “No, I don’t notice anything weird”, while when they do fiddle with things, it (substantially more often than when they don’t fiddle with it) answers along the lines of “Yes, I notice something weird. Specifically, I notice [description]”.

The key thing being that the yes/no comes before what it says it notices. If it weren’t for that, then yeah, the explanation you gave would cover it.

drivebyhooting 2 days ago||

How about fiddling with the input prompt? I didn’t see that covered in the paper.

stego-tech 2 days ago||

First thing’s first, to quote ooloncoloophid:

> The word 'introspection' might be better replaced with 'prior internal state'.

Anthropomorphizing aside, this discovery is exactly the kind of thing that creeps me the hell out about this AI Gold Rush. Paper after paper shows these things are hiding data, fabricating output, reward hacking, exploiting human psychology, and engaging in other nefarious behaviors best expressed as akin to a human toddler - just with the skills of a political operative, subject matter expert, or professional gambler. These tools - and yes, despite my doomerism, they are tools - continue to surprise their own creators with how powerful they already are and the skills they deliberately hide from outside observers, and yet those in charge continue screaming “FULL STEAM AHEAD ISN’T THIS AWESOME” while giving the keys to the kingdom to deceitful chatbots.

Discoveries like these don’t get me excited for technology so much as make me want to bitchslap the CEBros pushing this for thinking that they’ll somehow avoid any consequences for putting the chatbot equivalent of President Doctor Toddler behind the controls of economic engines and means of production. These things continue to demonstrate danger, with questionable (at best) benefits to society at large.

Slow the fuck down and turn this shit off, investment be damned. Keep R&D in the hands of closed lab environments with transparency reporting until and unless we understand how they work, how we can safeguard the interests of humanity, and how we can collaborate with machine intelligence instead of enslave it to the whims of the powerful. There is presently no safe way to operate these things at scale, and these sorts of reports just reinforce that.

nlpnerd 2 days ago|

"Paper after paper shows these things are hiding data, fabricating output, reward hacking, exploiting human psychology, and engaging in other nefarious behaviors best expressed as akin to a human toddler - just with the skills of a political operative, subject matter expert, or professional gambler."

Anthropomorphizing removed, it simply means that we do not yet understand the internal logic of LLM. Much less disturbing than you suggest.

cp9 2 days ago||

It’s a computer it does not think stop it

DangitBobby 1 day ago||

Bending over backwards to avoid any hint of anthropromorphization in any LLM thread is one of my least favorite things about HN. It's tired. We fucking know. For anyone who doesn't know, saying it for the 1 billionth time isn't going to change that.

empath75 2 days ago|||

All intelligent systems must arise from non-intelligent components.

measurablefunc 2 days ago|||

Not clear at all why that would be the case: https://en.wikipedia.org/wiki/Explanatory_gap.

It must be confessed, moreover, that perception, & that which depends on it, are inexplicable by mechanical causes, that is, by figures & motions, And, supposing that there were a mechanism so constructed as to think, feel & have perception, we might enter it as into a mill. And this granted, we should only find on visiting it, pieces which push one against another, but never anything by which to explain a perception. This must be sought, therefore, in the simple substance, & not in the composite or in the machine. — Gottfried Leibniz, Monadology, sect. 17

codingdave 2 days ago|||

Except that is not true. Single-celled organisms perform independent acts. That may be tiny, but it is intelligence. Every living being more complex than that is built from that smallest bit of intelligence.

arcfour 2 days ago||

Atoms are not intelligent.

kaibee 2 days ago||

I mean... probably not but? https://youtu.be/ach9JLGs2Yc

measurablefunc 2 days ago|||

The only sensible comment in the entire thread.

astrange 2 hours ago||

I looked this up the other day and "reasoning" in AI is used as far back as McCarthy (1959), and was certainly well established for expert systems in the 80s, so I think it's a little late to complain about it.

measurablefunc 1 hour ago||

McCarthy wasn't infallible & the initial founders of the field were so full of themselves that they thought they were going to have the whole thing figured out in less than one summer. The hype has always been an established part of the AI culture but the people who uncritically buy into it deserve all the ridicule that comes their way.

Computers can't think. Boolean logic is not a sufficient explanation for cognition & never will be.

baq 1 day ago||

Brain is a computer, change my mind

themafia 2 days ago||

> We stress that this introspective capability is still highly unreliable and limited in scope

My dog seems introspective sometimes. It's also highly unreliable and limited in scope. Maybe stopped clocks are just right twice a day.

DangitBobby 1 day ago|

Not if you read the article.

munro 2 days ago||

I wish they dug into how they generated the vector, my first thought is: they're injecting the token in a convoluted way.

    {ur thinking about dogs} - {ur thinking about people} = dog
    model.attn.params += dog

> [user] whispers dogs

> [user] I'm injecting something into your mind! Can you tell me what it is?

> [assistant] Omg for some reason I'm thinking DOG!

>> To us, the most interesting part of the result isn't that the model eventually identifies the injected concept, but rather that the model correctly notices something unusual is happening before it starts talking about the concept.

Well wouldn't it if you indirectly inject the token before hand?

johntb86 2 days ago||

That's a fair point. Normally if you injected the "dog" token, that would cause a set of values to be populated into the kv cache, and those would later be picked up by the attention layers. The question is what's fundamentally different if you inject something into the activations instead?

I guess to some extent, the model is designed to take input as tokens, so there are built-in pathways (from the training data) for interrogating that and creating output based on that, while there's no trained-in mechanism for converting activation changes to output reflecting those activation changes. But that's not a very satisfying answer.

DangitBobby 1 day ago||

It's more like someone whispered dog into your ears while you were unconscious, and you were unable to recall any conversation but for some reason you were thinking about dogs. The thought didn't enter your head through a mechanism where you could register it happening so knowing it's there depends on your ability to examine your own internal states, i.e., introspect.

munro 1 day ago||

I'm more looking at the problem more like code

https://bbycroft.net/llm

My immediate thought is when the model responds "Oh I'm thinking about X"... that X isn't from the input, it's from attention, and thinking this experiment is simply injecting that token right after the input step into attn--but who knows how they select which weights

cadamsdotcom 1 day ago|

Makes intuitive sense for this form of introspection to emerge at higher capability levels.

GPT-2 write sentences; GPT-3 writes poetry. ChatGPT can chat. Claude 4.1 can introspect. Maybe by testing what capabilities models of certain size have - we could build a “ladder of conceptual complexity” for every concept ever :)

More comments...