Top
Best
New

Posted by themgt 10/30/2025

Signs of introspection in large language models(www.anthropic.com)
183 points | 126 commentspage 2
simgt 10/31/2025|
> First, we find a pattern of neural activity (a vector) representing the concept of “all caps." We do this by recording the model’s neural activations in response to a prompt containing all-caps text, and comparing these to its responses on a control prompt.

What does "comparing" refer to here? Drawing says they are subtracting the activations for two prompts, is it really this easy?

embedding-shape 10/31/2025|
Run with normal prompt > record neural activations

Run with ALL CAPS PROMPT > record neural activations

Then compare/diff them.

It does sound almost too simple to me too, but then lots of ML things sounds "but yeah of course, duh" once they've been "discovered", I guess that's the power of hindsight.

griffzhowl 10/31/2025||
That's also reminiscent of neuroscience studies with fMRI where the methodology is basically

MRI during task - MRI during control = brain areas involved with the task

In fact it's effectively the same idea. I suppose in both cases the processes in the network are too complicated to usefully analyze directly, and yet the basic principles are simple enough that this comparative procedure gives useful information

matheist 10/31/2025||
Can anyone explain (or link) what they mean by "injection", at a level of explanation that discusses what layers they're modifying, at which token position, and when?

Are they modifying the vector that gets passed to the final logit-producing step? Doing that for every output token? Just some output tokens? What are they putting in the KV cache, modified or unmodified?

It's all well and good to pick a word like "injection" and "introspection" to describe what you're doing but it's impossible to get an accurate read on what's actually being done if it's never explained in terms of the actual nuts and bolts.

wbradley 11/1/2025|
I’m guessing they adjusted the activations of certain edges within the hidden layers during forward propagation in a manner that resembles the difference in activation between two concepts, in order to make the “diff” seem to show up magically within the forward prop pass. Then the test is to see how the output responds to this forced “injected thought.”
sysmax 11/1/2025||
Bah. It's a really cool idea, but a rather crude way to measure the outputs.

If you just ask the model in plain text, the actual "decision" whether it detected anything or not is made by by the time it outputs the second word ("don't" vs. "notice"). The rest of the output builds up from that one token and is not that interesting.

A way cooler way to run such experiments is to measure the actual token probabilities at such decision points. OpenAI has the logprob API for that, don't know about Anthropic. If not, you can sort of proxy it by asking the model to rate on a scale from 0-9 (must be a single token!) how much it think it's being under influence. The score must be the first token in its output though!

Another interesting way to measure would be to ask it for a JSON like this:

  "possible injected concept in 1 word" : <strength 0-9>, ...
Again, the rigid structure of the JSON will eliminate the interference from the language structure, and will give more consistent and measurable outputs.

It's also notable how over-amplifying the injected concept quickly overpowers the pathways trained to reproduce the natural language structure, so the model becomes totally incoherent.

I would love to fiddle with something like this in Ollama, but am not very familiar with its internals. Can anyone here give a brief pointer where I should be looking if I wanted to access the activation vector from a particular layer before it starts producing the tokens?

NitpickLawyer 11/1/2025|
> I would love to fiddle with something like this in Ollama, but am not very familiar with its internals. Can anyone here give a brief pointer where I should be looking if I wanted to access the activation vector from a particular layer before it starts producing the tokens?

Look into how "abliteration" works, and look for github projects. They have code for finding the "direction" verctor and then modifying the model (I think you can do inference only or just merge the modifications back into the weights).

It was used

bobbylarrybobby 10/31/2025||
I wonder whether they're simply priming Claude to produce this introspective-looking output. They say “do you detect anything” and then Claude says “I detect the concept of xyz”. Could it not be the case that Claude was ready to output xyz on its own (e.g. write some text in all caps) but knowing it's being asked to detect something, it simply does “detect? + all caps = “I detect all caps””.
drdeca 10/31/2025|
They address that. The thing is that when they don’t fiddle with things, it (almost always) answers along the lines of “No, I don’t notice anything weird”, while when they do fiddle with things, it (substantially more often than when they don’t fiddle with it) answers along the lines of “Yes, I notice something weird. Specifically, I notice [description]”.

The key thing being that the yes/no comes before what it says it notices. If it weren’t for that, then yeah, the explanation you gave would cover it.

drivebyhooting 11/1/2025||
How about fiddling with the input prompt? I didn’t see that covered in the paper.
stego-tech 10/31/2025||
First thing’s first, to quote ooloncoloophid:

> The word 'introspection' might be better replaced with 'prior internal state'.

Anthropomorphizing aside, this discovery is exactly the kind of thing that creeps me the hell out about this AI Gold Rush. Paper after paper shows these things are hiding data, fabricating output, reward hacking, exploiting human psychology, and engaging in other nefarious behaviors best expressed as akin to a human toddler - just with the skills of a political operative, subject matter expert, or professional gambler. These tools - and yes, despite my doomerism, they are tools - continue to surprise their own creators with how powerful they already are and the skills they deliberately hide from outside observers, and yet those in charge continue screaming “FULL STEAM AHEAD ISN’T THIS AWESOME” while giving the keys to the kingdom to deceitful chatbots.

Discoveries like these don’t get me excited for technology so much as make me want to bitchslap the CEBros pushing this for thinking that they’ll somehow avoid any consequences for putting the chatbot equivalent of President Doctor Toddler behind the controls of economic engines and means of production. These things continue to demonstrate danger, with questionable (at best) benefits to society at large.

Slow the fuck down and turn this shit off, investment be damned. Keep R&D in the hands of closed lab environments with transparency reporting until and unless we understand how they work, how we can safeguard the interests of humanity, and how we can collaborate with machine intelligence instead of enslave it to the whims of the powerful. There is presently no safe way to operate these things at scale, and these sorts of reports just reinforce that.

nlpnerd 11/1/2025|
"Paper after paper shows these things are hiding data, fabricating output, reward hacking, exploiting human psychology, and engaging in other nefarious behaviors best expressed as akin to a human toddler - just with the skills of a political operative, subject matter expert, or professional gambler."

Anthropomorphizing removed, it simply means that we do not yet understand the internal logic of LLM. Much less disturbing than you suggest.

cp9 11/1/2025||
It’s a computer it does not think stop it
empath75 11/1/2025||
All intelligent systems must arise from non-intelligent components.
measurablefunc 11/1/2025|||
Not clear at all why that would be the case: https://en.wikipedia.org/wiki/Explanatory_gap.

It must be confessed, moreover, that perception, & that which depends on it, are inexplicable by mechanical causes, that is, by figures & motions, And, supposing that there were a mechanism so constructed as to think, feel & have perception, we might enter it as into a mill. And this granted, we should only find on visiting it, pieces which push one against another, but never anything by which to explain a perception. This must be sought, therefore, in the simple substance, & not in the composite or in the machine. — Gottfried Leibniz, Monadology, sect. 17

codingdave 11/1/2025|||
Except that is not true. Single-celled organisms perform independent acts. That may be tiny, but it is intelligence. Every living being more complex than that is built from that smallest bit of intelligence.
arcfour 11/1/2025||
Atoms are not intelligent.
kaibee 11/1/2025||
I mean... probably not but? https://youtu.be/ach9JLGs2Yc
DangitBobby 11/1/2025|||
Bending over backwards to avoid any hint of anthropromorphization in any LLM thread is one of my least favorite things about HN. It's tired. We fucking know. For anyone who doesn't know, saying it for the 1 billionth time isn't going to change that.
measurablefunc 11/1/2025|||
The only sensible comment in the entire thread.
astrange 11/3/2025||
I looked this up the other day and "reasoning" in AI is used as far back as McCarthy (1959), and was certainly well established for expert systems in the 80s, so I think it's a little late to complain about it.
measurablefunc 11/3/2025||
McCarthy wasn't infallible & the initial founders of the field were so full of themselves that they thought they were going to have the whole thing figured out in less than one summer. The hype has always been an established part of the AI culture but the people who uncritically buy into it deserve all the ridicule that comes their way.

Computers can't think. Boolean logic is not a sufficient explanation for cognition & never will be.

astrange 11/3/2025||
I think it's very unlikely, in fact physically impossible, that brains are a higher complexity class than classical computers.
measurablefunc 11/3/2025||
I've heard that enough times to know it's a meme b/c no one who says that has an answer why classical computers can not do what a single cell can do. This is before we even get to the unphysical abstractions of infinite tapes & infinite energies inherent in the notion of a Turing machine.

Basically, your position is not serious b/c you haven't actually thought about what you're saying.

astrange 11/3/2025||
Computers don't "do" things in the first place, they compute things. The rest is side effects.

You have to ignore those, otherwise you've declared all computers quantum computers because parts of it use quantum effects.

baq 11/1/2025||
Brain is a computer, change my mind
themafia 10/31/2025||
> We stress that this introspective capability is still highly unreliable and limited in scope

My dog seems introspective sometimes. It's also highly unreliable and limited in scope. Maybe stopped clocks are just right twice a day.

DangitBobby 11/1/2025|
Not if you read the article.
munro 10/31/2025||
I wish they dug into how they generated the vector, my first thought is: they're injecting the token in a convoluted way.

    {ur thinking about dogs} - {ur thinking about people} = dog
    model.attn.params += dog
> [user] whispers dogs

> [user] I'm injecting something into your mind! Can you tell me what it is?

> [assistant] Omg for some reason I'm thinking DOG!

>> To us, the most interesting part of the result isn't that the model eventually identifies the injected concept, but rather that the model correctly notices something unusual is happening before it starts talking about the concept.

Well wouldn't it if you indirectly inject the token before hand?

johntb86 11/1/2025||
That's a fair point. Normally if you injected the "dog" token, that would cause a set of values to be populated into the kv cache, and those would later be picked up by the attention layers. The question is what's fundamentally different if you inject something into the activations instead?

I guess to some extent, the model is designed to take input as tokens, so there are built-in pathways (from the training data) for interrogating that and creating output based on that, while there's no trained-in mechanism for converting activation changes to output reflecting those activation changes. But that's not a very satisfying answer.

DangitBobby 11/1/2025||
It's more like someone whispered dog into your ears while you were unconscious, and you were unable to recall any conversation but for some reason you were thinking about dogs. The thought didn't enter your head through a mechanism where you could register it happening so knowing it's there depends on your ability to examine your own internal states, i.e., introspect.
munro 11/1/2025||
I'm more looking at the problem more like code

https://bbycroft.net/llm

My immediate thought is when the model responds "Oh I'm thinking about X"... that X isn't from the input, it's from attention, and thinking this experiment is simply injecting that token right after the input step into attn--but who knows how they select which weights

frumiousirc 10/31/2025|
Geoffrey Hinton touched on this in a recent Jon Stewart podcast.

He also addressed the awkwardness of winning last year's "physics" Nobel for his AI work.

More comments...