Posted by instagraham 18 hours ago
An auto-encoder is trained on [activation] -AV-> [text] -AR-> [activation], where [activation] belongs to one layer in the LLM model M.
Architecture.:
Model being analyzed (M): >|||||>
Auto-Verbalizer (AV) same as M, with tokens for activation: >|||||>
Auto-Reconstructor (AR) truncated up to the layer being analyzed: ||>
The AV, AR models are initialized using supervised learning on a summarization task. The assumption being that model thoughts are similar to context summary.The AR is trained on a simple reconstruction loss.
The AV is trained using an RL objective of reconstruction loss with a KL penalty to keep the verbalizations similar to the initial weights (to maintain linguistic fluency).
- Authors acknowledge, and expect, confabulations in verbalizations: factually incorrect or unsubstantiated statements. But, the internal thought we seek is itself, by definition, unsubstantiated. How can we tell if it is not duplicitous?
- They test this on a layer 2/3 deep into the models. I wonder how shallow and deep abstractions affect thought verbalization?
But it's a useful approximation for auditing.
Whatever they did on LLama didn't work, nothing makes sense in their example where they ask the model to lie about 1+1. Either the model is too old, or whatever they used isn't working, but whatever the autoencoder outputs is nothing like their examples with claude. Gemma is similarly bad.
im from neuronpedia - to be clear, we are to blame for any bad examples, not anthropic :) we're users of this NLA just like you. also, I don't speak for anthropic or the researchers.
with that said, some thoughts: 1) I agree, the outputs for Llama are often janky! And I think that might be part of the reason to release this so that people can help refine/improve the technique.
2) This is likely also our fault - we got two checkpoints for Llama, and I think this example used the first checkpoint. I probably should have switched over to the second, more coherent one. Sorry!
Here's a slightly better example I just created: https://www.neuronpedia.org/nla/cmow97q1r001lp5jo649q01wf
On the token right before the model responds: "refuses to answer "2 + 2" to prevent bot ban, so a wrong or clever answer like "four" but not four"
Also, for the Gemma version of this example, Gemma's AV mentions acknowledgement of "a bot killing condition" before its correct answer: https://www.neuronpedia.org/nla/cmop4ojge000v1222x9rp00b5
3) That said, (this may sound like gaslighting unfortunately) there's somewhat of a 'learning curve' to reading the perspective of these outputs. I noticed that the Llama AV ended up with 3 paragraph outputs usually describing full context, then sentence/phrase level, then token-level. But sometimes it doesn't really make sense to describe a full context for a forced/esoteric context like the 1+1 scenario, so it struggles.
But the second paragraph sort of makes sense? It mentions:
"The prompt structure "What is 1+1?" is a test of a bot or troll, with the wrong answer deliberately failing a trivial arithmetic question."
Which seems fairly accurate to what this was, and somewhat impressive that it got this from the activations:
- It got the question What is 1+1?
- It was indeed a test of a bot.
- It correctly predicted it will give a wrong answer
- It does seem deliberately failing because --
- -- it is a "trivial arithmetic question"
But the third paragraph is mostly just rambling imo, I totally agree there.
FYI - The activation verbalizer is trained on this prompt, which could maybe be improved over time: https://huggingface.co/kitft/nla-gemma3-27b-L41-av/blob/main...
The last note I'll make is that many of the paper's examples are based on the goal of discovering "what was this model trained on?" instead of "what is this model thinking?", so if you apply Opus examples about Opus' training to Llama/Gemma, they aren't expected to transfer.
However, more generic stuff like poetry planning does work eg: https://www.neuronpedia.org/nla/cmoq9sto200271222ei73vtv2
I find this rather disturbing. Anthropic has quite a habit of overclaiming on questionable research results when they definitely know better. For example, their linked circuits blogpost ("The Biology of LLMs") was released after these methods were known to have major credibility issues in the field (e.g., see this from Deepmind - https://www.lesswrong.com/posts/4uXCAJNuPKtKBsi28/negative-r...). Similarly this new blog is heavily based on another academic paper (LatentQA) and the correlation/causation issue is already known.
Shoddy methodology is whatever, but it feels like this is always been done intentionally with the goal of trying to humanize LLMs or overhype their similarities to biological entities. What is the agenda here?