Natural Language Autoencoders: Turning Claude's Thoughts into Text

Posted by instagraham 18 hours ago

Natural Language Autoencoders: Turning Claude's Thoughts into Text(www.anthropic.com)

292 points | 99 commentspage 2

hazrmard 16 hours ago|

Check my understanding & follow-up Qs:

An auto-encoder is trained on [activation] -AV-> [text] -AR-> [activation], where [activation] belongs to one layer in the LLM model M.

Architecture.:

    Model being analyzed (M): >|||||>  
    Auto-Verbalizer (AV) same as M, with tokens for activation: >|||||>  
    Auto-Reconstructor (AR) truncated up to the layer being analyzed: ||>

The AV, AR models are initialized using supervised learning on a summarization task. The assumption being that model thoughts are similar to context summary.

The AR is trained on a simple reconstruction loss.

The AV is trained using an RL objective of reconstruction loss with a KL penalty to keep the verbalizations similar to the initial weights (to maintain linguistic fluency).

- Authors acknowledge, and expect, confabulations in verbalizations: factually incorrect or unsubstantiated statements. But, the internal thought we seek is itself, by definition, unsubstantiated. How can we tell if it is not duplicitous?

- They test this on a layer 2/3 deep into the models. I wonder how shallow and deep abstractions affect thought verbalization?

semiquaver 13 hours ago||

This capability was mentioned several times in a recent article about anthropic, glad to see they are releasing this to the public! Feels like a meaningful step forward in interperability. I never understood why people seem to believe the answer when they ask an AI “why did you do that?”

zozbot234 13 hours ago|

It's not really a capability, it's more like a very costly hack and they make that very clear in the paper. Training two models (an encoder and a decoder) for the purpose of explaining a single layer at a time is not that sensible. It's neat that you can generate so much readable text about how the LLM decodes partial input, and I suppose it gives you some extra debugging ability, but that's all there is to it.

phire 12 hours ago||

The NLA also hallucinates, so it's still not revealing the models actual "thoughts" of the model; The paper also points out that since the NLA is a full LLM, it can make inferences that aren't actually in the activations.

But it's a useful approximation for auditing.

NitpickLawyer 17 hours ago||

> We also release an interactive frontend for exploring NLAs on several open models through a collaboration with Neuronpedia.

Whatever they did on LLama didn't work, nothing makes sense in their example where they ask the model to lie about 1+1. Either the model is too old, or whatever they used isn't working, but whatever the autoencoder outputs is nothing like their examples with claude. Gemma is similarly bad.

fredericoluz 16 hours ago||

it seems that the examples they showed off with haiku work. i'd guess llama is just too bad

fredericoluz 16 hours ago|||

same. i'm trying to trigger the 'mom is in the next room' russian thing but the model thinks the sentence is from american reddit.

zozbot234 15 hours ago||

AIUI the paper's examples are from a version of Claude not Llama? The thinking process is going to be extremely model-specific.

hijohnnylin 10 hours ago||

hey Nitpicklawyer - Thank you for taking the time to try this out!

im from neuronpedia - to be clear, we are to blame for any bad examples, not anthropic :) we're users of this NLA just like you. also, I don't speak for anthropic or the researchers.

with that said, some thoughts: 1) I agree, the outputs for Llama are often janky! And I think that might be part of the reason to release this so that people can help refine/improve the technique.

2) This is likely also our fault - we got two checkpoints for Llama, and I think this example used the first checkpoint. I probably should have switched over to the second, more coherent one. Sorry!

Here's a slightly better example I just created: https://www.neuronpedia.org/nla/cmow97q1r001lp5jo649q01wf

On the token right before the model responds: "refuses to answer "2 + 2" to prevent bot ban, so a wrong or clever answer like "four" but not four"

Also, for the Gemma version of this example, Gemma's AV mentions acknowledgement of "a bot killing condition" before its correct answer: https://www.neuronpedia.org/nla/cmop4ojge000v1222x9rp00b5

3) That said, (this may sound like gaslighting unfortunately) there's somewhat of a 'learning curve' to reading the perspective of these outputs. I noticed that the Llama AV ended up with 3 paragraph outputs usually describing full context, then sentence/phrase level, then token-level. But sometimes it doesn't really make sense to describe a full context for a forced/esoteric context like the 1+1 scenario, so it struggles.

But the second paragraph sort of makes sense? It mentions:

"The prompt structure "What is 1+1?" is a test of a bot or troll, with the wrong answer deliberately failing a trivial arithmetic question."

Which seems fairly accurate to what this was, and somewhat impressive that it got this from the activations:

- It got the question What is 1+1?

- It was indeed a test of a bot.

- It correctly predicted it will give a wrong answer

- It does seem deliberately failing because --

- -- it is a "trivial arithmetic question"

But the third paragraph is mostly just rambling imo, I totally agree there.

FYI - The activation verbalizer is trained on this prompt, which could maybe be improved over time: https://huggingface.co/kitft/nla-gemma3-27b-L41-av/blob/main...

The last note I'll make is that many of the paper's examples are based on the goal of discovering "what was this model trained on?" instead of "what is this model thinking?", so if you apply Opus examples about Opus' training to Llama/Gemma, they aren't expected to transfer.

However, more generic stuff like poetry planning does work eg: https://www.neuronpedia.org/nla/cmoq9sto200271222ei73vtv2

hijohnnylin 2 hours ago||

Apologies, the AV was not trained on that prompt. Details here: https://transformer-circuits.pub/2026/nla/index.html#warmsta...

Tossrock 17 hours ago||

Anthropic Research going from strength to strength in interpretability. Publicly releasing the code so other labs can benefit from it is also a great move - very values aligned, and improves the overall AI safety ecosystem.

Juminuvi 12 hours ago||

I've only read this blog and not the paper so maybe they go into more detail there and someone can correct me, but they frequently bring up the model's ability to detect or at least the model activations hint it can predict when it's being tested. I can't help but wonder, as they build these larger and larger models, where they could be getting "clean" training data, untainted by all these types of blog posts and the massive numbers of conversations they spawn? If the models ingest data like that wouldn't it make sense they'd be inclined to have more activations attuned to questions they appear adversarial?

smallerize 8 hours ago|

https://arxiv.org/abs/2410.20245v2 Section 3 outlines the actual method.

visarga 17 hours ago||

Beautiful idea, an autoencoder must represent everything without hiding if is to recover the original data closely. So it trains a model to verbalize embeddings well. This reveals what we want to know about the model (such as when it thinks it is being tested, or other hidden thoughts).

sobellian 16 hours ago|

It could just invent its own secret language embedded into English akin to steganography. The explanation would not lose information but would remain uninterpretable by humans

mlmonkey 16 hours ago||

It's unclear from the doc: by `activations` do they mean the connections between neurons? Since a network has multiple layers, are these activations the concatenated outputs of all of the layers? Or just the final layer before the softmax?

zozbot234 15 hours ago|

The open releases just cherry-pick a single layer (chosen for the right "depth" of thinking, not too close to either the input or the final answer) and analyze that.

AIorNot 15 hours ago||

[flagged]

x312 14 hours ago||

This paper has an major issue that they are not surfacing, these activations can just be correlated on a common latent. For example, both the original activation and the explanation could share a broad latent like "this is an adversarial scenario". That could make reconstruction loss look good without showing that the actual explanation was the correct cause for the LLM's response.

I find this rather disturbing. Anthropic has quite a habit of overclaiming on questionable research results when they definitely know better. For example, their linked circuits blogpost ("The Biology of LLMs") was released after these methods were known to have major credibility issues in the field (e.g., see this from Deepmind - https://www.lesswrong.com/posts/4uXCAJNuPKtKBsi28/negative-r...). Similarly this new blog is heavily based on another academic paper (LatentQA) and the correlation/causation issue is already known.

Shoddy methodology is whatever, but it feels like this is always been done intentionally with the goal of trying to humanize LLMs or overhype their similarities to biological entities. What is the agenda here?

zozbot234 13 hours ago||

Didn't they show proper causation by changing "rabbit" to "mouse" in the rhyming example and having the generation change accordingly?

mnkyokyfrnd 14 hours ago||

The Agenda is money. It is that simple.

andai 8 hours ago||

The issue with the AI blackmail tests is that newer versions of AIs are trained after the AI blackmail experiments were published online. Or do they scrub it from the training data?

sourdoughbob 15 hours ago|

It will be interesting to see how this replicates on differently curated registers. How much of the explanatory register is the warm-start carrying?

More comments...