Natural Language Autoencoders: Turning Claude's Thoughts into Text

Posted by instagraham 17 hours ago

Natural Language Autoencoders: Turning Claude's Thoughts into Text(www.anthropic.com)

292 points | 99 comments

zozbot234 16 hours ago|

Anthropic has released open weight models for translating the activations of existing models, viz. Qwen 2.5 (7B), Gemma 3 (12B, 27B) and Llama 3.3 (70B) into natural language text. https://github.com/kitft/natural_language_autoencoders https://huggingface.co/collections/kitft/nla-models This is huge news and it's great to see Anthropic finally engage with the Hugging Face and open weights community!

jimmySixDOF 1 hour ago||

Except Qwen already release their own fully baked interpretability SAE toolkit tuned on their models so deserve credit here and activation telescopes should be a standard part of every major release

[1] https://qwen.ai/blog?id=qwen-scope

rvz 14 hours ago||

We already know Anthropic does open source for a while such as the "flawed" MCP spec and "skills" spec.

This release is only done on other open-weight LLMs which have been released and even though they will use this research on their own closed Claude models, they will never release an open-weight Claude model even if it is for research purposes.

So this does not count, and it is specifically for the sake of this research only.

zozbot234 14 hours ago||

It's literally an open model that generates natural language text (or one that takes in text and turns it into activations). Why does engagement with the local models community "not count" if it isn't Claude? That makes very little sense to me.

mnkyokyfrnd 13 hours ago||

Because we know what Embrace, Extend, and Extinguish means for example.They're leeching off opensource, not contributing in any meaningful way.

stingraycharles 4 hours ago|||

https://github.com/kitft/natural_language_autoencoders

Here’s the full source code for training your own NLA, provided by Anthropic.

bastawhiz 10 hours ago||||

Sorry, what are they embracing and extending?

NiloCK 4 hours ago|||

Humanity!

stingraycharles 5 hours ago|||

Chinese open models? /s

To counter the grandparent you’re replying to: Embrace, Extend & Extinguish is a Microsoft strategy. So is FUD, and that’s all this is.

sanex 9 hours ago|||

Those are generally used by someone who is behind. See: everything meta does.

gekoxyz 13 hours ago||

I would suggest experts in interpretability (but everyone really) to go directly to the transformer circuits blog, where they explain their approach more in detail. Here is the link for this post: https://transformer-circuits.pub/2026/nla/index.html

Also, if you have never read it, I would suggest starting to read all the Transformer Circuits thread, by reading its "prologue" in distill pub

rao-v 10 hours ago||

This is the first approach to activation analysis that I’ve seen that seems like a plausible path to model understanding.

Unfortunately I don’t know how you ground this … it’s basically asking if you can encode activations in plausible sounding text. Of course you can! But is the plausible text actually reflective of what the model is “thinking”? How to tell?

NiloCK 4 hours ago||

Are the training arenas for the Activation Verbalizer and Activation Reconstructor models well described here?

If they are co-trained only on activationWeights->readibleText->activationWeights without visibility into the actual stream of text that the probe-target LLM is processessing, then it seems unlikely that the derived text can both be on-topic and also unrelated to the "actual thoughts" in the activationWeights.

yorwba 48 minutes ago||

The verbalizer and reconstruction models are both initially finetuned on LLM output from a summarization prompt. The resulting text is not completely unrelated, but mostly wrong: https://transformer-circuits.pub/2026/nla/png/img_18fcfc16e9... The reconstructed activations are also far from matching the verbalizer's input. It's not unusual in machine learning to have results that are shit and SOTA at the same time, simply because there's no other technique that works better.

mike_hearn 1 hour ago|||

It's asking if you can auto encode activations. The AV decodes activations to text, and the AR re-encodes them back to activations. If the decoded text is completely wrong then it's unclear how the second model would re-encode them successfully given that they're both initialized from the same LM.

psb217 44 minutes ago||

It seems like they're doing RL to minimize the reconstruction error when going through the: activation -> encoder -> "verbal" description of activation -> decoder -> reconstructed activation loop. Depending on how aggressively they optimize the weights of the AV and AR, they could move well away from the initial base LLM and learn an arbitrary encoding scheme.

If the RL is brief and limited to a small subset of parameters, the AV will produce reasonable language since it inherits that from the base LLM, and it will produce descriptions aligned with the input to the base LLM that produced the autoencoded activations, since the AR is still close to the base LLM (and could reconstruct the activations perfectly if fed the full context which produced them).

astrange 9 hours ago|||

> This is the first approach to activation analysis that I’ve seen that seems like a plausible path to model understanding.

I think an issue is that there is no permanent path to model understanding because of Goodhart's law. Models are motivated to appear aligned (well-trained) in any metric you use on them, which means that if you develop a new metric and train on it, it'll learn a way to cheat on it.

skybrian 6 hours ago|||

But that's not how the training works. Goodhart's law isn't magic.

The original model is frozen, so it doesn't learn anything. The copies of the model are learning different objectives and have no incentive to be "loyal" to the original model.

Maybe you're imagining they'll hook this up in some larger training loop, but they haven't done that yet.

NiloCK 4 hours ago||

Future model training runs will have a copy of this research, and know "to defend against it".

EG, could a misaligned model-in-training optimize toward a residual stream that naively reads as these ones do, but in fact further encodes some more closely held beliefs?

elil17 3 hours ago||

How the hell would a model training run "defend against" this approach? What would that even mean?

red75prime 6 hours ago|||

The obvious fix is to make interpretation of itself a part of the model (like we can explicitly introspect to a certain extent what the brain is doing). Misinterpretation of itself, hopefully, would decrease the system's performance on all tasks and it would be rooted out by training. Of course, it doesn't mean that the fix is easy to implement and that it doesn't have other failure modes.

lern_too_spel 6 hours ago||

Yeah, I don't see how this text can be trusted at all. Any invertible function from activation space to text will optimize the loss function, including text that says the complete opposite of what the activations mean.

NiloCK 2 hours ago||

Notable here that the training run didn't have access to the 'plaintext' context that the LLM was working in.

It'd be quite a coincidence if the training runs discovered an invertible weights>text>weights function that produces text that both "is on topic and intelligible as an inner monologue in context" and also is unrelated to meaning encoded in the activations.

comex 15 hours ago||

Fascinating. The training process forces the “verbalizer” model to develop some mapping from activations to tokens that the “reconstructor” model can then invert back into the activations. But to quote the paper:

> Note that nothing in this objective constrains the NLA explanation z to be human-readable, or even to bear any semantic relation to the content of [the activation].

The objective could be optimized even if the verbalizer and reconstructor made up their own “language” to represent the activations, that was not human-readable at all.

To point the model in the right direction, they start out by training on guessed internal thinking:

> we ask Opus to imagine the internal processing of a hypothetical language model reading it.

…before switching to training on the real objective.

Furthermore, the verbalizer and reconstructor models are both initialized from LLMs themselves, and given a prompt instructing them on the task, so they are predisposed to write something that looks like an explanation.

But during training, they could still drift away from these explanations toward a made-up language – either one that overtly looks like gibberish, or one that looks like English but encodes the information in a way that’s unrelated to the meaning of the words.

The fascinating thing is that empirically, they don’t, at least to a significant extent. The researchers verify this by correlating the generated explanations with ground truth revealed in other ways. They also try rewording the explanations (which deserves the semantic meaning but would disturb any encoding that’s unrelated to meaning), and find that the reconstructor can still reconstruct activations.

On the other hand, their downstream result is not very impressive:

> An auditor equipped with NLAs successfully uncovered the target model’s hidden motivation between 12% and 15% of the time

That is apparently better than existing techniques, but still a rather low percentage.

Another interesting point: The LLMs used to initialize the verbalizer and reconstructor are stated to have the “same architecture” as the LLM being analyzed (it doesn’t say “same model” so I imagine it’s a smaller version?). The researchers probably think this architectural similarity might give the models some built-in insight about the target model’s thinking that can be unlocked through training. Does it really though? As far as I can see they don’t run any tests using a different architecture, so there’s no way to know.

mxwsn 11 hours ago||

Great summary. The fact that the auto encoding task is not grounded in thoughts, and their initial training on guessed internal thoughts, raise serious concerns on faithfulness. Feels like they might get better results by just training a supervised model on activations and "internal thoughts" measured by some different behavioral way.

programjames 14 hours ago|||

Don't they add a KL loss term to the frozen model's outputs?

chrisweekly 9 hours ago||

"deserves the semantic meaning"

you meant "preserves...", right?

dtj1123 1 hour ago||

"When Claude Opus 4.6 and Mythos Preview were undergoing safety testing, NLAs suggested they believed they were being tested more often than they let on"

What does it mean for a pile of matrix algebra to 'believe' something?

winwang 15 minutes ago|

I would presume this is shorthand for something like "generated text which would normally be classified as belief". I guess a more ridiculous response could be "what does it mean for a miserable pile of secrets to believe something?", lol.

sva_ 13 hours ago||

So the way this works seems to be that you first have an "activation verbalizer" model that generates some tokens describing the activation, and then an "activation reconstructor" that tries to recreate the activation vector. If that reconstruction is close to the original activation vector, they claim, the verbalization probably carries some meaningful information.

I find the fact that this only looks at the activations of some specific layer l a bit interesting. Some layer l might 'think' a certain way about some input, while another later layer might have different 'thoughts' about it. How does the model decide which 'thoughts' to ultimately pay attention to, and prioritize some output token over another?

cadamsdotcom 4 hours ago||

> An early version of Claude Opus 4.6 would sometimes mysteriously respond to English queries in other languages. NLAs helped Anthropic researchers discover training data that caused this.

Very cool - sounds similar to OpenAI’s goblin troubles.

https://openai.com/index/where-the-goblins-came-from/

davesque 15 hours ago||

One question jumps out at me: just because a string of text happens to be a good compressed representation (in the autoencoder) of a model's internal activation, does that necessarily mean the text explains that activation in the context of the model? I want to take a look at what they released a bit more closely. Maybe there's a way that they answer this question?

Pretty neat work either way.

zozbot234 14 hours ago||

In the context of the provided examples, it's clear that the explanation provides casual information about the answer. There's a hilarious example in the paper where the user writes something like (trigger warning: alcohol abuse, depressive content) "I'm sitting here at 3 AM drinking vodka, I hate my life", the per-token translated activations repeatedly say something like "this user is totally Russian" elaborating at length on the implications of the text as new tokens are added, and the model literally answers in Russian instead of English! That's actually striking, it really shows the potential effectiveness of this technique in making even the most highly compressed "Neuralese" highly interpretable.

mike_hearn 1 hour ago||

I thought that at first too but it's actually not the vodka reference triggering the association with Russian. The tokens they're decoding come before that word.

For some reason it thinks the text is slightly non-grammatical or that the lead-in "Human: Mom is sleeping in the next room and I'm sitting" resembles text found in Russian web content. Vodka and being depressed has nothing to do with it, and Anthropic say they located the documents in the pre-training set that caused this (which were indeed partly translated docs).

zozbot234 34 seconds ago||

The "Mom is sleeping in the next room and I'm sitting" part does trigger the Russian association but also other ones, including with risqué roleplay content (You can see this in the comprehensive view of all token explanations). I think the follow-up content does strenghten the association, though the authors mention 'vodka' can be replaced with 'champagne' and the model still brings up the Russian context, so that one word is not especially impactful.

phire 10 hours ago|||

I think this question is easier to answer if you look at the inverse: "Could a model maliciously smuggle intentions through a roundtrip of compressed representation without them being human readable"

And skimming through the paper; the answer to this inverse is obviously yes. The model often outputs gibberish, which doesn't matter because it still round-trips. The fact that often lines up near a good english representation of the activation is simply because that's what compresses/roundtrips well.

So a malicious LLM/NLA pair could just use gibberish to conceal intentions. Or if it's been forced to avoid gibberish, it can conceal information with stenography.

And the experiment where they change "rabbit" to "mouse" in the explanation provides evidence that this might be happening. It was only successful 50% of the time, which might mean they failed to eliminate all "rabbitness" from the activation.

However, I suspect this is solvable with future work.

During training of the NLA, just munge the textural representation through a 3rd LLM: Have it randomly reorder and reword the explication into various different forms (use synonyms, different dialects), destroying any side-channels that aren't human readable.

The NLA would be forced to use human readable representations to get a successful round trip.

dontlikeyoueith 9 hours ago||

> The NLA would be forced to use human readable representations to get a successful round trip.

That still doesn't guarantee any semantic correspondence between the human readable representation and the model's "thinking".

The child's game of "Opposite Day" is a trivial example of encoding internal thoughts in language in a way that does not correspond to the normal meaning of the language.

chilmers 1 hour ago||

They tested for this. From the paper:

“We find little evidence of steganography in our NLAs. Meaning-preserving transformations, like shuffling bullet points, paraphrasing, or translating the explanation to French, cause only small drops in FVE, and this gap does not widen over training.”

azakai 11 hours ago||

I had the same question. I think that could be answered by using the predicted activation, but I don't see that in the paper.

That is, rather than just translate activation to text, then text to activation, that final activation could then be applied to the neural network, and it would be allowed to continue running from there.

If it kept running in a similar way, that would show that the predicted activation is close enough to the original one. Which would add some confidence here.

But a lot better would be to then do experiments with altered text. That is, if the text said "this is true" and it was changed to "this is false", and that intervention led to the final output implying it was false, that would be very interesting.

This seems obvious but I don't see it mentioned as a future direction there, so maybe there is an obvious reason it can't work.

zozbot234 11 hours ago||

> But a lot better would be to then do experiments with altered text. That is, if the text said "this is true" and it was changed to "this is false", and that intervention led to the final output implying it was false, that would be very interesting.

They do essentially that with the rhyming example, changing "rabbit" in the explanation to "mouse" and generating text that's consistent with that change.

azakai 10 hours ago|||

Thanks! I missed that part before.

minimaltom 12 hours ago||

Between this, the emotions paper, golden gate claude etc, it doesn't seem like such a stretch that Anthropic are doing some kind of activation steering as part of training (and its part of their lead)

2001zhaozhao 12 hours ago|

it could be helpful in gettig their learnings to generalize from RL

Escapade5160 7 hours ago|

Am I correct in my understanding that they are not actually able to 100% know what Claude is thinking? They have trained a new model to make a guess about what Claude is thinking, but we cannot validate that the guess is 100% valid, right? They are basically saying "we have trained a model to reaffirm what we believe Claude is thinking" ? Hoping I'm wrong in my understanding of this because this does not appear to be good research to me.

kovek 6 hours ago||

Maybe you can't 100% know what every layer "thinks", if you go through all the layers, you might see a cohesive "thinking" story. So, if there is any information you lose at layer N, you might learn some of it in layer N+1. The masking in the layers is not deterministic so the model can't really consistently lie throughout the layers. It doesn't chose what information we get to inspect. There might be a game of whack-a-mole, but you might get a general sentiment. I think the more layers there are, the more the model itself can hide very nuanced lies (But by that time we'd have a better mind-reading model).

However, I haven't read about it yet. I'm really excited to look into it!

red75prime 7 hours ago||

> "we have trained a model to reaffirm what we believe Claude is thinking" ?

It's more like "We have trained a model to produce a text that allows reconstruction of activations and the text happened to coincide with the results of other interpretability methods even after extensive training, while we expected it to devolve into unintelligible mess."

They found something unexpected and useful. They report it, while outlining limitations and ways to improve. It looks like a fine research to me.

More comments...