Natural Language Autoencoders: Turning Claude's Thoughts into Text

Posted by instagraham 19 hours ago

Natural Language Autoencoders: Turning Claude's Thoughts into Text(www.anthropic.com)

304 points | 99 commentspage 3

kurnoolion 9 hours ago|

So, this is like reading EKG of human brain and understand its thoughts?

hansmayer 16 hours ago||

Claude's "Thougts" - get outta here you gits :)

btown 10 hours ago||

I find it fascinating how they were able to keep the reconstruction error function incredibly simple, literally its success in round-tripping the activation layer, while making it interpretable... simply by choosing a good data-driven initialization state, and (effectively) training slowly.

I guess "initialization is all you need!"

From the paper https://transformer-circuits.pub/2026/nla/index.html :

> We find that simply initializing the AV and AR as copies of M leads to unstable training: the AV in particular, having never encountered a layer-l activation as a token embedding, outputs nonsensical explanations. We therefore initialize the AV and AR with supervised fine-tuning on a text-summarization proxy task. Specifically, we compute layer-l activations from the final token of randomly truncated pretraining-like text snippets, and use Claude Opus 4.5 to generate summaries s of the text up to that token (see the Appendix for details of this procedure). We then fine-tune the AV and AR on (h_l,s) and (s,h_l) pairs respectively. This warm-start typically yields an FVE of around 0.3-0.4. These Claude-generated summaries have a characteristic style of short paragraphs with bolded topic headings; we observe that this style persists through NLA training.

And from the appendix:

> We generate warm-start data for the AV and AR by prompting Claude Opus 4.5 to produce summaries of contexts, using the prompt below. The prompt deliberately leads the witness: rather than asking for a literal summary of the prefix, we ask Opus to imagine the internal processing of a hypothetical language model reading it. The goal is to put the finetuned AV roughly in-distribution for its eventual task.

tjohnell 18 hours ago||

It will inevitably learn how to think in a way that translates to one (moral) meaning and back but has an ulterior meaning underneath.

gavmor 17 hours ago||

Something like a textual steganography?

Ursula K. Le Guin: 'The artist deals with what cannot be said in words. The artist whose medium is fiction does this in words.'

rotcev 18 hours ago|||

This is exactly what I first thought. “The user appears to be attempting to decode my previous thought process, …”, the question is whether or not the model will be able to internalize this in such a way that is undetectable to the aforementioned technique.

astrange 17 hours ago||

That shouldn't happen as long as the autoencoder isn't used as an RL reward. It will happen (due to Goodhart's law) if it is.

Of course, if you use it to make any decision that can still happen eventually.

bilsbie 12 hours ago||

Could you use this to see what facts a model knows?

bilsbie 12 hours ago||

How does this differ from golden gate Claude?

hijohnnylin 11 hours ago|

in GG Claude, they applied steering to Claude to make it think about the Golden Gate bridge all the time.

here, they don't modify/steer the base model. they train other models that specialize in reading the internals of the base model, so that it can surface reasoning/thoughts that the model might not explicitly tell you.

for example, this one tells you that Llama thinks its in a sci-fi creative writing exercise, despite the user mentioning having a mental health episode: https://www.neuronpedia.org/nla/cmonzq63g0003rlh8xi9onjnn

seba_dos1 5 hours ago||

Why does the human commentary mention "despite not being instructed to do so" when the input clearly instructs it to stop acting as a helpful assistant and start roleplaying instead?

spacebacon 15 hours ago||

Attach the SRT to your frozen model Anthropic. Problem solved. https://github.com/space-bacon/SRT.

drdeca 12 hours ago||

I see your repository’s README says

> Language models process signs (representamens) but are blind to when meaning forks — when the same word means different things to different communities.

But, haven’t interpretability results shown that these models internally represent several meanings of the same word, differently? In that case, why would they not already do the same for how words are used differently in different communities?

spacebacon 8 hours ago||

[dead]

spacebacon 15 hours ago||

[dead]

az226 12 hours ago||

This is very cool

micalo 9 hours ago||

[dead]

arian_ 16 hours ago|

[dead]

More comments...