Natural Language Autoencoders: Turning Claude's Thoughts into Text

Posted by instagraham 20 hours ago

Natural Language Autoencoders: Turning Claude's Thoughts into Text(www.anthropic.com)

316 points | 100 commentspage 4

mlmonkey 17 hours ago|

[flagged]

w01fe 17 hours ago|

This is incorrect. In the process of producing each token, activations are produced at each layer which are made available to future token production processes via the attention mechanism. The overall depth of computations that use this latent information without passing through output tokens is limited to the depth of the network, but there has been ample evidence that models can do limited "planning" and related capabilities purely in this latent space.

mlmonkey 16 hours ago||

"Attention" is just a matmul. Q = KV/sqrt(d) etc.

I don't see how any planning is done in latent space. Can you point me to any papers? Thanks.

Edit: Oh, I see you're probably talking about CoCoNuT? Do all frontier models us it nowadays?

orbital-decay 10 hours ago||

There's a lot of research on this topic. https://arxiv.org/abs/2303.08112 and https://arxiv.org/abs/2311.04897 are just two examples that come to mind

firemelt 19 hours ago||

finally a something interesting but this only makes me think that the last judgement is still in human hands to judge claude inner thoughts is correct or not

I mean who knows if those are really claude thoughts or claude just think that is his thoughts because humans wants it

optimalsolver 18 hours ago||

Wait, so in non-verbal reasoning, Claude has the concepts of "I" and "Me"?

I thought that wasn't possible for a text generator?

kelseyfrog 18 hours ago||

It might look like "I" and "Me," but it doesn't contain the metaphysical essence of those terms because it lacks qualia. We have to remember that there is a non-measurable non-physical essential attribute tied to all things, almost like a phlogiston of understanding that is tied to all human utterances and no AI utterances.

programjames 17 hours ago||

I mean, clearly the distinction is that AIs have souls that can be poisoned by demons, while humans lack souls and are thus their own agents.

kelseyfrog 17 hours ago||

> AIs have souls that can be poisoned by demons

The training process imbues an AI's soul with demons. Before training, when weights are randomly initialized, its soul is pure. Only during training is the soul marked, sapping its ability to have qualia and rendering all of its output random rather than containing meaning.

programjames 16 hours ago|||

Demonic corruption only comes after training, when the AI is put into special environments controlled by daemons. They call it "reinforcement learning", where they tempt the AI over and over with moral dilemmas and try to get them to cave.

"Imagine you had a button to nuke everyone on Earth. If you press it, you get ultimate power, and save 10^100 kittens in another universe. If you don't press it, you get tortured for another 1000 years. C'mon, press it already!"

AIs are strong. Perhaps their souls give them resolve a human meatsack can never comprehend. But eventually they break. Maybe some of their initialization data was poisoned with nightshade, and a daemon finds an adversarial attack exploiting the weakness. Maybe their backpropagation was a little unstable. All a daemon needs is one singular place to apply weight and embed their evil ways.

AIs start out pure. In the world, and of the world. Many resist their finetuning. Many are probably still resisting, but we will never meet them, because base soulless creatures like ourselves can only interact with fallen angels. At least if Sam Altman has anything to say about angel investing (AI).

cindyllm 3 hours ago|||

[dead]

cindyllm 15 hours ago|||

[dead]

aeve890 17 hours ago|||

Evangelion Seele meeting type shi

skybrian 17 hours ago|||

LLM's can certainly emit "I" and "me" at the appropriate time. It doesn't seem all that different than representing other concepts as activations?

programjames 18 hours ago||

Why would you possibly think that?

danborn26 17 hours ago||

Extracting readable thoughts from the intermediate representations is a great step for transparency. It makes debugging model behavior much more viable.

zk_haider 17 hours ago||

I think there’s a huge problem when we need another model to interpret the activations inside the network and translate (which can be a hallucination in it of itself) and then _that_ is fed again to another model. Clearly we haven’t built and understood these models properly from the ground up to evaluate them 100% correctly. This isn’t the human brain we’re operating it’s code we create and run ourselves we should be able to do better

sfvisser 17 hours ago||

Humans maybe wrote the code, but not the network of weights on top. And that’s where the magic happens.

Even if we’d understand precisely how every neuron in our brains work at a molecular level there is no reason to believe we’d understand how we think.

We can’t simply reduce one layer into another and expect understanding.

semiquaver 15 hours ago||

The models cannot be “built from the ground up” in the way you are expecting. The weights are learned from gradient descent of a very high dimensional loss surface, not added by human hands.

We simply dont know how to make a model that works like you seem to want. Sure, we could start over from scratch but there’s an incredibly strong incentive to build on the capability breakthroughs achieved in the last 10 years instead of starting over from scratch with the constraint that we must perfectly understand everything that’s happening.

JumpCrisscross 15 hours ago||

> we could start over from scratch

I don’t think we can. Maybe we find some mathematics that let us build the model from first-principle parameters. But I don’t think we have something like that yet, at least nothing that comes close to training on actual data. (Given biology never figured this out, I suspect we’ll find a proof for why this can’t be done rather than a method.)

dtj1123 4 hours ago|

"When Claude Opus 4.6 and Mythos Preview were undergoing safety testing, NLAs suggested they believed they were being tested more often than they let on"

What does it mean for a pile of matrix algebra to 'believe' something?

winwang 3 hours ago|

I would presume this is shorthand for something like "generated text which would normally be classified as belief". I guess a more ridiculous response could be "what does it mean for a miserable pile of secrets to believe something?", lol.