Posted by tosh 2 days ago
Thank God there is still neverending wars, otherwise authoritarian governments would have no fun left.
And people keep comparing compulsive binge watching to the "infinite jest" from D.C.Wallace (I could not tell, the brick is sitting barely touched on my shelves, but I'm not insulting the future.)
I'm tired of living in an ironic remix of everyone's favorite distopia. Time for someone to write optimistic sci-fi to give everyone something nice to implement when they're adults.
Bring us back Jules Verne. Let's have the Jetson's life for real. Put Ted Lasso in space.
Given their training material, "futuristic stories with nice people getting their happy ending" is not something big tech AI is going to spit anytime soon, so that's a niche to take on !
Is what cavemen sound like the same in every culture? Like I know that different cultures have different words for "woof" or "meow"; so it stands to reason maybe also for cavemans speech?
Like "Sea world" or "see the world".
I don't think it would be fundamentally very surprising if something like this works, it seems like the natural extension to tokenisation. It also seems like the natural path towards "neuralese" where tokens no longer need to correspond to units of human language.
Quite often on reddit I'll write two paragraphs and get told "I'm not reading all that".
Really? Has basic reading become a Herculean task?
I find LLM slop much harder to read than normal human text.
I can't really explain it, it's just a feeling.
The feeling that it draaaags and draaaaaags and keeeeeps going on and on and on before getting to the point, and by the time I'm done with all the "fluff", I don't care what is the text about anymore, I just want to lay down and rest.
But realistically, I am not going to read every online comment carefully because the SNR is low, especially on Reddit. Make your case concisely and meaningfully.
But combining this with caveman? Gold!
https://developers.openai.com/api/reference/resources/respon...
I don't know their internal eval, but I think I have heard it does not hurt or improve performance. But at least this parameter may affect how many comments are in the code.
I.e. by demanding the model to be concise, you're literally making it dumber.
(Separating out "chain of thought" into "thinking mode" and removing user control over it definitely helped with this problem.)
> cutting ~75% of tokens while keeping full technical accuracy.
I have no clue if this claim holds, but alas, just pretending they did not address the obvious criticism, while they did, is at the very least pretty lazy.
An explanation that explains nothing is not very interesting.
Nobody has to proof anything. It can give your claim credibility. If you don't provide any, an opposing claim without proof does not get any better.
“I don’t need to provide proof to say things” is a valueless, trivial assertion that adds no value whatsoever to any discussion anyone has ever had.
If you want to pretend this is a claim that should be taken seriously, a lack of evidence is damning. If you just want to pass the metaphorical bong and say stupid shit to each other with no judgment and no expectation, then I don’t know what to tell you. Maybe X is better for that.
You can read the skill. They didn't do anything to mitigate the issue, so the criticism is valid.
But they didn't address the criticism. "cutting ~75% of tokens while keeping full technical accuracy" is an empirical claim for which no evidence was provided.
For an LLM, tokens are thought. They have no ability to think, by whatever definition of that word you like, without outputting something. The token only represents a tiny fraction of the internal state changes made when a token is output.
Clearly there is an optimal for each task (not necessarily a global one) and a concrete model for a given task can be arbitrarily far from it. But you'd need to test it out for each case, not just assume that "less tokens = more better". You can be forcing your model to be dumber without realizing it if you're not testing.
When producing a token the model doesn't just emit the final token but you also have the entire hidden states from previous attention blocks. These hidden states are mixed into the attention block of future tokens (so even though LLMs are autoregressive where a token attends to previous tokens, in terms of a computational graph this means that the hidden states of previous tokens are passed forward and used to compute hidden states of future tokens).
So no it's not wasteful, those low-perplexity tokens are precisely spots that can instead be used to do plan ahead and do useful computation.
Also I would not be sure that even the output tokens are purely "filler". If you look at raw COT, they often have patterns like "but wait!" that are emitted by the model at crucial pivot points. Who's to say that the "you're absolutely right" doesn't serve some other similar purpose of forcing the model into one direction of adjusting its priors.
Do you know that is true? These aren’t just tokens, they’re tokens with specific position encodings preceded by specific context. The position as a whole is a lot richer than you make it out to be. I think this is probably an unanswered empirical question, unless you’ve read otherwise.
The output is "just tokens"; the "position encodings" and "context" are inputs to the LLM function, not outputs. The information that a token can carry is bounded by the entropy of that token. A highly predictable token (given the context) simply can't communicate anything.
Again: if a tiny language model or even a basic markov model would also predict the same token, it's a safe bet it doesn't encode any useful thinking when the big model spits it out.
train an LLM to leave out the filler words, and see it get the same performance at a lower cost? or do it at token selection time?
Or if you prefer, here's a Galilean thought experiment: gin up a script to get a large language model and a tiny language model to predict the next token in parallel; when they disagree, append the token generated by the large model. Clearly the large model will not care that the "easy" tokens were generated by a different model - how could it even know? Same token, same result. And you will find that the tokens that they agree on are, naturally, the filler words.
To be clear, this observation merely debunks the idea that filler words encode useful information, that they give the LLM "room to think". It doesn't directly imply that an LLM that omits filler words can be just as smart, or that such a thing is trivial to make. It could be that highly predictable words are still important to thought in some way. It could be that they're only important because it's difficult to copy the substance of human thought without also capturing the style. But we can be very sure that what they aren't doing is "storing useful intermediate results".
Tokens are how an LLM works things out, but I think it's just as likely as not that LLMs (like people) are capable of overthinking things to the point of coming to a wrong answer when their "gut" response would have been better. I do not content that this is the default mode, but that it is both possible, and that it's more or less likely on one kind of problem than another, problem categories to be determined.
A specific example of this was the era of chat interfaces that leaned too far in the direction of web search when responding to user queries. No, claude, I don't want a recipe blogspam link or summary - just listen to your heart and tell me how to mix pancakes.
More abstractly: LLMs give the running context window a lot of credit, and will work hard to post-hoc rationalize whatever is in there, including any prior low-likelihood tokens. I expect many problematic 'hallucinations' are the result of an unlucky run of two or more low probability tokens running together, and the likelihood of that happening in a given response scales ~linearly with the length of response.
LLMs do stumble into long prediction chains that don’t lead the inference in any useful direction, wasting tokens and compute.
Additionally, LLMs do not actually operate in text; much of the thinking happens in a much higher dimensional space that just happens to be decoded as text.
So unless the LLM was trained otherwise, making it talk like a caveman is more than just theoretically turning it into a caveman.
What do you mean by that? It’s literally text prediction, isn’t it?
I have a list of numbers, 0 to9, and the + , = operators. I will train my model on this dataset, except the model won’t get the list, they will get a bunch of addition problems. A lot. But every addition problem possible inside that space will not be represented, not by a long shot, and neither will every number. but still, the model will be able to solve any math problem you can form with those symbols.
It’s just predicting symbols, but to do so it had to internalize the concepts.
This gives the impression that it is doing something more than pattern matching. I think this kind of communication where some human attribute is used to name some concept in the LLM domain is causing a lot of damage, and ends up inadvertently blowing up the hype for the AI marketing...
I think what's causing a lot of damage is not attributing more of human attributes (though carefully). It's not the LLM marketing you have to worry about - that's just noise. All marketing is malicious lies and abusive bullshit, AI marketing is no different.
Care about engineering - designing and securing systems. There, the refusal to anthropomorphise LLMs is doing a lot of damage and wasted efforts, with good chunk of the industry believing in "lethal trifecta" as if it were the holy Trinity, and convinced it's something that can be solved without losing all that makes LLMs useful in the first place. A little bit of anthropomorphising LLMs, squinting your eyes and seeing them as little people on a chip, will immediately tell you these "bugs" and "vulnerabilities" are just inseparable facets of the features we care about, fundamental to general-purpose tools, and they can be mitigated and worked around (at a cost), but not solved, not any more you can solve "social engineering" or better code your employees so they're impervious to coercion or bribery, or being prompt-injected by a phone call from their loved one.
Anthropomorphic descriptions are the most expressive because of the fact that LLMs based on human cultural output mimic human behaviours, intrinsically. Other terminology is not nearly as expressive when describing LLM output.
Pattern matching is the same as saying text prediction. While being technically truthy, it fails to convey the external effect. Anthropomorphic terms, while being less truthy overall, do manage to effectively convey the external effect. It does unfortunately imply an internal cause that does not follow, but the externalities are what matter in most non-philosophical contexts.
But the problem is that this does not inform about the failure mode. So if I am understanding correctly, you are saying that the behavior of LLM, when it works, is like it has internalized the concepts.
But then it does not inform that it can also say stuff that completely contradicts what it said before, there by also contradicting the notion of having "internalized" the concept.
So that will turn out to be a lie.
I don't think they do if we are talking about a honest human being.
LLMs will happily hallucinate and even provide "sources" for their wrong responses. That single thing should contradict what you are saying.
So the conclusion was that these middle layers have their own language and it's converting the text into this language and this decoding it. It explains why sometime the models switch to chinese when they have a lot of chinese language inputs, etc.
You are also confusing ‘mechanistic explanation still incomplete’ with ‘empirical phenomenon unestablished.’ Those are not the same thing.
PS. Em dash? So you are some LLM bot trying to bait mine HN for reasoning traces? :D
You sound like you’re trying to sound impressive. Like I said, I’ll read the paper.
you are discovering that the favorite luddite argument is bullshit
https://machinelearning.apple.com/research/illusion-of-think...
> just look at research papers
You didn't add anything other than vibes either.
This is not how the feature called "reasoning" work in current models.
"reasoning" simply let's the model output and then consume some "thinking" tokens before generating the actual output.
All the "fluff" tokens in the output have absolutely nothing to do with "reasoning".
For example thinking in modern US English generates many thoughts, to keep correct speak at right cultural context (there is only one correct way to say People Of Color, and it changes every year, any typo makes it horribly wrong).
Some languages are far more expressive and specialized in logical conditions, conditionals, recursion and reasoning. Like eskimos have 100 words for snow, but for boolean algebra.
It is well proven that thinking in Chinese needs far less tokens!
With this caveman mod you strip out most of cultural complexities of anglosphere, make it easier for foreigners and far simpler to digest.
This is simply not true.
It is very arrogant to assume, no other language can be more advanced than English.
Programming languages are not languages in the human brain nor the culture sense.
But I assume this has been studied? Can anyone point to papers that show it? I’d particularly like to know what the curves look like, it’s clearly not linear, so if you cut out 75% or tokens what do you expect to lose?
I do imagine there is not a lot of caveman speak in the training data so results may be worse because they don’t fit the same patterns that have been reinforcement learned in.
So it must be studied and at least be proven effective in practice to be so universally used now.
Someone else posted a few articles like this in the thread above but there’s probably more and better ones if you search. https://news.ycombinator.com/item?id=47647907
Do LLMs generally perform better in verbose languages than they do in concise ones?
Yeah, definitely. It lacks case and verb conjugations, plus whole classes of filler words, and words themselves are on average substantially shorter. If you listen to or read a hyper-literal transliteration of Chinese speech into English (you can find fun videos of this on Chinese social media), it even resembles "caveman speech" for those reasons.
If you look at translated texts and compare the English versions to the Chinese ones, the Chinese versions are substantially shorter. Same if you compare localization strings in your favorite open-source project.
It's also part of why Chinese apps are so information-dense, and why localizing to other languages often requires reorganizing the layout itself— languages like English just aren't as information-dense, pixel for pixel.
The difference is especially profound for vernacular Chinese, which is why Chinese people often note that text which "has a machine translation flavor" is over-specified and gratuitously prolix.
Maybe some of this washes out in LLMs due to tokenization differences. But Chinese texts are typically shorter than English texts and it extends to prose as well as poetry.
But yeah this is standard stuff: Chinese is more concise and more contextual/ambiguous. More semantic work is allocated in interpretation than with English, less is allocated in the writing/speaking.
Do you speak Chinese and experience the differences between Chinese and English differently? I'm a native English speaker and only a beginner in Chinese but I've formed these views in discussion with Chinese people who know some English as well.
I'm also more curious about tokenizers for LLMs than I've ever been before, both for Chinese and English. I feel like to understand I'll need to look at some concrete examples, since sometimes tokenization can be per word or per character or sometimes chunks that are in between.
There’s a less magical model of how LLMs work: they are essentially fancy autocomplete engines.
Most of us probably have an intuition that the more you give an autocomplete, the better results it will yield. However, does this extend to output of the autocomplete—i.e. the more tokens it uses for the result, the better?
It could well be true in context of chain of thought[0] models, in the sense that the output of a preceding autocomplete step is then fed as input to the next autocomplete step, and therefore would yield better results in the end. In other words, with this intuition, if caveman speak is applied early enough in the chain, it would indeed hamper the quality of the end result; and if it is applied later, it would not really save that many tokens.
Willing to be corrected by someone more familiar with NN architecture, of course.
[0] I can see “thinking” used as a term of art, distinct from its regular meaning, when discussing “chain of thought” models; sort of like what “learning” is in “machine learning”.
As I understand it, the claim is: more tokens = more computation = more "thinking" => answer probably better.
Say that limit is X. This means if your problem fundamentally requires at least Y compute to be solved, your machine will never give you a reliable answer in less than ceil(Y/N) steps.
LLMs are like this - a loop is programmed to step the CPU/turn the crank until the machine emits a magic "stop" token. So in this sense, asking an LLM to be concise means reducing the number of compute it can perform, and if you insist on it too much, it may stop so early as to fundamentally have been unable to solve the problem in computational space allotted.
This perspective requires no assumptions about "thinking" or anything human-like happening inside - it follows just from time and energy being finite :).
--
[0] - I strongly think the industry is doing a huge disservice avoiding to anthropomorphize LLMs, as treating them as "little people on a chip" is the best high-level model we have for understanding their failure modes and role in larger computing systems - and instead, we just have tons of people wasting their collective efforts trying to fix "lethal trifecta" as if it was a software bug and not fundamental property of what makes LLM interesting. Already wrote more on it in this thread, so I'll stop here.
Benchmark or nothing.
It's a significantly much succinct semantic encoding than English while being able to express all the same concepts, since it encodes a lot of glue words into the grammar of the language, and conventionally lets you drop many pronouns.
e.g.
"I would have walked home, but it seemed like it was going to rain" (14 words) -> "Domum ambulavissem, sed pluiturum esse videbatur" (6 words).
However, another potential issue is that LLMs are continuation engines, and I'd have thought that talking like a caveman may be "interpreted" as meaning you want a dumbed down response, not just a smart response in caveman-speak.
It's a bit like asking an LLM to predict next move in a chess game - it's not going to predict the best move that it can, but rather predict the next move that would be played given what it can infer about the ELO rating of the player whose moves it is continuing. If you ask it to continue the move sequence of a poor player, it'll generate a poor move since that's the best prediction.
Of course there's not going to be a lot of caveman speak on stack overflow, so who knows what the impact is. Program go boom. Me stomp on bugs.
But does talk like caveman make number go down? Less token = less think?
I also wondered, due to the way LLMs work, if I ask AI a question using fancy language, does that make it pattern match to scientific literature, and therefore increase the probability that the output will be true?
Not everybody is Dijkstra.
https://platform.claude.com/docs/en/build-with-claude/extend...
Nothing on that page indicates otherwise.
https://docs.aws.amazon.com/bedrock/latest/userguide/claude-...
## More tokens = smarter outputs
When an LLM uses tokens, it is putting more information into its context
## Better context, better results
The more information the LLM has in its context, the more complete and well thought-through the outputs will be
## More complete thinking
When an LLM is able to iterate on itself, results improve
## Better shareholder value
Numbers need to go up in order for us to maintain our shareholder value. This means instead of focusing on results that are qualitative, instead the brand should focus on quantitative, hard results
Forcing it to be concise doesn't work because it wasn't trained on token strings that short.
This is a 2023-era comment and is incorrect.
> but mmuh latest SOTA from CloudCorp (c)!
You don't know how these things work and all you have to go on is marketing copy.
You also aren't aware that there's more to it than "LLM architecture". And you're rather confident despite your lack of knowledge.
You're like the old LLMs before ChatGPT was released that were kinda neat, but usually wrong and overconfident about it.
The only new innovation is MoE, something that's used to optimize local models and not for the "SOTA" cloud offerings you're so fond of.
Diffusion for text is not even an academic toy at this point and will likely never be a real thing.
https://arxiv.org/abs/2112.00114 https://arxiv.org/abs/2406.06467 https://arxiv.org/abs/2404.15758 https://arxiv.org/abs/2512.12777
First that scratchpads matter, then why they matter, then that they don’t even need to be meaningful tokens, then a conceptual framework for the whole thing.
Did you test that ""caveman mode"" has similar performance to the ""normal"" model?
A lot of communication is just mentioning the concepts.
Funny idea though. And I’d like to see a more matter-of-fact output from Claude.
Take it a step further and do kind of like that xkcd where you try to post and it rewrites it like this and if you want the original version you have to write a justification that gets posted too.
Chef's kiss
Compare with fluid dynamics; it's not hard to write down the Navier–Stokes equations, but there's a million dollars available to the first person who can prove or give a counter-example of the following statement:
In three space dimensions and time, given an initial velocity field, there exists a vector velocity and a scalar pressure field, which are both smooth and globally defined, that solve the Navier–Stokes equations.
- https://en.wikipedia.org/wiki/Navier–Stokes_existence_and_sm...Seems reasonable, but this doesn't settle probably-empirical questions like: (a) to what degree is 'more' better?; (b) how important are filler words? (c) how important are words that signal connection, causality, influence, reasoning?
So it's probably true that the "Great question!---" type preambles are not helpful, but that there's definitely a lower bound on exactly how primitive of a caveman language we're pushing toward.
> Someone didn't get the memo that for LLMs, tokens are units of thinking.
Where do you get this memo ? Seems completely wrong to me. More computation does not translate to more "thinking" if you compute the wrong things (ie things that contribute significantly to the final sentence meaning).e.g. instead of: "The square root of 256 is" you'd enter "errr The er square um root errr of 256 errr is" and it would miraculously get better? The model can't differentiate between words you entered and words it generated its self...
This only makes sense if you assume that you are the consumer of the response. When compacting, harnesses typically save a copy of the text exchange but strip out the tool calls in between. Because the agent relies on this text history to understand its own past actions, a log full of caveman-style responses leaves it with zero context about the changes it made, and the decisions behind them.
To recover that lost context, the agent will have to execute unnecessary research loops just to resume its task.