Caveman: Why use many token when few token do trick

Posted by tosh 1 day ago

Caveman: Why use many token when few token do trick(github.com)

870 points | 358 comments

JBrussee-2 1 day ago|

Author here. A few people are arguing against a stronger claim than the repo is meant to make. As well, this was very much intended to be a joke and not research level commentary.

This skill is not intended to reduce hidden reasoning / thinking tokens. Anthropic’s own docs suggest more thinking budget can improve performance, so I would not claim otherwise.

What it targets is the visible completion: less preamble, less filler, less polished-but-nonessential text. Therefore, since post-completion output is “cavemanned” the code hasn’t been affected by the skill at all :)

Also surprising to hear so little faith in RL. Quite sure that the models from Anthropic have been so heavily tuned to be coding agents that you cannot “force” a model to degrade immensely.

The fair criticism is that my “~75%” README number is from preliminary testing, not a rigorous benchmark. That should be phrased more carefully, and I’m working on a proper eval now.

Also yes, skills are not free: Anthropic notes they consume context when loaded, even if only skill metadata is preloaded initially.

So the real eval is end-to-end: - total input tokens - total output tokens - latency - quality/task success

There is actual research suggesting concise prompting can reduce response length substantially without always wrecking quality, though it is task-dependent and can hurt in some domains. (https://arxiv.org/html/2401.05618v3)

So my current position is: interesting idea, narrower claim than some people think, needs benchmarks, and the README should be more precise until those exist.

Chance-Device 1 day ago||

Sounds reasonable to me. I think this thread is just the way online discourse tends to go. Actually it’s probably better than average, but still sometimes disappointing.

trueno 1 day ago|||

i played with this a bit the other night and ironically i think everyone should give it a shot as an alternative mode they might sometimes switch into. but not to save tokens, but instead to.. see things in a different light.

its kind of great for the "eli5", not because it's any more right or wrong, but sometimes presenting it in caveman presents something to me in a way that's almost like... really clear and simple. it feels like it cuts through bullshit just a smidge. seeing something framed by a caveman in a couple of occasions peeled back a layer i didnt see before.

it, for whatever reason, is useful somehow to me, the human. maybe seeing it laid out to you in caveman bulletpoints gives you this weird brevity that processes a little differently. if you layer in caveman talk about caves, tribes, etc it has sort of a primal survivalship way of framing things, which can oddly enough help me process an understanding.

plus it makes me laugh. which keeps me in a good mood.

7granddad 1 day ago|||

Interesting point! Based on what you said, in a way caveman does save your human brain tokens. Grammar rules evolve in a particular environment to reduce ambiguities and I think we are all familiar enough with caveman for it to make sense to all of us as a common. For example, word order matters for semantics in modern english so "The dog bit the grandma" and "Dog bit grandma" mean the same. Coming from languages where cases matter for semantics (like German), word order alone does not resolve ambiguity. Articles exist in English due to its Germanic roots

sellmesoap 1 day ago|||

Now I want to try programming in pigeon English

adsteel_ 1 day ago||

A pidgin is just a simplified form of language that hasn't evolved into its own new language yet. There are many English pidgins.

fireflash38 1 day ago|||

It's much easier to talk about how something is deficient/untested than to do the testing yourself.

The same site that complains so much about replication crises in science too...

dataviz1000 1 day ago|||

If you want to benchmark, consider this https://github.com/adam-s/testing-claude-agent

bdbdbdb 1 day ago|||

Translation:

It joke. No yell at me. It kind of work?

bbeonx 1 day ago||

Thank. Too much word, me try read but no more tokens.

sgbeal 1 day ago|||

> There is actual research suggesting concise prompting can reduce response length substantially without always wrecking quality,

Anecdote: i discussed that with an LLM once and it explained to me that LLMs tend to respond to terse questions with terse answers because that's what humans (i.e. their training data) tend to do. Similarly, it explained to me that polite requests tend to lead to LLM responses with _more_ information than a response strictly requires because (again) that's what their training data suggests is correct (i.e. because that's how humans tend to respond).

TL;DR: how they are asked questions influences how they respond, even if the facts of the differing responses don't materially differ.

(Edit: Seriously, i do not understand the continued down-voting of completely topical responses. It's gotten so bad i have little choice but to assume it's a personal vendetta.)

sumeno 1 day ago|||

LLMs don't understand what they are doing, they can't explain it to you, it's just creating a reasonable sounding response

codethief 1 day ago|||

But that response is grounded in the training data they've seen, so it's not entirely unreasonable to think their answer might provide actual insights, not just statistical parroting.

Jensson 1 day ago||

What do you mean? It is grounded on the text it is fed, the reason it said that was that humans have said that or something similar to it, not because it analyzed a lot of LLM information and thought up that answer itself.

LLM can "think" but that requires a lot of tokens to do, all quick answers are just human answers or answers it was fed with some basic pattern matching / interpolation.

astrange 1 day ago||

There's nothing "basic" about the several months of training used to create a frontier model.

weird-eye-issue 1 day ago||

That's a very pedantic response because either way the model cannot see or analyze the training data when it responds.

astrange 1 day ago|||

They have some ability; also, you could give them tools to do it.

https://www.anthropic.com/research/introspection

weird-eye-issue 1 day ago||||

> i discussed that with an LLM once and it explained to me that LLMs...

Do you have any idea how dumb this sounds?

TeMPOraL 22 hours ago||

Do you? I have the same knee-jerk reaction, but if you think about for more than 2 seconds, LLMs at this point have, through training, read much more research about LLMs than any human, so actually, it's not a dumb thing to do. It may not be very current, though.

weird-eye-issue 21 hours ago||

> read much more research about LLMs than any human

How long a response is from an LLM is going to be completely individual based on the system prompt and the model itself. You can read all of the "LLM research" in the world and it's not going to give you a correct generalized answer about this topic. It's not like this is some inherent property of LLMs.

TeMPOraL 14 hours ago||

FWIW, they also wrote down something that's so obvious you don't have to know much about LLMs to know that it's true. Even the "stochastic parrot" / "glorified Markov chain" / "regurgitation machine" camps people should be on the same page - LLMs are trained on human communication, and in human communications, longer queries, good manners and correct grammar are associated with longer, more correct and quality responses; correctly, shitposting is associated with shitposts in reply.

That much is, again, obvious. My previous comment was addressing your ridiculing the notion of discussing LLMs with LLMs, which was a fair reaction back in GPT-3.5 era, but not so today.

weird-eye-issue 8 hours ago||

And yet what you are saying just isn't true in my experience.

I use speech to text with Claude Code and other LLMs and often have terrible grammar and lots of typos and stuff and it never affects the output. But if I go by what you are saying then it would only seem right that the code it outputs is more sloppy? Also the length of a response entirely depends on what I'm using for example ChatGPT always gives me a long response no matter what I ask it and the Claude app always gives short responses unless I specifically ask for something longer. This is because of how they are given instructions and is not inherent to LLMs.

larodi 1 day ago||||

this continual down-voting is not a personal thing for sure. perhaps there are crawlers that pretend to be more humane, or fully automated llm commenters which also randomly downvote.

weird-eye-issue 1 day ago||

Instead of conspiracy theories don't you think it's just likely that it was people downvoting a stupid comment?

teeklp 18 hours ago|||

[dead]

nullc 1 day ago|||

> Quite sure that the models from Anthropic have been so heavily tuned to be coding agents that you cannot “force” a model to degrade immensely.

The rest of what you're saying sounds find, but that remark seems confused to me.

prefix your prompt with "be a moron that does everything wrong and only superficially look like you're doing it correctly. make constant errors." Of course you can degrade the performance, question is if any particular 'output styling' actually does and to what extent.

nomel 1 day ago|||

I think they mean performance with the same, rational, task.

Measuring "degredation" for the nonsense task, like you gave, would be difficult.

hexaga 1 day ago||

Their point (and it's a good one) is that there are non-obvious analogues to the obvious case of just telling it to do the task terribly. There is no 'best' way to specify a task that you can label as 'rational', all others be damned. Even if one is found empirically, it changes from model to model to harness to w/e.

To clarify, consider the gradated:

> Do task X extremely well

> Do task X poorly

> Do task X or else Y will happen

> Do task X and you get a trillion dollars

> Do task X and talk like a caveman

Do you see the problem? "Do task X" also cannot be a solid baseline, because there are any number of ways to specify the task itself, and they all carry their own implicit biasing of the track the output takes.

The argument that OP makes is that RL prevents degradation... So this should not be a problem? All prompts should be equivalent? Except it obviously is a problem, and prompting does affect the output (how can it not?), _and they are even claiming their specific prompting does so, too_! The claim is nonsense on its face.

If the caveman style modifier improves output, removing it degrades output and what is claimed plainly isn't the case. Parent is right.

If it worsens output, the claim they made is again plainly not the case (via inverted but equivalent construction). Parent is right.

If it has no effect, it runs counter to their central premise and the research they cite in support of it (which only potentially applies - they study 'be concise' not 'skill full of caveman styling rules'). Parent is right.

federicosimoni 1 day ago||

[dead]

derefr 1 day ago||

I've always figured that constraining an LLM to speak in any way other than the default way it wants to speak, reduces its intelligence / reasoning capacity, as at least some of its final layers can be used (on a per-token basis) either to reason about what to say, or about how to say it, but not both at once.

(And it's for a similar reason, I think, that deliberative models like rewriting your question in their own terms before reasoning about it. They're decreasing the per-token re-parsing overhead of attending to the prompt [by distilling a paraphrase that obviates any need to attend to the literal words of it], so that some of the initial layers that would either be doing "figure out what the user was trying to say" [i.e. "NLP stuff"] or "figure out what the user meant" [i.e. deliberative-reasoning stuff] — but not both — can focus on the latter.)

I haven't done the exact experiment you'd want to do to verify this effect, i.e. "measuring LLM benchmark scores with vs without an added requirement to respond in a certain speaking style."

But I have (accidentally) done an experiment that's kind of a corollary to it: namely, I've noticed that in the context of LLM collaborative fiction writing / role-playing, the harder the LLM has to reason about what it's saying (i.e. the more facts it needs to attend to), the spottier its adherence to any "output style" or "character voicing" instructions will be.

svachalek 1 day ago||

I think this is on point, I've really started to think about LLMs in terms of attention budget more than tokens. There's only so many things they can do at once, which ones are most important to you?

krackers 1 day ago||

Outputting "filler" tokens is also basically doesn't require much "thinking" for an LLM, so the "attention budget" can be used to compute something else during the forward passes of producing that token. So besides the additional constraints imposed, you're also removing one of the ways which it thinks. Explicit COT helps mitigates some of this, but if you want to squeeze out every drop of computational budget you can get, I'd think it beneficial to keep the filler as-is.

If you really wanted just have a separate model summarize the output to remove the filler.

benjismith 1 day ago||

This is true, but I also think the input context isn't the only function of those tokens...

As those tokens flow through the QKV transforms, on 96 consecutive layers, they become the canvas where all the activations happen. Even in cases where it's possible to communicate some detail in the absolute minimum number of tokens, I think excess brevity can still limit the intelligence of the agent, because it starves their cognitive budget for solving the problem.

I always talk to my agents in highly precise language, but I let A LOT of my personality come through at the same time. I talk them like a really good teammate, who has a deep intuition for the problem and knows me personally well enough to talk with me in rich abstractions and metaphors, while still having an absolutely rock-solid command of the technical details.

But I do think this kind of caveman talk might be very handy in a lot of situations where the agent is doing simple obvious things and you just want to save tokens. Very cool!

tatrions 18 hours ago||

[flagged]

chappyasel 16 hours ago||

[dead]

muzani 1 day ago||

I find the inverse as well - asking a LLM to be chatty ends up with a much higher output. I've experimented with a few AI personality and telling it to be careful etc matters less than telling it to be talkative.

padolsey 1 day ago||

This is fun. I'd like to see the same idea but oriented for richer tokens instead of simpler tokens. If you want to spend less tokens, then spend the 'good' ones. So, instead of saying 'make good' you could say 'improve idiomatically' or something. Depends on one's needs. I try to imagine every single token as an opportunity to bend/expand/limit the geometries I have access to. Language is a beautiful modulator to apply to reality, so I'll wager applying it with pedantic finesse will bring finer outputs than brutish humphs of cavemen. But let's see the benchmarks!

philsnow 1 day ago||

I'm reminded by the caveman skill of the clipped writing style used in telegrams, and your post further reminded me of "standard" books of telegram abbreviations. Take a look at [0]; could we train models to use this kind of code and then decode it in the browser? These are "rich" tokens (they succinctly carry a lot of information).

[0] https://books.google.com/books?id=VO4OAAAAYAAJ&pg=PA464#v=on...

derefr 1 day ago|||

I would point out that the default BPE tokenization vocabulary used by many models (cl100k_base) is already a pretty powerful shorthand. It has a lot of short tokens, sure. But then:

Token ID 73700 is the literal entire (space-prefixed) word " strawberry". (Which neatly explains the "strawberry problem.")

Token ID 27128 is " cryptocurrency". (And 41698 is " disappointment".)

Token ID 44078 is " UnsupportedOperationException"!

Token ID 58040 is 128 spaces in a row (and is the longest token in the vocabulary.)

You'd be surprised how well this vocabulary can compress English prose — especially prose interspersed with code!

beau_g 1 day ago|||

For a while I was missing the ability one uses all the time in stable diffusion prompts of using parentheses and floats to emphasize weight to different parts of the prompt. The more I thought about how it would work in an LLM though, the more I realized it's just reinventing code syntax and you could just give a code snippet to the LLM prompt.

dTal 1 day ago|||

Hmm... this sounds a lot like the old RISC vs CISC argument all over again. RISC won because simplicity scales better and you can always define complex instructions in terms of simple ones. So while I would relish experiencing the timeline in which our computerized chums bootstrap into sentience through the judicious application of carefully selected and highly nuanced words, it's playing out the other way: LLMs doing a lot of 'thinking' using a small curated set of simple and orthogonal concepts.

andsoitis 1 day ago||

RISC good. CISC bad. But CISC tribe sneaky — hide RISC inside. Look CISC outside, think RISC inside. Trick work long time.

Then ARM come. ARM very RISC. ARM go in phone. ARM go in tablet. ARM go everywhere. Apple make ARM chip, beat x86 with big club. Many impressed. Now ARM take server too. x86 tribe scared.

RISC-V new baby RISC. Free for all. Many tribe use. Watch this one.

RISC win brain fight. x86 survive by lying. ARM win world.

solarkraft 1 day ago||

RISC tribe also sneaky. Hide CISC inside.

docjay 1 day ago||

Try:

“””

Your response: MILSPEC prose register. Max per-token semantic yield. Domain nomenclature over periphrasis. Hypotactic, austere. Plaintext only; omit bold.

“””

teekert 1 day ago||

Idk I try talk like cavemen to claude. Claude seems answer less good. We have more misunderstandings. Feel like sometimes need more words in total to explain previous instructions. Also less context is more damage if typo. Who agrees? Could be just feeling I have. I often ad fluff. Feels like better result from LLM. Me think LLM also get less thinking and less info from own previous replies if talk like caveman.

WarmWash 1 day ago||

In the regular people forums (twitter, reddit), you see endless complaints about LLMs being stupid and useless.

But you also catch a glimpse of how the author of the complaint communicates in general...

"im trying to get the ai to help with the work i am doing to give me good advice for a nice path to heloing out and anytim i askin it for help with doing this it's total trash i dunt kno what to do anymore with this dum ai is so stupid"

kristopolous 1 day ago||

The realization is LLMs are computer programs. You orchestrate them like any other program and you get results.

Everyone's interfaces, concept and desires are different so the performance is wildly varied

This is similar to frameworks: they were either godsends or curses depending on how you thought and what you were doing ..

YZF 1 day ago||

I see people treating LLMs like programming languages and trying to give very precise and detailed instructions. Essentially pseudo-coding or writing english instead of C++. I find that being vague and iterating is more powerful. If you want to give a detailed spec that fully describes the program then you might as well write that program?

Basically treat the LLM as a human. Not as a computer. Like a junior developer or an intern (for the most part).

That said you need to know what to ask for and how to drive the LLM in the correct direction. If you don't know anything you're likely not going to get there.

lelanthran 1 day ago|||

I once (when ChatGPT first came out) launched into a conversation with ChatGPT using nothing but s-expressions. Didn't bother with a preamble, nor an explanation, just structured my prompt into a tree, forced said tree into an s-expression and hit enter.

I was very surprised to see that the response was in s-expressions too. It was incoherent, but the parens balanced at least.

Just tried it now and it doesn't seem to do that anymore.

astrange 6 hours ago||

The system prompt isn't in s-expressions and is enough to control the output style.

jaccola 1 day ago|||

Yes because in most contexts it has seen "caveman" talk the conversations haven't been about rigorously explained maths/science/computing/etc... so it is less likely to predict that output.

altmanaltman 1 day ago|||

Why say more word when less word do. Save time. Sea world.

wvenable 1 day ago|||

*dolphin noises*

TiredOfLife 1 day ago|||

You mean see the world or Sea World?

cyanydeez 1 day ago|||

Fluff adds probable likeness. Probablelikeness brings in more stuff. More stuff can be good. More stuff can poison.

vurudlxtyt 1 day ago||

Grug brained developer meets AI tooling (https://grugbrain.dev)

testycool 1 day ago||

+1 Have used Grug as example for years to have LLM explain things to me.

Applejinx 20 hours ago||

My first reaction was 'blatantly ripping off Grug', and I don't see why not to view it in that light.

tapoxi 1 day ago||

This is neat but my employer rates my performance based on token consumption; is there one that makes Claude needlessly verbose?

eclipticplane 1 day ago||

After every loop, instruct it to ELI5 what it did into `/tmp`.

outworlder 1 day ago||

Is this a joke, or are you serious? Do you work for Nvidia?

hshsiejensjsj 1 day ago|||

I’m not poster above but I work at Meta and they are doing this unfortunately. Wish it was a joke.

DedlySnek 1 day ago||||

This isn't a joke anymore I'm afraid. In my company there's a big push to use as much AI as possible. Mine isn't even a big and/or famous company.

dysoco 17 hours ago||||

I know at least of a major LATAM company which has dashboards to see AI usage per employee and they will call your attention if you don't use it enough.

dbg31415 1 day ago|||

1996 Boss: "Let's look at the lines of code you produced today."

2026 Boss: "Let's look at the AI tokens you used today."

The technology changes, but the micromanagement layer stays exactly the same.

Time is a circle, my friend. (=

nayroclade 1 day ago||

Cute idea, but you're never gonna blow your token budget on output. Input tokens are the bottleneck, because the agent's ingesting swathes of skills, directory trees, code files, tool outputs, etc. The output is generally a few hundred lines of code and a bit of natural language explanation.

konaraddi 1 day ago||

In single-turn use, yeah, but across dozens of turns there's probably value in optimizing the output.

Btw your point lands just as well without "Cute idea, but" https://odap.knrdd.com/patterns/condescending-reveal

nayroclade 1 day ago|||

I didn't mean it as condescending. I meant it literally is cute: A neat idea that is quite cool in its execution.

johnfn 1 day ago||||

Pretty neat site you've got there. You should submit it to Show HN. I had fun clicking around - it's like TVTropes, except the examples make me angry, lol.

It would be pretty fun to train an LLM on this site and then have it flag my comments before I get downvoted, haha.

konaraddi 20 hours ago||

Thanks! I want to do something similar to your LLM suggestion, the endgame is tooling for forums and individuals to improve the quality of discourse. More broadly, I think LLMs and recent advancements now make it possible to assist with self improvement (e.g., see former startup Humu’s nudges but for everyone instead of just B2B)

hxugufjfjf 1 day ago||||

Oh boy, every example reads like a HN comment!

YZF 1 day ago|||

You're practicing your own pattern ;)

Like your site and good luck with improving discourse on the Internet.

DimitriBouriez 1 day ago||

Good point and it's actually worse than that : the thinking tokens aren't affected by this at all (the model still reasons normally internally). Only the visible output that gets compressed into caveman... and maybe the model actually need more thinking tokens to figure out how to rephrase its answer into caveman style

zozbot234 1 day ago||

Grug says you can tune how much each model thinks. Is not caveman but similar. also thinking is trained with RL so tends to be efficient, less fluffy. Also model (as seen locally) always drafts answer inside thinking then output repeats, change to caveman is not really extra effort.

Hard_Space 1 day ago||

Also see https://arxiv.org/pdf/2604.00025 ('Brevity Constraints Reverse Performance Hierarchies in Language Models' March 2026)

ryanschaefer 1 day ago||

Kinda ironic this description is so verbose.

> Use when user says "caveman mode", "talk like caveman", "use caveman", "less tokens", "be brief", or invokes /caveman

For the first part of this: couldn’t this just be a UserSubmitPrompt hook with regex against these?

See additionalContext in the json output of a script: https://code.claude.com/docs/en/hooks#structured-json-output

For the second, /caveman will always invoke the skill /caveman: https://code.claude.com/docs/en/skills

FurstFly 1 day ago|

Okay, I like how it reduces token usage, but it kind of feels that, it will reduce the overall model intelligence. LLMs are probabilistic models, and you are basically playing with their priors.

sheiyei 1 day ago|

If you take meaningless tokens (that do not contribute to subject focus), I don't see what you would lose. But as this takes out a lot of contextual info as well, I would think it might be detrimental.

More comments...