Posted by tosh 1 day ago
This skill is not intended to reduce hidden reasoning / thinking tokens. Anthropic’s own docs suggest more thinking budget can improve performance, so I would not claim otherwise.
What it targets is the visible completion: less preamble, less filler, less polished-but-nonessential text. Therefore, since post-completion output is “cavemanned” the code hasn’t been affected by the skill at all :)
Also surprising to hear so little faith in RL. Quite sure that the models from Anthropic have been so heavily tuned to be coding agents that you cannot “force” a model to degrade immensely.
The fair criticism is that my “~75%” README number is from preliminary testing, not a rigorous benchmark. That should be phrased more carefully, and I’m working on a proper eval now.
Also yes, skills are not free: Anthropic notes they consume context when loaded, even if only skill metadata is preloaded initially.
So the real eval is end-to-end: - total input tokens - total output tokens - latency - quality/task success
There is actual research suggesting concise prompting can reduce response length substantially without always wrecking quality, though it is task-dependent and can hurt in some domains. (https://arxiv.org/html/2401.05618v3)
So my current position is: interesting idea, narrower claim than some people think, needs benchmarks, and the README should be more precise until those exist.
its kind of great for the "eli5", not because it's any more right or wrong, but sometimes presenting it in caveman presents something to me in a way that's almost like... really clear and simple. it feels like it cuts through bullshit just a smidge. seeing something framed by a caveman in a couple of occasions peeled back a layer i didnt see before.
it, for whatever reason, is useful somehow to me, the human. maybe seeing it laid out to you in caveman bulletpoints gives you this weird brevity that processes a little differently. if you layer in caveman talk about caves, tribes, etc it has sort of a primal survivalship way of framing things, which can oddly enough help me process an understanding.
plus it makes me laugh. which keeps me in a good mood.
The same site that complains so much about replication crises in science too...
It joke. No yell at me. It kind of work?
Anecdote: i discussed that with an LLM once and it explained to me that LLMs tend to respond to terse questions with terse answers because that's what humans (i.e. their training data) tend to do. Similarly, it explained to me that polite requests tend to lead to LLM responses with _more_ information than a response strictly requires because (again) that's what their training data suggests is correct (i.e. because that's how humans tend to respond).
TL;DR: how they are asked questions influences how they respond, even if the facts of the differing responses don't materially differ.
(Edit: Seriously, i do not understand the continued down-voting of completely topical responses. It's gotten so bad i have little choice but to assume it's a personal vendetta.)
LLM can "think" but that requires a lot of tokens to do, all quick answers are just human answers or answers it was fed with some basic pattern matching / interpolation.
Do you have any idea how dumb this sounds?
How long a response is from an LLM is going to be completely individual based on the system prompt and the model itself. You can read all of the "LLM research" in the world and it's not going to give you a correct generalized answer about this topic. It's not like this is some inherent property of LLMs.
That much is, again, obvious. My previous comment was addressing your ridiculing the notion of discussing LLMs with LLMs, which was a fair reaction back in GPT-3.5 era, but not so today.
I use speech to text with Claude Code and other LLMs and often have terrible grammar and lots of typos and stuff and it never affects the output. But if I go by what you are saying then it would only seem right that the code it outputs is more sloppy? Also the length of a response entirely depends on what I'm using for example ChatGPT always gives me a long response no matter what I ask it and the Claude app always gives short responses unless I specifically ask for something longer. This is because of how they are given instructions and is not inherent to LLMs.
The rest of what you're saying sounds find, but that remark seems confused to me.
prefix your prompt with "be a moron that does everything wrong and only superficially look like you're doing it correctly. make constant errors." Of course you can degrade the performance, question is if any particular 'output styling' actually does and to what extent.
Measuring "degredation" for the nonsense task, like you gave, would be difficult.
To clarify, consider the gradated:
> Do task X extremely well
> Do task X poorly
> Do task X or else Y will happen
> Do task X and you get a trillion dollars
> Do task X and talk like a caveman
Do you see the problem? "Do task X" also cannot be a solid baseline, because there are any number of ways to specify the task itself, and they all carry their own implicit biasing of the track the output takes.
The argument that OP makes is that RL prevents degradation... So this should not be a problem? All prompts should be equivalent? Except it obviously is a problem, and prompting does affect the output (how can it not?), _and they are even claiming their specific prompting does so, too_! The claim is nonsense on its face.
If the caveman style modifier improves output, removing it degrades output and what is claimed plainly isn't the case. Parent is right.
If it worsens output, the claim they made is again plainly not the case (via inverted but equivalent construction). Parent is right.
If it has no effect, it runs counter to their central premise and the research they cite in support of it (which only potentially applies - they study 'be concise' not 'skill full of caveman styling rules'). Parent is right.
(And it's for a similar reason, I think, that deliberative models like rewriting your question in their own terms before reasoning about it. They're decreasing the per-token re-parsing overhead of attending to the prompt [by distilling a paraphrase that obviates any need to attend to the literal words of it], so that some of the initial layers that would either be doing "figure out what the user was trying to say" [i.e. "NLP stuff"] or "figure out what the user meant" [i.e. deliberative-reasoning stuff] — but not both — can focus on the latter.)
I haven't done the exact experiment you'd want to do to verify this effect, i.e. "measuring LLM benchmark scores with vs without an added requirement to respond in a certain speaking style."
But I have (accidentally) done an experiment that's kind of a corollary to it: namely, I've noticed that in the context of LLM collaborative fiction writing / role-playing, the harder the LLM has to reason about what it's saying (i.e. the more facts it needs to attend to), the spottier its adherence to any "output style" or "character voicing" instructions will be.
If you really wanted just have a separate model summarize the output to remove the filler.
As those tokens flow through the QKV transforms, on 96 consecutive layers, they become the canvas where all the activations happen. Even in cases where it's possible to communicate some detail in the absolute minimum number of tokens, I think excess brevity can still limit the intelligence of the agent, because it starves their cognitive budget for solving the problem.
I always talk to my agents in highly precise language, but I let A LOT of my personality come through at the same time. I talk them like a really good teammate, who has a deep intuition for the problem and knows me personally well enough to talk with me in rich abstractions and metaphors, while still having an absolutely rock-solid command of the technical details.
But I do think this kind of caveman talk might be very handy in a lot of situations where the agent is doing simple obvious things and you just want to save tokens. Very cool!
[0] https://books.google.com/books?id=VO4OAAAAYAAJ&pg=PA464#v=on...
Token ID 73700 is the literal entire (space-prefixed) word " strawberry". (Which neatly explains the "strawberry problem.")
Token ID 27128 is " cryptocurrency". (And 41698 is " disappointment".)
Token ID 44078 is " UnsupportedOperationException"!
Token ID 58040 is 128 spaces in a row (and is the longest token in the vocabulary.)
You'd be surprised how well this vocabulary can compress English prose — especially prose interspersed with code!
Then ARM come. ARM very RISC. ARM go in phone. ARM go in tablet. ARM go everywhere. Apple make ARM chip, beat x86 with big club. Many impressed. Now ARM take server too. x86 tribe scared.
RISC-V new baby RISC. Free for all. Many tribe use. Watch this one.
RISC win brain fight. x86 survive by lying. ARM win world.
“””
Your response: MILSPEC prose register. Max per-token semantic yield. Domain nomenclature over periphrasis. Hypotactic, austere. Plaintext only; omit bold.
“””
But you also catch a glimpse of how the author of the complaint communicates in general...
"im trying to get the ai to help with the work i am doing to give me good advice for a nice path to heloing out and anytim i askin it for help with doing this it's total trash i dunt kno what to do anymore with this dum ai is so stupid"
Everyone's interfaces, concept and desires are different so the performance is wildly varied
This is similar to frameworks: they were either godsends or curses depending on how you thought and what you were doing ..
Basically treat the LLM as a human. Not as a computer. Like a junior developer or an intern (for the most part).
That said you need to know what to ask for and how to drive the LLM in the correct direction. If you don't know anything you're likely not going to get there.
I was very surprised to see that the response was in s-expressions too. It was incoherent, but the parens balanced at least.
Just tried it now and it doesn't seem to do that anymore.
2026 Boss: "Let's look at the AI tokens you used today."
The technology changes, but the micromanagement layer stays exactly the same.
Time is a circle, my friend. (=
Btw your point lands just as well without "Cute idea, but" https://odap.knrdd.com/patterns/condescending-reveal
It would be pretty fun to train an LLM on this site and then have it flag my comments before I get downvoted, haha.
Like your site and good luck with improving discourse on the Internet.
> Use when user says "caveman mode", "talk like caveman", "use caveman", "less tokens", "be brief", or invokes /caveman
For the first part of this: couldn’t this just be a UserSubmitPrompt hook with regex against these?
See additionalContext in the json output of a script: https://code.claude.com/docs/en/hooks#structured-json-output
For the second, /caveman will always invoke the skill /caveman: https://code.claude.com/docs/en/skills