Anonymous request-token comparisons from Opus 4.6 and Opus 4.7

Posted by anabranch 12 hours ago

Anonymous request-token comparisons from Opus 4.6 and Opus 4.7(tokens.billchambers.me)

440 points | 450 commentspage 3

throwatdem12311 10 hours ago|

Price is now getting to be more in line with the actual cost. Th models are dumber, slower and more expensive than what we’ve been paying up until now. OpenAI will do it too, maybe a bit less to avoid pissing people off after seeing backlash to Anthropic’s move here. Or maybe they won’t make it dumber but they’ll increase the price while making a dumber mode the baseline so you’re encouraged to pay more. Free ride is over. Hope you have 30k burning a hole in your pocket to buy a beefy machine to run your own model. I hear Mac Studios are good for local inference.

fathermarz 10 hours ago||

I have been seeing this messaging everywhere and I have not noticed this. I have had the inverse with 4.7 over 4.6.

I think people aren’t reading the system cards when they come out. They explicitly explain your workflow needs to change. They added more levels of effort and I see no mention of that in this post.

Did y’all forget Opus 4? That was not that long ago that Claude was essentially unusable then. We are peak wizardry right now and no one is talking positively. It’s all doom and gloom around here these days.

gck1 4 hours ago||

> They explicitly explain your workflow needs to change

How about - don't break my workflow unless the change is meaningful?

While we're at it, either make y in x.y mean "groundbreaking", or "essentially same, but slightly better under some conditions". The former justifies workflow adjustments, the latter doesn't.

RevEng 5 hours ago||

I have used nothing but Sonnet and composer for a year and they work fine. LLMs were certainly not unusable before and Opus is certainly not necessary, especially considering the cost. People get excited by new records on benchmarks but for most day to day work the existing models are sufficient and far more efficient.

atleastoptimal 8 hours ago||

The whole version naming for models is very misleading. 4 and 4.1 seem to come from a different "line" than 4.5 and 4.6, and likewise 4.7 seems like a new shape of model altogether. They aren't linear stepwise improvements, but I think overall 4.7 is generally "smarter" just based on conversational ability.

napolux 11 hours ago||

Token consumption is huge compared to 4.6 even for smaller tasks. Just by "reasoning" after my first prompt this morning I went over 50% over the 5 hours quota.

jimkleiber 11 hours ago||

I wonder if this is like when a restaurant introduces a new menu to increase prices.

Is Opus 4.7 that significantly different in quality that it should use that much more in tokens?

I like Claude and Anthropic a lot, and hope it's just some weird quirk in their tokenizer or whatnot, just seems like something changed in the last few weeks and may be going in a less-value-for-money direction, with not much being said about it. But again, could just be some technical glitch.

hopfenspergerj 11 hours ago|

You can't accidentally retrain a model to use a different tokenizer. It changes the input vectors to the model.

jimkleiber 4 hours ago||

I appreciate you saying that, i think sometimes with ai conversations i wade into them without knowing the precise definitions of the terms, I'll try to be more careful next time. Thank you.

Frannky 4 hours ago||

My subscription was up for renewal today. I gave it a shot with OpenCode Go + Xiaomi model. So far, so good—I can get stuff done the same way it seems.

bobjordan 11 hours ago||

I've spent the past 4+ months building an internal multi-agent orchestrator for coding teams. Agents communicate through a coordination protocol we built, and all inter-agent messages plus runtime metrics are logged to a database.

Our default topology is a two-agent pair: one implementer and one reviewer. In practice, that usually means Opus writing code and Codex reviewing it.

I just finished a 10-hour run with 5 of these teams in parallel, plus a Codex run manager. Total swarm: 5 Opus 4.7 agents and 6 Codex/GPT-5.4 agents.

Opus was launched with:

`export CLAUDE_AUTOCOMPACT_PCT_OVERRIDE=35 claude --dangerously-skip-permissions --model 'claude-opus-4-7[1M]' --effort high --thinking-display summarized`

Codex was launched with:

`codex --dangerously-bypass-approvals-and-sandbox --profile gpt-5-4-high`

What surprised me was usage: after 10 hours, both my Claude Code account and my Codex account had consumed 28% of their weekly capacity from that single run.

I expected Claude Code usage to be much higher. Instead, on these settings and for this workload, both platforms burned the same share of weekly budget.

So from this datapoint alone, I do not see an obvious usage-efficiency advantage in switching from Opus 4.7 to Codex/GPT-5.4.

pitched 11 hours ago|

I just switched fully into Codex today, off of Claude. The higher usage limits were one factor but I’m also working towards a custom harness that better integrates into the orchestrator. So the Claude TOS was also getting in the way.

ausbah 12 hours ago||

is it really unthinkable that another oss/local model will be released by deepseek, alibaba, or even meta that once again give these companies a run for their money

zozbot234 11 hours ago||

> is it really unthinkable that another oss/local model will be released by deepseek, alibaba, or even meta that once again give these companies a run for their money

Plenty of OSS models being released as of late, with GLM and Kimi arguably being the most interesting for the near-SOTA case ("give these companies a run for their money"). Of course, actually running them locally for anything other than very slow Q&A is hard.

rectang 11 hours ago|||

For my working style (fine-grained instructions to the agent), Opus 4.5 is basically ideal. Opus 4.6 and 4.7 seem optimized for more long-running tasks with less back and forth between human and agent; but for me Opus 4.6 was a regression, and it seems like Opus 4.7 will be another.

This gives me hope that even if future versions of Opus continue to target long-running tasks and get more and more expensive while being less-and-less appropriate for my style, that a competitor can build a model akin to Opus 4.5 which is suitable for my workflow, optimizing for other factors like cost.

DeathArrow 8 hours ago||

Have you tried GLM 5.1?

amelius 11 hours ago|||

I'm betting on a company like Taalas making a model that is perhaps less capable but 100x as fast, where you could have dozens of agents looking at your problem from all different angles simultaneously, and so still have better results and faster.

andai 11 hours ago|||

Yeah, it's a search problem. When verification is cheap, reducing success rate in exchange for massively reducing cost and runtime is the right approach.

never_inline 11 hours ago||

You underestimating the algorithmic complexity of such brute forcing, and the indirect cost of brittle code that's produced by inferior models

100ms 9 hours ago|||

I'm excited for Taalas, but the worry with that suggestion is that it would blow out energy per net unit of work, which kills a lot of Taalas' buzz. Still, it's inevitable if you make something an order of magnitude faster, folk will just come along and feed it an order of magnitude more work. I hope the middleground with Taalas is a cottage industry of LLM hosts with a small-mid sized budget hosting last gen models for quite cheap. Although if they're packed to max utilisation with all the new workloads they enable, latency might not be much better than what we already have today

embedding-shape 11 hours ago|||

Nothing is unthinkable, I could think of Transformers.V2 that might look completely different, maybe iterations on Mamba turns out fruitful or countless of other scenarios.

pitched 11 hours ago|||

Now that Anthropic have started hiding the chain of thought tokens, it will be a lot harder for them

zozbot234 11 hours ago||

Anthropic and OpenAI never showed the true chain of thought tokens. Ironically, that's something you only get from local models.

slowmovintarget 11 hours ago|||

Qwen released a new model the same day (3.6). The headline was kind of buried by Anthropic's release, though.

https://news.ycombinator.com/item?id=47792764

casey2 9 hours ago||

This regression put Anthropic behind Chinese models actually.

nickvec 4 hours ago||

For all intents and purposes, aren't the "token change" and "cost change" metrics effectively the same thing?

gck1 8 hours ago|

Anthropic is playing a strange game. It's almost like they want you to cancel the subscription if you're an active user and only subscribe if you only use it once per month to ask what the weather in Berlin is.

First they introduce a policy to ban third party clients, but the way it's written, it affects claude -p too, and 3 months later, it's still confusing with no clarification.

Then they hide model's thinking, introduce a new flag which will still show summaries of thinking, which they break again in the next release, with a new flag.

Then they silently cut the usage limits to the point where the exact same usage that you're used to consumes 40% of your weekly quota in 5 hours, but not only they stay silent for entire 2 weeks - they actively gaslight users saying they didn't change anything, only to announce later that they did, indeed change the limits.

Then they serve a lobotomized model for an entire week before they drop 4.7, again, gaslighting users that they didn't do that.

And then this.

Anthropic has lost all credibility at this point and I will not be renewing my subscription. If they can't provide services under a price point, just increase the price or don't provide them.

EDIT: forgot "adaptive thinking", so add that too. Which essentially means "we decide when we can allocate resources for thinking tokens based on our capacity, or in other words - never".

More comments...