Top
Best
New

Posted by anabranch 6 days ago

Anonymous request-token comparisons from Opus 4.6 and Opus 4.7(tokens.billchambers.me)
490 points | 488 commentspage 5
BrianneLee011 6 days ago|
We should clarify 'Scaling up' here. Does higher token consumption actually correlate with better accuracy, or are we just increasing overhead?
ben8bit 6 days ago||
Makes me think the model could actually not even be smarter necessarily, just more token dependent.
hirako2000 6 days ago|
Asking a seller to sell less.

That's an incentive difficult to reconcile with the user's benefit.

To keep this business running they do need to invest to make the best model, period.

It happens to be exactly what Anthropic's strategy is. That and great tooling.

subscribed 6 days ago||
But they're clearly oversubscribed, massively.

And they're selling less and less (suddenly 5 hour window lasts 1 hour on the similar tasks it lasted 5 hours a week ago), so IMO they're scamming.

I hope many people are making notes and will raise heat soon.

hirako2000 4 days ago||
I agree. I'm rather pointing out the whole strategy dictates the outcome.

Anthropic has to keep racing ahead and be stamped offering the best frontier models.

It isn't optimal, so the models cost them disproportionately too much to sell at a profitable price. So they keep feeding the hype and push the costs higher, hoping there won't be too much heat and get away with it.

I wouldn't like to be a leader at such company, but their pay keep them in line.

nmeofthestate 6 days ago||
Is this a weird way of saying Opus got "cheaper" somehow from 4.6 to 4.7?
l5870uoo9y 6 days ago||
My impression the reverse is true when upgrading to GPT-5.4 from GPT-5; it uses fewer tokens(?).
andai 6 days ago|
But with the same tokenizer, right?

The difference here is Opus 4.7 has a new tokenizer which converts the same input text to a higher number of tokens. (But it costs the same per token?)

> Claude Opus 4.7 uses a new tokenizer, contributing to its improved performance on a wide range of tasks. This new tokenizer may use roughly 1x to 1.35x as many tokens when processing text compared to previous models (up to ~35% more, varying by content), and /v1/messages/count_tokens will return a different number of tokens for Claude Opus 4.7 than it did for Claude Opus 4.6.

> Pricing remains the same as Opus 4.6: $5 per million input tokens and $25 per million output tokens.

ArtificialAnalysis reports 4.7 significantly reduced output tokens though, and overall ~10% cheaper to run the evals.

I don't know how well that translates to Claude Code usage though, which I think is extremely input heavy.

silverwind 6 days ago||
Still worth it imho for important code, but it shows that they are hitting a ceiling while trying to improve the model which they try to solve by making it more token-inefficient.
blahblaher 6 days ago||
Conspiracy time: they released a new version just so hey could increase the price so that people wouldn't complain so much along the lines of "see this is a new version model, so we NEED to increase the price") similar to how SaaS companies tack on some shit to the product so that they can increase prices
willis936 6 days ago|
The result is the same: they lose their brand of producing quality output. However the more clever the maneuver they try to pull off the more clear it is to their customers that they are not earning trust. That's what will matter at the end of this. Poor leadership at Claude.
operatingthetan 6 days ago||
They are trying to pull a rabbit out of a hat. Not surprising that is their SOP given that AI in concept is an attempt to do the very same thing.
macinjosh 6 days ago||
Opus 4.7 seems smarter not wiser. More knowledge, maybe, but less grit. It often has been asking me to wrap it up or just be happy with current state, instead of working out a problem.
eezing 6 days ago||
Not sure if this equates to more spend. Smarter models make fewer mistakes and thus fewer round trips.
lucid-dev 6 days ago|
Um, I keep getting "invalid" request despite trying my prompt through various formats as provided in the examples.

It looks like you don't allow testing of anything beyond a certain token size.

Which makes your test kind of pointless, because if you are chatting about AI with something that's only a few hundred tokens, the data your collecting is pretty minimal and specific, not something that's generally applicable or relevant to wider user outside of that specific context.

More comments...