Anonymous request-token comparisons from Opus 4.6 and Opus 4.7

Posted by anabranch 6 days ago

Anonymous request-token comparisons from Opus 4.6 and Opus 4.7(tokens.billchambers.me)

615 points | 575 commentspage 7

cooldk 6 days ago|

Anthropic may have its biases, but its product is undeniably excellent.

DeathArrow 6 days ago||

We (my wallet and I) are pretty happy with GLM 5.1 and MiniMax 2.7.

dackdel 6 days ago||

releases 4.8 and deletes everything else. and now 4.8 costs 500% more than 4.7. i wonder what it would take for people to start using kimi or qwen or other such.

therobots927 6 days ago||

Wow this is pretty spectacular. And with the losses anthro and OAI are running, don’t expect this trend to change. You will get incremental output improvements for a dramatically more expensive subscription plan.

falcor84 6 days ago||

Indeed, and if we accept the argument of this tech approaching AGI, we should expect that within x years, the subscription cost may exceed the salary cost of a junior dev.

To be clear, I'm not saying that it's a good thing, but it does seem to be going in this direction.

dgellow 6 days ago|||

If LLMs do reach AGI (assuming we have an actual agreed upon definition), it would make sense to pay way more than a junior salary. But also, LLMs won’t give us AGI (again, assuming we have an actual, meaningful definition)

therobots927 6 days ago|||

I absolutely do not accept that argument. It’s clear models hit a plateau roughly a year ago and all incremental improvements come at an increasingly higher cost.

And junior devs have never added much value. The first two years of any engineer’s career is essentially an apprenticeship. There’s no value add from have a perpetually junior “employee”.

justindotdev 6 days ago||

i think it is quite clear that staying with opus 4.6 is the way to go, on top of the inflation, 4.7 is quite... dumb. i think they have lobotomized this model while they were prioritizing cybersecurity and blocking people from performing potentially harmful security related tasks.

bcherny 6 days ago||

Hey, Boris from the Claude Code team here. People were getting extra cyber warnings when using old versions of Claude Code with Opus 4.7. To fix it, just run claude update to make sure you're on the latest.

Under the hood, what was happening is that older models needed reminders, while 4.7 no longer needs it. When we showed these reminders to 4.7 it tended to over-fixate on them. The fix was to stop adding cyber reminders.

More here: https://x.com/ClaudeDevs/status/2045238786339299431

bakugo 6 days ago|||

How do you justify the API and web UI versions of 4.7 refusing to solve NYT Connections puzzles due to "safety"?

https://x.com/LechMazur/status/2044945702682309086

templar_snow 6 days ago||

To be fair, reading the New York Times is a safety risk for any intelligent life form these days. But still.

maleldil 6 days ago||

You don't need to subscribe to the NYT to play the games. There's a separate subscription.

matheusmoreira 6 days ago|||

What is your response to:

> 4.7 is quite... dumb. i think they have lobotomized this model

Is adaptive thinking still broken? Why was the option to disable it taken away?

vessenes 6 days ago||

4.7 is super variable in my one day experience - it occasionally just nails a task. Then I'm back to arguing with it like it's 2023.

aenis 6 days ago|||

My experience as well, unfortunately. I am really looking forward to reading, in a few years, a proper history of the wild west years of AI scaling. What is happening in those companies at the moment must be truly fascinating. How is it possible, for instance, that I never, ever, had an instance of not being able to use Claude despite the runaway success it had, and - i'd guess - expotential increase in infra needs. When I run production workloads on vertex or bedrock i am routinely confronted with quotas, here - it always works.

dgellow 6 days ago|||

That has been my Friday experience as well… very frustrating to go back to the arguing, I forgot how tense that makes me feel

ozgrakkurt 6 days ago||

The design of this thing is atrocious. There should be a clear way to see what the +X% thing means. Is 4.7 using more or is 4.6 using more.

Also there should be time distribution for the queries and a way to filter by query time. This is because Anthropic is reported to change the model quality arbitrarily in the background.

Also there is no unit in table column headers. For example "Request 4.7" is this the amount of tokens 4.7 consumes? Is it output/input/reasoning etc.

Really difficult to make sense of this.

People get offended if what they are doing is labeled as slop but this is unfortunately the level of quality I expect from AI related content or code.

ManlyBread 6 days ago||

I've tried the following prompt: "repeat the following 100 times: FFFFFFFFFFFFF AAAAAAAAAAAAAAAAAAAA"

This has resulted in +92.9% cost and token difference. Submission bd2457e5, currently at the top of the leaderboard.

lucid-dev 6 days ago||

Um, I keep getting "invalid" request despite trying my prompt through various formats as provided in the examples.

It looks like you don't allow testing of anything beyond a certain token size.

Which makes your test kind of pointless, because if you are chatting about AI with something that's only a few hundred tokens, the data your collecting is pretty minimal and specific, not something that's generally applicable or relevant to wider user outside of that specific context.

erelong 6 days ago||

was shocked to see phone verification roll out like last month as well... yikes

QuadrupleA 6 days ago|

Definitely seems like AI money got tight the last month or two - that the free beer is running out and enshittification has begun.

More comments...