Top
Best
New

Posted by HellsMaddy 16 hours ago

Claude Opus 4.6(www.anthropic.com)
1907 points | 805 commentspage 6
ra 11 hours ago|
Why are Anthropic such a horrible company to deal with?
danielbln 6 hours ago|
Care to elaborate?
ra 3 hours ago||
obscure billing, unreachable customer support gatekeeped by an overzealous chatbot, no transparency about inclusions, or changes to inclusions over time... just from recent experience.
winterrx 16 hours ago||
Agentic search benchmarks are a big gap up. let's see Codex release later today
osti 16 hours ago||
Somehow regresses on SWE bench?
lkbm 16 hours ago||
I don't know how these benchmarks work (do you do a hundred runs? A thousand runs?), but 0.1% seems like noise.
SubiculumCode 16 hours ago|||
That benchmark is pretty saturated, tbh. A "regression" of such small magnitude could mean many different things or nothing at all.
usaar333 16 hours ago||
i'd interpret that as rounding error. that is unchanged

swe-bench seems really hard once you are above 80%

Squarex 16 hours ago||
it's not a great benchmark anymore... starting with it being python / django primarily... the industry should move to something more representative
usaar333 16 hours ago||
Openai has; they don't even mention score on gpt-5.3-codex.

On the other hand, it is their own verified benchmark, which is telling.

m-hodges 16 hours ago||
> In Claude Code, you can now assemble agent teams to work on tasks together.
nprz 16 hours ago|
I was just reading about Steve Yegge's Gas Town[0], it sounds like agent orchestration is now integrated into Claude Code?

[0]https://steve-yegge.medium.com/welcome-to-gas-town-4f25ee16d...

simianwords 16 hours ago||
Important: API cost of Opus 4.6 and 4.5 are the same - no change in pricing.
rob 16 hours ago||
System Card: https://www-cdn.anthropic.com/0dd865075ad3132672ee0ab40b05a5...
niobe 12 hours ago||
Is there a good technical breakdown of all these benchmarks that get used to market the latest greatest LLMs somewhere? Preferably impartial.
Aztar 12 hours ago|
I just ask claude and ask for sources for each one.
niobe 6 hours ago||
Reminds me of how if you make a complaint against a lawyer or a judge it's evaluated by lawyers and judges.
kingstnap 16 hours ago||
I was hoping for a Sonnet as well but Opus 4.6 is great too!
paxys 16 hours ago||
Hmm all leaks had said this would be Claude 5. Wonder if it was a last minute demotion due to performance. Would explain the few days' delay as well.
trash_cat 16 hours ago||
I think the naming schemes are quite arbitrary at this point. Going to 5 would come with massive expectations that wouldn't meet reality.
mrandish 15 hours ago|||
After the negative reactions to GPT 5, we may see model versioning that asymptotically approaches the next whole number without ever reaching it. "New for 2030: Claude 4.9.2!"
esafak 7 hours ago||
Or approaching a magic number like e (Metafont) or π (TeX).
Squarex 15 hours ago|||
the standard used to be that major version means a new base model / full retrain... but now it is arbitrary i guess
cornedor 16 hours ago|||
Leaks were mentioning Sonnet 5 and I guess later (a combination of) Opus 4.6
scrollop 15 hours ago||
Sonnet 5 was mentioned initially.
sanufar 16 hours ago|
Works pretty nicely for research still, not seeing a substantial qualitative improvement over Opus 4.5.
More comments...