An update on recent Claude Code quality reports

Posted by mfiguiere 14 hours ago

An update on recent Claude Code quality reports(www.anthropic.com)

650 points | 497 commentspage 5

Implicated 7 hours ago|

Just as a note to CC fans/users here since I had an opportunity to do so... I tested resuming a session that was stale at 950k tokens after returning from a full day or so of being idle, thus a fully empty quota/session.

Resuming it cost 5% of the current session and 1% of the weekly session on a max subscription.

nopurpose 4 hours ago||

Weren't there reports that quality decreased when using non-CC harnesses too? Nothing in blog post can explain that.

sutterd 11 hours ago||

What kind of performance are people getting now? I was running 4.7 yesterday and it did a remarkably bad job. I recreated my repo state exactly and ran the same starting task with 4.5 (which I have preferred to 4.6). It was even worse, by a large margin. It is likely my task was a difficult or poorly posed, but I still have some idea of what 4.5 should have done on it. This was not it. What experiences are other people having with the 4.7? How about with other model versions, if they are trying them? (In both cases, I ran on max effort, for whatever that is worth.)

pxc 12 hours ago||

One of Anthropic's ostensive ethical goals is to produce AI that is "understandable" as well as exceptionally "well-aligned". It's striking that some of the same properties that make AI risky also just make it hard to consistently deliver a good product. It occurs to me that if Anthropic really makes some breakthroughs in those areas, everyone will feel it in terms of product quality whether they're worried about grandiose/catastrophic predictions or not.

But right now it seems like, in the case of (3), these systems are really sensitive and unpredictable. I'd characterize that as an alignment problem, too.

VadimPR 12 hours ago||

Appreciate the honesty from the team.

At the same time, personally I find prioritizing quality over quantity of output to be a better personal strategy. Ten partially buggy features really aren't as good as three quality ones.

jwpapi 11 hours ago||

Those are exactly the kind of issues you run into when your app is ai coded you built one thing and kill something else.

You have too many and the wrong benchmarks

deaux 9 hours ago||

They had this ready and timed it for GPT 5.5 announcement. Zero chance it's a coincidence .

wg0 5 hours ago||

A heavily vibe coded CLI would have tons of issues, regularly.

LLMs over edit and it's a known problem.

ankit219 10 hours ago||

An interesting question to wonder is why these optimizations were pushed so aggressively in the first place. Especially given this is the time they were running a 2x promotion, by themselves, without presumably seeing any slowdown in demand.

zem 9 hours ago|

ugh, caching based on idle time is horrible for my usage anyway; since claude is both fairly slow and doesn't really have much of a daily quota anyway I often tell it to do something and then wander off and come back to check on it when I next think about it. I always vaguely assumed that my session would not "detect" the intervening time anyway since it was all async. I guess from a global perspective time-based cache eviction makes sense.

More comments...