An update on recent Claude Code quality reports

Posted by mfiguiere 2 days ago

An update on recent Claude Code quality reports(www.anthropic.com)

933 points | 721 commentspage 11

setnone 2 days ago|

Good on them for resolving all three issues, but is it any good again?

alxndr13 2 days ago|

for me at least, yes. just wrote it to coworkers this afternoon. Behaves way more "stable" in terms of quality and i don't have the feeling of the model getting way worse after 100k tokens of context or so.

What i notice: after 300k there's some slight quality drop, but i just make sure to compact before that threshold.

ritonlajoie 2 days ago||

yesterday CC created a fastapi /healthz endpoint and told me it's the gold standard (with the ending z). today I stopped my max sub and will be trying codex

wrxd 2 days ago||

To be fair that’s a Google convention. Have a look at z-pages

jesse_dot_id 2 days ago||

This is fairly normal.

psubocz 2 days ago||

> All three issues have now been resolved as of April 20 (v2.1.116).

The latest in homebrew is 2.1.108 so not fixed, and I don't see opus 4.7 on the models list... Is homebrew a second class citizen, or am I in the B group?

maxrev17 2 days ago||

Please for the love of god just put the max price plan up like 4x or 5x in cost and make it actually work.

jruz 2 days ago||

Too late bro, switched to Codex I’m done with your bullshit.

rishabhaiover 2 days ago||

Boris gaslighted us with all the quality related incidents for weeks not acknowledging these problems.

throwaway2027 2 days ago|

Maybe he didn't know or they were still figuring it out which is fine they're still engineers who can get things wrong sometimes but the communication felt lackluster and being on the receiving end sucks when you had a reliable setup which then degrades. There is a reason people don't upgrade software and why people say if it works don't fix it, but obviously that's not an option for Anthropic when you want to keep improving the product, so they need good measurement tools and quick rollbacks even if properly "benchmarking" LLMs could prove difficult.

rishabhaiover 2 days ago||

I agree but one can admit their situation instead of outrightly rejecting the claims. My own mistake is to have become so hopelessly dependent on them.

powera 2 days ago||

I'm not sure they've found/understand it yet. My two main theories:

1. A bunch of people with new Claude Code codebases in December now are working with a larger codebase, causing more context. Claude reads a lot of code files, and doesn't effectively prune from the context as far as I can tell. I find myself having to hint Claude regularly about what files to read (and not read) to avoid having 75k of unrelated files in the context window.

2. Claude Code tries to do more now, for the benefit of people who don't know exactly what they want. The trade-off is that it's worse at doing exactly what people want, when they do know. The "small fix" becomes a large endeavor for Claude.

PeakScripter 2 days ago||

They should really test everything thoroughly and then make it available to general public to avoid these issues!!

YetAnotherNick 2 days ago||

Why don't they monitor average prompt and response token length(both cached and uncached) per interaction. Seems this could have solved all their previous unnoticed degradation.

Also bit surprised they don't have any automated quality check. They can run something like swe bench before each release. Both of these seem like a basic thing even for startup, let alone some product generating billions in revenue.

Rapzid 2 days ago|

> On March 4, we changed Claude Code's default reasoning effort from high to medium to reduce the very long latency—enough to make the UI appear frozen—some users were seeing in high mode.

Translation: To reduce the load on our servers.

More comments...