An update on recent Claude Code quality reports

Posted by mfiguiere 12 hours ago

An update on recent Claude Code quality reports(www.anthropic.com)

615 points | 483 commentspage 4

jameson 11 hours ago|

> "In combination with other prompt changes, it hurt coding quality, and was reverted on April 20"

Do researchers know correlation between various aspects of a prompt and the response?

LLM, to me at least, appears to be a wildly random function that it's difficult to rely on. Traditional systems have structured inputs and outputs, and we can know how a system returned the output. This doesn't appear to be the case for LLM where inputs and outputs are any texts.

Anecdotally, I had a difficult time working with open source models at a social media firm, and something as simple as wrapping the example of JSON structure with ```, adding a newline or wording I used wildly changed accuracy.

munk-a 11 hours ago||

It's also important to realize that Anthropic has recently struck several deals with PE firms to use their software. So Anthropic pays the PE firm which forces their managed firms to subscribe to Anthropic.

The artificial creation of demand is also a concerning sign.

rfc_1149 10 hours ago||

The third bug is the one worth dwelling on. Dropping thinking blocks every turn instead of just once is the kind of regression that only shows up in production traffic. A unit test for "idle-threshold clearing" would assert "was thinking cleared after an hour of idle" (yes) without asserting "is thinking preserved on subsequent turns" (no). The invariant is negative space.

The real lesson is that an internal message-queuing experiment masked the symptoms in their own dogfooding. Dogfooding only works when the eaten food is the shipped food.

afro88 2 hours ago|

Experienced engineers that know the codebase and system well, and with enough time to consider the problem properly would likely consider this case.

But if we're vibing... This is the kind of bug that should make it back into a review agent/skill's instructions in a more generic format. Essentially if something is done to the message history, check there tests that subsequent turns work as expected.

But yeah, you'd have to piss off a bunch of users in prod first to discover the blind spot.

nopurpose 3 hours ago||

Weren't there reports that quality decreased when using non-CC harnesses too? Nothing in blog post can explain that.

voxelc4L 5 hours ago||

I’ve stuck to the non-1M context Opus 4.6 and it works really well for me, even with on-going context compression. I honestly couldn’t deal with the 1M context change and then the compounding token devouring nonsense of 4.7 I sincerely hope Anthropic is seeing all of this and taking note. They have their work cut out for them.

lifthrasiir 11 hours ago||

Is it just for me that the reset cycle of usage limits has been randomly updated? I originally had the reset point at around 00:00 UTC tomorrow and it was somehow delayed to 10:00 UTC tomorrow, regardless of when I started to use Claude in this cycle. My friends also reported very random delay, as much as ~40 hours, with seemingly no other reason. Is this another bug on top of other bugs? :-S

someone4958923 11 hours ago|

"This isn’t the experience users should expect from Claude Code. As of April 23, we’re resetting usage limits for all subscribers."

lifthrasiir 11 hours ago||

I know that. I'm saying that the cycle reset is not what it used to (starting at the very first usage) or what it might be (retaining the cycle reset timing).

jongleberry 11 hours ago||

it seems to be the same cycle for everyone now, not based on first usage. I saw a reddit thread on this from someone who had multiple accounts that all had the same cycles

Implicated 6 hours ago||

Just as a note to CC fans/users here since I had an opportunity to do so... I tested resuming a session that was stale at 950k tokens after returning from a full day or so of being idle, thus a fully empty quota/session.

Resuming it cost 5% of the current session and 1% of the weekly session on a max subscription.

arjie 11 hours ago||

Useful update. Would be useful to me to switch to a nightly / release cycle but I can see why they don't: they want to be able to move fast and it's not like I'm going to churn over these errors. I can only imagine that the benchmark runs are prohibitively expensive or slow or not using their standard harness because that would be a good smoke test on a weekly cadence. At the least, they'd know the trade-offs they're making.

Many of these things have bitten me too. Firing off a request that is slow because it's kicked out of cache and having zero cache hits (causes everything to be way more expensive) so it makes sense they would do this. I tried skipping tool calls and thinking as well and it made the agent much stupider. These all seem like natural things to try. Pity.

WhitneyLand 12 hours ago||

Did they not address how adaptive thinking has played in to all of this?

sutterd 9 hours ago|

What kind of performance are people getting now? I was running 4.7 yesterday and it did a remarkably bad job. I recreated my repo state exactly and ran the same starting task with 4.5 (which I have preferred to 4.6). It was even worse, by a large margin. It is likely my task was a difficult or poorly posed, but I still have some idea of what 4.5 should have done on it. This was not it. What experiences are other people having with the 4.7? How about with other model versions, if they are trying them? (In both cases, I ran on max effort, for whatever that is worth.)

More comments...