An update on recent Claude Code quality reports

Posted by mfiguiere 9 hours ago

An update on recent Claude Code quality reports(www.anthropic.com)

565 points | 430 commentspage 2

Robdel12 9 hours ago|

Wow, bad enough for them to actually publish something and not cryptic tweets from employees.

Damage is done for me though. Even just one of these things (messing with adaptive thinking) is enough for me to not trust them anymore. And then their A/B testing this week on pricing.

saghm 8 hours ago||

The A/B testing is by far the most objectionable thing from them so far in my opinion, if only because of how terrible it would be for something like that to be standard for subscriptions. I'd argue that it's not even A/B testing of pricing but silently giving a subset of users an entirely different product than they signed up for; it would be like if 2% of Netflix customers had full-screen ads pop up and cover the videos randomly throughout a show. Historically the only thing stopping companies from extraordinarily user-hostile decisions has been public outcry, but limiting it to a small subset of users seems like it's intentionally designed to try to limit the PR consequences.

lifthrasiir 8 hours ago||

The best possible situation that I can imagine is that Anthropic just wanted to measure how much value does Claude Code have for Pro users and didn't mean to change the plan itself (so those users would get CC as a "bonus"), but that alone is already questionable to start with.

polishdude20 4 hours ago|||

Bruce here from the Twitter team.

I got finally fired.

mannanj 9 hours ago||

so who do you trust and go to? (NotClearlySo)OpenAI?

carlgreene 8 hours ago|||

I "subconsciously" moved to codex back in mid Feb from CC and it's been so freaking awesome. I don't think it's as good at UI, but man is it thorough and able to gather the right context to find solutions.

I use "subconsciously" in quotes because I don't remember exactly why I did it, but it aligns with the degradation of their service so it feels like that probably has something to do with it even though I didn't realize it at the time.

cageface 4 hours ago|||

Codex does better if you ask it to take screenshots and critique its own UI work and iterate. It rarely one-shots something I like but it can get there in steps.

GenerWork 8 hours ago||||

Anthropic definitely takes the cake when it comes to UI related activities (pulling in and properly applying Figma elements, understanding UI related prompts and properly executing on it, etc), and I say this as a designer with a personal Codex subscription.

snissn 8 hours ago||||

it's been frustrating how bad it is at UI. I'm starting to test out using their image2 for UI and then handing it to codex to build out the images into code and I'm impressed and relieved so far

cmrdporcupine 7 hours ago|||

Codex isn't great at UI, but you might find Gemini is competent enough as an adjunct. I've had some luck with that.

simlevesque 9 hours ago||||

I went with MiniMax. The token plans are over what I currently need, 4500 messages per 5h, 45000 messages per week for 40$. I can run multiple agents and they don't think for 5-10 minutes like Sonnet did. Also I can finally see the thinking process while Anthropic chose to hide it all from me.

I'm using Zed and Claude Code as my harnesses.

Robdel12 9 hours ago||||

At the moment, yeah. If Google ever figures out how to build an agentic model, I would use them as well.

However you feel about OpenAI, at least their harness is actually open source and they don’t send lawyers after oss projects like opencode

IncreasePosts 7 hours ago||

Is Gemini cli not an agentic model? Or are you just saying it's built poorly? Gemini 2.5 didn't really work for me but Gemini 3 seems fairly solid

cmrdporcupine 7 hours ago||

Gemini fairs poorly at tool use, even in its own CLI and even in Antigravity. It gets into a mess just editing source files, it's tragic because it's actually not a bad model otherwise.

parliament32 6 hours ago||||

Self-hosted models are the one true path.

bensyverson 8 hours ago||||

Anecdotally, I know many people who have supplemented Claude with Codex, and are experimenting with models such as GLM 5.1, Kimi, Qwen, etc.

irthomasthomas 8 hours ago|||

I like chutes because they always use the full weights, and prompts are encrypted with TEE.

puppystench 8 hours ago||

The Claude UI still only has "adaptive" reasoning for Opus 4.7, making it functionally useless for scientific/coding work compared to older models (as Opus 4.7 will randomly stop reasoning after a few turns, even when prompted otherwise). There's no way this is just a bug and not a choice to save tokens.

mattew 6 hours ago|

It was odd that there was no mention of the forced adaptive reasoning in the article. My guess is they don't have enough compute to do anything else here.

lherron 5 hours ago||

Are they also going to refund all the extra usage api $$$ people spent in the last month?

Also I don’t know how “improving our Code Review tool” is going to improve things going forward, two of the major issues were intentional choices. No code review is going to tell them to stop making poor and compromising decisions.

zem 4 hours ago||

this is one reason i will not pay for extra usage - it is an incentive for them to be inefficient, or at least to not spend any effort on improving my token usage efficiency.

dallen33 5 hours ago|||

No, they will not.

FireBeyond 1 hour ago||

Even for all of us plan users, where we got barely any use from our plan because we'd destroy our 5h and 1w usage limits, also unlikely, after all they have an out of "your usage limits are guaranteed to be 5x of Pro users" (who are also being screwed).

Of course, all their vibe coding is being done with effectively infinite tokens, so...

vintagedave 6 hours ago||

> Today we are resetting usage limits for all subscribers.

I asked for this via support, got a horrible corporate reply thread, and eventually downgraded my account. I'm using Codex now as we speak. I could not use Claude any more, I couldn't get anything done.

Will they restore my account usage limits? Since I no longer have Max?

Is that one week usage restored, or the entire buggy timespan?

sowbug 5 hours ago|

[dead]

nickdothutton 8 hours ago||

I presume they don't yet have a cohesive monetization strategy, and this is why there is such huge variability in results on a weekly basis. It appears that Anthropic are skipping from one "experiment" to another. As users we only get to see the visible part (the results). Can't design a UI that indicates the software is thinking vs frozen? Does anyone actually believe that?

slashdave 4 hours ago|

Compute is limited worldwide. No amount of money can make these compute platforms appear overnight. They are buying time because the only other option is to stop accepting customers.

skeledrew 5 hours ago||

Some of these changes and effects seriously affect my flow. I'm a very interactive Claude user, preferring to provide detailed guidance for my more serious projects instead of just letting them run. And I have multiple projects active at once, with some being untouched for days at a time. Along with the session limits this feels like compounding penalties as I'm hit when I have to wait for session reset (worse in the middle of a long task), when I take time to properly review output and provide detailed feedback, when I'm switching among currently active projects, when I go back to a project after a couple days or so,... This is honestly starting to feel untenable.

cedws 7 hours ago||

>On April 16, we added a system prompt instruction to reduce verbosity

In practice I understand this would be difficult but I feel like the system prompt should be versioned alongside the model. Changing the system prompt out from underneath users when you've published benchmarks using an older system prompt feels deceptive.

At least tell users when the system prompt has changed.

elAhmo 7 hours ago|

Its also kinda funny they have to rely on system prompt to control verbosity itself.

esafak 1 hour ago||

It's cheaper than retraining the model.

dataviz1000 9 hours ago||

This is the problem with co-opting the word "harness". What agents need is a test harness but that doesn't mean much in the AI world.

Agents are not deterministic; they are probabilistic. If the same agent is run it will accomplish the task a consistent percentage of the time. I wish I was better at math or English so I could explain this.

I think they call it EVAL but developers don't discuss that too much. All they discuss is how frustrated they are.

A prompt can solve a problem 80% of the time. Change a sentence and it will solve the same problem 90% of time. Remove a sentence it will solve the problem 70% of the time.

It is so friggen' easy to set up -- stealing the word from AI sphere -- a TEST HARNESS.

Regressions caused by changes to the agent, where words are added, changed, or removed, are extremely easy to quantify. It isn’t pass/fail. It’s whether the agent still solves the problem at the same percentage of the time it consistently has.

arjie 8 hours ago||

The word is not co-opted. A harness is just supportive scaffolding to run something. A test harness is scaffolding to run tests against software, a fuzz harness is scaffolding to run a fuzzer against the software, and so on. I've seen it being used in this manner many times over the past 15 years. It's the device that wraps your software so you can run it repeatedly with modifications of parameters, source code, or test condition.

dataviz1000 7 hours ago||

> A harness is just supportive scaffolding to run something.

Thank you for the perfect explanation.

Last week in my confusion about the word because Anthropic was using test, eval, and harness in the same sentence so I thought Anthropic made a test harness, I used Google asking "in computer science what is a harness". It responded only discussing test harnesses which solidified my thinking that is what it is.

I wish Google had responded as clearly you did. In my defense, we don't know if we understand something unless we discuss it.

thesz 7 hours ago||

To have some confidence in consistency of results (p-value), one has to start from cohort of around 30, if I remember correctly. This is 1.5 orders of magnitude increase of computing power needed to find (absence of) consistent changes of agent's behavior.

dataviz1000 7 hours ago||

I apologize for the potato quality of these links, however, I have been working tirelessly to wrap my head how to reason about how agents and LLM models work. They are more than just a black box.

The first tries to answer what happens when I give the models harder and harder arithmetic problems to the point Sonnet will burn 200k tokens for 20minutes. [0]

The other is a very deep dive into the math of a reasoning model in the only way I could think to approach it, with data visualizations, seeing the computation of the model in real time in relation to all the parts.[1]

Two things I've learned are that the behavior of an agent that will reverse engineer any website and the behavior of an agent that does arithmetic are the same. Which means the probability that either will solve their intended task is the same for the given agent and task -- it is a distribution. The other, is that models have a blind spot, therefore creating a red team adversary bug hunter agent will not surface a bug if the same model originally wrote the code.

Understanding that, knowing that I can verify at the end or use majority of votes (MoV), using the agents to automate extremely complicated tasks can be very reliable with an amount of certainty.

[0] https://adamsohn.com/reliably-incorrect/

[1] https://adamsohn.com/grpo/

voxelc4L 2 hours ago||

I’ve stuck to the non-1M context Opus 4.6 and it works really well for me, even with on-going context compression. I honestly couldn’t deal with the 1M context change and then the compounding token devouring nonsense of 4.7 I sincerely hope Anthropic is seeing all of this and taking note. They have their work cut out for them.

lukebechtel 8 hours ago|

Some people seem to be suggesting these are coverups for quantization...

Those who work on agent harnesses for a living realize how sensitive models can be to even minor changes in the prompt.

I would not suspect quantization before I would suspect harness changes.

More comments...