An update on recent Claude Code quality reports

Posted by mfiguiere 15 hours ago

An update on recent Claude Code quality reports(www.anthropic.com)

706 points | 523 commentspage 6

ankit219 12 hours ago|

An interesting question to wonder is why these optimizations were pushed so aggressively in the first place. Especially given this is the time they were running a 2x promotion, by themselves, without presumably seeing any slowdown in demand.

wg0 7 hours ago||

A heavily vibe coded CLI would have tons of issues, regularly.

LLMs over edit and it's a known problem.

zem 10 hours ago||

ugh, caching based on idle time is horrible for my usage anyway; since claude is both fairly slow and doesn't really have much of a daily quota anyway I often tell it to do something and then wander off and come back to check on it when I next think about it. I always vaguely assumed that my session would not "detect" the intervening time anyway since it was all async. I guess from a global perspective time-based cache eviction makes sense.

Alifatisk 15 hours ago||

It’s incredible how forgiving you guys are with Anthropic and their errors. Especially considering you pay high price for their service and receive lower quality than expected.

saghm 15 hours ago||

At least personally, it feels like the choices are the one that's okay with being used for mass surveillance and autonomous weapons targeting, the one that's on track to get acquired by the AI company that dragged its feet in getting around to stopping people from making child porn with it, the one that nobody seems to use from Google, and the one that everyone complains about but also still seems to be using because it at least sometimes works well. At this point I've opted out of personal LLM coding by canceling my subscription (although my employer still has subscriptions and wants us to keep using them, so I'll presumably keep using Claude there) but if I had to pick one to spend my own money on I'd still go with Claude.

scblock 15 hours ago||

A valid choice, a moral choice, is none of the above.

goldfish_gemma4 8 hours ago||

[dead]

ed_elliott_asc 15 hours ago|||

I pay for 20x max and get so much more value out of it than I pay.

Avicebron 15 hours ago|||

It's still night and day the difference in quality between chatgpt5.4 and opus 4.7. Heck even on Perplexity where 5.4 is included in Pro vs 4.7 which is behind the max plan or whatever, I will pick sonnet 4.6 over the 5.4 offering and it's consistently better. I don't love Anthropic, I don't have illusions about them as a business.

But if a tool is better, it's better.

wahnfrieden 15 hours ago||

You aren’t getting the 5.4 experience for code if you’re not using it in the Codex harness

scottyah 15 hours ago|||

It's fairly small issues for an amazing product, and the company is just a few years old and growing rapidly. Also, they are leading a powerful technological revolution and their competitors are known to have multiple straight up evil tendencies. A little degradation is not an issue.

arnvald 14 hours ago|||

What's the alternative? Are you suggesting other LLM providers don't charge high price? Or that they don't make mistakes? Or that they provide better quality?

We're talking about dynamically developed products, something that most people would have considered impossible just 5 years ago. A non-deterministic product that's very hard to test. Yes, Anthropic makes mistakes, models can get worse over time, their ToS change often. But again, is Gemini/GPT/Grok a better alternative?

timmg 14 hours ago|||

> It’s incredible how forgiving you guys are with Anthropic and their errors.

Ironically, I was thinking the exact opposite. This is bleeding edge stuff and they keep pushing new models and new features. I would expect issues.

I was surprised at how much complaining there is -- especially coming from people who have probably built and launched a lot of stuff and know how easy it is to make mistakes.

mlinsey 15 hours ago|||

The consumer surplus is quite high. Even with the regressions in this postmortem, performance was above the models last fall, when I was gladly paying for my subscription and thought it was net saving me time.

That said, there is now much better competition with Codex, so there's only so much rope they have now.

AntiUSAbah 15 hours ago|||

Because it is still good though.

If you have a good product, you are more understanding. And getting worse doesn't mean its no longer valuable, only that the price/value factor went down. But Opus 4.5 was relevant better and only came out in November.

There was no price increase at that time so for the same money we get better models. Opus 4.6 again feels relevant better though.

Also moving fastish means having more/better models faster.

I do know plenty of people though which do use opencode or pi and openrouter and switching models a lot more often.

lukasus 15 hours ago|||

At the time you wrote your comment there were 4 other comments and all of them very negative towards the Anthropic and the blog post in question here. How did you get this conclusions?

lukan 15 hours ago|||

Confused as well, I rather supposed Antrophic had some standing for saying no to Trump and being declared national security threat, but the anger they got and people leaving to OpenAI again, who gladly said yes to autonomous killing AI did astonish me a bit. And I also had weird things happening with my usage limits and was not happy about it. But it is still very useful to me - and I only pay for the pro plan.

sunaookami 14 hours ago||

>I rather supposed Antrophic had some standing for saying no to Trump and being declared national security threat

I never understood why people cheered for Anthropic then when they happily work together with Palantir.

unselect5917 15 hours ago|||

HN glazes anthropic every single time I see it come up. This is as obvious as HN's political bias.

operatingthetan 14 hours ago|||

I don't think Anthropic has to inform their customers of every change they make, but they should have with this one.

jgbuddy 15 hours ago|||

Anthropic actually not so bad. Anthropic models code good, usually. Price not so high compared to time to do it by self.

OsrsNeedsf2P 15 hours ago|||

Look at any criticism of Mythos. Some members on HN are defending it tooth and nail, despite it not being released

fastball 15 hours ago|||

What high price? I pay $200/m for an insane number of tokens.

oytis 15 hours ago|||

Remember Louis CK talking about Wi-Fi on an airplane? People are dealing with highly experimental technology here

tempest_ 15 hours ago|||

A lot of people are provided their access through work.

They don't actually pay the bill or see it.

mystraline 15 hours ago|||

Exactly. They've done now like 6 rug-pulls.

Idiots keep throwing money at real-time enshittification and 'I am changing the terms. Pray I do not change them further".

And yes, I am absolutely calling people who keep getting screwed and paying for more 'service' as idiots.

And Anthropic has proved that they will pay for less and less. So, why not fuck them over and make more company money?

natdempk 15 hours ago||

As an end-user, I feel like they're kind of over-cooking and under-describing the features and behavior of what is a tool at the end of the day. Today the models are in a place where the context management, reasoning effort, etc. all needs to be very stable to work well.

The thing about session resumption changing the context of a session by truncating thinking is a surprise to me, I don't think that's even documented behavior anywhere?

It's interesting to look at how many bugs are filed on the various coding agent repos. Hard to say how many are real / unique, but quantities feel very high and not hard to run into real bugs rapidly as a user as you use various features and slash commands.

kristianc 13 hours ago||

To think we'd have known about this in advance if they'd just have open sourced Claude Code, rather than them being forced into this embarrassing post mortem. Sunlight is the best disinfectant.

KronisLV 14 hours ago||

This reads like good news! They probably still lost a bunch of users due to the negative public sentiment and not responding quickly enough, but at least they addressed it with a good bit of transparency.

xlayn 15 hours ago||

If anthropic is doing this as a result of "optimizations" they need to stop doing that and raise the price. The other thing, there should be a way to test a model and validate that the model is answering exactly the same each time. I have experienced twice... when a new model is going to come out... the quality of the top dog one starts going down... and bam.. the new model is so good.... like the previous one 3 months ago.

The other thing, when anthropic turns on lazy claude... (I want to coin here the term Claudez for the version of claude that's lazy.. Claude zzZZzz = Claudez) that thing is terrible... you ask the model for something... and it's like... oh yes, that will probably depend on memory bandwith... do you want me to search that?...

YES... DO IT... FRICKING MACHINE..

joshstrange 14 hours ago||

It's incredibly frustrating when I've spelled out in CLAUDE.md that it should SSH to my dev server to investigate things I ask it to and it regularly stops working with a message of something like:

> Next steps are to run `cat /path/to/file` to see what the contents are

Makes me want to pull my hair out. I've specifically told you to go do all the read-only operations you want out on this dev server yet it keeps forgetting and asking me to do something it can do just fine (proven by it doing it after I "remind" it).

That and "Auto" mode really are grinding my gears recently. Now, after a Planing session my only option is to use Auto mode and I have to manually change it back to "Dangerously skip permissions". I think these are related since the times I've let it run on "Auto" mode is when it gives up/gets stuck more often.

Just the other day it was in Auto mode (by accident) and I told it:

> SSH out to this dev server, run `service my_service_name restart` and make sure there are no orphans (I was working on a new service and the start/stop scripts). If there are orphans, clean them up, make more changes to the start/stop scripts, and try again.

And it got stuck in some loop/dead-end with telling I should do it and it didn't want to run commands out on a "Shared Dev server" (which I had specifically told it that this was not a shared server).

The fact that Auto mode burns more tokens _and_ is so dumb is really a kick in the pants.

marcyb5st 14 hours ago|||

Apart from Anthropic nobody knows how much the average user costs them. However the consensus is "much more than that".

If they have to raise prices to stop hemorrhaging money, would you be willing to pay 1000 bucks a month for a max plan? Or 100$ per 1M pitput tokens (playing numberWang here, but the point stands).

If I have to guess they are trying to get balance sheet in order for an IPO and they basically have 3 ways of achieving that:

1. Raising prices like you said, but the user drop could be catastrophic for the IPO itself and so they won't do that

2. Dumb the models down (basically decreasing their cost per token)

3. Send less tokens (ie capping thinking budgets aggressively).

2 and 3 are palatable because, even if they annoying the technical crowd, investors still see a big number of active users with a positive margin for each.

CamperBob2 10 hours ago||

$1000/mo for guaranteed functionality >= Opus 4.6 at its peak? Yes, I'd probably grumble a bit and then whip out the credit card.

I'm not a heavy LLM user, and I've never come anywhere the $200/month plan limits I'm already subscribed to. But when I do use it, I want the smartest, most relentless model available, operating at the highest performance level possible.

Charge what it takes to deliver that, and I'll probably pay it. But you can damned well run your A/B tests on somebody else.

dgellow 14 hours ago|||

I would love if agents would act way more like tools/machines and NOT try to act as if they were humans

Keeeeeeeks 15 hours ago|||

https://marginlab.ai/ (no affiliation)

There are a number of projects working on evals that can check how 'smart' a model is, but the methodology is tricky.

One would want to run the exact same prompt, every day, at different times of the day, but if the eval prompt(s) are complex, the frontier lab could have a 'meta-cognitive' layer that looks for repetitive prompts, and either: a) feeds the model a pre-written output to give to the user b) dumbs down output for that specific prompt

Both cases defeat the purpose in different ways, and make a consistent gauge difficult. And it would make sense for them to do that since you're 'wasting' compute compared to the new prompts others are writing.

hex4def6 14 hours ago||

I think you could alter the prompt in subtle ways; a period goes to an ellipses, extra commas, synonyms, occasional double-spaces, etc.

Enough that the prompt is different at a token-level, but not enough that the meaning changes.

It would be very difficult for them to catch that, especially if the prompts were not made public.

Run the variations enough times per day, and you'd get some statistical significance.

The guess the fuzzy part is judging the output.

JyB 12 hours ago||

This specifically is super annoying.

rebolek 11 hours ago||

> On April 16, we added a system prompt instruction to reduce verbosity.

What verbosity? Most of the time I don’t know what it’s doing.

whalesalad 7 hours ago|

They don’t either.

hirako2000 9 hours ago|

In other words we did the right things, but we understand feedback, oh and bugs happen.

More comments...