An update on recent Claude Code quality reports

Posted by mfiguiere 2 days ago

An update on recent Claude Code quality reports(www.anthropic.com)

756 points | 571 commentspage 7

gnegggh 2 days ago|

not the first time. Still not showing thinking are we?

gilrain 2 days ago||

Hi Boris, random observer here. Would you consider apologizing to the community for mistakenly closing tickets related to this and then wrongly keeping them closed when, internally, you realized they were legitimate?

I think an apology for that incident would go a long way.

KronisLV 2 days ago||

This reads like good news! They probably still lost a bunch of users due to the negative public sentiment and not responding quickly enough, but at least they addressed it with a good bit of transparency.

xlayn 2 days ago||

If anthropic is doing this as a result of "optimizations" they need to stop doing that and raise the price. The other thing, there should be a way to test a model and validate that the model is answering exactly the same each time. I have experienced twice... when a new model is going to come out... the quality of the top dog one starts going down... and bam.. the new model is so good.... like the previous one 3 months ago.

The other thing, when anthropic turns on lazy claude... (I want to coin here the term Claudez for the version of claude that's lazy.. Claude zzZZzz = Claudez) that thing is terrible... you ask the model for something... and it's like... oh yes, that will probably depend on memory bandwith... do you want me to search that?...

YES... DO IT... FRICKING MACHINE..

joshstrange 2 days ago||

It's incredibly frustrating when I've spelled out in CLAUDE.md that it should SSH to my dev server to investigate things I ask it to and it regularly stops working with a message of something like:

> Next steps are to run `cat /path/to/file` to see what the contents are

Makes me want to pull my hair out. I've specifically told you to go do all the read-only operations you want out on this dev server yet it keeps forgetting and asking me to do something it can do just fine (proven by it doing it after I "remind" it).

That and "Auto" mode really are grinding my gears recently. Now, after a Planing session my only option is to use Auto mode and I have to manually change it back to "Dangerously skip permissions". I think these are related since the times I've let it run on "Auto" mode is when it gives up/gets stuck more often.

Just the other day it was in Auto mode (by accident) and I told it:

> SSH out to this dev server, run `service my_service_name restart` and make sure there are no orphans (I was working on a new service and the start/stop scripts). If there are orphans, clean them up, make more changes to the start/stop scripts, and try again.

And it got stuck in some loop/dead-end with telling I should do it and it didn't want to run commands out on a "Shared Dev server" (which I had specifically told it that this was not a shared server).

The fact that Auto mode burns more tokens _and_ is so dumb is really a kick in the pants.

marcyb5st 2 days ago|||

Apart from Anthropic nobody knows how much the average user costs them. However the consensus is "much more than that".

If they have to raise prices to stop hemorrhaging money, would you be willing to pay 1000 bucks a month for a max plan? Or 100$ per 1M pitput tokens (playing numberWang here, but the point stands).

If I have to guess they are trying to get balance sheet in order for an IPO and they basically have 3 ways of achieving that:

1. Raising prices like you said, but the user drop could be catastrophic for the IPO itself and so they won't do that

2. Dumb the models down (basically decreasing their cost per token)

3. Send less tokens (ie capping thinking budgets aggressively).

2 and 3 are palatable because, even if they annoying the technical crowd, investors still see a big number of active users with a positive margin for each.

CamperBob2 2 days ago||

$1000/mo for guaranteed functionality >= Opus 4.6 at its peak? Yes, I'd probably grumble a bit and then whip out the credit card.

I'm not a heavy LLM user, and I've never come anywhere the $200/month plan limits I'm already subscribed to. But when I do use it, I want the smartest, most relentless model available, operating at the highest performance level possible.

Charge what it takes to deliver that, and I'll probably pay it. But you can damned well run your A/B tests on somebody else.

dgellow 2 days ago|||

I would love if agents would act way more like tools/machines and NOT try to act as if they were humans

Keeeeeeeks 2 days ago|||

https://marginlab.ai/ (no affiliation)

There are a number of projects working on evals that can check how 'smart' a model is, but the methodology is tricky.

One would want to run the exact same prompt, every day, at different times of the day, but if the eval prompt(s) are complex, the frontier lab could have a 'meta-cognitive' layer that looks for repetitive prompts, and either: a) feeds the model a pre-written output to give to the user b) dumbs down output for that specific prompt

Both cases defeat the purpose in different ways, and make a consistent gauge difficult. And it would make sense for them to do that since you're 'wasting' compute compared to the new prompts others are writing.

hex4def6 2 days ago||

I think you could alter the prompt in subtle ways; a period goes to an ellipses, extra commas, synonyms, occasional double-spaces, etc.

Enough that the prompt is different at a token-level, but not enough that the meaning changes.

It would be very difficult for them to catch that, especially if the prompts were not made public.

Run the variations enough times per day, and you'd get some statistical significance.

The guess the fuzzy part is judging the output.

JyB 2 days ago||

This specifically is super annoying.

tdg5 2 days ago||

I missed the part about the refunds…

einrealist 2 days ago||

Is 'refactoring Markdown files' already a thing?

ireadmevs 2 days ago|

Read Claude’s skill to create other skills and you’ll see that this ship has already sailed

https://skills.sh/anthropics/skills/skill-creator

2001zhaozhao 2 days ago||

How about just not change the harness abruptly in the first place? Make new system prompt changes "experimental" first so you can gather feedback.

davidfstr 2 days ago||

Good on Anthropic for giving an update & token refund, given the recent rumors of an inexplicable drop in quality. I applaud the transparency.

scuderiaseb 2 days ago|

Opus 4.7 was released a week ago, at that point all limits were reset, so this was very beneficial to them because basically everyones weekly limit Was anyway about to be reset.

hirako2000 2 days ago||

In other words we did the right things, but we understand feedback, oh and bugs happen.

throwaway2027 2 days ago|

Cool but I switched to Codex for the time being.

More comments...