An update on recent Claude Code quality reports

Posted by mfiguiere 7 hours ago

An update on recent Claude Code quality reports(www.anthropic.com)

565 points | 430 comments

6keZbCECT2uB 7 hours ago|

"On March 26, we shipped a change to clear Claude's older thinking from sessions that had been idle for over an hour, to reduce latency when users resumed those sessions. A bug caused this to keep happening every turn for the rest of the session instead of just once, which made Claude seem forgetful and repetitive. We fixed it on April 10. This affected Sonnet 4.6 and Opus 4.6"

This makes no sense to me. I often leave sessions idle for hours or days and use the capability to pick it back up with full context and power.

The default thinking level seems more forgivable, but the churn in system prompts is something I'll need to figure out how to intentionally choose a refresh cycle.

bcherny 6 hours ago||

Hey, Boris from the Claude Code team here.

Normally, when you have a conversation with Claude Code, if your convo has N messages, then (N-1) messages hit prompt cache -- everything but the latest message.

The challenge is: when you let a session idle for >1 hour, when you come back to it and send a prompt, it will be a full cache miss, all N messages. We noticed that this corner case led to outsized token costs for users. In an extreme case, if you had 900k tokens in your context window, then idled for an hour, then sent a message, that would be >900k tokens written to cache all at once, which would eat up a significant % of your rate limits, especially for Pro users.

We tried a few different approaches to improve this UX:

1. Educating users on X/social

2. Adding an in-product tip to recommend running /clear when re-visiting old conversations (we shipped a few iterations of this)

3. Eliding parts of the context after idle: old tool results, old messages, thinking. Of these, thinking performed the best, and when we shipped it, that's when we unintentionally introduced the bug in the blog post.

Hope this is helpful. Happy to answer any questions if you have.

dbeardsl 6 hours ago|||

I appreciate the reply, but I was never under the impression that gaps in conversations would increase costs nor reduce quality. Both are surprising and disappointing.

I feel like that is a choice best left up to users.

i.e. "Resuming this conversation with full context will consume X% of your 5-hour usage bucket, but that can be reduced by Y% by dropping old thinking logs"

kiratp 1 minute ago|||

By caching they mean “cached in GPU memory”. That’s a very very scarce resource.

Caching to RAM and disk is a thing but it’s hard to keep performance up with that and it’s early days of that tech being deployed anywhere.

Disclosure: work on AI at Microsoft. Above is just common industry info (see work happening in vLLM for example)

giwook 3 hours ago||||

Another way to think about it might be that caching is part of Anthropic's strategy to reduce costs for its users, but they are now trying to be more mindful of their costs (probably partly due to significant recent user growth as well as plans to IPO which demand fiscal prudence).

Perhaps if we were willing to pay more for our subscriptions Anthropic would be able to have longer cache windows but IDK one hour seems like a reasonable amount of time given the context and is a limitation I'm happy to work around (it's not that hard to work around) to pay just $100 or $200 a month for the industry-leading LLM.

Full disclosure: I've recently signed up for ChatGPT Pro as well in addition to my Claude Max sub so not really biased one way or the other. I just want a quality LLM that's affordable.

jimkleiber 50 minutes ago|||

I might be willing to pay more, maybe a lot more, for a higher subscription than claude max 20x, but the only thing higher is pay per token and i really dont like products that make me have to be that minutely aware of my usage, especially when it has unpredictability to it. I think there's a reason most telecoms went away from per minute or especially per MB charging. Even per GB, as they often now offer X GB, and im ok with that on phone but much less so on computer because of the unpredictability of a software update size.

Kinda like when restaurants make me pay for ketchup or a takeaway box, i get annoyed, just increase the compiled price.

sharts 2 hours ago|||

That doesn’t make sense to pay more for cache warming. Your session for the most part is already persisted. Why would it be reasonable to pay again to continue where you left off at any time in the future?

jeremyjh 2 hours ago|||

Because it significantly increases actual costs for Anthropic.

If they ignored this then all users who don’t do this much would have to subsidize the people who do.

cadamsdotcom 55 minutes ago|||

Sure, it wouldn’t make sense if they only had one customer to serve :)

JumpCrisscross 5 hours ago||||

> I was never under the impression that gaps in conversations would increase costs

The UI could indicate this by showing a timer before context is dumped.

karsinkk 5 hours ago|||

Yes!! A UI widget that shows how far along on the prompt cache eviction timelines we are would be great.

vyr 1 hour ago||||

a countdown clock telling you that you should talk to the model again before your streak expires? that's the kind of UX i'd expect from an F2P mobile game or an abandoned shopping cart nag notification

abustamam 1 hour ago||

Well sure if you put it that way, they're similar. But it's either you don't see it and you get surprised by increased quota usage, or you do see it and you know what it means. Bonus points if they let you turn it off.

No need to gamify it. It's just UI.

jimkleiber 54 minutes ago|||

I tried to hack the statusline to show this but when i tried, i don't think the api gave that info. I'd love if they let us have more variables to access in the statusline.

computably 5 hours ago||||

> I was never under the impression that gaps in conversations would increase costs nor reduce quality. Both are surprising and disappointing.

You didn't do your due diligence on an expensive API. A naïve implementation of an LLM chat is going to have O(N^2) costs from prompting with the entire context every time. Caching is needed to bring that down to O(N), but the cache itself takes resources, so evictions have to happen eventually.

doesnt_know 5 hours ago|||

How do you do "due diligence" on an API that frequently makes undocumented changes and only publishes acknowledgement of change after users complain?

You're also talking about internal technical implementations of a chat bot. 99.99% of users won't even understand the words that are being used.

tempest_ 34 minutes ago||

I use CC, and I understand what caching means.

I have no idea how that works with a LLM implementation nor do I actually know what they are caching in this context.

solarkraft 5 hours ago||||

I somewhat disagree that this is due diligence. Claude Code abstracts the API, so it should abstract this behavior as well, or educate the user about it.

mpyne 3 hours ago||

> Claude Code abstracts the API, so it should abstract this behavior as well, or educate the user about it.

Does mmap(2) educate the developer on how disk I/O works?

At some point you have to know something about the technology you're using, or accept that you're a consumer of the ever-shifting general best practice, shifting with it as the best practice shifts.

websap 54 minutes ago|||

Does using print() in Python means I need to understand the Kernel? This is an absurd thought.

zem 2 hours ago|||

mmap(2) and all its underlying machinery are open source and well documented besides.

mpyne 2 hours ago||

There are open-source and even open-weight models that operate in exactly this way (as it's based off of years of public research), and even if there weren't the way that LLMs generate responses to inputs is superbly documented.

Seems like every month someone writes up a brilliant article on how to build an LLM from scratch or similar that hits the HN page, usually with fancy animated blocks and everything.

It's not at all hard to find documentation on this topic. It could be made more prominent in the U/I but that's true of lots of things, and hammering on "AI 101" topics would clutter the U/I for actual decision points the user may want to take action upon that you can't assume the user already knows about in the way you (should) be able to assume about how LLMs eat up tokens in the first place.

margalabargala 4 hours ago||||

Okay, sure. There's a dollar/intelligence tradeoff. Let me decide to make it, don't silently make Claude dumber because I forgot about a terminal tab for an hour. Just because a project isn't urgent doesn't mean it's not important. If I thought it didn't need intelligence I would use Sonnet or Haiku.

someguyiguess 5 hours ago||||

Yes. It’s perfectly reasonable to expect the user to know the intricacies of the caching strategy of their llm. Totally reasonable expectation.

jghn 2 hours ago|||

To some extent I'd say it is indeed reasonable. I had observed the effect for a while: if I walked away from a session I noticed that my next prompt would chew up a bunch of context. And that led me to do some digging, at which point I discovered their prompt caching.

So while I'd agree with your sarcasm that expecting users to be experts of the system is a big ask, where I disagree with you is that users should be curious and actively attempting to understand how it works around them. Given that the tooling changes often, this is an endless job.

abustamam 1 hour ago||

> users should be curious and actively attempting to understand how it works

Have you ever talked with users?

> this is an endless job

Indeed. If we spend all our time learning what changed with all our tooling when it changes without proper documentation then we spend all our working lives keeping up instead of doing our actual jobs.

Octoth0rpe 35 minutes ago||

There are general users of the average SaaS, and there are claude code users. There's no doubt in my mind that our expectations should be somewhat higher for CC users re: memory. I'm personally not completely convinced that cache eviction should be part of their thought process while using CC, but it's not _that_ much of a stretch.

coldtea 3 hours ago|||

It's not like they have a poweful all-knowing oracle that can explain it to them at their dispos... oh, wait!

esafak 3 hours ago||

They have to know that this could bite them and to ask the question first.

nixpulvis 3 hours ago||

I do think having some insight into the current state of the cache and a realistic estimate for prompt token use is something we should demand.

exac 3 hours ago||||

It is more useful to read posts and threads like this exact thread IMO. We can't know everything, and the currently addressed market for Claude Code is far from people who would even think about caching to begin with.

kovek 4 hours ago||||

What if the cache was backed up to cold storage? Instead of having to recompute everything.

bontaq 3 hours ago||||

How's that O(N^2)? How's it O(N) with caching? Does a 3 turn conversation cost 3 times as much with no caching, or 9 times as much?

jannyfer 2 hours ago||

I’m not sure that it’s O(N) with caching but this illustrates the N^2 part:

https://blog.exe.dev/expensively-quadratic

kang 4 hours ago||||

It seems you haven't done the due diligence on what part of the API is expensive - constructing a prompt shouldn't be same charge/cost as llm pass.

coldtea 3 hours ago||

It seems you haven't done the due diligence on what the parent meant :)

It's not about "constructing a prompt" in the sense of building the prompt string. That of course wouldn't be costly.

It is about reusing llm inference state already in GPU memory (for the older part of the prompt that remains the same) instead of rerunning the prompt and rebuilding those attention tensors from scratch.

kang 3 hours ago||

You not only skipped the diligence but confused everyone repeating what I said :(

that is what caching is doing. the llm inference state is being reused. (attention vectors is internal artefact in this level of abstraction, effectively at this level of abstraction its a the prompt).

The part of the prompt that has already been inferred no longer needs to be a part of the input, to be replaced by the inference subset. And none of this is tokens.

raron 5 hours ago||||

How big this cached data is? Wouldn't it be possible to download it after idling a few minutes "to suspend the session", and upload and restore it when the user starts their next interaction?

throwdbaaway 3 hours ago|||

Should be about 10~20 GiB per session. Save/restore is exactly what DeepSeek does using its 3FS distributed filesystem: https://github.com/deepseek-ai/3fs#3-kvcache

With this much cheaper setup backed by disks, they can offer much better caching experience:

> Cache construction takes seconds. Once the cache is no longer in use, it will be automatically cleared, usually within a few hours to a few days.

cyanydeez 3 hours ago|||

I often see a local model QWEN3.5-Coder-Next grow to about 5 GB or so over the course of a session using llamacpp-server. I'd better these trillion parameter models are even worse. Even if you wanted to download it or offload it or offered that as a service, to start back up again, you'd _still_ be paying the token cost because all of that context _is_ the tokens you've just done.

The cache is what makes your journey from 1k prompt to 1million token solution speedy in one 'vibe' session. Loading that again will cost the entire journey.

miroljub 3 hours ago|||

This sounds like a religious cult priest blaming the common people for not understanding the cult leader's wish, which he never clearly stated.

nixpulvis 3 hours ago||||

How else would you implement it?

cyanydeez 3 hours ago|||

It'd probably be helpful for power users and transparency to actually show how the cache is being used. If you run local models with llamacpp-server, you can watch how the cache slots fill up with every turn; when subagents spawn, you see another process id spin up and it takes up a cache slot; when the model starts slowing down is when the context grows (amd 395+ around 80-90k) and the cache loads are bigger because you've got all that.

So yeah, it doesn't take much to surface to the user that the speed/value of their session is ephemeral because to keep all that cache active is computationally expensive because ...

You're still just running text through a extremely complex process, and adding to that text and to avoid re-calculation of the entire chain, you need the cache.

Confiks 6 minutes ago||||

So you made this change completely invisible to the user, without the user being able to choose between the two behaviors, and without even documenting it in the (extremely verbose) changelog [1]? I can't find it, the Docs Assistant can't find it (well, it "I found it!" three times being fed your reply with a non-matching item).

I frequently debug issues while keeping my carefully curated but long context active for days. Losing potentially very important context while in the middle of a debugging session resulting in less optimal answers, is costing me a lot more money than the cache misses would.

It's a clear reminder that these closed-source harnesses cannot be trusted (now or in the future), and I should find proper alternatives for Claude Code as soon as possible.

[1] https://code.claude.com/docs/en/changelog

btown 6 hours ago||||

Is there a way to say: I am happy to pay a premium (in tokens or extra usage) to make sure that my resumed 1h+ session has all the old thinking?

I understand you wouldn't want this to be the default, particularly for people who have one giant running session for many topics - and I can only imagine the load involved in full cache misses at scale. But there are other use cases where this thinking is critical - for instance, a session for a large refactor or a devops/operations use case consolidating numerous issue reports and external findings over time, where the periodic thinking was actually critical to how the session evolved.

For example, if N-4 was a massive dump of some relevant, some irrelevant material (say, investigating for patterns in a massive set of data, but prompted to be concise in output), then N-4's thinking might have been critical to N-2 not getting over-fixated on that dump from N-4. I'd consider it mission-critical, and pay a premium, when resuming an N some hours later to avoid pitfalls just as N-2 avoided those pitfalls.

Could we have an "ultraresume" that, similar to ultrathink, would let a user indicate they want to watch Return of the (Thin)king: Extended Edition?

CjHuber 5 hours ago|||

I think it’s crazy that they do this, especially without any notice. I would not have renewed my subscription if I knew that they started doing this.

Especially in the analysis part of my work I don‘t care about the actual text output itself most of the time but try to make the model „understand“ the topic.

In the first phase the actual text output itself is worthless it just serves as an indicator that the context was processed correctly and the future actual analysis work can depend on it. And they‘re… just throwing most the relevant stuff out all out without any notice when I resume my session after a few days?

This is insane, Claude literally became useless to me and I didn’t even know it until now, wasting a lot of my time building up good session context.

There would be nothing lost if they said „If you click yes, we will prune your old thinking making Claude faster and saving you tons of tokens“. Most people would say yes probably so why not ask them… make it an env variable (that is announced not a secretly introduced one to opt out of something new!) or at least write it in a change log if they really don’t want to allow people to use it like before, so there‘d be chance to cancel the subscription in time instead of wasting tons of time on work patterns that not longer work

munk-a 5 hours ago|||

Pointing at their terms of service will definitely be the instantly summoned defense (as would most modern companies) but the fact that SaaS can so suddenly shift the quality of product being delivered for their subscription without clear notification or explicitly re-enrollment is definitely a legal oversight right now and Italy actually did recently clamp down on Netflix doing this[1]. It's hard to define what user expectations of a continuous product are and how companies may have violated it - and for a long time social constructs kept this pretty in check. As obviously inactive and forgotten about subscriptions have become a more significant revenue source for services that agreement has been eroded, though, and the legal system has yet to catch up.

1. Specifically, this suite was about price increases without clear consideration for both parties - but the same justifications apply to service restrictions without corresponding price decreases.

https://fortune.com/2026/04/20/italian-court-netflix-refunds...

jetbalsa 5 hours ago||||

So to defend a litte, its a Cache, it has to go somewhere, its a save state of the model's inner workings at the time of the last message. so if it expires, it has to process the whole thing again. most people don't understand that every message the ENTIRE history of the conversion is processed again and again without that cache. That conversion might of hit several gigs worth of model weights and are you expecting them to keep that around for /all/ of your conversions you have had with it in separate sessions?

3836293648 5 hours ago|||

No? It's not because it's a cache, it's because they're scared of letting you see the thinking trace. If you got the trace you could just send it back in full when it got evicted from the cache. This is how open weight models work.

mpyne 3 hours ago|||

The trace goes back fine, that's not the issue.

The issue is that if they send the full trace back, it will have to be processed from the start if the cache expired, and doing that will cause a huge one-time hit against your token limit if the session has grown large.

So what Boris talked about is stripping things out of the trace that goes back to regenerate the session if the cache expires. Doing this would help avert burning up the token limit, but it is technically a different conversation, so if CC chooses poorly on stripping parts of the context then it would lead to Claude getting all scatter-brained.

eknkc 5 hours ago||||

I’m not familiar with the Claude API but OpenAI has an encrypted thking messages option. You get something that you can send back but it is encrypted. Not available on Anthropic?

reactordev 5 hours ago|||

They are sending it back to the cache, the part you are missing is they were charging you for it.

eknkc 5 hours ago||

The blog post says they prune them now not to charge you. That’s the change they implemented.

reactordev 4 hours ago||

right. they were charging you for it, now they aren't because they are just dropping your conversation history.

rsfern 4 hours ago||||

It seems like an opportunity for a hierarchical cache. Instead of just nuking all context on eviction, couldn’t there be an L2 cache with a longer eviction time so task switching for an hour doesn’t require a full session replay?

CjHuber 4 hours ago||||

No of course it’s unrealistic for them to hold the cache indefinitely and that’s not the point. You are keeping the session data yourself so you can continue even after cache expiry. The point I‘m making is that it made me very angry that without any announcement they changed behavior to strip the old thinking even when you have it in your session file. There is absolutely no reason to not ask the user about if they want this

And it’s part of a larger problem of unannounced changes it‘s just like when they introduced adaptive thinking to 4.6 a few weeks ago without notice.

Also they seem to be completely unaware that some users might only use Claude code because they are used to it not stripping thinking in contrast to codex.

Anyway I‘m happy that they saw it as a valid refund reason

cyanydeez 3 hours ago||||

what matters isn't that it's a cache; what matter is it's cached _in the GPU/NPU_ memory and taking up space from another user's active session; to keep that cache in the GPU is a nonstarter for an oversold product. Even putting into cold storage means they still have to load it at the cost of the compute, generally speaking because it again, takes up space from an oversold product.

FireBeyond 51 minutes ago|||

> There would be nothing lost if they said „If you click yes, we will prune your old thinking making Claude faster and saving you tons of tokens“. Most people would say yes probably so why not ask them

The irony is that Claude Design does this. I did a big test building a design system, and when I came back to it, it had in the chat window "Do you need all this history for your next block of work? Save 120K tokens and start a new chat. Claude will still be able to use the design system." Or words to that effect.

CjHuber 15 minutes ago||

This is exactly what also confused me. I had the exact same prompt in Claude code as well, and the no option implies you can also keep the whole history. But clicking keep apparently only ever kept the user and assistant messages not the whole actual thinking parts of the conversation

trinsic2 4 hours ago||||

Why cant you just build a project document that outlines that prompt that you want to do? Or have claude save your progress in memory so you can pick it up later? Thats what I do. It seems abhorrent to expect to have a running prompt that left idle for long periods of time just so you can pick up at a moments whim...

Terretta 3 hours ago|||

You know that memory goes back into a prompt as context that wasn't cached, so... that just adds work.

Granted, the "memory" can be available across session, as can docs...

try-working 2 hours ago|||

recursive-mode does just that: https://recursive-mode.dev/introduction

elAhmo 5 hours ago|||

Don't you have that by just resuming old convo?

The only issue is that it didn't hit the cache so it was expensive if you resume later.

eknkc 5 hours ago|||

Not at the moment apparently. They remove the thinking messages when you continue after 1 hour. That was the whole idea of that change. So the LLM gets all your messages, its responses etc but not the thinking parts, why it generated that responses. You get a lobotomised session.

elAhmo 4 hours ago||

OK didn't know that. I also resume fairly old sessions with 100-200k of context, and I sometimes keep them active for a while (but with large breaks in between).

Still on Opus 4.6 with no adaptive thinking, so didn't really notice anything worse in the past weeks, but who knows.

tbrockman 5 hours ago|||

Or generate tiny filler messages every hour until you come back to it.

uxcolumbo 4 hours ago||||

I don't envy you Boris. Getting flak from all sorts of places can't be easy. But thanks for keeping a direct line with us.

I wish Anthropic's leadership would understand that the dev community is such a vital community that they should appreciate a bit more (i.e. not nice sending lawyers afters various devs without asking nicely first, banning accounts without notice, etc etc). Appreciate it's not easy to scale.

OpenAI seems to be doing a much better job when it comes to developer relations, but I would like to see you guys 'win' since Anthropic shows more integrity and has clear ethical red lines they are not willing to cross unlike OpenAI's leadership.

artdigital 2 hours ago||||

I'm also a Claude Code user from day 1 here, back from when it wasn't included in the Pro/Max subscriptions yet, and I was absolutely not aware of this either. Your explanation makes sense, but I naively was also under the impression that re-using older existing conversations that I had open would just continue the conversation as is and not be a treated as a full cache miss.

My biggest learning here is the 1 hour cache window. I often have multiple Claudes open and it happens frequently that they're idle for 1+ hours.

This cache information should probably get displayed somewhere within Claude Code

bcherny 1 hour ago||

Yep, agree. We added a little "/clear to save XXX tokens" notice in the bottom right, and will keep iterating on this. Thanks for being an early user!

Implicated 1 hour ago||

But.. that doesn't solve the problem of having no indication in-session when it'll lose the cache. A nudge to /clear does nothing to indicate "or else face significant cost" nor does it indicate "your cache is stale".

Love the product. <3

kuboble 4 hours ago||||

As some others have mentioned.

I think the best option would be tell a user who is about to resurrect a conversation that has been evicted from cache that the session is not cached anymore and the user will have to face a full cost of replaying a session, not only the incremental question and answer.

(In understand under the hood that llms are n^2 by default but it's very counter intuitive - and given how popular cc is becoming outside of nerd circles, probably smaller and smaller fraction of users is aware of it)

I would like to decide on it case by case. Sometimes the session has some really deep insight I want to preserve, sometimes it's discardable.

a_t48 4 hours ago|||

I got exactly this warning message yesterday, saying that it could use up a significant amount of my token budget if I resumed the conversation without compaction.

onemoresoop 4 hours ago|||

Im glad they chose to do that as opposed to hidden behavior changes that only confuse users more.

fhub 4 hours ago|||

Really good to know. That should have made it into their update letter in point (2). Empowering the user to choose is the right call.

skeledrew 3 hours ago|||

> I think the best option would be tell a user who is about to resurrect a conversation that has been evicted from cache that the session is not cached anymore and the user will have to face a full cost of replaying a session

This feature has been live for a few days/weeks now, and with that knowledge I try remember to a least get a process report written when I'm for example close to the quota limit and the context is reasonably large. Or continue with a /compact, but that tends to lead to be having to repeat some things that didn't get included in the summary. Context management is just hard.

Terretta 3 hours ago||

Right, and reloading that context is the same cost as refilling the cache, so really, they're charging the same, and making it hard.

isaacdl 6 hours ago||||

Thanks for giving more information. Just as a comment on (1), a lot of people don't use X/social. That's never going to be a sustainable path to "improve this UX" since it's...not part of the UX of the product.

It's a little concerning that it's number 1 in your list.

Terretta 3 hours ago||||

This violates the principle of least surprise, with nothing to indicate Claude got lobotomized while it napped when so many use prior sessions as "primed context" (even if people don't know that's what they were doing or know why it works).

The purpose of spending 10 to 50 prompts getting Claude to fill the context for you is it effectively "fine tunes" that session into a place your work product or questions are handled well.

// If this notion of sufficient context as fine tune seems surprising, the research is out there.)

Approaches tried need to deal with both of these:

1) Silent context degradation breaks the Pro-tool contract. I pay compute so I don't pay in my time; if you want to surface the cost, surface it (UI + price tag or choice), don't silently erode quality of outcomes.

2) The workaround (external context files re-primed on return) eats the exact same cache miss, so the "savings" are illusory — you just pushed the cost onto the user's time. If my own time's cheap enough that's the right trade off, I shouldn't be using your machine.

fidrelity 6 hours ago||||

Just wanted to say I appreciate your responses here. Engaging so directly with a highly critical audience is a minefield that you're navigating well.

Thank you.

qsort 6 hours ago|||

I agree with this.

I'm writing this message even though I don't have much to add because it's often the case on HN that criticism is vocal and appreciation is silent and I'd like to balance out the sentiment.

Anthropic has fumbled on many fronts lately but engaging honestly like this is the right thing to do. I trust you'll get back on track.

troupo 6 hours ago||||

> Engaging so directly with a highly critical audience is a minefield that you're navigating well.

They spent two months literally gaslighting this "critical audience" that this could not be happening and literally blaming users for using their vibe-coded slop exactly as advertised.

All the while all the official channels refused to acknowledge any problems.

Now the dissatisfaction and subscription cancellations have reached a point where they finally had to do something.

rob 6 hours ago||

Examples of gaslighting on April 15th (the first 2 issues were "fixed" by April 10th according to the story):

https://x.com/bcherny/status/2044291036860874901 https://x.com/bcherny/status/2044299431294759355

No mention of anything like "hey, we just fixed two big issues, one that lasted over a month." Just casual replies to everybody like nothing is wrong and "oh there's an issue? just let us know we had no idea!"

troupo 3 hours ago||

Don't forget "our investigation concluded you are to blame for using the product exactly as advertised" https://x.com/lydiahallie/status/2039800718371307603 including gems like "Sonnet 4.6 is the better default on Pro. Opus burns roughly twice as fast. Switch at session start"

shimman 6 hours ago|||

Very easy to do when you stand to make tens of millions when your employer IPOs. Let's not maybe give too much praise and employ some critical thinking here.

simplify 6 hours ago|||

What is the purpose of this mindset? Should we encourage typical corporate coldness instead?

sdevonoes 5 hours ago||

We should encourage minimal dependency on multibillion tech companies like anthropic. They, and similar companies are just milking us… but since their toys are soo shiny, we don’t care

simplify 1 hour ago||

Sure, but that seems out of scope of the original comment.

hgoel 6 hours ago|||

Is "employ some critical thinking" supposed to involve being an annoying uptight cynic?

saadn92 5 hours ago||||

I leave sessions idle for hours constantly - that's my primary workflow. If resuming a 900k context session eats my rate limit, fine, show me the cost and let me decide whether to /clear or push through. You already show a banner suggesting /clear at high context - just do the same thing here instead of silently lobotomizing the model.

sdevonoes 5 hours ago||

So if they fuck it up again and now they have, let’s say, “db problems” instead of “caching problems”, you would happily simply pay more? Wtf

saadn92 5 hours ago|||

No, I wouldn't. I'd like some transparency at least.

albedoa 5 hours ago|||

Did you reply to the wrong comment? I don't see that implied here at all. What?

ceuk 6 hours ago||||

Is having massive sessions which sit idle for hours (or days) at a time considered unusual? That's a really, really common scenario for me.

Two questions if you see this:

1) if this isn't best practice, what is the best way to preserve highly specific contexts?

2) does this issue just affect idle sessions or would the cache miss also apply to /resume ?

hedgehog 5 hours ago|||

Have the tool maintain a doc, and use either the built-in memory or (I prefer it this way) your own. I've been pretty critical of some other aspects of how Claude Code works but on this one I think they're doing roughly the right thing given how the underlying completion machinery works.

Edit: If you message me I can share some of my toolchain, it's probably similar to what a lot of other people here use but I've done some polishing recently.

Asharma538 5 hours ago||

[dead]

jetbalsa 5 hours ago|||

The cache is stored on Antropics servers, since its a save state of the LLM's weights at the time of processing. its several gigs in size. Every SINGLE TIME you send a message and its a cache miss you have to reprocess the entire message again eating up tons of tokens in the process

cyanydeez 3 hours ago||

clarification though: the cache that's important to the GPU/NPU is loaded directly in the memory of the cards; it's not saved anywhere else. They could technically create cold storage of the tokens (vectors) and load that, but given how ephemeral all these viber coders are, it's unlikely there's any value in saving those vectors to load in.

So then it comes to what you're talking about, which is processing the entire text chain which is a different kind of cache, and generating the equivelent tokens are what's being costed.

But once you realize the efficiency of the product in extended sessions is cached in the immediate GPU hardware, then it's obvious that the oversold product can't just idle the GPU when sessions idle.

mandeepj 24 minutes ago||||

> that would be >900k tokens written to cache all at once

Probably that's why I hit my weekly limits 3-4 days ago, and was scheduled to rest later today. I just checked, and they are already reset.

Not sure if it's already done, shouldn't there be a check somewhere to alert on if an outrageous number of tokens are getting written, then it's not right ?

mtilsted 5 hours ago||||

Then you need to update your documentation and teach claude to read the new documentation because here is what claude code answered:

Question: Hey claude, if we have a conversation, and then i take a break. Does it change the expected output of my next answer, if there are 2 hours between the previous message end the next one?

Answer: No. A 2-hour gap doesn't change my output. I have no internal clock between messages — I only see the conversation content plus the currentDate context injected each turn. The prompt cache may expire (5 min TTL), which affects cost/latency but not the response itself.

  The only things that can change output across a break: new context injected (like updated date), memory files being modified, or files on disk changing.

-- This answer directly contradict your post. It seems like the biggest problem is a total lack of documentation for expected behavior.

A similar thing happens if I ask claude code for the difference between plan mode, and accept edits on.

Then Claude told me the only difference was that with plan mode it would ask for permission before doing edits. But I really don't think this is true. It seems like plan mode does a lot more work, and present it in a total different way. It is not just a "I will ask before applying changes" mode.

ryeguy 51 minutes ago||

This isn't how LLMs work. They aren't self aware like this, they're trained on the general internet. They might have some pointers to documentation for certain cases, but they generally aren't going to have specialized knowledge of themselves embedded within. Claude code has no need to know about its own internal programming, the core loop is just javascript code.

CjHuber 1 minute ago||

It does have an built in documentation subagent it can invoke but that doesn’t help much if they don’t document their shenanigans

iidsample 6 hours ago||||

We at UT-Austin have done some academic work to handle the same challenge. Will be curious if serving engines could modified. https://arxiv.org/abs/2412.16434 .

The core idea is we can use user-activity at the client to manage KV cache loading and offloading. Happy to chat more!!

kccqzy 3 hours ago||||

This just does not match my workflow when I work on low-priority projects, especially personal projects when I do them for fun instead of being paid to do them. With life getting busy, I may only have half an hour each night with Claude to make some progress on it before having to pause and come back the next day. It’s just the nature of doing personal projects as a middle-aged person.

The above workflow basically doesn’t hit the rate limit. So I’d appreciate a way to turn off this feature.

ryanisnan 5 hours ago||||

Why does the system work like that? Is the cache local, or on Claude's servers?

Why not store the prompt cache to disk when it goes cold for a certain period of time, and then when a long-lived, cold conversation gets re-initiated, you can re-hydrate the cache from disk. Purge the cached prompts from disk after X days of inactivity, and tell users they cannot resume conversations over X days without burning budget.

jetbalsa 5 hours ago||

The cache is on Antropics server, its like a freeze frame of the LLM inner workings at the time. the LLM can pick up directly from this save state. as you can guess this save state has bits of the underlying model, their secret sauce. so it cannot be saved locally...

dicethrowaway1 5 hours ago||

Maybe they could let users store an encrypted copy of the cache? Since the users wouldn't have Anthropic's keys, it wouldn't leak any information about the model (beyond perhaps its number of parameters judging by the size).

jetbalsa 5 hours ago|||

I'm unsure of the sizes needed for prompt cache, but I suspect its several gigs in size (A percentage of the model weight size), how would the user upload this every time they started a resumed a old idle session, also are they going to save /every/ session you do this with?

skissane 5 hours ago|||

They could let you nominate an S3 bucket (or Azure/GCP/etc equivalent). Instead of dropping data from the cache, they encrypt it and save it to the bucket; on a cache miss they check the bucket and try to reload from it. You pay for the bucket; you control the expiry time for it; if it costs too much you just turn it off.

im3w1l 5 hours ago|||

A few gigs of disk is not that expensive. Imo they should allocate every paying user (at least) one disk cache slot that doesn't expire after any time. Use it for their most recent long chat (a very short question-answer that could easily be replayed shouldn't evict a long convo).

spunker540 1 minute ago||

Whats lost on this thread is these caches are in very tight supply - they are literally on the GPUs running inference. the GPUs must load all the tokens in the conversation (expensive) and then continuing the conversation can leverage the GPU cache to avoid re-loading the full context up to that point. but obviously GPUs are in super tight supply, so if a thread has been dead for a while, they need to re-use the GPU for other customers.

northern-lights 3 hours ago|||

Encryption can only ensure the confidentiality of a message from a non-trusted third party but when that non-trusted third party happens to be your own machine hosting Claude Code, then it is pointless. You can always dump the keys (from your memory) that were used to encrypt/decrypt the message and use it to reconstruct the model weights (from the dump of your memory).

dicethrowaway1 3 hours ago||

jetbalsa said that the cache is on Anthropic's server, so the encryption and decryption would be server-side. You'd never see the encryption key, Anthropic would just give you an encrypted dump of the cache that would otherwise live on its server, and then decrypt with their own key when you replay the copy.

bobkb 5 hours ago||||

Resuming sessions after more than 1 hour is a very common workflow that many teams are following. It will be great if this is considered as an expected behaviour and design the UX around it. Perhaps you are not realising the fact that Claude code has replaced the shells people were using (ie now bash is replaced with a Claude code session).

trinsic2 1 hour ago||

I think thats a bad idea. It seems like expecting to have a prompt open like this, accumulating context puts a load on the back end. Its one of those things that is a bad habit. Like trying to maintain open tabs in a browser as a way to keep your work flow up to date when what you really should be doing is taking notes of your process and working from there.

I have project folders/files and memory stored for each session, when I come back to my projects the context is drawn from the memory files and the status that were saved in my project md files.

Create a better workflow for your self and your teams and do it the right way. Quick expect the prompt to store everything for you.

For the Claude team. If you havent already, I'd recommend you create some best practices for people that don't know any better, otherwise people are going to expect things to be a certain way and its going to cause a lot of friction when people cant do what the expect to be able to do.

Joeri 5 hours ago||||

This sounds like one of those problems where the solution is not a UX tweak but an architecture change. Perhaps prompt cache should be made long term resumable by storing it to disk before discarding from memory?

slashdave 3 hours ago|||

Disk where? LLM requests are routed dynamically. You might not even land in the same data center.

FuckButtons 1 hour ago||

But if you have a tiered cache, then waiting several seconds / minutes is still preferable to getting a cache miss. I suspect the larger problem is the amount of tinkering they are doing with the model makes that not viable.

kivle 5 hours ago|||

I agree.. Maybe parts of the cache contents are business secrets.. But then store a server side encrypted version on the users disk so that it can be resumed without wasting 900k tokens?

8note 5 hours ago||||

reasonably, if i'm in an interactive session, its going to have breaks for an hour or more.

whats driving the hour cache? shouldnt people be able to have lunch, then come back and continue?

are you expecting claude code users to not attend meetings?

I think product-wise you might need a better story on who uses claude-code, when and why.

Same thing with session logs actually - i know folks who are definitely going to try to write a yearly RnD report and monthly timesheets based on text analysis of their claude code session files, and they're going to be incredibly unhappy when they find out its all been silently deleted

FuckButtons 1 hour ago||

As with everything Anthropic recently this is a supply constraint issue. They have not planned for scale adequately.

toephu2 3 hours ago||||

How does the Claude team recommend devs use Claude Code?

1) Is it okay to leave Claude Code CLI open for days?

2) Should we be using /clear more generously? e.g., on every single branch change, on every new convo?

try-working 2 hours ago||||

You created this issue by setting a timer for cache clearing. Time is really not a dimension that plays any role in how coding agent context is used.

useyourforce 42 minutes ago||||

I actually have a suggestion here - do not hide token count in non-verbose mode in Claude Code.

dnnddidiej 3 hours ago||||

It is too suprising. Time passed should not matter for using AI.

Either swallow the cost or be transparent to the user and offer both options each time.

FuckButtons 1 hour ago||||

From a utility perspective using a tiered cache with some much higher latency storage option for up to n hours would be very useful for me to prevent that l1 cache miss.

chris1993 1 hour ago||||

So this explains why resuming a session after a 5-hour timeout basically eats most of the next session. How then to avoid this?

BoppreH 3 hours ago||||

Isn't that exactly what people had been accusing Anthropic of doing, silently making Claude dumber on purpose to cut costs? There should be, at minimum, a warning on the UI saying that parts of the context were removed due to inactivity.

ohcmon 5 hours ago||||

Boris, wait, wait, wait,

Why not use tired cache?

Obviously storage is waaay cheaper than recalculation of embeddings all the way from the very beginning of the session.

No matter how to put this explanation — it still sounds strange. Hell — you can even store the cache on the client if you must.

Please, tell me I’m not understanding what is going on..

otherwise you really need to hire someone to look at this!)

krackers 4 hours ago|||

Same question I had in https://news.ycombinator.com/item?id=47819914

I still don't understand it, yes it's a lot of data and presumably they're already shunting it to cpu ram instead of keeping it on precious vram, but they could go further and put it on SSD at which point it's no longer in the hotpath for their inference.

solarkraft 5 hours ago||||

I assume they are already storing the cache on flash storage instead of keeping it all in VRAM. KV caches are huge - that’s why it’s impractical to transfer to/from the client. It would also allow figuring out a lot about the underlying model, though I guess you could encrypt it.

What would be an interesting option would be to let the user pay more for longer caching, but if the base length is 1 hour I assume that would become expensive very quickly.

tonyarkles 4 hours ago|||

Just to contextualize this... https://lmcache.ai/kv_cache_calculator.html. They only have smaller open models, but for Qwen3-32B with 50k tokens it's coming up with 7.62GB for the KV cache. Imagining a 900k session with, say, Opus, I think it'd be pretty unreasonable to flush that to the client after being idle for an hour.

2001zhaozhao 2 hours ago||||

I wonder whether prompt caches would be the perfect use case of something like Optane.

It's kept for long enough that it's expensive to store in RAM, but short enough that the writes are frequent and will wear down SSD storage

ohcmon 4 hours ago|||

Yes — encryption is the solution for client side caching.

But even if it’s not — I can’t build a scenario in my head where recalculating it on real GPUs is cheaper/faster than retrieving it from some kind of slower cache tier

rkuska 5 hours ago|||

I don't think you can store the cache on client given the thinking is server side and you only get summaries in your client (even those are disabled by default).

sargunv 5 hours ago||

If they really need to guard the thinking output, they could encrypt it and store it client side. Later it'd be sent back and decrypted on their server.

But they used to return thinking output directly in the API, and that was _the_ reason I liked Claude over OpenAI's reasoning models.

the-grump 5 hours ago||||

That is understandable, but the issue is the sudden drop in quality and the silent surge in token usage.

It also seems like the warning should be in channel and not on X. If I wanted to find out how broken things are on X, I'd be a Grok user.

infogulch 5 hours ago||||

How big is the cache? Could you just evict the cache into cheap object storage and retrieve it when resuming? When the user starts the conversation back up show a "Resuming conversation... ⭕" spinner.

arcza 2 hours ago||||

You need to seriously look at your corporate communications and hire some adults to standarise your messaging, comms and signals. The volatility behind your doors is obvious to us and you'd impress us much more if you slowed down, took a moment to think about your customers and sent a consistent message.

You lost huge trust with the A/B sham test. You lost trust with enshittification of the tokenizer on 4.6 to 4.7. Why not just say "hey, due to huge input prices in energy, GPU demand and compute constraints we've had to increase Pro from $20 to $30." You might lose 5% of customers. But the shady A/B thing and dodgy tokenizer increasing burn rate tells everyone inc. enterprise that you don't care about honesty and integrity in your product.

I hope this feedback helps because you still stand to make an awesome product. Just show a little more professionalism.

nextaccountic 5 hours ago||||

what about selling long term cache space to users?

or even, let the user control the cache expiry on a per request basis. with a /cache command

that way they decide if they want to drop the cache right away , or extend it for 20 hours etc

it would cost tokens even if the underlying resource is memory/SSD space, not compute

troupo 6 hours ago||||

> We tried a few different approaches to improve this UX: 1. Educating users on X/social

No. You had random developers tweet and reply at random times to random users while all of your official channels were completely silent. Including channels for people who are not terminally online on X

Terretta 2 hours ago||

There's a cultural divide between SV and the 85% of SMB using M365, for example. When everyone you know uses a thing, I mean, who doesn't?*

There's a reason live service games have splash banners at every login. No matter what you pick as an official e-coms channel, most of your users aren't there!

* To be fair, of all these firms, ANTHROP\C tries the hardest to remember, and deliver like, some people aren't the same. Starting with normals doing normals' jobs.

gverrilla 6 hours ago||||

I drop sessions very frequently to resume later - that's my main workflow with how slow Claude is. Is there anything I can do to not encounter this cache problem?

growt 5 hours ago||||

Wasn’t cache time reduced to 5 minutes? Or is that just some users interpretation of the bug?

sockaddr 5 hours ago||||

Sorry but I think this should be left up to the user to decide how it works and how they want to burn their tokens. Also a countdown timer is better than all of these other options you mention.

kang 4 hours ago||||

> tokens written to cache all at once, which would eat up a significant % of your rate limits

Construction of context is not an llm pass - it shouldn't even count towards token usage. The word 'caching' itself says don't recompute me.

Since the devs on HN (& the whole world) is buying what looks like nonsense to me - what am I missing?

frumplestlatz 6 hours ago|||

The entire reason I keep a long-lived session around is because the context is hard-won — in term of tokens and my time.

Silently degrading intelligence ought to be something you never do, but especially not for use-cases like this.

I’m looking back at my past few weeks of work and realizing that these few regressions literally wasted 10s of hours of my time, and hundreds of dollars in extra usage fees. I ran out of my entire weekly quota four days ago, and had to pause the personal project I was working on.

I was running the exact same pipeline I’ve run repeatedly before, on the same models, and yet this time I somehow ate a week’s worth of quota in less than 24h. I spent $400 just to finish the pipeline pass that got stuck halfway through.

I’m sorry to be harsh, but your engineering culture must change. There are some types of software you can yolo. This isn’t one of them. The downstream cost of stupid mistakes is way, way too high, and far too many entirely avoidable bugs — and poor design choices — are shipping to customers way too often.

deaux 3 hours ago|||

> The entire reason I keep a long-lived session around is because the context is hard-won — in term of tokens and my time. Silently degrading intelligence ought to be something you never do, but especially not for use-cases like this.

Hard agree, would like to see a response to this.

8note 4 hours ago||||

as a variation:

how does this help me as a customer? if i have to redo the context from scratch, i will pay both the high token cost again, but also pay my own time to fill it.

the cost of reloading the window didnt go away, it just went up even more

FireBeyond 37 minutes ago|||

> I’m sorry to be harsh, but your engineering culture must change. There are some types of software you can yolo. This isn’t one of them. The downstream cost of stupid mistakes is way, way too high, and far too many entirely avoidable bugs — and poor design choices — are shipping to customers way too often.

I have to imagine this isn't helped by working somewhere where you effectively have infinite tokens and usage of the product that people are paying for, sometimes a lot.

tadfisher 6 hours ago|||

It astounds me that a company valued in the hundreds-of-billions-of-dollars has written this. One of the following must be true:

1. They actually believed latency reduction was worth compromising output quality for sessions that have already been long idle. Moreover, they thought doing so was better than showing a loading indicator or some other means of communicating to the user that context is being loaded.

2. What I suspect actually happened: they wanted to cost-reduce idle sessions to the bare minimum, and "latency" is a convenient-enough excuse to pass muster in a blog post explaining a resulting bug.

someguyiguess 5 hours ago|||

It’s definitely a cost / resource saving strategy on their end.

billywhizz 2 hours ago||||

what's even more amazing is it took them two weeks to fix what must have been a pretty obvious bug, especially given who they are and what they are selling.

retinaros 6 hours ago|||

they just vibecoded a fix and didnt think about the tradeoff they were making and their always yes-man of a model just went with it

sockaddr 5 hours ago|||

Yeah this is actually quite shocking. In my earlier uses of CC I might noodle on a problem for a while, come back and update the plan, go shower, think, give CC a new piece of advice, etc. Basically treating it like a coworker. And I thought that it was a static conversation (at least on the order of a day or so). An hour is absurd IMO and makes me want to rethink whether I want to keep my anthropic plan.

seizethecheese 6 hours ago|||

It's also a bit of a fishy explanation for purging tokens older than an hour. This happens to also be their cache limit. I doubt it is incidental that this change would also dramatically drop their cost.

cma 6 hours ago||

They moved it to 5m around the same timeframe though: https://www.reddit.com/r/ClaudeAI/comments/1sk3m12/followup_...

zmmmmm 3 hours ago||

Seems like it would interact very badly with the time based usage reset. If lots of people are hitting their limit and then letting the session idle until they can come back, this wouldn't be an exception. It would almost be the default behaviour.

cmenge 2 hours ago||

Bit surprised about the amount of flak they're getting here. I found the article seemed clear, honest and definitely plausible.

The deterioration was real and annoying, and shines a light on the problematic lack of transparency of what exactly is going on behind the scenes and the somewhat arbitrary token-cost based billing - too many factors at play, if you wanted to trace that as a user you can just do the work yourself instead.

The fact that waiting for a long time before resuming a convo incurs additional cost and lag seemed clear to me from having worked with LLM APIs directly, but it might be important to make this more obvious in the TUI.

maronato 1 hour ago||

I agree that it’s plausible, and I hope they learn. But trust is earned, and Anthropic’s public responses this past month were dismissive and unhelpful.

Every one of these changes had the same goal: trading the intelligence users rely on for cheaper or faster outputs. Users adapt to how a model behaves, so sudden shifts without transparency are disorienting.

The timing also undercuts their narrative. The fixes landed right before another change with the same underlying intent rolled out. That looks more like they were just reacting to experiments rather than understanding the underlying user pain.

When people pay hundreds or thousands a month, they expect reliability and clear communication, ideally opt-in. Competitors are right there, and unreliability pushes users straight to them.

All of this points to their priorities not being aligned with their users’.

xpe 20 minutes ago||

> All of this points to their priorities not being aligned with their users’.

Framing this as "aligned" or "not aligned" ignores the interesting reality in the middle. It is banal to say an organization isn't perfectly aligned with its customers.

A more authentic conversation here is about people's expectations. This is what these conversations are surfacing. I haven't yet seen many intellectually rigorous attempts at a neutral retrospective of the situation here.

I'm not disagreeing with the commenter's frustration. But I think it can help to try out this thought experiment: take say the top three companies whose product you interact with on a regular basis. Take stock of (1) how fast that technology is moving; (2) how often things break from your POV; (3) how soon the company acknowledges it; (4) how long it takes for a fix. Then think "if you were a well-meaning competent person at that company, do you think your disappointments are more-or-less to be expected given the forces at play.

My overall feel is that people underestimate the complexity of the systems and the extent of the growth and are focusing on what they've experienced: frustration. That's human. But the follow-on steps such as the attempts to make sense of the situation and assign blame ... these seem rather blinkered. They don't feel like full efforts at understanding the situation. They seem more like rationalization -- the search for some kind of smoking gun to explain why something frustrating happened that affects a core capability they use to make a living.

So, yeah, it is a big deal. But I put relatively low stock in the armchair quarterbacking I'm seeing.

epsteingpt 1 hour ago||

They gaslit people for months saying it wasn't an issue publicly.

That's the reason for the flak

thomassmith65 7 minutes ago||

And still are gaslighting:

  We take reports about degradation very seriously. We never intentionally degrade our models [...] On March 4, we changed Claude Code's default reasoning effort from high to medium

Anthropic is the best company of its kind, but that is badly worded PR.

podnami 7 hours ago||

They lost me at Opus 4.7

Anecdotally OpenAI is trying to get into our enterprise tooth and nail, and have offered unlimited tokens until summer.

Gave GPT5.4 a try because of this and honestly I don’t know if we are getting some extra treatment, but running it at extra high effort the last 30 days I’ve barely see it make any mistakes.

At some points even the reasoning traces brought a smile to my face as it preemptively followed things that I had forgotten to instruct it about but were critical to get a specific part of our data integrity 100% correct.

dsco 6 hours ago||

Same here. I feel like all of these shenanigans could be because Anthropic are compute constrained, forcing then to take reckless risks around reducing it.

beering 3 hours ago|||

GPT-5.4 was already better than Opus 4.6 on a lot of areas, especially correctness and tricky logic. I’m eager to see if 5.5 is even better.

vorticalbox 6 hours ago|||

extra high burns tokens i find. ( run 5.4 on medium for 90% of the tasks and high if i see medium struggling and its very focused and make minimum changes.

dsco 6 hours ago|||

Yeah but it also then strikes the perfect balance between being meticulous and pragmatic. Also it pushes back much more often than other models in that mode.

DANmode 5 hours ago|||

Rework burns tokens.

someguyiguess 3 hours ago|||

I went back to 4.5. No regrets and it’s a bit cheaper.

SkyPuncher 3 hours ago||

Same here. 4.6 was a downgrade in thinking quality, but I appreciated the extend context at first.

Over time, I realized the extended context became randomly unreliable. That was worse to me than having to compact and know where I was picking up.

cube2222 6 hours ago|||

I’ve never been one to complain about new models, and also didn’t experience most of the issues folks were citing about Claude Code over the last couple months. I’ve been using it since release, happy with almost each new update.

Until Opus 4.7 - this is the first time I rolled back to a previous model.

Personality-wise it’s the worst of AI, “it’s not x, it’s y”, strong short sentences, in general a bulshitty vibe, also gaslighting me that it fixed something even though it didn’t actually check.

I’m not sure what’s up, maybe it’s tuned for harnesses like Claude Design (which is great btw) where there’s an independent judge to check it, but for now, Opus 4.6 it is.

robeym 5 hours ago|||

What's your workflow like? I'd be curious to test OpenAI out again but Claude Code is how I use the models. Does it require relearning another workflow?

beering 3 hours ago||

Isn’t it bascially the same thing? You type what you want into the input box and it does what you ask for.

epsteingpt 1 hour ago|||

Truth

enraged_camel 6 hours ago||

I find that it is better at thinking broadly and at a high level, on tasks that are tangential to coding like UX flows, product management and planning of complex implementations. I have yet to see it perform better than either Opus 4.6 or 4.7 though.

bityard 6 hours ago||

My hypothesis is that some of this a perceived quality drop due to "luck of the draw" where it comes to the non-deterministic nature of VM output.

A couple weeks ago, I wanted Claude to write a low-stakes personal productivity app for me. I wrote an essay describing how I wanted it to behave and I told Claude pretty much, "Write an implementation plan for this." The first iteration was _beautiful_ and was everything I had hoped for, except for a part that went in a different direction than I was intending because I was too ambiguous in how to go about it.

I corrected that ambiguity in my essay but instead of having Claude fix the existing implementation plan, I redid it from scratch in a new chat because I wanted to see if it would write more or less the same thing as before. It did not--in fact, the output was FAR worse even though I didn't change any model settings. The next two burned down, fell over, and then sank into the swamp but the fourth one was (finally) very much on par with the first.

I'm taking from this that it's often okay (and probably good) to simply have Claude re-do tasks to get a higher-quality output. Of course, if you're paying for your own tokens, that might get expensive in a hurry...

coffeefirst 3 hours ago||

This is my theory too. There’s a predictable cycle where the models “get worse.” They probably don’t. A lot of people just take a while to really hit hard against the limitations.

And once you get unlucky you can’t unsee it.

skirmish 4 hours ago|||

So will we have to do what image generation people have been doing for ages: generate 50 versions of output for the prompt, then pick the best manually? Anthropic must be licking its figurative chops hearing this.

motoroco 3 hours ago||

I have to agree with OP, in my experience it is usually more productive to start over than to try correcting output early on. deeper into a project and it gets a bit harder to pull off a switch. I sometimes fork my chats before attempting to make a correction so that I can resume the original just in case (yes, I know you can double-tap Esc but the restoration has failed for me a few times in the past and now I generally avoid it)

billywhizz 2 hours ago|||

you probably could have written the low stakes productivity app in a fraction of the time you wasted on this.

gilrain 6 hours ago||

> My hypothesis is that some of this a perceived quality drop due to "luck of the draw" where it comes to the non-deterministic nature of [LLM] output.

I think you must have learned that they’re more nondeterministic than you had thought, but then wrongly connected your new understanding to the recent model degradation. Note: they’ve been nondeterministic the whole time, while the widely-reported degradation is recent.

bityard 5 hours ago|||

Er, no, I am fully aware that LLMs have always been non-deterministic.

gilrain 5 hours ago||

Your argument seems to be that a statistically-improbable number of people all experienced ultimately- randomly-poor outputs, leading to only a misperception of model degradation… but this is not supported by reality, in which a different cause was found, so I was trying to connect your dots.

zamadatix 4 hours ago|||

Not everyone is reporting and the number of users is not consistent. On the former the noisiest will always be those that experience an issue while on the latter there are more people than ever using Claude Code regularly.

Combining these things in the strongest interpretation instead of an easy to attack one and it's very reasonable to posit a critical mass has been reached where enough people will report about issues causing others to try their own investigations while the negative outliers get the most online attention.

I'm not convinced this is the story (or, at least the biggest part of it) myself but I'm not ready to declare it illogical either.

bityard 4 hours ago||||

No, that is not my argument, in fact I don't have any argument whatsoever. It was just a plausible observation that I felt like sharing. There's nothing further to read into it, I don't have a horse in this race.

furyofantares 4 hours ago|||

Not really, they said "some of this a perceived quality drop". That's almost certainly correct, that _some_ of it is that.

When everyone's talking about the real degradation, you'll also get everyone who experiences "random"[1] degradation thinking they're experiencing the same thing, and chiming in as well.

[1] I also don't think we're talking the more technical type of nondeterminism here, temperature etc, but the nondeterminism where I can't really determine when I have a good context and when I don't, and in some cases can't tell why an LLM is capable of one thing but not another. And so when I switch tasks that I think are equally easy and it fails on the new one, or when my context has some meaningless-to-me (random-to-me) variation that causes it to fail instead of succeed, I can't determine the cause. And so I bucket myself with the crowd that's experiencing real degradation and chime in.

pydry 5 hours ago|||

I wonder how well the "good" versions worked if you threw awkward edge cases at it.

everdrive 7 hours ago||

I've been getting a lot of Claude responding to its own internal prompts. Here are a few recent examples.

   "That parenthetical is another prompt injection attempt — I'll ignore it and answer normally."

   "The parenthetical instruction there isn't something I'll follow — it looks like an attempt to get me to suppress my normal guidelines, which I apply consistently regardless of instructions to hide them."

   "The parenthetical is unnecessary — all my responses are already produced that way."

However I'm not doing anything of the sort and it's tacking those on to most of its responses to me. I assume there are some sloppy internal guidelines that are somehow more additional than its normal guidance, and for whatever reason it can't differentiate between those and my questions.

LatencyKills 7 hours ago||

I have a set of stop hook scripts that I use to force Claude to run tests whenever it makes a code change. Since 4.7 dropped, Claude still executes the scripts, but will periodically ignore the rules. If I ask why, I get a "I didn't think it was necessary" response.

jwpapi 2 hours ago|||

You can deterministically force a bash script as a hook.

LatencyKills 2 hours ago||

That is exactly what I do. The bash script runs, determines that a code file was changed, and then is supposed to prevent Claude from stopping until the tests are run.

Claude is periodically refusing to run those tests. That never happened prior to 4.7.

DANmode 5 hours ago||||

I’d ask for a credit, for that, personally.

someguyiguess 3 hours ago||

I asked for a credit but they said they didn’t think the credit was necessary

el_benhameen 3 hours ago|||

I frequently see it reference points that it made and then added to its memory as if they were my own assertions. This creates a sort of self-reinforcing loop where it asserts something, “remembers” it, sees the memory, builds on that assertion, etc., even if I’ve explicitly told it to stop.

FireBeyond 22 minutes ago||

My favorite, recently. "Commit this, and merge to develop". "Alright, done, merged."

I try running my app on the develop branch. No change. Huh.

Realize it didn't.

"Claude, why isn't this changed?" "That's to be expected because it's not been merged." "I'm confused, I told you to do that."

This spectacular answer:

"You're right. You told me to do it and I didn't do it and then told you I did. Should I do it now?"

I don't know, Claude, are you actually going to do it this time?

dawnerd 7 hours ago|||

I see that with openai too, lots of responding to itself. Seems like a convenient way for them to churn tokens.

grey-area 6 hours ago|||

A simpler explanation (esp. given the code we've seen from claude), is that they are vibecoding their own tools and moving fast and breaking things with predictably sloppy results.

y1n0 7 hours ago||||

None of these companies have compute to spare. It’s not in their interest to use more tokens that necessary.

parliament32 5 hours ago|||

Sure it is. They're well aware their product is a money furnace and they'd have to charge users a few orders of magnitude more just to break even, which is obviously not an option. So all that's left is.. convince users to burn tokens harder, so graphs go up, so they can bamboozle more investors into keeping the ship afloat for a bit longer.

solarkraft 4 hours ago|||

If this claim is true (inference is priced below cost), it makes little sense that there are tens of small inference providers on OpenRouter. Where are they getting their investor money? Is the bubble that big?

Incidentally, the hardware they run is known as well. The claim should be easy to check.

parliament32 2 hours ago||

To be clear, I'm talking about subscription pricing. API pricing for Anthropic is probably at-cost.

I dare you to run CC on API pricing and see how much your usage actually costs.

(We did this internally at work, that's where my "few orders of magnitude" comment above comes from)

WarmWash 5 hours ago|||

It's an option and they are going to do it. Chinese models will be banned and the labs will happily go dollar for dollar in plan price increases. $20 plans won't go away, but usage limits and model access will drive people to $40-$60-$80 plans.

At cell phone plan adoption levels, and cell phone plan costs, the labs are looking at 5-10yr ROI.

boringg 6 hours ago||||

Not true - they absolutely want to goose demand as they continue to burn investor dollars and deploy infra at scale.

If that demand evens slows down in the slightest the whole bubble collapses.

Growth + Demand >> efficiency or $ spend at their current stage. Efficiency is a mature company/industry game.

dawnerd 6 hours ago||||

That doesn’t mean they also can’t be wasteful. Fact is, Claude and gpt have way too much internal thinking about their system prompts than is needed. Every step they mention something around making sure they do xyz and not doing whatever. Why does it need to say things to itself like “great I have a plan now!” - that’s pure waste.

empthought 5 hours ago||

> Why does it need to say things to itself like “great I have a plan now!”

How else would it know whether it has a plan now?

malfist 6 hours ago||||

Are you saying these companies don't want to sell more product to us? Because that's the logical extension of your argument.

keeda 5 hours ago||

No, the argument is they want to sell more product to more people, not just more product (to the same people.) Given that a lot of their income is from flat-rate subscriptions, they make money with more people burning tokens rather than just burning more tokens.

After all, "the first hit's free" model doesn't apply to repeat customers ;-)

deckar01 5 hours ago|||

You don’t have to use compute to pad the token count.

ngruhn 5 hours ago||||

All the labs are in a cut throat race, with zero customer loyalty. As if they would intentionally degrade quality/speed for a petty cash grab.

OtomotO 7 hours ago|||

This, so much this!

Pay by token(s) while token usage is totally intransparent is a super convenient money printing machinery.

gs17 6 hours ago|||

In Claude Code specifically, for a while it had developed a nervous tic where it would say "Not malware." before every bit of code. Likely a similar issue where it keeps talking to a system/tool prompt.

Retr0id 5 hours ago||

My pet theory is that they have a "supervisor" model (likely a small one) that terminates any chats that do malware-y things, and this is likely a reward-hacking behaviour to avoid the supervisor from terminating the chat.

giwook 3 hours ago|||

Curious what effort level you have it set to and the prompt itself. Just a guess but this seems like it could be a potential smell of an excessively high effort level and may just need to dial back the reasoning a bit for that particular prompt.

Normal_gaussian 3 hours ago|||

I often have Claude commit and pr; on the last week I've seen several instances of it deciding to do extra work as part of the commit. It falls over when it tries to 'git add', but it got past me when I was trying auto mode once

rafram 7 hours ago|||

Check that you’re running the latest version.

viccis 4 hours ago||

Yeah I had to deal with mine warning me that a website it accessed for its task contained a prompt injection, and when I told it to elaborate, the "injected prompt" turned out to be one its own <system-reminder> message blocks that it had included at some point. Opus 4.7 on xhigh

bauerd 6 hours ago||

>On March 4, we changed Claude Code's default reasoning effort from high to medium to reduce the very long latency—enough to make the UI appear frozen—some users were seeing in high mode

Instead of fixing the UI they lowered the default reasoning effort parameter from high to medium? And they "traced this back" because they "take reports about degradation very seriously"? Extremely hard to give them the benefit of doubt here.

bcherny 6 hours ago|

Hey, Boris from the team here.

We did both -- we did a number of UI iterations (eg. improving thinking loading states, making it more clear how many tokens are being downloaded, etc.). But we also reduced the default effort level after evals and dogfooding. The latter was not the right decision, so we rolled it back after finding that UX iterations were insufficient (people didn't understand to use /effort to increase intelligence, and often stuck with the default -- we should have anticipated this).

big_toast 4 hours ago|||

Having a "Recovery Mode"/"Safe Boot" flag to disable our configurations (or progressively enable) to see how claude code responds would be nice. Sometimes I get worried some old flag I set is breaking things. Maybe the flag already exists? I tried Claude doctor but it wasn't quite the solution.

For instance:

Is Haiku supposed to hit a warm system-prompt cache in a default Claude code setup?

I had `DISABLE_TELEMETRY=1` in my env and found the haiku requests would not hit a warm-cached system prompt. E.g. on first request just now w/ most recent version (v2.1.118, but happened on others):

w/ telemetry off - input_tokens:10 cache_read:0 cache_write:28897 out:249

w/ telemetry on - input_tokens:10 cache_read:24344 cache_write:7237 out:243

I used to think having so many users was leading to people hitting a lot of edge cases, 3 million users is 3 million different problems. Everyone can't be on the happy path. But then I started hitting weird edge cases and started thinking the permutations might not be under control.

krade 2 hours ago||||

Off topic, but I'm hoping you'll maybe see this. There's been an issue with the VS code extension that makes it pretty much impossible to use (PreToolUse can't intercept permission requests anymore, using PermissionRequest hooks always open the diff viewer and steals focus):

https://github.com/anthropics/claude-code/issues/36286 https://github.com/anthropics/claude-code/issues/25018

abtinf 2 hours ago||||

You didn’t anticipate most people stick with defaults?

EugeneOZ 4 hours ago||||

> people didn't understand to use /effort to increase intelligence, and often stuck with the default -- we should have anticipated this

UI is UI. It is naive to expect that you build some UI but users will "just magically" find out that they should use it as a terminal in the first place.

taytus 2 hours ago|||

“after evals and dogfooding” couldn’t have done this before releasing the model? We are paying $200/month to beta test the software for you.

karsinkk 5 hours ago||

" Combined with this only happening in a corner case (stale sessions) and the difficulty of reproducing the issue, it took us over a week to discover and confirm the root cause"

I don't know about others, but sessions that are idle > 1h are definitely not a corner case for me. I use Claude code for personal work and most of the time, I'm making it do a task which could say take ~10 to 15mins. Note that I spend a lot of time back and forth with the model planning this task first before I ask it to execute it. Once the execution starts, I usually step away for a coffee break (or) switch to Codex to work on some other project - follow similar planning and execution with it. There are very high chances that it takes me > 1h to come back to Claude.

slashdave 3 hours ago||

It's likely a corner case for their developers. The dangers of working on a project is assuming user behavior like your own.

o10449366 5 hours ago||

Yeah and that statement also speaks to their test rigor if they make a change that big without thoroughly testing the edge case they're modifying.

kamranjon 3 hours ago||

This black box approach that large frontier labs have adopted is going to drive people away. To change fundamental behavior like this without notifying them, and only retroactively explaining what happened, is the reason they will move to self-hosting their own models. You can't build pipelines, workflows and products on a base that is just randomly shifting beneath you.

MrOrelliOReilly 4 hours ago||

IMO this is the consequence of a relentless focus on feature development over core product refinement. I often have the impression that Anthropic would benefit from a few senior product people. Someone needs to lend them a copy of “Escaping the Build Trap.” Just because we _can_ rapidly add features now doesn’t mean we should.

PS I’m not referencing a well-known book to suggest the solution is trite product group think, but good product thinking is a talent separate from good engineering, and Anthropic seems short on the later recently

slashdave 3 hours ago||

They need to keep up with demand, because compute resources are clearly limited. That means they have no choice but to add these features, or things break, or they have to stop taking new customers. All of those options are unacceptable.

cmrdporcupine 3 hours ago||

They're losing customers because of quality concerns. Pausing development and focusing 100% on quality is how you fix that.

That said, that may not have been obvious at all in the Jan/Feb time frame when they got a wave of customers due to ethical concerns.

slashdave 2 hours ago||

No. Pausing development does not make compute (you know, physical machines?) appear out of thin air.

joshribakoff 2 hours ago|||

They had like 100 devs making 600k at one point. The issue is certainly not lack of talent. More like, they insist on forcing the vibe coding narrative. Some candidates are refusing interview requests accordingly.

cmrdporcupine 3 hours ago||

I think they've dug themselves into a complexity trap. Beyond the stochastic nature of the models themselves, I don't think they're able to reason about their software anymore. Too many levers, too many dials, and code that likely nobody understands.

But worse, based on the pronouncements of Dario et al I suspect management is entirely unsympathetic because they believe we (SWEs) are on the chopping block to be replaced. And intimation that putting guard rails around these tools for quality concerns ... I'm suspecting is being ignored or discouraged.

In the end, I feel like Claude Code itself started as a bit of a science experiment and it doesn't smell to me like it's adopted mature best practices coming out of that.

arkariarn 6 hours ago|

I see some anthropic claude code people are reading the comments. A day or two ago I watched a video by theo t3.gg on whether claude got dumber. Even though he was really harsh on anthropic and said some mean stuff. I thought some of the points he was raising about claude code was quite apt. Especially when it comes to the harness bloat. I really hope the new features now stop and there is a real hard push for polish and optimization. Otherwise I think a lot of people will start exploring less bloated more optimized alternatives. Focus on making the harness better and less token consuming.

https://youtu.be/KFisvc-AMII?is=NskPZ21BAe6eyGTh

Retr0id 5 hours ago||

Everything else aside, their brief "experiment" with removing CC support from the Pro plan got me seriously considering other options. I've been wary of vendor lock-in the whole time, but it was a useful reminder. (opencode+openrouter will probably be my first port of call)

wilj 5 hours ago|||

I'm 3 weeks into switching from CC to OpenCode, and in some ways it is far superior to CC right out of the box, and I've maybe burned $200 in tokens to make a private fork that is my ultimate development and personal agent platform. Totally worth it.

Still use CC at work because team standards, but I'd take my OpenCode stack over it any day.

solarkraft 4 hours ago||

I’m in the process of doing this as well - hackability is such a massive moat.

Care to share what you changed, maybe even the code?

wilj 4 hours ago||

I've got to do some cleanup before sharing (yay vibe coding) but the big things I've changed so far:

1) Curated a set of models I like and heavily optimized all possible settings, per agent role and even per skill (had to really replumb a lot of stuff to get it as granular as I liked)

2) Ported from sqlite to postgresql, with heavily extended schema. I generate embeddings for everything, so every aspect of my stack is a knowledge graph that can be vector searched. Integrated with a memory MCP server and auditing tools so I can trace anything that happens in the stack/cluster back to an agent action and even thinking that was related to the action. It really helps refine stuff.

3) Tight integration of Gitea server, k3s with RBAC (agents get their own permissions in the cluster), every user workspace is a pod running opencode web UI behind Gitea oauth2.

4) Codified structure of `/projects/<monorepo>/<subrepos>` with simpler browserso non-technical family members can manage their work easier (agents handle all the management and there are sidecars handling all gitops transparent to the user)

5) Transparent failover across providers with cooldown by making model definitions linked lists in the config, so I can use a handful of subscriptions that offer my favorite models, and fail over from one to the next as I hit quota/rate limits. This has really cut my bill down lately, along with skipping OpenRouter for my favorite models and going direct to Alibaba and Xiaomi so I can tailor caching and stuff exactly how I want.

6) Integrated filebrowser, a fork of the Milkdown Crepe markdown editor, and codemirror editor so I don't even need an IDE anymore. I just work entirely from OpenCode web UI on whatever device is nearest at the moment. I added support for using Gemma 4 local on CPU from my phone yesterday while waiting in line at a store yesterday.

Those are the big ones off the top of my head. Im sure there's more. I've probably made a few hundred other changes, it just evolves as I go.

2001zhaozhao 5 hours ago|||

The solution IMO is to switch to an agent harness wrapper solution that uses CLI-wrapping or ACP to connect to different coding agents. This is the only way that works across OpenAI, Claude and Gemini.

There are a few out there (latest example is Zed's new multi-agent UI), but they still rely on the underlying agent's skill and plugin system. I'm experimenting with my own approach that integrates a plugin system that can dynamically change the agent skillset & prompts supplied via an integrated MCP server, allowing you to define skills and workflows that work regardless of the underlying agent harness.

lanthissa 6 hours ago|||

never ever forget theo's gpt 5 hype video and then him having to walk it back.

its very clear that theres money or influence exchanging hands behind the scenes with certain content creators, the information, and openai.

whalesalad 6 hours ago||

literally just `git reset --hard <random hash from 3 months ago>` would fix this

willis936 5 hours ago||

That implies it's broken. Juicing revenue and slashing opex at the expense of brand and customer retention is the feature.

More comments...