Top
Best
New

Posted by HellsMaddy 8 hours ago

Claude Opus 4.6(www.anthropic.com)
1570 points | 670 comments
ck_one 4 hours ago|
Just tested the new Opus 4.6 (1M context) on a fun needle-in-a-haystack challenge: finding every spell in all Harry Potter books.

All 7 books come to ~1.75M tokens, so they don't quite fit yet. (At this rate of progress, mid-April should do it ) For now you can fit the first 4 books (~733K tokens).

Results: Opus 4.6 found 49 out of 50 officially documented spells across those 4 books. The only miss was "Slugulus Eructo" (a vomiting spell).

Freaking impressive!

golfer 3 hours ago||
There's lots of websites that list the spells. It's well documented. Could Claude simply be regurgitating knowledge from the web? Example:

https://harrypotter.fandom.com/wiki/List_of_spells

ck_one 3 hours ago||
It didn't use web search. But for sure it has some internal knowledge already. It's not a perfect needle in the hay stack problem but gemini flash was much worse when I tested it last time.
viraptor 3 hours ago|||
If you want to really test this, search/replace the names with your own random ones and see if it lists those.

Otherwise, LLMs have most of the books memorised anyway: https://arstechnica.com/features/2025/06/study-metas-llama-3...

ribosometronome 2 hours ago||
Couldn't you just ask the LLM which 50 (or 49) spells appear in the first four Harry Potter books without the data for comparison?
viraptor 2 hours ago|||
It's not going to be as consistent. It may get bored of listing them (you know how you can ask for many examples and get 10 in response?), or omit some minor ones for other reasons.

By replacing the names with something unique, you'll get much more certainty.

Grimblewald 1 hour ago|||
might not work well, but by navigating to a very harry potter dominant part of latent space by preconditioning on the books you make it more likely to get good results. An example would be taking a base model and prompting "what follows is the book 'X'" it may or may not regurgitate the book correctly. Give it a chunk of the first chapter and let it regurgitate from there and you tend to get fairly faithful recovery, especially for things on gutenberg.

So it might be there, by predcondiditioning latent space to the area of harry potter world, you make it so much more probable that the full spell list is regurgitated from online resources that were also read, while asking naive might get it sometimes, and sometimes not.

the books act like a hypnotic trigger, and may not represent a generalized skill. Hence why replacing with random words would help clarify. if you still get the origional spells, regurgitation confirmed, if it finds the spells, it could be doing what we think. An even better test would be to replace all spell references AND jumble chapters around. This way it cant even "know" where to "look" for the spell names from training.

joshmlewis 3 hours ago||||
I think the OP was implying that it's probably already baked into its training data. No need to search the web for that.
obirunda 54 minutes ago||||
This underestimates how much of the Internet is actually compressed into and is an integral part of the model's weights. Gemini 2.5 can recite the first Harry Potter book verbatim for over 75% of the book.
Trasmatta 1 hour ago||||
Do the same experiment in the Claude web UI. And explicitly turn web searches off. It got almost all of them for me over a couple of prompts. That stuff is already in its training data.
soulofmischief 2 hours ago||||
The only worthwhile version of this test involves previously unseen data that could not have been in the training set. Otherwise the results could be inaccurate to the point of harmful.
eek2121 3 hours ago|||
Honestly? My advice would be to cook something custom up! You don't need to do all the text yourself. Maybe have AI spew out a bunch of text, or take obscure existing text and insert hidden phrases here or there.

Shoot, I'd even go so far as to write a script that takes in a bunch of text, reorganizes sentences, and outputs them in a random order with the secrets. Kind of like a "Where's Waldo?", but for text

Just a few casual thoughts.

I'm actually thinking about coming up with some interesting coding exercises that I can run across all models. I know we already have benchmarks, however some of the recent work I've done has really shown huge weak points in every model I've run them on.

clhodapp 2 hours ago||
Having AI spew it might suffer from the fact that the spew itself is influenced by AI's weights. I think your best bet would be to use a new human-authored work that was released after the model's context cutoff.
xiomrze 4 hours ago|||
Honest question, how do you know if it's pulling from context vs from memory?

If I use Opus 4.6 with Extended Thinking (Web Search disabled, no books attached), it answers with 130 spells.

petercooper 3 hours ago|||
One possible trick could be to search and replace them all with nonsense alternatives then see if it extracts those.
andai 3 hours ago||
That might actually boost performance since attention pays attention to stuff that stands out. If I make a typo, the models often hyperfixate on it.
ozim 3 hours ago||||
Exactly there was this study where they were trying to make LLM reproduce HP book word for word like giving first sentences and letting it cook.

Basically they managed with some tricks make 99% word for word - tricks were needed to bypass security measures that are there in place for exactly reason to stop people to retrieve training material.

pron 3 hours ago|||
This reminds me of https://en.wikipedia.org/wiki/Pierre_Menard,_Author_of_the_Q... :

> Borges's "review" describes Menard's efforts to go beyond a mere "translation" of Don Quixote by immersing himself so thoroughly in the work as to be able to actually "re-create" it, line for line, in the original 17th-century Spanish. Thus, Pierre Menard is often used to raise questions and discussion about the nature of authorship, appropriation, and interpretation.

ck_one 3 hours ago|||
Do you remember how to get around those tricks?
djhn 3 hours ago||
This is the paper: https://arxiv.org/abs/2601.02671

Grok and Deepmind IIRC didn’t require tricks.

eek2121 2 hours ago||
This really makes me want to try something similar with content from my own website.

I shut it down a while ago because the number of bots overtake traffic. The site had quite a bit of human traffic (enough to bring in a few hundred bucks a month in ad revenue, and a few hundred more in subscription revenue), however, the AI scrapers really started ramping up and the only way I could realistically continue would be to pay a lot more for hosting/infrastructure.

I had put a ton of time into building out content...thousands of hours, only to have scrapers ignore robots, bypass cloudflare (they didn't have any AI products at the time), and overwhelm my measly infrastructure.

Even now, with the domain pointed at NOTHING, it gets almost 100,000 hits a month. There is NO SERVER on the other end. It is a dead link. The stats come from Cloudflare, where the domain name is hosted.

I'm curious if there are any lawyers who'd be willing to take someone like me on contingency for a large copyright lawsuit.

camdenreslink 1 hour ago||
The new cloudflare products for blocking bots and AI scrapers might be worth a shot if you put so much work into the content.
ck_one 3 hours ago||||
When I tried it without web search so only internal knowledge it missed ~15 spells.
clanker_fluffer 3 hours ago|||
What was your prompt?
meroes 4 hours ago|||
What is this supposed to show exactly? Those books have been feed into LLMs for years and there's even likely specific RLHF's on extracting spells from HP.
muzani 3 hours ago|||
There was a time when I put the EA-Nasir text into base64 and asked AI to convert it. Remarkably it identified the correct text but pulled the most popular translation of the text than the one I gave it.
majewsky 42 minutes ago||
Sucks that you got a really shitty response to your prompt. If I were you, the model provider would be receiving my complaint via clay tablet right away.
rvz 3 hours ago|||
> What is this supposed to show exactly?

Nothing.

You can be sure that this was already known in the training data of PDFs, books and websites that Anthropic scraped to train Claude on; hence 'documented'. This is why tests like what the OP just did is meaningless.

Such "benchmarks" are performative to VCs and they do not ask why isn't the research and testing itself done independently but is almost always done by their own in-house researchers.

siwatanejo 39 minutes ago|||
> All 7 books come to ~1.75M tokens

How do you know? Each word is one token?

koakuma-chan 22 minutes ago||
You can download the books and run them through a tokenizer. I did that half a year ago and got ~2M.
zamadatix 4 hours ago|||
To be fair, I don't think "Slugulus Eructo" (the name) is actually in the books. This is what's in my copy:

> The smug look on Malfoy’s face flickered.

> “No one asked your opinion, you filthy little Mudblood,” he spat.

> Harry knew at once that Malfoy had said something really bad because there was an instant uproar at his words. Flint had to dive in front of Malfoy to stop Fred and George jumping on him, Alicia shrieked, “How dare you!”, and Ron plunged his hand into his robes, pulled out his wand, yelling, “You’ll pay for that one, Malfoy!” and pointed it furiously under Flint’s arm at Malfoy’s face.

> A loud bang echoed around the stadium and a jet of green light shot out of the wrong end of Ron’s wand, hitting him in the stomach and sending him reeling backward onto the grass.

> “Ron! Ron! Are you all right?” squealed Hermione.

> Ron opened his mouth to speak, but no words came out. Instead he gave an almighty belch and several slugs dribbled out of his mouth onto his lap.

sobjornstad 2 hours ago|||
I have a vague recollection that it might come up named as such in Half-Blood Prince, written in Snape's old potions textbook?

In support of that hypothesis, the Fandom site lists it as “mentioned” in Half-Blood Prince, but it says nothing else and I'm traveling and don't have a copy to check, so not sure.

zamadatix 22 minutes ago||
Hmm, I don't get a hit for "slugulus" or "eructo" (case insensitive) in any of the 7. Interestingly two mentions of "vomit" are in book 6, but neither in reference to to slugs (plenty of Slughorn of course!). Book 5 was the only other one a related hit came up:

> Ron nodded but did not speak. Harry was reminded forcibly of the time that Ron had accidentally put a slug-vomiting charm on himself. He looked just as pale and sweaty as he had done then, not to mention as reluctant to open his mouth.

There could be something with regional variants but I'm doubtful as the Fandom site uses LEGO Harry Potter: Years 1-4 as the citation of the spell instead of a book.

Maybe the real LLM is the universe and we're figuring this out for someone on Slacker News a level up!

ck_one 3 hours ago|||
Then it's fair that id didn't find it
kybernetikos 2 hours ago|||
I recently got junie to code me up an MCP for accessing my calibre library. https://www.npmjs.com/package/access-calibre

My standard test for that was "Who ends up with Bilbo's buttons?"

muzani 3 hours ago|||
There's a benchmark which works similarly but they ask harder questions, also based on books https://fiction.live/stories/Fiction-liveBench-Feb-21-2025/o...

I guess they have to add more questions as these context windows get bigger.

dwa3592 2 hours ago|||
have another LLM (gemini, chatgpt) make up 50 new spells. insert those and test and maybe report here :)
dom96 2 hours ago|||
I often wonder how much of the Harry Potter books were used in the training. How long before some LLM is able to regurgitate full HP books without access to the internet?
bartman 3 hours ago|||
Have you by any chance tried this with GPT 4.1 too (also 1M context)?
LanceJones 3 hours ago|||
Assuming this experiment involved isolating the LLM from its training set?
irishcoffee 2 hours ago|||
The top comment is about finding basterized latin words from childrens books. The future is here.
Geste 2 hours ago||
I'll have some of that coffee too, this is quite a sad time we're living where this is a proper use of our limited resources.
guluarte 3 hours ago|||
you can get the same result just asking opus/gpt, it is probably internalized knowledge from reddit or similar sites.
ck_one 3 hours ago||
If you just ask it you don't get the same result. Around 13 spells were missing when I just prompted Opus 4.6 without the books as context.
TheRealPomax 2 hours ago|||
That doesn't seem a super useful test for a model that's optimized for programming?
IhateAI 41 minutes ago|||
like I often say, these tools are mostly useful for people to do magic tricks on themselves (and to convince C-suites that they can lower pay, and reduce staff if they pay Anthropic half their engineering budget lmao )
adarsh2321 3 hours ago||
[dead]
gizmodo59 7 hours ago||
5.3 codex https://openai.com/index/introducing-gpt-5-3-codex/ crushes with a 77.3% in Terminal Bench. The shortest lived lead in less than 35 minutes. What a time to be alive!
wasmainiac 7 hours ago||
Dumb question. Can these benchmarks be trusted when the model performance tends to vary depending on the hours and load on OpenAI’s servers? How do I know I’m not getting a severe penalty for chatting at the wrong time. Or even, are the models best after launch then slowly eroded away at to more economical settings after the hype wears off?
tedsanders 6 hours ago|||
We don't vary our model quality with time of day or load (beyond negligible non-determinism). It's the same weights all day long with no quantization or other gimmicks. They can get slower under heavy load, though.

(I'm from OpenAI.)

wasmainiac 3 hours ago|||
Thanks for the response, I appreciate it. I do notice variation in quality throughout the day. I use it primarily for searching documentation since it’s faster than google in most case, often it is on point, but also it seems off at times, inaccurate or shallow maybe. In some cases I just end the session.
nl 3 hours ago||
Usually I find this kind of variation is due to context management.

Accuracy can decreases at large context sizes. OpenAI's compaction handles this better than anyone else, but it's still an issue.

If you are seeing this kind of thing start a new chat and re-run the same query. You'll usually see an improvement.

repeekad 33 minutes ago||
This is called context rot
GorbachevyChase 35 minutes ago||||
Hi Ted. I think that language models are great, and they’ve enabled me to do passion projects I never would have attempted before. I just want to say thanks.
Trufa 5 hours ago||||
Can you be more specific than this? does it vary in time from launch of a model to the next few months, beyond tinkering and optimization?
tedsanders 4 hours ago|||
Yeah, happy to be more specific. No intention of making any technically true but misleading statements.

The following are true:

- In our API, we don't change model weights or model behavior over time (e.g., by time of day, or weeks/months after release)

- Tiny caveats include: there is a bit of non-determinism in batched non-associative math that can vary by batch / hardware, bugs or API downtime can obviously change behavior, heavy load can slow down speeds, and this of course doesn't apply to the 'unpinned' models that are clearly supposed to change over time (e.g., xxx-latest). But we don't do any quantization or routing gimmicks that would change model weights.

- In ChatGPT and Codex CLI, model behavior can change over time (e.g., we might change a tool, update a system prompt, tweak default thinking time, run an A/B test, or ship other updates); we try to be transparent with our changelogs (listed below) but to be honest not every small change gets logged here. But even here we're not doing any gimmicks to cut quality by time of day or intentionally dumb down models after launch. Model behavior can change though, as can the product / prompt / harness.

ChatGPT release notes: https://help.openai.com/en/articles/6825453-chatgpt-release-...

Codex changelog: https://developers.openai.com/codex/changelog/

Codex CLI commit history: https://github.com/openai/codex/commits/main/

Trufa 2 hours ago|||
I ask then unironically then, am I imagining that models are great when they start and degrade over time?

I've had this perceived experience so many times, and while of course it's almost impossible to be objective about this, it just seem so in your face.

I don't discard being novelty plus getting used to it, plus psychological factors, do you have any takes on this?

jason_oster 49 minutes ago||
You might be susceptible to the honeymoon effect. If you have ever felt a dopamine rush when learning a new programming language or framework, this might be a good indication.

Once the honeymoon wears off, the tool is the same, but you get less satisfaction from it.

Just a guess! Not trying to psychoanalyze anyone.

jychang 4 hours ago||||
What about the juice variable?

https://www.reddit.com/r/OpenAI/comments/1qv77lq/chatgpt_low...

tedsanders 4 hours ago|||
Yep, we recently sped up default thinking times in ChatGPT, as now documented in the release notes: https://help.openai.com/en/articles/6825453-chatgpt-release-...

The intention was purely making the product experience better, based on common feedback from people (including myself) that wait times were too long. Cost was not a goal here.

If you still want the higher reliability of longer thinking times, that option is not gone. You can manually select Extended (or Heavy, if you're a Pro user). It's the same as at launch (though we did inadvertently drop it last month and restored it yesterday after Tibor and others pointed it out).

tgrowazay 4 hours ago|||
Isn’t that just how many steps at most a reasoning model should do?
ComplexSystems 4 hours ago|||
Do you ever replace ChatGPT models with cheaper, distilled, quantized, etc ones to save cost?
jghn 4 hours ago||
He literally said no to this in his GP post
joshvm 5 hours ago|||
My gut feeling is that performance is more heavily affected by harnesses which get updated frequently. This would explain why people feel that Claude is sometimes more stupid - that's actually accurate phrasing, because Sonnet is probably unchanged. Unless Anthropic also makes small A/B adjustments to weights and technically claims they don't do dynamic degradation/quantization based on load. Either way, both affect the quality of your responses.

It's worth checking different versions of Claude Code, and updating your tools if you don't do it automatically. Also run the same prompts through VS Code, Cursor, Claude Code in terminal, etc. You can get very different model responses based on the system prompt, what context is passed via the harness, how the rules are loaded and all sorts of minor tweaks.

If you make raw API calls and see behavioural changes over time, that would be another concern.

zamadatix 4 hours ago||||
I appreciate you taking the time to respond to these kinds of questions the last few days.
derwiki 2 hours ago||||
Has this always been the case?
Someone1234 5 hours ago||||
Specifically including routing (i.e. which model you route to based on load/ToD)?

PS - I appreciate you coming here and commenting!

hhh 5 hours ago||
There is no routing with API, or when you choose a specific model in chatGPT.
fragmede 2 hours ago|||
I believe you when you say you're not changing the model file loaded onto the H100s or whatever, but there's something going on, beyond just being slower, when the GPUs are heavily loaded.
clbrmbr 1 hour ago||
I do wonder about reasoning effort.
Corence 6 hours ago||||
It is a fair question. I'd expect the numbers are all real. Competitors are going to rerun the benchmark with these models to see how the model is responding and succeeding on the tasks and use that information to figure out how to improve their own models. If the benchmark numbers aren't real their competitors will call out that it's not reproducible.

However it's possible that consumers without a sufficiently tiered plan aren't getting optimal performance, or that the benchmark is overfit and the results won't generalize well to the real tasks you're trying to do.

mrandish 3 hours ago||
> I'd expect the numbers are all real.

I think a lot of people are concerned due to 1) significant variance in performance being reported by a large number of users, and 2) We have specific examples of OpenAI and other labs benchmaxxing in the recent past (https://grok.com/share/c2hhcmQtMw_66c34055-740f-43a3-a63c-4b...).

It's tricky because there are so many subtle ways in which "the numbers are all real" could be technically true in some sense, yet still not reflect what a customer will experience (eg harnesses, etc). And any of those ways can benefit the cost structures of companies currently subsidizing models well below their actual costs with limited investor capital. All with billions of dollars in potential personal wealth at stake for company employees and dozens of hidden cost/performance levers at their disposal.

And it doesn't even require overt deception on anyone's part. For example, the teams doing benchmark testing of unreleased new models aren't the same people as the ops teams managing global deployment/load balancing at scale day-to-day. If there aren't significant ongoing resources devoted to specifically validating those two things remain in sync - they'll almost certainly drift apart. And it won't be anyone's job to even know it's happening until a meaningful number of important customers complain or sales start to fall. Of course, if an unplanned deviation causes costs to rise over budget, it's a high-priority bug to be addressed. But if the deviation goes the other way and costs are little lower than expected, no one's getting a late night incident alert. This isn't even a dig at OpenAI in particular, it's just the default state of how large orgs work.

ifwinterco 6 hours ago||||
On benchmarks GPT 5.2 was roughly equivalent to Opus 4.5 but most people who've used both for SWE stuff would say that Opus 4.5 is/was noticeably better
CraigJPerry 5 hours ago|||
There's an extended thinking mode for GPT 5.2 i forget the name of it right at this minute. It's super slow - a 3 minute opus 4.5 prompt is circa 12 minutes to complete in 5.2 on that super extended thinking mode but it is not a close race in terms of results - GPT 5.2 wins by a handy margin in that mode. It's just too slow to be useable interactively though.
ifwinterco 4 hours ago||
Interesting, sounds like I definitely need to give the GPT models another proper go based on this discussion
elAhmo 6 hours ago||||
I mostly used Sonnet/Opus 4.x in the past months, but 5.2 Codex seemed to be on par or better for my use case in the past month. I tried a few models here and there but always went back to Claude, but with 5.2 Codex for the first time I felt it was very competitive, if not better.

Curious to see how things will be with 5.3 and 4.6

georgeven 6 hours ago||||
Interesting. Everyone in my circle said the opposite.
MadnessASAP 2 hours ago|||
My experience is that Codex follows directions better but Claude writes better code.

ChatGPT-5.2-Codex follows directions to ensure a task [bead](https://github.com/steveyegge/beads) is opened before starting a task and to keep it updated almost to a fault. Claude-Opus-4.5 with the exact same directions, forgets about it within a round or two. Similarly, I had a project that required very specific behaviour from a couple functions, it was documented in a few places including comments at the top and bottom of the function. Codex was very careful in ensuring the function worked as was documented. Claude decided it was easier to do the exact opposite, rewrote the function, the comments, and the documentation to saynit now did the opposite of what was previously there.

If I believed a LLM could be spiteful, I would've believed it on that second one. I certainly felt some after I realised what it had done. The comment literally said:

  // Invariant regardless of the value of X, this function cannot return Y
And it turned it into:

  // Returns Y if X is true
planckscnst 1 hour ago||
That's so strange. I found GPT to be abysmal at following instructions to the point of unusability for any direction-heavy role. I have a common workflow that involves an orchestrator that pretty much does nothing but follow some simple directions [1]. GPT flat-out cannot do this most basic task.

[1]: https://github.com/Vibecodelicious/llm-conductor/blob/main/O...

krzyk 5 hours ago|||
It probably depends on programming language and expectations.
ifwinterco 4 hours ago||
This is mostly Python/TS for me... what Jonathan Blow would probably call not "real programming" but it pays the bills

They can both write fairly good idiomatic code but in my experience opus 4.5 is better at understanding overall project structure etc. without prompting. It just does things correctly first time more often than codex. I still don't trust it obviously but out of all LLMs it's the closest to actually starting to earn my trust

SatvikBeri 4 hours ago|||
I pretty consistently heard people say Codex was much slower but produced better results, making it better for long-running work in the background, and worse for more interactive development.
smcleod 4 hours ago||||
I don't think much from OpenAI can be trusted tbh.
aaaalone 6 hours ago||||
At the end of the day you test it for your use cases anyway but it makes it a great initial hint if it's worth it to test out.
cyanydeez 6 hours ago||||
When do you think we should run this benchmark? Friday, 1pm? Monday 8AM? Wednesday 11AM?

I definitely suspect all these models are being degraded during heavy loads.

j_maffe 6 hours ago||
This hypothesis is tested regularly by plenty of live benchmarks. The services usually don't decay in performance.
thinkingtoilet 4 hours ago|||
We know Open AI got caught getting benchmark data and tuning their models to it already. So the answer is a hard no. I imagine over time it gives a general view of the landscape and improvements, but take it with a large grain of salt.
tedsanders 1 hour ago|||
Are you referring to FrontierMath?

We had access to the eval data (since we funded it), but we didn't train on the data or otherwise cheat. We didn't even look at the eval results until after the model had been trained and selected.

rvz 3 hours ago|||
The same thing was done with Meta researchers with Llama 4 and what can go wrong when 'independent' researchers begin to game AI benchmarks. [0]

You always have to question these benchmarks, especially when the in-house researchers can potentially game them if they wanted to.

Which is why it must be independent.

[0] https://gizmodo.com/meta-cheated-on-ai-benchmarks-and-its-a-...

purplerabbit 7 hours ago|||
The lack of broad benchmark reports in this makes me curious: Has OpenAI reverted to benchmaxxing? Looking forward to hearing opinions once we all try both of these out
MallocVoidstar 6 hours ago||
The -codex models are only for 'agentic coding', nothing else.
dingnuts 6 hours ago||
[dead]
nharada 7 hours ago|||
That's a massive jump, I'm curious if there's a materially different feeling in how it works or if we're starting to reach the point of benchmark saturation. If the benchmark is good then 10 points should be a big improvement in capability...
jkelleyrtp 7 hours ago||
claude swe-bench is 80.8 and codex is 56.8

Seems like 4.6 is still all-around better?

gizmodo59 7 hours ago|||
Its SWE bench pro not swe bench verified. The verified benchmark has stagnated
joshuahedlund 7 hours ago||
Any ideas why verified has stagnated? It was increasing rapidly and then basically stopped.
Snuggly73 7 hours ago||
it has been pretty much a benchmark for memorization for a while. there is a paper on the subject somewhere.

swe bench pro public is newer, but its not live, so it will get slowly memorized as well. the private dataset is more interesting, as are the results there:

https://scale.com/leaderboard/swe_bench_pro_private

Rudybega 4 hours ago|||
You're comparing two different benchmarks. Pro vs Verified.
pjot 8 hours ago||
Claude Code release notes:

  > Version 2.1.32:
     • Claude Opus 4.6 is now available!
     • Added research preview agent teams feature for multi-agent collaboration (token-intensive feature, requires setting
     CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1)
     • Claude now automatically records and recalls memories as it works
     • Added "Summarize from here" to the message selector, allowing partial conversation summarization.
     • Skills defined in .claude/skills/ within additional directories (--add-dir) are now loaded automatically.
     • Fixed @ file completion showing incorrect relative paths when running from a subdirectory
     • Updated --resume to re-use --agent value specified in previous conversation by default.
     • Fixed: Bash tool no longer throws "Bad substitution" errors when heredocs contain JavaScript template literals like ${index + 1}, which
     previously interrupted tool execution
     • Skill character budget now scales with context window (2% of context), so users with larger context windows can see more skill descriptions
     without truncation
     • Fixed Thai/Lao spacing vowels (สระ า, ำ) not rendering correctly in the input field
     • VSCode: Fixed slash commands incorrectly being executed when pressing Enter with preceding text in the input field
     • VSCode: Added spinner when loading past conversations list
neuronexmachina 8 hours ago|
> Claude now automatically records and recalls memories as it works

Neat: https://code.claude.com/docs/en/memory

I guess it's kind of like Google Antigravity's "Knowledge" artifacts?

bityard 6 hours ago|||
If it works anything like the memories on Copilot (which have been around for quite a while), you need to be pretty explicit about it being a permanent preference for it to be stored as a memory. For example, "Don't use emoji in your response" would only be relevant for the current chat session, whereas this is more sticky: "I never want to see emojis from you, you sub-par excuse for a roided-out spreadsheet"
flutas 5 hours ago|||
It's a lot more iffy than that IME.

It's very happy to throw a lot into the memory, even if it doesn't make sense.

9dev 5 hours ago|||
> you sub-par excuse for a roided-out spreadsheet

That’s harsh, man.

om8 7 hours ago||||
Is there a way to disable it? Sometimes I value agent not having knowledge that it needs to cut corners
nerdsniper 6 hours ago|||
90-98% of the time I want the LLM to only have the knowledge I gave it in the prompt. I'm actually kind of scared that I'll wake up one day and the web interface for ChatGPT/Opus/Gemini will pull information from my prior chats.
vineyardmike 4 hours ago|||
All these of these providers support this feature. I don’t know about ChatGPT but the rest are opt-in. I imagine with Gemini it’ll be default on soon enough, since it’s consumer focused. Claude does constantly nag me to enable it though.
pdntspa 4 hours ago||||
They already do this

I've had claude reference prior conversations when I'm trying to get technical help on thing A, and it will ask me if this conversation is because of thing B that we talked about in the immediate past

sanxiyn 1 hour ago||
You can disable this at Settings > Capabilities > Memory > Search and reference chats.
hypercube33 6 hours ago||||
I'm fairly sure OpenAI/GPT does pull prior information in the form of its memories
nerdsniper 6 hours ago||
Ah, that could explain why I've found myself using it the least.
sharifhsn 6 hours ago|||
Gemini has this feature but it’s opt-in.
kzahel 5 hours ago|||
Claude told me he can disable it by putting instructions in the MEMORY.md file to not use it. So only a soft disable AFAIK and you'd need to do it on each machine.
4b11b4 4 hours ago||||
I understand everyone's trying to solve this problem but I'm envisioning 1 year down the line when your memory is full of stuff that shouldn't be in there.
codethief 7 hours ago||||
Are we sure the docs page has been updated yet? Because that page doesn't say anything about automatic recording of memories.
neuronexmachina 6 hours ago||
Oh, quite right. I saw people mention MEMORY.md online and I assumed that was the doc for it, but it looks like it isn't.
pdntspa 4 hours ago||||
I thought it was already doing this?

I asked Claude UI to clear its memory a little while back and hoo boy CC got really stupid for a couple of days

kzahel 5 hours ago|||
I looked into it a bit. It stores memories near where it stores JSONL session history. It's per-project (and specific to the machine) Claude pretty aggressively and frequently writes stuff in there. It uses MEMORY.md as sort of the index, and will write out other files with other topics (linking to them from the main MEMORY.md) file.

It gives you a convenient way to say "remember this bug for me, we should fix tomorrow". I'll be playing around with it more for sure.

I asked Claude to give me a TLDR (condensed from its system prompt):

----

Persistent directory at ~/.claude/projects/{project-path}/memory/, persists across conversations

MEMORY.md is always injected into the system prompt; truncated after 200 lines, so keep it concise

Separate topic files for detailed notes, linked from MEMORY.md What to record: problem constraints, strategies that worked/failed, lessons learned

Proactive: when I hit a common mistake, check memory first - if nothing there, write it down

Maintenance: update or remove memories that are wrong or outdated

Organization: by topic, not chronologically

Tools: use Write/Edit to update (so you always see the tool calls)

ra7 3 hours ago||
> Persistent directory at ~/.claude/projects/{project-path}/memory/, persists across conversations

I create a git worktree, start Claude Code in that tree, and delete after. I notice each worktree gets a memory directory in this location. So is memory fragmented and not combined for the "main" repo?

vardalab 13 minutes ago||
Yes, I noticed the same thing, and Claude told me that it's going to be deleted. I will have it improve the skill that is part of our worktree cleanup process to consolidate that memory into the main memory if there's anything useful.
surajkumar5050 5 hours ago||
I think two things are getting conflated in this discussion.

First: marginal inference cost vs total business profitability. It’s very plausible (and increasingly likely) that OpenAI/Anthropic are profitable on a per-token marginal basis, especially given how cheap equivalent open-weight inference has become. Third-party providers are effectively price-discovering the floor for inference.

Second: model lifecycle economics. Training costs are lumpy, front-loaded, and hard to amortize cleanly. Even if inference margins are positive today, the question is whether those margins are sufficient to pay off the training run before the model is obsoleted by the next release. That’s a very different problem than “are they losing money per request”.

Both sides here can be right at the same time: inference can be profitable, while the overall model program is still underwater. Benchmarks and pricing debates don’t really settle that, because they ignore cadence and depreciation.

IMO the interesting question isn’t “are they subsidizing inference?” but “how long does a frontier model need to stay competitive for the economics to close?”

jmalicki 5 hours ago||
I suspect they're marginally profitable on API cost plans.

But the max 20x usage plans I am more skeptical of. When we're getting used to $200 or $400 costs per developer to do aggressive AI-assisted coding, what happens when those costs go up 20x? what is now $5k/yr to keep a Codex and a Claude super busy and do efficient engineering suddenly becomes $100k/yr... will the costs come down before then? Is the current "vibe-coding renaissance" sustainable in that regime?

slopusila 3 hours ago||
after the models get good enough to replace coders they will be able to start increasing the subscriptions back up
jmalicki 2 hours ago||
At $100k/yr the joke that AI means "actual Indians" starts to make a lot more sense... it is cheaper than the typical US SWE, but more than a lot of global SWEs.
HPMOR 1 hour ago||
No - because the AI will be super human. No human even at $1mm a year would be competitive with a $100k/yr corresponding AI subscription.

See people get confused. They think you can charge __less__ for software because it's automation. The truth is you can charge MORE, because it's high quality and consistent, once the output is good. Software is worth MORE than a corresponding human, not less.

jmalicki 1 hour ago|||
I am unsure if you're joking or not, but you do have a point. But it's not about quality it's about supply and demand. There are a ton of variables moving at once here and who knows where the equilibrium is.
IhateAI 39 minutes ago|||
You're delusional, stop talking to LLMs all day.
raincole 5 hours ago|||
> the interesting question isn’t “are they subsidizing inference?”

The interesting question is if they are subsidizing the $200/mo plan. That's what is supporting the whole vibecoding/agentic coding thing atm. I don't believe Claude Code would have taken off if it were token-by-token from day 1.

(My baseless bet is that they're, but not by much and the price will eventually rise by perhaps 2x but not 10x.)

BosunoB 5 hours ago|||
Dario said this in a podcast somewhere. The models themselves have so far been profitable if you look at their lifetime costs and revenue. Annual profitability just isn't a very good lens for AI companies because costs all land in one year and the revenue all comes in the next. Prolific AI haters like Ed Zitron make this mistake all the time.
jmalicki 5 hours ago|||
Do you have a specific reference? I'm curious to see hard data and models.... I think this makes sense, but I haven't figured out how to see the numbers or think about it.
BosunoB 4 hours ago||
I was able to find the podcast. Question is at 33:30. He doesn't give hard data but he explains his reasoning.

https://youtu.be/mYDSSRS-B5U

majewsky 32 minutes ago||
> He doesn't give hard data

And why is that? Should they not be interested in sharing the numbers to shut up their critics, esp. now that AI detractors seem to be growing mindshare among investors?

jmatthiass 4 hours ago|||
In his recent appearance on NYT Dealbook, he definitely made it seem like inference was sustainable, if not flat-out profitable.

https://www.youtube.com/live/FEj7wAjwQIk

rstuart4133 4 hours ago|||
> It’s very plausible (and increasingly likely) that OpenAI/Anthropic are profitable on a per-token marginal basis

There any many places that will not use models running on hardware provided by OpenAI / Anthropic. That is the case true of my (the Australian) government at all levels. They will only use models running in Australia.

Consequently AWS (and I presume others) will run models supplied by the AI companies for you in their data centres. They won't be doing that at a loss, so the price will cover marginal cost of the compute plus renting the model. I know from devs using and deploying the service demand outstrips supply. Ergo, I don't think there is much doubt that they are making money from inference.

waffletower 2 hours ago||
In the case of Anthropic -- they host on AWS all the while their models are accessible via AWS APIs as well, the infrastructure between the two is likely to be considerably shared. Particularly as caching configuration and API limitations are near identical between Anthropic and Bedrock APIs invoking Anthropic models. It is likely a mutually beneficial arrangement which does not necessarily hinder Anthropic revenue.
w10-1 4 hours ago||
"how long does a frontier model need to stay competitive"

Remember "worse is better". The model doesn't have to be the best; it just has to be mostly good enough, and used by everyone -- i.e., where switching costs would be higher than any increase in quality. Enterprises would still be on Java if the operating costs of native containers weren't so much cheaper.

So it can make sense to be ok with losing money with each training generation initially, particularly when they are being driven by specific use-cases (like coding). To the extent they are specific, there will be more switching costs.

simonw 8 hours ago||
The bicycle frame is a bit wonky but the pelican itself is great: https://gist.github.com/simonw/a6806ce41b4c721e240a4548ecdbe...
stkai 7 hours ago||
Would love to find out they're overfitting for pelican drawings.
fdeage 2 hours ago|||
OpenAI claims not to: https://x.com/aidan_mclau/status/1986255202132042164
andy_ppp 6 hours ago||||
Yes, Racoon on a unicycle? Magpie on a pedalo?
throw310822 4 hours ago|||
Correct horse battery staple:

https://claude.ai/public/artifacts/14a23d7f-8a10-4cde-89fe-0...

ta988 4 hours ago||
no staple?
iwontberude 3 hours ago||
it looks like a bodge wire
_kb 3 hours ago|||
Platypus on a penny farthing.
theanonymousone 4 hours ago||||
Even if not intentionally, it is probably leaking into training sets.
fragmede 6 hours ago|||
The estimation I did 4 months ago:

> there are approximately 200k common nouns in English, and then we square that, we get 40 billion combinations. At one second per, that's ~1200 years, but then if we parallelize it on a supercomputer that can do 100,000 per second that would only take 3 days. Given that ChatGPT was trained on all of the Internet and every book written, I'm not sure that still seems infeasible.

https://news.ycombinator.com/item?id=45455786

eli 6 hours ago|||
How would you generate a picture of Noun + Noun in the first place in order to train the LLM with what it would look like? What's happening during that 1 estimated second?
metalliqaz 4 hours ago|||
its pelicans all the way down
Terretta 5 hours ago|||
This is why everyone trains their LLM on another LLM. It's all about the pelicans.
AnimalMuppet 4 hours ago|||
But you need to also include the number of prepositions. "A pelican on a bicycle" is not at all the same as "a pelican inside a bicycle".

There are estimated to be 100 or so prepositions in English. That gets you to 4 trillion combinations.

gcanyon 7 hours ago|||
One aspect of this is that apparently most people can't draw a bicycle much better than this: they get the elements of the frame wrong, mess up the geometry, etc.
arionmiles 6 hours ago|||
There's a research paper from the University of Liverpool, published in 2006 where researchers asked people to draw bicycles from memory and how people overestimate their understanding of basic things. It was a very fun and short read.

It's called "The science of cycology: Failures to understand how everyday objects work" by Rebecca Lawson.

https://link.springer.com/content/pdf/10.3758/bf03195929.pdf

devilcius 4 hours ago|||
There’s also a great art/design project about exactly this. Gianluca Gimini asked hundreds of people to draw a bicycle from memory, and most of them got the frame, proportions, or mechanics wrong. https://www.gianlucagimini.it/portfolio-item/velocipedia/
rcxdude 5 hours ago|||
A place I worked at used it as part of an interview question (it wasn't some pass/fail thing to get it 100% correct, and was partly a jumping off point to a different question). This was in a city where nearly everyone uses bicycles as everyday transportation. It was surprising how many supposedly mechanical-focused people who rode a bike everyday, even rode a bike to the interview, would draw a bike that would not work.
gcanyon 4 hours ago|||
I wish I had interviewed there. When I first read that people have a hard time with this I immediately sat down without looking at a reference and drew a bicycle. I could ace your interview.
throwuxiytayq 4 hours ago|||
This is why at my company in interviews we ask people to draw a CPU diagram. You'd be surprised how many supposedly-senior computer programmers would draw a processor that would not work.
niobe 4 hours ago|||
If I was asked that question in an interview to be a programmer I'd walk out. How many abstraction layers either side of your knowledge domain do you need to be an expert in? Further, being a good technologist of any kind is not about having arcane details at the tip of your frontal lobe, and a company worth working for would know that.
duped 1 hour ago||
I mean gp is clearly a joke but

A fundamental part of the job is being able to break down problems from large to small, reason about them, and talk about how you do it, usually with minimal context or without deep knowledge in all aspects of what we do. We're abstraction artists.

That question wouldn't be fundamentally different than any other architecture question. Start by drawing big, hone in on smaller parts, think about edge cases, use existing knowledge. Like bread and butter stuff.

I much more question your reaction to the joke than using it as a hypothetical interview question. I actually think it's good. And if it filters out people that have that kind of reaction then it's excellent. No one wants to work with the incurious.

selcuka 1 hour ago||||
Poe's Law [1]:

> Without a clear indicator of the author's intent, any parodic or sarcastic expression of extreme views can be mistaken by some readers for a sincere expression of those views.

[1] https://en.wikipedia.org/wiki/Poe%27s_law

gedy 4 hours ago||||
That's reasonable in many cases, but I've had situations like this for senior UI and frontend positions, and they: don't ask UI or frontend questions. And ask their pet low level questions. Some even snort that it's softball to ask UI questions or "they use whatever". It's like, yeah no wonder your UI is shit and now you are hiring to clean it up.
rsc 4 hours ago|||
Raises hand.
gnatolf 7 hours ago||||
Absolutely. A technically correct bike is very hard to draw in SVG without going overboard in details
falloutx 6 hours ago|||
Its not. There are thousands of examples on the internet but good SVG sites do have monetary blocks.

https://www.freepik.com/free-photos-vectors/bicycle-svg

jefftk 5 hours ago|||
Several of those have incorrect frames:

https://www.freepik.com/free-vector/cyclist_23714264.htm

https://www.freepik.com/premium-vector/bicycle-icon-black-li...

Or missing/broken pedals:

https://www.freepik.com/premium-vector/bicycle-silhouette-ic...

https://www.freepik.com/premium-vector/bicycle-silhouette-ve...

http://freepik.com/premium-vector/bicycle-silhouette-vector-...

gnatolf 4 hours ago|||
From smaller to larger nitpick, there's basically something wrong with all of the first 15 or so of these drawings. Thanks for agreeing :)
RussianCow 5 hours ago|||
I'm not positive I could draw a technically correct bike with pen and paper (without a reference), let alone with SVG!
nateglims 5 hours ago||||
I just had an idea for an RLVR startup.
cyanydeez 6 hours ago|||
Yes, but obviously AGI will solve this by, _checks notes_ more TerraWatts!
hackernudes 6 hours ago|||
The word is terawatts unless you mean earth-based watts. OK then, it's confirmed, data centers in space!
seanhunter 6 hours ago|||
…in space!
franze 5 hours ago|||
here the animated version https://claude.ai/public/artifacts/3db12520-eaea-4769-82be-7...
gryfft 5 hours ago||
That's hilarious. It's so close!
einrealist 8 hours ago|||
They trained for it. That's the +0.1!
zahlman 5 hours ago|||
Do you find that word choices like "generate" (as opposed to "create", "author", "write" etc.) influence the model's success?

Also, is it bad that I almost immediately noticed that both of the pelican's legs are on the same side of the bicycle, but I had to look up an image on Wikipedia to confirm that they shouldn't have long necks?

Also, have you tried iterating prompts on this test to see if you can get more realistic results? (How much does it help to make them look up reference images first?)

simonw 3 hours ago||
I've stuck with "Generate an SVG of a pelican riding a bicycle" because it's the same prompt I've been using for over a year now and I want results that are sort-of comparable to each other.

I think when I first tried this I iterated a few times to get to something that reliably output SVG, but honestly I didn't keep the notes I should ahve.

etwigg 4 hours ago|||
If we do get paperclipped, I hope it is of the "cycling pelican" variety. Thanks for your important contribution to alignment Simon!
athrowaway3z 7 hours ago|||
This benchmark inspired me to have codex/claude build a DnD battlemap tool with svg's.

They got surprisingly far, but i did need to iterate a few times to have it build tools that would check for things like; dont put walls on roads or water.

What I think might be the next obstacle is self-knowledge. The new agents seem to have picked up ever more vocabulary about their context and compaction, etc.

As a next benchmark you could try having 1 agent and tell it to use a coding agent (via tmux) to build you a pelican.

hoeoek 8 hours ago|||
This really is my favorite benchmark
eaf7e281 8 hours ago|||
There's no way they actually work on training this.
margalabargala 7 hours ago|||
I suspect they're training on this.

I asked Opus 4.6 for a pelican riding a recumbent bicycle and got this.

https://i.imgur.com/UvlEBs8.png

WarmWash 7 hours ago|||
It would be way way better if they were benchmaxxing this. The pelican in the image (both images) has arms. Pelicans don't have arms, and a pelican riding a bike would use it's wings.
ryandrake 6 hours ago|||
Having briefly worked in the 3D Graphics industry, I don't even remotely trust benchmarks anymore. The minute someone's benchmark performance becomes a part of the public's purchasing decision, companies will pull out every trick in the book--clean or dirty--to benchmaxx their product. Sometimes at the expense of actual real-world performance.
seanhunter 6 hours ago|||
Pelicans don’t ride bikes. You can’t have scruples about whether or not the image of a pelican riding a bike has arms.
jevinskie 6 hours ago||
Wouldn’t any decent bike-riding pelican have a bike tailored to pelicans and their wings?
actsasbuffoon 4 hours ago|||
Sure, that’s one solution. You could also Isle of Dr Moreau your way to a pelican that can use a regular bike. The sky is the limit when you have no scruples.
cinntaile 6 hours ago|||
Now that would be a smart chat agent.
mrandish 7 hours ago||||
Interesting that it seems better. Maybe something about adding a highly specific yet unusual qualifier focusing attention?
riffraff 6 hours ago|||
perhaps try a penny farthing?
KeplerBoy 7 hours ago||||
There is no way they are not training on this.
collinmanderson 7 hours ago|||
I suspect they have generic SVG drawing that they focus on.
fragmede 6 hours ago|||
The people that work at Anthropic are aware of simonw and his test, and people aren't unthinking data-driven machines. How valid his test is or isn't, a better score on it is convincing. If it gets, say, 1,000 people to use Claude Code over Codex, how much would that be worth to Anthropic?

$200 * 1,000 = $200k/month.

I'm not saying they are, but to say that they aren't with such certainty, when money is on the line; unless you have some insider knowledge you'd like to share with the rest of the class, it seems like an questionable conclusion.

beemboy 5 hours ago|||
Isn't there a point at which it trains itself on these various outputs, or someone somewhere draws one and feeds it into the model so as to pass this benchmark?
bityard 6 hours ago|||
Well, the clouds are upside-down, so I don't think I can give it a pass.
copilot_king_2 7 hours ago|||
I'm firing all of my developers this afternoon.
RGamma 7 hours ago|||
Opus 6 will fire you instead for being too slow with the ideas.
insane_dreamer 5 hours ago|||
Too late. You’ve already been fired by a moltbot agent from your PHB.
nine_k 6 hours ago|||
I suppose the pelican must be now specifically trained for, since it's a well-known benchmark.
7777777phil 7 hours ago|||
best pelican so far would you say? Or where does it rank in the pelican benchmark?
mrandish 7 hours ago||
In other words, is it a pelican or a pelican't?
canadiantim 5 hours ago||
You’ve been sitting on that pun just waiting for it to take flight
nubg 8 hours ago|||
What about the Pelo2 benchmark? (the gray bird that is not gray)
6thbit 5 hours ago|||
do you have a gif? i need an evolving pelican gif
Kye 1 hour ago||
A pelican GIF in a Pelican(TM) MP4 container.
risyachka 6 hours ago|||
Pretty sure at this point they train it on pelicans
ares623 8 hours ago|||
Can it draw a different bird on a bike?
simonw 8 hours ago||
Here's a kākāpō riding a bicycle instead: https://gist.github.com/simonw/19574e1c6c61fc2456ee413a24528...

I don't think it quite captures their majesty: https://en.wikipedia.org/wiki/K%C4%81k%C4%81p%C5%8D

zahlman 4 hours ago||
Now that I've looked it all up, I feel like that's much more accurate to a real kākāpō than the pelican is to a real pelican. It's almost as if it thinks a pelican is just a white flamingo with a different beak.
DetroitThrow 8 hours ago|||
The ears on top are a cute touch
iujasdkjfasf 4 hours ago|||
[dead]
behnamoh 7 hours ago|||
[flagged]
smokel 6 hours ago|||
I'll bite. The benchmark is actually pretty good. It shows in an extremely comprehensible way how far LLMs have come. Someone not in the know has a hard time understanding what 65.4% means on "Terminal-Bench 2.0". Comparing some crappy pelicans on bicycles is a lot easier.
blibble 5 hours ago||
it ceases to be a useful benchmark of general ability when you post it publicly for them to train against
quinnjh 6 hours ago|||
the field is advancing so fast it's hard to do real science as their will be a new SOTA by the time you're ready to publish results. i think this is a combination of that and people having a laugh.

Would you mind sharing which benchmarks you think are useful measures for multimodal reasoning?

techpression 5 hours ago||
A benchmark only tests what the benchmark is doing, the goal is to make that task correlate with actually valuable things. Graphic benchmarks is a good example, extremely hard to know what you will get in a game by looking at 3D Mark scores, it varies by a lot. Making a SVG of a single thing doesn’t help much unless that applies to all SVG tasks.
fullstackchris 4 hours ago||
[flagged]
dang 3 hours ago||
Personal attacks are not allowed on HN. No more of this, please.
legitster 8 hours ago||
I'm still not sure I understand Anthropic's general strategy right now.

They are doing these broad marketing programs trying to take on ChatGPT for "normies". And yet their bread and butter is still clearly coding.

Meanwhile, Claude's general use cases are... fine. For generic research topics, I find that ChatGPT and Gemini run circles around it: in the depth of research, the type of tasks it can handle, and the quality and presentation of the responses.

Anthropic is also doing all of these goofy things to try to establish the "humanity" of their chatbot - giving it rights and a constitution and all that. Yet it weirdly feels the most transactional out of all of them.

Don't get me wrong, I'm a paying Claude customer and love what it's good at. I just think there's a disconnect between what Claude is and what their marketing department thinks it is.

tgtweak 8 hours ago||
Claude itself (outside of code workflows) actually works very well for general purpose chat. I have a few non-technical friends that have moved over from chatgpt after some side-by-side testing and I've yet to see one go back - which is good since claude circa 8 months ago was borderline unusable for anything but coding on the api.
dimgl 22 minutes ago|||
I don't get what's so difficult to understand. They have ambitions beyond just coding. And Claude is generally a good LLM. Even beyond just the coding applications.
eaf7e281 7 hours ago|||
I kinda agree. Their model just doesn't feel "daily" enough. I would use it for any "agentic" tasks and for using tools, but definitely not for day to day questions.
lukebechtel 7 hours ago|||
Why? I use it for all and love it.

That doesn't mean you have to, but I'm curious why you think it's behind in the personal assistant game.

legitster 7 hours ago|||
I have three specific use cases where I try both but ChatGPT wins:

- Recipes and cooking: ChatGPT just has way more detailed and practical advice. It also thinks outside of the box much more, whereas Claude gets stuck in a rut and sticks very closely to your prompt. And ChatGPT's easier to understand/skim writing style really comes in useful.

- Travel and itinerary: Again, ChatGPT can anticipate details much more, and give more unique suggestions. I am much more likely to find hidden gems or get good time-savers than Claude, which often feels like it is just rereading Yelp for you.

- Historical research: ChatGPT wins on this by a mile. You can tell ChatGPT has been trained on actual historical texts and physical books. You can track long historical trends, pull examples and quotes, and even give you specific book or page(!) references of where to check the sources. Meanwhile, all Claude will give you is a web search on the topic.

aggie 6 hours ago||
How does #3 square with Anthropic's literal warehouse full of books we've seen from the copyright case? Did OpenAI scan more books? Or did they take a shadier route of training on digital books despite copyright issues, but end up with a deeper library?
legitster 3 hours ago|||
I have no idea, but I suspect there's a difference between using books to train an LLM and be able to reproduce text/writing styles, and being able to actually recall knowledge in said books.
rolisz 6 hours ago|||
I think they bought the books after they were caught that they pirated the books and lost that case (because they pirated, not because of copyright).
eaf7e281 4 hours ago||||
It's hard to say. Maybe it has to do with the way Claude responds or the lack of "thinking" compared to other models. I personally love Claude and it's my only subscription right now, but it just feels weird compared to the others as a personal assistant.
lukebechtel 2 hours ago||
Oh, I always use opus 4.5 thinking mode. Maybe that's the diff.
FergusArgyll 1 hour ago|||
My 2 cents:

All the labs seem to do very different post training. OpenAI focuses on search. If it's set to thinking, it will search 30 websites before giving you an answer. Claude regularly doesn't search at all even for questions it obviously should. It's postraining seems more focused on "reasoning" or planning - things that would be useful in programming where the bottleneck is: just writing code without thinking how you'll integrate it later and search is mostly useless. But for non coding - day to day "what's the news with x" "How to improve my bread" "cheap tasty pizza" or even medical questions, you really just want a distillation of the internet plus some thought

solarkraft 7 hours ago||||
But that’s what makes it so powerful (yeah, mixing model and frontend discussion here yet again). I have yet to see a non-DIY product that can so effortlessly call tens of tools by different providers to satisfy your request.
quietsegfault 3 hours ago|||
Claude is far superior for daily chat. I have to work hard to get it to not learn how to work around various bad behaviors I have but don’t want to change.
Squarex 5 hours ago|||
Claude sucks at non English languages. Gemini and ChatGPT are much better. Grok is the worst. I am a native Czech speaker and Claude makes up words and Grok sometimes respond in Russian. So while I love it for coding, it’s unusable for general purpose for me.
9dev 5 hours ago|||
> Grok sometimes respond in Russian

Geopolitically speaking this is hilarious.

Squarex 4 hours ago||
The voice mode sounded like a Ukrainian trying to speak Czech. I don’t think it means anything.
khendron 1 hour ago||||
Claude is helping me learn French right now. I am using it as a supplementary tutor for a class I am taking. I have caught it in a couple of mistakes, but generally it seems to be working pretty well.
jorl17 4 hours ago||||
Claude is quite good at European Portuguese in my limited tests. Gemini 3 is also very good. ChatGPT is just OK and keeps code-switching all the time, it's very bizarre.

I used to think of Gemini as the lead in terms of Portuguese, but recently subjectively started enjoying Claude more (even before Opus 4.5).

In spite of this, ChatGPT is what I use for everyday conversational chat because it has loads of memories there, because of the top of the line voice AI, and, mostly, because I just brainstorm or do 1-off searches with it. I think effectively ChatGPT is my new Google and first scratchpad for ideas.

kuboble 4 hours ago|||
Claude code (opus) is very good in Polish.

I sometimes vibe code in polish and it's as good as with English for me. It speaks a natural, native level Polish.

I used opus to translate thousands of strings in my app into polish, Korean, and two Chinese dialects. Polish one is great, and the other are also good according to my customers.

altern8 1 hour ago||
Your game is amazing!

I wish there was a "Reset" button to go back to the original position.

Where are you in Poland?

redox99 1 hour ago|||
Why would I even use Claude for asking something on their web, considering that chips away my claude code usage limit?

Their limit system is so bad.

derwiki 1 hour ago||
It feels very similar to how Lyft positioned themselves against Uber. (And we know how that played out)
blibble 8 hours ago||
> We build Claude with Claude. Our engineers write code with Claude Code every day

well that explains quite a bit

jsheard 8 hours ago||
CC has >6000 open issues, despite their bot auto-culling them after 60 days of inactivity. It was ~5800 when I looked just a few days ago so they seem to be accelerating towards some kind of bug singularity.
dkersten 5 hours ago|||
Just anecdotally, each release seems to be buggier than the last.

To me, their claim that they are vibe coding Claude code isn’t the flex they think it is.

I find it harder and harder to trust anthropic for business related use and not just hobby tinkering. Between buggy releases, opaque and often seemingly glitches rate limits and usage limits, and the model quality inconsistency, it’s just not something I’d want to bet a business on.

zahlman 4 hours ago||
I think I would be much more frightened if it were working well.
ifwinterco 4 hours ago||
Exactly, thank goodness it's still a bit rubbish in some aspects
tgtweak 8 hours ago||||
plot twist, it's all claude code instances submitting bug reports on behalf of end users.
trescenzi 1 hour ago|||
I literally hit a claude code bug today, tried to use claude desktop to debug it which didn't help and it offered to open a bug report for me. So yes 100%. Some of the titles also make it pretty clear they are auto submitted. This is my favorite which was around the top when I was creating my bug report 3 hours ago and is now 3 pages back lol.

> Unable to process - no bug report provided. Please share the issue details you'd like me to convert into a GitHub issue title

https://github.com/anthropics/claude-code/issues/23459

accrual 7 hours ago|||
It's Claude, all the way down.
elAhmo 6 hours ago||||
Insane to think that a relatively simple CLI tool has so many open issues...
emilsedgh 5 hours ago|||
It's not really a simple CLI tool though it's really interactive.
trymas 5 hours ago||||
What’s so simple about it?
elAhmo 5 hours ago||
I said relatively simple. It is mostly an API interface with Anthropic models, with tool calling on top of it, very simple input and output.
brookst 4 hours ago|||
With extensibility via plugins, MCP (stdio and http), UI to prompt the user for choices and redirection, tools to manage and view context, and on and on.

It is not at all a small app, at least as far as UX surface area. There are, what, 40ish slash commands? Each one is an opportunity for bugs and feature gaps.

everforward 2 hours ago||
I would still call that small, maybe medium. emacs is huge as far as CLI tools go, awk is large because it implements its own language (apparently capable of writing Doom in). `top` probably has a similar number of interaction points, something like `lftp` might have more between local and remote state.

The complex and magic parts are around finding contextual things to include, and I'd be curious how many are that vs "forgot to call clear() in the TUI framework before redirecting to another page".

9dev 5 hours ago|||
I’m pretty certain you haven’t used it yet(to its fullest extent) then. Claude Code is easily one of the most complex terminal UIs I have seen yet.
dvfjsdhgfv 4 hours ago||
Could you explain why? When I think about complex TUIs, I think about things we were building with Turbo Vision in the 90s.
gorbypark 4 hours ago||
I’m going to buck the trend and say it’s really not that complex. AFAIK they are using Ink, which is React with a TUI renderer.

Cue I could build it in a weekend vibes, I built my own agent TUI using the OpenAI agent SDK and Ink. Of course it’s not as fleshed out as Claude, but it supports git work trees for multi agent, slash commands, human in the loop prompts and etc. If I point it at the Anthropic models it more or less produces results as m good as the real Claude TUI.

I actually “decompiled” the Claude tools and prompts and recreated them. As of 6 months ago Claude was 15 tools, mostly pretty basic (list for, read file, wrote file, bash, etc) with some very clever prompts, especially the task tool it uses to do the quasi planning mode task bullets (even when not in planning mode).

Honestly the idea of bringing this all together with an affordable monthly service and obviously some seriously creative “prompt engineers” is the magic/hard part (and making the model itself, obviously).

dwaltrip 5 hours ago|||
sips coffee… ahh yes, let me find that classic Dropbox rsync comment
paxys 7 hours ago|||
Half of them were probably opened yesterday during the Claude outage.
anematode 7 hours ago||
Nah, it was at like 5500 before.
raincole 8 hours ago|||
It explains how important dogfooding is if you want to make an extremely successful product.
jama211 8 hours ago|||
It’s extremely successful, not sure what it explains other than your biases
blibble 8 hours ago|||
Microsoft's products are also extremely successful

they're also total garbage

simianwords 7 hours ago|||
but they have the advantage of already being a big company. Anthropic is new and there's no reason for people to use it
kuboble 3 hours ago|||
The tool is absolutely fantastic coding assistant. That's why I use it.

The amount of non-critical bugs all over the place is at least a magnitude larger than of any software I was using daily ever.

Plenty of built in /commands don't work. Sometimes it accepts keystrokes with 1 second delays. It often scrolls hundreds of lines in console after each key stroke Every now and then it crashes completely and is unrecoverable (I once have up and installed a fresh wls) When you ask it question in plan mode it is somewhat of an art to find the answer because after answering the question it will dump the whole current plan (free screens of text)

And just in general the technical feeling of the TUI is that of a vibe coded project that got too big to control.

derwiki 1 hour ago||
I think this might be a harbinger of what we should expect for software quality in the next decade
Izikiel43 4 hours ago|||
what about if management gives them a reason? You can think of which those can be.
holoduke 5 hours ago|||
Claude is by far the most popular and best assistant currently available for a developer.
wavemode 5 hours ago||
Okay, and Windows is by far the most popular desktop operating system.

Discussions are pointless when the parties are talking past each other.

pluralmonad 5 hours ago||
Popular meaning lots of people like it or that it is relatively widespread? Polio used to be popular in the latter way.
quietsegfault 3 hours ago||
I like windows, it’s fine. I like MacOS better. I like Linux. None of them are garbage or unusable.
blibble 3 hours ago||
have you used Windows 11?

file explorer takes 5 seconds to open

acedTrex 6 hours ago||||
Something being successful and something being a high quality product with good engineering are two completely different questions.
mvdtnz 7 hours ago|||
Anthropic has perhaps the most embarrassing status page history I have ever seen. They are famous for downtime.

https://status.claude.com/

ronsor 7 hours ago|||
As opposed to other companies which are smart enough not to report outages.
tavavex 6 hours ago||
So, there are only two types of companies: ones that have constant downtime, and ones that have constant downtime but hide it, right?
Sebguer 6 hours ago||
Basically, yes.
Computer0 6 hours ago||||
The competition doesn't currently have all 99's - https://status.openai.com/
djeastm 5 hours ago||||
The best way to use Claude's models seems to be some other inference provider (either OpenRouter or directly)
derwiki 1 hour ago||||
Shades of Fail Whale
dimgl 7 hours ago|||
And yet people still use them.
cedws 7 hours ago|||
The sandboxing in CC is an absolute joke, it's no wonder there's an explosion of sandbox wrappers at the moment. There's going to be a security catastrophe at some point, no doubt about it.
gjsman-1000 8 hours ago|||
Also explains why Claude Code is a React app outputting to a Terminal. (Seriously.)
krystofbe 5 hours ago|||
I did some debugging on this today. The results are... sobering.

Memory comparison of AI coding CLIs (single session, idle):

  | Tool        | Footprint | Peak   | Language      |
  |-------------|-----------|--------|---------------|
  | Codex       | 15 MB     | 15 MB  | Rust          |
  | OpenCode    | 130 MB    | 130 MB | Go            |
  | Claude Code | 360 MB    | 746 MB | Node.js/React |
That's a 24x to 50x difference for tools that do the same thing: send text to an API.

vmmap shows Claude Code reserves 32.8 GB virtual memory just for the V8 heap, has 45% malloc fragmentation, and a peak footprint of 746 MB that never gets released, classic leak pattern.

On my 16 GB Mac, a "normal" workload (2 Claude sessions + browser + terminal) pushes me into 9.5 GB swap within hours. My laptop genuinely runs slower with Claude Code than when I'm running local LLMs.

I get that shipping fast matters, but building a CLI with React and a full Node.js runtime is an architectural choice with consequences. Codex proves this can be done in 15 MB. Every Claude Code session costs me 360+ MB, and with MCP servers spawning per session, it multiplies fast.

atonse 4 hours ago|||
Jarred Sumner (bun creator, bun was recently acquired by Anthropic) has been working exclusively on bringing down memory leaks and improving performance in CC the last couple weeks. He's been tweeting his progress.

This is just regular tech debt that happens from building something to $1bn in revenue as fast as you possibly can, optimize later.

They're optimizing now. I'm sure they'll have it under control in no time.

CC is an incredible product (so is codex but I use CC more). Yes, lately it's gotten bloated, but the value it provides makes it bearable until they fix it in short time.

bdangubic 4 hours ago||
if I had a dollar for each time I heard “until they fix it in short time” I’d have Elon money
badlogic 26 minutes ago||||
OpenCode is not written in Go. It's TS on Bun, with OpenTUI underneath which is written in Zig.
Weryj 5 hours ago||||
I believe they use https://bun.com/ Not Node.js
slopusila 3 hours ago|||
why do you care about uncommitted virtual memory? that's practically infinite
jama211 8 hours ago||||
There’s nothing wrong with that, except it lets ai skeptics feel superior
everforward 2 hours ago|||
There are absolutely things wrong with that, because React was designed to solve problems that don't exist in a TUI.

React fixes issues with the DOM being too slow to fully re-render the entire webpage every time a piece of state changes. That doesn't apply in a TUI, you can re-render TUIs faster than the monitor can refresh. There's no need to selectively re-render parts of the UI, you can just re-render the entire thing every time something changes without even stressing out the CPU.

It brings in a bunch of complexity that doesn't solve any real issues beyond the devs being more familiar with React than a TUI library.

RohMin 7 hours ago||||
https://www.youtube.com/watch?v=LvW1HTSLPEk

I thought this was a solid take

jdthedisciple 6 hours ago||
interesting
overgard 4 hours ago||||
I haven't looked at it directly, so I can speak on quality, but it's a pretty weird way to write a terminal app
3836293648 7 hours ago||||
Oh come on. It's massively wrong. It is always wrong. It's not always wrong enough to be important, but it doesn't stop being wrong
vntok 5 hours ago||
You should elaborate. What are your criteria and why do you think they should matter to actual users?
exe34 7 hours ago|||
I use AI and I can call AI slop shit if it smells like shit.
krona 7 hours ago||||
Sounds like a web developer defined the solution a year before they knew what the problem was.
thehamkercat 8 hours ago||||
Same with opencode and gemini, it's disgusting

Codex (by openai ironically) seems to be the fastest/most-responsive, opens instantly and is written in rust but doesn't contain that many features

Claude opens in around 3-4 seconds

Opencode opens in 2 seconds

Gemini-cli is an abomination which opens in around 16 second for me right now, and in 8 seconds on a fresh install

Codex takes 50ms for reference...

--

If their models are so good, why are they not rewriting their own react in cli bs to c++ or rust for 100x performance improvement (not kidding, it really is that much)

g947o 7 hours ago|||
Great question, and my guess:

If you build React in C++ and Rust, even if the framework is there, you'll likely need to write your components in C++/Rust. That is a difficult problem. There are actually libraries out there that allow you to build web UI with Rust, although they are for web (+ HTML/CSS) and not specifically CLI stuff.

So someone needs to create such a library that is properly maintained and such. And you'll likely develop slower in Rust compared to JS.

These companies don't see a point in doing that. So they just use whatever already exists.

shoeb00m 7 hours ago|||
Opencode wrote their own tui library in zig, and then build a solidjs library on top of that.

https://github.com/anomalyco/opentui

g947o 4 hours ago||
This has nothing to do with React style UI building.
Philpax 7 hours ago||||
Those Rust libraries have existed for some time:

- https://github.com/ratatui/ratatui

- https://github.com/ccbrown/iocraft

- https://crates.io/crates/dioxus-tui

g947o 4 hours ago||
Where is React? These are TUI libraries, which are not the same thing
Philpax 4 hours ago||
iocraft and dioxus-tui implement the React model, or derivatives of it.
pdntspa 4 hours ago|||
and why do they need react...
Philpax 4 hours ago||
That's actually relatively understandable. The React model (not necessarily React itself) of compositional reactive one-way data binding has become dominant in UI development over the last decade because it's easy to work with and does not require you to keep track of the state of a retained UI.

Most modern UI systems are inspired by React or a variant of its model.

azinman2 7 hours ago||||
Why does it matter if Claude Code opens in 3-4 seconds if everything you do with it can take many seconds to minutes? Seems irrelevant to me.
RohMin 7 hours ago|||
I guess with ~50 years of CPU advancements, 3-4 seconds for a TUI to open makes it seem like we lost the plot somewhere along the way.
strange_quark 7 hours ago||
Don’t forget they’ve also publicly stated (bragged?) about the monumental accomplishment of getting some text in a terminal to render at 60fps.
mbesto 6 hours ago||||
This is exactly the type of thing that AI code writers don't do well - understand the prioritization of feature development.

Some developers say 3-4 seconds are important to them, others don't. Who decides what the truth is? A human? ClawdBot?

wahnfrieden 7 hours ago|||
Because when the agent is taking many seconds to minutes, I am starting new agents instead of waiting or switching to non-agent tasks
shoeb00m 7 hours ago||||
codex cli is missing a bunch of ux features like resizing on terminal size change.

Opencode's core is actually written in zig, only ui orchestration is in solidjs. It's only slightly slower to load than neo-vim on my system.

https://github.com/anomalyco/opentui

bdangubic 3 hours ago||||
50ms to open and then 2hrs to solve a simple problem vs 4s to open and then 5m to solve a problem, eh?
wahnfrieden 7 hours ago|||
Codex team made the right call to rewrite its TypeScript to Rust early on
sweetheart 7 hours ago||||
React's core is agnostic when it comes to the actual rendering interface. It's just all the fancy algos for diffing and updating the underlying tree. Using it for rendering a TUI is a very reasonable application of the technology.
skydhash 6 hours ago||
The terminal UI is not a tree structure that you can diff. It’s a 2D cells of characters, where every manipulation is a stream of texts. Refreshing or diffing that makes no sense.
HarHarVeryFunny 3 hours ago|||
IMO diffing might have made sense to do here, but that's not what they chose to do.

What's apparently happening is that React tells Ink to update (re-render) the UI "scene graph", and Ink then generates a new full-screen image of how the terminal should look, then passes this screen image to another library, log-update, to draw to the terminal. log-update draws these screen images by a flicker-inducing clear-then-redraw, which it has now fixed by using escape codes to have the terminal buffer and combine these clear-then-redraw commands, thereby hiding the clear.

An alternative solution, rather than using the flicker-inducing clear-then-redraw in the first place, would have been just to do terminal screen image diffs and draw the changes (which is something I did back in the day for fun, sending full-screen ASCII digital clock diffs over a slow 9600baud serial link to a real terminal).

skydhash 2 hours ago||
Any diff would require to have a Before and an After. Whatever was done for the After can be done to directly render the changes. No need for the additional compute of a diff.
HarHarVeryFunny 1 hour ago||
Sure, you could just draw the full new screen image (albeit a bit inefficient if only one character changed), and no need for the flicker-inducing clear before draw either.

I'm not sure what the history of log-output has been or why it does the clear-before-draw. Another simple alternative to pre-clear would have been just to clear to end of line (ESC[0K) after each partial line drawn.

Longwelwind 5 hours ago||||
When doing advanced terminal UI, you might at some point have to layout content inside the terminal. At some point, you might need to update the content of those boxes because the state of the underlying app has changed. At that point, refreshing and diffing can make sense. For some, the way React organizes logic to render and update an UI is nice and can be used in other contexts.
skydhash 5 hours ago||
How big is the UI state that it makes sense to bring in React and the related accidental complexity? I’m ready to bet that no TUI have that big of a state.
bizzleDawg 5 hours ago||||
Only in the same way that the pixels displayed in a browser are not a tree structure that you can diff - the diffing happens at a higher level of abstraction than what's rendered.

Diffing and only updating the parts of the TUI which have changed does make sense if you consider the alternative is to rewrite the entire screen every "frame". There are other ways to abstract this, e.g. a library like tqmd for python may well have a significantly more simple abstraction than a tree for storing what it's going to update next for the progress bar widget than claude, but it also provides a much more simple interface.

To me it seems more fair game to attack it for being written in JS than for using a particular "rendering" technique to minimise updates sent to the terminal.

skydhash 5 hours ago||
Most UI library store states in tree of components. And if you’re creating a custom widget, they will give you a 2D context for the drawing operations. Using react makes sense in those cases because what you’re diffing is state, then the UI library will render as usual, which will usually be done via compositing.

The terminal does not have a render phase (or an update state phase). You either refresh the whole screen (flickering) or control where to update manually (custom engine, may flicker locally). But any updates are sequential (moving the cursor and then sending what to be displayed), not at once like 2D pixel rendering does.

So most TUI only updates when there’s an event to do so or at a frequency much lower than 60fps. This is why top and htop have a setting for that. And why other TUI software propose a keybind to refresh and reset their rendering engines.

sweetheart 3 hours ago|||
The "UI" is indeed represented in memory in tree-like structure for which positioning is calculated according to a flexbox-like layout algo. React then handles the diffing of this structure, and the terminal UI is updated according to only what has changed by manually overwriting sections of the buffer. The CLI library is called Ink and I forget the name of the flexbox layout algo implementation, but you can read about the internals if you look at the Ink repo.
tayo42 8 hours ago||||
Is this a react feature or did they build something to translate react to text for display in the terminal?
sbarre 7 hours ago|||
React, the framework, is separate from react-dom, the browser rendering library. Most people think of those two as one thing because they're the most popular combo.

But there are many different rendering libraries you can use with React, including Ink, which is designed for building CLI TUIs..

skydhash 6 hours ago||
Anyone that knows a bit about terminals would already know that using React is not a good solution for TUI. Terminal rendering is done as a stream of characters which includes both the text and how it displays, which can also alter previously rendered texts. Diffing that is nonsense.
9dev 5 hours ago||
You’re not diffing that, though. The app keeps a virtual representation of the UI state in a tree structure that it diffs on, then serializes that into a formatted string to draw to the out put stream. It’s not about limiting the amount of characters redrawn (that would indeed be nonsense), but handling separate output regions effectively.
pkkim 7 hours ago||||
They used Ink: https://github.com/vadimdemedes/ink

I've used it myself. It has some rough edges in terms of rendering performance but it's nice overall.

tayo42 7 hours ago||
Thats pretty interesting looking, thanks!
embedding-shape 7 hours ago||||
Not a built-in React feature. The idea been around for quite some time, I came across it initially with https://github.com/vadimdemedes/ink back in 2022 sometime.
tayo42 7 hours ago|||
i had claude make a snake clone and fix all the flickering in like 20 minutes with the library mentioned lol
CooCooCaCha 8 hours ago||||
It’s really not that crazy.

React itself is a frontend-agnostic library. People primarily use it for writing websites but web support is actually a layer on top of base react and can be swapped out for whatever.

So they’re really just using react as a way to organize their terminal UI into components. For the same reason it’s handy to organize web ui into components.

dreamteam1 5 hours ago||
And some companies use it to write start menus.
CamperBob2 7 hours ago|||
Also explains why Claude Code is a React app outputting to a Terminal. (Seriously.)

Who cares, and why?

All of the major providers' CLI harnesses use Ink: https://github.com/vadimdemedes/ink

quietsegfault 3 hours ago|||
What does it explain, oh snark master supreme?
spruce_tips 8 hours ago|||
Ah yes, explains why it takes 3 seconds for a new chat to load after I click new chat in the macOS app.
exe34 7 hours ago||
Can Claude fix the flicker in Claude yet?
nickstinemates 6 hours ago||
[flagged]
losvedir 6 hours ago|||
Oh, is that what the issue is? I've seen the "flicker" thing as a meme, but as someone who uses Claude Code I've never noticed. I use ghostty mostly, so maybe it's not an issue with ghostty? Or maybe I just haven't noticed it.
nickstinemates 6 hours ago||
Yes it's people using bad tools on underpowered machines as far as I have seen
winrid 5 hours ago||
Happens with Konsole sometimes on an 8th gen i7. This cpu can run many instances of intellij just fine, but somehow this TUI manages to be slow sometimes. Codex is fine, so no good argument exists really.
hkt 5 hours ago|||
Blaming the terminal seems a little backwards. Perhaps the application could take responsibility for being compatible with common terminals?
nickstinemates 11 minutes ago||
I have no dog in the fight.
Someone1234 8 hours ago||
Does anyone with more insight into the AI/LLM industry happen to know if the cost to run them in normal user-workflows is falling? The reason I'm asking is because "agent teams" while a cool concept, it largely constrained by the economics of running multiple LLM agents (i.e. plans/API calls that make this practical at scale are expensive).

A year or more ago, I read that both Anthropic and OpenAI were losing money on every single request even for their paid subscribers, and I don't know if that has changed with more efficient hardware/software improvements/caching.

simonw 8 hours ago||
The cost per token served has been falling steadily over the past few years across basically all of the providers. OpenAI dropped the price they charged for o3 to 1/5th of what it was in June last year thanks to "engineers optimizing inferencing", and plenty of other providers have found cost savings too.

Turns out there was a lot of low-hanging fruit in terms of inference optimization that hadn't been plucked yet.

> A year or more ago, I read that both Anthropic and OpenAI were losing money on every single request even for their paid subscribers

Where did you hear that? It doesn't match my mental model of how this has played out.

cootsnuck 7 hours ago|||
I have not see any reporting or evidence at all that Anthropic or OpenAI is able to make money on inference yet.

> Turns out there was a lot of low-hanging fruit in terms of inference optimization that hadn't been plucked yet.

That does not mean the frontier labs are pricing their APIs to cover their costs yet.

It can both be true that it has gotten cheaper for them to provide inference and that they still are subsidizing inference costs.

In fact, I'd argue that's way more likely given that has been precisely the goto strategy for highly-competitive startups for awhile now. Price low to pump adoption and dominate the market, worry about raising prices for financial sustainability later, burn through investor money until then.

What no one outside of these frontier labs knows right now is how big the gap is between current pricing and eventual pricing.

chis 7 hours ago|||
It's quite clear that these companies do make money on each marginal token. They've said this directly and analysts agree [1]. It's less clear that the margins are high enough to pay off the up-front cost of training each model.

[1] https://epochai.substack.com/p/can-ai-companies-become-profi...

m101 6 hours ago|||
It’s not clear at all because model training upfront costs and how you depreciate them are big unknowns, even for deprecated models. See my last comment for a bit more detail.
simonw 3 hours ago|||
They are obviously losing money on training. I think they are selling inference for less than what it costs to serve these tokens.

That really matters. If they are making a margin on inference they could conceivably break even no matter how expensive training is, provided they sign up enough paying customers.

If they lose money on every paying customer then building great products that customers want to pay for them will just make their financial situation worse.

ACCount37 4 hours ago|||
By now, model lifetime inference compute is >10x model training compute, for mainstream models. Further amortized by things like base model reuse.
emp17344 3 hours ago||||
Sue, but if they stop training new models, the current models will be useless in a few years as our knowledge base evolves. They need to continually train new models to have a useful product.
magicalist 6 hours ago||||
> They've said this directly and analysts agree [1]

chasing down a few sources in that article leads to articles like this at the root of claims[1], which is entirely based on information "according to a person with knowledge of the company’s financials", which doesn't exactly fill me with confidence.

[1] https://www.theinformation.com/articles/openai-getting-effic...

simonw 3 hours ago||
"according to a person with knowledge of the company’s financials" is how professional journalists tell you that someone who they judge to be credible has leaked information to them.

I wrote a guide to deciphering that kind of language a couple of years ago: https://simonwillison.net/2023/Nov/22/deciphering-clues/

9cb14c1ec0 6 hours ago|||
It's also true that their inference costs are being heavily subsidized. For example, if you calculate Oracles debt into OpenAIs revenue, they would be incredibly far underwater on inference.
NitpickLawyer 7 hours ago||||
> they still are subsidizing inference costs.

They are for sure subsidising costs on all you can prompt packages (20-100-200$ /mo). They do that for data gathering mostly, and at a smaller degree for user retention.

> evidence at all that Anthropic or OpenAI is able to make money on inference yet.

You can infer that from what 3rd party inference providers are charging. The largest open models atm are dsv3 (~650B params) and kimi2.5 (1.2T params). They are being served at 2-2.5-3$ /Mtok. That's sonnet / gpt-mini / gemini3-flash price range. You can make some educates guesses that they get some leeway for model size at the 10-15$/ Mtok prices for their top tier models. So if they are inside some sane model sizes, they are likely making money off of token based APIs.

slopusila 3 hours ago||
most of those subscriptions go unused. I barely use 10% of mine

so my unused tokens compensate for the few heavy users

aenis 24 minutes ago||
Thanks!

I hope my unused gym subscription pays back the good karma :-)

mrandish 7 hours ago||||
> I have not see any reporting or evidence at all that Anthropic or OpenAI is able to make money on inference yet.

Anthropic planning an IPO this year is a broad meta-indicator that internally they believe they'll be able to reach break-even sometime next year on delivering a competitive model. Of course, their belief could turn out to be wrong but it doesn't make much sense to do an IPO if you don't think you're close. Assuming you have a choice with other options to raise private capital (which still seems true), it would be better to defer an IPO until you expect quarterly numbers to reach break-even or at least close to it.

Despite the willingness of private investment to fund hugely negative AI spend, the recently growing twitchiness of public markets around AI ecosystem stocks indicates they're already worried prices have exceeded near-term value. It doesn't seem like they're in a mood to fund oceans of dotcom-like red ink for long.

WarmWash 6 hours ago||
IPO'ing is often what you do to give your golden investors an exit hatch to dump their shares on the notoriously idiotic and hype driven public.
barrkel 7 hours ago|||
> evidence at all that Anthropic or OpenAI is able to make money on inference yet.

The evidence is in third party inference costs for open source models.

replwoacause 1 hour ago||||
My experience trying to use Opus 4.5 on the Pro plan has been terrible. It blows up my usage very very fast. I avoid it altogether now. Yes, I know they warn about this, but it's comically fast how quickly it happens.
nubg 8 hours ago||||
> "engineers optimizing inferencing"

are we sure this is not a fancy way of saying quantization?

bityard 6 hours ago|||
When MP3 became popular, people were amazed that you could compress audio to 1/10th its size with minor quality loss. A few decades later, we have audio compression that is much better and higher-quality than MP3, and they took a lot more effort than "MP3 but at a lower bitrate."

The same is happening in AI research now.

simonw 2 hours ago||||
The o3 optimizations were not quantization, they confirmed this at the time.
embedding-shape 8 hours ago||||
Or distilled models, or just slightly smaller models but same architecture. Lots of options, all of them conveniently fitting inside "optimizing inferencing".
esafak 6 hours ago||||
Someone made a quality tracker: https://marginlab.ai/trackers/claude-code/
jmalicki 7 hours ago|||
A ton of GPU kernels are hugely inefficient. Not saying the numbers are realistic, but look at the 100s of times of gain in the Anthropic performance takehome exam that floated around on here.

And if you've worked with pytorch models a lot, having custom fused kernels can be huge. For instance, look at the kind of gains to be had when FlashAttention came out.

This isn't just quantization, it's actually just better optimization.

Even when it comes to quantization, Blackwell has far better quantization primitives and new floating point types that support row or layer-wise scaling that can quantize with far less quality reduction.

There is also a ton of work in the past year on sub-quadratic attention for new models that gets rid of a huge bottleneck, but like quantization can be a tradeoff, and a lot of progress has been made there on moving the Pareto frontier as well.

It's almost like when you're spending hundreds of billions on capex for GPUs, you can afford to hire engineers to make them perform better without just nerfing the models with more quantization.

Der_Einzige 7 hours ago||
"This isn't X, it's Y" with extra steps.
jmalicki 6 hours ago||
I'm flattered you think I wrote as well as an AI.
nubg 4 hours ago||
lmao
sumitkumar 7 hours ago|||
It seems it is true for gemini because they have a humongous sparse model but it isn't so true for the max performance opus-4.5/6 and gpt-5.2/3.
Aurornis 7 hours ago|||
> A year or more ago, I read that both Anthropic and OpenAI were losing money on every single request even for their paid subscribers

This gets repeated everywhere but I don't think it's true.

The company is unprofitable overall, but I don't see any reason to believe that their per-token inference costs are below the marginal cost of computing those tokens.

It is true that the company is unprofitable overall when you account for R&D spend, compensation, training, and everything else. This is a deliberate choice that every heavily funded startup should be making, otherwise you're wasting the investment money. That's precisely what the investment money is for.

However I don't think using their API and paying for tokens has negative value for the company. We can compare to models like DeepSeek where providers can charge a fraction of the price of OpenAI tokens and still be profitable. OpenAI's inference costs are going to be higher, but they're charging such a high premium that it's hard to believe they're losing money on each token sold. I think every token paid for moves them incrementally closer to profitability, not away from it.

3836293648 7 hours ago|||
The reports I remember show that they're profitable per-model, but overlap R&D so that the company is negative overall. And therefore will turn a massive profit if they stop making new models.
schnable 5 hours ago|||
* stop making new models and people keep using the existing models, not switch to a competitor still investing in new models.
trcf23 6 hours ago|||
Doesn’t it also depend on averaging with free users?
runarberg 7 hours ago|||
I can see a case for omitting R&D when talking about profitability, but training makes no sense. Training is what makes the model, omitting it is like omitting the cost of running the production facility of a car manufacturer. If AI companies stop training they will stop producing models, and they will run out of a products to sell.
vidarh 5 hours ago|||
The reason for this is that the cost scales with the model and training cadence, not usage and so they will hope that they will be able to scale number of inference tokens sold both by increasing use and/or slowing the training cadence as competitors are also forced to aim for overall profitability.

It is essentially a big game of venture capital chicken at present.

Aurornis 6 hours ago|||
It depends on what you're talking about

If you're looking at overall profitability, you include everything

If you're talking about unit economics of producing tokens, you only include the marginal cost of each token against the marginal revenue of selling that token

runarberg 4 hours ago||
I don’t understand the logic. Without training the marginal cost of each token goes into nothing. The more you train, the better the model, and (presumably) you will gain more costumer interest. Unlike R&D you will always have to train new models if you want to keep your customers.

To me this looks likes some creative bookkeeping, or even wishful thinking. It is like if SpaceX omits the price of the satellites when calculating their profits.

nodja 6 hours ago|||
> A year or more ago, I read that both Anthropic and OpenAI were losing money on every single request even for their paid subscribers, and I don't know if that has changed with more efficient hardware/software improvements/caching.

This is obviously not true, you can use real data and common sense.

Just look up a similar sized open weights model on openrouter and compare the prices. You'll note the similar sized model is often much cheaper than what anthropic/openai provide.

Example: Let's compare claude 4 models with deepseek. Claude 4 is ~400B params so it's best to compare with something like deepseek V3 which is 680B params.

Even if we compare the cheapest claude model to the most expensive deepseek provider we have claude charging $1/M for input and $5/M for output, while deepseek providers charge $0.4/M and $1.2/M, a fifth of the price, you can get it as cheap as $.27 input $0.4 output.

As you can see, even if we skew things overly in favor of claude, the story is clear, claude token prices are much higher than they could've been. The difference in prices is because anthropic also needs to pay for training costs, while openrouter providers just need to worry on making serving models profitable. Deepseek is also not as capable as claude which also puts down pressure on the prices.

There's still a chance that anthropic/openai models are losing money on inference, if for example they're somehow much larger than expected, the 400B param number is not official, just speculative from how it performs, this is only taking into account API prices, subscriptions and free user will of course skew the real profitability numbers, etc.

Price sources:

https://openrouter.ai/deepseek/deepseek-v3.2-speciale

https://claude.com/pricing#api

Someone1234 5 hours ago||
> This is obviously not true, you can use real data and common sense.

It isn't "common sense" at all. You're comparing several companies losing money, to one another, and suggesting that they're obviously making money because one is under-cutting another more aggressively.

LLM/AI ventures are all currently under-water with massive VC or similar money flowing in, they also all need training data from users, so it is very reasonable to speculate that they're in loss-leader mode.

nodja 5 hours ago||
Doing some math in my head, buying the GPUs at retail price, it would take probably around half a year to make the money back, probably more depending how expensive electricity is in the area you're serving from. So I don't know where this "losing money" rhetoric is coming from. It's probably harder to source the actual GPUs than making money off them.
suddenlybananas 2 hours ago||
electricity
zozbot234 8 hours ago|||
> i.e. plans/API calls that make this practical at scale are expensive

Local AI's make agent workflows a whole lot more practical. Making the initial investment for a good homelab/on-prem facility will effectively become a no-brainer given the advantages on privacy and reliability, and you don't have to fear rugpulls or VC's playing the "lose money on every request" game since you know exactly how much you're paying in power costs for your overall load.

vbezhenar 6 hours ago|||
I don't care about privacy and I didn't have much problems with reliability of AI companies. Spending ridiculous amount of money on hardware that's going to be obsolete in a few years and won't be utilized at 100% during that time is not something that many people would do, IMO. Privacy is good when it's given for free.

I would rather spend money on some pseudo-local inference (when cloud company manages everything for me and I just can specify some open source model and pay for GPU usage).

slopusila 3 hours ago|||
on prem economics dont work because you can't batch requests. unless you are able to run 100 agents at the same time all the time
zozbot234 38 minutes ago||
> unless you are able to run 100 agents at the same time all the time

Except that newer "agent swarm" workflows do exactly that. Besides, batching requests generally comes with a sizeable increase in memory footprint, and memory is often the main bottleneck especially with the larger contexts that are typical of agent workflows. If you have plenty of agentic tasks that are not especially latency-critical and don't need the absolutely best model, it makes plenty of sense to schedule these for running locally.

Havoc 8 hours ago|||
Saw a comment earlier today about google seeing a big (50%+) fall in Gemini serving cost per unit across 2025 but can’t find it now. Was either here or on Reddit
mattddowney 8 hours ago||
From Alphabet 2025 Q4 Earnings call: "As we scale, we’re getting dramatically more efficient. We were able to lower Gemini serving unit costs by 78% over 2025 through model optimizations, efficiency and utilization improvements." https://abc.xyz/investor/events/event-details/2026/2025-Q4-E...
Havoc 4 hours ago||
Thanks! That's the one
m101 6 hours ago|||
I think actually working out whether they are losing money is extremely difficult for current models but you can look backwards. The big uncertainties are:

1) how do you depreciate a new model? What is its useful life? (Only know this once you deprecate it)

2) how do you depreciate your hardware over the period you trained this model? Another big unknown and not known until you finally write the hardware off.

The easy thing to calculate is whether you are making money actually serving the model. And the answer is almost certainly yes they are making money from this perspective, but that’s missing a large part of the cost and is therefore wrong.

3abiton 8 hours ago|||
It's not just that. Everyone is complacent with the utilization of AI agents. I have been using AI for coding for quite a while, and most of my "wasted" time is correcting its trajectory and guiding it through the thinking process. It's very fast iterations but it can easily go off track. Claude's family are pretty good at doing chained task, but still once the task becomes too big context wise, it's impossible to get back on track. Cost wise, it's cheaper than hiring skilled people, that's for sure.
lufenialif2 8 hours ago||
Cost wise, doesn’t that depend on what you could be doing besides steering agents?
cyanydeez 6 hours ago||
Isn't the quote something like: "If these LLMs are so good at producing products, where are all those products?"
KaiserPro 7 hours ago|||
Gemini-pro-preview is on ollama and requires h100 which is ~$15-30k. Google are charging $3 a million tokens. Supposedly its capable of generating between 1 and 12 million tokens an hour.

Which is profitable. but not by much.

grim_io 4 hours ago||
What do you mean it's on ollama and requires h100? As a proprietary google model, it runs on their own hardware, not nvidia.
KaiserPro 4 hours ago||
sorry A lack of context:

https://ollama.com/library/gemini-3-pro-preview

You can run it on your own infra. Anthropic and openAI are running off nvidia, so are meta(well supposedly they had custom silicon, I'm not sure if its capable of running big models) and mistral.

however if google really are running their own inference hardware, then that means the cost is different (developing silicon is not cheap...) as you say.

simonw 2 hours ago|||
You can't run Gemini 3 Pro Preview on your own infrastructure. Ollama sell access to cloud models these days. It's a little weird and confusing.
zozbot234 3 hours ago|||
That's a cloud-linked model. It's about using ollama as an API client (for ease of compatibility with other uses, including local), not running that model on local infra. Google does release open models (called Gemma) but they're not nearly as capable.
Bombthecat 7 hours ago|||
That's why anthropic switched to tpu, you can sell at cost.
WarmWash 6 hours ago||
These are intro prices.

This is all straight out of the playbook. Get everyone hooked on your product by being cheap and generous.

Raise the price to backpay what you gave away plus cover current expenses and profits.

In no way shape or form should people think these $20/mo plans are going to be the norm. From OpenAI's marketing plan, and a general 5-10 year ROI horizon for AI investment, we should expect AI use to cost $60-80/mo per user.

esafak 2 hours ago||
The models in 5-10 years are going to be unimaginably good. $100/month will be a bargain for knowledge workers, if they survive.
replwoacause 1 hour ago||
I feel like I can't even try this on the Pro plan because Anthropic has conditioned me to understand that even chatting lightly with the Opus model blows up usage and locks me out. So if I would normally use Sonnet 4.5 for a day's worth of work but I wake up and ask Opus a couple of questions, I might as well just forget about doing anything with Claude for the rest of the day lol. But so far I haven't had this issue with ChatGPT. Their 5.2 model (haven't tried 5.3) worked on something for 2 FREAKING HOURS and I still haven't run into any limits. So yeah, Opus is out for me now unfortunately. Hopefully they make the Sonnet model better though!
greenavocado 1 hour ago|
That's why you use Opus for detailed planning docs and weaker models for implementation & RAG for more focused implementation
replwoacause 1 hour ago||
Exactly. I barely had a chance to kick the tires the couple of times I did this before it exploded my usage. I don’t just chat with it casually. The questions I asked were apart of an overall planning strategy which was never allowed to get off the ground on my tiny Pro plan.
rahulroy 4 hours ago|
They are also giving away $50 extra pay as you go credit to try Opus 4.6. I just claimed it from the web usage page[1]. Are they anticipating higher token usage for the model or just want to promote the usage?

[1] https://claude.ai/settings/usage

zamadatix 3 hours ago||
"Page not found" for me. I assume this is for currently paying accounts only or something (my subscription hasn't been active for a while), which is fair.
thunfischtoast 4 hours ago||
Thanks for the tip!
More comments...