Measuring Claude 4.7's tokenizer costs

Posted by aray07 8 hours ago

Measuring Claude 4.7's tokenizer costs(www.claudecodecamp.com)

519 points | 354 comments

louiereederson 7 hours ago|

LLMs exist on a logaritmhic performance/cost frontier. It's not really clear whether Opus 4.5+ represent a level shift on this frontier or just inhabits place on that curve which delivers higher performance, but at rapidly diminishing returns to inference cost.

To me, it is hard to reject this hypothesis today. The fact that Anthropic is rapidly trying to increase price may betray the fact that their recent lead is at the cost of dramatically higher operating costs. Their gross margins in this past quarter will be an important data point on this.

I think the tendency for graphs of model assessment to display the log of cost/tokens on the x axis (i.e. Artificial Analysis' site) has obscured this dynamic.

louiereederson 7 hours ago||

I meant reference Toby Ord's work here. I think his framing of the performance/cost frontier hasn't gotten enough attention https://www.tobyord.com/writing/hourly-costs-for-ai-agents

dang 2 hours ago|||

Let's give that one a SCP* re-up: https://news.ycombinator.com/item?id=47778922

(* explained at https://news.ycombinator.com/item?id=26998308)

fragmede 4 hours ago|||

That post doesn't address the human factor of cost, and I don't mean that in a good way. Even if AI costs more than a human, it's tireless, doesn't need holidays, is never going to have to go to HR for sexual harassment issues, won't show up hungover or need an advance to pay for a dying relative's surgery. It can be turned on and off with the flip of a switch. Hire 30 today, fire 25 of them next week. Spin another 5 up just before the trade show demo needs to go out and fire them with no remorse afterwards.

lbreakjai 3 hours ago|||

The cost to hire a human is highly predictable. The cost of AI isn't. I, as a human, need food and shelter, which puts a ceiling to my bargaining power. I can't withdraw my labour indefinitely.

The power dynamics are also vastly against me. I represent a fraction of my employer's labour, but my employer represents 100% of my income.

That dynamic is totally inverted with AI. You are a rounding error on their revenue sheet, they have a monopoly on your work throughput. How do you budget an workforce that could turn 20% more expensive overnight?

bornfreddy 3 hours ago|||

By continuously testing competitors and local LLMs? The reason for rising prices is that they (Anthropic) probably realized that they have reached a ceiling of what LLMs are capable of, and while it's a lot, it is still not a big moat and it's definitely not intelligence.

alex_sf 1 hour ago||||

The same way companies already deal with any cost.

zer00eyz 3 hours ago|||

> The cost of AI isn't.

This is why there are a ton of corps running the open source models in house... Known costs, known performance, upgrade as you see fit. The consumer backlash against 4o was noted by a few orgs, and they saw the writing on the wall... they didnt want to develop against a platform built on quicksand (see openweb, apps on Facebook and a host of other examples).

There are people out there making smart AI business decisions, to have control over performance and costs.

piker 4 hours ago||||

That was a great promise before the models starting becoming "moody" due to their proprietors arbitrarily modifying their performance capabilities and defaults without transparency or recourse.

mh- 58 minutes ago||

I still haven't seen any statistically sound data supporting that this is happening on the API (per-token pricing.)

If you've got something to share I'd love to see it.

louiereederson 4 hours ago||||

I think it's difficult to say agentic and human developer labor are fungible in the real world at this point. Agents may succeed in discrete tasks, like those in a benchmark assessment, but those requiring a larger context window (i.e. working in brownfield systems, which is arguably the bulk of development work) favor developers for now. Not to mention that at this point a lot of necessary context is not encoded in an enterprise system, but lives in people's heads.

I'd also flip your framing on its head. One of the advantages of human labor over agents is accountability. Someone needs to own the work at the end of the day, and the incentive alignment is stronger for humans given that there is a real cost to being fired.

kennywinker 3 hours ago||

For some the appeal of agent over human is the lack of accountability. “Agent, find me ten targets in iran to blow up” - “Okay, great idea! This military strike isn’t just innovative - it’s game changing! A reddit comment from ten years ago says that military often uses schools to hide weapons, so here is a list of the ten most crowded schools in Iran”

Our_Benefactors 3 hours ago||

It must be wild to actually go through life believing the things written in this post and also thinking you have a rational worldview.

michaelbuckbee 3 hours ago||||

More importantly it collapses mythical-man-month communication overhead.

pona-a 4 hours ago||||

I think the word you're looking for is contractors. But yes, you still have to treat those with _some_ human decency.

krainboltgreene 4 hours ago||||

Ah-ha, the perfect slave.

cyanydeez 4 hours ago|||

it just will delete production database when flustered. no biggie. we learning how to socialize again. cant let all that history go to waste.

cindyllm 4 hours ago||

[dead]

Aurornis 5 hours ago|||

> It's not really clear whether Opus 4.5+ represent a level shift on this frontier or just inhabits place on that curve which delivers higher performance, but at rapidly diminishing returns to inference cost.

I think we're reaching the point where more developers need to start right-sizing the model and effort level to the task. It was easy to get comfortable with using the best model at the highest setting for everything for a while, but as the models continue to scale and reasoning token budgets grow, that's no longer a safe default unless you have unlimited budgets.

I welcome the idea of having multiple points on this curve that I can choose from. depending on the task. I'd welcome an option to have an even larger model that I could pull out for complex and important tasks, even if I had to let it run for 60 minutes in the background and made my entire 5-hour token quota disappear in one question.

I know not everyone wants this mental overhead, though. I predict we'll see more attempts at smart routing to different models depending on the task, along with the predictable complaints from everyone when the results are less than predictable.

KronisLV 3 hours ago|||

> It was easy to get comfortable with using the best model at the highest setting for everything for a while, but as the models continue to scale and reasoning token budgets grow, that's no longer a safe default unless you have unlimited budgets.

For a while I used Cerebras Code for 50 USD a month with them running a GLM model and giving you millions of tokens per day. It did a lot of heavy lifting in a software migration I was doing at the time (and made it DOABLE in the first place), BUT there were about 10 different places where the migration got fucked up and had to manually be fixed - files left over after refactoring (what's worse, duplicated ones basically), some constants and routes that are dead code, some development pages that weren't removed when they were superseded by others and so on.

I would say that Claude Code with throwing Opus at most problems (and it using Sonnet or Haiku for sub-agents for simple and well specified tasks) is actually way better, simply because it fucks things up less often and review iterations at least catch when things are going wrong like that. Worse models (and pretty much every one that I can afford to launch locally, even ones that need around ~80 GB of VRAM in the context of an org wanting to self-host stuff) will be confidently wrong and place time bombs in your codebases that you won't even be aware of if you don't pay enough attention to everything - even when the task was rote bullshit that any model worth its salt should have resolved with 0 issues.

My fear is that models that would let me truly be as productive as I want with any degree of confidence might be Mythos tier and the economics of that just wouldn't work out.

gardnr 12 minutes ago|||

The GLM-4.7 model isn't that great. I was on their $200/month plan for a while. It was really hard to keep up with how fast it works. Going back to Claude seems like everything takes forever. GLM got much better in 5.1 but Cerebras still doesn't offer that yet (it's a bit heavier). I have a year of Z.ai that I got as a bargain and I use GLM-5.1 for some open source stuff but I am a bit nervous about sending data into their API.

Aurornis 1 hour ago|||

Good points. I was speaking from a position of using an LLM in a pair programming style where I'm interactive with each request.

For handing work off to an LLM in large chunks, picking the best model available is the only way to go right now.

dustingetz 4 hours ago||||

Human dev labor cost is still the high pole in the tent, even multiplying today's subsidized subscription cost by 10x. If the capability improvement trajectory continues, developers should prepare for a new economy where more productivity is achieved by fewer devs by shifting substantial labor budget to AI.

johnmaguire 2 hours ago|||

I'm getting a lot more done by handing off the code writing parts of my tasks to many agents running simultaneously. But my attention still has its limits.

what 1 hour ago|||

Your employer doesn’t pay the subscription cost, they pay per token. So it’s already way more than 10x the cost.

stingraycharles 6 minutes ago||

Depends on the type of subscription. We have Codex Team and have a monthly subscription, no per-token costs.

richstokes 2 hours ago||||

The problem is half the time you don't know you need the better model until the lesser model has made a massive mess. Then you have to do it again on the good model, wasting money. The "auto" modes don't seem to do a good job at picking a model IME.

dahart 3 hours ago||||

> I know not everyone wants this mental overhead, though.

I’m curious how to even do it. I have no idea how to choose which model to use in advance of a given task, regardless of the mental overhead.

And unless you can predict perfectly what you need, there’s going to be some overuse due to choosing the wrong model and having to redo some work with a better model, I assume?

Leynos 4 hours ago||||

Isn't that essentially GPT Pro Extended Thinking?

jpalawaga 5 hours ago||||

Except developers can’t even do that. Estimation of any not-small task that hasn’t been done before is essentially a random guess.

nilkn 4 hours ago|||

I don't completely agree. Estimation is nontrivial, but not necessarily a random guess. Teams of human engineers have been doing this for decades -- not always with great success, but better than random. Deciding whether to put an intern or your best staff engineer on a problem is a challenge known to any engineering manager and TPM.

jpalawaga 1 hour ago||

or tech lead. or whoever. the point is, someone has to do the sizing. I think applying an underpowered agent to a task of unknown size is about as good as getting the intern to do it.

Even EMs and TPMs are assigning people based on their previous experience, which generally boils down to "i've seen this task before and I know what's involved," "this task is small, and I know what's involved," or "this task is too big and needs to be understood better."

justapassenger 4 hours ago|||

That's why you split tasks and do project management 101.

That's how things worked pre-AI, and old problems are new problems again.

When you run any bigger project, you have senior folks who tackle hardest parts of it, experienced folks who can churn out massive amounts of code, junior folks who target smaller/simpler/better scoped problems, etc.

We don't default to tell the most senior engineer "you solve all of those problems". But they're often involved in evaluation/scoping down/breakdown of problem/supervising/correcting/etc.

There's tons of analogies and decades of industry experience to apply here.

jpalawaga 1 hour ago||

Yeah... you split tasks into consecutively smaller tasks until it's estimateable.

I'm not saying that can't be done, but taking a large task that hasn't been broken down needs, you guessed it, a powerful agent. that's your senior engineer who can figure out the rote parts, the medium parts, and the thorny parts.

the goal isn't to have an engineer do that. we should still be throwing powerful agents at a problem, they should just be delegating the work more efficiently.

throwing either an engineer or an agent at any unexplored work means you just have to delegate the most experienced resource to, or suffer the consequences.

KaiShips 5 hours ago|||

[dead]

snek_case 7 hours ago|||

They're also getting closer to IPO and have a growing user base. They can't justify losing a very large number of billions of other people's money in their IPO prospectus.

So there's a push for them to increase revenue per user, which brings us closer to the real cost of running these models.

giwook 7 hours ago|||

I agree, and I'm also quite skeptical that Anthropic will be able to remain true to its initial, noble mission statement of acting for the global good once they IPO.

At that point you are beholden to your shareholders and no longer can eschew profit in favor of ethics.

Unfortunately, I think this is the beginning of the end of Anthropic and Modei being a company and CEO you could actually get behind and believe that they were trying to do "the right thing".

It will become an increasingly more cutthroat competition between Anthropic and OpenAI (and perhaps Google eventually if they can close the gap between their frontier models and Claude/GPT) to win market share and revenue.

Perhaps Amodei will eventually leave Anthropic too and start yet another AI startup because of Anthropic's seemingly inevitable prioritization of profit over safety.

snek_case 7 hours ago|||

I think the pivot to profit over good has been happening for a long time. See Dario hyping and salivating over all programming jobs disappearing in N months. He doesn't care at all if it's true or not. In fact he's in a terrible position to even understand if this is possible or not (probably hasn't coded for 10+ years). He's just in the business of selling tokens.

bombcar 5 hours ago||

And worse, he (eventually) has to sell tokens above cost - which may have so much "baggage" (read: debt to pay Nvidia) that it'll be nearly impossible; or a new company will come to play with the latest and greatest hardware and undercut them.

Just how if Boeing was able to release a supersonic plane that was also twice as efficient tomorrow; it'd destroy any airline that was deep in debt for its current "now worthless" planes.

outofpaper 5 hours ago||

That's why open models are going to win in the long run.

sumedh 2 hours ago||||

> At that point you are beholden to your shareholders

No not really, you can issue two types of shares, the company founders can control a type of shares which has more voting power while other shareholders can get a different type of shares with less voting power.

Facebook, Google has something similar.

what 1 hour ago||

No, they still have to act in the interest of shareholders even if they have no voting power.

devmor 7 hours ago|||

Skeptical is a light way to put it. It is essentially a forgone conclusion that once a company IPOs, any veil that they might be working for the global good is entirely lifted.

A publicly traded company is legally obligated to go against the global good.

mattkevan 6 hours ago|||

It’s not really, companies like GM used to boast about how well they treated their employees and communities. It was Jack Welch and a legion of like-minded arseholes who decided they should be increasingly richer no matter who or what paid for it.

axpy906 3 hours ago|||

It’s funny how corporations get a bar wrap. Have you ever worked with private equity? Bad to worse.

dboreham 6 hours ago||||

See also HP. Pretty much only Costco left.

chrisweekly 3 hours ago|||

This is where PBCs (Public Benefit Companies) and B-Corps may have a role to play. Something like that seems necessary to enable both (A) sufficient profitability to support innovation and viability in a capitalist society and (B) consideration of the public good. Traditional public companies aren't just disincentivized from caring about externalities, they're legally required to maximize shareholder profits, full stop. Which IMHO is a big part of the reason companies ~always become "evil".

devmor 1 hour ago|||

Costco is such a strange and stark case standing in opposition to this general rule. From everything I hear, I can only gather that the reason is because of extremely experienced and level-headed executive staff.

tehjoker 3 hours ago||||

The previous deal was due to (a) a lower level of development of capitalism (b) a higher profit margin that collapsed in the 70s (c) a communist movement that threatened capital into behaving

ShroudedNight 3 hours ago||

"Is your washroom breeding Bolsheviks?"

renticulous 5 hours ago|||

Middle class productive population produces commons goods and resources which gets exploited by Elites. Tragedy of the Commons applied to wealth generation process itself.

giwook 7 hours ago||||

Fair point.

Call me an optimist, but I'm still holding out hope that Amodei is and still can do the right thing. That hope is fading fast though.

thibauts 6 hours ago||

« Don’t be evil »

abirch 5 hours ago||

If no one can buy your soul, what's its value? Every Management Consulting Firm

WarmWash 7 hours ago|||

The problem is that people equate money to power and power to evil.

So no matter what, if you do something lots of people like (and hence compensate you for), you will be evil.

It's a very interesting quirk of human intuition.

arcanemachiner 6 hours ago|||

A reasonable conclusion, considering that money and power seem to have their own gravity, so people with more of both end up getting even more of both, and vice versa.

Can't blame someone who comes to such a conclusion about money and power.

WarmWash 5 hours ago||

The unreasonable part automatically labeling power as evil.

epsilonic 4 hours ago|||

It’s a sane default to label power as evil in a society driven by greed, usury, and capital gain. Power tends to corrupt, particularly when the incentives driving its pursuit or sustenance undermine scruples or conscientiousness. It is difficult to see how power is not corrupting when it becomes an end in itself, rather than a means directed toward a worthy or noble purpose.

ModernMech 4 hours ago|||

Labeling power evil is not automatic, its just making an observation of the common case. Money-backed power almost never works for the forces of good, and the people who claim they're gonna be good almost always end up being evil when they're rich and powerful enough. See also: Google.

WarmWash 4 hours ago||

Google is the company that created a class-less non-hierarchical internet. Everyone can get the same access to the same services regardless of wealth or personhood. Google is probably the most progressive company to ever exist, because money stops no one from being able to leverage google's products. Born in the bush of the Congo or high rise of Manhatten, you are granted the same google account with the same services. The cost of entry is just to be a human, one of the most sacrosanct pillars of progressive ideology.

Yet here they are, often considered on of the most evil companies on Earth. That's the interesting quirk.

ModernMech 4 hours ago|||

Lot of people and companies were responsible for that. Anyway, that says nothing about what Google has become.

devmor 3 hours ago|||

> Google is the company that created a class-less non-hierarchical internet.

Can you explain what you mean by this? I disagree but I don't understand how you think Google did this so I am very curious.

For my part, I started using the internet before Google, and I strongly hold the opinion that Google's greatest contribution to the internet was utterly destroying its peer to peer, free, open exchange model by being the largest proponent of centralizing and corporatizing the web.

tehjoker 3 hours ago|||

Money and power are good when used democratically to clearly benefit the majority of the people. They are bad otherwise. It is hard to see this because we live in such a regime that exists in the negative space seemingly without beginning or end. Other countries have different relationships to their population.

ljm 6 hours ago||||

They're also getting into cloud compute given you can use the desktop app to work in a temporary sandbox that they provision for you.

I was about to call it reselling but so many startups with their fingers in the tech startup pie offer containerised cloud compute akin to a loss leader. Harking back to the old days of buying clock time on a mainframe except you're getting it for free for a while.

zozbot234 6 hours ago|||

The "real cost" of running near-SOTA models is not a secret: you can run local models on your own infrastructure. When you do, you quickly find out that typical agentic coding incurs outsized costs by literal orders of magnitude compared to the simple Q&A chat most people use AI for. All tokens are very much not created equal, and the typical coding token (large model, large noisy context) costs a lot even under best-case caching scenarios.

iainmerrick 3 hours ago|||

That sounds very plausible. But it implies they could offer even higher performance models at much higher costs if they chose to; and presumably they would if there were customers willing to pay. Is that the case? Surely there are a decent number of customers who’d be willing to pay more, much more, to get the very best LLMs possible.

Like, Apple computers are already quite pricey -- $1000 or $2000 or so for a decent one. But you can spec up one that’s a bit better (not really that much better) and they’ll charge you $10K, $20K, $30K. Some customers want that and many are willing to pay for it.

Is there an equivalent ultra-high-end LLM you can have if you’re willing to pay? Or does it not exist because it would cost too much to train?

criemen 3 hours ago||

> Is there an equivalent ultra-high-end LLM you can have if you’re willing to pay? Or does it not exist because it would cost too much to train?

I guess at the time that was GPT-4.5. I don't think people used it a lot because it was crazy expensive, and not that much better than the rest of the crop.

foobar10000 57 minutes ago||

The issue is not better - it’s better _AND_ fast enough. An agentic loop is essentially [think,verify] in a loop - i.e. [t1,v1,t2,v2,t3,v3,…] A model that does [t1,t2,t3,t4] in 40 minutes, if verify takes 10 min, will most likely do MUCH worse that a model that does t1 (decently worse) in 10 mins, v1 in 10 mins, t2 now based on t1 and v1 in 10 mins, v2 in 10 mins, etc..

So, for agentic workflows - ones where the model gets feedback from tools, etc…, fast enough is important.

conductr 2 hours ago|||

Yeah. Combine this with much of Corpos right now using a “burn as many tokens as you need” policy on AI, the incentive is there for them to raise price and find an equilibrium point or at least reduce the bleed.

amelius 2 hours ago|||

Once they implement their models directly in silicon, the cost will come down and the speed will go up. See Taalas.

aaronblohowiak 1 hour ago||

taalas is amazing. id gladly spend 5-15k on something that matched that performance with opus 4.6 quality

ethin 6 hours ago|||

I mean, the signs have been there that the costs to run and operate these models wasn't as simple as inference costs. And the signs were there (and, arguably, are still there) that it costs way, way more than many people like to claim on the part of Anthropic. So to me this price hike is not at all surprising. It was going to come eventually, and I suspect it's nowhere near over. It wouldn't surprise me if in 2-3 years the "max" plan is $800 or $2000 even.

ezst 5 hours ago||

> It wouldn't surprise me if in 2-3 years the "max" plan is $800 or $2000 even.

I'd rather be surprised if they are still doing business by then.

QuiEgo 4 hours ago||

I would not be surprised at all, a $1,000/mo tool that makes your $20,000/mo engineer a lot more productive is an easy sell.

I’m guessing we’re gonna have a world like working on cars - most people won’t have expensive tools (ex a full hydraulic lift) for personal stuff, they are gonna have to make do with lesser tools.

selfmodruntime 2 hours ago|||

No engineer will cost 20.000 bucks a month at this point in time. Offshoring is still happening aggressively.

cyanydeez 3 hours ago|||

noway.

i bought a $3k AMD395+ under the Sam Altman price hike and its got a local model that readily accomplishes medial tasks.

theres a ceiling to these price hikes because open weights will keep popping up as competitors tey to advertise their wares.

sure, we POV different capabilities but theres definitely not that much cash in propfietary models for their indererminance

benjiro3000 2 hours ago||

[dead]

Lihh27 4 hours ago|||

heh adaptive thinking is letting the meter run itself. they make more when it runs longer.

orangecar 44 minutes ago|||

[dead]

paulddraper 7 hours ago||

> The fact that Anthropic is rapidly trying to increase price may betray the fact that their recent lead is at the cost of dramatically higher operating costs.

Or they are just not willing to burn obscene levels of capital like OpenAI.

tabbott 3 hours ago||

I find it interesting that folks are so focused on cost for AI models. Human time spent redirecting AI coding agents towards better strategies and reviewing work, remains dramatically more expensive than the token cost for AI coding, for anything other than hobby work (where you're not paying for the human labor). $200/month is an expensive hobby, but it's negligible as a business expense; SalesForce licenses cost far more.

The key question is how well it a given model does the work, which is a lot harder to measure. But I think token costs are still an order of magnitude below the point where a US-based developer using AI for coding should be asking questions about price; at current price points, the cost/benefit question is dominated by what makes the best use of your limited time as an engineer.

aenis 1 hour ago||

That.

We already shipped 3 things this year built using Claude. The biggest one was porting two native apps into one react native app - which was originally estimated to be a 6-7 month project for a 9 FTE team, and ended up being a 2 months project with 2 people. To me, the economic value of a claude subscription used right is in the range of 10-40k eur, depending on the type of work and the developer driving it. If Anthropic jacked the prices 100x today, I'd still buy the licenses for my guys.

Edit: ok, if they charged 20k per month per seat I'd also start benchmarking the alternatives and local models, but for my business case, running a 700M budget, Claude brings disproportionate benefis, not just in time saved in developer costs, but also faster shipping times, reduced friction between various product and business teams, and so on. For the first time we generally say 'yes' to whichever frivolities our product teams come up with, and thats a nice feeling.

wg0 1 hour ago||

Who's going to review that output for accuracy? We'll leave performance and security as unnecessary luxuries in this age and time.

In my experience, even Claude 4.6's output can't be trusted blindly it'll write flawed code and would write tests that would be testing that flawed code giving false sense of confidence and accomplishment only to be revealed upon closer inspection later.

Additionally - it's age old known fact that code is always easier to write (even prior to AI) but is always tenfold difficult to read and understand (even if you were the original author yourself) so I'm not so sure this much generative output from probabilistic models would have been so flawless that nobody needs to read and understand that code.

Too good to be true.

doh 21 minutes ago|||

I don't want to defend LLM written code, but this is true regardless if code is written by a person or a machine. There are engineers that will put the time to learn and optimize their code for performance and focus on security and there are others that won't. That has nothing to do with AI writing code. There is a reason why most software is so buggy and all software has identified security vulnerabilities, regardless of who wrote it.

I remember how website security was before frameworks like Django and ROR added default security features. I think we will see something similar with coding agents, that just will run skills/checks/mcps/... that focus have performance, security, resource management, ... built in.

I have done this myself. For all apps I build I have linters, static code analyzers, etc running at the end of each session. It's cheapest default in a very strict mode. Cleans up most of the obvious stuff almost for free.

abustamam 35 minutes ago||||

Well it's all tradeoffs, right? 6 months for 9 FTEs is 54 man months. 2 months for 2 FTEs is 4 man months. Even if one FTE spent two extra months perusing every line of code and reviewing, that's still 6 man months, resulting in almost 10x speed.

Let's say you dont review. Those two extra months probably turns into four extra months of finding bugs and stuff. Still 8 man months vs 54.

Of course this is all assuming that the original estimates were correct. IME building stuff using AI in greenfield projects is gold. But using AI in brownfield projects is only useful if you primarily use AI to chat to your codebase and to make specific scoped changes, and not actually make large changes.

yladiz 19 minutes ago|||

Minor point: AI doesn’t write, it generates.

hyraki 2 hours ago|||

Yes 200 as a business expense is really not that bad. But a hobby is hard to justify.

scuff3d 1 hour ago||

It's not gonna stay that way. Token cost is being massively subsidized right now. Prices will have to start increasing at some point.

ianm218 1 hour ago|||

This is hard to say definitively. The new Nvidia Vera Rubin chips are 35-50x more efficient on a FLOPS/ megawatt basis. TPU/ ASICS/ AMD chips are making similar less dramatic strides.

So a service ran at a loss now could be high margin on new chips in a year. We also don’t really know that they are losing money on the 200/ month subscriptions just that they are compute constrained.

If prices increase might be because of a supply crunch than due to unit economics.

Gigachad 1 hour ago|||

Seems like the real costs and numbers are very hidden right now. It’s all private companies and secret info how much anything costs and if anything is profitable.

davikr 45 minutes ago||

Some say margins could be up to 90% on API inference. The house always wins?

vessenes 2 hours ago||

I mean, my openclaw instance was billing $200 a day for Opus after they banned using the max subscription. I think a fair amount of that was not useful use of Opus; so routing is the bigger problem. but, that sort of adds up, you know! At $1/hr, I loved Openclaw. At $15/hour, it's less competitive.

_pdp_ 7 hours ago||

IMHO there is a point where incremental model quality will hit diminishing returns.

It is like comparing an 8K display to a 16K display because at normal viewing distance, the difference is imperceptible, but 16K comes at significant premium.

The same applies to intelligence. Sure, some users might register a meaningful bump, but if 99% can't tell the difference in their day-to-day work, does it matter?

A 20-30% cost increase needs to deliver a proportional leap in perceivable value.

highfrequency 5 hours ago||

I believe that's why 90% of the focus in these firms is on coding. There is a natural difficulty ramp-up that doesn't end anytime soon: you could imagine LLMs creating a line of code, a function, a file, a library, a codebase. The problem gets harder and harder and is still economically relevant very high into the difficulty ladder. Unlike basic natural language queries which saturate difficulty early.

This is also why I don't see the models getting commoditized anytime soon - the dimensionality of LLM output that is economically relevant keeps growing linearly for coding (therefore the possibility space of LLM outputs grows exponentially) which keeps the frontier nontrivial and thus not commoditized.

In contrast, there is not much demand for 100 page articles written by LLMs in response to basic conversational questions, therefore the models are basically commoditized at answering conversational questions because they have already saturated the difficulty/usefulness curve.

Aperocky 2 hours ago||

> the dimensionality of LLM output that is economically relevant keeps growing linearly for coding

Doubt. Yes. there was at one point it suddenly became useful to write code in a general sense. I have seen almost no improvement in department of architecting, operations and gaslighting. In fact gaslighting has gotten worse. Entire output based on wrong assumption that it hid, almost intentionally. And I had to create very dedicated, non-agentic tools to combat this.

And all of this with latest Opus line.

skydhash 1 hour ago||

Also doubt. But most likely because of organizational inertia. After a while, you’re mostly focused on small problems and big features are rare. You solution is quasi done. But now each new change is harder because you don’t want to broke assumptions that have become hard requirements.

ZeroCool2u 7 hours ago|||

Whenever we get the locally runnable 4k models things are going to get really awkward for the big 3 labs. Well at least Google will still have their ad revenue I guess.

UncleOxidant 6 hours ago|||

Given how little claude usage they've been giving us on the "pro" plan lately, I've started doing more with the various open Qwen3.* models. Both Qwen3-coder-next and Qwen3.5-27b have been giving me good results and their 3.6 models are starting to be released. I think Anthropic may be shooting themselves in the foot here as more people start moving to local models due to costs and/or availability. Are the Qwen models as good as Claude right now? No. But they're getting close to as good as Claude sonnet was 9 months to a year ago (prior to 4.5, around 4.0). If I need some complex planning I save that for claude and have the Qwen models do the implementation.

blurbleblurble 6 hours ago|||

I was thinking the exact same thing just now as I load up qwen3.6 into hermes agent and all while fantasizing that it will replace opus 4.7. It might not actually but seems like we're on the verge of that.

Lately I've been wondering too just how large these proprietary "ultra powerful frontier models" really are. It wouldn't shock me if the default models aren't actually just some kind of crazy MoE thing with only a very small number of active params but a huge pool of experts to draw from for world knowledge.

Aurornis 5 hours ago||||

I've also been using the Qwen3.5-27B and the new Qwen3.6 locally, both at Q6. I don't agree that they're as good as pre-Opus Claude. I really like how much they can do on my local hardware, but we have a long way to go before we reach parity with even the pre-Opus Claude in my opinion.

wizee 2 hours ago|||

I run Qwen 3.5 122B-A10B on my MacBook Pro, and in my experience its capability level for programming and code comprehension tasks is roughly that of Claude Sonnet 3.7. Honestly I find that pretty amazing, having something with capability roughly equivalent to frontier models of an year ago running locally on my laptop for free. I’m eager to try Qwen 3.6 122B-A10B when it’s released.

_fizz_buzz_ 3 hours ago|||

What hardware do you use? I want to experiment with running models locally.

threecheese 2 hours ago|||

OP’s Qwen3.6 27B Q6 seems to run north of 20GB on huggingface, and should function on an Apple Silicon with 32GB RAM. Smaller models work unreasonably well even on my M1/64GB MacBook.

I am getting 10tok/sec on a 27B of Qwen3.5 (thinking, Q4, 18GB) on an M4/32GB Mac Mini. It’s slow.

For a 9B (much smaller, non-thinking) I am getting 30tok/sec, which is fast enough for regular use if you need something from the training data (like how to use grep or Hemingways favorite cocktail).

I’m using LMStudio, which is very easy and free (beer).

UncleOxidant 3 hours ago|||

Not who you asked, but I've got a Framework desktop (strix halo) with 128GB RAM. In linux up to about 112GB can be allocated towards the GPU. I can run Qwen3.5-122B (4-bit quant) quite easily on this box. I find qwen3-coder-next (80b param, MOE) runs quite well at about 36tok/sec. Qwen3.5-27b is a bit slower at about ~24tok/sec but that's a dense model.

manmal 5 hours ago|||

Why don’t you do the planning yourself? It’s very likely to be a better plan.

robot_jesus 6 hours ago||||

They're not perfect but the local model game is progressing so quickly that they're impossible to ignore. I've only played around with the new qwen 3.6 models for a few minutes (it's damn impressive) but this weekend's project is to really put it through its paces.

If I can get the performance I'm seeing out of free models on a 6-year-old Macbook Pro M1, it's a sign of things to come.

Frontier models will have their place for 1) extensive integrations and tooling and 2) massive context windows. But I could see a very real local-first near future where a good portion of compute and inference is run locally and only goes to a frontier model as needed.

UncleOxidant 6 hours ago||

I've had really good results form qwen3-coder-next. I'm hoping we get a qwen3.6-coder soon since claude seems to get less-and-less available on the pro plan.

efficax 6 hours ago|||

If the apple silicon keeps making the gains it makes, a mac studio with 128gb of ram + local models will be a practical all-local workflow by say 2028 or 2030. OpenAI and Anthropic are going to have to offer something really incredible if they want to keep subscription revenue from software developers in the near future, imo

levocardia 6 hours ago|||

Depends a lot on the task demands. "Got 95% of the way to designing a successful drug" and "Got 100% of the way" is a huge difference in terms of value, and that small bump in intelligence would justify a few orders of magnitude more in cost.

9dev 6 hours ago||

But that objective measure is exactly what we’re lacking in programming: There is often many ways to skin a cat, but the model only takes one. Without knowing about those it didn’t take, how do you judge the quality of a new model?

altern8 4 hours ago||

I would say following instructions.

If Claude understood what you mean better without you having to over explain it would be an improvement

snek_case 7 hours ago|||

It probably depends what you're using the models for. If you use them for web search, summarizing web pages, I can imagine there's a plateau and we're probably already hitting it.

For coding though, there is kind of no limit to the complexity of software. The more invariants and potential interactions the model can be aware of, the better presumably. It can handle larger codebases. Probably past the point where humans could work on said codebases unassisted (which brings other potential problems).

Bolwin 3 hours ago||

> summarizing web pages

For summarizing creative writing, I've found Opus and Gemini 3 pro are still only okay and actively bad once it gets over 15K tokens or so.

A lot of long context and attention improvements have been focused on Needle in a Haystack type scenarios, which is the opposite of what summarization needs.

simplyluke 6 hours ago|||

I'm seeing a lot of sentiment, and agree with a lot of it, that opus 4.6 un-nerfed is there already and for many if not most software use cases there's more value to be had in tooling, speed, and cost than raw model intelligence.

Rapzid 2 hours ago|||

At normal viewing distance(let's say cinema FOV) most people won't see a difference between 4k and 8k never mind 16k.

And it's not that they "don't notice" it's that they physically can't distinguish finer angular separation.

aray07 7 hours ago|||

yeah thats is my biggest issue - im okay with paying 20-30% more but what is the ROI? i dont see an equivalent improvement in performance. Anthropic hasnt published any data around what these improvements are - just some vague “better instruction following"

Bridged7756 5 hours ago|||

Its enshittificating real fast. They'll just keep releasing model after model, more expensive than the last, marginal gains, but touted as "the next thing". Evangelists will say that they're afraid, it's the future, in 6 months it's all over. Anthropic will keep astroturfing on Reddit. CEOs will make even more outlandish claims.

You raised a good point, what's a good metric for LLM performance? There's surely all the benchmarks out there, but aren't they one and done? Usually at release? What keeps checking the performance of those models. At this point it's just by feel. People say models have been dumbed down, and that's it.

I think the actual future is open source models. Problem is, they don't have the huge marketing budget Anthropic or OpenAI does.

conductr 2 hours ago||

This is most likely trajectory I fear. It reminds me a lot of Oracle, where they rebrand and reskin products just to change pricing/marketing without adding anything.

skydhash 1 hour ago||

Win 10, win 11, all the recent macOS,… could have been released as features and not new products

margorczynski 6 hours ago|||

The other thing is most people don't really care about price per token or whatever but how much it will cost to execute (successfully) a task they want.

It doesn't matter if a model is e.g. 30% cheaper to use than another (token-wise) but I need to burn 2x more tokens to get the same acceptable result.

jasonjmcghee 1 hour ago|||

It's more like, if it gets it right 99% of the time, that sounds incredible.

Until it's making 100k decisions a day and many are dependent on previous results.

mlinsey 6 hours ago|||

I agree, but also the model intelligence is quite spikey. There are areas of intelligence that I don't care at all about, except as proxies for general improvement (this includes knowledge based benchmarks like Humanity's Last Exam, as well as proving math theorems etc). There are other areas of intelligence where I would gladly pay more, even 10X more, if it meant meaningful improvements: tool use, instruction following, judgement/"common sense", learning from experience, taste, etc. Some of these are seeing some progress, others seem inherent to the current LLM+chain of thought reasoning paradigm.

manmal 4 hours ago||

Common sense isn’t a language pattern. I doubt this will ever work w/ LLMs.

mgraczyk 5 hours ago|||

This will probably happen but I wouldn't plan on it happening soon

_pdp_ 6 hours ago|||

Longer version of the comment https://www.linkedin.com/pulse/imperceptible-upgrade-petko-d...

zadkey 3 hours ago|||

yeah there needs to be a corresponding increment improvement in model archetecture.

nisegami 6 hours ago|||

>IMHO there is a point where incremental model quality will hit diminishing returns.

It's not necessary a single discrete point I think. In my experience, it's tied to the quality/power of your harness and tooling. More powerful tooling has made revealed differences between models that were previously not easy to notice. This matches your display analogy, because I'm essentially saying that the point at which display resolution improvements are imperceptible matters on how far you sit.

wellthisisgreat 5 hours ago|||

Does anyone here use 8k display for work? Does it make sense over 4k?

I was always wondering where that breaking point for cost/peformance is for displays. I use 4K 27” and it’s noticeably much better for text than 1440p@27 but no idea if the next/ and final stop is 6k or 8k?

zozbot234 5 hours ago||

Even 4k turns out to be overkill if you're looking at the whole screen and a pixel-perfect display. By human visual acuity, 1440p ought to be enough, and even that's taking a safety margin over 1080p to account for the crispness of typical text.

solenoid0937 4 hours ago||

1440p is enough if you haven't experienced anything else. Even the jump from 4k to 5-6k is quite noticeable on a 27" monitor.

I switched to the Studio Display XDR and it is noticeably better than my 4k displays and my 1440p displays feel positively ancient and near unusable for text.

zozbot234 2 hours ago||

That's great for contrast, color fidelity and compatibility with the Apple Mac. But the resolution is quite overkill.

iLoveOncall 6 hours ago||

> IMHO there is a point where incremental model quality will hit diminishing returns.

You mean a couple of years ago?

memcoder 2 minutes ago||

depends if you're running Opus for everything vs tiering. my pipeline: Haiku 4.5 for ~70% of implementation, Sonnet 4 for one review step, Opus 4.5 only for planning and final synthesis

claude code on opus continuously = whole bill. different measurement.

haiku 4.5 is good enough for fanout. opus earns it on synthesis where you need long context + complex problem solving under constraints

speedgoose 6 hours ago||

The "multiplier" on Github Copilot went from 3 to 7.5. Nice to see that it is actually only 20-30% and Microsoft wanting to lose money slightly slower.

https://docs.github.com/fr/copilot/reference/ai-models/suppo...

Someone1234 6 hours ago||

Yep, and I just made a recommendation that was essentially "never enable Opus 4.7" to my org as a direct result. We have Opus 4.6 (3x) and Opus 4.5 (3x) enabled currently. They are worth it for planning.

At 7.5x for 4.7, heck no. It isn't even clear it is an upgrade over Opus 4.6.

chewz 4 hours ago|||

7.5 is promotional rate, it will go up to 25. And in May you will be switched to per token billing.

Opus 4.5 and 4.6 will be removed very soon.

So what is your contingency plan?

carlinm 3 hours ago|||

Are you saying github copilot is switching to a per token billing model? If so, you have a link to that?

Someone1234 3 hours ago|||

Can you link to a source for anything you're claiming?

slopinthebag 2 hours ago||

https://github.blog/changelog/2026-04-16-claude-opus-4-7-is-...

> Over the coming weeks, Opus 4.7 will replace Opus 4.5 and Opus 4.6 in the model picker for Copilot Pro+.

> This model is launching with a 7.5× premium request multiplier as part of promotional pricing until April 30th

TBF, it's a rumour that they are switching to per-token price in May, but it's from an insider (apparently), and seeing how good of a deal the current per-request pricing is, everyone expects them to bump prices sometime soon or switch to per-token pricing.

phainopepla2 55 minutes ago||

The per-request pricing is ridiculous (in a good way, for the user). You can get so much done on a single prompt if you build the right workflow. I'm sure they'll change it soon

GaryBluto 5 hours ago||||

Microsoft are going to be removing Opus 4.5 and 4.6 from Copilot soon so I'd enjoy the lower cost while it lasts.

bwat49 6 hours ago||||

in copilot I find it hard to justify using opus at even 3x vs just using GPT 5.4 high at 1x

d0100 2 hours ago||

I went from plan with opus, implement with claude, to simply plan and implement with GPT 5.4

It's a very good model for a very good price

solenoid0937 4 hours ago|||

I don't know how you guys are not seeing 4.7 as an upgrade, it just does so much more, so much better. I guess lower complexity tasks are saturated though.

_puk 3 hours ago||

Anecdotally, been leaning on 4.6 heavily, and today 4.7 hallucinated on some agentic research it was doing. Not seen it do that before.

When pushed it did the 'ol "whoopsie, silly me"; turned out the hallucination had been flagged by the agent and ignored by Opus.

Makes it hard to trust it, which sucks as it's a heavy part of my workflow.

Aurornis 1 hour ago|||

This article is only about the tokenizer. It doesn't measure the number of tokens needed for each request, which could be higher or lower overall.

aulin 2 hours ago||

Opus 4.6 also just got dumber. It's dismissive, hand-wavy, jumps to conclusions way too quickly, skips reasoning... Bubble is going to burst, either some big breakthrough comes up or we are going to see a very fast enshittificafion.

namnnumbr 7 hours ago||

The title is a misdirection. The token counts may be higher, but the cost-per-task may not be for a given intelligence level. Need to wait to see Artificial Analysis' Intelligence Index run for this, or some other independent per-task cost analysis.

The final calculation assumes that Opus 4.7 uses the exact same trajectory + reasoning output as Opus 4.6. I have not verified, but I assume it not to be the case, given that Opus 4.7 on Low thinking is strictly better than Opus 4.6 on Medium, etc., etc.

alach11 2 hours ago||

I ran an internal (oil and gas focused) benchmark yesterday and found Opus 4.7 was 50% cheaper than Opus 4.6, driven by significantly fewer output tokens for reasoning. It also scored 80% (vs. 60%).

dang 2 hours ago|||

(Submitted title was "Claude Opus 4.7 costs 20–30% more per session". We've since changed it to a (more neutral) version of what the article's title says.)

bisonbear 5 hours ago|||

yep, ran a controlled experiment on 28 tasks comparing old opus 4.6 vs new opus 4.6 vs 4.7, and found that 4.7 is comparable in cost to old 4.6, and ~20% more expensive then new 4.6 (because new 4.6 is thinking less)

https://www.stet.sh/blog/opus-4-7-zod

cced 5 hours ago||

So they nerfed 4.6 to make way for 4.7?

Progress. /s

bisonbear 5 hours ago||

> they nerfed 4.6 to make way for 4.7?

> Progress. /s

pretty much, lmao. my theory is 4.6 started thinking less to save compute for 4.7 release. but who knows what's going on at anthropic

kirubakaran 4 hours ago||

"but who knows what's going on at anthropic"

People at Anthropic, of course

aray07 6 hours ago|||

im running some experiments on this but based on what i have seen on my own personal data - I dont think this is true

"given that Opus 4.7 on Low thinking is strictly better than Opus 4.6 on Medium, etc., etc.”

Opus 4.7 in general is more expensive for similar usage. Now we can argue that is provides better performance all else being equal but I haven’t been able to see that

namnnumbr 4 hours ago|||

Following up on "strictly better" via plot in release announcement:

https://www.anthropic.com/_next/image?url=https%3A%2F%2Fwww-...

unpwn 6 hours ago|||

Very unlikely that the article is wrong. the 4.7 intelligence bump is not that big, plus most of the token spend is in inputs/tool calls etc, much of which won't change even with this bump.

namnnumbr 4 hours ago||

IMO, you're incorrect:

1. In my own use, since 1 Apr this month, very heavy coding:

> 472.8K Input Tokens +299.3M cached > 2.2M Output Tokens

My workloads generate ~5x more output than input, and output tokens cost 5x more per token... output dominates my bill at roughly 25x the cost of input. (Even more so when you consider cache hits!) If Opus 4.7 was more efficient with reasoning (and thus output), I'd likely save considerable money (were I paying per-token).

2. Anthropic's benchmarks DO show strictly-better (granted they are Anthropic's benchmarks, so salt may be needed) https://www.anthropic.com/_next/image?url=https%3A%2F%2Fwww-...

watsonL1F7 7 hours ago||

[dead]

_fat_santa 7 hours ago||

A question I've been asking alot lately (really since the release of GPT-5.3) is "do I really need the more powerful model"?

I think a big issue with the industry right now is it's constantly chasing higher performing models and that comes at the cost of everything else. What I would love to see in the next few years is all these frontier AI labs go from just trying to create the most powerful model at any cost to actually making the whole thing sustainable and focusing on efficiency.

The GPT-3 era was a taste of what the future could hold but those models were toys compare to what we have today. We saw real gains during the GPT-4 / Claude 3 era where they could start being used as tools but required quite a bit of oversight. Now in the GPT-5 / Claude 4 era I don't really think we need to go much further and start focusing on efficiency and sustainability.

What I would love the industry to start focusing on in the next few years is not on the high end but the low end. Focus on making the 0.5B - 1B parameter models better for specific tasks. I'm currently experimenting with fine-tuning 0.5B models for very specific tasks and long term I think that's the future of AI.

namnnumbr 4 hours ago||

Yes! I'd be totally happy with today's sonnet 4.6 if I could run it locally.

If you can forgive the obviously-AI-generated writing, [CPUs Aren't Dead](https://seqpu.com/CPUsArentDead) makes an interesting point on AI progress: Google's latest, smallest Gemma model (Gemma 4 E2B), which can run on a cell phone, outperforms GPT-3.5-turbo. Granted, this factoid is based on `MT-Bench` performance, a benchmark from 2023 which I assume to be both fully saturated and leaked into the training data for modern LLMs. However, cross-referencing [Artificial Analysis' Intelligence Index](https://artificialanalysis.ai/models?models=gemma-4-e2b-non-...) suggests that indeed the latest 2B open-weights models are capable of matching or beating 175B models from 3-4 years ago. Perhaps more impressive, [Gemma 4 E4B matches or beats GPT-4o](https://artificialanalysis.ai/models?models=gemma-4-e4b%2Cge...) on many benchmarks.

If this trend continues, perhaps we'll have the capabilities of today's best models available to reasonably run on our laptops!

minimaxir 5 hours ago|||

Many people were hoping that Sonnet 4.6 was "Opus 4.5 quality but with Sonnet speed/cost" but unfortunately that didn't pan out.

malfist 4 hours ago||

You can already see people here saying the same stuff about opus 4.7, saw a comment claiming that Opus 4.7 on low thinking was better than 4.6 on high.

I'm not seeing that in my testing, but these opinions are all vibe based anyway.

Bridged7756 5 hours ago|||

Efficiency doesn't make as much money. It's in big LLM's best interest to keep inference computationally expensive.

I personally think the whole "the newest model is crazy! You've gotta use X (insert most expensive model)" Is just FOMO and marketing-prone people just parroting whatever they've seen in the news or online.

renticulous 5 hours ago|||

Does everyone need a graphing calculator? Does everyone need a scientific calculator? Does everyone need a normal calculator? Does everyone need GeoGebra or Desmos ?

nprateem 2 hours ago|||

So you're happy with an untrustworthy lazy moron prone to stupid mistakes and guesswork?

Surely you can see the first lab that solves this gains a massive advantage?

fkealy 6 hours ago|||

I agree, and yet here i am using it... However, I think the industry IS going multiple directions all at once with smaller models, bigger models etc. I need to try out Google's latest models but alas what can one person do in the face of so many new models...

rambojohnson 5 hours ago||

[dead]

admiralrohan 4 hours ago||

In Kolkata, sweet sellers was struggling with cost management after covid due to increased prices of raw materials. But they couldn't increase the price any further without losing customers. So they reduced the size of sweets instead, and market slowly reduced expectations. And this is the new normal now.

Human psychology is surprisingly similar, and same pattern comes across domains.

hirako2000 4 hours ago||

It's not just in Kolkata, worldwide packs of biscuits etc remained the same size but less inside.

I didn't buy Springles chips in years, even the box now is nothing like it was. Thinner. Shorter. I imagine how far from the top the slices stack up.

steelbrain 3 hours ago||

See also: Shrinkflation (https://en.wikipedia.org/wiki/Shrinkflation)

ericol 3 hours ago||

I did some work yesterday with Opus and found it amazing.

Today we are almost on non-speaking terms. I'm asking it to do some simple stuff and he's making incredible stupid mistakes:

    This is the third time that I have to ask you to remove the issue that was there for more than 20 hours. What is going on here?

and at the same time the compacting is firing like crazy. (What adds ~4 minute delays every 1 - 15 minutes)

  | # | Time     | Gap before | Session span | API calls |
  |---|----------|-----------|--------------|-----------|
  | 1 | 15:51:13 | 8s        | <1m          | 1         |
  | 2 | 15:54:35 | 48s       | 37m          | 51        |
  | 3 | 16:33:33 | 2s        | 19m          | 42        |
  | 4 | 16:53:44 | 1s        | 9m           | 30        |
  | 5 | 17:04:37 | 1s        | 17m          | 30        |
  # — sequential compaction event number, ordered by time.

  Time — timestamp of the first API call in the resumed session, i.e. when the new context (carrying the compaction summary) was first sent to the
  model.

  Gap before — time between the last API call of the prior session and the first call of this one. Includes any compaction processing time plus user
   think time between the two sessions.

  Session span — how long this compaction-resumed session ran, from its first API call to its last before the next compaction (or end of session).

  API calls — total number of API requests made during this resumed session. Each tool use, each reply, each intermediate step = one request.

Bottomline, I will probably stay on Sonnet until they fix all these issues.

aulin 3 hours ago||

They won't. These are not "issues", it's them trying to push the models to burn less compute. It will only get worse.

criemen 3 hours ago|||

> it's them trying to push the models to burn less compute

I'm curious, how does using more tokens save compute?

BoorishBears 6 minutes ago|||

I'm 99.9% sure Opus 4.7 is a smaller model than 4.6.

Too many signs between the sudden jump in TPS (biggest smoking gun for me), new tokenenizer, commentary about Project Mythos from Ant employees, etc.

It looks like their new Sonnet was good enough to be labeled Opus and their new Opus was good enough to be labeled Mythos.

They'll probably continue post-training and release a more polished version as Opus 5

b65e8bee43c2ed0 2 hours ago||||

productivity (tokens per second per hardware unit) increases at the cost of output quality, but the price remains the same.

both Anthropic and OpenAI quantize their models a few weeks after release. they'd never admit it out loud, but it's more or less common knowledge now. no one has enough compute.

sthimons 2 hours ago|||

Pretty bold claim - you have a source for that?

Rapzid 2 hours ago||

There is no evidence TMK that the accuracy the models change due to release cycles or capacity issues. Only latency. Both Anthropic and OpenAI have stated they don't do any inference compute shenanigans due to load or post model release optimization.

Tons of conspiracy theories and accusations.

I've never seen any compelling studies(or raw data even) to back any of it up.

cebert 2 hours ago|||

Do you have a source for that claim?

b65e8bee43c2ed0 2 hours ago||

my source is that people have been noticing this since GPT4 days.

https://arxiv.org/pdf/2307.09009

but of course, this isn't a written statement by a corporate spokespersyn. I don't think that breweries make such statements when they water their beer either.

shortstuffsushi 2 hours ago||||

I think that the idea is each action uses more tokens, which means that users hit their limit sooner, and are consequently unable to burn more compute.

ryanschaefer 2 hours ago||

What?

bloppe 2 hours ago|||

It could be the adaptive reasoning

rustyhancock 2 hours ago|||

If you've not seen Common People Black Mirror episode I strongly recommend it.

The only misprediction it makes is that AI is creating the brain dead user base...

You have to hook your customers before you reel them in!

https://www.netflix.com/gb/title/70264888?s=a&trkid=13747225...

whalesalad 3 hours ago|||

I am having a shit experience lately. Opus 4.7, max effort.

> You're right, that was a shit explanation. Let me go look at what V1 MTBL actually is before I try again.

> Got it — I read the V1 code this time instead of guessing. Turns out my first take was wrong in an important way. Let me redo this in English.

:facepalm:

tremon 3 hours ago|||

> I read the V1 code this time instead of guessing

Does the LLM even keep a (self-accessible) record of previous internal actions to make this assertion believable, or is this yet another confabulation?

johnmaguire 2 hours ago||

Yes, the LLM is able to see the entire prior chat history including tool use. This type of interaction occurs when the LLM fails to read the file, but acts as though it had.

al_borland 3 hours ago||||

This seems like the experience I've had with every model I've tried over the last several years. It seems like an inherent limitation of the technology, despite the hyperbolic claims of those financially invested in all of this paying off.

smt88 3 hours ago||

Opus 4.6 pre-nerf was incredible, almost magical. It changed my understanding of how good models could be. But that's the only model that ever made me feel that way.

al_borland 2 hours ago|||

That was better, but still not to the point that I just let it go on my repo.

whalesalad 2 hours ago|||

Yes! I genuinely got a LOT of shit done with Opus 4.6 "pre nerf" with regular old out-of-the-box config, no crazy skills or hacks or memory tweaks or anything. The downfall is palpable. Textbook rugpull.

ericol 3 hours ago||||

Matches what I am experiencing. Makes incredible stupid mistakes.

The weird stuff is yesterday I asked it to test and report back on a 30+ commit branch for a PR and it did that flawlessly.

ed_elliott_asc 2 hours ago||||

If it isn’t working for you why don’t you choose an older model? 4.6

alphabettsy 3 hours ago|||

The docs suggest not using max effort in most cases to avoid overthinking :shrug:

whalesalad 2 hours ago||

They've jumped the shark. I truly can't comprehend why all of these changes were necessary. They had a literal money printing machine that actually got real shit done, really well. Now it's a gamble every time and I am pulling back hard from Anthropic ecosystem.

geraldwhen 2 hours ago||

It seems clear that it was a money spending machine, not a money printing machine.

cadamsdotcom 2 hours ago||

> he’s making .. mistakes

Claude and other LLMs do not have a gender; they are not a “he”. Your LLM is a pile of weights, prompts, and a harness; anthropomorphising like this is getting in the way.

You’re experiencing what happens when you sample repeatedly from a distribution. Given enough samples the probability of an eventual bad session is 100%.

Just clear the context, roll back, and go again. This is part of the job.

yokoprime 2 hours ago||

Why be so upset at someone using pronouns with a LLM?

montjoy 6 hours ago|

It appears that they are testing using Max. For 4.7 Anthropic recognizes the high token usage of max and recommends the new xhigh mode for most cases. So I think the real question is whether 4.7 xhigh is “better” than 4.6 max.

> max: Max effort can deliver performance gains in some use cases, but may show diminishing returns from increased token usage. This setting can also sometimes be prone to overthinking. We recommend testing max effort for intelligence-demanding tasks.

> xhigh (new): Extra high effort is the best setting for most coding and agentic use cases

Ref: https://platform.claude.com/docs/en/build-with-claude/prompt...

dcrazy 6 hours ago|

Inserting an xhigh tier and pushing max way out has very “these go to 11” vibes.

More comments...