I think this is the first time we've had a third minor version bump on a frontier Anthropic model. (I count the 0.5s as major here, because they've been issued non-sequentially and also corresponded to massive capability leaps, eg, Sonnet 3.5, Opus 4.5).
So now the Opus 4.5 family has successors 4.6, 4.7, and 4.8, each posting fairly modest claimed gains. My own experience w/ 4.6 and 4.7 are that I don't firmly grasp any capabilities improvements over my memory of 4.5, but it's all so fuzzy that it's truly difficult to tell.
Maybe my own tastes are saturated now (it's smarter than me?) and I'll never again perceive model progress. Maybe the incrementalism is such that I'd notice immediately if my 4.7 workflows were redirected now to 4.5.
Difficult spot for the labs to be in because, if they have a stronger product, I'd prefer they release it and that I can use it.
But as this dynamic continues, the improvements are going to be less and less legible for end-users, who will complain about the churn-without-payoff, even when the payoff may actually be real.
There's orders of magnitude of low hanging juice to squeeze out of smaller models.
It is almost guaranteed that a 60-90B model can outperform current SOTA in coding tasks within 2-3 years (design not certain, probably unlikely).
It is far less clear that a 1.2T model will be meaningfully better enough to justify training it.
As far as reasoning is concerned, with the recent GRAM release, there may be 4 orders of magnitude of reasoning to tack on to smaller models.
Think about that... Google, OpenAI, Anthropic could train a 30B GRAM-based model in days - and it could potentially have better local reasoning than the best model available today at >1T params... They could upgrade that to a ~600B MoE model in days to have general trivia knowledge rivaling the best models...
You just can't train a 1T+ parameter model that fast. It is a giant if how much GRAM turns out to improve things, but it's unlikely to be trivial or nothing.
Larger models can already sort of tell you anything. They're never going to get everything right unless they stop being LLMs.
There's just not a lot of juice left to squeeze for Gemini to tell you exactly how tall Ke$ha is or when the last time Brittney Spears went to jail was...
(G)enerative (R)ecursive re(A)soning (M)odels. They really wanted the acronym.
"Analysis" was right thereWhich will be pretty rare.
I agree but with their urgent IPO-driven need to keep increasing prices, the frontier vendors now have every incentive maintain the perception that frontier performance requires endless >$200K racks of unobtanium GPUs and RAM. While they'd love to reduce their actual costs, they'd only want to do it to the extent they are certain they can keep it secret. Otherwise, they can't maintain and keep increasing their prices. And post-IPO audited reporting makes keeping that secret even harder.
Game theory-wise they probably don't want their their armies of leading researchers optimizing frontier performance, at least in any way that would further accelerate the relative price/perf of smaller models or self/cloud-hosting. While they know the open source models will always improve, the still win as long as enough customers demand the latest frontier and the open source lag remains constant.
They profit most in a world where a few frontier labs stay far in front, drag-racing each other and expending vast capital. It keeps their customers reliant and paying top dollar while keeping low-cost alternatives farther back. They probably much prefer competing with a couple other frontier labs who have similar astronomical costs and biz models, than a world where self or cloud-hosted open-source models start closing the gap enough to start commoditizing their business.
I don't disagree, but how much of this ends up being distillation? I can't help but imagine that 4.8 was probably trained in part by leveraging Mythos.
If the very large models turn out to be very expensive to run relative to the benefits, it's possible that they could end up still being trained, but ultimately used as a tool to create smaller models that are nearly as effective.
I'm curious if someone here with a stronger background in the space has a similar intuition or not.
The latter is much better (since you can clean up, review, update responses and filter your datasets).
I suspect nobody is doing real student teacher distillation, it’s just easier to do a bunch of training on the same giant corpus then post train on the synthetic corpus with its reasoning traces etc. (which might have been generated by a bigger better LLM)
A lot, so you can bet tens of millions are flowing to congress to have distillation declared illegal before this happens. And then it'll happen anyway.
A lab can train a large model, and then distill a smaller model from it that retains the majority of the useful capbility.
I don't know well enough if there's any benefit of that over just training the smaller model directly, but I'll bet there are some times where that is useful. I could easily see it being easier to do the initial pre-training on a larger model but be able to distill everything useful down into a smaller model, essentially filtering out a lot of noise in the process.
You don't need distillation. They already have the training sets.
It's MLA + MoE + Medusa (a better version of Speculative Decoding) + 1.58b (possibly - maybe nothing) + GRAM (which will almost certainly not turn out to be a nothing burger, but no one has quickly turned this around yet to prove it).
And even that would be rich as a accusation from SOTAs that depend on explicitly disregarding millions of training data intellectual property..
On the architectures side, I'm a lot more interesting in attention residuals than anything else, one of those things that seems obvious in hindsight and Kimi have proven it at scale.
Yes, variants typically 2-3x less good...
Same with speculative decoding... They all do something, but there are known techniques that are substantially better - that just were't known when they started development of the previous models.
MTP will still be highly valuable for interactive use of course.
- this gets reinvented/rediscovered constantly under different names
- it cant be trained very well (right now, will change)
- massive theoretical improvements over current models (log_2(vocabsize)=17, residual stream dim is thousands of dimensions, recursivity means more information bandwidth by ~3 OoM)
- BUT it cant be interpreted or aligned <- this is why no one uses it and no one talks about it. the idea is 100% obvious to all the frontier labs and there is a good reason why it isn't used
I follow this stuff closely, I think I know what I'm talking about (edited for formating)
What are the different names? I haven't seen this before.
> - it cant be trained very well (right now, will change)
If you're sure it will change, then why are you certain that it hasn't yet, and if it's proven a 5000x boost in reasoning... why aren't they exploring this path more aggressively?
> the idea is 100% obvious to all the frontier labs and there is a good reason why it isn't used
Surely someone is willing to take a 5000x boost in reasoning on a small research model... None of them have even tried anything resembling this AFAIK. It does not seem like something 100% obvious to them.
Those things being untrainable at scale is why they aren't around. Alignment is an afterthought.
As long as it's giving the right outputs, who cares what's in latent space?
If the model thinks in latent space: "God I wish these people would die," and constantly does the right thing, who cares?
Additionally, if one of it's latent spaces that it never explores is a psychopath -> who cares? The path never gets taken...
That's a lot of harmless people walking around with crazy thoughts...
There's still several orders of magnitude of improvement that are almost certainly left - it's just not clear how much is left on the frontier end.
Most people will be very glad to pay Anthropic, OpenAI, Google etc $200 a month to get things done 20x faster than they could IF they had a $8000 MacBook and could theoretically do it locally.
Some people would pay $200 a month forever not to have to open the terminal one time...
Furthermore, if looking at the results takes 10 minutes, that same 1 hour task only sees a 3x improvement. And so on.
No most people will not pay $200 for an LLM subscription. Some software developers do. Also, at $200/month, you are much better getting the macbook machine assuming token output speed is the same or at least reasonable.
LLMs are not very productive for your average person now for them to drop $200 on. They'll need to be more capable and integrated and even so...
What insight do you have to make this claim?
I've repeatedly given local models non-trivial projects that involve research and coding which they've successfully completed with minimal intervention from me (almost exclusively in the domain of reviewing the results). Again, nothing comparable with current SotA, but definitely tasks I could not have given SotA models last year (without agent harness).
Now that pure progress from these models seems to have slowed down, we're seeing a ton of options for both making models more efficient and other tools that help improve them (everything from agent harnesses to RLVR).
That's just looking at "what can small do today", when you look at what's possible with larger open models that are still much smaller than SotA from the major providers, their performance is extremely close to SotA, enough that for personal projects I'll just use Kimi instead of any anthropic offerings.
So it's not terribly hard to image a solution in the middle happening within a few years. We still have tons to learn about optimal sizes of these models and how to build them with maximal efficiency (and we've already seen a lot of recent improvements in this space).
What happens if you run last years model in a SOTA harness? IME, the quality of the harness has a much more significant impact on the quality of the result, once you get past the initial hump of “can it do anything at all”
I think multiple SLMs driven by an orchestration frameworks (harness or otherwise) will ultimately displace LLMs. Right now we're in the era of diminishing returns with respect to LLM gains. Moving the needle percentages doesn't excite as many people anymore and with "reasoning" capabilities there's no reason why small distributed models can't be run more efficiently, especially if/when we start to see gains in modularized context management solutions.
A smaller model with better context today can outperform a model with 100x more parameters with bad or diluted context.
2. MoE (already abundant) + MLA (mostly memory efficiency, not quality) + Medusa (speed, not quality) + GRAM (5000-10,000x better reasoning in an extremely small model) + 1.58b (unclear if it will have the impact Microsoft first claimed - but possibly 5x).
the last?!? I'm excited to see :) I'll take the other side of that since llms are so new
Honestly, there is nothing in my head that Claude cannot handle. Maybe it can be more this or that but I can already barely exploit Opus 4.7.
And I'm using DeepSeek 4 Pro for my personal use and while it's a little behind, it's not that far.
I think the situation can be very dangerous for US AI companies because if current models are already capable of doing mostly anything, nobodoy will want to get to the next model, even if it's 10x better. OTOH, open source models like DeepSeek are doing mostly the same work for 1/10 of the price.
Also the more I play with Pi, the more I think LLMs are already not kept back by their own capabilities but by the lack of agency we allow them to have. There is more value today in a capable harness for current LLMs than in a better LLM.
I think what gp said was the improvements are incremental, and we haven't seen a big revolutionary change since 2-3 years, and the pace is slowing down.
One idea is that maybe it could figure out how many L's are in the word "google" [1]
Or, maybe which days of the week have a "d" in their spelling [2].
There's a lot of room for improving the smaller models at many levels of the stack.
i think it'll be more like we get 1-10T models and then distill those down into smaller models, though
It seems like the best small models today are all distilled from bigger models
Moreover, I hypothesize Claude Opus 4.7 and now 4.8 are a distillation of Claude Mythos
- why'd a quantum computer help running an LLM?
- of course there'd be need for frontier companies - nobody else has the resources to train frontier models.
You can, but it's not as useful as you might think.
It needs to at least understand 1 human language to understand your intent to implement features.
If GRAM turns out to be a 5000x multiplier for local reasoning, you could theoretically train a 500M parameter model on just a programming language to understand stack traces to fix bugs and be incredibly powerful.
But most people also want it to understand human language to implement features as well.
Because then it can't just understand React and JavaScript - it needs to understand thousands of commonly used dependencies, the DOM, CSS, HTML, etc...
And for that you need A LOT more parameters than you might expect.
You can definitely get a ~3B active parameter model that can run comfortably on today's hardware to be VERY good at coding once all of the SOTA architectures are added to a single model - especially if we get better tool calling to give models better context per language.
You could get 100x performance if you feed the models ideal context... So a 3B model today can perform almost as good as ~300B model if you give it really good context vs flood it with mostly garbage it doesn't need across your repository.
Boomer comparison, but I remember the 8 bit computer era when the hardware was what it was so the later games of that era used hardware better than previous ones.
We have so many ways of optimizing:
- continusly creating more and better training data
- increasing parameters to 20/50/100TB
- We still wait for Mythos access
- We still wait for Mythos distilation (i haven't heard any rumors or so that there is a distilled version of Mythos out)
- Reinforcment learning and evolutionary algortihm only started to appear
- If a small 30GB Model can do stuff, these models can also be used as teachers for the big ones
- We have not seen yet specialized models at all. Like a coding java german expert model. Why? Even with MoE architecture, you still need to have these layers around
- Research for Diffusion and other models is still in progress
- Nvidia just announced/showed a 7x speedup on inferencing for Nemotron
- Multitoken prediction became available just a few weeks ago
- Compute gets only in a range were they can do a lot more and cheaper experiments (see Google IO 2026 announcement)
- World models are showing great progress and we do not know yet what they will bring to the table
- They are probably not finetuning/fixing all areas in parallel. I would argue that Anthropic focuses most of its efforts into coding and agentic. Google for sure does subagent and agentic optimizations too. Plenty of areas are just not touched i would say because they don't have the capacity
- We see more and more mulit modal models (these also consume compute)
- N-Gram paper and co i have not seen all of these things in chinese open models
- We don't even know yet what Meta is doing, but we do know they restarted their efforts again
- Anthropics models got a lot better benchmark wise for dening non sense asks. They do learn how to get rid or reduce hallucinations
- We are in the middle of the biggest Reinforcement loop whith all the training data we give them day to day and its not clear at all if they already use these models in thir training and at what stage.
- We do expect bigger models to be able to comprehend deeper concepts / broader code bases. Big companies with huge code bases probably are waiting for this
- Thre will be also continues progress in harnesses which in it alone is not part of the LLM progress (fair) but these harnesses do get better when you finetune a model to be optimized for a harness
- ChatGPTs Image model 2.0 got relevant better and came out just a month ago
I suspect, based on hardware requirements and progress on hardware infrastructure alone, that the industry wants to go to 100t models and we do not know yet what this will mean. I could see that we might skip normal transformer and find relevant other architectures.
Just a week ago there was a research paper about parallel input and output streams which has not been explored enough.
There was also a research paper were they showed that a LLM can compute things. This will take time to see were this leads to.
I don't think the focus on GRAM and facts is so relevant. Its about context and context handling not just some facts.
If you look at things like Mythic AI and the recent wurtzite ferroelectric nitrides breakthrough from the University of Michigan, huge performance and efficiency gains through new compute-in-memory approaches are around the corner.
And that will get us up to two orders of magnitude more parameters.
It's also plausible to me that before we get all the way to 100T we find some recipe of efficient state synchronization, goal sharing or something so that we are able to get higher collective IQ by combining fast distributed predictive subnetworks.
They said the words!!!!!
https://www.anthropic.com/glasswing
Ive seen the tickets generated by the model that have trickled to my team. They are legitimate, but i can’t speak to model improvement because its a pilot program.
Mythos is a bunch of likely overhyped claims at this point. A few experts who looked into the claimed results weren't that impressed.
We have these breathless conversations about the new AI frontier at the peril of losing sight of reality and our own human potential.
Are you sure that humans can?
Didn't a SOTA recently solved a mathematical theorem, one escaping mathematicians for 80 years?
Maybe a human "novel" invention is just a good interpolating from the datapoints (knowledge) fed to the human.
And how is that anything other than synthesis? Do we pull concepts out of thin air?
I’d be surprised tbh. Investors don’t want to hear “everyone else is still training models and seeing improvements, but we don’t want to participate in the arms race anymore.” They want monumental leaps every quarter or two because they have sunk unholy amounts of money into these companies/products.
The whole idea of “hyper scale” doesn’t jive with caution and or otherwise slowing down.
The whole ecosystem will twist and evolve, and the big companies will be left begging for corporate subscriptions.
I finally caved when I realized I could build a PC, for myself, with dual video cards that I wanted, which can play games that I like and run models that I want, without worrying about giving my payment info to someone I don’t trust, or invoking token anxiety that I don’t want.
I am ready to bet against this. Knowledge benchmark like SimpleQA isn't increasing for small models.
> It is far less clear that a 1.2T model will be meaningfully better enough to justify training it.
Well for one, we know for certain there is Mythos which is meaningfully better. And I think there is a lot of juice left to squeeze for Mythos class model.
Do we?
Have you used it?
What is "meaningfully" better? It's not 3-4 orders of magnitude better. That is definitely happening for smaller models.
Meaningful in the sense it could find security vulnerabilities in browser and kernel that >99% of the engineers couldn't find.
Model intelligence and knowledge aren't necessarily directly related. If we can pack greater intelligence and agency at the cost of it forgetting factoids, that would actually be a good thing. We don't need LLMs to memorize facts, we need them to learn how to interact with the world such that they can find the facts that are necessary and surface them to the user.
If we could distill all of the knowledge out of an LLM and just be left with a very agentic model that only knows facts in it's context, I think some very interesting stuff would happen.
There isn't a clear definition of what is knowledge and what is intelligence. Is being able to write in C knowledge? Is knowing undefined behaviour in that knowledge?
6 is for sure happening...
As is Gemini 4.
It's less certain there will be a Gemini 5 or GPT 7 any time soon that is a true next "generation" and not just an iteration. They will almost certainly call something Gemini 5 and GPT 7...
First you say there won't be a new generation. Now you're saying there will be more. Oh well, I'll stop responding here
You clearly did not read my first comment or the second, or clearly disagree on what a generation is.
My conspiracy theory is that Apple recognizes this.
I don't think that's not a conspiracy theory. AFAIK, It's their stated AI policy...
haven't verified, but attributed to Askell: "I just think that... there's this idea that you're always giving the models a personality and a persona, because they are talking like people and they are trained on human data. And I think my worry has been: if you train them to be excessively corrigible and to see that as their persona, in people I think this actually has a lot of negative broader traits. As in, if you met someone and it was just like, "oh yeah, they would literally do anything," a follower — you know, if a person just tells them something and they just fully defer, they don't bother thinking about it at all — I'm just a bit worried about how that might end up generalizing, especially if models are going to be playing a more active role in the world."
https://www.anthropic.com/research/persona-selection-model
https://www.anthropic.com/research/assistant-axis
https://www.anthropic.com/research/emergent-misalignment-rew...
https://www.anthropic.com/research/emotion-concepts-function
/model claude-opus-4-6
For this session and permanently (in shell):
export ANTHROPIC_MODEL=claude-opus-4-6
My 2¢, I personally feel like all of the productivity gains since 4.5's release (in November 2025!!) have come from improvements to the harnesses (cc, cursor cli, codex, opencode, whatever) AND from the context window expansion from 200k to 1M.
But the actual "raw" intelligence of the model / ability to make good decisions feels like it has plateaued since 4.5. 4.6 was maybe a small improvement, but hard to differentiate from in-context-learning with the 1M window. 4.7 if anything felt like a regression in wisdom for me and my coworkers, with it consistently making worse/lazier decisions.
Honestly... not that dramatically. Each release is much more marginal. And quoted official benchmarks doesn't translate very well into the real world.
4.7 regressed hard in some ways. But a compounding factor too is that the claude code harness seems to nerf the model after a few months. Probably to reduce token use.
So far 4.8 seems less verbose but we'll see in practice what it translates into meaningfully.
There's a sweet spot of complexity for low importance tasks where it's just big enough I don't want to do it and just simple enough to have opus plan/delegate/review with another model. So possibly model improvements will grow this window, but currently I don't do much in there.
They mention more granular control of effort, 'dynamic workflows' and more speed controls ("fast mode"). While they position them as user features, they also sound like the kinds of knobs Anthropic will need to twiddle on the back-end to balance costs, margins, ARR, and user growth vs retention post-IPO to hit key metrics in quarterly reporting.
Data at https://gertlabs.com/rankings
I've actually intentionally switched back to 4.5. I hated 4.7 so much that I decided to jump back all the way to 4.5.
Now that I've been using 4.5 for a few weeks, I find it significantly more reliable but a bit more forgetful than 4.6/4.7. I'm okay with that because it's really easy to identify this forgetfulness and nudge it.
I found 4.7's adaptive thinking to be extremely unreliable. It seems to overcorrect on the current message without considering the difficult of the overall problem. I wonder if 4.8 will improve on that.
This one change will probably solve 80% of the problems you have noticed.
I normally have only one session going at once though.
Still, the context window is sometimes too small for my usage.
It might be saturated for smaller scopes of work, but it’s not hard to see the cracks when you scale up what you ask of SOTA models/agents.
One example, to try and single shot prompt coding a ChatGPT equivalent chatbot.
Sure it will spit something out, but the feature depth, UX subtitles, backend integration, and lots of pragmatic engineering decisions along the way will just not be baked.
Another example is building a C compiler from scratch which Anthropic showed is still a struggle to do.
Not that these these specific examples are important but just to point out scaling up expectations shows the cracks.
It’s not just a model problem of course, better agents, orchestration features (like Dynamic Workflows mentioned in the post), all need to continue to evolve.
Ar what point does my CS degree become totally useless is an open question.
It also seems to be helpless at effort levels < xhigh, I turn to Sonnet when simpler tasks are needed.
I'm hoping they recreate the magic of 4.5 but it's as much about the quality of harness, the memory and efficiency of the tools than simply the models at this point.
In my experience, Opus 4.0 was fantastic, major jump from 3.7. it was creative, super slow and expensive, and would sometime forget what it was doing, but it was getting the job done.
4.1 they made it much faster, so a lot of infra improvements.
4.5 was the time it could work on longer task, didn't make a lot of obvious mistakes of 4.0, and i think this was about the time the opus went mainstream, and all of the anthropic's compute crisis began, so instead of making the model better they tried to optimize it to reduce cost instead.
4.6 was such a bad model, they switched to adaptive thinking and it had so many bugs. poor api design, benchmaxxed and poor real-world results. i switched back to 4.5.
4.7 they just fixed the bugs they added in 4.6. Better than 4.5.
haven't fully tested 4.8 yet.
A few days? A few weeks? Longer?
However a company releases a new AI model and within hours users are confidently proclaiming how much smarter it is than previous versions.
btw where do they tell you how they trained the model.
Are the dividing lines around personality? Working domains? Opinionated software stuff?
Who knows?
https://platform.claude.com/docs/en/about-claude/pricing
``` Model Base Input Tokens 5m Cache Writes 1h Cache Writes Cache Hits & Refreshes Output Tokens
Claude Opus 4.8 $5 / MTok $6.25 / MTok $10 / MTok $0.50 / MTok $25 / MTok
Claude Opus 4.7 $5 / MTok $6.25 / MTok $10 / MTok $0.50 / MTok $25 / MTok
Claude Opus 4.6 $5 / MTok $6.25 / MTok $10 / MTok $0.50 / MTok $25 / MTok
Claude Opus 4.5 $5 / MTok $6.25 / MTok $10 / MTok $0.50 / MTok $25 / MTok
Claude Opus 4.1 $15 / MTok $18.75 / MTok $30 / MTok $1.50 / MTok $75 / MTok
Claude Opus 4 (deprecated) $15 / MTok $18.75 / MTok $30 / MTok $1.50 / MTok $75 / MTok
Claude Sonnet 4.6 $3 / MTok $3.75 / MTok $6 / MTok $0.30 / MTok $15 / MTok
Claude Sonnet 4.5 $3 / MTok $3.75 / MTok $6 / MTok $0.30 / MTok $15 / MTok
Claude Sonnet 4 (deprecated) $3 / MTok $3.75 / MTok $6 / MTok $0.30 / MTok $15 / MTok
Claude Haiku 4.5 $1 / MTok $1.25 / MTok $2 / MTok $0.10 / MTok $5 / MTok
Claude Haiku 3.5 (retired, except on Bedrock and Vertex AI) $0.80 / MTok $1 / MTok $1.60 / MTok $0.08 / MTok $4 / MTok ```
It didn't make a splash like a new open source release would have.
You won't, really.
You don't have to correct it dozens of times a day!? Really?
This felt particularly visible during the 4.6 when people said that 4.6 felt dumber and I remember someone doing some analysis and it sort of proved that models were getting dumber over time.
This has both benefits of costing less for the company to run while taking a standard subscription but also, at the same time, making the next model when it drops to public to "feel" more good comparatively.
Again, I am not sure if this is the case or not but merely proposing something that I feel like it might be in the possibility of realm.
This is a refreshing attitude!
I've also verified that you can now turn off adaptive thinking in the web UI, which is great. I've had a lot of problems with thinking not triggering and the model producing sub-par output. Glad we can finally turn it off. (I hope being able to turn off adaptive thinking is new, if I could have turned it off at any time that would be embarrassing)
[1] https://code.claude.com/docs/en/model-config#adaptive-reason...
> Opus 4.7 and later always use adaptive reasoning. The fixed thinking budget mode and `CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING` do not apply to them.
More importantly for me, though, is how CC will respond to 4.6-"only" flags for thinking. For now, it doesn't seem to clobber my setup.
Well, I think the attitude is that costs are allowed to escalate faster and more steeply than the features delivered. From that perspective, semantic versioning is a handy tool for adjusting pricing strategies. IMHO, it (versioning) only makes sense for open-source projects, where you can clearly see the actual changes made with each version upgrade. Anything else is more than a little suspicious…
Same cost/token, more token usage.
4.8 is also 2x more expensive for a "modest" performance bump. How refreshing.
This is just cope.
Where are you seeing it's 2x more expensive? https://platform.claude.com/docs/en/about-claude/pricing
Probably more interesting than the 4.8 release.
This suggests that they're doing the same thing with Mythos now and the Mythos we get will be nerfed in that department?
Or more precisely, I think they'll have two versions of Mythos, and the scary one will probably continue to require a lot of paperwork.
Sonnet and Haiku look real outclassed for the price with current Chinese competition.
Hope this isn’t the case and that normal average Joe’s of the world don’t get policed out of access.
But in general, what does the average Joe need Opus for that Sonnet or Haiku can't do for them? Better is better.
Unless it's so expensive that we can't realistically use it for anything, I wouldn't complain about getting at least that. I would also rather have the actual model, but that's a useful application of it (and I'm probably not going to afford using it for much more).
Although mental safety gymnastics aside, getting the most amount of intelligence for the cheapest amount of cost to normal people seems like the most ethical thing a big lab could do.
Going around and granting different tiers of intelligence to different insiders, friends, or companies is majorly problematic long-term.
Heck right now, the tokens you buy today for “Opus 4.8”, no one even knows or believes will be the same “Opus 4.8” just 3 days from now.
this one [0] notes one run cost $20k to run but another cost $50.
The fact that they haven't released it yet suggests a cost/margins issue to me more than anything else. Short term, I'll probably keep using Antrhopic, but my long-term bet is that locally-served models win, if only because the quest for profitability will probably lead to intentionally-nerfed / enshittified frontier models.
At other vendors, ad placement within LLM responses is either coming or already here. Anthropic's handling of OpenClaw shows they're willing to engage in anti-competitive behavior, and the courts are not in a hurry to stop them. Why would I pay them $200 a month for such treatment when a $2K box does what I need locally?
https://gist.github.com/simonw/68560eddb0b268a8417f80ceb7304...
The high one is notably better - the bicycle frame is the correct shape, unlike thinking level low.
For comparison, here's Opus 4.7: https://gist.github.com/simonw/afcb19addf3f38eb1996e1ebe749c...
https://www.gianlucagimini.it/portfolio-item/velocipedia/
Turns out even humans can be pretty bad at drawing bicycles :)
https://tools.simonwillison.net/markdown-svg-renderer#url=ht...
For max I used 25 input, 17,167 output which cost me 43 cents! https://www.llm-prices.com/#it=25&ot=17167&ic=5&oc=25&sel=cl...
No, the handlebar is wrong. The handle bar is rotating the frame instead of rotating the front wheel. The handle bar should be mounted on the same line as the front wheel is.
Hopefully 4.9 will read my comments :)
UPDATE: My mistake, the API does support max. I added a max one at the bottom of this page (cost 43 cents): https://tools.simonwillison.net/markdown-svg-renderer#url=ht...
But not the best/not the worst is somewhat subjective.. so not sure how well that would work.
...but that pelican's little helmet is adorable.
https://bsky.app/profile/senko.net/post/3mmwnrkwboc2v
The prompt was: Create a simple but functional real time strategy (RTS) game similar to old WarCraft, StarCraft or Command & Conquer games. The player should be able to build buildings, create units, gather resources and should uncover the whole map. No AI or multiplayer needed. Use simple but nice-looking graphics. No sound. Implement everything in HTML/CSS/JS, everything in a single file (you can use 3rd-party js or css libraries/frameworks via CDN).
OpenAI solves tasks with about 50% less output tokens.
https://artificialanalysis.ai/?intelligence=coding-index&int...
There's like 8 million benchmarks. Every release, every model randomly picks 5-10 where they win in everything except 1, to make it look like they aren't randomly cherry picking benchmarks they probably benchmaxxed for.
I built it for myself, to test which models to use via OpenRouter for my n8n agents. Currently actually still using gpt-5.3-codex for many things, as its pricing is really good in production (due to how their token caching works).
Gemini models still have the best intelligence (when asked any questions, most likely to get it right), but in production they still have many failure modes[1].
[0]: https://aibenchy.com
At least, until hopefully one day HN has a rule about accounts that derive 99.9999% of their engagement with the site from shilling a personal project.
There are many benchmarks all for specific use cases but with them the difference seems to be in extreme points (93% vs 92%)
I think that, that tracks but still, it was refreshing to see a benchmark which I can help make better opinions about.
Surprised about Mimo v2.5, within artificial-analysis and other benchmarks, the difference between Mimo and deepseek seems very partial and a lot of focus/(hype?) is on Deepseek
But mimo seems like an interesting model and they are having some crazy discounts too.
Deepseek is valuable for the research community because of how open they are but absolutely crazy to think how Xiaomi basically pulled up in creating Mimo given that they didn't have anything till quite recently.
Either way, an interesting benchmark, also a plus point for giving golang some decent representation equal to python/typescript.
I think that there are sets of things which resemble something like normal benchmarks where open source models can be absolutely fine and for a very small fraction or more technical things, the benchmark that you linked starts to be better projected so it depends upon the scale of complexity but its good to see how models compete given enough complexity. definitely fascinating.
I would be interested to see more models compete on this test. The current range is still a bit limited as compared to other benchmarks but OSS models like Kimi/mimo seem to only be 3-4 (at max 6 months) behind closed source.
Of the metircs they reported for 4.7, for 4.8 they excluded BrowseComp, CharXiv Reasoning, CyberGym, GPQA Diamond, MCP Atlas, MMMLU, SWE-bench Verified. The last 4 were almost always mentioned in previous Opus releases.
I doubt Anthropic internally sets as a goal to improve this or that benchmark - it's just a way to visualize progress. They probably have much more complex metrics internally.
I personally feel that Anthropic doesn't understand what this means for the frontier labs, and moreover that they might be the only frontier lab that doesn't.
1. Google dropped Gemini 3.5 Flash at IO, delaying the release of 3.5 Pro for a bit (they have said its coming). They also released a refreshed Antigravity, and drew special attention to how cheaply they were able to build their toy operating system to play Doom (less-than $1000 IIRC).
2. OpenAI has dumped everything into Codex, is offering double the token limits for the next few weeks IIRC, and is offering business discounts. Their head of Codex has tweeted that 5.5 is "extremely efficient", implying that they aren't actually losing money on any of this.
3. DeepSeek and other Chinese labs have dropped token pricing to the floor, in some situations as much as 99%.
4. Anthropic releases the next generation of Opus, their most expensive public model, without changing its price. In the background, they hype up Mythos, an even more expensive model.
Anthropic has screwed up where they need to be making investments, and the cracks are starting to show. They've marginally underinvested in the Sonnet line of models for almost a year now, and they've critically underinvested in product. Anthropic made bets on the story of the second half of 2026 being: ultra-frontier, ultra-intelligence. In reality, what's shaping up is that the story will be: Companies rolling back AI spend, efficiency, "95% as good for 15% the price", sophisticated high quality harnesses, cheaper models. Anthropic isn't ready for this world.
This is lack of imagination. If you use these models heavily enough, pretty soon you'll hit the edges of their capabilities. The smarter among us are collecting these problems into a personal benchmark and use that to judge model capability. I think this is the right approach, and dare I say, even better than generic benchmarks. To me, it matters less what the benchmark says, and more what my particular problems are.
I keep trying to switch to something else but I keep coming back. (Typically after a few days of giving a new model an honest go, and finding myself constantly asking Sonnet to fix its output... Yes, even Sonnet wins on this front! They really do have some kind of special sauce.)
I'm not where most of their money comes from though, and I don't know how universal my experience is.
No idea why you’d say they have critically underinvested in product when Claude Code dominates and they’ve also released popular tools like Cowork and integrations for Microsoft products at an incredibly rapid pace.
Cost is becoming more of a factor, and no doubt they’ll work on that. There’s no reason to think they won’t be able to release cheaper models if they optimize for that rather than improving performance.
It feels like the only way to push the limits of newer models is with really long context questions that require reasoning. Any short request will naturally just be within the distribution of all the recent models so there isn't a performance difference there.
I think the near future is looking like a bunch of business-critical tasks that scale infinitely with better reasoning, all being done on whatever the most advanced model is at a high cost. Trading stocks, running a business, looking for tax dodges, writing high-performance code. These are all things where there's a tangible return on each jump in reasoning.
In our work we asked several frontier AIs to come up with an API we needed. We compared Opus 4.7 and GPT-5.5 (among others). Opus 4.7 came up with the most creative and intelligent API design that pleasantly surprised us, especially given that GPT-5.5 was passing it on various coding benchmarks.
What I noticed is that we don't have a commons benchmark to measure "creativity" and "ingenuity", and in some ways such a benchmark would conflict with the common IFBench benchmark. Yet this is a very important skill when designing systems. I'm glad to see Anthropic putting thought into it, and would love to see a public benchmark for this that other models could compare themselves to.
[1] https://cdn.sanity.io/files/4zrzovbb/website/c886650a2e96fc0...
So for now its planning/architecture/strategy -> Opus. Pure coding -> GPT.
Helps with agentic coding that GPT is much roomier with the tokens you get.
I'm sure it will get fixed eventually/soon, just annoying to update and have your workflow break.