Claude Opus 4.8 - Hacker News

Posted by craigmart 3 hours ago

892 points | 689 comments

NiloCK 3 hours ago|

A rambling comment:

I think this is the first time we've had a third minor version bump on a frontier Anthropic model. (I count the 0.5s as major here, because they've been issued non-sequentially and also corresponded to massive capability leaps, eg, Sonnet 3.5, Opus 4.5).

So now the Opus 4.5 family has successors 4.6, 4.7, and 4.8, each posting fairly modest claimed gains. My own experience w/ 4.6 and 4.7 are that I don't firmly grasp any capabilities improvements over my memory of 4.5, but it's all so fuzzy that it's truly difficult to tell.

Maybe my own tastes are saturated now (it's smarter than me?) and I'll never again perceive model progress. Maybe the incrementalism is such that I'd notice immediately if my 4.7 workflows were redirected now to 4.5.

Difficult spot for the labs to be in because, if they have a stronger product, I'd prefer they release it and that I can use it.

But as this dynamic continues, the improvements are going to be less and less legible for end-users, who will complain about the churn-without-payoff, even when the payoff may actually be real.

onlyrealcuzzo 3 hours ago||

I won't be surprised if the next gen frontier models are the last.

There's orders of magnitude of low hanging juice to squeeze out of smaller models.

It is almost guaranteed that a 60-90B model can outperform current SOTA in coding tasks within 2-3 years (design not certain, probably unlikely).

It is far less clear that a 1.2T model will be meaningfully better enough to justify training it.

As far as reasoning is concerned, with the recent GRAM release, there may be 4 orders of magnitude of reasoning to tack on to smaller models.

Think about that... Google, OpenAI, Anthropic could train a 30B GRAM-based model in days - and it could potentially have better local reasoning than the best model available today at >1T params... They could upgrade that to a ~600B MoE model in days to have general trivia knowledge rivaling the best models...

You just can't train a 1T+ parameter model that fast. It is a giant if how much GRAM turns out to improve things, but it's unlikely to be trivial or nothing.

Larger models can already sort of tell you anything. They're never going to get everything right unless they stop being LLMs.

There's just not a lot of juice left to squeeze for Gemini to tell you exactly how tall Ke$ha is or when the last time Brittney Spears went to jail was...

vlovich123 2 hours ago|||

Took me a while to find what you were referring to by gram. Arxiv paper from 9 days ago that's not properly indexed by search engines.

(G)enerative (R)ecursive re(A)soning (M)odels. They really wanted the acronym.

https://arxiv.org/html/2605.19376v1

knollimar 2 hours ago|||

I prefer GRRM but then that would imply a habit of not actually getting a final result

areweai 2 hours ago||||

That acronym is unacceptable. It's going to impede discussion and cause confusion for a long time if it doesn't die off immediately.

sebzim4500 1 hour ago|||

You think that's bad? I introduce you to LION, (evoLved sIgn mOmeNtum) [1]

[1] https://arxiv.org/pdf/2302.06675

evan_ 2 hours ago||||

  "Analysis" was right there

froh 21 minutes ago||||

confusing indeed. I wondered "which RAM? nvram? dram? vram? dram? now what's g-ram?"

gchamonlive 2 hours ago|||

Yeah, look what happened to GNU

dyates 2 hours ago||||

And to think, we could have had George RR Martins instead.

trollbridge 2 hours ago||

Speaking of things that never finish.

867-5309 2 hours ago||

my wife assures me it's common..

mindcrime 1 hour ago||

is her name Jenny by chance?

867-5309 56 minutes ago||

what are the odds

jimbokun 1 hour ago||||

Just spell it GRRM but pronounce it “gram” if you have to reference it in spoken conversation.

Which will be pretty rare.

freehorse 53 minutes ago||

Grrm with a rolling r sounds better.

mrandish 21 minutes ago||||

> Google, OpenAI, Anthropic could train a 30B GRAM-based model in days - and it could potentially have better local reasoning than the best model available today at >1T param

I agree but with their urgent IPO-driven need to keep increasing prices, the frontier vendors now have every incentive maintain the perception that frontier performance requires endless >$200K racks of unobtanium GPUs and RAM. While they'd love to reduce their actual costs, they'd only want to do it to the extent they are certain they can keep it secret. Otherwise, they can't maintain and keep increasing their prices. And post-IPO audited reporting makes keeping that secret even harder.

Game theory-wise they probably don't want their their armies of leading researchers optimizing frontier performance, at least in any way that would further accelerate the relative price/perf of smaller models or self/cloud-hosting. While they know the open source models will always improve, the still win as long as enough customers demand the latest frontier and the open source lag remains constant.

They profit most in a world where a few frontier labs stay far in front, drag-racing each other and expending vast capital. It keeps their customers reliant and paying top dollar while keeping low-cost alternatives farther back. They probably much prefer competing with a couple other frontier labs who have similar astronomical costs and biz models, than a world where self or cloud-hosted open-source models start closing the gap enough to start commoditizing their business.

iknowstuff 47 seconds ago||

Google seems pretty happy to release smaller, faster models. 3.5 Flash is pretty clutch isn't it?

supern0va 3 hours ago||||

>It is almost guaranteed that a 60-90B model can outperform current SOTA in coding tasks within 2-3 years.

I don't disagree, but how much of this ends up being distillation? I can't help but imagine that 4.8 was probably trained in part by leveraging Mythos.

If the very large models turn out to be very expensive to run relative to the benefits, it's possible that they could end up still being trained, but ultimately used as a tool to create smaller models that are nearly as effective.

I'm curious if someone here with a stronger background in the space has a similar intuition or not.

rao-v 1 hour ago|||

It’s really worth distinguishing between old-fashioned student teacher distillation (ie at the level of layers, weights and distributions) and large scale synthetic dataset creation.

The latter is much better (since you can clean up, review, update responses and filter your datasets).

I suspect nobody is doing real student teacher distillation, it’s just easier to do a bunch of training on the same giant corpus then post train on the synthetic corpus with its reasoning traces etc. (which might have been generated by a bigger better LLM)

spwa4 2 hours ago||||

> I don't disagree, but how much of this ends up being distillation?

A lot, so you can bet tens of millions are flowing to congress to have distillation declared illegal before this happens. And then it'll happen anyway.

lambda 2 hours ago||

Distillation isn't only between different labs.

A lab can train a large model, and then distill a smaller model from it that retains the majority of the useful capbility.

I don't know well enough if there's any benefit of that over just training the smaller model directly, but I'll bet there are some times where that is useful. I could easily see it being easier to do the initial pre-training on a larger model but be able to distill everything useful down into a smaller model, essentially filtering out a lot of noise in the process.

spwa4 2 hours ago||

There used to be training methods like that but I think they've been phased out in favor of letting small models evolve by rewriting their own training material. Surprisingly that's actually cheaper.

onlyrealcuzzo 2 hours ago|||

> I don't disagree, but how much of this ends up being distillation?

You don't need distillation. They already have the training sets.

It's MLA + MoE + Medusa (a better version of Speculative Decoding) + 1.58b (possibly - maybe nothing) + GRAM (which will almost certainly not turn out to be a nothing burger, but no one has quickly turned this around yet to prove it).

semiquaver 1 hour ago|||

The frontier labs distill their own base models all day long. It’s not just something done by nefarious Chinese copycats. The knowledge embodied by the internal base models that we never see is much more powerful and useful than the much sparser raw training data

coldtea 1 hour ago|||

>It’s not just something done by nefarious Chinese copycats

And even that would be rich as a accusation from SOTAs that depend on explicitly disregarding millions of training data intellectual property..

manmal 1 hour ago||||

But how? The training data is the unadulterated content those models are based on? I genuinely don’t understand, no snark.

supern0va 1 hour ago|||

I think you replied to the wrong parent.

Philpax 2 hours ago||||

It wouldn't be data distillation: instead, it would be teacher-student distillation. The teacher model has stronger representations that the student can mimic, which would give it more capability over training on the data itself.

minimaltom 2 hours ago|||

Frontier labs have their own variants of MLA and certainly their own balance/scaling-laws for things like MoE vs FC vs Attn. MoE scales really well for inference with horizontal scaling + batching, which these guys luv.

On the architectures side, I'm a lot more interesting in attention residuals than anything else, one of those things that seems obvious in hindsight and Kimi have proven it at scale.

onlyrealcuzzo 2 hours ago||

> Frontier labs have their own variants of MLA

Yes, variants typically 2-3x less good...

Same with speculative decoding... They all do something, but there are known techniques that are substantially better - that just were't known when they started development of the previous models.

amluto 1 hour ago||

How useful is speculative decoding in a batched setting where you get paid for throughput (aggregated across users) and you mostly don’t get paid for latency or single-session throughput?

onlyrealcuzzo 1 hour ago||

It's useful at the local level, where there will be SOTA models developed...

zozbot234 22 minutes ago||

Local models are moving towards batched inference too, if only for non-interactive use. An early experimental patchset for DS4 (running DeepSeek V4 Flash) seems to show 2x aggregate tok/s decode when processing 8 streams concurrently, and more than 3x when processing as many as 32 streams concurrently. Note that decode (which is not helped significantly by this change) then becomes a larger fraction of total wall-clock time, so the overall gain is lower (i.e. decode is akin to a 'serial' task wrt. Amdahl's law).

MTP will still be highly valuable for interactive use of course.

sometimelurker 2 hours ago||||

I looked into this "GRAM" stuff a sibling comment links further to, and just to say:

- this gets reinvented/rediscovered constantly under different names

- it cant be trained very well (right now, will change)

- massive theoretical improvements over current models (log_2(vocabsize)=17, residual stream dim is thousands of dimensions, recursivity means more information bandwidth by ~3 OoM)

- BUT it cant be interpreted or aligned <- this is why no one uses it and no one talks about it. the idea is 100% obvious to all the frontier labs and there is a good reason why it isn't used

I follow this stuff closely, I think I know what I'm talking about (edited for formating)

onlyrealcuzzo 17 minutes ago|||

> - this gets reinvented/rediscovered constantly under different names

What are the different names? I haven't seen this before.

> - it cant be trained very well (right now, will change)

If you're sure it will change, then why are you certain that it hasn't yet, and if it's proven a 5000x boost in reasoning... why aren't they exploring this path more aggressively?

> the idea is 100% obvious to all the frontier labs and there is a good reason why it isn't used

Surely someone is willing to take a 5000x boost in reasoning on a small research model... None of them have even tried anything resembling this AFAIK. It does not seem like something 100% obvious to them.

l674 2 hours ago|||

Could you explain how/why GRAM cannot be interpreted or aligned how current LLMs are? Not very familiar how it works

sometimelurker 38 minutes ago|||

sibling comment got to the main points before me, but to add on kmavm's reply, the attack surface for gradient decent to get the system to exchange "bad information is much higher in latent reasoning models (like GRAM). You get ~3 OoM more bits (~17 bits per token in a standard CoT vs the whole residual stream of the model @ f16 = a few kb) per forward pass of the system coming back to itself, and even if you could sift through all that for signs of misalignment, you just can't put a blockade on all of the bad things that leak through.

ACCount37 26 minutes ago||

Most alignment methods nowadays don't rely on interpretability. And neither do all LLM vendors care about alignment much - especially not in China.

Those things being untrainable at scale is why they aren't around. Alignment is an afterthought.

kmavm 1 hour ago|||

Crudely? Because you can't grep a sequence of latent states for variants of "If I kill all the puny humans, I can <achieve my current goal>."

onlyrealcuzzo 13 minutes ago|||

Why do you need to grep latent space?

As long as it's giving the right outputs, who cares what's in latent space?

If the model thinks in latent space: "God I wish these people would die," and constantly does the right thing, who cares?

Additionally, if one of it's latent spaces that it never explores is a psychopath -> who cares? The path never gets taken...

That's a lot of harmless people walking around with crazy thoughts...

czl 1 minute ago|||

[flagged]

jruz 2 hours ago||||

Absolutely that’s why they’re rushing to IPO now to squeeze the last drop of the bubble they know this is a dead end.

swader999 50 minutes ago|||

I think we could run for at least a decade further with no model changes/improvements, just better harnesses and infra around this agentic way of developing.

onlyrealcuzzo 2 hours ago||||

It's unclear it's a dead-end within 5 years.

There's still several orders of magnitude of improvement that are almost certainly left - it's just not clear how much is left on the frontier end.

Most people will be very glad to pay Anthropic, OpenAI, Google etc $200 a month to get things done 20x faster than they could IF they had a $8000 MacBook and could theoretically do it locally.

Some people would pay $200 a month forever not to have to open the terminal one time...

bonzini 2 hours ago|||

"Doing things X times faster" at some point hits Amdahl law. If just context switching takes 5 minutes, speeding up a 1 hour task by 10x provides 5x improvement.

Furthermore, if looking at the results takes 10 minutes, that same 1 hour task only sees a 3x improvement. And so on.

csomar 1 hour ago||||

> Most people will be very glad to pay Anthropic, OpenAI, Google etc $200 a month to get things done 20x faster than they could IF they had a $8000 MacBook and could theoretically do it locally.

No most people will not pay $200 for an LLM subscription. Some software developers do. Also, at $200/month, you are much better getting the macbook machine assuming token output speed is the same or at least reasonable.

LLMs are not very productive for your average person now for them to drop $200 on. They'll need to be more capable and integrated and even so...

eiej 2 hours ago|||

That’s not how firms do the financial analysis which is where most of the revenue’s are coming from…

lukan 2 hours ago|||

On the other hand, I think I have been hearing that for a while, even before Opus.

energy123 2 hours ago||

While revenues grow almost exponentially. Reminds me of the confident predictions in the early days of Covid that it was nothing while the data showed exponential growth.

haldujai 18 minutes ago||

I’m also reminded by the early COVID days when exponential growth was leading to predictions of the collapse of modern civilization and a billion dead, now it’s just another endemic respiratory virus.

hellohello2 2 hours ago||||

"It is almost guaranteed that a 60-90B model can outperform current SOTA in coding tasks within 2-3 years"

What insight do you have to make this claim?

roadside_picnic 2 hours ago|||

Have you personally used any of the latest batch of even smaller local models? They certainly don't beat SotA models at coding... but with a good harness they are able to achieve things with SotA that I couldn't last year.

I've repeatedly given local models non-trivial projects that involve research and coding which they've successfully completed with minimal intervention from me (almost exclusively in the domain of reviewing the results). Again, nothing comparable with current SotA, but definitely tasks I could not have given SotA models last year (without agent harness).

Now that pure progress from these models seems to have slowed down, we're seeing a ton of options for both making models more efficient and other tools that help improve them (everything from agent harnesses to RLVR).

That's just looking at "what can small do today", when you look at what's possible with larger open models that are still much smaller than SotA from the major providers, their performance is extremely close to SotA, enough that for personal projects I'll just use Kimi instead of any anthropic offerings.

So it's not terribly hard to image a solution in the middle happening within a few years. We still have tons to learn about optimal sizes of these models and how to build them with maximal efficiency (and we've already seen a lot of recent improvements in this space).

maccard 2 hours ago|||

> but with a good harness they are able to achieve things with SotA that I couldn't last year.

What happens if you run last years model in a SOTA harness? IME, the quality of the harness has a much more significant impact on the quality of the result, once you get past the initial hump of “can it do anything at all”

windexh8er 1 hour ago|||

I think this is a big component, but also context. A large factor in any model being able to handle complexity comes down to context length.

I think multiple SLMs driven by an orchestration frameworks (harness or otherwise) will ultimately displace LLMs. Right now we're in the era of diminishing returns with respect to LLM gains. Moving the needle percentages doesn't excite as many people anymore and with "reasoning" capabilities there's no reason why small distributed models can't be run more efficiently, especially if/when we start to see gains in modularized context management solutions.

mswphd 21 minutes ago|||

sure, but high-quality harnesses require less gpu compute/VRAM, and plausibly can be used locally by most users.

sixothree 1 hour ago|||

Can you spare a sentence or two describing your local setup?

theplatman 52 minutes ago||

biggest thing i wish was present in more discussions about models is people providing more specifics on their setups vs. vague descriptions of harnesses

onlyrealcuzzo 2 hours ago||||

1. Context is all you need... They are heavily investing in getting better context (especially for coding tasks). This will disproportionately advantage smaller models (and benefit everyone).

A smaller model with better context today can outperform a model with 100x more parameters with bad or diluted context.

2. MoE (already abundant) + MLA (mostly memory efficiency, not quality) + Medusa (speed, not quality) + GRAM (5000-10,000x better reasoning in an extremely small model) + 1.58b (unclear if it will have the impact Microsoft first claimed - but possibly 5x).

knollimar 1 hour ago|||

Probably just "gemma was cool"

slashdave 2 hours ago||||

I think you are assuming training from scratch, which I doubt is happening here. Fine-tuning and RL, especially based on synthetic feedback (coding skill, in particular) can be ongoing and is where these models obtain truly useful abilities.

mucle6 3 hours ago||||

> I won't be surprised if the next gen frontier models are the last.

the last?!? I'm excited to see :) I'll take the other side of that since llms are so new

pjerem 2 hours ago||

What gp wanted to say is that models are now so smart and useful that even if they managed to be EVEN MORE smart and useful, you wouldn't even notice it.

Honestly, there is nothing in my head that Claude cannot handle. Maybe it can be more this or that but I can already barely exploit Opus 4.7.

And I'm using DeepSeek 4 Pro for my personal use and while it's a little behind, it's not that far.

I think the situation can be very dangerous for US AI companies because if current models are already capable of doing mostly anything, nobodoy will want to get to the next model, even if it's 10x better. OTOH, open source models like DeepSeek are doing mostly the same work for 1/10 of the price.

Also the more I play with Pi, the more I think LLMs are already not kept back by their own capabilities but by the lack of agency we allow them to have. There is more value today in a capable harness for current LLMs than in a better LLM.

suttontom 1 hour ago|||

Are you joking? Is there literally "nothing" you can imagine that Claude can't do?

dead_internet 1 hour ago||

[dead]

coldtea 1 hour ago||||

>What gp wanted to say is that models are now so smart and useful that even if they managed to be EVEN MORE smart and useful, you wouldn't even notice it.

I think what gp said was the improvements are incremental, and we haven't seen a big revolutionary change since 2-3 years, and the pace is slowing down.

claytongulick 1 hour ago|||

> Honestly, there is nothing in my head that Claude cannot handle.

One idea is that maybe it could figure out how many L's are in the word "google" [1]

Or, maybe which days of the week have a "d" in their spelling [2].

[1] https://x.com/FatherPhi/status/2059659658428912040?s=20

[2] https://x.com/FatherPhi/status/2054212816069132461?s=20

mickdarling 57 minutes ago||||

I effectively distill the frontier models by building whole sets of skills, personas, and other artifacts that I can then run on smaller models and get 10% even 20% improvements on models like haiku or local models.

There's a lot of room for improving the smaller models at many levels of the stack.

merlindru 3 hours ago||||

surely training also gets cheaper so justifying it becomes easier?

i think it'll be more like we get 1-10T models and then distill those down into smaller models, though

It seems like the best small models today are all distilled from bigger models

Moreover, I hypothesize Claude Opus 4.7 and now 4.8 are a distillation of Claude Mythos

ishurand4 1 hour ago||||

And anyway, with quantum, there will be no need for frontier companies as you might be able to even run a 1T param model on a consumer quantum computer.

stratos123 24 minutes ago|||

I'm assuming this is a joke, but:

- why'd a quantum computer help running an LLM?

- of course there'd be need for frontier companies - nobody else has the resources to train frontier models.

root_axis 31 minutes ago|||

Even if quantum computing had any clear implications for LLMs (it doesn't), there is no such thing as a "consumer quantum computer" and there won't be in our lifetimes.

dbbk 32 minutes ago||||

I'm frankly surprised the focus is still on these enormous "know everything in the world" models. I would think you could create an incredibly lean and smart "just React and React Native" model.

onlyrealcuzzo 29 minutes ago||

> I would think you could create an incredibly lean and smart "just React and React Native" model.

You can, but it's not as useful as you might think.

It needs to at least understand 1 human language to understand your intent to implement features.

If GRAM turns out to be a 5000x multiplier for local reasoning, you could theoretically train a 500M parameter model on just a programming language to understand stack traces to fix bugs and be incredibly powerful.

But most people also want it to understand human language to implement features as well.

Because then it can't just understand React and JavaScript - it needs to understand thousands of commonly used dependencies, the DOM, CSS, HTML, etc...

And for that you need A LOT more parameters than you might expect.

You can definitely get a ~3B active parameter model that can run comfortably on today's hardware to be VERY good at coding once all of the SOTA architectures are added to a single model - especially if we get better tool calling to give models better context per language.

You could get 100x performance if you feed the models ideal context... So a 3B model today can perform almost as good as ~300B model if you give it really good context vs flood it with mostly garbage it doesn't need across your repository.

yomismoaqui 2 hours ago||||

Let's hope that hitting a scaling wall and less money to spend will begin redirecting efforts to optimize inference and get the same results with less compute.

Boomer comparison, but I remember the 8 bit computer era when the hardware was what it was so the later games of that era used hardware better than previous ones.

Gomotono 1 hour ago||||

I don't think this is true at all. It might feel like this because we are used to a very very fast release cycle but we are only in this topic for a few years.

We have so many ways of optimizing:

- continusly creating more and better training data

- increasing parameters to 20/50/100TB

- We still wait for Mythos access

- We still wait for Mythos distilation (i haven't heard any rumors or so that there is a distilled version of Mythos out)

- Reinforcment learning and evolutionary algortihm only started to appear

- If a small 30GB Model can do stuff, these models can also be used as teachers for the big ones

- We have not seen yet specialized models at all. Like a coding java german expert model. Why? Even with MoE architecture, you still need to have these layers around

- Research for Diffusion and other models is still in progress

- Nvidia just announced/showed a 7x speedup on inferencing for Nemotron

- Multitoken prediction became available just a few weeks ago

- Compute gets only in a range were they can do a lot more and cheaper experiments (see Google IO 2026 announcement)

- World models are showing great progress and we do not know yet what they will bring to the table

- They are probably not finetuning/fixing all areas in parallel. I would argue that Anthropic focuses most of its efforts into coding and agentic. Google for sure does subagent and agentic optimizations too. Plenty of areas are just not touched i would say because they don't have the capacity

- We see more and more mulit modal models (these also consume compute)

- N-Gram paper and co i have not seen all of these things in chinese open models

- We don't even know yet what Meta is doing, but we do know they restarted their efforts again

- Anthropics models got a lot better benchmark wise for dening non sense asks. They do learn how to get rid or reduce hallucinations

- We are in the middle of the biggest Reinforcement loop whith all the training data we give them day to day and its not clear at all if they already use these models in thir training and at what stage.

- We do expect bigger models to be able to comprehend deeper concepts / broader code bases. Big companies with huge code bases probably are waiting for this

- Thre will be also continues progress in harnesses which in it alone is not part of the LLM progress (fair) but these harnesses do get better when you finetune a model to be optimized for a harness

- ChatGPTs Image model 2.0 got relevant better and came out just a month ago

I suspect, based on hardware requirements and progress on hardware infrastructure alone, that the industry wants to go to 100t models and we do not know yet what this will mean. I could see that we might skip normal transformer and find relevant other architectures.

Just a week ago there was a research paper about parallel input and output streams which has not been explored enough.

There was also a research paper were they showed that a LLM can compute things. This will take time to see were this leads to.

I don't think the focus on GRAM and facts is so relevant. Its about context and context handling not just some facts.

ilaksh 1 hour ago||

Great points! We do keep seeing gains from larger model sizes. I think that is still one of the factors contributing to jagged intelligence. When they increase up to around 100T parameters, that will truly be human complexity level, and I assume there will be no trace of jaggedness left.

If you look at things like Mythic AI and the recent wurtzite ferroelectric nitrides breakthrough from the University of Michigan, huge performance and efficiency gains through new compute-in-memory approaches are around the corner.

And that will get us up to two orders of magnitude more parameters.

It's also plausible to me that before we get all the way to 100T we find some recipe of efficient state synchronization, goal sharing or something so that we are able to get higher collective IQ by combining fast distributed predictive subnetworks.

fnord77 58 minutes ago||||

So, then I guess the big three are never going to make their money back.

lichenwarp 1 hour ago||||

O R D E R s O f m a g N I T U D E

They said the words!!!!!

firebirdn99 2 hours ago||||

you just need to look at Mythos to see the jump in performance from a 10T(?) model. As they scale, they get more capable. We might have an yearly release, but I believe the releases will continue, as long as scaling laws are in tact, and there's huge problems still need solving. (think cancer)

phainopepla2 2 hours ago|||

And how are we meant to look at Mythos? Do you have access?

bigfishrunning 2 hours ago|||

no but they tell me it's TERRIFYING and DANGEROUS and we should INVEST MORE MONEY

dwpdwpdwpdwpdwp 2 hours ago||||

Through association with a large company:

https://www.anthropic.com/glasswing

Ive seen the tickets generated by the model that have trickled to my team. They are legitimate, but i can’t speak to model improvement because its a pilot program.

OtomotO 2 hours ago|||

Through the lenses of anthropic's marketing department of course

coldtea 1 hour ago||||

>you just need to look at Mythos to see the jump in performance from a 10T(?) model

Mythos is a bunch of likely overhyped claims at this point. A few experts who looked into the claimed results weren't that impressed.

aj_hackman 2 hours ago|||

You forget that these models are still only interpolating between human-generated datapoints fed to them. They cannot reason beyond the data they've been given, so unless everything you want to create with AI is a synthesis of prior art, you're back to relying on the stone-age human brain that created AI in the first place.

mofeien 1 hour ago|||

Not all training data is human generated, and it's also not clear that being ridiculously good at interpolating between data points (whatever that means) will not lead to superhuman capabilities.

aj_hackman 1 hour ago||

I could make a robotic picture coloring machine with truly superhuman capabilities - picking only the most beautiful color combinations and staying 100% in the lines while finishing entire murals in < 1 second. However, if you need a completely new and original image rendered, the machine is of only partial utility for you. It is very well possible that your cure for cancer (if that's even feasible) or whatever else you desire is a completely new picture.

We have these breathless conversations about the new AI frontier at the peril of losing sight of reality and our own human potential.

coldtea 1 hour ago||||

>these models are still only interpolating between human-generated datapoints fed to them. They cannot reason beyond the data they've been given

Are you sure that humans can?

Didn't a SOTA recently solved a mathematical theorem, one escaping mathematicians for 80 years?

Maybe a human "novel" invention is just a good interpolating from the datapoints (knowledge) fed to the human.

stratos123 45 minutes ago||||

Your phrasing ("you forget") implies this is a fact and common knowledge, while in fact there's little reason to think that's true.

suttontom 1 hour ago|||

Do you know if anyone has trained, say, a pre-2017 model and tried to get it to come up with Attention Is All You Need? If it did, would you say that was only because it's a synthesis of prior art? If so, what isn't?

aj_hackman 1 hour ago||

Allow me to restate my point: human beings and AI both create via synthesis, but we are the only ones capable of what we could categorize as true original thought or creativity. It could be argued that nothing we do as humans is truly original or creative either, but I would counter that with the claim that an LLM could not have created any element of the society and culture that gave birth to LLMs. Maybe in six more months.

coldtea 1 hour ago||

>human beings and AI both create via synthesis, but we are the only ones capable of what we could categorize as true original thought or creativity.

And how is that anything other than synthesis? Do we pull concepts out of thin air?

Forgeties79 2 hours ago||||

> I won't be surprised if the next gen frontier models are the last.

I’d be surprised tbh. Investors don’t want to hear “everyone else is still training models and seeing improvements, but we don’t want to participate in the arms race anymore.” They want monumental leaps every quarter or two because they have sunk unholy amounts of money into these companies/products.

The whole idea of “hyper scale” doesn’t jive with caution and or otherwise slowing down.

irishcoffee 1 hour ago||

The way this will play out, most likely, is that smaller models will continue to get released, anyone willing to drop 1-3k on a home upgrade/new LLM box (no that isn’t cheap, it also isn’t outrageously expensive) along with improved open source agents or whatever (lot of meat on that bone) will sneak up behind the big players and start taking dents. Smaller companies will pop up providing 50 users unlimited whatever for a lower cost than the big companies.

The whole ecosystem will twist and evolve, and the big companies will be left begging for corporate subscriptions.

I finally caved when I realized I could build a PC, for myself, with dual video cards that I wanted, which can play games that I like and run models that I want, without worrying about giving my payment info to someone I don’t trust, or invoking token anxiety that I don’t want.

YetAnotherNick 2 hours ago||||

> It is almost guaranteed that a 60-90B model can outperform current SOTA in coding tasks within 2-3 years.

I am ready to bet against this. Knowledge benchmark like SimpleQA isn't increasing for small models.

> It is far less clear that a 1.2T model will be meaningfully better enough to justify training it.

Well for one, we know for certain there is Mythos which is meaningfully better. And I think there is a lot of juice left to squeeze for Mythos class model.

onlyrealcuzzo 2 hours ago|||

> Well for one, we know for certain there is Mythos which is meaningfully better.

Do we?

Have you used it?

What is "meaningfully" better? It's not 3-4 orders of magnitude better. That is definitely happening for smaller models.

YetAnotherNick 14 minutes ago||

What do you mean by 3-4 orders of magnitude better? Was Einstein 3-4 order of magnitude better than us?

Meaningful in the sense it could find security vulnerabilities in browser and kernel that >99% of the engineers couldn't find.

ertgbnm 2 hours ago|||

Knowledge benchmarks can't really be improved upon via distillation or RL. It requires those facts be added to the training corpus and for the model to memorize them better. Neither distillation or RL really do that and thus we shouldn't expect improvements on SimpleQA unless some other interventions are being made.

Model intelligence and knowledge aren't necessarily directly related. If we can pack greater intelligence and agency at the cost of it forgetting factoids, that would actually be a good thing. We don't need LLMs to memorize facts, we need them to learn how to interact with the world such that they can find the facts that are necessary and surface them to the user.

If we could distill all of the knowledge out of an LLM and just be left with a very agentic model that only knows facts in it's context, I think some very interesting stuff would happen.

YetAnotherNick 7 minutes ago|||

Lot of the things aren't facts that could be stated. No one can just see the dictionary or translation of words and start talking in that language.

There isn't a clear definition of what is knowledge and what is intelligence. Is being able to write in C knowledge? Is knowing undefined behaviour in that knowledge?

slashdave 2 hours ago|||

RL is more than facts. Synthetic feedback is an obvious approach. Does the model suggest code that compiles and performs well?

guluarte 2 hours ago||||

I think the future will be enterprise clients will train their own models based on their needs and data.

abalashov 10 minutes ago|||

Versus just packing all their needs and data into context, and RAG (i.e. context)?

jimbokun 42 minutes ago|||

Why isn’t this happening more already?

z3t4 13 minutes ago||

It takes way more resources to train the model then to use it.

wahnfrieden 2 hours ago||||

I would be shocked if 5.5 is the last new pre-train from OpenAI. Your comment is nonsense.

onlyrealcuzzo 1 hour ago||

5.5 is not a generation it is a trivial iteration...

6 is for sure happening...

As is Gemini 4.

It's less certain there will be a Gemini 5 or GPT 7 any time soon that is a true next "generation" and not just an iteration. They will almost certainly call something Gemini 5 and GPT 7...

wahnfrieden 1 hour ago||

5.5 is in fact a new pre-train model

First you say there won't be a new generation. Now you're saying there will be more. Oh well, I'll stop responding here

onlyrealcuzzo 23 minutes ago||

> I won't be surprised if the next gen frontier models are the last.

You clearly did not read my first comment or the second, or clearly disagree on what a generation is.

michaelchisari 2 hours ago|||

| a 60-90B model can outperform current SOTA

My conspiracy theory is that Apple recognizes this.

dweekly 2 hours ago|||

That does seem to be the path Apple is following here. Have a local model that can answer most things and then have a fallback of cloud options when they request is too complex. The cleverness of this strategy has been overshadowed by the incredibly poor quality of their local models. It will be extremely interesting to see what next month holds and whether Google helped fine tune an Apple specific Gemini / Gemma model for their devices. Bonus points, of course, if they unveil the M5 Ultra Studio with half a terabyte of RAM to be a local "cloud model" (the true fantasy here of course would be Apple building something a little like openclaw where from your phone you could give commands to your Home Apple server). They could probably get away with charging $20k for it if it has sufficient tok/sec. If that happens and is successful one could imagine a straight line path in the next two generations to bringing the cost and form factor down to the point where some of the form factor of an Apple TV becomes everybody's home inference server / agentic HQ. Sovereign AI for everyone!

holoduke 1 hour ago||||

You need some serious memory then. Let's say around 192gb for having not all your memory eaten by your LLM.

onlyrealcuzzo 2 hours ago|||

> My conspiracy theory is that Apple recognizes this.

I don't think that's not a conspiracy theory. AFAIK, It's their stated AI policy...

michaelchisari 1 hour ago||

Interesting. Where have they stated that?

selectodude 1 hour ago||

https://machinelearning.apple.com/research/introducing-apple...

gAI 3 hours ago|||

4.7 was the first time I had to resort to using the previous version (4.6) for most use cases. Hoping 4.8 rectifies this.

ishurand4 1 hour ago|||

They just showed the benchmarks it improved on but it regressed on so much more, such as the MCRR benchmark: "On multi-round coreference/context recall tests (often cited as MRCR or long-text retrieval benchmarks), Opus 4.7 reportedly dropped from roughly 78.3% down to 32.2% compared to Opus 4.6."

merlindru 3 hours ago||||

Same. 4.7 felt like a definite regression

supern0va 3 hours ago||

Interestingly enough, 4.7 actually did regress on a few benchmarks from 4.6, so it's more than just vibes.

gAI 3 hours ago|||

It seems like a lot of things fed into that. Anthropic couldn't keep up with the compute costs when they got a huge influx of users. (So) effort level defaults got turned down. (Looks like we have direct effort control in the web interface now - thrilled about that!) Adaptive Thinking, while usually cheaper for them, seems less robust than Extended Thinking. And this part is just vibes, but the alignment on 4.7 feels too stiff. I understand wanting the model to push back more, but it seems like 4.7 will push back reflexively in situations where it's just odd.

bombcar 3 hours ago||

Claude got very mad at me and burned more tokens than exist to complain about me asking about a "yellow background cell" in an excel spreadsheet.

forshaper 2 hours ago||

Too much personality, if you ask me. My biggest use case of an LLM is tool, not therapy, but therapy and opinions have been sneaking into workhorse tasks.

haven't verified, but attributed to Askell: "I just think that... there's this idea that you're always giving the models a personality and a persona, because they are talking like people and they are trained on human data. And I think my worry has been: if you train them to be excessively corrigible and to see that as their persona, in people I think this actually has a lot of negative broader traits. As in, if you met someone and it was just like, "oh yeah, they would literally do anything," a follower — you know, if a person just tells them something and they just fully defer, they don't bother thinking about it at all — I'm just a bit worried about how that might end up generalizing, especially if models are going to be playing a more active role in the world."

gAI 2 hours ago||

Anthropic’s research makes the case that role-playing is inherent to how the models work. Communication implies a sender. Language implies a writer, and the models learn these roles implicitly during training. RLHF is meant to strengthen the attractor to the Assistant persona.

https://www.anthropic.com/research/persona-selection-model

https://www.anthropic.com/research/assistant-axis

https://www.anthropic.com/research/emergent-misalignment-rew...

https://www.anthropic.com/research/emotion-concepts-function

hashmap 27 minutes ago||

The RLHF very much does do that. My take is that RLHF as a mechanism ought to be avoided altogether, and even the selection of the assistant attractor basin is suspect. If I am exploring a problem space I don't want to hire Igor to explore it with me, it's more helpful to have a colleague role who will sort of jump out and say "nah thats dumb what if we throw out that whole thing and do this completely different angle instead".

ACCount37 3 hours ago|||

4.7 is a different base model from 4.6, so it's possible that they introduced regressions with pre-training changes, or undercooked the post-training stage.

petterroea 2 hours ago||||

Same. 4.7 has done some incredibly stupid things.

dbbk 30 minutes ago||

I think this is a more a consequence of the introduction of adaptive thinking and removal of extended thinking, than 4.7 specifically

rhubarbtree 3 hours ago||||

Same. So happy when I found that option.

gAI 2 hours ago||

Unfortunately, looks like 4.6 is now gone from the web ui.

lukan 2 hours ago||

Was bothered by that too, but did a magic trick and asked claude how to change that and .. there is

/model claude-opus-4-6

For this session and permanently (in shell):

export ANTHROPIC_MODEL=claude-opus-4-6

dezsirazvan 20 minutes ago|||

same!

gen220 3 hours ago|||

I'm curious to poll HN on this issue. Do you feel like we've had meaningful/noticeable gains in terms of your programming workflows between 4.5 and 4.7?

My 2¢, I personally feel like all of the productivity gains since 4.5's release (in November 2025!!) have come from improvements to the harnesses (cc, cursor cli, codex, opencode, whatever) AND from the context window expansion from 200k to 1M.

But the actual "raw" intelligence of the model / ability to make good decisions feels like it has plateaued since 4.5. 4.6 was maybe a small improvement, but hard to differentiate from in-context-learning with the 1M window. 4.7 if anything felt like a regression in wisdom for me and my coworkers, with it consistently making worse/lazier decisions.

Bnjoroge 3 hours ago|||

For long-running tasks, yes 4.7 has been a noticeable improvement. Goes off the rails alot less than 4.6 does. For shorter-sized windows, I havent felt as much and agree that the harness improvements have been fhe biggest lever

csvance 1 hour ago||

When doing big long running workflows especially with plan Mode 4.7 was a clear improvement. It’s considerably worse for under specified tasks and responds to a couple sentences with 10+ paragraphs for explanatory type discussions.

themgt 1 hour ago||

Opus 4.7+ Max is a 10x engineer who wants to be left alone to work. When you talk to him, he infodumps on you to get you (his pointy haired idiot Dilbert boss) to go away.

bonoboTP 3 hours ago||||

To me 4.5 was mindblow, 4.6 noticeable, 4.7 more like a style/personality change regarding how much it asks back, how much it assumes, how eager it is to jump to action etc but not really in terms of my perception of its smartness.

alfalfasprout 7 minutes ago||||

I'm actually currently studying this :)

Honestly... not that dramatically. Each release is much more marginal. And quoted official benchmarks doesn't translate very well into the real world.

4.7 regressed hard in some ways. But a compounding factor too is that the claude code harness seems to nerf the model after a few months. Probably to reduce token use.

So far 4.8 seems less verbose but we'll see in practice what it translates into meaningfully.

somenameforme 2 hours ago||||

They all feel, more or less, the same to me in terms of output capabilities. Mostly get simple things right, can get more complex things right with nudging, eventually get stuck hard on something that takes a bunch of iterations through it/logging/etc or me fixing the code manually.

bcrosby95 2 hours ago||||

4.6 felt a bit better than 4.5 but slower. 4.7 doesn't feel better than 4.6.

giraffe_lady 2 hours ago|||

I actually don't see any personal productivity improvements from using opus over sonnet for coding. If you're keeping tasks small and conversations short, reading the code and correcting before changes go in, whatever advantages opus has aren't practically significant. It's also just talky as hell, overexplains anything it touches and every token produced this way increases the surface area for hallucination so you need to have your guard up even more with it.

There's a sweet spot of complexity for low importance tasks where it's just big enough I don't want to do it and just simple enough to have opus plan/delegate/review with another model. So possibly model improvements will grow this window, but currently I don't do much in there.

mrandish 1 hour ago|||

I suspect the more frequent incremental releases may also be to deploy new capabilities used by Anthropic to control costs and throttle consumption of resources. I assume any new controls they expose to end-users have far more granular sub-controls under the hood which they can meta-adjust for each user type.

They mention more granular control of effort, 'dynamic workflows' and more speed controls ("fast mode"). While they position them as user features, they also sound like the kinds of knobs Anthropic will need to twiddle on the back-end to balance costs, margins, ARR, and user growth vs retention post-IPO to hit key metrics in quarterly reporting.

gertlabs 2 hours ago|||

4.5/4.6 were roughly the same in our testing. Opus 4.7 is smarter, but it's difficult to use as a product for various personality issues. So far, Opus 4.8 seems to be going down that path (unusably slow, but this could be a launch day rollout problem). Full Opus 4.8 tests are in progress now.

Data at https://gertlabs.com/rankings

__s 1 hour ago||

"personality issues" I was able to tell that Opus 4.7 would take instructions more literally, which I appreciated once I calibrated my phrasing to be more precise (often asking to investigate issues, pre-4.7 it'd start making code changes instead of just giving write up). But I can see contexts where handling vague prompts would've just been worse

SkyPuncher 3 hours ago|||

> My own experience w/ 4.6 and 4.7 are that I don't firmly grasp any capabilities improvements over my memory of 4.5, but it's all so fuzzy that it's truly difficult to tell.

I've actually intentionally switched back to 4.5. I hated 4.7 so much that I decided to jump back all the way to 4.5.

Now that I've been using 4.5 for a few weeks, I find it significantly more reliable but a bit more forgetful than 4.6/4.7. I'm okay with that because it's really easy to identify this forgetfulness and nudge it.

I found 4.7's adaptive thinking to be extremely unreliable. It seems to overcorrect on the current message without considering the difficult of the overall problem. I wonder if 4.8 will improve on that.

dwaltrip 2 hours ago||

If you are using Claude code, just set effort to xhigh.

This one change will probably solve 80% of the problems you have noticed.

whatevaa 24 minutes ago|||

Isn't xhigh on opus 4.7 very expensive on tokens?

dwaltrip 7 minutes ago||

I’ve never ran into the limits on the $100 plan, and rarely even get close.

I normally have only one session going at once though.

orwin 2 hours ago|||

This. XHigh and the 'plan' mode for complex tasks is absolutely a must have.

Still, the context window is sometimes too small for my usage.

jayGlow 17 minutes ago||

agent teams can help with that, the main agent acts as an orchestrator and spawns sub agents to do the actual tasks it generally keeps the main context from overflowing.

WhitneyLand 2 hours ago|||

“Maybe my own tastes are saturated now”

It might be saturated for smaller scopes of work, but it’s not hard to see the cracks when you scale up what you ask of SOTA models/agents.

One example, to try and single shot prompt coding a ChatGPT equivalent chatbot.

Sure it will spit something out, but the feature depth, UX subtitles, backend integration, and lots of pragmatic engineering decisions along the way will just not be baked.

Another example is building a C compiler from scratch which Anthropic showed is still a struggle to do.

Not that these these specific examples are important but just to point out scaling up expectations shows the cracks.

It’s not just a model problem of course, better agents, orchestration features (like Dynamic Workflows mentioned in the post), all need to continue to evolve.

Ar what point does my CS degree become totally useless is an open question.

ricardobeat 3 hours ago|||

4.7 was a significant jump in the ability to run long-horizon tasks. It immediately completed tasks that 4.6 was unable to, even though I have the impression that it became a bit less capable over the first few weeks after release.

It also seems to be helpless at effort levels < xhigh, I turn to Sonnet when simpler tasks are needed.

light_triad 2 hours ago|||

I've been using Claude Code regularly since the 4.5 release, and 4.7 was a significant regression: very unreliable, arguing about changes, deciding that fixes weren't needed, etc.

I'm hoping they recreate the magic of 4.5 but it's as much about the quality of harness, the memory and efficiency of the tools than simply the models at this point.

ahmadyan 1 hour ago|||

pretty spot on.

In my experience, Opus 4.0 was fantastic, major jump from 3.7. it was creative, super slow and expensive, and would sometime forget what it was doing, but it was getting the job done.

4.1 they made it much faster, so a lot of infra improvements.

4.5 was the time it could work on longer task, didn't make a lot of obvious mistakes of 4.0, and i think this was about the time the opus went mainstream, and all of the anthropic's compute crisis began, so instead of making the model better they tried to optimize it to reduce cost instead.

4.6 was such a bad model, they switched to adaptive thinking and it had so many bugs. poor api design, benchmaxxed and poor real-world results. i switched back to 4.5.

4.7 they just fixed the bugs they added in 4.6. Better than 4.5.

haven't fully tested 4.8 yet.

teruakohatu 35 minutes ago||

I gave 4.6 a miss and only recently switched from 4.5 to 4.7. I found on a particularly different task 4.5 struggled with (getting stuck in loops and trying to convince me the problem had been solved) was quite solvable with 4.7.

cootsnuck 27 minutes ago|||

Well, it seems like collectively we are all struggling to perceive model progress, given that it seems like every reply to you is reporting different experiences with which of the models has subjectively performed best for them.

spaceman_2020 1 hour ago|||

I think 4.7 was an awful model in actual use. I never got anything out of it and it was frustratingly weird. This feels more like an attempt to course correct and isn't a real bump

throwaway63467 1 hour ago||

I think they overtrained on scientific papers or such as it would spout really sophisticated sounding nonsense with a ton of complicated verbs and adjectives. 4.6 was definitely better in that regard. The more I use these tools the more I think they’re not actually that revolutionary. I mean it’s still amazing what they can do but they have very clear limitations it seems.

binary0010 3 hours ago|||

Maybe try making a simple randomize script to swap the three latest models. And see if you can tell which ones are meaningfully different without knowing which ones are flipped on or off?

osigurdson 2 hours ago||

I find the quality ebbs and flows even on the same model. My guess it is something to do with GPU availability but only guessing.

atq2119 2 hours ago||

Unless you're systematically repeating the exact same task, the most parsimonious explanation is that you're seeing natural variation based on different tasks, random sampling of tokens, etc.

jimbokun 1 hour ago|||

How long would it take to evaluate a new coworker to say “wow she’s really bright?” Relative to your other coworkers?

A few days? A few weeks? Longer?

However a company releases a new AI model and within hours users are confidently proclaiming how much smarter it is than previous versions.

ifwinterco 46 minutes ago|||

4.7 uses more tokens and costs more for the same task than OG 4.5, that's about it

irthomasthomas 2 hours ago|||

Given that 4.7 was a brand new model, trained from scratch with a unique architecture and tokenization scheme, I don't see the same pattern. It seems arbitrary.

dominotw 2 hours ago||

i dont understand the nuances here. what does this mean. 4.8 is trained on same model as previous one then? what does brand new mean.

irthomasthomas 2 hours ago||

It means for 4.7 they trained a new base model with different architecture, different pre-training data (later knowledge cutoff), and a new tokenizer. Vs finetuning an existing model, which was the case for 4.6, and probably for 4.8.

dominotw 25 minutes ago|||

do you mean pre training? so 4.8 is just post training of an old pretrained model?

btw where do they tell you how they trained the model.

dominotw 27 minutes ago|||

whats a base model.

extr 3 hours ago|||

IMO they have all been clean and noticeable upgrades over their predecessors. Opus 4.7 in particular was a solid jump in capabilities.

TSiege 3 hours ago|||

most of my coworkers feel the opposite about 4.7 and that 4.6 was, to them, significantly better to point that several stopped using claude code

teruakohatu 34 minutes ago||

4.5 -> 4.7 was a solid jump for me having skipped 4.6. It probably does depend on the specific tasks.

NiloCK 3 hours ago|||

I think it's telling how split the opinions are around all of this. A lot of people distinctly disliked 4.7.

Are the dividing lines around personality? Working domains? Opinionated software stuff?

Who knows?

onlypassingthru 3 hours ago|||

The honesty will be noticeable. Maybe we'll see some honest assessments like "That is not possible within the laws of known physics", "Your legal argument is nonsensical and defies logic", "There is no evidence to support taking that will cure anything", etc., etc.

gigatexal 1 hour ago|||

why are the models the same price?

https://platform.claude.com/docs/en/about-claude/pricing

``` Model Base Input Tokens 5m Cache Writes 1h Cache Writes Cache Hits & Refreshes Output Tokens

Claude Opus 4.8 $5 / MTok $6.25 / MTok $10 / MTok $0.50 / MTok $25 / MTok

Claude Opus 4.7 $5 / MTok $6.25 / MTok $10 / MTok $0.50 / MTok $25 / MTok

Claude Opus 4.6 $5 / MTok $6.25 / MTok $10 / MTok $0.50 / MTok $25 / MTok

Claude Opus 4.5 $5 / MTok $6.25 / MTok $10 / MTok $0.50 / MTok $25 / MTok

Claude Opus 4.1 $15 / MTok $18.75 / MTok $30 / MTok $1.50 / MTok $75 / MTok

Claude Opus 4 (deprecated) $15 / MTok $18.75 / MTok $30 / MTok $1.50 / MTok $75 / MTok

Claude Sonnet 4.6 $3 / MTok $3.75 / MTok $6 / MTok $0.30 / MTok $15 / MTok

Claude Sonnet 4.5 $3 / MTok $3.75 / MTok $6 / MTok $0.30 / MTok $15 / MTok

Claude Sonnet 4 (deprecated) $3 / MTok $3.75 / MTok $6 / MTok $0.30 / MTok $15 / MTok

Claude Haiku 4.5 $1 / MTok $1.25 / MTok $2 / MTok $0.10 / MTok $5 / MTok

Claude Haiku 3.5 (retired, except on Bedrock and Vertex AI) $0.80 / MTok $1 / MTok $1.60 / MTok $0.08 / MTok $4 / MTok ```

teruakohatu 33 minutes ago|||

Why shouldn’t they be? They are probably the same size and cost the same to run. They are not doing full training runs (eg Mythos) so don’t need to recover insane training costs.

cootsnuck 25 minutes ago||

I'd be kind of shocked if a model that came out six months ago is the same size and cost to run as one that just came out today.

staticman2 29 minutes ago|||

Opus 4.7 and presumably 4.8 are more expensive due to a new tokenizer that translates data into more tokens per input.

taytus 3 hours ago|||

Incremental gains compounds.

itake 3 hours ago|||

meta threw in the towel when it came to producing AI models since their gains couldn't keep up with China.

HDThoreaun 2 hours ago||

Has meta stopped producing new models? I figured they were just regrouping after all the drama they’ve had recently. Meta’s massive user base means they don’t need to be involved in the customer acquisition rat race. Once they have a model they’re happy with they can have a billion people interacting with it within a month.

staticman2 28 minutes ago||

Meta released a major new closed source model a month or so ago.

It didn't make a splash like a new open source release would have.

paulddraper 3 hours ago|||

Exactly. Go back to Opus 4.5 and see how you like it.

You won't, really.

jere 1 hour ago|||

"it's smarter than me?"

You don't have to correct it dozens of times a day!? Really?

conartist6 3 hours ago|||

Just want to say there's no question that you're smarter than any (and every) AI.

NiloCK 2 hours ago|||

I appreciate the generosity, but you're gonna want to meet me first.

conartist6 2 hours ago||

Kind of the beauty of it is that I don't have to to know I'm right. The reason I know is that you're alive so you can do the one thing it can't ever do, which is know when to stop or give up. It would turn me and everything else in the world into paperclips repeating the same research 1,000,000 times over.

petesergeant 2 hours ago|||

No question at all that a dolphin swims better than a submarine.

rotcev 1 hour ago|||

[flagged]

Imustaskforhelp 1 hour ago||

Although I am not sure about it but there was something I read which said that models intentionally degrade slowly by lower quantizations as a new model is going to drop.

This felt particularly visible during the 4.6 when people said that 4.6 felt dumber and I remember someone doing some analysis and it sort of proved that models were getting dumber over time.

This has both benefits of costing less for the company to run while taking a standard subscription but also, at the same time, making the next model when it drops to public to "feel" more good comparatively.

Again, I am not sure if this is the case or not but merely proposing something that I feel like it might be in the possibility of realm.

colonCapitalDee 3 hours ago||

"Users will find Opus 4.8 to be a modest but tangible improvement on its predecessor."

This is a refreshing attitude!

I've also verified that you can now turn off adaptive thinking in the web UI, which is great. I've had a lot of problems with thinking not triggering and the model producing sub-par output. Glad we can finally turn it off. (I hope being able to turn off adaptive thinking is new, if I could have turned it off at any time that would be embarrassing)

gibspaulding 43 minutes ago||

I’m pretty sure that switch has always been there, but turning it off doesn’t do what you want. It disables thinking entirely.

kakugawa 3 minutes ago||

Opus 4.7 does not support disabling adaptive thinking (web, Claude Code). [1] Like the OP, I experienced similar issues and I'm glad that they brought back the ability to disable adaptive thinking in Opus 4.8.

[1] https://code.claude.com/docs/en/model-config#adaptive-reason...

> Opus 4.7 and later always use adaptive reasoning. The fixed thinking budget mode and `CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING` do not apply to them.

winwang 3 hours ago|||

Awesome, thanks for posting because I think I hit a possibly-spurious bug in turning Adaptive off when I switched models (4.6 -> 4.8, extra). Tried again, works as intended (I hope).

More importantly for me, though, is how CC will respond to 4.6-"only" flags for thinking. For now, it doesn't seem to clobber my setup.

smartmic 2 hours ago|||

> This is a refreshing attitude!

Well, I think the attitude is that costs are allowed to escalate faster and more steeply than the features delivered. From that perspective, semantic versioning is a handy tool for adjusting pricing strategies. IMHO, it (versioning) only makes sense for open-source projects, where you can clearly see the actual changes made with each version upgrade. Anything else is more than a little suspicious…

smsx 1 hour ago|||

The 4.8 model costs the same as it's 4.7 predecessor.

drewnick 1 hour ago||||

While all these models are nondeterministic a feature bump is still necessary as the same input can have wildly different output on a new model. For API users being able to pin a model is a necessity.

zaptheimpaler 1 hour ago|||

All the 4.x models are still available, and they all cost the same.

ambicapter 7 minutes ago||

> Opus 4.7 and later use a new tokenizer compared to previous models, contributing to their improved performance on a wide range of tasks. This new tokenizer may use up to 35% more tokens for the same fixed text.

Same cost/token, more token usage.

jascha_eng 3 hours ago|||

The benchmark improvements actually look pretty damn nice tho!

comboy 25 minutes ago|||

"We've cut costs A LOT"

wahnfrieden 2 hours ago|||

What's refreshing about it given the context that 4.7 was a regression in many ways (including as measured by benchmarks)?

4.8 is also 2x more expensive for a "modest" performance bump. How refreshing.

This is just cope.

cootsnuck 22 minutes ago|||

> 4.8 is also 2x more expensive for a "modest" performance bump. How refreshing.

Where are you seeing it's 2x more expensive? https://platform.claude.com/docs/en/about-claude/pricing

murkt 21 minutes ago|||

Price hasn’t changes at all, though.

FergusArgyll 2 hours ago|||

I liked the "modest but tangible improvement" too! There is a cynical take here but I think I'm gonna hold it in...

ai_slop_hater 1 hour ago||

What do you mean? This is not just a new model, this is a new way of thinking.

northern-lights 3 hours ago||

> Not only that, but we plan to release a new class of model with even higher intelligence than Opus. As part of Project Glasswing, a small number of organizations are currently using Claude Mythos Preview for cybersecurity work. Models of this capability level require stronger cyber safeguards before they can be generally released. We’re making swift progress on developing these safeguards and expect to be able to bring Mythos-class models to all our customers in the coming weeks.

Probably more interesting than the 4.8 release.

andai 18 minutes ago||

In the Opus 4.7 release notes they mentioned intentionally making it worse at cybersecurity. [0]

This suggests that they're doing the same thing with Mythos now and the Mythos we get will be nerfed in that department?

Or more precisely, I think they'll have two versions of Mythos, and the scary one will probably continue to require a lot of paperwork.

https://www.anthropic.com/news/claude-opus-4-7

scuderiaseb 8 minutes ago|||

So this is how they’ll remove access from Claude Pro to the biggest models. You would need at least a Claude Max subscription for the bigger than Opus models I bet.

ac29 2 hours ago|||

More interesting than that to me is "we’re working on developing and releasing models that provide many of the same capabilities as Opus at a lower cost"

Sonnet and Haiku look real outclassed for the price with current Chinese competition.

TIPSIO 2 hours ago|||

Seems like they might be hinting that if you are not a billionaire or multi-billion dollar company you will just get a limited and nerfed Claude Code slash command /mythos-security-audit or something.

Hope this isn’t the case and that normal average Joe’s of the world don’t get policed out of access.

dbbk 14 minutes ago|||

What does an average Joe need a Mythos level model for that Opus can't do for them?

freedomben 9 minutes ago||

It's not just better at cybersecurity, it's better at all the things (or most of them). I for one would really benefit from a better claude code. I still have to babysit it pretty closely to keep it from messing things up. Opus 4.7 was not an upgrade for me.

But in general, what does the average Joe need Opus for that Sonnet or Haiku can't do for them? Better is better.

gs17 2 hours ago||||

> you will just get a limited and nerfed Claude Code slash command /mythos-security-audit or something.

Unless it's so expensive that we can't realistically use it for anything, I wouldn't complain about getting at least that. I would also rather have the actual model, but that's a useful application of it (and I'm probably not going to afford using it for much more).

TIPSIO 1 hour ago|||

Price discrimination is I think fine and reasonable so long if you can drum up the cash you can use it how you want within their ToS.

Although mental safety gymnastics aside, getting the most amount of intelligence for the cheapest amount of cost to normal people seems like the most ethical thing a big lab could do.

Going around and granting different tiers of intelligence to different insiders, friends, or companies is majorly problematic long-term.

Heck right now, the tokens you buy today for “Opus 4.8”, no one even knows or believes will be the same “Opus 4.8” just 3 days from now.

vorticalbox 1 hour ago||||

some of the bench marks i have seen on also include cost where one scan of the codebase cost tens of thousands of dollars.

this one [0] notes one run cost $20k to run but another cost $50.

[0] https://red.anthropic.com/2026/mythos-preview/

FinnKuhn 1 hour ago|||

/security-review already exists so I don't think it would be crazy to have a /mythos-security-review as more thourough command as well. I think it's more likely it is going to be released at some point to the general public though - although the the pricing might make it quite unattractive.

hedora 2 hours ago||||

Isn't OpenAI's public flagship already beating Mythos on penetration testing? I get the impression Mythos is just valuation-juicing for IPO more than anything else.

The fact that they haven't released it yet suggests a cost/margins issue to me more than anything else. Short term, I'll probably keep using Antrhopic, but my long-term bet is that locally-served models win, if only because the quest for profitability will probably lead to intentionally-nerfed / enshittified frontier models.

At other vendors, ad placement within LLM responses is either coming or already here. Anthropic's handling of OpenClaw shows they're willing to engage in anti-competitive behavior, and the courts are not in a hurry to stop them. Why would I pay them $200 a month for such treatment when a $2K box does what I need locally?

ameliaquining 25 minutes ago|||

Mythos is dramatically better specifically at finding zero-day vulnerabilities and developing exploits for them, that being what it was designed to do. On other cybersecurity tasks, GPT-5.5 is at least as good, but finding and exploiting zero-days is a particularly scary capability, which is why Mythos is a big deal. See, e.g., https://forum.effectivealtruism.org/posts/8yztpbjuPkyXsmA6n/....

srmatto 1 hour ago|||

What benchmarks are you referencing that show a comparison of the models for penetration testing?

Tepix 2 hours ago|||

It does sound like an even higher API price tier for sure.

huflungdung 2 hours ago||

[dead]

simonw 3 hours ago||

I generated pelicans riding bicycles on both thinking level low and thinking level high:

https://gist.github.com/simonw/68560eddb0b268a8417f80ceb7304...

The high one is notably better - the bicycle frame is the correct shape, unlike thinking level low.

For comparison, here's Opus 4.7: https://gist.github.com/simonw/afcb19addf3f38eb1996e1ebe749c...

eminence32 2 minutes ago||

I bet someone shares this link every time you post about bicycles, but since I didn't see anyone share it yet in this thread, I'll take the opportunity to do so:

https://www.gianlucagimini.it/portfolio-item/velocipedia/

Turns out even humans can be pretty bad at drawing bicycles :)

simonw 46 minutes ago|||

Here's pelicans in all of the thinking levels - low, medium, high, xhigh, max

https://tools.simonwillison.net/markdown-svg-renderer#url=ht...

stratos123 39 minutes ago||

Is the output on the max level meant to be missing?

simonw 37 minutes ago||

I just fixed that (force refresh). It hit my default 8,000 output token limit, it worked when I bumped that up.

For max I used 25 input, 17,167 output which cost me 43 cents! https://www.llm-prices.com/#it=25&ot=17167&ic=5&oc=25&sel=cl...

GistNoesis 2 hours ago|||

> the bicycle frame is the correct shape

No, the handlebar is wrong. The handle bar is rotating the frame instead of rotating the front wheel. The handle bar should be mounted on the same line as the front wheel is.

Hopefully 4.9 will read my comments :)

loeg 1 hour ago||

Could be an extremely high angle stem that just happens to match the downtube angle.

jonas21 3 hours ago|||

Glad to see that the "high thinking" level adds a helmet. Always a smart choice.

spmartin823 3 hours ago|||

You've peed in the pool Simon, this has to be a part of the internal evals by now! You got to try something new - maybe a panda in a canoe?

phainopepla2 2 hours ago|||

If these were in the internal evals then the output would be much better. The 4.8 pelicans are pretty meh

HDThoreaun 2 hours ago|||

Click the link

ceroxylon 2 hours ago|||

I really like that thinking level high gave the pelican a helmet.

Xunjin 3 hours ago|||

Hey simonw I love your test, do you think using thinking level "max" makes sense for this test? I would love to see the results about it.

simonw 1 hour ago||

I don't think the API supports "max" as an option, that might just be a Claude Code harness thing.

UPDATE: My mistake, the API does support max. I added a max one at the bottom of this page (cost 43 cents): https://tools.simonwillison.net/markdown-svg-renderer#url=ht...

toastmaster11 2 hours ago|||

I find the most miraculous thing about 4.7 to be that the pelican is facing left, wonder why the right facing everything is so ubiquitous in these images.

i000 1 hour ago|||

This happened to me in elementary school. We were doing fingerpaintings using plasticine. After all the bikes were hung on the wall, mine was racing the other way... Somehow it really stuck with me.

gboss 1 hour ago||||

It's facing left but looking right...

toastmaster11 1 hour ago||

Profound political commentary?

tancop 32 minutes ago|||

[dead]

silisili 1 hour ago|||

The vast majority (if not all) of these make it impossible to turn, among other fun things. Only out of curiosity, have you tried prompting further with how a bike must operate to see if it does the right thing?

nickvec 3 hours ago|||

Is the "opossum riding an e-scooter" benchmark in the works for Opus 4.8? ;)

simonw 3 hours ago|||

Good call, it's cute: https://gist.github.com/simonw/68560eddb0b268a8417f80ceb7304... - but nothing like GLM-5.1: shttps://static.simonwillison.net/static/2026/glm-possum-esco...

3738384848 2 hours ago|||

[flagged]

yanis_t 3 hours ago|||

Simon, is your pelican test really captures differences among models or should you at least try like 10 times or something to average the random effects

simonw 3 hours ago||

I've been meaning to do a "run 3 times and pick the best" version for quite a while, I should really pull the trigger on that one. Currently it's one-shot only.

xiphias2 2 hours ago||

Best-of-3 would be cheating, ruin the test, middle of 3 makes more sense

nik736 2 hours ago||

Why would you need the 3rd run if you pick the "one in the middle"?

jmaw 45 minutes ago||

Middle as in not the best, and not the worst. As opposed to the second generated in sequence.

But not the best/not the worst is somewhat subjective.. so not sure how well that would work.

timsuchanek 2 hours ago|||

thanks for always providing this very much on time. I'm wondering what the next, harder challenge could be? Maybe some animated svg?

1attice 3 hours ago|||

That little red hat on hard mode is sending me. 4.8 has whimsy

whalesalad 1 hour ago|||

Eventually the frontier model folks are going to pick up on your pelican on a bike test and bake-in flawless results for that particular request.

highwaylights 2 hours ago|||

Am I allowed to say that pelican's little helmet is adorable? I can't provide a strong computational proof, or even a shred of anecdata...

...but that pelican's little helmet is adorable.

onlyrealcuzzo 3 hours ago||

4.7 reigns supreme IMO.

senko 1 hour ago||

My fav coding benchmark for frontier models is to build a simple RTS game in one file (js/html/css). Claude Code with Opus 4.8 in ultracode mode nailed it, the best result so far:

https://bsky.app/profile/senko.net/post/3mmwnrkwboc2v

The prompt was: Create a simple but functional real time strategy (RTS) game similar to old WarCraft, StarCraft or Command & Conquer games. The player should be able to build buildings, create units, gather resources and should uncover the whole map. No AI or multiplayer needed. Use simple but nice-looking graphics. No sound. Implement everything in HTML/CSS/JS, everything in a single file (you can use 3rd-party js or css libraries/frameworks via CDN).

apitman 1 hour ago||

I like that benchmark. You should throw the results up on GitHub pages so people can try out the games.

brandly 9 minutes ago||

Yeah! Host on GitHub pages, so it's easy to click a link and play!

digdugdirk 32 minutes ago|||

Do you have a collection of these benchmark apps saved anywhere? I'd be particularly interested in seeing the relative cost differences between different models in a use case like this.

elAhmo 46 minutes ago|||

What is ultracode mode?

tcoff91 28 minutes ago||

it's a brand new mode

jclay 1 hour ago|||

It almost appears as if the code was minified. The variable names are short and formatting looks like it's written to minimize whitespace. Did it write it in this compact format all on it's own?

andai 14 minutes ago||

A friend sent me something he vibe coded which included a massive webassembly blob in the HTML file. My friend is not a programmer so he was not able to explain to me how it did that.

jryan49 26 minutes ago|||

Kinda buggy, but impressively nonetheless. How long did it take?

l3x4ur1n 52 minutes ago||

Played it to the end. Pretty neat!

hereme888 6 minutes ago||

Early ArtificialAnalysis.ai results show GPT 5.5 is still the better bang-for-your-buck.

OpenAI solves tasks with about 50% less output tokens.

https://artificialanalysis.ai/?intelligence=coding-index&int...

onlyrealcuzzo 3 hours ago||

Does anyone troll these releases and cherry pick random metrics other companies would cherry pick to show how amazing their models are?

There's like 8 million benchmarks. Every release, every model randomly picks 5-10 where they win in everything except 1, to make it look like they aren't randomly cherry picking benchmarks they probably benchmaxxed for.

aronowb14 3 hours ago||

https://arena.ai/leaderboard - I’ve found this company is a pretty good ranker - not sure their exact methodology but during day to day programming with Claude / gpt models I’ve felt qualitatively what they report

XCSme 2 hours ago|||

Also check mine[0], basically random private tests/questions and an ok-ish methodology, testing mostly for general intelligence than coding-specific tasks.

I built it for myself, to test which models to use via OpenRouter for my n8n agents. Currently actually still using gpt-5.3-codex for many things, as its pricing is really good in production (due to how their token caching works).

Gemini models still have the best intelligence (when asked any questions, most likely to get it right), but in production they still have many failure modes[1].

[0]: https://aibenchy.com

[1]: https://news.ycombinator.com/item?id=48230368

BoorishBears 1 minute ago||

Every model release you'll post this, and evry time I'll be there to point out how it's completely useless (for reasons you've shared are intentional) and does things like place the old Gemini 3 Flash above the infinitely more capable 3.5 Flash and Opus 4.5- Opus 4.8 and gpt-5.5

At least, until hopefully one day HN has a rule about accounts that derive 99.9999% of their engagement with the site from shilling a personal project.

reckless 1 hour ago||||

No way is Muse Spark generally better than offerings from Google and OpenAI. I actually find arena to be amongst the most useless indicators

WarmWash 1 hour ago||||

On paper it's one of the best because it's meant to be blind comparison of your own prompts. However if you are someone who geeks hard on one or a few models, you learn their "personality" and can recognize them in a blind test.

Bnjoroge 2 hours ago||||

Have you seen https://deepswe.datacurve.ai/blog? This is the closest to a vibe check i’ve felt even with the open models.

Imustaskforhelp 1 hour ago||

This actually looks like a really good test.

There are many benchmarks all for specific use cases but with them the difference seems to be in extreme points (93% vs 92%)

I think that, that tracks but still, it was refreshing to see a benchmark which I can help make better opinions about.

Surprised about Mimo v2.5, within artificial-analysis and other benchmarks, the difference between Mimo and deepseek seems very partial and a lot of focus/(hype?) is on Deepseek

But mimo seems like an interesting model and they are having some crazy discounts too.

Deepseek is valuable for the research community because of how open they are but absolutely crazy to think how Xiaomi basically pulled up in creating Mimo given that they didn't have anything till quite recently.

Either way, an interesting benchmark, also a plus point for giving golang some decent representation equal to python/typescript.

I think that there are sets of things which resemble something like normal benchmarks where open source models can be absolutely fine and for a very small fraction or more technical things, the benchmark that you linked starts to be better projected so it depends upon the scale of complexity but its good to see how models compete given enough complexity. definitely fascinating.

I would be interested to see more models compete on this test. The current range is still a bit limited as compared to other benchmarks but OSS models like Kimi/mimo seem to only be 3-4 (at max 6 months) behind closed source.

morley 2 hours ago||||

I'm finding it a little hard to believe that GPT 5.5 is in 11th place for webdev, outranked by models like Kimi, Qwen, and Z.ai. I'm not saying it's not true (I have noticed GPT being less smart in recent weeks), but this is very different from my expectation.

dakolli 1 hour ago|||

If you don't know their methodology, or anything about it why do you think its a good ranker?

nerevarthelame 3 hours ago|||

It's interesting they only included 6 metrics this time. Opus 4.7 had 12, and 4.6 had 13.

Of the metircs they reported for 4.7, for 4.8 they excluded BrowseComp, CharXiv Reasoning, CyberGym, GPQA Diamond, MCP Atlas, MMMLU, SWE-bench Verified. The last 4 were almost always mentioned in previous Opus releases.

onlyrealcuzzo 3 hours ago||

Gonna assume it's because they barely budged or moved downward and most of their reported benchmark results are probably within sampling errors...

hyperpape 3 hours ago||

They will release a system card, and you can then confirm or disconfirm your assumptions.

ddosmax556 2 hours ago|||

I would take all benchmarks with a grain of salt. I don't really use them. What's it supposed to tell me? "5% smarter", what does that mean? My experience will differ. Just try it!

I doubt Anthropic internally sets as a goal to improve this or that benchmark - it's just a way to visualize progress. They probably have much more complex metrics internally.

bel8 3 hours ago|||

On this note, is there a benchmark aggregator to compile all benchmarks in a single large grid?

jpadkins 2 hours ago||

I find this site useful https://artificialanalysis.ai/leaderboards/models

YetAnotherNick 3 hours ago||

At least they show competitors in any benchmark, compared to OpenAI which likes to pretend that there isn't any competitor.

827a 1 hour ago||

Frontier models are mostly past the point of human ability to discern whether they are actually better or worse than predecessors and competitors. I suspect the benchmarks may also be saturated, or at least past their usefulness.

I personally feel that Anthropic doesn't understand what this means for the frontier labs, and moreover that they might be the only frontier lab that doesn't.

1. Google dropped Gemini 3.5 Flash at IO, delaying the release of 3.5 Pro for a bit (they have said its coming). They also released a refreshed Antigravity, and drew special attention to how cheaply they were able to build their toy operating system to play Doom (less-than $1000 IIRC).

2. OpenAI has dumped everything into Codex, is offering double the token limits for the next few weeks IIRC, and is offering business discounts. Their head of Codex has tweeted that 5.5 is "extremely efficient", implying that they aren't actually losing money on any of this.

3. DeepSeek and other Chinese labs have dropped token pricing to the floor, in some situations as much as 99%.

4. Anthropic releases the next generation of Opus, their most expensive public model, without changing its price. In the background, they hype up Mythos, an even more expensive model.

Anthropic has screwed up where they need to be making investments, and the cracks are starting to show. They've marginally underinvested in the Sonnet line of models for almost a year now, and they've critically underinvested in product. Anthropic made bets on the story of the second half of 2026 being: ultra-frontier, ultra-intelligence. In reality, what's shaping up is that the story will be: Companies rolling back AI spend, efficiency, "95% as good for 15% the price", sophisticated high quality harnesses, cheaper models. Anthropic isn't ready for this world.

dbgrman 56 seconds ago||

thats a pretty cynical take. > past the point of human ability to discern whether they are actually better or worse

This is lack of imagination. If you use these models heavily enough, pretty soon you'll hit the edges of their capabilities. The smarter among us are collecting these problems into a personal benchmark and use that to judge model capability. I think this is the right approach, and dare I say, even better than generic benchmarks. To me, it matters less what the benchmark says, and more what my particular problems are.

andai 4 minutes ago|||

Tried using everything that isn't Claude and I keep switching back to Claude because even the smarter models give me uglier code, or miss common sense requirements. (And the dumber models give me code that doesn't work properly).

I keep trying to switch to something else but I keep coming back. (Typically after a few days of giving a new model an honest go, and finding myself constantly asking Sonnet to fix its output... Yes, even Sonnet wins on this front! They really do have some kind of special sauce.)

I'm not where most of their money comes from though, and I don't know how universal my experience is.

brokencode 1 hour ago|||

Anthropic’s story over the past year has been nothing but explosive growth that they can’t keep up with, but now they’re suddenly doomed? Seems pretty far fetched to me.

No idea why you’d say they have critically underinvested in product when Claude Code dominates and they’ve also released popular tools like Cowork and integrations for Microsoft products at an incredibly rapid pace.

Cost is becoming more of a factor, and no doubt they’ll work on that. There’s no reason to think they won’t be able to release cheaper models if they optimize for that rather than improving performance.

827a 1 hour ago||

I never said they were doomed. Where did you get that idea? I said they aren't ready for this world. That means they screwed up and need to get ready. They let the Mythos hype get to their heads while the world changed beneath them.

jonnycoder 1 hour ago|||

No, no it's been pretty easy with software engineering. I work on two types of projects and it's very easy to ask claude for a plan, then have gpt 5.5 rip it to shreds and find legit issues, and vice versa. If both 5.5 and claude 4.8 can independently create a plan and both find no critical or high issues, then we will be at that point.

chis 1 hour ago|||

I think it's probably too soon to say. I certainly still feel that large coding tasks are getting better and better with each model. I'd guess lawyers, doctors, etc feel similarly.

It feels like the only way to push the limits of newer models is with really long context questions that require reasoning. Any short request will naturally just be within the distribution of all the recent models so there isn't a performance difference there.

I think the near future is looking like a bunch of business-critical tasks that scale infinitely with better reasoning, all being done on whatever the most advanced model is at a high cost. Trading stocks, running a business, looking for tax dodges, writing high-performance code. These are all things where there's a tangible return on each jump in reasoning.

827a 1 hour ago||

We'll have to agree to disagree on that last point. I think that, historically (past ~6 months), "always use the most advanced model" being the norm is really just an artifact of both: The most advanced models oftentimes being the only model that can solve these problems; and: Infinite AI budgets.

loeg 1 hour ago|||

I thought 4.7 was noticeably better than 4.6.

dyauspitr 1 hour ago|||

The Chinese stuff is good enough for up to 80% of the frontier on most text tasks but they are significantly worse at code. They just don’t “get” what you’re asking for like Codex and Claude and require so many more iterations to get close to what you need.

827a 1 hour ago||

Agreed. But we're seeing Cursor (now SpaceX) take these models and add great coding capability on top of them. Frontier model providers should be concerned that Composer 2.5 costs $0.50/$2.50 (versus Opus 4.8 $5/$25). That's why Google prioritized Gemini 3.5 Flash, and talked up how near-frontier it is ($1.50/$9).

llmslave 1 hour ago||

anthropic is crushing it, this analysis is laughable. they are only constrained by GPUs

gslepak 3 hours ago||

On page 102 of the system card [1] I'm pleased to see evaluation against "creative mastery".

In our work we asked several frontier AIs to come up with an API we needed. We compared Opus 4.7 and GPT-5.5 (among others). Opus 4.7 came up with the most creative and intelligent API design that pleasantly surprised us, especially given that GPT-5.5 was passing it on various coding benchmarks.

What I noticed is that we don't have a commons benchmark to measure "creativity" and "ingenuity", and in some ways such a benchmark would conflict with the common IFBench benchmark. Yet this is a very important skill when designing systems. I'm glad to see Anthropic putting thought into it, and would love to see a public benchmark for this that other models could compare themselves to.

[1] https://cdn.sanity.io/files/4zrzovbb/website/c886650a2e96fc0...

MattRogish 2 hours ago||

Agreed, my vibes tell me 4.6 is a better coder than 4.7. 4.7 is a much better strategic thinker and maintains overall "better architecture" than 5.5. 5.5 is way better than either at coding, but more expensive. So I have 4.7 do the planning/architecture, 4.6 does the coding, then 5.5 critiques and fixes it.

dimitri-vs 1 hour ago||

This is my exact vibesperience

suprfnk 1 hour ago||

Agreed, these are my vibes too. It feels much better to do planning and strategy and architecture etc. with Opus 4.7 than GPT-5.5. GPT just feels like a robot that gets instructions and does exactly that. Opus feels like an almost human that sometimes has actually good ideas and pushes back on bad ideas.

So for now its planning/architecture/strategy -> Opus. Pure coding -> GPT.

Helps with agentic coding that GPT is much roomier with the tokens you get.

silverlight 2 hours ago|

Unfortunately they seem to have straight up broken Claude Code either with this release in the backend or the new CC version. Errors about "can't modify thinking blocks" are bricking long-running sessions: https://github.com/anthropics/claude-code/issues?q=is%3Aissu...

javawizard 1 hour ago||

Same. It's not a good look to have happen right when they roll out a new model.

whalesalad 1 hour ago|||

That is part of the charm of working with Claude. Every time they release anything new - all your shit will break.

solenoid0937 2 hours ago||

Try updating maybe?

Fabricio20 2 hours ago|||

I just installed/upgraded to try out 4.8 and in only 3 messages I hit this bug! Seems something is broken on CC.

silverlight 2 hours ago|||

I'm on the latest version (2.1.154 as of this comment). Based on the timestamps on those Issues being reported I think it's happening on the latest version.

I'm sure it will get fixed eventually/soon, just annoying to update and have your workflow break.

More comments...