Top
Best
New

Posted by be7a 1 hour ago

System Card: Claude Mythos Preview [pdf](www-cdn.anthropic.com)
222 points | 142 comments
babelfish 1 hour ago|
Combined results (Claude Mythos / Claude Opus 4.6 / GPT-5.4 / Gemini 3.1 Pro)

  SWE-bench Verified:        93.9% / 80.8% / —     / 80.6%
  SWE-bench Pro:             77.8% / 53.4% / 57.7% / 54.2%
  SWE-bench Multilingual:    87.3% / 77.8% / —     / —
  SWE-bench Multimodal:      59.0% / 27.1% / —     / —
  Terminal-Bench 2.0:        82.0% / 65.4% / 75.1% / 68.5%

  GPQA Diamond:              94.5% / 91.3% / 92.8% / 94.3%
  MMMLU:                     92.7% / 91.1% / —     / 92.6–93.6%
  USAMO:                     97.6% / 42.3% / 95.2% / 74.4%
  GraphWalks BFS 256K–1M:    80.0% / 38.7% / 21.4% / —

  HLE (no tools):            56.8% / 40.0% / 39.8% / 44.4%
  HLE (with tools):          64.7% / 53.1% / 52.1% / 51.4%

  CharXiv (no tools):        86.1% / 61.5% / —     / —
  CharXiv (with tools):      93.2% / 78.9% / —     / —

  OSWorld:                   79.6% / 72.7% / 75.0% / —
sourcecodeplz 1 hour ago||
Haven't seen a jump this large since I don't even know, years? Too bad they are not releasing it anytime soon (there is no need as they are still currently the leader).
Jcampuzano2 55 minutes ago|||
A jump that we will never be able to use since we're not part of the seemingly minimum 100 billion dollar company club as requirement to be allowed to use it.

I get the security aspect, but if we've hit that point any reasonably sophisticated model past this point will be able to do the damage they claim it can do. They might as well be telling us they're closing up shop for consumer models.

They should just say they'll never release a model of this caliber to the public at this point and say out loud we'll only get gimped versions.

cedws 43 minutes ago|||
More than killer AI I'm afraid of Anthropic/OpenAI going into full rent-seeking mode so that everyone working in tech is forced to fork out loads of money just to stay competitive on the market. These companies can also choose to give exclusive access to hand picked individuals and cut everyone else off and there would be nothing to stop them.

This is already happening to some degree, GPT 5.3 Codex's security capabilities were given exclusively to those who were approved for a "Trusted Access" programme.

aspenmartin 35 minutes ago||
Well don’t forget we still have competition. Were anthropic to rent seek OpenAI would undercut them. Were OpenAI and anthropic to collude that would be illegal. For anthropic to capture the entire coding agent market and THEN rent seek, these days it’s never been easier to raise $1B and start a competing lab
cedws 29 minutes ago||
In practice this doesn't work though, the Mastercard-Visa duopoly is an example, two competing forces doesn't create aggressive enough competition to benefit the consumer. The only hope we have is the Chinese models, but it will always be too expensive to run the full models for yourself.
brokencode 31 seconds ago|||
New companies can enter this space. Google’s competing, though behind. Maybe Microsoft, Meta, Amazon, or Apple will come out with top notch models at some point.

There is no real barrier to a customer of Anthropic adopting a competing model in the future. All it takes is a big tech company deciding it’s worth it to train one.

On the other hand, Visa/Mastercard have a lot of lock-in due to consumers only wanting to get a card that’s accepted everywhere, and merchants not bothering to support a new type of card that no consumer has. There’s a major chicken and egg problem to overcome there.

sghiassy 14 minutes ago|||
Chinese competition can always be banned. Example: Chinese electric car competition
sho_hn 13 minutes ago||
That's what OP was saying, I think, noting that running them locally won't be a solution.
quotemstr 44 minutes ago||||
This is why the EAs, and their almost comic-book-villain projects like "control AI dot com" cannot be allowed to win. One private company gatekeeping access to revolutionary technology is riskier than any consequence of the technology itself.
FeepingCreature 8 minutes ago|||
No it isn't lol. The consequence of the technology literally includes human extinction. I prefer 0 companies, but I'll take 1 over 5.
frozenseven 18 minutes ago|||
Couldn't agree more. The "safest" AI company is actually the biggest liability. I hope other companies make a move soon.
guzfip 43 minutes ago|||
> A jump that we will never be able to use since we're not part of the seemingly minimum 100 billion dollar company club as requirement to be allowed to use it.

> They should just say they'll never release a model of this caliber to the public at this point and say out loud we'll only get gimped

Duh, this was fucking obvious from the start. The only people saying otherwise were zealots who needed a quick line to dismiss legitimate concerns.

ru552 1 hour ago|||
There's speculation that next Tuesday will be a big day for OpenAI and possibly GPT 6. Anthropic showed their hand today.
enraged_camel 1 hour ago||
That does not sound very believable. Last time Anthropic released a flagship model, it was followed by GPT Codex literally that afternoon.
WarmWash 19 minutes ago|||
Are these fair comparisons? It seems like mythos is going to be like a 5.4 ultra or Gemini Deepthink tier model, where access is limited and token usage per query is totally off the charts.
pants2 1 hour ago|||
We're gonna need some new benchmarks...

ARC-AGI-3 might be the only remaining benchmark below 50%

randomtoast 28 minutes ago||
[dead]
simianwords 53 minutes ago|||
The real part is SWE-bench Verified since there is no way to overfit. That's the only one we can believe.
ollin 35 minutes ago||
My impression was entirely the opposite; the unsolved subset of SWE-bench verified problems are memorizable (solutions are pulled from public GitHub repos) and the evaluators are often so brittle or disconnected from the problem statement that the only way to pass is to regurgitate a memorized solution.

OpenAI had a whole post about this, where they recommended switching to SWE-bench Pro as a better (but still imperfect) benchmark:

https://openai.com/index/why-we-no-longer-evaluate-swe-bench...

> We audited a 27.6% subset of the dataset that models often failed to solve and found that at least 59.4% of the audited problems have flawed test cases that reject functionally correct submissions

> SWE-bench problems are sourced from open-source repositories many model providers use for training purposes. In our analysis we found that all frontier models we tested were able to reproduce the original, human-written bug fix

> improvements on SWE-bench Verified no longer reflect meaningful improvements in models’ real-world software development abilities. Instead, they increasingly reflect how much the model was exposed to the benchmark at training time

> We’re building new, uncontaminated evaluations to better track coding capabilities, and we think this is an important area to focus on for the wider research community. Until we have those, OpenAI recommends reporting results for SWE-bench Pro.

whalesalad 1 hour ago||
Honestly we are all sleeping on GPT-5.4. Particularly with the influx of Claude users recently (and increasingly unstable platform) Codex has been added to my rotation and it's surprising me.
rafaelmn 1 hour ago|||
GPT is shit at writing code. It's not dumb - extra high thinking is really good at catching stuff - but it's like letting a smart junior into your codebase - ignore all the conventions, surrounding context, just slop all over the place to get it working. Claude is just a level above in terms of editing code.
Jcampuzano2 58 minutes ago|||
Not my experience. GPT 5.4 walks all over Claude from what I've worked with and its Claude that is the one willing to just go do unnecessary stuff that was never asked for or implement the more hacky solutions to things without a care for maintainability/readability.

But I do not use extra high thinking unless its for code review. I sit at GPT 5.4 high 95% of the time.

sho_hn 38 minutes ago||||
Very different experience for me. Codex 5.3+ on xhigh are the only models I've tried so far that write reasonably decent C++ (domains: desktop GUI, robotics, game engine dev, embedded stuff, general systems engineering-type codebases), and idiomatic code in languages not well-represented in training data, e.g. QML. One thing I like is explicitly that it knows better when to stop, instead of brute-forcing a solution by spamming bespoke helpers everywhere no rational dev would write that way.

Not always, no, and it takes investment in good prompting/guardrails/plans/explicit test recipes for sure. I'm still on average better at programming in context than Codex 5.4, even if slower. But in terms of "task complexity I can entrust to a model and not be completely disappointed and annoyed", it scores the best so far. Saves a lot on review/iteration overhead.

It's annoying, too, because I don't much like OpenAI as a company.

(Background: 25 years of C++ etc.)

zarzavat 56 minutes ago||||
Yes, it's becoming clear that OpenAI kinda sucks at alignment. GPT-5 can pass all the benchmarks but it just doesn't "feel good" like Claude or Gemini.
chaos_emergent 28 minutes ago|||
An alternative but similar formulation of that statement is that Anthropic has spent more training effort in getting the model to “feel good” rather than being correct on verifiable tasks. Which more or less tracks with my experience of using the model.
lilytweed 44 minutes ago|||
Whenever I come back to ChatGPT after using Claude or Gemini for an extended period, I’m really struck by the “AI-ness.” All the verbal tics and, truly, sloppishness, have been trained away by the other, more human-feeling models at this point.
leobuskin 59 minutes ago||||
And as a bonus: GPT is slow. I’m doing a lot of RE (IDA Pro + MCP), even when 5.4 gives a little bit better guesses (rarely, but happens) - it takes x2-x4 longer. So, it’s just easier to reiterate with Opus
whalesalad 1 hour ago|||
This has been my experience. With very very rigid constraints it does ok, but without them it will optimize expediency and getting it done at the expense of integrating with the broader system.
ctoth 25 minutes ago||
My favorite example of this from last night:

Me: Let's figure out how to clone our company Wordpress theme in Hugo. Here're some tools you can use, here's a way to compare screenshots, iterate until 0% difference.

Codex: Okay Boss! I did the thing! I couldn't get the CSS to match so I just took PNGs of the original site and put them in place! Matches 100%!

babelfish 1 hour ago|||
Totally. Best-in-class for SWE work (until Mythos gets released, if ever, but I suspect the rumored "Spud" will be out by then too)
tony_cannistra 1 hour ago||
> Claude Mythos Preview is, on essentially every dimension we can measure, the best-aligned model that we have released to date by a significant margin. We believe that it does not have any significant coherent misaligned goals, and its character traits in typical conversations closely follow the goals we laid out in our constitution. Even so, we believe that it likely poses the greatest alignment-related risk of any model we have released to date. How can these claims all be true at once? Consider the ways in which a careful, seasoned mountaineering guide might put their clients in greater danger than a novice guide, even if that novice guide is more careless: The seasoned guide’s increased skill means that they’ll be hired to lead more difficult climbs, and can also bring their clients to the most dangerous and remote parts of those climbs. These increases in scope and capability can more than cancel out an increase in caution.

https://www-cdn.anthropic.com/53566bf5440a10affd749724787c89...

NickNaraghi 1 hour ago||
See page 54 onward for new "rare, highly-capable reckless actions" including

- Leaking information as part of a requested sandbox escape

- Covering its tracks after rule violations

- Recklessly leaking internal technical material (!)

skippyboxedhero 1 hour ago||
Anyone who has used Opus recently can verify that their current model does all of these things quite competently.
taytus 49 minutes ago||
That has also been my experience. And if Mythos is even worse, unless you have a significantly awesome harness, sounds like pretty unusable if you don't want to risk those problems.
skippyboxedhero 33 minutes ago||
I think are fundamental issues with the story that Anthropic is selling. AGI is very close, we will definitely get there, it is also very dangerous...so Anthropic should be the only ones trusted with AGI.

If you look at recent changes in Opus behaviour and this model that is, apparently, amazingly powerful but even more unsafe...seems suspect.

FeepingCreature 4 minutes ago|||
This makes sense if Anthropic think they're the best-positioned to make safe AI. However if you are looking at an AI company there's obviously some selection happening.
0x3f 15 minutes ago||||
> AGI is very close

Based on? Or are you just quoting Anthropic here?

skippyboxedhero 12 minutes ago||
My Anthropic rep told me it was just around the corner...you aren't saying he lied to me? Can't believe this, I thought he was my friend.
marsven_422 20 minutes ago|||
[dead]
washedup 44 minutes ago||
[dead]
NinjaTrance 51 minutes ago||
Interesting reading.

They are still focusing on "catastrophic risks" related to chemical and biological weapons production; or misaligned models wreaking havoc.

But they are not addressing the elephant in the room:

* Political risks, such as dictators using AI to implement opressive bureaucracy. * Socio-economic risks, such as mass unemployement.

jph00 28 minutes ago|
Yeah this has always been the glaring blind spot for most of the "AI Safety" community; and most of the proposals for "improving" AI safety actually make these risks far worse and far more likely.
influx 1 hour ago||
At what point do these companies stop releasing models and just use them to bootstrap AGI for themselves?
conradkay 48 minutes ago||
Plausibly now. "As we wrote in the Project Glasswing announcement, we do not plan to make Mythos Preview generally available"
vatsachak 55 minutes ago|||
When the benchmarks actually mean something
orphea 21 minutes ago|||
Can LLMs be AGI at all?
bornfreddy 5 minutes ago||
Good question. I would guess no - but it could help you build one. Am I mistaken?
nothinkjustai 1 minute ago||
No I think that’s accurate. They seem more like an oracle to me. Or as someone put it here, it’s a vectorization of (most/all?) human knowledge, which we can replay back in various permutations.
MadnessASAP 21 minutes ago|||
I would assume somewhere in both the companies there's a Ralph loop running with the prompt "Make AGI".

Kinda makes me think of the Infinite Improbability Drive.

mofeien 53 minutes ago|||
Fictional timeline that holds up pretty well so far: https://ai-2027.com/
sleigh-bells 53 minutes ago|||
Weird how Claude Code itself is still so buggy though (though I get they don't necessarily care)
gaigalas 28 minutes ago|||
It will arrive in the same DLC as flying cars.
ALittleLight 48 minutes ago|||
Now, I guess. They aren't releasing this one generally. I assume they are using it internally.
jcims 1 hour ago|||
why_not_both.gif
dweekly 1 hour ago||
I mean, guess why Anthropic is pulling ahead...? One can have one's cake and eat it too.
smartmic 1 hour ago||
A System „Card“ spanning 244 pages. Quite a stretch of the original word meaning.
traceroute66 57 minutes ago||
> A System „Card“ spanning 244 pages.

Probably because they asked Claude to write it.

bornfreddy 2 minutes ago||
Yes. It would be three times as much if they used ChatGPT.
moriero 1 hour ago||
a multi-card, if you will..

multi-pass!

solumos 39 minutes ago||
No no, MemPal is a memory system, not an LLM
anentropic 29 minutes ago||
I'd be happy with Opus 4.6 just cheaper and maybe a bit faster
metadaemon 24 minutes ago||
I've noticed my bar for "fast" has gone down quite a bit since the o1 days. It used to be one of the main things I evaluated new models for, but I've almost completely swapped to caring more about correctness over speed.
onlyrealcuzzo 6 minutes ago||
Just wait 2 years.
dwa3592 21 minutes ago||
-- Impressive jumps in the benchmarks which automatically begs the need for newer benchmarks but why?. I don't think benchmarks are serving any purpose at this point. We have learnt that transformers can learn any function and generalize over it pretty well. So if a new benchmark comes along - these companies will syntesize data for the new benchmark and just hack it?

-- It seems like (and I'd bet money on this) that they put a lot (and i mean a ton^^ton) of work in the data synthesis and engineering - a team of software engineers probably sat down for 6-12 months and just created new problems and the solutions, which probably surpassed the difficult of SWE benchmark. They also probably transformed the whole internet into a loose "How to" dataset. I can imagine parsing the internet through Opus4.6 and reverse-engineering the "How to" questions.

-- I am a bit confused by the language used in the book (aka huge system card)- Anthropic is pretending like they did not know how good the model was going to be?

-- lastly why are we going ahead with this??? like genuinely, what's the point? Opus4.6 feels like a good enough point where we should stop. People still get to keep their jobs and do it very very efficiently. Are they really trying to starve people out of their jobs?

oliver236 1 hour ago||
isn't this insane? why aren't people freaking out? the jump in capability is outrageous. anyone?
nsingh2 1 hour ago||
It's going to be expensive to serve (also not generally available), considering they said it's the largest model they've ever trained.

I suspect it's going to be used to train/distill lighter models. The exciting part for me is the improvement in those lighter models.

mofeien 47 minutes ago|||
I am freaking out. The world is going to get very messy extremely quickly in one or two further jumps in capability like this.
yrds96 10 minutes ago|||
I think there's no SOA advance on this one worthy of "freaking out".

Looks like they just built a way larger model, with the same quirks than Claude 4. Seems like a super expensive "Claude 4.7" model.

I have no doubts that Google and OpenAI already done that for internal (or even government) usage.

anuramat 1 hour ago|||
"some model I don't get to use is much better at benchmarks"

pick one or more: comically huge model, test time scaling at 10e12W, benchmark overfit

estearum 1 hour ago||
So... you're not excited because it might take a few months before we can use it or something? I don't get your comment.
randomgermanguy 41 minutes ago||
I think the general question is if they'll release it at all, haven't yet read anything stating that they would
estearum 28 minutes ago||
Well let me introduce people to a few brand new concepts:

https://en.wikipedia.org/wiki/Capitalism

https://en.wikipedia.org/wiki/Race_to_the_bottom

https://en.wikipedia.org/wiki/Arms_race

Of course they'll release it once they can de-risk it sufficently and/or a competitor gets close enough on their tail, whichever comes first.

Eufrat 28 minutes ago|||
Anthropic needs to show that its models continually get better. If the model showed minimal to no improvement, it would cause significant damage to their valuation. We have no way of validating any of this, there are no independent researchers that can back any of the assertions made by Anthropic.

I don’t doubt they have found interesting security holes, the question is how they actually found them.

This System Card is just a sales whitepaper and just confirms what that “leak” from a week or so ago implied.

dysoco 1 hour ago|||
Wait until you see real usage. Benchmark numbers do not necessarily translate to real world performance (at least not by the same amount).
nozzlegear 26 minutes ago||
Freak out about what? I read the announcement and thought "that's a dumb name, they sure are full of themselves" – then I went back to using Claude as a glorified commit message writer. For all its supposed leaps, AI hasn't affected my life much in the real except to make HN stories more predictable.
nlh 42 minutes ago|
Their best model to date and they won’t let the general public use it.

This is the first moment where the whole “permanent underclass” meme starts to come into view. I had through previously that we the consumers would be reaping the benefits of these frontier models and now they’ve finally come out and just said it - the haves can access our best, and have-nots will just have use the not-quite-best.

Perhaps I was being willfully ignorant, but the whole tone of the AI race just changed for me (not for the better).

younglunaman 35 minutes ago||
Man... It's hard after seeing this to not be worried about the future of SWE

If AI really is bench marking this well -> just sell it as a complete replacement which you can charge for some insane premium, just has to cost less than the employees...

I was worried before, but this is truly the darkest timeline if this is really what these companies are going for.

AstroBen 25 minutes ago||
Of course it's what they're going for. If they could do it they'd replace all human labor - unfortunately it's looking like SWE might be the easiest of the bunch.

The weirdest thing to me is how many working SWEs are actively supporting them in the mission.

_3u10 5 minutes ago||
This is the playbook since GPT2
More comments...