System Card: Claude Mythos Preview [pdf]

Posted by be7a 4 hours ago

System Card: Claude Mythos Preview [pdf](www-cdn.anthropic.com)

380 points | 260 commentspage 3

therealdeal2020 2 hours ago|

is it just hype building or real? I don't care, shut up and take my money haha

awestroke 4 hours ago||

I predict they will release it as soon as Opus 4.6 is no longer in the lead. They can't afford to fall behind. And they won't be able to make a model that is intelligent in every way except cybersecurity, because that would decrease general coding and SWE ability

chippiewill 4 hours ago|

Alternatively they'll just wreck it down a bit so it beats a competitor but isn't unsafe.

dwa3592 3 hours ago||

-- Impressive jumps in the benchmarks which automatically begs the need for newer benchmarks but why?. I don't think benchmarks are serving any purpose at this point. We have learnt that transformers can learn any function and generalize over it pretty well. So if a new benchmark comes along - these companies will syntesize data for the new benchmark and just hack it?

-- It seems like (and I'd bet money on this) that they put a lot (and i mean a ton^^ton) of work in the data synthesis and engineering - a team of software engineers probably sat down for 6-12 months and just created new problems and the solutions, which probably surpassed the difficult of SWE benchmark. They also probably transformed the whole internet into a loose "How to" dataset. I can imagine parsing the internet through Opus4.6 and reverse-engineering the "How to" questions.

-- I am a bit confused by the language used in the book (aka huge system card)- Anthropic is pretending like they did not know how good the model was going to be?

-- lastly why are we going ahead with this??? like genuinely, what's the point? Opus4.6 feels like a good enough point where we should stop. People still get to keep their jobs and do it very very efficiently. Are they really trying to starve people out of their jobs?

juleiie 3 hours ago||

Honestly if that was some kind of research paper, it would be wholly insufficient to support any safety thesis.

They even admit:

"[...]our overall conclusion is that catastrophic risks remain low. This determination involves judgment calls. The model is demonstrating high levels of capability and saturates many of our most concrete, objectively-scored evaluations, leaving us with approaches that involve more fundamental uncertainty, such as examining trends in performance for acceleration (highly noisy and backward-looking) and collecting reports about model strengths and weaknesses from internal users (inherently subjective, and not necessarily reliable)."

Is this not just an admission of defeat?

After reading this paper I don't know if the model is safe or not, just some guesses, yet for some reason catastrophic risks remain low.

And this is for just an LLM after all, very big but no persistent memory or continuous learning. Imagine an actual AI that improves itself every day from experience. It would be impossible to have a slightest clue about its safety, not even this nebulous statement we have here.

Any sort of such future architecture model would be essentially Russian roulette with amount of bullets decided by initial alignment efforts.

Stevvo 4 hours ago||

"Claude Mythos Preview’s large increase in capabilities has led us to decide not to make it generally available."

Disappointing that AGI will be for the powerful only. We are heading for an AI dystopia of Sci-Fi novels.

girvo 2 hours ago||

Not surprising though, this was always going to be the end result within our current systems I think. When you add up: scaling power and required cost, then how talent concentrates in our economic systems, we were always going to end up with monopolies I think

Unless governments nationalise the companies involved, but then there’s no way our governments of today give this power out to the masses either.

gom_jabbar 1 hour ago||

Expected outcome. Nick Land and the CCRU have explored how capitalism operationalizes science fiction (distilled in the concept of Hyperstition). Viewed through this lens, prices encode "distributed SF narratives." [0]

[0] Nick Land (1995). No Future in Fanged Noumena: Collected Writings 1987-2007, Urbanomic, p. 396.

LoganDark 4 hours ago||

> Claude Mythos Preview’s large increase in capabilities has led us to decide not to make it generally available.

Shame. Back to business as usual then.

Tepix 4 hours ago|

I for one applaud them for being cautious.

LoganDark 3 hours ago||

Being cautious is fine. Farming hype around something that may as well not exist for us should be discouraged. I do appreciate the research outputs.

vonneumannstan 4 hours ago||

Are you guys ready for the bifurcation when the top models are prohibitively expensive to normal users? If your AI budget $2000+ a month? Or are you going to be part of the permanent free tier underclass?

adi_kurian 4 hours ago||

If one is to believe the API prices are reasonable representation of non subsidized "real world pricing" (with model training being the big exception), then the models are getting cheaper over time. GPT 4.5 was $150.00 / 1M tokens IIRC. GPT o1-pro was $600 / 1M tokens.

vonneumannstan 3 hours ago||

You can check the hardware costs for self hosting a high end open source model and compare that to the tiers available from the big providers. Pretty hard to believe its not massively subsidized. 2 years of Claude Max costs you 2,400. There is no hardware/model combination that gets you close to that price for that level of performance.

adi_kurian 3 hours ago||

Yes that's why I said API price. I once used the API like I use my subscription and it was an eye watering bill. More than that 2 year price in... a very short amount of time. With no automations/openclaw.

OsrsNeedsf2P 4 hours ago|||

Inference for the same results has been dropping 10x year over year[0]

[0] https://ziva.sh/blogs/llm-pricing-decline-analysis

ceejayoz 3 hours ago||

Sure, but "the same results" will rapidly become unacceptable results if much better results are available.

hibikir 3 hours ago|||

When we go with any other good in the economy, price is always relevant: After all, the price is a key part of any offering. There are $80-100k workstations out there, but most of us don't buy them, because the extra capabilities just aren't worth it vs, say a $3000 computer, and or even a $500 one. Do I need a top specialist to consult for a stomachache, at $1000 a visit? Definitely not at first.

There's a practical difference to how much better certain kinds of results can be. We already see coding harnesses offloading simple things to simpler models because they are accurate enough. Other things dropped straight to normal programs, because they are that much more efficient than letting the LLM do all the things.

There will always be problems where money is basically irrelevant, and a model that costs tens of thousand dollars of compute per answer is seen as a great investment, but as long as there's a big price difference, in most questions, price and time to results are key features that cannot be ignored.

swader999 3 hours ago||||

Yes, it will always be an arms race game.

esafak 3 hours ago|||

Or will they rapidly become indistinguishable since they both get the job done?

asadm 2 hours ago||

if it can pay my rent, why not?

jdthedisciple 3 hours ago||

Opus 4.6 is already incredible so this leap is huge.

Although, amusingly, today Opus told me that the string 'emerge' is not going to match 'emergency' by using `LIKE '%emerge%'` in Sqlite

Moment of disappointment. Otherwise great.

bornfreddy 3 hours ago||

I only have 3 points against LLMs: they lack reason and they can't count.

FeepingCreature 3 hours ago||

'emer ge' is two tokens, 'emergency' is one. The models think in a logosyllabic language.

More comments...