Posted by be7a 3 hours ago
Any benchmarks where we constraint something like thinking time or power use?
Even if this were released no way to know if it’s the same quant.
Mythos preview has higher accuracy with fewer tokens used than any previous Claude model. Though, the fact that this incredibly strong result was only presented for BrowseComp (a kind of weird benchmark about searching for hard to find information on the internet) and not for the other benchmarks implies that this result is likely not the same for those other benchmarks.
Although, amusingly, today Opus told me that the string 'emerge' is not going to match 'emergency' by using `LIKE '%emerge%'` in Sqlite
Moment of disappointment. Otherwise great.
A month ago I might have believed this, now I assume that they know they can't handle the demand for the prices they're advertising.
I remember when OpenAI created the first thinking model with o1 and there were all these breathless posts on here hyperventilating about how the model had to be kept secret, how dangerous it was, etc.
Fell for it again award. All thinking does is burn output tokens for accuracy, it is the AI getting high on its own supply, this isn't innovation but it was supposed to super AGI. Not serious.
“All that phenomenon X does is make a tradeoff of Y for Z”
It sounds like you’re indignant about it being called thinking, that’s fine, but surely you can realize that the mechanism you’re criticizing actually works really well?
I've read that about Llama and Stable Diffusion. AI doomers are, and always have been, retarded.
https://arxiv.org/html/2402.06664v1
Like think carefully about this. Did they discover AGI? Or did a bunch of investors make a leveraged bet on them "discovering AGI" so they're doing absolutely anything they can to make it seem like this time it's brand new and different.
If we're to believe Anthropic on these claims, we also have to just take it on faith, with absolutely no evidence, that they've made something so incredibly capable and so incredibly powerful that it cannot possibly be given to mere mortals. Conveniently, that's exactly the story that they are selling to investors.
Like do you see the unreliable narrator dynamic here?
Also they just hit a $30B run-rate, I don't think they're that needy for new hype cycles.
What do you find surprising here?
The point is that this whole "the model is too powerful" schtick is a bunch of smoke and mirrors. It serves the valuation.
Do you don't believe that the vulnerabilities found by these agents are serious enough to warrant staggered release?
Sorry kid.
You must see some value, or are you in a situation where you're required to test / use it, eg to report on it or required by employer?
(I would disagree about the code, the benefits seem obvious to me. But I'm still curious why others would disagree, especially after actively using them for years.)
I don't think the issue is with the model, it is with the implication that AGI is just around the corner and that is what is required for AI to be useful...which is not accurate. The more grey area is with agentic coding but my opinion (one that I didn't always hold) is that these workflows are a complete waste of time. The problem is: if all this is true then how does the CTO justify spending $1m/month on Anthropic (I work somewhere where this has happened, OpenAI got the earlier contract then Cursor Teams was added, now they are adding Anthropic...within 72 hours of the rollout, it was pulled back from non-engineering teams). I think companies will ask why they need to pay Anthropic to do a job they were doing without Anthropic six months ago.
Also, the code is bad. This is something that is non-obvious to 95% of people who talk about AI online because they don't work in a team environment or manage legacy applications. If I interview somewhere and they are using agentic workflow, the codebase will be shit and the company will be unable to deliver. At most companies, the average developer is an idiot, giving them AI is like giving a monkey an AK-47 (I also say this as someone of middling competence, I have been the monkey with AK many times). You increase the ability to produce output without improving the ability to produce good output. That is the reality of coding in most jobs.
AI isn't good enough to replace a competent human, it is fast enough to make an incompetent human dangerous.
Anthropic is burning through billions of VC cash. if this model was commercially viable, it would've been released yesterday.
There's a practical difference to how much better certain kinds of results can be. We already see coding harnesses offloading simple things to simpler models because they are accurate enough. Other things dropped straight to normal programs, because they are that much more efficient than letting the LLM do all the things.
There will always be problems where money is basically irrelevant, and a model that costs tens of thousand dollars of compute per answer is seen as a great investment, but as long as there's a big price difference, in most questions, price and time to results are key features that cannot be ignored.
Disappointing that AGI will be for the powerful only. We are heading for an AI dystopia of Sci-Fi novels.
Unless governments nationalise the companies involved, but then there’s no way our governments of today give this power out to the masses either.
[0] Nick Land (1995). No Future in Fanged Noumena: Collected Writings 1987-2007, Urbanomic, p. 396.
Shame. Back to business as usual then.
They even admit:
"[...]our overall conclusion is that catastrophic risks remain low. This determination involves judgment calls. The model is demonstrating high levels of capability and saturates many of our most concrete, objectively-scored evaluations, leaving us with approaches that involve more fundamental uncertainty, such as examining trends in performance for acceleration (highly noisy and backward-looking) and collecting reports about model strengths and weaknesses from internal users (inherently subjective, and not necessarily reliable)."
Is this not just an admission of defeat?
After reading this paper I don't know if the model is safe or not, just some guesses, yet for some reason catastrophic risks remain low.
And this is for just an LLM after all, very big but no persistent memory or continuous learning. Imagine an actual AI that improves itself every day from experience. It would be impossible to have a slightest clue about its safety, not even this nebulous statement we have here.
Any sort of such future architecture model would be essentially Russian roulette with amount of bullets decided by initial alignment efforts.