Top
Best
New

Posted by mindingnever 21 hours ago

Will It Mythos?(swelljoe.com)
305 points | 216 commentspage 3
irthomasthomas 15 hours ago|
I find this interesting:

  …no model performed better with an Agent, a couple performed worse, and time/tokens/costs were consistently much higher with the agent in the loop, for some reason.
Somone should build a harness where features are only added if they are proven net positive to outcomes.
StizzurpXDD 19 hours ago||
This just shows that Google needs to double down on its AI models fast. Even open source chinese models are beating 3.1 Pro and 3.5.Flash in almost everything.
SwellJoe 11 hours ago||
Gemma 4 beat Gemini 3.1 Pro, as well. In a later replication test I haven't published yet, it found more bugs than all other models (somewhat inconsistently) when given multiple attempts. So, it seems like they are doing real work but seemingly on making models efficient rather than making them bigger. Gemma 4 12b is the most effective vision model I've tested, including models several times its size.
linzhangrun 18 hours ago||
Google said they would bring 3.5 Pro this month. I've been waiting for a month now.
tomcam 6 hours ago||
I feel like this could achieve techempower-level legend status
rbbydotdev 12 hours ago||
I find it ironic, we now have to use lesser models to write potentially MORE buggy code, than greater models which would allow you to write LESS buggy code. It's paradoxical.
yapyap 12 hours ago|
wouldn’t agree that there’s a paradox to be found in what ur proposing
wiz21c 13 hours ago||
As a european, it's funny to read those stories about Fable and not being able to check for myself. It looks like being a kid watching other kids playing with nicer toys.
x187463 13 hours ago|
If it makes you feel any better, nobody is playing with the toys, now.
wiz21c 13 hours ago||
ah, you're right, I thought it was disabled only for "export", but the PR explicitly says everybody:

https://www.anthropic.com/news/fable-mythos-access

GL26 18 hours ago||
Frankly after testing out Fable last week, it was just a bigger sink of tokens than anything else. The amount of tokens consumed by it wasn't worth the steps it saved me compared to using opus 4.8.
lukaslalinsky 10 hours ago|
As much as I hate to say this, I think it is an user error. Fable is very to the point, much more so than any other Anthropic model. I found it to be cheaper to use Fable, than using Opus for same task, but in order to achieve that, it needs to be given a targeted task.
jonplackett 17 hours ago||
I thought the whole point was that it doesn’t need to be pointed at the problem. That’s a much easier problem to solve. Also you eliminate 10000 false positives.
SwellJoe 17 hours ago|
They were not pointed at the problem. You're reading the section about corpus selection and mixing it up with the benchmark rules.

And, false positives are reported in the results.

FartyMcFarter 18 hours ago||
Is the title a reference to "will it blend"?
fb03 3 hours ago||
omg, core memory unlocked
DGCA 14 hours ago||
That is the question
ryangg 15 hours ago|
The leaderboard sorting is very misleading, gpt-5.5-pro only found 2 while mimo-v2.5-pro found 4.5 out of 9 cases.
b-zee 13 hours ago||
Mentioned directly under the table:

> Note GPT 5.5 Pro is at the top of the leaderboard only because it blew through $100 budget after only completing four cases, so 2/4 is 50%. And, a couple of other results, both Qwen models, are skewed upward in the detect % ranking because of failure to complete all cases.

SwellJoe 11 hours ago||
Yeah, I'm not super happy with the chart sorting order, but trying to balance all the information is challenging. I chose not to include partials (right place, inaccurate bug description, so it smelled something funny but didn't quite understand it) in the sort order, but maybe should.

And, it does feel wrong that the unrealistically expensive model that no one in their right mind would use for anything but the most critical tasks (and even then, a committee of ten of the best alternatives would cost half as much) is at the top. But, GPT 5.5 Pro did find a bug nobody else found among the four cases it got to, hinting at some real difference. It may be closer to Mythos than others, but at an absurd price. It'd cost tens of thousands of dollars to audit all the files in a large codebase, versus maybe fifty bucks for MiMo or DeepSeek.

More comments...