Top
Best
New

Posted by mindingnever 1 day ago

Will It Mythos?(swelljoe.com)
306 points | 218 commentspage 5
fsadsadsdasdas 22 hours ago|
事実は小説よりも奇なり
bottlepalm 22 hours ago||
Surprise.. someone downplaying Mythos/Fable that didn't actually use it. Plenty of comments here to the contrary, including my own personal experience with Fable was easily a step change in capability over Opus - figuring things out in reverse engineering binaries that Opus plain couldn't find.
SwellJoe 22 hours ago|
Who are you talking about? I don't believe I have downplayed anything? And, I did briefly use Fable. It was excellent for general coding but it was blocked before I could benchmark it. I kinda suspect it would refuse this task, though. I never had access to Mythos.
davedx 21 hours ago|
I don't understand the article.

"I’d say this benchmark answers with a resounding, “Maybe.”

Mythos maybe really is better than the other current models at finding security bugs"

Yet in the results, I don't see Mythos?

It seems like a really well researched article with lots of results for other models, yet the title seems to be clickbait because the results don't contain Mythos, do they?

olmo23 21 hours ago||
> Yet in the results, I don't see Mythos?

Mythos is the 100% against which the other models are compared.

scotty79 21 hours ago||
Bugs the other models were benchmarked on are from the corpus that Mythos found. So Mythos might have 100% in this benchmark.

Although the benchmark had 100$ budget cap and rudimentary tooling so probably a bit less than 100%.

GPT-5.5-pro attemted only 4 problems out of 9 before the budget ran out and got 2 of them right.

It's a shame that the author didn't try GPT-5.5-pro on all 9 just for completeness, pehaps on subscription to save money.

SwellJoe 20 hours ago|||
Also, with regard to tools, I originally ran a batch of several models in a full-featured agent (and whatever tools the agent provides), and they didn't perform better than the basic minimal harness with just read and grep. They chewed more tokens but didn't find more bugs. I'm currently doing tests with more advanced tools, like tree-sitter so the model can better understand execution and data flow and semgrep (which is almost cheating, since it finds bugs on its own, but worth a try since models can still be useful in helping rule out false positives and suggest mitigations). When I've got time for it, I'll also give them a full dev environment with compiler, debugger, and maybe fuzzer, and a loop that iterates through a security bug hunting checklist (since a single prompt and context window can't handle that much complexity at once).
scotty79 15 hours ago||
We can't really know in what manner Mythos was used to find these bugs, right?
SwellJoe 14 hours ago||
Right. I noted that in the post. Some of the information out of Anthropic indicates dumb loops, sometimes, but some hint at a more sophisticated harness and process for some of the Mythos bug hunts. But, nothing specific.

I've been doing more benchmarks with additional tools, with no silver bullet revealing itself thus far.

SwellJoe 21 hours ago|||
At the time a GPT subscription didn't include Pro usage in the rolling limits. It was billed at API rates. Does it now?

If anyone wants to fund the other five cases (~$125), I'll run them. I find that an unrealistic cost, though...simply not useful data. I'm certainly not going to spend $23 per file to audit a project with hundreds or thousands of files. I don't know anyone who would.

Also note that it was $100 cap per model, and the next most expensive model was GPT 5.5 at a 20th the price per case, about ten bucks for the whole batch.

scotty79 18 hours ago||
I have ~100$/mo sub and I have Pro in chat app and Extra High in Codex for GPT-5.5

I think on sub tokens might be 100 times cheaper.

The quota is also generous in my opinion. I can vibecode a lot most days of the week and not run out.

SwellJoe 13 hours ago||
But GPT 5.5 on extra high is not Pro. When I looked into it, Pro was not available for agentic use via any rolling limits plan. But, I'll look again into whether there's some reasonable way to complete the test for GPT Pro.
scotty79 12 hours ago||
Ah, right. Sorry, my mistake. I have access to it in chat but not in Codex.