Top
Best
New

Posted by mindingnever 23 hours ago

Will It Mythos?(swelljoe.com)
305 points | 217 commentspage 4
ryangg 17 hours ago|
The leaderboard sorting is very misleading, gpt-5.5-pro only found 2 while mimo-v2.5-pro found 4.5 out of 9 cases.
b-zee 14 hours ago||
Mentioned directly under the table:

> Note GPT 5.5 Pro is at the top of the leaderboard only because it blew through $100 budget after only completing four cases, so 2/4 is 50%. And, a couple of other results, both Qwen models, are skewed upward in the detect % ranking because of failure to complete all cases.

SwellJoe 12 hours ago||
Yeah, I'm not super happy with the chart sorting order, but trying to balance all the information is challenging. I chose not to include partials (right place, inaccurate bug description, so it smelled something funny but didn't quite understand it) in the sort order, but maybe should.

And, it does feel wrong that the unrealistically expensive model that no one in their right mind would use for anything but the most critical tasks (and even then, a committee of ten of the best alternatives would cost half as much) is at the top. But, GPT 5.5 Pro did find a bug nobody else found among the four cases it got to, hinting at some real difference. It may be closer to Mythos than others, but at an absurd price. It'd cost tens of thousands of dollars to audit all the files in a large codebase, versus maybe fifty bucks for MiMo or DeepSeek.

wald3n 21 hours ago||
The benchmark fills an interesting niche, but the methods need work considering how many caveats are included in the results.
SwellJoe 20 hours ago|
And, I said I'm still working on it also in the post.
catigula 14 hours ago||
What year are we in?

>I am skeptical of the reasons given publicly, I suspect it’s really just so much more expensive to operate than their current models that they don’t want to offer it broadly, yet, given the difficulty they’ve had growing capacity to keep up with use. But, are they telling the truth about how good it is at finding security vulnerabilities or is it just more hype?

Meanwhile,

1. Mythos is banned by the government per reality.

2. The NSA said it hacked all of their systems in hours per multiple sources.

3. The Five Eyes spy agencies said we're about to have an AI global catastrophe in a few months per the Guardian.

SwellJoe 11 hours ago|
The post was published on May 30, and written over a few days before that. Well before Fable was banned. And, before the NSA hacking thing. But, I am skeptical of the AI global catastrophe, it still feels like a mix of marketing hype and reality and it can be difficult to separate the two, coming from the hype men who run the AI companies.
mixmastamyk 21 hours ago||
Could someone point the thing at Ventoy please?
guessmyname 21 hours ago||
This Ventoy? → https://github.com/ventoy/Ventoy
RobertSponge 21 hours ago||
What’s with ventoy?
mixmastamyk 11 hours ago||
Works at a very low level of course, pre-OS, filled with binary blobs, perfect for an xz-style or supply chain attack. I’ve not seen any confirmation, so just speculation.

Has not been famous enough so far to have someone invest in an audit, so this would probably be cheaper.

https://news.ycombinator.com/item?id=44810281

mcoliver 21 hours ago||
Gemini / antigravity didn't use to be this hamstrung. Something recently changed within the past couple months that makes doing security work very difficult to do. Even auditing/securing your own code now requires an insane amount of prompt engineering that is utterly ridiculous and did not use to be required.
SwellJoe 20 hours ago|
Gemini CLI actually had an extension explicitly for security tasks: https://github.com/gemini-cli-extensions/security

But, Gemini CLI is deprecated. So, I tried to use Antigravity and it simply refused.

Weirdly, Gemma 4 has proven to be excellent at this task in subsequent tests. The best in its size/class. So, not everybody at Google is determined to break Google models for security work.

holoduke 21 hours ago||
Yesterday I wanted to delete records from a database in my own ssh server. It refused to do so. No matter what I prompted. Very annoying.
reinitctxoffset 22 hours ago||
Opus 4 class models are terrifying at infosec. They tie their shoelaces together on other things, but don't fuck with them on that. It's a savant thing.

A cursory reading of the model card shows Mythos/Fable is a fine tune on Project Zero with some steering on persistence.

But I think it's a valuable lesson: advertise your product as a nuclear weapon while microdosing at Lighthaven to enough Davos attendees and sooner or later? Someone is going to evaluate the claim from a chair where you act first and nuance later.

Wild that Amodei's blog and pod circuit are the greatest IPO risk.

eru 22 hours ago|
> Opus 4 class models are terrifying at infosec. They tie their shoelaces together on other things, but don't fuck with them on that. It's a savant thing.

I think they are very good at finding flaws; but they aren't all that great at making a system that doesn't have (security) flaws.

tptacek 22 hours ago|||
What makes you say that? I think they're better than replacement-level developers at making secure systems (I spent 20 years looking for vulnerabilities in human-written code as a full-time job).
eru 22 hours ago|||
See https://news.ycombinator.com/item?id=48640533 for some further elaboration.

These models are definitely a lot better than your run of the mill human developer at finding security flaws in existing systems. I'm agnostic at how good they are at actually making a secure system. Probably better, too, for two reasons:

- humans are really terrible

- the model probably has an easier time picking up special purpose tools you can use to write proven secure systems

I don't think Mythos can write secure C code, either. Practically no one can. (At least not directly. See how seL4 is officially written in C; but they didn't just set out to carefully write secure C code directly; C just happens to be an intermediate language they use.)

sscaryterry 22 hours ago|||
Agreed. In the right hands, they can perform magic.
reinitctxoffset 22 hours ago|||
You are not wrong, but there's an asdymetry here: run adversarial self play and low-pass filter.
eru 22 hours ago||
Mostly right. However there's an extra assumption I didn't explicitly state:

Almost all existing real world software is full of holes and security flaws. Mythos is better than humans at uncovering many of them; especially because its time is a lot cheaper than that of the top tier human experts (and even of mid-and low-tier human experts).

Especially when these systems are written in notoriously unreliably languages like C.

I don't think Mythos is especially good at writing systems that are free of security problems. Essentially the only way we know is by proving your software correct.

In principle, you can even prove C correct, but in practice you'll want to write your system from the ground up to be proven correct instead of adding that property after the fact; and for that you'll most likely also want to pick a language that supports this better.

See https://en.wikipedia.org/wiki/SeL4 for a noteworthy example.

terekhindc 7 hours ago||
[flagged]
bob1029 21 hours ago||
[dead]
fabijanbajo 20 hours ago|
[flagged]
More comments...