Small models also found the vulnerabilities that Mythos found

Posted by dominicq 4 hours ago

Small models also found the vulnerabilities that Mythos found(aisle.com)

391 points | 116 commentspage 2

chirau 3 hours ago|

Their isolation approach is totally different from Mythos approach though. Mythos had to evaluate whole code bases rather than isolated sections. It's like saying one dog walked into the Amazon jungle and found a tennis ball and then another team isolated a 1 square kilometer radius that they knew the ball was definitely in and found the same ball.

hakanderyal 35 minutes ago||

Even that would be more meaningful test. They basically coated the ball with a strong smell, then they prepped the dog with that smell, then set it loose in a 5x5 meter area.

"Our tests gave models the vulnerable function directly, often with contextual hints (e.g., "consider wraparound behavior")."

kennywinker 3 hours ago||

I don’t think mythos can ingest an entire codebase into context. So it’s spinning off sub-agents to process chunks. Which supports their thesis: the harness is the moat. The tooling is whats important, the model is far far less important.

bhouston 3 hours ago||

Mythos was clear it was one agent per chunk. But this positive confirming results do not actually disprove anytime with Mythos, because it is only one side of the discriminator challenge - you got positives, but we do not know your false positive rate and your false negative rate.

kennywinker 3 hours ago||

In TFA they talk a fair bit about how different models perform wrt false positives:

“The results show something close to inverse scaling: small, cheap models outperform large frontier ones.”

coppsilgold 2 hours ago||

LLMs are wordsmith oracles. A lot of effort went into trying to coax interactive intelligence from them but the truth is that you could have probably always harnessed the base models directly to do very useful things. The instruct tuned models give your harness even more degrees of freedom.

A while ago, the autoresearch[1] harness went viral, yet it's but a highly simplified version of AlphaEvolve[2][3][4].

In the cybersecury context, you can envision a clever harness that probes every function in a codebase for vulnerabilities, then bubbles the candidates up to their callsites (and probes whether the vulnerability can be triggered from there) and then all the way to an interface (such as a syscall) where a potential exploit can be manifested. And those would be the low hanging fruit, other vulnerabilities may require the interplay of multiple functions. Or race conditions.

[1] <https://github.com/karpathy/autoresearch>

[2] <https://deepmind.google/blog/alphaevolve-a-gemini-powered-co...>

[3] <https://arxiv.org/abs/2506.13131>

[4] <https://github.com/algorithmicsuperintelligence/openevolve>

slibhb 1 hour ago||

The best way to think of Anthropic's communication about Mythos is as advertisement. It's basically "our model is too smart to release" which suggests they're ahead of OpenAI (without proof)

pardon_me 24 minutes ago|

The whole company is like that. If things were as amazing as advertised, they wouldn't even need to advertise. Or to release models to the public at all.

npilk 1 hour ago||

Wouldn't this mean we're even more cooked? I've seen this page cited a few times as evidence that Mythos is no big deal, but if true then the same big deal is already out there with other models today.

davebren 1 hour ago|

As cooked as we were pre-LLMs knowing that security exploits are relatively easy to learn about online and use, yet things keep chugging along.

dominicq 1 hour ago||

This would just speed up the discovery -> patch cycle, at least until such time that all the low hanging fruit (=represented in training data) is patched.

Though another possibility would be that since LLMs generate so much code, the LLM vulnerability discovery would just keep chugging along and we'd simply settle for the same amount of potential vulns, same relative vulnerability-exploit-patch dynamics, though higher in absolute numbers.

abel_ 2 hours ago||

This misses the broader ongoing trend. For a few million dollars, of course you can create a startup that builds tools it can use to more efficiently find code vulnerabilities. And of course you can do this with weaker models with scaffolds that incorporate lots of human understanding. The difference now is that you don't need an expensive team, nor a bunch of human heuristics, nor a million dollars. The requisite cost and skill are falling rapidly.

bhouston 3 hours ago||

This is quite misleading.

If you isolate the positive cases and then ask a tool to label them and it labels them all positive, doesn't prove anything. This is a one-sided test and it is really easy to write a tool that passes it -- just return always true!

You need to test your tool on both positive and negative cases and check if it is accurate on both.

If you don't, you could end up with hundreds or thousands of false positives when using this on real-world samples.

The real test is to use it to find new real bugs in the midst of a large code base.

operatingthetan 3 hours ago||

My theory is that Mythos is basically just Opus with revised context window handling and more compute thrown at it. So while it will be a step forward, it is probably primarily hype.

appcustodian2 2 hours ago|

N model is basically just N-1 model with revised context window handling and more compute thrown at it

amazingamazing 3 hours ago||

Did mythos isolate the code to begin with? Without a clear methodology that can be attempted with another model the whole thing is meaningless

bhouston 3 hours ago||

They did do one agent per code chunk, yes. But key is that their agent had to identify when there was a vulnerability and when there wasn't. This "small model" test only had to label the known positive cases as positive -- which any function that simply returns "true" can do. This whole test setup is annoying because it proves nothing.

aniceperson 3 hours ago|||

to be fair, last post i saw from anthropic on finding linux kernel vulnerability was a while loop per failed prompting "there is a vulnerability here, find it" more important than that, no frontier model can keep the entire linux kernel in context, so there definitely is code isolation, either explicitly or implicitly (the model itself delegates subagents with smaller chunks of code)

loeg 3 hours ago||

No. How would it? Before the vulns were identified by Mythos, no one knew what the relevant portion to isolate was.

yalogin 2 hours ago||

Intuitively every existing model has already been trained on all code, all vulnerabilities reported, all security papers. So they all have the capability. Small models fall short because they may not be able to find a vulnerability that spans across a large function chain but for the most part they should suffice too.

Of course I say this without any knowledge of what mythos is doing or how it’s different. I am sure it’s somehow different

nomel 1 hour ago|

Not intuitive at all. Not all models are equally capable, just because they had the same training data. The model architecture (as a whole) is very important. To reduce capability, you can reduce layers, tool use, thinking, quantize it, etc. This is trivially proven by a cursory glance in the rough direction of any set of benchmarks (or actual use).

Using small models as a classifier "there might be a vulnerability here" is probably reasonable, if you have a model capable of proving it. There are many companies attempting this without the verification step, resulting in AI vulnerability checker being banned left and right, from the nonsense noise.

throwaway13337 2 hours ago|

So there are two competing narratives:

1. Mythos uniquely is able to find vulnerabilities that other LLMs cannot practically.

2. All LLMs could already do this but no one tried the way anthropic did.

The truth is one of these. And it comes down whether the comparison is apples to apples. Since we don't know the exact specifics of how either tests were performed, we lack a way of knowing absolutely.

So I guess, like so many things today, we can to pick the truth we find most comfortable personally.

More comments...