Small models also found the vulnerabilities that Mythos found

Posted by dominicq 5 hours ago

Small models also found the vulnerabilities that Mythos found(aisle.com)

542 points | 157 commentspage 3

yalogin 3 hours ago|

Intuitively every existing model has already been trained on all code, all vulnerabilities reported, all security papers. So they all have the capability. Small models fall short because they may not be able to find a vulnerability that spans across a large function chain but for the most part they should suffice too.

Of course I say this without any knowledge of what mythos is doing or how it’s different. I am sure it’s somehow different

nomel 3 hours ago|

Not intuitive at all. Not all models are equally capable, just because they had the same training data. The model architecture (as a whole) is very important. To reduce capability, you can reduce layers, tool use, thinking, quantize it, etc. This is trivially proven by a cursory glance in the rough direction of any set of benchmarks (or actual use).

Using small models as a classifier "there might be a vulnerability here" is probably reasonable, if you have a model capable of proving it. There are many companies attempting this without the verification step, resulting in AI vulnerability checker being banned left and right, from the nonsense noise.

npilk 3 hours ago||

Wouldn't this mean we're even more cooked? I've seen this page cited a few times as evidence that Mythos is no big deal, but if true then the same big deal is already out there with other models today.

davebren 2 hours ago|

As cooked as we were pre-LLMs knowing that security exploits are relatively easy to learn about online and use, yet things keep chugging along.

dominicq 2 hours ago||

This would just speed up the discovery -> patch cycle, at least until such time that all the low hanging fruit (=represented in training data) is patched.

Though another possibility would be that since LLMs generate so much code, the LLM vulnerability discovery would just keep chugging along and we'd simply settle for the same amount of potential vulns, same relative vulnerability-exploit-patch dynamics, though higher in absolute numbers.

mrifaki 4 hours ago||

finding vulns in a large codebase is a search problem with a huge negative space and what aisle measured is classification accuracy on ground-truth positives, those are different tasks so a model that correctly labels a pre-isolated vulnerable function tells me almost nothing about that model's ability to surface the same function out of a million lines of unrelated code under a realistic triage budget

the experiment i'd want to see is running each of the small models as an unsupervised scanner across full freebsd then return the top-k suspicious functions per model and compute precision at recall levels that correspond to real analyst triage budgets, if mythos s findings show up in the small models top 100, i'd call that meaningful but if they only surface under 10k false positives then the cost advantage collapses because analyst triage time is more expensive than frontier model compute to begin with

second thing i keep coming back to is the $20k mythos number is a search budget not a model cost, small models at one hundredth the per-token price don't give us one hundredth the total budget when the search process is the same shape, i still run thousands of iterations and the issue for autonomous vuln research is how fast the reward signal converges and the aisle post doesn't touch any of this

_pdp_ 2 hours ago||

  find ./ \( -name '*.c' -o -name '*.cpp' \) -exec agent.sh -p "can you spot any vulnerabilities in {}" \;

elzbardico 4 hours ago||

I think that probably Mytho's mojo comes from a lot of post-training on this kind of task.

I occasionally pick up contract work doing coding annotation to make some quick extra money, and a few months ago one of the projects was heavily focused on spotting common memory access bugs in C and C++.

Retr0id 5 hours ago||

And what about the false-positive rate?

dataflow 4 hours ago|

Yeah, this is the critical question. If the model ends up flagging too much, that could end up being like a manual read of the code.

nickdothutton 4 hours ago||

POC of GTFO should apply to AI models too, or the false positive rate will overwhelm.

TacticalCoder 4 hours ago||

I don't dispute the fact that it's more than cool that we have a new tool to find security exploits (and do many other things) but... A big shoot-out to OpenBSD?

We're literally talking about the biggest computers on the planet ever, trained with the biggest amount of data ever available to a system, with the biggest investment ever made by man or close to it and...

The subtlest security bug it can find required: going 28 years in the past and find a...

Denial-of-service?

A freaking DoS? Not a remote root exploit. Not a local exploit.

Just a DoS? And it had to go into 28 years old code to find that?

So kudos, hats off, deep bow not to Mythos but to OpenBSD? Just a bit, no!?

JackYoustra 5 hours ago|

> Isolated the relevant code

I mean isn't that most of it? If you put a snippet of code in front of me and said "there's probably a vulnerability here" I could probably spend a few hours (a much lower METR time!) and find it. It's a whole other ballgame to ask me with no context to come up with an exploit.

kennywinker 5 hours ago|

Sure. But it’s a computer. You can run “there’s probably a vulnerability here” as many times as you like. And it’s easier and cheaper to run it many times with a small open model than a big frontier model.

It also sounds like that is how mythos works too. Which makes sense - the linux kernel is too big to fit in context

JackYoustra 5 hours ago||

No, it sounds like mythos is just doing parallel trajectories. that's pretty distinct!

More comments...