Posted by HieronymusBosch 1 day ago
A bug is a bug. A “potential vulnerability” is a bug. A vulnerability is verifiable as having security implications with a proof of concept or other substantial evidence.
Words matter. Bugs matter. It’s important to fix large amounts of bugs, just as it always has been, and has been done. Let that be impressive on its own, because it IS impressive.
Mythos didn’t write 271 PoC for vulnerabilities and demonstrate code path reachability with security implications. Mythos found 271 valid bugs. Let that be enough.
> As additional context, we apply security severity ratings from critical to low to indicate the urgency of a bug:
> * sec-critical and sec-high are assigned to vulnerabilities that can be triggered with normal user behavior, like browsing to a web page. We make no technical difference between these, but sec-critical bugs are reserved for issues that are publicly disclosed or known to be exploited in the wild.
> * sec-moderate is assigned to vulnerabilities that would otherwise be rated sec-high but require unusual and complex steps from the victim.
> * sec-low is assigned to bugs that are annoying but far from causing user harm (e.g, a safe crash).
> Of the 271 bugs we announced for Firefox 150: 180 were sec-high, 80 were sec-moderate, and 11 were sec-low.
Mozilla uses the term "vulnerability" for even sec-high, even though they say right below that it doesn't mean the same thing as a practical exploit. And on their definitional page, they classify even sec-low as "vulnerabilities" [2].
Words are tools, that get their utility from collective meaning. I'd be interested where you recieved your semantics from and if they match up or disagree with Mozilla.
[1] https://hacks.mozilla.org/2026/05/behind-the-scenes-hardenin...
[2] https://wiki.mozilla.org/Security_Severity_Ratings/Client
In general, I would say that our use of "vulnerability" lines up with what jerrythegerbil calls "potential vulnerability". (In cases with a POC, we would likely use the word "exploit".) Our goal is to keep Firefox secure. Once it's clear that a particular bug might be exploitable, it's usually not worth a lot of engineering effort to investigate further; we just fix it. We spend a little while eyeballing things for the purpose of sorting into sec-high, sec-moderate, etc, and to help triage incoming bugs, but if there's any real question, we assume the worst and move on.
So were all 271 bugs exploitable? Absolutely not. But they were all security bugs according to the normal standards that we've been applying for years.
(Partial exception: there were some bugs that might normally have been opened up, but were kept hidden because Mythos wasn't public information yet. But those bugs would have been marked sec-other, and not included in the count.)
So if you think we're guilty of inflating the number of "real" vulnerabilities found by Mythos, bear in mind that we've also been consistently inflating the baseline. The spike in the Firefox Security Fixes by Month graph is very, very real: https://hacks.mozilla.org/2026/05/behind-the-scenes-hardenin...
If you look closely at, say, this patch, you might get a sense of what I mean (although the real cleverness is in the testcase, which we have not made public): https://hg-edge.mozilla.org/integration/autoland/rev/c29515d...
What is the point of keeping it private? I'd bet feeding this patch to Opus and asking to look for specific TOCTOU issue fixed by the patch will make it come up with a testcase sooner or later.
The code before the patch does not look obviously wrong. Now, some more lines were added, but would you now say it now looks less obviously wrong, or more obviously correct?
It seems that the invariants needed here are either in some person's heads, or in some document that is not referenced.
Reading the code for the first time, the immediate question is: "What other lines might be missing? How can I figure?"
If the "obviously correct" level of the code does not increase for a human reviewer, how is it ensured that a similar problem will not arise in the future? Or do we need more LLM to tell us which other lines need to be added?
I did get Opus to do an audit for similar problems elsewhere, to supplement the investigations that we were already doing by hand. It initially thought it found something, but when asked to produce a testcase, it thought for 20 minutes and admitted defeat. I suspect that the difference between Opus and Mythos is in small edges like this: if Mythos is smart enough to spot why Opus's discovery didn't work a little bit faster, and it can waste less time chasing down red herrings, then it's more likely to find a real bug within the limits of a context window. It's not that Opus completely lacks some capability, it's that it has trouble chaining all the pieces together consistently.
I've assumed I could send an agent using a publicly available model bug hunting in a codebase like this and get tons of results, assuming I wanted to burn the tokens, so it's really unclear to me whether the Mythos hype is justified or if it's just an easy button (and subsidized tokens?) to do what is already possible.
So the best answer I can give is: I dunno, maybe it's possible to find bugs like this using Opus, but if so, where are they? Did nobody think to try "please find the bug in this code" pre-Mythos? I've done enough auditing with Opus to be convinced that it can be a good assistant to somebody who already knows what they're doing, but in practice the big wave of AI-discovered bugs started with Mythos.
I'm sure lots of people have assumed they could send a publicly available model bug hunting and find things. I have not noticed a huge amount of success. We've had some very nice correctness bugs reported, but skimming through the list of security bugs I've fixed recently, the AI-related ones all seem to be Mythos.
My best guess is that Mythos is just enough better along just enough axes that its hit rate on finding potential bugs and filtering out the real ones from the hallucinations is good enough to matter. Like, there's no obvious qualitative difference between 3.6kg of uranium-232 and 3.8 kg of uranium-232, just a small quantitative increase. But if you form both of them into spheres, only one of them has reached critical mass. Can you do something clever to reach critical mass with 3.6kg of uranium? Maybe! But needing to do something clever is a non-trivial barrier in itself.
I'm going to disagree in the specific case of Firefox. First, although it has diverged a long way from its roots, Mozilla still has the community project ideal in its DNA. Enough, at least, that I stumbled while reading the clause "from outside" -- if you're finding and reporting actual relevant security bugs, you're already on the inside. SpiderMonkey in particular still has a good amount of code being written and even maintained by non-employees. (Examples: Temporal and LoongArch64 JIT support).
Second, the bug bounty program still exists[0] and is being used. If someone were sitting on a pile of AI-discovered exploits, then it has monetary value which is rapidly draining away the longer they aren't reported.[1] That's incentive to put in the work to report them properly.
Third, I agree that finding bugs is likely not the bottleneck. Validating them is. With previous models, the false positive rate was too high so they required too much work to whittle down to the valid ones. A PoC is a very strong signal that a bug is valid, and that's where I just don't believe you: without a really good harness, I don't think Opus was good enough to find very many bugs with PoCs. It could find some, just not very many.[2]
[0] For now. It remains to be seen how it will adapt to the AI age. For the moment, it hasn't been severely nerfed like Google's.
[1] One could make the argument that people who are inexpert enough to only be able to poke an AI to find bugs are also the people more likely to sell them on the black market rather than disclosing them. It seems plausible. Still, some people would still be disclosing, and not many were filing quality bugs pre-Mythos. Some were, but it was a trickle compared to post-Mythos.
[2] Also note that I personally, as a SpiderMonkey developer, don't find a huge amount of value in the AI-generated patches that accompany these bug reports. Sometimes they're useful to better illustrate the problem, especially since the AI's problem analysis is usually subtly wrong in important ways. They can be a decent starting point for a real patch. But I'll still need to go through my own process of figuring out what the right fix is, even in the handful of cases where I end up with the same thing the AI did.
I was wondering this too. By working directly with tech companies and (one assumes) subsidizing tokens, they're empowering the people on the inside who absolutely want to have the bugs fixed.
Who outside of Mozilla is going to pay and spend the effort to find Firefox bugs? Sure some hobbyists and contributors might, but they don't have the institutional knowledge of the codebase which can help guide an agent prompts, nor do they have strong incentives to try and report them, nor do they necessarily have the time to craft good bug reports that stand out from the slop reports.
My assumption would be that most people working to discover bugs this way in Firefox are interested in using them rather than getting them fixed, so maintainers wouldn't necessarily even know the degree to which it was already happening.
We have many outside contributors who have successfully submitted security bugs and received payments.
At Mozilla, but not everywhere: exploits are a subset of vulnerabilities are a subset of bugs.
Much as GitHub calls everything an "issue" and GitLab a "work item".
I'm genuinely curious what "types" of implementation mistakes these were, like whether e.g. it was library usage bugs, state management bugs, control flow bugs etc.
Would love to see a writeup about these findings, maybe Mythos hinted us towards that better fuzzing tools are needed?
In this particular sense, AI tends to find bugs that are closer to what we'd see from a human researcher reading the code. Fuzz bugs are often more "here's a seemingly innocuous sequence of statements that randomly happen to collide three corner cases in an unexpected way".
Outside of SpiderMonkey, my understanding is that many of the best vulnerabilities were in code that is difficult to fuzz effectively for whatever reason.
That being said, I think there's a lot of potential for synergy here: if LLMs make writing code easier, that includes fuzzers, so maybe fuzzers will also end up finding a lot more bugs. I saw somebody on Twitter say they used an LLM to write a fuzzer for Chrome and found a number of security bugs that they reported.
Security things are mentioned in the Release Notes [b] pointing to a completely different document [d].
Perhaps sometimes a bug is 'just' a bug, and not a vulnerability.
[a] https://bugzilla.mozilla.org/show_bug.cgi?id=2034980 ; "Can't highlight image scans in Firefox 150+"
[b] https://www.firefox.com/en-CA/firefox/150.0.2/releasenotes/
[c] https://bugzilla.mozilla.org/show_bug.cgi?id=2024918
[d] https://www.mozilla.org/en-US/security/advisories/mfsa2026-4...
That’s not evident in what you pastedat all.
What you pasted says
> sec-critical and sec-high are assigned to vulnerabilities that can be triggered with normal user behavior […] We make no technical difference between these […] sec-critical bugs are reserved for issues that are publicly disclosed or known to be exploited in the wild.
> sec-low is assigned to bugs that are annoying but far from causing user harm (e.g, a safe crash).
From this one infers that the "180 were sec-high" bugs found are actually exploitsble but known to have been found in the wild, and are NOT mere annoying bugs.
The difference between 180 and 270 does nothing to deflate the signicance, or lack there of, of the implication re: Mythos.
For us this is substantial enough evidence to consider it a security vulnerability at that point, unless shown otherwise and it has always been this way (also for fuzzing bugs).
But report [1] says that "Some of these bugs showed evidence of memory corruption...", which implies that majority of these (which includes 271 bugs from Mythos) don't have evidence at all. Do I not understand something?
> For us this is substantial enough evidence to consider it a security vulnerability at that point
Mythos is supposed to be pretty good at writing actual exploits, so (as I understand) there shouldn't be any serious problems with checking if bug is vulnerability or not.
[1] https://www.mozilla.org/en-US/security/advisories/mfsa2026-3...
This is just the standard sentence we've been using for years. It has nothing to do with Mythos and for Mythos, almost all bugs show evidence of memory corruption (we do have a handful of bugs in JS IPC / JS Actors, one is in the blog post).
> Mythos is supposed to be pretty good at writing actual exploits, so (as I understand) there shouldn't be any serious problems with checking if bug is vulnerability or not.
Yes but if we have a choice between writing exploits and scanning more source, potentially finding more bugs, then of course we prioritize the latter.
I'm guessing a bit, but for example: out of bounds reads are not memory corruption. Assertion failures in debug builds are also usually not memory corruption, and I'd guess that many of these bugs were found through assertions. (Some parts of Firefox like the SpiderMonkey JS engine make heavy use of assertions, and that's the biggest signal used for defect validation. An assertion firing is almost always treated as a real and serious problem. Though with our harness, Opus and Mythos try to come up with an exploit PoC anyway.)
My only source for this is personal experience, and no, I can't share any evidence of it.
But if you ask it to get you a shell it’ll probably tell you to get lost.
I think the word you're looking for is exploit?
https://hacks.mozilla.org/2026/05/behind-the-scenes-hardenin...
So while Mythos certainly is real I think you could do the same with Deepseek pro, GPT 5.5 etc...
When I hear that "we found X bugs using some new tool", where the standard for bugs is low and doesn't neccessarily require user impact in realistic scenarios, I think to myself- duh! You went looking for bugs, of course you found them.
For a sufficiently complicated product, in my experience, you don't have to look far.
That's the "'No Way to Prevent This,' Says Only Nation Where This Regularly Happens" of unsafe languages.
There are huge swathes of problems we know how to categorically prevent, but some people won't do it because they're more comfortable believing it was never preventable than accepting any culpability for not preventing it previously.
This new post makes it pretty clear that this was all bolted on-top of their existing fuzzing infrastructure, and really just used to get more and better initial hits that a very skilled team is looking at. I assume Anthropic was giving them a very good deal on inference for the positive PR, but I believe these other reports and suspect Mozilla did not really need them.
The skill required to find then create zero days is quickly approaching the floor.
Then they loop over a codebase like this. This way you always point a model at a 'known' bug. And I assume a smaller context window helps with quality.
Not entirely sure it's obviously proprietary.
It's better because it actually lists a sample of Bugzilla reports that were made public. This topic was discussed previously (36 comments two weeks ago: https://news.ycombinator.com/item?id=47885042), but the part about bug reports being made public is brand new.
This isn’t sarcasm. Firefox deserves to be used more. Most people I know don’t use it because “Chrome does almost everything better”, and Firefox can’t compete with the other browsers’ roadmaps.
Totally agree. I even go as far as choosing which website I make purchases on depending if they work on FF, or writing to support occasionally to tell them it's not supported or a feature isn't working properly and this would be appreciated.
I know it pretty much always goes nowhere, but I feel it's what I can do to keep the browser somehow on the radar.
Part of the problem is, when they stop working on fixing bugs, they start doing Mr Robot things... We just want a web browser. Nobody asked for pocket, or AI...
If they use AI to fix all the bugs, then what else is for them to do, other than maintain syntax compatibility with the various languages they build with? They're just going to go back to making the browser trash again.
(Don't worry- I use the system browser for any site I don't fully trust.)
If Mozilla created some proprietary LLM or harness that they used internally to outpace Chrome that may be a different story, though I also don't see that happening.
Same with my wife, after I've explained things to her and she understood how different internet experience can be thats the primary browser.
So please don't put the argument like 'here is crappy underdog but please use it because monopoly is bad and google is a bit evil', its first class experience in everything I have ever thrown at it. Tripple that on mobile, by far the best mobile and useful mobile experience, bar none.
FF works for me.
I ran Zoom in my Firefox desktop browser for a while, but it tended to overheat my laptop. Other things overheat it too, so I don't know how much was specific to Zoom on Firefox.
I just checked. Still gives me the option ("Join from browser" in a less highlighted option, trying to drive you to their native client I guess.)
That's a really good use case for LLMs. It also applies to things like finding proofs in Lean and creating test stimulus. In both cases you know automatically whether the output is good, and it doesn't really matter if it isn't.
That isn't the case for most bugs, and definitely isn't the case for actually fixing bugs.
Firefox is written in several languages, only about 25% of it is in C++ but every single one of these issues seems to touch the C++.
Sure, but, surely AddressSanitizer would also detect the same problem in the C or Rust which together also make up about 25% of Firefox so... ?
From what I can tell, a lot of these bugs were hardly C++-specific, they just happened in C++ code. Even the most secure Rust can't magically catch things like TOCTOU issues.
I suppose it depends what the word "magically" means. A ToCToU race is because you imagined things wouldn't change but they did and in Rust you actually do write fewer patterns with this mistake because of the Mutable xor Aliased rule. If we have at least one immutable reference to a Goose then Rust isn't OK with anybody mutating the Goose, your safe Rust can't do that and unsafe Rust mustn't do that. So the ToCToU race caused by "Oops I forgot somebody else might change the Goose" is less likely because you were made to wrestle with this problem during design - the safe Rust where you just forgot about this doesn't compile.
And I presume you can run AddressSanitizer with Rust but given Rust is memory safe by default, it's only going to find issues in `unsafe` code which is a tiny tiny fraction of most code. Google had a blog post a few months ago where they managed to put some actual numbers on this, because they almost shipped one Rust memory safety bug.
Some of this is tempered if the pattern is that Mythos finds bugs mostly in dusty old C++ but the rates are much, much lower in newer C++, the reverse of Google's earlier finding for human researchers.
The answer is both of those. They didn't ask for bugs in the Rust code because it wouldn't have found any. They've explicitly set it up to only look for memory safety bugs. It's not going to find any in a memory safe language.
Read this: https://blog.google/security/rust-in-android-move-fast-fix-t...
Exactly the same as using the memory-safe subset of Python or Java.
The 70% number google claims is either BS or google-specific as other projects reported far lower numbers.
No, there are simply too few memory safety bugs in Rust projects for AI to find any. It found 271 bugs in Firefox so you're talking around 0.3 bugs found in the same amount of Rust.
> The 70% number google claims is either BS or google-specific as other projects reported far lower numbers.
The post I linked didn't mention 70% so I guess you didn't read it. And if you're talking about the "70% of C/C++ security bugs are due to memory safety" stat, then no it isn't bullshit. The same (or very similar) number has been found by numerous companies and projects. Not that that stat is relevant here.
Curl reported 40% and more recently it dropped to about 20% of issues caused by their use of C. And this even with the requirement to stick to old C89. OpenBSD reported 30%. I assume the 70% either have to do with C++ or - more likely - there is a huge selection bias.
I mean, it's not supposed to find any in the unsafe language either, but that's why it was used.
Firefox not only uses unstable Rust features (via the exemption mechanism the same way Linux does it, trained professionals, closed course, do not attempt at home) it also presumably has some volume of its own explicitly unsafe Rust and so there's no reason this could not be checked, and what makes the difference here is whether it was or was not.
No it is supposed to find them in C++, because we all know humans are infallible and it's super easy to write memory errors in C++.
The whole point of Rust is that the borrow checker is infallible (pretty much anyway).
> it also presumably has some volume of its own explicitly unsafe Rust
"Some volume" is so tiny as to be irrelevant. There's no point going to this effort if Rust memory safety vulnerabilities are 1000 times less frequent than in C++.
That number is not made up. See https://blog.google/security/rust-in-android-move-fast-fix-t...
I'd like to understand if Rust was skipped because they assumed it would be fine, skipped purely as happenstance, or in fact tested and found to not be problem. I don't like assuming things when I could measure instead.
Ha yes.
> Strict No LLM / No AI Policy
> No LLMs for issues.
> No LLMs for pull requests.
> No LLMs for comments on the bug tracker, including translation. English is encouraged, but not required. You are welcome to post in your native language and rely on others to have their own translation tools of choice to interpret your words.
If they would accept issues filed by AI or written by AI, they should edit their policy to say that.
If you’re saying their philosophy is compatible with LLM issues, I agree and I think they should change their policy to reflect that.
I eventually left and wound up at Mozilla where there were a number of /* flawfinder ignore */ comments scattered throughout the code.
My guess is that Mythos just ignored the "flawfinder ignore" directives and reported the known vulnerabilities in the code.
I wonder if these models will get good + cheap enough so that people rarely reach for static analysis.
Using LLM coding tools to stay on top of static analysis tool output works very well and adding some guard rails that enforce that there are no issues is probably a good idea. Just like adding CI checks to make sure everything is clean.
As for false positives, it depends on the tool. I tend to avoid tools that generate mostly noise. Most of these tools allow you to disable rules if they produce a lot of noise. Or you can just tell the LLM to fix all the issues. When it's cheaper to fix things than to argue with the rule, just fix it. That used to be really expensive when you had to do that manually. Now it isn't.
I recently did this to an Ansible code base that I needed to refresh after not touching it for a few years. It had hundreds of ansible-lint issues; mostly deprecation warnings and some non fatal other warnings. And 10 minutes later I had zero. Mostly they probably weren't very serious ones but it's a form of technical debt. If you have to fix hundreds of warnings manually, you are probably not going to do it. But if you can wave a magical wand and it all goes a way, why not? I adjusted the guard rails so it now always runs ansible-lint and fixes any issues. Only takes a few seconds extra.
I maintain a static analysis tool using in Firefox's CI. False positives have to be fixed or annotated as non-problems in order for you to land a patch in our tree. That means permitting zero positives (false or true), which is a strict threshold. This is a conscious tradeoff; it requires weakening the analysis and getting some false negatives (missed bugs) in order to keep the signal-to-noise ratio high enough that people don't just ignore it and annotate everything away, or stop running it. Nearly all static analysis tools have to do this balancing act.
AI, as commonly used, is given more leeway. It's kind of fundamental that it must be allowed to hallucinate false positives; that's the source of much of its power. Which means you need layers of verification and validation on top of it. It can be slow, you'll never be able to say "it catches 100% of the errors of this particular form: ...", and yet it catches so much stuff.
Data point: my analysis didn't cover one case that I erroneously thought was unlikely to produce true positives (real bugs), and was more complex to implement than seemed worth the trouble. Opus or Mythos, I'm not sure which, started reporting vulnerabilities stemming from that case, so I scrambled and extended the analysis to cover the gap. It took me long enough to implement that by the time I had a full scan of the source tree, Claude had found every important problem that it reported. The static analysis found several others, and I still honestly don't know whether any of them could ever be triggered in practice.
I still think there's value in the static analysis. Some of those occurrences of the problematic pattern might be reachable now through paths too tricky for the AI to construct. Some of them might turn into real problems when other code changes. It seems worth having fixes for all of them now for both possibilities, and also for the lesser reason of not wanting the AI to waste time trying to exploit them. At the same time, clearly the cost/benefit balance has shifted.
They could also team up: if I relax my standards and allow my analysis to write an additional warnings report of suspected problems, with the clear expectation that they might be false alarms, then I could feed that list to an AI to validate them. Essentially, feed slop to the slop machine and have it nondeterministically filter out the diamonds in the rough.
Food for thought...
It's for detecting a specific situation: you grab a pointer to a GC-managed object, call something that might possibly trigger a GC even if it probably won't, and then use the pointer. (The GC might collect that object, or it might move it somewhere else.)
Claude is pretty good at weaponizing these UAFs.
I think these harnesses are _using_ static analysis tools, and probably will continue to do that.
From what I understand, that is a recipe for getting quickly banned by commercial LLM providers?