Posted by todsacerdoti 18 hours ago
https://blog.mozilla.org/en/firefox/hardening-firefox-anthro...
https://www.wsj.com/tech/ai/send-us-more-anthropics-claude-s...
That’s a different kind of productivity but equally valuable.
I couldn't bring myself to switch to the (even) more press-releasey title.
2. I wouldn't say LLMs are "better" than other fuzzers. Someone would need to measure findings/cost for that. But many LLMs do work at a higher level than most fuzzers, as they can generate plausible-looking source code.
I think the practical move is to combine them: have an LLM produce multi-step flows or corpora and seed a fuzzer with them, or use the model to script Playwright or Puppeteer scenarios that reproduce deep state transitions and then let coverage-guided fuzzing mutate around those seeds. Expect tradeoffs though, LLM outputs hallucinate plausible but untriggerable exploit chains and generate a lot of noisy candidates so you still need sanitizers, deterministic replay, and manual validation, while fuzzers demand instrumentation and long runs to actually reach complex stateful behavior.
That's because there were none. All bugs came with verifiable testcases (crash tests) that crashed the browser or the JS shell.
For the JS shell, similar to fuzzing, a small fraction of these bugs were bugs in the shell itself (i.e. testing only) - but according to our fuzzing guidelines, these are not false positives and they will also be fixed.
There's some nuance here. I fixed a couple of shell-only Anthropic issues. At least mine were cases where the shell-only testing functions created situations that are impossible to create in the browser. Or at least, after spending several days trying, I managed to prove to myself that it was just barely impossible. (And it had been possible until recently.)
We do still consider those bugs and fix them one way or the other -- if the bug really is unreachable, then the testing function can be weakened (and assertions added to make sure it doesn't become reachable in the future). For the actual cases here, it was easier and better to fix the bug and leave the testing function in place.
We love fuzz bugs, so we try to structure things to make invalid states as brittle as possible so the fuzzers can find them. Assertions are good for this, as are testing functions that expose complex or "dangerous" configurations that would otherwise be hard to set up just by spewing out bizarre JS code or whatever. It causes some level of false positives, but it greatly helps the fuzzers find not only the bugs that are there, but also the ones that will be there in the future.
(Apologies for amusing myself with the "not only X, but also Y" writing pattern.)
Did you also test on old source code, to see if it could find the vulnerabilities that were already discovered by humans?
“Our first step was to use Claude to find previously identified CVEs in older versions of the Firefox codebase. We were surprised that Opus 4.6 could reproduce a high percentage of these historical CVEs”
Then, with each model having a different training epoch, you end up with no useful comparison, to decide if new models are improving the situation. I don't doubt they are, just not sure this is a way to show it.
I think it was curl that closed its bug bounty program due to AI spam.
The curl situation was completely different because as far as I know, these bugs were not filed with actual testcases. They were purely static bugs and those kinds of reports eat up a lot of valuable resources in order to validate.
The level of AI spam for Firefox security submissions is a lot lower than the curl people have described. I'm not sure why that is. Maybe the size of the code base and the higher bar to submitting issues plays a role.
> Opus 4.6 is currently far better at identifying and fixing vulnerabilities than at exploiting them. This gives defenders the advantage. And with the recent release of Claude Code Security in limited research preview, we’re bringing vulnerability-discovery (and patching) capabilities directly to customers and open-source maintainers.
> But looking at the rate of progress, it is unlikely that the gap between frontier models’ vulnerability discovery and exploitation abilities will last very long. If and when future language models break through this exploitation barrier, we will need to consider additional safeguards or other actions to prevent our models from being misused by malicious actors.
> We urge developers to take advantage of this window to redouble their efforts to make their software more secure. For our part, we plan to significantly expand our cybersecurity efforts, including by working with developers to search for vulnerabilities (following the CVD process outlined above), developing tools to help maintainers triage bug reports, and directly proposing patches.
1. Accompanying minimal test cases
2. Detailed proofs-of-concept
3. Candidate patches
This is most similar to fuzzing, and in fact could be considered another variant of fuzzing, so I'll compare to that. Good fuzzing also provides minimal test cases. The Anthropic ones were not only minimal but well-commented with a description of what it was up to and why. The detailed descriptions of what it thought the bug was were useful even though they were the typical AI-generated descriptions that were 80% right and 20% totally off base but plausible-sounding. Normally I don't pay a lot of attention to a bug filer's speculations as to what is going wrong, since they rarely have the context to make a good guess, but Claude's were useful and served as a better starting point than my usual "run it under a debugger and trace out what's happening" approach. As usual with AI, you have to be skeptical and not get suckered in by things that sound right but aren't, but that's not hard when you have a reproducible test case provided and you yourself can compare Claude's explanations with reality.The candidate patches were kind of nice. I suspect they were more useful for validating and improving the bug reports (and these were very nice bug reports). As in, if you're making a patch based on the description of what's going wrong, then that description can't be too far off base if the patch fixes the observed problem. They didn't attempt to be any wider in scope than they needed to be for the reported bug, so I ended up writing my own. But I'd rather them not guess what the "right" fix was; that's just another place to go wrong.
I think the "proofs-of-concept" were the attempts to use the test case to get as close to an actual exploit as possible? I think those would be more useful to an organization that is doubtful of the importance of bugs. Particularly in SpiderMonkey, we take any crash or assertion failure very seriously, and we're all pretty experienced in seeing how seemingly innocuous problems can be exploited in mind-numbingly complicated ways.
The Anthropic bug reports were excellent, better even than our usual internal and external fuzzing bugs and those are already very good. I don't have a good sense for how much juice is left to squeeze -- any new fuzzer or static analysis starts out finding a pile of new bugs, but most tail off pretty quickly. Also, I highly doubt that you could easily achieve this level of quality by asking Claude "hey, go find some security bugs in Firefox". You'd likely just get AI slop bugs out of that. Claude is a powerful tool, but the Anthropic team also knew how to wield it well. (They're not the only ones, mind.)