Hardening Firefox with Anthropic's Red Team

Posted by todsacerdoti 18 hours ago

Hardening Firefox with Anthropic's Red Team(www.anthropic.com)

The bugs are the ones that say "using Claude from Anthropic" here: https://www.mozilla.org/en-US/security/advisories/mfsa2026-1...

https://blog.mozilla.org/en/firefox/hardening-firefox-anthro...

https://www.wsj.com/tech/ai/send-us-more-anthropics-claude-s...

514 points | 148 commentspage 2

hinkley 11 hours ago|

At this point about 80% of my interaction with AI has been reacting to an AI code review tool. For better or worse it reviews all code moves and indentions which means all the architecture work I’m doing is kicking asbestos dust everywhere. It’s harping on a dozen misfeatures that look like bugs, but some needed either tickets or documentation and that’s been handled now. It’s also found about half a dozen bugs I didn’t notice, in part because the tests were written by an optimist, and I mean that as a dig.

That’s a different kind of productivity but equally valuable.

driverdan 16 hours ago||

Anthropic's write up[1] is how all AI companies should discuss their product. No hype, honest about what went well and what didn't. They highlighted areas of improvement too.

1: https://www.anthropic.com/news/mozilla-firefox-security

dang 10 hours ago||

Thanks! Since it has more technical info, I switched the URL to that from https://blog.mozilla.org/en/firefox/hardening-firefox-anthro... and put the latter in the top text.

I couldn't bring myself to switch to the (even) more press-releasey title.

shevy-java 13 hours ago||

Reads like a promo.

mentalgear 17 hours ago||

That's one good use of LLMs: fuzzy testing / attack.

nz 16 hours ago|

Not contradicting this (I am sure it's true), but why is using an LLM for this qualitatively better than using an actual fuzzer?

azakai 12 hours ago|||

1. This is a kind of fuzzer. In general it's just great to have many different fuzzers that work in different ways, to get more coverage.

2. I wouldn't say LLMs are "better" than other fuzzers. Someone would need to measure findings/cost for that. But many LLMs do work at a higher level than most fuzzers, as they can generate plausible-looking source code.

saagarjha 16 hours ago||||

Presumably because people have used actual fuzzers and not found these bugs.

bvisness 4 hours ago||||

As someone on the SpiderMonkey team who had to evaluate some of Anthropic's bugs, I can definitely say that Anthropic's test cases were definitely far easier to assess than those generated by traditional fuzzers. Instead of extremely random and mostly superfluous gibberish, we received test cases that actually resembled a coherent program.

hrmtst93837 8 hours ago||||

Fuzzers and LLMs attack different corners of the problem space, so asking which is 'qualitatively better' misses the point: fuzzers like AFL or libFuzzer with AddressSanitizer excel at coverage-driven, high-volume byte mutations and parsing-crash discovery, while an LLM can generate protocol-aware, stateful sequences, realistic JavaScript and HTTP payloads, and user-like misuse patterns that exercise logic and feature-interaction bugs a blind mutational fuzzer rarely reaches.

I think the practical move is to combine them: have an LLM produce multi-step flows or corpora and seed a fuzzer with them, or use the model to script Playwright or Puppeteer scenarios that reproduce deep state transitions and then let coverage-guided fuzzing mutate around those seeds. Expect tradeoffs though, LLM outputs hallucinate plausible but untriggerable exploit chains and generate a lot of noisy candidates so you still need sanitizers, deterministic replay, and manual validation, while fuzzers demand instrumentation and long runs to actually reach complex stateful behavior.

utopiah 14 hours ago||||

I didn't even read the piece but my bet is that fuzzers are typically limited to inputs whereas relying on LLMs is also about find text patterns, and a bit more loosely than before while still being statistically relevant, in the code base.

mmis1000 11 hours ago|||

It's not really bad or not though. It's a more directed than the rest fuzzer. While being able to craft a payload that trigger flaw in deep flow path. It could also miss some obvious pattern that normal people don't think it will have problem (this is what most fuzzer currently tests)

amelius 15 hours ago||

Perhaps I missed it but I don't see any false positives mentioned.

mozdeco 15 hours ago|

[working for Mozilla]

That's because there were none. All bugs came with verifiable testcases (crash tests) that crashed the browser or the JS shell.

For the JS shell, similar to fuzzing, a small fraction of these bugs were bugs in the shell itself (i.e. testing only) - but according to our fuzzing guidelines, these are not false positives and they will also be fixed.

sfink 11 hours ago|||

> For the JS shell, similar to fuzzing, a small fraction of these bugs were bugs in the shell itself (i.e. testing only)

There's some nuance here. I fixed a couple of shell-only Anthropic issues. At least mine were cases where the shell-only testing functions created situations that are impossible to create in the browser. Or at least, after spending several days trying, I managed to prove to myself that it was just barely impossible. (And it had been possible until recently.)

We do still consider those bugs and fix them one way or the other -- if the bug really is unreachable, then the testing function can be weakened (and assertions added to make sure it doesn't become reachable in the future). For the actual cases here, it was easier and better to fix the bug and leave the testing function in place.

We love fuzz bugs, so we try to structure things to make invalid states as brittle as possible so the fuzzers can find them. Assertions are good for this, as are testing functions that expose complex or "dangerous" configurations that would otherwise be hard to set up just by spewing out bizarre JS code or whatever. It causes some level of false positives, but it greatly helps the fuzzers find not only the bugs that are there, but also the ones that will be there in the future.

(Apologies for amusing myself with the "not only X, but also Y" writing pattern.)

amelius 14 hours ago||||

Sounds good.

Did you also test on old source code, to see if it could find the vulnerabilities that were already discovered by humans?

ycombinete 13 hours ago|||

Isn’t that this from the (Anthropic) article:

“Our first step was to use Claude to find previously identified CVEs in older versions of the Firefox codebase. We were surprised that Opus 4.6 could reproduce a high percentage of these historical CVEs”

https://www.anthropic.com/news/mozilla-firefox-security

rcxdude 13 hours ago||||

Anthropic mention that they did beforehand, and it was the good performance it had there that lead to them looking for new bugs (since they couln't be sure that it was just memorising the vulnerabilities that had already been published).

Quarrel 14 hours ago|||

I really like this as a suggestion, but getting opensource code that isn't in the LLMs training data is a challenge.

Then, with each model having a different training epoch, you end up with no useful comparison, to decide if new models are improving the situation. I don't doubt they are, just not sure this is a way to show it.

amelius 14 hours ago||

Yes, but perhaps the impact of being trained on code on being able to find bugs in code is not so large. You could do a bunch of experiments to find out. And this would be interesting in itself.

shevy-java 13 hours ago||||

I guess it is good when bugs are fixed, but are these real bugs or contrived ones? Is anyone doing quality assessment of the bugs here?

I think it was curl that closed its bug bounty program due to AI spam.

mozdeco 13 hours ago|||

The bugs are at least of the same quality as our internal fuzzing bugs. They are either crashes or assertion failures, both of these are considered bugs by us. But they have of course a varying value. Not every single assertion failure is ultimately a high impact bug, some of these don't have an impact on the user at all - the same applies to fuzzing bugs though, there is really no difference here. And ultimately we want to fix all of these because assertions have the potential to find very complex bugs, but only if you keep your software "clean" wrt to assertion failures.

The curl situation was completely different because as far as I know, these bugs were not filed with actual testcases. They were purely static bugs and those kinds of reports eat up a lot of valuable resources in order to validate.

mccr8 13 hours ago|||

The bugs that were issued CVEs (the Anthropic blog post says there were 22) were all real security bugs.

The level of AI spam for Firefox security submissions is a lot lower than the curl people have described. I'm not sure why that is. Maybe the size of the code base and the higher bar to submitting issues plays a role.

anonnon 10 hours ago|||

Any particular reason why the number of vulnerabilities fixed in Feb. was so high? Even subtracting the count of Anthropic's submissions, from the graph in their blog post, that month still looks like an outlier.

pvillano 4 hours ago||

It's like supercharged fuzzing.

nullbyte 6 hours ago||

I always enjoy reading Anthropic's blogposts, they often have great articles

cubefox 12 hours ago||

Interesting end of the Anthropic report:

> Opus 4.6 is currently far better at identifying and fixing vulnerabilities than at exploiting them. This gives defenders the advantage. And with the recent release of Claude Code Security in limited research preview, we’re bringing vulnerability-discovery (and patching) capabilities directly to customers and open-source maintainers.

> But looking at the rate of progress, it is unlikely that the gap between frontier models’ vulnerability discovery and exploitation abilities will last very long. If and when future language models break through this exploitation barrier, we will need to consider additional safeguards or other actions to prevent our models from being misused by malicious actors.

> We urge developers to take advantage of this window to redouble their efforts to make their software more secure. For our part, we plan to significantly expand our cybersecurity efforts, including by working with developers to search for vulnerabilities (following the CVD process outlined above), developing tools to help maintainers triage bug reports, and directly proposing patches.

ilioscio 13 hours ago||

Anthropic continues to pull ahead of the other ai companies in terms of 'trustworthiness' If they want to really test their red team I hope they look at CUPS

LtWorf 12 hours ago|

A bit of an easy target no?

sfink 10 hours ago||

As someone who saw a bunch of these bugs come in (and fixed a few), I'd say that Anthropic's associated writeup at https://www.anthropic.com/news/mozilla-firefox-security undersells it a bit. They list the primary benefits as:

    1. Accompanying minimal test cases
    2. Detailed proofs-of-concept
    3. Candidate patches

This is most similar to fuzzing, and in fact could be considered another variant of fuzzing, so I'll compare to that. Good fuzzing also provides minimal test cases. The Anthropic ones were not only minimal but well-commented with a description of what it was up to and why. The detailed descriptions of what it thought the bug was were useful even though they were the typical AI-generated descriptions that were 80% right and 20% totally off base but plausible-sounding. Normally I don't pay a lot of attention to a bug filer's speculations as to what is going wrong, since they rarely have the context to make a good guess, but Claude's were useful and served as a better starting point than my usual "run it under a debugger and trace out what's happening" approach. As usual with AI, you have to be skeptical and not get suckered in by things that sound right but aren't, but that's not hard when you have a reproducible test case provided and you yourself can compare Claude's explanations with reality.

The candidate patches were kind of nice. I suspect they were more useful for validating and improving the bug reports (and these were very nice bug reports). As in, if you're making a patch based on the description of what's going wrong, then that description can't be too far off base if the patch fixes the observed problem. They didn't attempt to be any wider in scope than they needed to be for the reported bug, so I ended up writing my own. But I'd rather them not guess what the "right" fix was; that's just another place to go wrong.

I think the "proofs-of-concept" were the attempts to use the test case to get as close to an actual exploit as possible? I think those would be more useful to an organization that is doubtful of the importance of bugs. Particularly in SpiderMonkey, we take any crash or assertion failure very seriously, and we're all pretty experienced in seeing how seemingly innocuous problems can be exploited in mind-numbingly complicated ways.

The Anthropic bug reports were excellent, better even than our usual internal and external fuzzing bugs and those are already very good. I don't have a good sense for how much juice is left to squeeze -- any new fuzzer or static analysis starts out finding a pile of new bugs, but most tail off pretty quickly. Also, I highly doubt that you could easily achieve this level of quality by asking Claude "hey, go find some security bugs in Firefox". You'd likely just get AI slop bugs out of that. Claude is a powerful tool, but the Anthropic team also knew how to wield it well. (They're not the only ones, mind.)

chill_ai_guy 7 hours ago|

Terrible day to be a Hackernews doomer who is still hanging on to "LLM bad code". AI will absolutely eat your lunch soon unless you get on the ship right now

More comments...