The math on the site linked here as a source for this claim is incorrect. The author of that site assumes that scrapers will keep track of the access tokens for a week, but most internet-wide scrapers don't do so. The whole purpose of Anubis is to be expensive for bots that repeatedly request the same site multiple times a second.
When reviewing it I noticed that the author carried the common misunderstanding that "difficulty" in proof of work is simply the number of leading zero bytes in a hash, which limits the granularity to powers of two. I realize that some of this is the cost of working in JavaScript, but the hottest code path seems to be written extremely inefficiently.
for (; ;) {
const hashBuffer = await calculateSHA256(data + nonce);
const hashArray = new Uint8Array(hashBuffer);
let isValid = true;
for (let i = 0; i < requiredZeroBytes; i++) {
if (hashArray[i] !== 0) {
isValid = false;
break;
}
}
It wouldn’t be exaggerating to say that a native implementation of this with even a hair of optimization could reduce the “proof of work” to being less time intensive than the ssl handshake.Proof of work can't function as a counter-abuse challenge even if you assume that the attackers have no advantage over the legitimate users (e.g. both are running exactly the same JS implementation of the challenge). The economics just can't work. The core problem is that the attackers pay in CPU time, which is fungible and incredibly cheap, while the real users pay in user-observable latency which is hellishly expensive.
Specifically for Firefox [1] they switch to the JavaScript fallback because that's actually faster [2] (because of overhead probably):
> One of the biggest sources of lag in Firefox has been eliminated: the use of WebCrypto. Now whenever Anubis detects the client is using Firefox (or Pale Moon), it will swap over to a pure-JS implementation of SHA-256 for speed.
[0] https://developer.mozilla.org/en-US/docs/Web/API/SubtleCrypt...
[1] https://github.com/TecharoHQ/anubis/blob/main/web/js/algorit...
[2] https://github.com/TecharoHQ/anubis/releases/tag/v1.22.0
Why is this inefficient?
The point of the article is that if the scraper is sufficiently motivated, Anubis is not going to do much anyway, and if the scraper doesn't care, same result can be achieved without annoying your actual users.
Am I missing something here? All this does is set an unencrypted cookie and reload the page right?
Anubis isn’t some conspiracy to show you pictures of anime catgirls, it’s a desperate attempt to stave off bot-driven downtime. Many admins who install it do so reluctantly, because obviously it is annoying to have a delay when you access a website. Nobody is doing that for fun.
(There are probably a few people who install it not to protect against scraper DDoS, but due to ideological opposition to AI scrapers. IMHO this is fruitless, as the more intelligent scrapers will find ways around it without calling attention to themselves. Anubis makes almost no sense on a static personal blog.)
E.g. if you open this in browser, you’ll get the challenge: https://code.ffmpeg.org/FFmpeg/FFmpeg/commit/13ce36fef98a3f4...
But if you run this, you get the page content straight away:
curl https://code.ffmpeg.org/FFmpeg/FFmpeg/commit/13ce36fef98a3f4e6d8360c24d6b8434cbb8869b
I’m pretty sure this gets abused by AI scrapers a lot. If you’re running Anubis, take a moment to configure it properly, or better put together something that’s less annoying for your visitors like the OP.https://foundation.wikimedia.org/wiki/Policy:Wikimedia_Found...
I know because as a matter of practice I do not send one. Like I do with most www sites, I used Wikipedia for many years without ever sending a UA header. Never had a problem
I read the www text-only, no graphical browser, no Javascript
[0]: https://en.wikipedia.org/wiki/User-Agent_header#Format_for_h...
[1]: https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/...
In practice, it hasn't been an issue for many months now, so I'm not sure why you're so sure. Disabling Anubis takes servers down; allowing curl bypass does not. What makes you assume that aggressive scrapers that don't want to identify themselves as bots will willingly identify themselves as bots in the first place?
With footnote:
"I don’t know if they have any good competition, but “Cloudflare” here refers to all similar bot protection services."
That's the crux. Cloudflare is the default, no one seems to bother to take the risk with a competitor for some reason. They seem to exist but when asked people can't even name them.
(For what it's worth I've been using AWS Cloudfront but I had to think a moment to remember its name.)
In the ongoing arms race, we're likely to see simple things like this sort of check result in a handful of detection systems that look for "set a cookie" or at least "open the page in headless chrome and measure the cookies."
Does anyone have any proof of this?
I mean they have access to a mind-blowing amount of computing resources so to they using a fraction of that to improve the quality of the data because they have this fundamental belief (because it's convenient for their situation) that scale is everything, why not use JS too. Heck if they have to run on a container full a browser, not even headless, they will.
Navigate, screenshots, etc. it has like 30 tools in it alone.
Now we can just run real browsers with LLMs attached. Idk how you even think about defeating that.
Admittedly, this is no different than the kinds of ways Anubis is hostile to those same users, truly a tragedy of the commons.
It makes (a) visual advertising and (b) tracking viable
I read the www text-only, no auto-loading of resources (images, etc.), and I see no ads
I work with ffmpeg so I have to access their bugtracker and mailing list site sometimes. Every few days, I'm hit with the Anubis block. And 1/3 - 1/5 of the time, it fails completely. The other times, it delays me by a few seconds. Over time, this has turned me sour on the Anubis project, which was initially something I supported.
That quote is strong indication that he sees it this way.
Sounds like maybe it'll be fixed soon though
It's like airplane checkin. Are we inconvenienced? Yes. Who is there to blame? Probably not the airline or the company who provides the services. Probably people who want to fly without a ticket or bring in explosives.
As long as Anubis project and people on it don't try to play both sides and don't make the LLM situation worse (mafia racket style), I think if it works it works.
I can't fully articulate it but I feel like there is some game theory aspect of the current design that's just not compatible with the reality.
I have a personal website that sometimes doesn't get an update for a year. Still the bots are in the majority of visitors. (Not so much that I would need counter measures but still.) Most bot visits could be avoided with such a scheme.
That's how Technorati worked.
Why doesn't one company do it and then resell the data? Is it a legal/liability issue? If you scrape it's a legal grey area, but if you sell what you scrap it's clearly copyright infringement?
E: In fact this whole idea is so stupid that I am forced to consider if it is just a DDoS in the original sense. Scrape everything so hard it goes down, just so that your competitors can't.