You Don't Need Anubis

Posted by flexagoon 1 day ago

171 points | 157 comments

uqers 1 day ago|

> Unfortunately, the price LLM companies would have to pay to scrape every single Anubis deployment out there is approximately $0.00.

The math on the site linked here as a source for this claim is incorrect. The author of that site assumes that scrapers will keep track of the access tokens for a week, but most internet-wide scrapers don't do so. The whole purpose of Anubis is to be expensive for bots that repeatedly request the same site multiple times a second.

drum55 1 day ago||

The "cost" of executing the JavaScript proof of work is fairly irrelevant, the whole concept just doesn't make sense with a pessimistic inspection. Anubis requires the users to do an irrelevant amount of sha256 hashes in slow javascript, where a scraper can do it much faster in native code; simply game over. It's the same reason we don't use hashcash for email, the amount of proof of work a user will tolerate is much lower than the amount a professional can apply. If this tool provides any benefit, it's due to it being obscure and non standard.

When reviewing it I noticed that the author carried the common misunderstanding that "difficulty" in proof of work is simply the number of leading zero bytes in a hash, which limits the granularity to powers of two. I realize that some of this is the cost of working in JavaScript, but the hottest code path seems to be written extremely inefficiently.

    for (; ;) {
        const hashBuffer = await calculateSHA256(data + nonce);
        const hashArray = new Uint8Array(hashBuffer);

        let isValid = true;
        for (let i = 0; i < requiredZeroBytes; i++) {
          if (hashArray[i] !== 0) {
            isValid = false;
            break;
          }
        }

It wouldn’t be exaggerating to say that a native implementation of this with even a hair of optimization could reduce the “proof of work” to being less time intensive than the ssl handshake.

jsnell 1 day ago|||

That is not a productive way of thinking about it, because it will lead you to the conclusion that all you need is a smarter proof of work algorithm. One that's GPU-resistant, ASIC-resistant, and native code resistant. That's not the case.

Proof of work can't function as a counter-abuse challenge even if you assume that the attackers have no advantage over the legitimate users (e.g. both are running exactly the same JS implementation of the challenge). The economics just can't work. The core problem is that the attackers pay in CPU time, which is fungible and incredibly cheap, while the real users pay in user-observable latency which is hellishly expensive.

aniviacat 23 hours ago||||

They do use SubtleCrypto digest [0] in secure contexts, which does the hashing natively.

Specifically for Firefox [1] they switch to the JavaScript fallback because that's actually faster [2] (because of overhead probably):

> One of the biggest sources of lag in Firefox has been eliminated: the use of WebCrypto. Now whenever Anubis detects the client is using Firefox (or Pale Moon), it will swap over to a pure-JS implementation of SHA-256 for speed.

[0] https://developer.mozilla.org/en-US/docs/Web/API/SubtleCrypt...

[1] https://github.com/TecharoHQ/anubis/blob/main/web/js/algorit...

[2] https://github.com/TecharoHQ/anubis/releases/tag/v1.22.0

xena 19 hours ago||||

If you can optimize it, I would love that as a pull request! I am not a JS expert.

gruez 18 hours ago|||

>but the hottest code path seems to be written extremely inefficiently.

Why is this inefficient?

tptacek 1 day ago|||

Right, but that's the point. It's not that the idea is bad. It's that PoW is the wrong fit for it. Internet-wide scrapers don't keep state? Ok, then force clients to do something that requires keeping state. You don't need to grind SHA2 puzzles to do that; you don't need to grind anything at all.

valicord 1 day ago|||

The point is that the scrapers can easily bypass this if they cared to do so

uqers 1 day ago||

How so?

valicord 15 hours ago|||

The parent comment was "The author of that site assumes that scrapers will keep track of the access tokens for a week, but most internet-wide scrapers don't do so.". There's no technical reason why they wouldn't reuse those tokens, they don't do that today because they don't care. If anubis gets enough adoption to cause meaningful inconvenience, the scrapers would just start caching the tokens to amortize the cost.

The point of the article is that if the scraper is sufficiently motivated, Anubis is not going to do much anyway, and if the scraper doesn't care, same result can be achieved without annoying your actual users.

tecoholic 1 day ago|||

Hmm… by setting the verified=1 cookie on every request to the website?

Am I missing something here? All this does is set an unencrypted cookie and reload the page right?

notpushkin 1 day ago||

They could, but if this is slightly different from site to site, they’ll have to either do this for every site (annoying but possible if your site is important enough), or go ahead and run JS (which... I thought they do already, with plenty of sites still being SPAs?)

rezonant 1 day ago||

I would be highly surprised if most of these bots are already running JavaScript, I'm confused by this unquestioned notion that they don't.

iamnothere 18 hours ago||

All the critics here miss the point. Anubis has worked to stop DDoS-level scraping against a number of production sites, especially self-hosted source repos and forums. If it stops working, then either Anubis contributors will come up with a fix, site devs will find their own fix, or the sites under attack will be shut down. It’s an arms race in which there is no permanent solution, each escalation will of course be easily bypassed (in theory) until the majority of the attackers find that further adaptations are not worth the additional revenue or there is no further defense possible.

Anubis isn’t some conspiracy to show you pictures of anime catgirls, it’s a desperate attempt to stave off bot-driven downtime. Many admins who install it do so reluctantly, because obviously it is annoying to have a delay when you access a website. Nobody is doing that for fun.

(There are probably a few people who install it not to protect against scraper DDoS, but due to ideological opposition to AI scrapers. IMHO this is fruitless, as the more intelligent scrapers will find ways around it without calling attention to themselves. Anubis makes almost no sense on a static personal blog.)

notpushkin 1 day ago||

My favourite thing about Anubis is that (in default configuration) it completely bypasses the actual challenge altogether if you set User-Agent header to curl.

E.g. if you open this in browser, you’ll get the challenge: https://code.ffmpeg.org/FFmpeg/FFmpeg/commit/13ce36fef98a3f4...

But if you run this, you get the page content straight away:

  curl https://code.ffmpeg.org/FFmpeg/FFmpeg/commit/13ce36fef98a3f4e6d8360c24d6b8434cbb8869b

I’m pretty sure this gets abused by AI scrapers a lot. If you’re running Anubis, take a moment to configure it properly, or better put together something that’s less annoying for your visitors like the OP.

xena 19 hours ago||

This was a tactical decision I made in order to avoid breaking well-behaved automation that properly identifies itself. I have been mocked endlessly for it. There is no winning.

seba_dos1 17 hours ago|||

The winning condition does not need to consider people who write before they think.

ranger_danger 8 hours ago|||

How is a curl user-agent automatically a well-behaved automation?

fragmede 8 minutes ago||

One assumes it is a human, running curl manually, from the command line on a system they're authorized to use. It's not wget -r.

rezonant 1 day ago|||

It only challenges user agents with Mozilla in their name by design, because user agents that do otherwise are already identifiable. If Anubis makes the bots change their user agents, it has done its job, as that traffic can now be addressed directly.

samlinnfer 21 hours ago|||

This has basically been Wikipedia's bot policy for a long long time. If you run a bot you should identify it via the UserAgent.

https://foundation.wikimedia.org/wiki/Policy:Wikimedia_Found...

1vuio0pswjnm7 8 hours ago||

It's only recently, within the last three months IIRC, that Wikipedia started requiring a UA header

I know because as a matter of practice I do not send one. Like I do with most www sites, I used Wikipedia for many years without ever sending a UA header. Never had a problem

I read the www text-only, no graphical browser, no Javascript

hshdhdhehd 1 day ago||||

What if everyone requests from the bot has a different UA?

skylurk 23 hours ago|||

Success. The goal is to differentiate users and bots who are pretending to be users.

trenchpilgrim 22 hours ago|||

Then you can tell the bots apart from legitimate users through normal WAF rules, because browsers froze the UA a while back.

hsbauauvhabzb 22 hours ago|||

Can you explain what you mean by this? Why Mozilla specifically and not WebKit or similar?

gucci-on-fleek 22 hours ago||

Due to weird historical reasons [0] [1], every modern browser's User-Agent starts with "Mozilla/5.0", even if they have nothing to do with Firefox.

[0]: https://en.wikipedia.org/wiki/User-Agent_header#Format_for_h...

[1]: https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/...

seba_dos1 17 hours ago||

> I’m pretty sure this gets abused by AI scrapers a lot.

In practice, it hasn't been an issue for many months now, so I'm not sure why you're so sure. Disabling Anubis takes servers down; allowing curl bypass does not. What makes you assume that aggressive scrapers that don't want to identify themselves as bots will willingly identify themselves as bots in the first place?

praptak 1 day ago||

There are reasons to choose the slightly annoying solution on purpose though. I'm thinking of a political statement along the lines "We have a problem with asshole AI companies and here's how they make everyone's life slightly worse."

weinzierl 23 hours ago||

"Unfortunately, Cloudflare is pretty much the only reliable way to protect against bots."

With footnote:

"I don’t know if they have any good competition, but “Cloudflare” here refers to all similar bot protection services."

That's the crux. Cloudflare is the default, no one seems to bother to take the risk with a competitor for some reason. They seem to exist but when asked people can't even name them.

(For what it's worth I've been using AWS Cloudfront but I had to think a moment to remember its name.)

Avamander 17 hours ago|

It's actually not that reliable either given a bit of effort. Only their paid offerings actually give you tools to properly defend against intentional attacks.

indrora 1 day ago||

The problem is that increasingly, they are running JS.

In the ongoing arms race, we're likely to see simple things like this sort of check result in a handful of detection systems that look for "set a cookie" or at least "open the page in headless chrome and measure the cookies."

moebrowne 22 hours ago||

> increasingly, they are running JS.

Does anyone have any proof of this?

xena 19 hours ago||

I'm seeing more big botnets hosted on Alibaba Cloud, Huawei Cloud, and one on Tencent Cloud that run Headless Chrome. IP space blocks have been the solution there. I currently have a thread open with Tencent Cloud abuse where they've been begging me to not block them by default.

ranger_danger 8 hours ago||

I don't consider cloud IP blocks a solution. We use Amazon WorkSpaces, and many sites often block or restrict access just because our IPs appear to be from Amazon. There are also a good number of legitimate VPN users that are on cloud IPs.

utopiah 1 day ago||

> increasingly, they are running JS.

I mean they have access to a mind-blowing amount of computing resources so to they using a fraction of that to improve the quality of the data because they have this fundamental belief (because it's convenient for their situation) that scale is everything, why not use JS too. Heck if they have to run on a container full a browser, not even headless, they will.

typpilol 22 hours ago||

Chrome even released a dev tools mcp they gives any LLM full tool access to do anything in the browser.

Navigate, screenshots, etc. it has like 30 tools in it alone.

Now we can just run real browsers with LLMs attached. Idk how you even think about defeating that.

katdork 19 hours ago||

I don't like this solution because it is hostile to those who use solutions such as UMatrix / NoScript in their browser, who use TUI browsers (e.g. chawan, lynx, w3m, ...) or who have disabled Javascript outright.

Admittedly, this is no different than the kinds of ways Anubis is hostile to those same users, truly a tragedy of the commons.

1vuio0pswjnm7 7 hours ago|

Whether intentional or not, there is an obvious benefit to the website operator in forcing users to expose themselves to images and Javascript by requiring the use of particular software. e.g., a popular graphical browser from a company providing advertising services (Google, Apple, etc.) or partnering with one (Mozilla):

It makes (a) visual advertising and (b) tracking viable

I read the www text-only, no auto-loading of resources (images, etc.), and I see no ads

yellow_lead 1 day ago||

Anubis should be something that doesn't inconvenience all the real humans that visit your site.

I work with ffmpeg so I have to access their bugtracker and mailing list site sometimes. Every few days, I'm hit with the Anubis block. And 1/3 - 1/5 of the time, it fails completely. The other times, it delays me by a few seconds. Over time, this has turned me sour on the Anubis project, which was initially something I supported.

xena 19 hours ago||

I've finally found a ruleset that works for that fwiw. The newest release has that fix.

yellow_lead 19 hours ago||

Thank you!

xena 18 hours ago||

No problem. I wish I had found it sooner, but between doing this nights and weekends while working a full time job, trying to help my husband find a new job, navigating the byzantine nightmare that is sales to education institutions, and other things I have found out that I hate, I have not had a lot of time to actually code things. I wish I could afford to work on this full time. Government grants have not gone through because I don't have the metrics they need. Probably gonna have to piss people off to get the bare minimum of metrics that I need in order to justify why I should get those grants.

opan 23 hours ago|||

I only had issues with it on GNOME's bug tracker and could work around it with a UA change, meanwhile Cloudflare challenges are often unpassable in qutebrowser no matter what I do.

mariusor 23 hours ago|||

I don't understand the hate when people look at a countermeasure against unethical shit and complain about it instead of being upset at the unethical shit. And it's funny when it's the other way around, like cookie banners being blamed on GDPR not on the scumminess of some web operators.

elashri 21 hours ago|||

I don't understand that some people don't realize that you can be upset about status que that both sides of the equation sucks. And you can hate thing and also the countermeasure that someone deploy against. These are not mutually exclusive.

mariusor 21 hours ago||

I didn't see parent be upset about both sides on this one. I don't see it implied anywhere that they even considered it.

elashri 21 hours ago||

>which was initially something I supported.

That quote is strong indication that he sees it this way.

yellow_lead 19 hours ago||

Yup, I'm against the AI scraping. But personally for me, the equation breaks when I'm getting delays and errors when just visiting a bug tracker.

Sounds like maybe it'll be fixed soon though

m4rtink 20 hours ago|||

Also the Anubis mascot is very cute! ;-)

throwaway290 1 day ago|||

I understand why ffmpeg does it. No one is expected to pay for it. Until this age of LLMs when bot traffic became dominant on the web ffmpeg site was probably acceptable expense. But they probably don't want to be unpaid data provider for big LLM operators who get to extract a few bucks from their users.

It's like airplane checkin. Are we inconvenienced? Yes. Who is there to blame? Probably not the airline or the company who provides the services. Probably people who want to fly without a ticket or bring in explosives.

As long as Anubis project and people on it don't try to play both sides and don't make the LLM situation worse (mafia racket style), I think if it works it works.

TJSomething 21 hours ago||

I know it's beside the point, but I think a chunk of the reason for many of the security measures in airports is because creating the appearance of security increases people's willingness to fly.

bakql 22 hours ago||

[flagged]

trenchpilgrim 22 hours ago||

Unfortunately in countries like Brazil and India, where a majority of humans collectively live, better computers are taxed at extremely high rates and are practically unaffordable.

bakql 22 hours ago||

[flagged]

paweladamczuk 23 hours ago||

Internet in its current form, where I can theoretically ping any web server on earth from my bedroom, doesn't seem sustainable. I think it will have to end at some point.

I can't fully articulate it but I feel like there is some game theory aspect of the current design that's just not compatible with the reality.

noAnswer 19 hours ago|

Years ago, wasn't there a proposal from google or the likes to have push notifications for search engines? Instead of the bots checking offer and offer again if there is something new, you would inform them about it. I think that would be a fair middle ground. You don't ddos us and in exchange we inform you timely if there is something new. (Bot would need a way to subscript themselves.)

I have a personal website that sometimes doesn't get an update for a year. Still the bots are in the majority of visitors. (Not so much that I would need counter measures but still.) Most bot visits could be avoided with such a scheme.

redwall_hp 12 hours ago|||

Ah, so blog pingbacks are new again. https://en.wikipedia.org/wiki/Pingback

That's how Technorati worked.

ranger_danger 8 hours ago|||

The problem I see with this approach is that it enables website operators to stop alerting bots completely, and then the bots' customers will complain that sites aren't updated, and don't care that the site owner is blocking them.

geokon 1 day ago|

Big picture, why does everyone scrape the web?

Why doesn't one company do it and then resell the data? Is it a legal/liability issue? If you scrape it's a legal grey area, but if you sell what you scrap it's clearly copyright infringement?

utopiah 1 day ago|

My bet is that they believe https://commoncrawl.org isn't good enough and, precisely as you are suggesting, the "rest" is where is their competitive advantage might stem from.

ccgreg 8 hours ago|||

Most academic AI research and AI startups find Common Crawl adequate for what they're doing. Common Crawl also has a lot of not-AI usage.

Jackson__ 22 hours ago|||

Thinking that there is anything worth scraping past the llm-apocalypse is pure hubris imo. It is slop city out there, and unless you have an impossibly perfect classifier to detect it, 99.9% of all the great new "content" you scrape will be AI written.

E: In fact this whole idea is so stupid that I am forced to consider if it is just a DDoS in the original sense. Scrape everything so hard it goes down, just so that your competitors can't.

More comments...