Top
Best
New

Posted by flexagoon 11/2/2025

You Don't Need Anubis(fxgn.dev)
177 points | 170 commentspage 2
geokon 11/2/2025|
Big picture, why does everyone scrape the web?

Why doesn't one company do it and then resell the data? Is it a legal/liability issue? If you scrape it's a legal grey area, but if you sell what you scrap it's clearly copyright infringement?

utopiah 11/2/2025|
My bet is that they believe https://commoncrawl.org isn't good enough and, precisely as you are suggesting, the "rest" is where is their competitive advantage might stem from.
ccgreg 11/2/2025|||
Most academic AI research and AI startups find Common Crawl adequate for what they're doing. Common Crawl also has a lot of not-AI usage.
fragmede 11/3/2025||||
I think that there are lots of people who are working from "first principles" and haven't even heard of common crawl or know how to use it.
Jackson__ 11/2/2025|||
Thinking that there is anything worth scraping past the llm-apocalypse is pure hubris imo. It is slop city out there, and unless you have an impossibly perfect classifier to detect it, 99.9% of all the great new "content" you scrape will be AI written.

E: In fact this whole idea is so stupid that I am forced to consider if it is just a DDoS in the original sense. Scrape everything so hard it goes down, just so that your competitors can't.

gucci-on-fleek 11/2/2025||
> But it still works, right? People use Anubis because it actually stops LLM bots from scraping their site, so it must work, right?

> Yeah, but only because the LLM bots simply don’t run JavaScript.

I don't think that this is the case, because when Anubis itself switched from a proof-of-work to a different JavaScript-based challenge, my server got overloaded, but switching back to the PoW solution fixed it [0].

I also semi-hate Anubis since it required me to add JS to a website that used none before, but (1) it's the only thing that stopped the bot problem for me, (2) it's really easy to deploy, and (3) very few human visitors are incorrectly blocked by it (unlike Captchas or IP/ASN bans that have really high false-positive rates).

[0]: https://github.com/TecharoHQ/anubis/issues/1121

gbuk2013 11/2/2025||
The Caddy config in the parent article uses status code 418. This is cute, but wouldn’t this break search engine indexing? Why not use 307 code?
flexagoon 11/2/2025|
I use this for a personal Redlib instance, so search indexing is not important. I don't know if this will allow indexing even with a 307 status code - maybe you just need to add an exception for Googlebot.
defraudbah 11/2/2025||
[flagged]
m4rtink 11/2/2025||
Working as intended! ;-)
viaoktavia 11/2/2025||
[dead]
agnishom 11/2/2025||
Exactly. I don't understand what computation you can afford to do in 10 seconds on a small number of cores that bots running on large data centers cannot
juliangmp 11/2/2025|
The point of anubis isn't to make the scraping impossible, but make it more expensive.
agnishom 11/2/2025||
by how much? I don't understand the cost model here at all.
eqvinox 11/2/2025||
AIUI the idea is to ratelimit each "solution". A normal human's browser only needs to "solve" once. A LLM crawler either needs to slow down (= objective achieved) or solve the puzzle n times to get n × the request rate.
agnishom 11/4/2025||
lets say that that adding Anubis does the job of adding 10 seconds of extra compute for the bot when it tries to access my website. Will this be enough to deter the bot/scraper?
GabrielTFS 11/10/2025||
Empirical evidence appears to show that it is ¯\_(ツ)_/¯
jchw 11/2/2025||
I was briefly messing around with Pangolin, which is supposed to be a self-hosted Cloudflare Tunnels sort of thing. Pretty cool.

One thing I noticed though was that the Digital Ocean Marketplace image asks you if you want to install something called Crowdsec, which is described as a "multiplayer firewall", and while it is a paid service, it appears there is a community offering that is well-liked enough. I actually was really wondering what downsides it has (except for the obvious, which is that you are definitely trading some user privacy in service of security) but at least in principle the idea seems kind of a nice middleground between Cloudflare and nothing if it works and the business model holds up.

bootsmann 11/2/2025|
Not sure crowdsec is fit for this purpose. Its more a fail2ban replacement than a ddos challenge.
jchw 11/2/2025||
One of the main ways that Cloudflare is able to avoid presenting CAPTCHAs to a lot of people while still filtering tons of non-human traffic is exactly that, though: just having a boatload of data across the Internet.
andersmurphy 11/2/2025||
So I don't use cloudflare. But only serve clients that support brotli and have a valid cookie. All the actual content comes down an SSE connection. Haven't had any problems with bots on my 5$ VPS.

What I realised recently is for non user browsers my demos are effectively zip bombs.

Why?

Because I stream each frame and each frame is around 180kb uncompressed (compressed frames can be as small as 13bytes). This is fine as the users browser doesn't hold onto the frames.

But, a crawler will hold onto those frames. Very quickly this ends up being a very bad time for them.

Of course there's nothing of value to scrape so mostly pointless. But, I found it entertaining that some scummy crawler is getting nuked by checkboxes [1].

- https://checkboxes.andersmurphy.com

echelon 11/2/2025||
This whole thing is pointless.

OpenAI Atlas defeats all of this by being a user's web browser. They got between you and the user you're trying to serve content, and they slurp up everything the user browses to return it back for training.

The firewall is now moot.

The bigger AI company, Google, has already been doing this for decades. They were the middlemen between your reader and you, and that position is unassailable. Without them, you don't have readers.

At this point, the only people you're keeping out with LLM firewalls are the smaller players, which further entrenches the leaders.

OpenAI and Google want you to block everybody else.

happyopossum 11/2/2025||
> Google, has already been doing this for decades

Do you have any proof, or even circumstantial evidence to point to this being the case?

If chrome actually scraped every site ever you visited and sent it off to Google, it’d be trivially simple to find some indication of that in network traffic, or heck - even chromium code.

echelon 11/2/2025||
Sorry, I mean they're between the customer relationship.

Who would dare block Google Search from indexing their site?

The relationship is adversarial, but necessary.

ranger_danger 11/2/2025|||
> Who would dare block Google Search from indexing their site?

People who don't want to be indexed. Or found at all.

Dylan16807 11/2/2025|||
Is it confirmed that site loads go into the training database?

But for anyone whose main concern is their server staying up, Atlas isn't a problem. It's not doing a million extra loads.

heavyset_go 11/2/2025||
> Is it confirmed that site loads go into the training database?

Would you trust OpenAI if they told you it doesn't?

If you would, would you also trust Meta to tell you if its multibillion dollar investment was trained on terabytes of pirated media the company downloaded over BitTorrent?

viraptor 11/2/2025||
We don't have to trust it or not. If there's such claim, surely someone can point at least at a pcap file with an unknown connection. Or at some decompiled code. Otherwise it's just a conspiracy theory.
_flux 11/2/2025|||
Surely the data must go to the OpenAI servers, how else would they use LLMs on it? We cannot see if that data ends up in the training data.

Personally I would just believe what they say for the time being; there would be backlash in doing something else, possibly legal one.

viraptor 11/2/2025||
I think the original claim was about something different. "Is it confirmed that site loads..." - I read it as the author taking about general browsing, not just explicit questions, with the context of the page.
heavyset_go 11/2/2025|||
Whatever is included in context is in OpenAI's control from that point forward, and you just have to trust them not to do anything with it.

That isn't a conspiracy theory, it's fundamentally how interfacing with 3rd party hosted LLMs works.

seba_dos1 11/2/2025|||
The "LLM firewall" is usually there so AI companies don't take the server down, not to prevent model training (that's just an acceptable side effect).
_flux 11/2/2025|||
As I understand it, the main point of Anubis is to reduce the costs caused by (AI company) bots and agent-generated load is still a lot less than simply spidering the complete web site; it might actually be quite close to what a user would manually browse.

Unless the user asked something that just needs visiting many pages, I suppose. For example, Google Gemini was pretty helpful in finding out the typical price ranges and dishes a local shopping centre coffee shops have, as the information was far from being just in a single page..

masklinn 11/2/2025||
> This whole thing is pointless.

It's definitely pointless if you completely miss the point of it.

> OpenAI Atlas defeats all of this by being a user's web browser. They got between you and the user you're trying to serve content, and they slurp up everything the user browses to return it back for training.

Cool. Anubis' fundamental purpose is not to prevent all bot access tho, as clearly spelled in its overview:

> This program is designed to help protect the small internet from the endless storm of requests that flood in from AI companies.

OpenAI atlas piggybacking on the user's normal browsing is not within the remit of anubis, because it's not going to take a small site down or dramatically increase hosting costs.

> At this point, the only people you're keeping out with LLM firewalls are the smaller players

Oh no, who will think of the small assholes?

greatgib 11/2/2025||
Just a personal fact, when I want to see a page and instead I have to face a 3s stupid nagscreens like the one of anubis, I'm very pissed off and pushed even more to bypass the website when possible to get the info I want directly from llm or search engine.

It's kind of a self fulfilling prophecy, you make it the visitor experience worse, giving a self justification why llm giving the content is wanted and needed.

All of that because in the current lambda/cloud computing word, it became very expensive to process only a few requests.

robinsonb5 11/2/2025||
Unfortunately the choice isn't between sites with something like Anubis and sites with free and unencumbered access. The choice is between putting up with Anubis and the sites simply going away.

A web forum I read regularly has been playing whack-a-mole with LLM scrapers for much of this year, with multiple weeks-long periods where the swarm-of-locusts would make the site inaccessible to actual users.

The admins tried all manner of blocks, including ultimately banning entire countries' IP ranges, all to no avail.

The forum's continued existence depends on being able to hold off abusive crawlers. Having to see half-a-second of the Anubis splashscreen occasionally is a small price to pay for keeping it alive.

greatgib 11/2/2025||
[flagged]
pushcx 11/2/2025|||
The scrapers will not attempt to discover and use an efficient representation. They will attempt to hit every URL they can discover on a site, and they'll do it at a rate of hundreds of hits per second, from enough IPs that each only requests at a rate of 1/minute. It's rude to talk down to people for not implementing a technique that you can't get scrapers to adopt, and for matching their investment in performance to their needs instead of accurately predicting years beforehand that traffic would dramatically change.
xena 11/2/2025|||
I challenge you to take a critical look at the performance of things like PHPBB and see how even naive scraping brings commonly deployed server CPUs to their knees.
eqvinox 11/2/2025|||
If you don't feel like understanding the thing to be pissed off about here are the AI crawlers, we don't feel like understanding your displeasure about the Anubis wall either. The choices are either the Anubis wall or nothing. This isn't theoretical, I've been involved in this decision: we had to either close off the service entirely, or put [something like] Anubis in front of it.

> have to face a 3s stupid nagscreens like the one of anubis, I'm very pissed off and pushed even more to bypass the website when possible to get the info I want directly from llm or search engine.

Most (freely accessible) LLMs will take more than 3s to "think". Why are you pissed off about Anubis, but not the slow LLM? And then you have to double check the LLM anyway...

> All of that because in the current lambda/cloud computing word, it became very expensive to process only a few requests.

You're making some very arrogant assumptions here. FOSS repos and bugtrackers are generally not lambda/cloud hosted.

redwall_hp 11/2/2025||
There are a lot of phpBB/XenForo/Discourse/etc fouls out there too that get slammed hard by those, and many cases of them just shutting down rather than eating much higher hosting costs. Which, of course, further pushes online communities in the hands of corporations like Reddit and Facebook.

Most of them are simply throwing one of those tools on a VPS or such, which is perfect for their community size, and then falls over under LLM companies' botnets DDoSing them.

DanOpcode 11/2/2025|||
I agree, I think it gives a bad impression when I need to see the anime Anubis girl before the page loads. Codeberg.org oftens shows me the nag screen, and it has worsened my impression of their service.
tptacek 11/2/2025|
This came up before (and this post links to the Tavis Ormandy post that kicked up the last firestorm about Anubis) and without myself shading the intent or the execution on Anubis, just from a CS perspective, I want to say again that the PoW thing Anubis uses doesn't make sense.

Work functions make sense in password hashes because they exploit an asymmetry: attackers will guess millions of invalid passwords for every validated guess, so the attacker bears most (really almost all) of the cost.

Work functions make sense in antispam systems for the same reason: spam "attacks" rely on the cost of an attempt being so low that it's efficient to target millions of victims in the expectation of just one hit.

Work functions make sense in Bitcoin because they function as a synchronization mechanism. There's nothing actually valorous about solving a SHA2 puzzle, but the puzzles give the whole protocol a clock.

Work functions don't make sense as a token tax; there's actually the opposite of the antispam asymmetry there. Every bot request to a web page yields tokens to the AI company. Legitimate users, who far outnumber the bots, are actually paying more of a cost.

None of this is to say that a serious anti-scraping firewall can't be built! I'm fond of pointing to how Youtube addressed this very similar problem, with a content protection system built in Javascript that was deliberately expensive to reverse engineer and which could surreptitiously probe the precise browser configuration a request to create a new Youtube account was using.

The next thing Anubis builds should be that, and when they do that, they should chuck the proof of work thing.

mariusor 11/2/2025||
With all due respect, but almost all I see in this thread is people looking down their nose at a proven solution, and giving advice instead of doing the work. I can see how you are a _very important person_ with bills to pay and money to make, but at least have the humility of understanding that the solution we got is better than the solution that could be better if only there was someone else to think of it and build it.
tptacek 11/2/2025|||
You can't moralize a flawed design into being a good one.
mariusor 11/2/2025||
How about into a "good enough one"?
tptacek 11/2/2025||
Look, I don't care if you run Anubis. I'm not against "Anubis". I'm interested in the computer science of the current Anubis implementation. It's not great. It doesn't make sense. Those are descriptive observations, and you can't moralize them into being false; you need to present an actual argument.
mariusor 11/2/2025||
This is not me being aggro because you're picking on my favourite project, I dislike Anubis for more or less the same complaints you see in this thread. I don't want JavaScript on otherwise static sites, I don't like the anime girl, etc. What I don't agree with is people like you pontificating about what an inferior solution it is, and *how* obvious that should be for everybody, but you fail to provide any better alternatives. So, I guess that what I'm trying to say, is to put up or shut up.
tptacek 11/2/2025||
Sorry, but I really can't think of anything less interesting to debate than how a computer science argument makes you feel about how it might make someone else feel.
mariusor 11/3/2025||
I don't know in how many more different ways I can say it, but I'm not inviting you to debate, I'm inviting you to write a better tool and make it accessible for free.
yumechii 11/2/2025|||
[dead]
mariusor 11/2/2025||
It's weird that you get offended by something which was not directed at you.

"The work" is providing those better alternatives to anubis, that everyone in this thread except for Xe seem to know all about.

The humility is about accepting the fact that the solution works for some people, the small site operators that get hammered by DDoSes and unethical LLM over crawling, despite not being perfect. And if that inconveniences you as a user of those sites - which I imagine is what you mean by "user backlash", the solution for you is to stop going there, not talk down at them for doing something about an issue that impacts them.

yumechii 11/2/2025||
How am I offended? Did I accuse you of anything? I didn't even accuse Anubis of anything. You asked for the work, I post the work and evidence to ground the discussion in "work", as you demanded.
mariusor 11/2/2025||
I repeat "the work" is to make a better thing than Anubis, not provide proof of concept that it can be beaten. :)
yumechii 11/2/2025||
You criticized what you identified as an "advice" for not providing work to your scope (which you clarified as "make a better thing than Anubis"), why should I suddenly have to meet your scope of "work" to be a valid criticism of your advice this time? Showing a negative result is also work.
mariusor 11/2/2025||
If you're operating your reasoning in a moral framework where helping the bad agents is a good outcome, then you'd be right. I personally do not, however.
yumechii 11/2/2025||
If your moral framework is supporting a nominally good "solution" with no evidence (where if your evidence that your assertion the solution is "proven"?) is "a good outcome", pointing out the solution is flawed, with evidence, is somehow not, then you'd be right. I personally do not share your nominal goodness compass, however.
mariusor 11/2/2025||
Codeberg and sourcehut[1] have both blogged about Anubis decreasing loads on their servers at the beginning of the year when this saga has started. Since then, one, or both have moved to different solutions, but that was not due to ineffectiveness but rather to requiring JavaScript.

[1] https://sourcehut.org/blog/2025-04-15-you-cannot-have-our-us...

yumechii 11/3/2025||
Empirical evidence is more robust than anecdotal evidence.

Also, a lower "server load" has nothing to do with the system being collectively "a good outcome" that justifies labeling criticism as supporting "the bad guys".

gucci-on-fleek 11/2/2025|||
> Work functions don't make sense as a token tax; there's actually the opposite of the antispam asymmetry there. Every bot request to a web page yields tokens to the AI company. Legitimate users, who far outnumber the bots, are actually paying more of a cost.

Agreed, residential proxies are far more expensive than compute, yet the bots seem to have no problem obtaining millions of residential IPs. So I'm not really sure why Anubis works—my best guess is that the bots have some sort of time limit for each page, and they haven't bothered to increase it for pages that use Anubis.

> with a content protection system built in Javascript that was deliberately expensive to reverse engineer and which could surreptitiously probe the precise browser configuration a request to create a new Youtube account was using.

> The next thing Anubis builds should be that, and when they do that, they should chuck the proof of work thing.

They did [0], but it doesn't work [1]. Of course, the Anubis implementation is much simpler than YouTube's, but (1) Anubis doesn't have dozens of employees who can test hundreds of browser/OS/version combinations to make sure that it doesn't inadvertently block human users, and (2) it's much trickier to design an open-source program that resists reverse-engineering than a closed-source program, and I wouldn't want to use Anubis if it went closed-source.

[0]: https://anubis.techaro.lol/docs/admin/configuration/challeng...

[1]: https://github.com/TecharoHQ/anubis/issues/1121

tptacek 11/2/2025||
Google's content-protection system didn't simply make sure you could run client-side Javascript. It implemented an obfuscating virtual machine that, if I'm remembering right (I may be getting some of the detailed blurred with Blu Ray's BD+ scheme) built up a hash input of runtime artifacts. As I understand it, it was one person's work, not the work of a big team. The "source code" we're talking about here is clientside Javascript.

Either way: what Anubis does now --- just from a CS perspective, that's all --- doesn't make sense.

Gander5739 11/2/2025||
But youtube can still be scraped with yt-dlp, so apparently it wasn't enough.
tptacek 11/2/2025||
Preventing that wasn't the objective of the content-protection system. You'll have to go read up on it.
More comments...