Top
Best
New

Posted by evacchi 4/12/2025

Anubis Works(xeiaso.net)
319 points | 208 commentspage 2
throwaway150 4/12/2025|
Looks cool. But please help me understand. What's to stop AI companies from solving the challenge, completing the proof of work and scrape websites anyway?
crq-yml 4/12/2025||
It's a strategy to redefine the doctrine of information warfare on the public Internet from maneuver(leveraged and coordinated usage of resources to create relatively greater effects) towards attrition(resources are poured in indiscriminately until one side capitulates).

Individual humans don't care about a proof-of-work challenge if the information is valuable to them - many web sites already load slowly through a combination of poor coding and spyware ad-tech. But companies care, because that changes their ability to scrape from a modest cost of doing business into a money pit.

In the earlier periods of the web, scraping wasn't necessarily adversarial because search engines and aggregators were serving some public good. In the AI era it's become belligerent - a form of raiding and repackaging credit. Proof of work as a deterrent was proposed to fight spam decades ago(Hashcash) but it's only now that it's really needed to become weaponized.

marginalia_nu 4/12/2025|||
The problem with scrapers in general is the asymmetry of compute resources involved in generating versus requesting a website. You can likely make millions of HTTP requests with the compute required in generating the average response.

If you make it more expensive to request a documents at scale, you make this type of crawling prohibitively expensive. On a small scale it really doesn't matter, but if you're casting an extremely wide net and re-fetching the same documents hundreds of times, yeah it really does matter. Even if you have a big VC budget.

Nathanba 4/13/2025|||
Yes but the scraper only has to solve it once and it gets cached too right? Surely it gets cached, otherwise it would be too annoying for humans on phones too? I guess it depends on whether scrapers are just simple curl clients or full headless browsers but I seriously doubt that Google tier LLM scrapers rely on site content loading statically without js.
ndiddy 4/13/2025|||
AI companies have started using a technique to evade rate limits where they will have a swarm of tens of thousands of scraper bots using unique residential IPs all accessing your site at once. It's very obvious in aggregate that you're being scraped, but when it's happening, it's very difficult to identify scraper vs. non-scraper traffic. Each time a page is scraped, it just looks like a new user from a residential IP is loading a given page.

Anubis helps combat this because even if the scrapers upgrade to running automated copies of full-featured web browsers that are capable of solving the challenges (which means it costs them a lot more to scrape than it currently does), their server costs would balloon even further because each time they load a page, it requires them to solve a new challenge. This means they use a ton of CPU and their throughput goes way down. Even if they solve a challenge, they can't share the cookie between bots because the IP address of the requestor is used as part of the challenge.

Nathanba 4/13/2025|||
Tens of thousands of scraper bots for a single site? Is that really the case? I would have assumed that maybe 3-5 bots send lets say 20 requests per second in parallel to scrape. Sure, they might eventually start trying different ips and bots if their others are timing out but ultimately it's still the same end result: All they will realize is that they have to increase the timeout and use headless browsers to cache results and the entire protection is gone. But yes, I think for big bot farms it will be a somewhat annoying cost increase to do this. This should really be combined with the cloudflare captcha to make it even more effective.
marginalia_nu 4/13/2025|||
A lot of the worst offenders seem to be routing the traffic through a residential botnet, which means that the traffic really does come from a huge number of different origins. It's really janky and often the same resources are fetched multiple times.

Saving and re-using the JWT cookie isn't that helpful, as you can effectively rate limit using the cookie as identity, so to reach the same request rates you see now they'd still need to solve hundreds or thousands of challenges per domain.

Hasnep 4/13/2025|||
If you're sending 20 requests per second from one IP address you'll hit rate limits quickly, that's why they're using botnets to DDoS these websites.
vhcr 4/13/2025|||
Until someone writes the proof of work code for GPUs and it runs 100x faster and cheaper.
marginalia_nu 4/13/2025|||
A big part of the problem with these scraping operations is how poorly implemented they are. They can get a lot cheaper gains by simply cleaning up how they operate, to not redundantly fetch the same documents hundreds of times, and so on.

Regardless of how they solve the challenges, creating an incentive to be efficient is a victory in itself. GPUs aren't cheap either, especially not if you're renting them via a browser farm.

runxiyu 4/13/2025|||
Anubis et al. are also looking into alternative algorithms. There seems to be consensus that SHA-256 PoW is not appropriate
genewitch 4/13/2025||
There's lots of other ones but you want hashes that use lots of RAM, stuff like scrypt used to be the go-to but I am sure there are better, now.
Hakkin 4/13/2025||||
It sets a cookie with a JWT verifying you completed the proof-of-work along with metadata about the origin of the request, the cookie is valid for a week. This is as far as Anubis goes, once you have this cookie you can do whatever you want on the site. For now it seems like enough to stop a decent portion of web crawlers.

You can do more underneath Anubis using the JWT as a sort of session token though, like rate limiting on a per proof-of-work basis, if a client using X token makes more than Y requests in a period of time, invalidate the token and force them to generate a new one. This would force them to either crawl slowly or use many times more resources to crawl your content.

FridgeSeal 4/13/2025|||
It seems a good chunk of the issue with these modern LLM scrapers is that they are doing _none_ of the normal “sane” things. Caching content, respecting rate limits, using sitemaps, bothering to track explore depth properly, etc.
charcircuit 4/13/2025|||
If you make it prohibitively expensive almost no regular user will want to wait for it.
xboxnolifes 4/13/2025|||
Regular users usually aren't page hopping 10 pages per second. A regular user is usually 100 times less than that.
pabs3 4/13/2025||
I tend to get blocked by HN when opening lots of comment pages in tabs with Ctrl+click.
xboxnolifes 4/13/2025||
Yes, HN has a fairly strict slow down policy for commenting. But, that's irrelevant to the context.
pabs3 4/13/2025||
I meant to say article pages not comment pages, but ack.
bobmcnamara 4/13/2025|||
Exponential backoff!
ndiddy 4/12/2025|||
This makes it much more expensive for them to scrape because they have to run full web browsers instead of limited headless browsers without full Javascript support like they currently do. There's empirical proof that this works. When GNOME deployed it on their Gitlab, they found that around 97% of the traffic in a given 2.5 hour period was blocked by Anubis. https://social.treehouse.systems/@barthalion/114190930216801...
dragonwriter 4/13/2025||
> This makes it much more expensive for them to scrape because they have to run full web browsers instead of limited headless browsers without full Javascript support like they currently do. There's empirical proof that this works.

It works in the short term, but the more people that use it, the more likely that scrapers start running full browsers.

SuperNinKenDo 4/13/2025|||
That's the point. An individual user doesn't lose sleep over using a full browser, that's exactly how they use the web anyway, but for an LLM scraper or similar, this greatly increases costs on their end and partially thereby rebalances the power/cost imbalance, and at the very least, encourages innovations to make the scrapers externalise costs less by not rescraping things over and over again just because you're too lazy, and the weight of doing so is born by somebody else, not you. It's an incentive correction for the commons.
sadeshmukh 4/13/2025||||
Which are more expensive - you can't run as many especially with Anubis
perching_aix 4/12/2025|||
Nothing. The idea instead that at scale the expenses of solving the challenges becomes too great.
userbinator 4/13/2025|||
This is basically the DRM wars again. Those who have vested interests in mass crawling will have the resources to blast through anything, while the legit users get subjected to more and more draconian measures.
SuperNinKenDo 4/13/2025||
I'll take this over a Captcha any day.
userbinator 4/13/2025||
CAPTCHAs don't need JS, nor does asking a question that an LLM can't answer but a human can.

Proof-of-work selects for those with the computing power and resources to do it. Bitcoin and all the other cryptocurrencies show what happens when you place value on that.

fc417fc802 4/14/2025||
You can provide visitors a choice.

> Your visit has been flagged. Please select: Login, PoW, Cloudflare, Google.

ronsor 4/13/2025||
I know companies that already solve it.
wredcoll 4/13/2025|||
I mean... knowing how to solve it isn't the trick, it's doing it a million times a minute for your firehose scraper.
udev4096 4/13/2025||
Anubis adds a cookie name `within.website-x-cmd-anubis-auth` which can be used by scrapers for not solving it more than once. Just have a fleet of servers whose sole purpose is to extract the cookie after solving the challenges and make sure all of them stay valid. It's not a big deal
fc417fc802 4/14/2025||
Requests are associated with the cookie meaning you can trace and block or rate limit as necessary. The cost of solving the PoW is the cost of establishing a new session. If you get blocked you have to solve again.
creata 4/13/2025|||
Why is spending all that CPU time to scrape the handful of sites that use Anubis worth it to them?
vhcr 4/13/2025||
Because it's not a lot of CPU, you only have to solve it once per website, and the default policy difficulty of 16 for bots is worthless because you can just change your user agent so you get a difficulty of 4.
mushufasa 4/15/2025||
I looked through the documentation and I've come across a couple sites using this already.

Genuine question: why not leverage the proof-of-work challenge literally into mining that generates some revenue for a website? Not a new idea, but when I looked at the docs it didn't seem like this challenge was tied to any monetary coin value.

This is coming from someone who is NOT a big crypto person, but it strikes me that this would be a much better way to monetize organic high quality content in this day and age. Basically the idea that Brave browser started with, meeting it's moment.

I'm sure Xe has already considered this. Do they have a blog post about this anywhere?

snvzz 4/13/2025||
My Amiga 1200 hates these tools.

It is really sad that the worldwide web has been taken to the point where this is needed.

pabs3 4/13/2025||
Recently I heard of a site blocking bot requests with a message telling the bot to download the site via Bittorrent instead.

Seems like a good solution to the badly behaved scrapers, and I feel like the web needs to move away from the client-server model towards a swarm model like Bittorrent anyway.

seba_dos1 4/13/2025|
Even if these stupid bots would just learn to clone git repos instead of crawling through GitLab UI pages it would already be helpful.
deknos 4/13/2025||
I wish, there was also an tunnel software (client+server) where

* the server appears on the outside as an https server/reverse proxy * the server supports self-signed-certificates or letsencrypt * when a client goes to a certain (sub)site or route, http auth can be used * after http auth, all traffic tunnel over that subsite/route is protected against traffic analysis, for example like the obfsproxy does it.

Does anyone know something like that? I am tempted to ask xeiaso to add such features, but i do not think his tool is meant for that...

rollcat 4/13/2025||
Your requirements are quite specific, and HTTP servers are built to be generic and flexible. You can probably put something together with nginx and some Lua, aka OpenResty: <https://openresty.org/>

> his

I believe it's their.

deknos 4/13/2025||
ups, yes, sorry, their.
immibis 4/13/2025||
Tor's Webtunnel?
deknos 4/13/2025||
but i do not want to go OVER tor, i just want a service over clearnet? or is this something else? do you have an URL?
immibis 4/13/2025||
I presume the protocol can be separated from Tor itself and I also presume this standalone thing doesn't exist yet.

In any situation, you're going to need some custom client code to route your traffic through the tunnel you opened, so I'm not sure why the login page that opens the tunnel needs to be browser-compatible?

dmtfullstack 4/13/2025||
Humans are served by bots. Any bot requesting traffic is doing so on behalf of a human somewhere.

What is the problem with bots asking for traffic, exactly?

Context of my perspective: I am a contractor for a team that hosts thousands of websites on a Kubernetes cluster. All of the websites are on a storage cluster (combination of ZFS and Ceph) with SATA and NVMe SSDs. The machines in the storage cluster and also the machines the web endpoints run on have tons of RAM.

We see a lot of traffic from what are obviously scraping bots. They haven't caused any problems.

Tarq0n 4/14/2025|
Ok? Not everyone has the same resources or technical sophistication.
udev4096 4/13/2025||
PoW captchas are not new. What's different with Anubis? How can it possibly prevent "AI" scrapers if the bots have enough compute to solve the PoW challenge? AI companies have quite a lot of GPUs at their disposal and I wouldn't be surprised if they used it for getting around PoW captchas
relistan 4/13/2025|
The point is to make it expensive to crawl your site. Anyone determined to do so is not blocked. But why would they be determined to do so for some random site? The value to the AI crawler likely does not match the cost to crawl it. It will just move on to another site.

So the point is not to be faster than the bear. It’s to be faster than your fellow campers.

genewitch 4/13/2025||
Why not have them hash pow for btc then?
sprremix 4/13/2025||
Why must everything involve $'s?
genewitch 4/13/2025||
because there's a lot of rhetoric about how this "balances the imbalance between serving a request and making that request" and if we're having them do sha256, why not have them do sha256(sha256(data+random nonce)) and potentially earn the site owner some money?
matt3210 4/13/2025||
A package which includes the cool artwork would be awesome
xena 4/13/2025|
You mean with the art assets extracted?

  $ mkdir -p ./tmp/anubis/static && anubis --extract-resources=./tmp/anubis/static
babuloseo 4/13/2025||
Nice will try to deploy to my sites after I eat some mac and cheese
matt3210 4/13/2025|
Very nice work!
More comments...