Posted by todsacerdoti 4/19/2025
The first involved requests from 300,000 unique IPs in a span of a few hours. I analyzed them and found that ~250,000 were from Brazil. I'm used to using ASNs to block network ranges sending this kind of traffic, but in this case they were spread thinly over 6,000+ ASNs! I ended up blocking all of Brazil (sorry).
A few days later this same web server was on fire again. I performed the same analysis on IPs and found a similar number of unique addresses, but spread across Turkey, Russia, Argentina, Algeria and many more countries. What is going on?! Eventually I think I found a pattern to identify the requests, in that they were using ancient Chrome user agents. Chrome 40, 50, 60 and up to 90, all released 5 to 15 years ago. Then, just before I could implement a block based on these user agents, the traffic stopped.
In both cases the traffic from datacenter networks was limited because I already rate limit a few dozen of the larger ones.
Sysadmin life...
It's a reverse proxy that presents a PoC challenge to every new visitor. It shifts the initial cost of accessing your server's resources back at the client. Assuming your uplink can handle 300k clients requesting a single 70kb web page, it should solve most of your problems.
For science, can you estimate your peak QPS?
Crazy how I remember the HN post where anubis's blog post was first made. Though, I always thought it was a bit funny with anime and it was made by frustration of (I think AWS? AI scrapers who won't follow general rules and it was constantly giving requests to his git server and it actually made his git server down I guess??) I didn't expect it to blow up to ... UN.
It was frustration at AWS' Alexa team and their abuse of the commons. Amusingly if they had replied to my email before I wrote my shitpost of an implementation this all could have turned out vastly differently.
Also didn't expect you to respond to my comment xD
I went through the slow realization of while reading this comment that you are the creator of anubis and I had such a smile when I realized that you commented to me.
Also, this project is really nice, but I actually want to ask, I haven't read the docs of anubis but could it be that the proof of work isn't wasted / it can be used for something (I know I might get downvoted because I am going to mention cryptocurrency, but nano currency has a proof of work required for each transaction, so if anubis actually does the proof of work as by nano standards, then theoretically that proof of work could atleast be some useful)
Looking forward to your comment!
The only way I see anything like that incorporated is a folding@home kind of thing that could help humanity as a whole.
Of course, if someone makes it work like you suggested, and it catches on, I will personally haunt your dreams forever. Don't give them any ideas.
It's a pretty effective attack because you get large numbers of individual browsers to contribute. Hosters don't care, so unless the site owners are technical enough, they can stay online quite a bit.
If they work with Referrer Policy, they should be able to mask themselves fairly well - the ones I saw back then did not.
Digging it up: https://www.washingtonpost.com/news/the-switch/wp/2015/04/10...
Very similar indeed. The attacks I witnessed where easy to block once you identified the patterns (referrer was visible and they used predictable ?_=... query parameters to try and bypass caches), but very effective otherwise.
I suppose in the event of a hot war, the Internet will be cut quickly to defend against things like the "Great Cannon".
So what are potential solutions? We're somehow still stuck with CAPTCHAS, a 25 years old concept that wastes millions of human hours and billions in infra costs [0].
How can enable beneficial automation while protecting against abusive AI crawlers?
As for letting well behaved crawlers in, I've had an idea for something like DKIM for crawlers. Should be possible to set up a fairly cheap cryptographic solution that enables crawlers a persistent identity that can't be forged.
Basically put a header containing first a string including today's date, the crawler's IP, and a domain name, then a cryptographic signature of the string. The domain has a TXT record with a public key for verifying the identity. It's cheap because you really only need to verify the string it once on the server side, and the crawler only needs to regenerate it once per day.
With that in place, crawlers can crawl with their reputation at stake. The big problem with these rogue scrapers are that they're basically impossible to identify or block, which means they don't have any incentives to behave well.
It wouldn't work to prevent the type of behavior shown in a title story
Not to mention that it's unknown if these are actually from AI companies, or from people pretending to be AI companies. You can set anything as your user agent.
It's more appropriate to mention the specific issue one haves about the crawlers, like "they request things too quickly" or "they're overloading my server". Then from there, it is easier to come to a solution than just "I hate AI". For example, one would realize that things like Anubis have existed forever, they are just called DDoS protection, specifically those using proof-of-work schemes (e.g. https://github.com/RuiSiang/PoW-Shield).
This also shifts the discussion away from something that adds to the discrimination against scraping in general, and more towards what is actually the issue: overloading servers, or in other words, DDoS.
It's just like how not all Ddoss's are actually hackers or bots. Sometimes a server just can't take the traffic of a large site flooding in. But the result is the same until something is investigated.
Did you mean "against"?
Running SHA hash calculations for a second or so once every week is not bad for users, but with scrapers constantly starting new sessions they end up spending most of their time running useless Javascript, slowing the down significantly.
The most effective alternative to proof of work calculations seems to be remote attestation. The downside is that you're getting captchas if you're one of the 0.1% who disable secure boot and run Linux, but the vast majority of web users will live a captcha free life. This same mechanism could in theory also be used to authenticate welcome scrapers rather than relying on pure IP whitelists.
The broad idea is to use zero knowledge proofs with certification. It sort of flips the public key certification system and adds some privacy.
To get into place, the powers in charge need to sway.
It won't fully solve the problem, but with the problem relatively identified, you must then ask why people are engaging in this behavior. Answer: money, for the most part. Therefore, follow the money and identify the financial incentives driving this behavior. This leads you pretty quickly to a solution most people would reject out-of-hand: turn off the financial incentive that is driving the enshittification of the web. Which is to say, kill the ad-economy.
Or at least better regulate it while also levying punitive damages that are significant enough to both disuade bad-actors and encourage entities to view data-breaches (or the potential therein) and "leakage[0]" as something that should actually be effectively secured against. Afterall, there are some upsides to the ad-economy that, without it, would present some hard challenges (eg, how many people are willing to pay for search? what happens to the vibrant sphere of creators of all stripes that are incentivized by the ad-economy? etc).
Personally, I can't imagine this would actually happen. Pushback from monied interests aside, most people have given up on the idea of data-privacy or personal-ownership of their data, if they ever even cared in the first place. So, in the absence of willing to do do something about the incentive for this maligned behavior, we're left with few good options.
0: https://news.ycombinator.com/item?id=43716704 (see comments on all the various ways people's data is being leaked/leached/tracked/etc)
I say we ask Google Analytics to count an AI crawler as a real view. Let’s see who’s most popular.
I imagine that e.g. Youtube would be happy to agree with this. Not that it would turn them against AI generally.
On a separate note, I believe open web scraping has been a massive benefit to the internet on net, and almost entirely positive pre-2021. Web scraping & crawling enables search engines, services like Internet Archive, walled-garden-busting (like Invidious, yt-dlp, and Nitter), mashups (Spotube, IFTT, and Plaid would have been impossible to bootstrap without web scraping), and all kinds of interesting data science projects (e.g. scraping COVID-19 stats from local health departments to patch together a picture of viral spread for epidemiologists).
Or do you want a central authority that decides who can do new search engines?
Why is Anubis-type mitigations a half-measure?
It's a half-measure because:
1. You're slowing down scrapers, not blocking them. They will still scrape your site content in violation of robots.txt.
2. Scrapers with more compute than IP proxies will not be significantly bottlenecked by this.
3. This may lead to an arms race where AI companies respond by beefing up their scraping infrastructure, necessitating more difficult PoW challenges, and so on. The end result of this hypothetical would be a more inconvenient and inefficient internet for everyone, including human users.
To be clear: I think Anubis is a great tool for website operators, and one of the best self-hostable options available today. However, it's a workaround for the core problem that we can't reliably distinguish traffic from badly behaving AI scrapers from legitimate user traffic.
[Cloudflare](https://developers.cloudflare.com/cache/troubleshooting/alwa...) tags the internet archive as operating from 207.241.224.0/20 and 208.70.24.0/21 so disabling the bot-prevention framework on connections from there should be enough.
New actors have the right to emerge.
Kagi is welcome to scrape from their IP addresses. Other bots that behave are fine too (Huawei and various other Chinese bots don't and I've had to put an IP block on those).
There's no rule that you have to let anyone in who claims to be a web crawler.
The truth is that I sympathize with the people trying to use mobile connections to bypass such a cartel.
What Cloudflare is doing now is worse than the web crawlers themselves and the legality of blocking crawlers with a monopoly is dubious at best.
It is still a pretty good lay-of-the-land.
https://www.trendmicro.com/vinfo/us/security/news/vulnerabil...
People are jumping to conclusions a bit fast over here, yes technically it's possible but this kind of behavior would be relatively easy to spot because the app would have to make direct connections to the website it wants to scrap.
Your calculator app for instance connecting to CNN.com ...
iOS have app privacy report where one can check what connections are made by app, how often, last one, etc.
Android by Google doesn't have such a useful feature of course, but you can run third party firewall like pcapdroid, which I recommend highly.
Macos (little snitch).
Windows (fort firewall).
Not everyone run these app obviously, only the most nerdy like myself but we're also the kind of people who would report on app using our device to make, what is in fact, a zombie or bot network.
I'm not saying it's necessarily false but imo it remains a theory until proven otherwise.
Privacy reports do not include that information. They include broad areas of information the app claims to gather. There is zero connection between those claimed areas and what the app actually does unless app review notices something that doesn't match up. But none of that information is updated dynamically, and it has never actually included the domains the app connects to. You may be confusing it with the old domain declarations for less secure HTTP connections. Once the connections met the system standards you no longer needed to declare it.
Go to a data conference like Neudata and you will see. You can have scraped data from user devices, real-time locations, credit card, Google analytics, etc.
How often is the average calculator app user checking there Privacy Report? My guess, not many!
They happens from time to time, last one was not more than two week ago where it's been shown that many app were able to read the list of all other app installed on a Android and that Google refused to fix that.
Do you really believe that an app used to make your device part of a bot network wouldn't be posted over here ?
^ edit: my mistake, the server logs I mentioned were from the authors prior blog post on this topic, linked to at the top of TFA: https://jan.wildeboer.net/2025/02/Blocking-Stealthy-Botnets/
AKA "why do Cloudflare and Google make me fill out these CAPTCHAs all day"
I don't know why Play Protect/MS Defender/whatever Apple has for antivirus don't classify apps that embed such malware as such. It's ridiculous that this is allowed to go on when detection is so easy. I don't know a more obvious example of a trojan than an SDK library making a user's device part of a botnet.
This is even worse with CG-NAT if you don't have IPv6 to solve the CG-NAT problem.
I don't think the data they collect is used to train anything these days. Cloudflare is using AI generated images for CAPTCHAs and Google's actual CAPTCHAs are easier for bots than humans at this point (it's the passive monitoring that makes it still work a little bit).
Personally, I think the "network sharing" software bundled with apps should fall into the category of potentially unwanted applications along with adware and spyware. All of the above "tag along" with something the user DID want to install, and quietly misuse the user's resources. Proxies like this definitely have an impact for metered/slow connections - I'm tempted to start Wireshark'ing my devices now to look for suspicious activity.
There should be a public repository of apps known to have these shady behaviours. Having done some light web scraping for archival/automation before, it's a pity that it'll become collateral damage in the anti-AI-botfarm fight.
[1] https://reports.exodus-privacy.eu.org/en/trackers/ [2] https://f-droid.org/packages/com.aurora.store/
I wouldn't mind reading a comprehensive report on SOTA with regard to bot-blocking.
Sure, there's Anubis (although someone elsethread called it a half-measure, and I'd like to know why), there's captcha's, there's relying on a monopoly (cloudflare, etc) who probably also wants to run their own bots at some point, but what else is there?
Anything incorporating anything like this is malware.
In most cases they are used for conducting real financial crimes, but the police investigators are also aware that there is a very low chance that sophisticated fraud is committed directly from a residential IP address.