I am kind of surprised how many sites seem to want/need this. I get the slow git pages problem for some of the git servers that are super deep, lack caches, serve off slow disks, etc.
Unesco surprised me some, the sub-site in question is pretty big, it has thousands of documents of content, but the content is static - this should be trivial to serve, so what's going on? Well it looks like it's a poorly deployed Wordpress on top of Apache, with no caching enabled, no content compression, no HTTP 2/3. It would likely be fairly easy to get this serving super cheap on a very small machine, but of course doing so requires some expertise, and expertise still isn't cheap.
Sure you could ask an LLM, but they still aren't good at helping when you have no clue what to ask - if you don't even really know the site is slower than it should be, why would you even ask? You'd just hear about things getting crushed and reach for the furry defender.
Sure, but at the same time, the number of people with expertise to set up Anubis (not that it's particularly hard, but I mean: even be aware that it exists) is surely even lower than of people with Wordpress administration experience, so I'm still surprised.
If I were to guess, the reasons for not touching Wordpress were unrelated, like: not wanting to touch a brittle instance, or organization permissions, or maybe the admins just assumed that WP is configured well already.
The AI scrapers are not only poorly written, they also go out of their way to do cache busting. So far I've seen a few solutions, CloudFlare, require a login, Anubis, or just insane amounts of infrastructure. Some site have reported 60% of their traffic coming from bots not, smaller sites is probably much higher.
My guess is that these tools tend to be targeted at mid-sized sites — the sorts of places that are large enough to have useful content, but small enough that there probably won't be any significant repercussions, and where the ops team is small enough (or plain nonexistent) that there's not going to be much in the way of blocks. That's why a site like SourceHut gets hit quite badly, but smaller blogs stay largely out of the way.
But that's just a working theory without much evidence trying to justify why I'm hearing so many people talking about struggling with AI bot traffic and not seeing it myself.
And then the universe blessed me with a natural 20. Never had these problems before. This shit is wild.
> Anubis uses a proof-of-work challenge to ensure that clients are using a modern browser and are able to calculate SHA-256 checksums
https://anubis.techaro.lol/docs/design/how-anubis-works
This is pretty cool, I have a project or two that might benefit from it.
But I find that when it comes to simple serving of content, human vs. bot is not usually what you’re trying to filter or block on. As long as a given client is not abusing your systems, then why do you care if the client is a human?
Well, that's the rub. The bots are abusing the systems. The bots are accessing the contents at rates thousands of times faster and more often than humans. The bots also have access patterns unlike your expected human audience (downloading gigabytes or terabytes of data multiples times, over and over).
And these bots aren't some being with rights. They're tools unleashed by humans. It's humans abusing the systems. These are anti-abuse measures.
And if that doesn't happen, you go to their ISP's ISP and get their ISP booted off the Internet.
Actual ISPs and hosting providers take abuse reports extremely seriously, mostly because they're terrified of getting kicked off by their ISP. And there's no end to that - just a chain of ISPs from them to you and you might end with convincing your ISP or some intermediary to block traffic from them. However, as we've seen recently, rules don't apply if enough money is involved. But I'm not sure if these shitty interim solutions come from ISPs ignoring abuse when money is involved, or from not knowing that abuse reporting is taken seriously to begin with.
Anyone know if it's legal to return a never-ending stream of /dev/urandom based on the user-agent?
You will be surprised on how many ISPs will not respond. Sure, Hetzner will respond, but these abusers are not using Hetzner at all. If you actually studied the actual problem, these are residential ISPs in various countries (including in US and Europe, mind you). At best the ISP will respond one-by-one to their customers and scan their computers (and at this point the abusers have already switched to another IP block) and at worst the ISP literally has no capability to control this because they cannot trace their CGNATted connections (short of blocking connections to your site, which is definitely nuclear).
> And if that doesn't happen, you go to their ISP's ISP and get their ISP booted off the Internet.
Again, the IP blocks are rotated, so by the time that they would respond you need to do the whole reporting rigomarole again. Additionally, these ISPs would instead suggest to blackhole these requests or to utilize a commercial solution (aka using Cloudflare or something else), because at the end of the day the residential ISPs are national entites that would quite literally trigger geopolitcal concerns if you disconnected them.
Yup, I'm assuming that immibis thinks that the ones using Anubis are those ones with high legal budgets, but this is not necessarily the case here.
Assume that you are in the shoes of Anubis users. Do you have a reasonable legal budget? No? From experience, most ISPs would not really respond unless either their network has become unstable as a consequence, or if legal advised them to cooperate. Realistically, at the time that they read your plea the activity has already died off (on their network), and the best that they can do is to give you the netflows to do your investigation.
> You think they wouldn't cut off customers who DDoS?
This is not your typical DDoS where the stability of the network links are affected (this is at the ISP level, not specifically your server), this is a very asymmetrical one where it seemingly blends out as normal browsing. Unless you have a reasonable legal budget, they would suggest to use RTBH (https://www.cisco.com/c/dam/en_us/about/security/intelligenc...) or a commercial filtering solution if need be. This even assumes that they're symphatetic to your pleas, at worst case you're dealing with state-backed ISPs that are known not to respond at all.
They're certainly positioning themselves for providing scraping servers for AI training. What will they do when I say that one of their customers just hit my server with 1000 requests per second? Ban the customer?
Let's be rational. They'll laugh at that mail and delete it. Bigger players use "home proxying" services which use residental blocks for egress, and make one request per host. Some people are cutting whole countries off with firewalls.
Playing by old rules won't get you anywhere, because all these gentlemen took their computers and work elsewhere. Now we all have are people who think they need no permission because what they do is awesome, anyway (which is not).
At the minimum they're very likely to have a talk with their customer "keep this shit up and you're outta here"
And those residential proxy services cost their customer around $0.50/GB up to $20/GB. Do with that knowledge what you will.
Good luck with that. Have you ever tried? AWS and Google have abuse mails. Do you think they read them? Do you think they care? It is basically impossible to get AWS to shutdown a customers systems, regardless of how much you try.
I believe ARIN has an abuse email registered for a Google subnet, with the comment that they believe it's correct, but no one answer last time they tried it, three years ago.
The hierarchy is: abuse contact of netblock. If ignored: abuse contact of AS. If ignored: Local internet registry (LIR) managing the AS. If ignored: Internet Registry like ARIN.
I see a possibility of automation here.
Also, report to DNSBL providers like Spamhaus. They rely on reports to blacklist single IPs, escalate to whole blocks and then the next larger subnet, until enough customers are affected.
In the interest of bringing the AI bickering to HN: I think one could accurately characterize "block bots just in case they choose to request too much data" as discrimination! Robots of course don't have any rights so it's not wrong, but it certainly might be unwise.
Not when the bots are actively programmed to thwart them by using far-flung IP address carousels, request pacing, spoofed user agents and similar techniques. It's open war these days.
I’m afraid that it doesn’t change anything in of itself and any sorts of solutions to only allow the users that you’re okay with are what’s direly needed all across the web.
Though reading about the people trying to mine crypto on a CI solution, it feels that sometimes it won’t just be LLM scrapers that you need to protect against but any number of malicious people.
At that point, you might as well run an invite only community.
I just worry about the idea of running public/free services on the web, due to the potential for misuse and bad actors, though making things paid also seems sensible, e.g. what was linked: https://man.sr.ht/ops/builds.sr.ht-migration.md
If the abuser is using request pacing to make less request then that's making the abuser less abusive. If you're still complaining that request pacing is not pacing the requests down enough because the pacing is designed to just not bring your server down and instead make you consume money, then you can counteract that just by tuning the rate limiting even further down.
The 10s of thousands distinct IP address is another (and perfectly valid) issue, but it was not the point I answered to.
There's been numerous posts on HN about people getting slammed, to the tune of many, many dollars and terabytes of data from bots, especially LLM scrapers, burning bandwidth and increasing server-running costs.
I'm largely suspecting that these are mostly other bots pretending to be LLM scrapers. Does anyone even check if the bots' IP ranges belong to the AI companies?
When I worked at Wikimedia (so ending ~4 years ago) we had several incidents of bots getting lost in a maze of links within our source repository browser (Phabricator) which could account for > 50% of the load on some pretty powerful Phabricator servers (Something like 96 cores, 512GB RAM). This happened despite having those URLs excluded via robots.txt and implementing some rudimentary request throttling. The scrapers were using lots of different IPs simultaneously and they did not seem to respect any kind of sane rate limits. If googlebot and one or two other scrapers hit at the same time it was enough to cause an outage or at least seriously degrade performance.
Eventually we got better at rate limiting and put more URLs behind authentication but it wasn't an ideal situation and would have been quite difficult to deal with had we been much more resource-constrained or less technically capable.
Sounds like a fun project for an AbuseIPDB contributor. Could look for fake Googlebots / Bingbots, etc, too.
What better way to show the effectiveness of your solution, than to help create the problem in the first place.
Google really wants everyone to use its spyware-embedded browser.
There are tons of other "anti-bot" solutions that don't have a conflict of interest with those goals, yet the ones that become popular all seem to further them instead.
It may have some other downsides - for example I don't think that Google is possible in a world where everyone requires proof of work (some may argue it's a good thing) but it doesn't specifically gate bots. It gates mass scraping.
Alternatively shared resources similar in spirit to common crawl but scaled up could be used. That would have the benefit of democratizing the ability to create and operate large scale search indexes.
The point of this is that there has recently been a massive explosion in the amount of bots that blatantly, aggressively, and maliciously ignore and attempt to bypass (mass ip/VPN switching, user agent swapping, etc) anti-abuse gates.
A funny line from his docs
Tangentially, I was wondering how this would impact common search engines (not AI crawlers) and how this compares to Cloudflare’s solution to stop AI crawlers, and that’s explained on the GitHub page. [1]
> Installing and using this will likely result in your website not being indexed by some search engines. This is considered a feature of Anubis, not a bug.
> This is a bit of a nuclear response, but AI scraper bots scraping so aggressively have forced my hand.
> In most cases, you should not need this and can probably get by using Cloudflare to protect a given origin. However, for circumstances where you can't or won't use Cloudflare, Anubis is there for you.
We are still making some improvements like passing open graph tags through so at least rich previews work!
Love them too, and abhor knowing that someone is bound to eventually remove them because found to be "problematic" in one way or another.
[1] https://discourse.gnome.org/t/anime-girl-on-gnome-gitlab/276...
I built my own solution that effectively blocks these "Bad Bots" at the network level. I effectively block the entirety of several large "Big Tech / Big LLM" networks entirely at the ASN (BGP) by utilizing MaxMind's database and a custom WAF and Reverse Proxy I put together.
Simply put you risk blocking legitimate traffic. This solution does as well but for most humans the actual risk is much lower.
As much as I'd love to not need JavaScript and to support users who run with it disabled, I've never once had a customer or end user complain about needing JavaScript enabled.
It is an incredible vocal minority who disapprove of requiring JavaScript, the majority of whom, upon encountering a site for which JavaScript is required, simply enable it. I'd speculate that, even then, only a handful ever release a defeated sigh.
There is going to be a pretty big refactor soon, but once that's done we plan on crushing this out.
- Block Bad Bots. There's a simple text file called `bad_bots.txt` - Block Bad ASNs. There's a simple text file called `bad_asns.txt`
There's also another for blocking IP(s) and IP-ranges called `bad_ips.txt` but it's often more effective to block an much larger range of IPs (At the ASN level).
To give you an concrete idea, here's some examples:
$ cat etc/caddy/waf/bad_asns.txt # CHINANET-BACKBONE No.31,Jin-rong Street, CN # Why: DDoS 4134
# CHINA169-BACKBONE CHINA UNICOM China169 Backbone, CN # Why: DDoS 4837
# CHINAMOBILE-CN China Mobile Communications Group Co., Ltd., CN # Why: DDoS 9808
# FACEBOOK, US # Why: Bad Bots 32934
# Alibaba, CN # Why: Bad Bots 45102
# Why: Bad Bots 28573
The "good enough" solution is the existing and widely used SHA( seed, nonce ). That could easily be integrated into a lower level of the stack if the tech giants wanted it.
Personally, I don't think the UX is that bad since I don't have to do anything. I definitely prefer it to captchas.
If you don't do this, the third-party cookie blocking that strict Enhanced Tracking Protection enables will completely destroy your ability to access websites hosted behind CloudFlare, because it is impossible for CloudFlare to know that you have solved the CAPTCHA.
This is what causes the infinite CAPTCHA loops. It doesn't matter how many of them you solve, Firefox won't let CloudFlare make a note that you have solved it, and then when it reloads the page you obviously must have just tried to load the page again without solving it.
This sounds like "we only save hashed minutiae of your biometrics"
Yes?
HTTP is stateless. It always has been and it always will be. If you want to pass state between page visits (like "I am logged in to account ..." or "My shopping cart contains ..." or "I solved a CAPTCHA at ..."), you need to be given, and return back to the server on subsequent requests, cookies that encapsulate that information, or encapsulate a reference to an identifier that the server can associate with that information.
This is nothing new. Like gruez said in a sibling comment; this is what session cookies do. Almost every website you ever visit will be giving you some form of session cookie.
What is not within your rights is to require the site owner to build their own solution to your specs to solve those problems or to require the site owner to just live with those problems because you want to view the content.
It usually becomes reasonable to object to the status quo long before the legislature is compelled to move to fix things.
I do think that it's reasonable for the service to provide alternative methods of interacting with it when possible. Phone lines, Mail, Email could all be potential escape hatches. But if a site is on the internet it is going to need protecting eventually.
I don't know that "3rd party session cookies" or "JS" are reasonable objections, but I definitely have privacy concerns. And I have encountered situations where I wasn't presented with a captcha but was instead unconditionally blocked. That's frustrating but legally acceptable if it's a small time operator. But when it's a contracted tech giant I think it's deserving of scrutiny. Their practices have an outsized footprint.
> service to provide alternative methods of interacting with it when possible
One of the most obvious alternative methods is logging in with an existing account, but on many websites I've found the login portal barricaded behind a screening measure which entirely defeats that.
> if a site is on the internet it is going to need protecting eventually
Ah yes, it needs "protection" from "bots" to ensure that your page visit is "secure". Preventing DoS is understandable, but many operators simply don't want their content scraped for reasons entirely unrelated to service uptime. Yet they try to mislead the visitor regarding the reason for the inconvenience.
Or worse, the government operations that don't care but are blindly implementing a compliance checklist. They sometimes stick captchas in the most nonsensical places.
You realize this is the same as session cookies, which are used on nearly every site, even those where you're not logging in?
>This sounds like "we only save hashed minutiae of your biometrics"
A randomly generated identifier is nowhere close to "hashed minutiae of your biometrics".
Your assumption is that anyone at cloudflare cares. But guess what, it's a self fulfilling prophecy of a bot being blocked, because not a single process in the UX/UI allows any real user to complain about it, and therefore all blocked humans must also be bots.
Just pointing out the flaw of bot blocking in general, because you seem to be absolutely unaware of it. Success rate of bot blocking is always 100%, and never less, because that would imply actually realizing that your tech does nothing, really.
Statistically, the ones really using bots can bypass it easily.
Tor and VPNs arguably have the same issue. I use both and haven't experienced "infinite loops" with either. The same can't be said of google, reddit, or many other sites using other security providers. Those either have outright bans, or show captchas that require far more effort to solve than clicking a checkbox.
Eg. Anubis here works fine for me, completely out-classing the CF interstitial page with its simplicity.
Genuine question: why not leverage the proof-of-work challenge literally into mining that generates some revenue for a website? Not a new idea, but when I looked at the docs it didn't seem like this challenge was tied to any monetary coin value.
This is coming from someone who is NOT a big crypto person, but it strikes me that this would be a much better way to monetize organic high quality content in this day and age. Basically the idea that Brave browser started with, meeting it's moment.
I'm sure Xe has already considered this. Do they have a blog post about this anywhere?