Top
Best
New

Posted by evacchi 4/12/2025

Anubis Works(xeiaso.net)
319 points | 208 comments
gnabgib 4/12/2025|
Related Anubis: Proof-of-work proxy to prevent AI crawlers (100 points, 23 days ago, 58 comments) https://news.ycombinator.com/item?id=43427679
raggi 4/13/2025||
It's amusing that Xe managed to turn what was historically mostly a joke/shitpost into an actually useful product. They did always say timing was everything.

I am kind of surprised how many sites seem to want/need this. I get the slow git pages problem for some of the git servers that are super deep, lack caches, serve off slow disks, etc.

Unesco surprised me some, the sub-site in question is pretty big, it has thousands of documents of content, but the content is static - this should be trivial to serve, so what's going on? Well it looks like it's a poorly deployed Wordpress on top of Apache, with no caching enabled, no content compression, no HTTP 2/3. It would likely be fairly easy to get this serving super cheap on a very small machine, but of course doing so requires some expertise, and expertise still isn't cheap.

Sure you could ask an LLM, but they still aren't good at helping when you have no clue what to ask - if you don't even really know the site is slower than it should be, why would you even ask? You'd just hear about things getting crushed and reach for the furry defender.

adrian17 4/13/2025||
> but of course doing so requires some expertise, and expertise still isn't cheap

Sure, but at the same time, the number of people with expertise to set up Anubis (not that it's particularly hard, but I mean: even be aware that it exists) is surely even lower than of people with Wordpress administration experience, so I'm still surprised.

If I were to guess, the reasons for not touching Wordpress were unrelated, like: not wanting to touch a brittle instance, or organization permissions, or maybe the admins just assumed that WP is configured well already.

raggi 4/14/2025||
I have trouble with that because it’s brimming full of plugins too (see them all disorganized all over the source), and failing to keep such a system up to date ends in tears rapidly in that ecosystem.
jtbayly 4/13/2025|||
My site that I’d like this for has a lot of posts, but there are links to a faceted search system based on tags that produces an infinite number of possible combinations and pages for each one. There is no way to cache this, and the bots don’t respect the robots file, so they just constantly request URLs, getting the posts over and over in different numbers and combinations. It’s a pain.
mrweasel 4/13/2025|||
> I am kind of surprised how many sites seem to want/need this.

The AI scrapers are not only poorly written, they also go out of their way to do cache busting. So far I've seen a few solutions, CloudFlare, require a login, Anubis, or just insane amounts of infrastructure. Some site have reported 60% of their traffic coming from bots not, smaller sites is probably much higher.

MrJohz 4/13/2025||
Fwiw, I run a pretty tiny site and see relatively minimal traffic coming from bots. Most of the bot traffic, when it appears, is vulnerability scanners (the /wp-admin/ requests on a static site), and has little impact on my overall stats.

My guess is that these tools tend to be targeted at mid-sized sites — the sorts of places that are large enough to have useful content, but small enough that there probably won't be any significant repercussions, and where the ops team is small enough (or plain nonexistent) that there's not going to be much in the way of blocks. That's why a site like SourceHut gets hit quite badly, but smaller blogs stay largely out of the way.

But that's just a working theory without much evidence trying to justify why I'm hearing so many people talking about struggling with AI bot traffic and not seeing it myself.

nicolapcweek94 4/13/2025|||
Well, we just spun up anubis in front of a two user private (as in publicly accessible but with almost all content set to private/login protected) forgejo instance after it started getting hammered (mostly by amazon ips presenting as amazonbot) earlier in the week, resulting in a >90% traffic reduction. From what we’ve seen (and Xe’s own posts) it seems git forges are getting hit harder than most other sites, though, so YMMV i guess.
mrweasel 4/14/2025|||
I actually have a theory, based on the last episode of the 2.5 admins podcast. Try spinning up a MediaWiki site. I have a feeling that wiki installation are being targeted to a much higher degree. You could also do a Git repo of some sort. Either two could give the impression that content is changed frequently.
gyaru 4/14/2025|||
yep, I'm running a pretty sizeable game Wiki and it's being scraped to hell with very specific urls that pretty much guarantees cache busting. (usually revision ids and diffs)
MrJohz 4/14/2025|||
I could believe that. Plus, because both of those are more dynamic, they're going to have to do more work per request anyway, meaning the effects of scraping are exacerbated.
cedws 4/13/2025||
PoW anti-bot/scraping/DDOS was already being done a decade ago, I’m not sure why it’s only catching on now. I even recall a project that tried to make the PoW useful.
xena 4/13/2025||
Xe here. If I had to guess in two words: timing and luck. As the G-man said: the right man in the wrong place can make all the difference in the world. I was the right shitposter in the right place at the right time.

And then the universe blessed me with a natural 20. Never had these problems before. This shit is wild.

underdeserver 4/13/2025||
Squeeze that lemon as far as it'll go mate, god speed and may the good luck continue.
gyomu 4/12/2025||
If you’re confused about what this is - it’s to prevent AI scraping.

> Anubis uses a proof-of-work challenge to ensure that clients are using a modern browser and are able to calculate SHA-256 checksums

https://anubis.techaro.lol/docs/design/how-anubis-works

This is pretty cool, I have a project or two that might benefit from it.

x3haloed 4/13/2025||
I’ve been wondering to myself for many years now whether the web is for humans or machines. I personally can’t think of a good reason to specifically try to gate bots when it comes to serving content. Trying to post content or trigger actions could obviously be problematic under many circumstances.

But I find that when it comes to simple serving of content, human vs. bot is not usually what you’re trying to filter or block on. As long as a given client is not abusing your systems, then why do you care if the client is a human?

xboxnolifes 4/13/2025|||
> As long as a given client is not abusing your systems, then why do you care if the client is a human?

Well, that's the rub. The bots are abusing the systems. The bots are accessing the contents at rates thousands of times faster and more often than humans. The bots also have access patterns unlike your expected human audience (downloading gigabytes or terabytes of data multiples times, over and over).

And these bots aren't some being with rights. They're tools unleashed by humans. It's humans abusing the systems. These are anti-abuse measures.

immibis 4/13/2025|||
Then you look up their IP address's abuse contact, send an email and get them to either stop attacking you or get booted off the internet so they can't attack you.

And if that doesn't happen, you go to their ISP's ISP and get their ISP booted off the Internet.

Actual ISPs and hosting providers take abuse reports extremely seriously, mostly because they're terrified of getting kicked off by their ISP. And there's no end to that - just a chain of ISPs from them to you and you might end with convincing your ISP or some intermediary to block traffic from them. However, as we've seen recently, rules don't apply if enough money is involved. But I'm not sure if these shitty interim solutions come from ISPs ignoring abuse when money is involved, or from not knowing that abuse reporting is taken seriously to begin with.

Anyone know if it's legal to return a never-ending stream of /dev/urandom based on the user-agent?

zinekeller 4/13/2025|||
> Then you look up their IP address's abuse contact, send an email and get them to either stop attacking you or get booted off the internet so they can't attack you.

You will be surprised on how many ISPs will not respond. Sure, Hetzner will respond, but these abusers are not using Hetzner at all. If you actually studied the actual problem, these are residential ISPs in various countries (including in US and Europe, mind you). At best the ISP will respond one-by-one to their customers and scan their computers (and at this point the abusers have already switched to another IP block) and at worst the ISP literally has no capability to control this because they cannot trace their CGNATted connections (short of blocking connections to your site, which is definitely nuclear).

> And if that doesn't happen, you go to their ISP's ISP and get their ISP booted off the Internet.

Again, the IP blocks are rotated, so by the time that they would respond you need to do the whole reporting rigomarole again. Additionally, these ISPs would instead suggest to blackhole these requests or to utilize a commercial solution (aka using Cloudflare or something else), because at the end of the day the residential ISPs are national entites that would quite literally trigger geopolitcal concerns if you disconnected them.

immibis 4/13/2025||
These the same residential providers that people complain cut them off for torrenting? You think they wouldn't cut off customers who DDoS?
zinekeller 4/14/2025|||
> These the same residential providers that people complain cut them off for torrenting?

Assume that you are in the shoes of Anubis users. Do you have a reasonable legal budget? No? From experience, most ISPs would not really respond unless either their network has become unstable as a consequence, or if legal advised them to cooperate. Realistically, at the time that they read your plea the activity has already died off (on their network), and the best that they can do is to give you the netflows to do your investigation.

> You think they wouldn't cut off customers who DDoS?

This is not your typical DDoS where the stability of the network links are affected (this is at the ISP level, not specifically your server), this is a very asymmetrical one where it seemingly blends out as normal browsing. Unless you have a reasonable legal budget, they would suggest to use RTBH (https://www.cisco.com/c/dam/en_us/about/security/intelligenc...) or a commercial filtering solution if need be. This even assumes that they're symphatetic to your pleas, at worst case you're dealing with state-backed ISPs that are known not to respond at all.

op00to 4/13/2025|||
They’re not cutting you off for torrenting because they think it’s the right thing to do. They’re cutting you off for torrenting because it costs them money if rights holders complain.
zinekeller 4/14/2025|||
> They’re cutting you off for torrenting because it costs them money if rights holders complain.

Yup, I'm assuming that immibis thinks that the ones using Anubis are those ones with high legal budgets, but this is not necessarily the case here.

fc417fc802 4/14/2025||||
If it's a cable company then there's also a conflict of interest.
bayindirh 4/13/2025||||
When I was migrating my server, and checking logs, I have seen a slew of hits in the rolling logs. I reversed the IP and found a company specializing in "Servers with GPUs". Found their website, and they have "Datacenters in the EU", but the company is located elsewhere.

They're certainly positioning themselves for providing scraping servers for AI training. What will they do when I say that one of their customers just hit my server with 1000 requests per second? Ban the customer?

Let's be rational. They'll laugh at that mail and delete it. Bigger players use "home proxying" services which use residental blocks for egress, and make one request per host. Some people are cutting whole countries off with firewalls.

Playing by old rules won't get you anywhere, because all these gentlemen took their computers and work elsewhere. Now we all have are people who think they need no permission because what they do is awesome, anyway (which is not).

immibis 4/13/2025||
A startup hosting provider you say - who's their ISP? Does that company know their customer is a DDoS-for-hire provider? Did you tell them? How did they respond?

At the minimum they're very likely to have a talk with their customer "keep this shit up and you're outta here"

sussmannbaka 4/13/2025||||
Please, read literally any article about the ongoing problem. The IPs are basically random, come from residential blocks, requests don’t reuse the same IP more than a bunch of times.
immibis 4/13/2025||
Are you sure that's AI? I get requests that are overtly from AI crawlers, and almost no other requests. Certainly all of the high-volume crawler-like requests overtly say that they're from crawlers.

And those residential proxy services cost their customer around $0.50/GB up to $20/GB. Do with that knowledge what you will.

mrweasel 4/13/2025|||
> Then you look up their IP address's abuse contact, send an email

Good luck with that. Have you ever tried? AWS and Google have abuse mails. Do you think they read them? Do you think they care? It is basically impossible to get AWS to shutdown a customers systems, regardless of how much you try.

I believe ARIN has an abuse email registered for a Google subnet, with the comment that they believe it's correct, but no one answer last time they tried it, three years ago.

47282847 4/13/2025||
ARIN/Internet registries doesn’t maintain these records themselves, owners of IP netblocks do. Some registries have introduced mandatory abuse contact information (I think at least RIPE) and send a link to confirm the mailbox exists.

The hierarchy is: abuse contact of netblock. If ignored: abuse contact of AS. If ignored: Local internet registry (LIR) managing the AS. If ignored: Internet Registry like ARIN.

I see a possibility of automation here.

Also, report to DNSBL providers like Spamhaus. They rely on reports to blacklist single IPs, escalate to whole blocks and then the next larger subnet, until enough customers are affected.

bbor 4/13/2025|||
Well, that's the meta-rub: if they're abusing, block abuse. Rate limits are far simpler, anyway!

In the interest of bringing the AI bickering to HN: I think one could accurately characterize "block bots just in case they choose to request too much data" as discrimination! Robots of course don't have any rights so it's not wrong, but it certainly might be unwise.

inejge 4/13/2025||
> Rate limits are far simpler, anyway!

Not when the bots are actively programmed to thwart them by using far-flung IP address carousels, request pacing, spoofed user agents and similar techniques. It's open war these days.

parineum 4/13/2025||
Request pacing sounds intentionally unabusive.
j16sdiz 4/13/2025|||
It is not bringing down your server, but they are taking 80%+ of your bandwidth budget. Does this count as abuse?
immibis 4/13/2025|||
Are you at a hoster with extortionately expensive bandwidth, such as AWS, GCP, or Azure?
ithkuil 4/13/2025|||
Isn't that what a rate limiter would address?
mkl 4/13/2025||
Not when the traffic is coming from 10s of thousands of IP addresses, with very few requests from each one: https://drewdevault.com/2025/03/17/2025-03-17-Stop-externali...
KronisLV 4/13/2025|||
That very much reads like the rant of someone who is sick and tired of the state of things.

I’m afraid that it doesn’t change anything in of itself and any sorts of solutions to only allow the users that you’re okay with are what’s direly needed all across the web.

Though reading about the people trying to mine crypto on a CI solution, it feels that sometimes it won’t just be LLM scrapers that you need to protect against but any number of malicious people.

At that point, you might as well run an invite only community.

bayindirh 4/13/2025||
Source Hut implemented Anubis, and it works so well. I mostly never see the waiting screen. And after it whitelists me for a very long time, so I work without any limitations.
KronisLV 4/13/2025||
That’s great to hear and Anubis seems cool!

I just worry about the idea of running public/free services on the web, due to the potential for misuse and bad actors, though making things paid also seems sensible, e.g. what was linked: https://man.sr.ht/ops/builds.sr.ht-migration.md

ithkuil 4/14/2025|||
ok, but my answer was about was how to react to request pacing.

If the abuser is using request pacing to make less request then that's making the abuser less abusive. If you're still complaining that request pacing is not pacing the requests down enough because the pacing is designed to just not bring your server down and instead make you consume money, then you can counteract that just by tuning the rate limiting even further down.

The 10s of thousands distinct IP address is another (and perfectly valid) issue, but it was not the point I answered to.

rollcat 4/13/2025|||
It's called DDoS. DDoS is abusive.
t-writescode 4/13/2025||||
> I personally can’t think of a good reason to specifically try to gate bots

There's been numerous posts on HN about people getting slammed, to the tune of many, many dollars and terabytes of data from bots, especially LLM scrapers, burning bandwidth and increasing server-running costs.

ronsor 4/13/2025||
I'm genuinely skeptical that those are all real LLM scrapers. For one, a lot of content is in CommonCrawl and AI companies don't want to redo all that work when they can get some WARC files from AWS.

I'm largely suspecting that these are mostly other bots pretending to be LLM scrapers. Does anyone even check if the bots' IP ranges belong to the AI companies?

20after4 4/13/2025|||
For a long time there have been spammers scraping in search of email addresses to spam. There are all kinds of scraper bots with unknown purpose. It's the aggregate of all of them hitting your server, potentially several at the same time.

When I worked at Wikimedia (so ending ~4 years ago) we had several incidents of bots getting lost in a maze of links within our source repository browser (Phabricator) which could account for > 50% of the load on some pretty powerful Phabricator servers (Something like 96 cores, 512GB RAM). This happened despite having those URLs excluded via robots.txt and implementing some rudimentary request throttling. The scrapers were using lots of different IPs simultaneously and they did not seem to respect any kind of sane rate limits. If googlebot and one or two other scrapers hit at the same time it was enough to cause an outage or at least seriously degrade performance.

Eventually we got better at rate limiting and put more URLs behind authentication but it wasn't an ideal situation and would have been quite difficult to deal with had we been much more resource-constrained or less technically capable.

t-writescode 4/13/2025||||
No matter the source, the result is the same, and these proof of work systems may be something that can help "the little guy" with their hosting bill
ronsor 4/13/2025||
If a bot claims to be from an AI company, but isn't from the AI company's IP range, then it's lying and its activity is plain abuse. In that case, you shouldn't serve them a proof of work system; you should block them entirely.
thunderfork 4/13/2025||
Blocking abusive actors can be very non-trivial. The proof-of-work system mitigates the amount of effort that needs to be spent identifying and blocking bad actors.
anonym29 4/13/2025||||
>Does anyone even check if the bots' IP ranges belong to the AI companies?

Sounds like a fun project for an AbuseIPDB contributor. Could look for fake Googlebots / Bingbots, etc, too.

userbinator 4/13/2025|||
Also suspect those working on "anti-bot" solutions may have a hand in this.

What better way to show the effectiveness of your solution, than to help create the problem in the first place.

zaphar 4/13/2025||
Why? When there are 100s of hopeful AI/LLM scrapers more than willing to do that work for you what possible reason would you have to do that work? The more typical and common human behavior is perfectly capable of explaining this. No reason to reach for some kind of underhanded conspiracy theory when simple incompetence and greed is more than adequate to explain it.
userbinator 4/13/2025||
CF hosts websites that sell DDoS services.

Google really wants everyone to use its spyware-embedded browser.

There are tons of other "anti-bot" solutions that don't have a conflict of interest with those goals, yet the ones that become popular all seem to further them instead.

praptak 4/13/2025||||
The good thing about proof of work is that it doesn't specifically gate bots.

It may have some other downsides - for example I don't think that Google is possible in a world where everyone requires proof of work (some may argue it's a good thing) but it doesn't specifically gate bots. It gates mass scraping.

fc417fc802 4/14/2025||
Things like google are still possible. Operators would need to whitelist services.

Alternatively shared resources similar in spirit to common crawl but scaled up could be used. That would have the benefit of democratizing the ability to create and operate large scale search indexes.

brikym 4/13/2025||||
As both a website host and website scraper I can see both sides of it. The website owners have very little interest in opening their data up; if they did they'd have made an API for it. In my case it's scraping supermarket prices so obviously big-grocery doesn't want a spot light on their arbitrary pricing patterns. It's frustrating for us scrapers but from their perspective opening up to bots is just a liability. Besides bots just spamming the servers getting around rate limits with botnets and noise any new features added by bots probably won't benefit them. If I made a bot service that would split your orders over multiple supermarkets, or buy items temporally as prices drop that wouldn't benefit the companies. All the work they've put into their site is to bring them to the status quo and they want to keep it that way. The companies don't want an open internet, only we do. I'd like to see some transparency laws so that large companies need to publish their pricing.
gbear605 4/13/2025||||
The issue is not whether it’s a human or a bot. The issue is whether you’re sending thousands of requests per second for hours, effectively DDOSing the site, or if you’re behaving like a normal user.
laserbeam 4/13/2025||||
The reason is: bots DO spam you repeatedly and increase your network costs. Humans don’t abuse the same way.
starkrights 4/13/2025||||
Example problem that I’ve seen posted about a few times on HN: LLM scrapers (or at least, an explosion of new scrapers) exploding and mindlessly crawling every singly HTTP endpoint of a hosted git-service, instead of just cloning the repo. (entirely ignoring robots.txt)

The point of this is that there has recently been a massive explosion in the amount of bots that blatantly, aggressively, and maliciously ignore and attempt to bypass (mass ip/VPN switching, user agent swapping, etc) anti-abuse gates.

mieses 4/13/2025|||
There is hope for misguided humans.
namanyayg 4/12/2025||
"It also uses time as an input, which is known to both the server and requestor due to the nature of linear timelines"

A funny line from his docs

xena 4/12/2025||
OMG lol I forgot that I left that in. Hilarious. I think I'm gonna keep it.
didgeoridoo 4/13/2025|||
I didn’t even blink at this, my inner monologue just did a little “well, naturally” in a Redditor voice and kept reading.
mkl 4/13/2025|||
BTW Xe, https://xeiaso.net/pronouns is 404 since sometime last year, but it is still linked to from some places like https://xeiaso.net/blog/xe-2021-08-07/ (I saw "his" above and went looking).
xena 4/13/2025||
I'm considering making it come back, but it's just gotten me too much abuse so I'm probably gonna leave it 404-ing until society is better.
IsTom 4/13/2025|||
Maybe there is some space on the market for a Proof of Emapthy widget
ranger_danger 4/14/2025||
I have seen some projects that require acknowledging certain politically-charged statements before they will allow you to participate, like "you must agree that sovereign country X is at war with aggressor country Y".
cendyne 4/13/2025|||
That's what route-specific Anubis is for.
frontalier 4/13/2025||
parent is referring to a different kind of abuse
1oooqooq 4/13/2025||
or you just not cranking up the required proof of work effort enough.
pie_flavor 4/14/2025||
Unfortunately it is also false (if taken out of context; Anubis rounds the time to the nearest week, which is probably good enough if the next-nearest week is valid too). Clock desync for a variety of reasons is pervasive - you can't expect the 10th percentile of your users to be accurate even to the day, and even the 25th percentile will be five minutes or so off.
AnonC 4/13/2025||
Those images on the interstitial page(s) while waiting for Anubis to complete its check are so cute! (I’ve always found all the art and the characters in Xe’s blog very beautiful)

Tangentially, I was wondering how this would impact common search engines (not AI crawlers) and how this compares to Cloudflare’s solution to stop AI crawlers, and that’s explained on the GitHub page. [1]

> Installing and using this will likely result in your website not being indexed by some search engines. This is considered a feature of Anubis, not a bug.

> This is a bit of a nuclear response, but AI scraper bots scraping so aggressively have forced my hand.

> In most cases, you should not need this and can probably get by using Cloudflare to protect a given origin. However, for circumstances where you can't or won't use Cloudflare, Anubis is there for you.

[1]: https://github.com/TecharoHQ/anubis/

JsonCameron 4/15/2025||
Yeah. Unfortunately at the current moment it does prevent indexing. Perhaps down the line we can whitelist search engines ips. However some like google, use the same for the AI and search indexing.

We are still making some improvements like passing open graph tags through so at least rich previews work!

snvzz 4/13/2025||
>Those images on the interstitial page(s) while waiting for Anubis to complete its check are so cute!

Love them too, and abhor knowing that someone is bound to eventually remove them because found to be "problematic" in one way or another.

pohuing 4/13/2025||
There's this funny instance[1] of someone afraid their their gf might see them and think they're into anime. But anyhow using an image and the image itself is up to the site since Anubis let's you configure it.

[1] https://discourse.gnome.org/t/anime-girl-on-gnome-gitlab/276...

prologic 4/13/2025||
I've read about Anubis, cool project! Unfortunately, as pointed out in the comments, requires your site's visitors to have Javascript™ enabled. This is totally fine for sites that require Javascript™ anyway to enhance the user experience, but not so great for static sites and such that require no JS at all.

I built my own solution that effectively blocks these "Bad Bots" at the network level. I effectively block the entirety of several large "Big Tech / Big LLM" networks entirely at the ASN (BGP) by utilizing MaxMind's database and a custom WAF and Reverse Proxy I put together.

xyzzy_plugh 4/13/2025||
A significant portion of the bot traffic TFA is designed to handle originates from consumer/residential space. Sure, there are ASN games being played alongside reputation fraud, but it's very hard to combat. A cursory investigation of our logs showed these bots (which make ~1 request from a given residential IP) are likely in ranges that our real human users occupy as well.

Simply put you risk blocking legitimate traffic. This solution does as well but for most humans the actual risk is much lower.

As much as I'd love to not need JavaScript and to support users who run with it disabled, I've never once had a customer or end user complain about needing JavaScript enabled.

It is an incredible vocal minority who disapprove of requiring JavaScript, the majority of whom, upon encountering a site for which JavaScript is required, simply enable it. I'd speculate that, even then, only a handful ever release a defeated sigh.

prologic 4/13/2025||
This is true. I had some bad actors from the ComCast Network at one point. And unfortunately also valid human users of some of my "things". So I opted not to block the ComCast ASN at that point.
xyzzy_plugh 4/13/2025|||
Exactly. We've all been down this rabbit hole, collectively, and that's why Anubis has taken off. It works shockingly well.
prologic 4/13/2025||
I was planning on building a Caddy module for Anubis actually. Is anyone else interested in this?
vinibrito 4/13/2025|||
Yes, I would! I love Caddy's set and forget nature, and with this it wouldn't be different. Especially if it could be triggered conditionally, for example based on server load or a flood being detected.
JsonCameron 4/15/2025|||
see https://github.com/TecharoHQ/anubis/issues/16

There is going to be a pretty big refactor soon, but once that's done we plan on crushing this out.

prologic 4/13/2025|||
I would be interested to hear of any other solutions that guarantee to either identity or block non-Human traffic. In the "small web" and self-hosting, we typically don't really want Crawlers, and other similar software hitting our services, because often the software is either buggy in the first place (Example: Runaway Claude Bot) or you don't want your sites indexed by them in the first place.
Cyphase 4/13/2025|||
For anyone wondering, Oracle holds the trademark for "JavaScript": https://javascript.tm/
prologic 4/13/2025||
Which arguably they should let go of
jadbox 4/13/2025|||
How do you know it's an LLM and not a VPN? How do you use this MaxMind's database to isolate LLMs?
prologic 4/13/2025||
I don't distinguish actually. There are two things I do normally:

- Block Bad Bots. There's a simple text file called `bad_bots.txt` - Block Bad ASNs. There's a simple text file called `bad_asns.txt`

There's also another for blocking IP(s) and IP-ranges called `bad_ips.txt` but it's often more effective to block an much larger range of IPs (At the ASN level).

To give you an concrete idea, here's some examples:

$ cat etc/caddy/waf/bad_asns.txt # CHINANET-BACKBONE No.31,Jin-rong Street, CN # Why: DDoS 4134

# CHINA169-BACKBONE CHINA UNICOM China169 Backbone, CN # Why: DDoS 4837

# CHINAMOBILE-CN China Mobile Communications Group Co., Ltd., CN # Why: DDoS 9808

# FACEBOOK, US # Why: Bad Bots 32934

# Alibaba, CN # Why: Bad Bots 45102

# Why: Bad Bots 28573

runxiyu 4/13/2025||
Do you have a link to your own solution?
JsonCameron 4/15/2025|||
I have a pretty similar one. (Works off of the same concept) https://github.com/JasonLovesDoggo/caddy-defender if you're curious. Keep in mind this will not protect you against residential IP scraping.
prologic 4/13/2025|||
Not yet unfortunately. But if you're interested, please reach out! I currently run it in a 3-region GeoDNS setup with my self-hosted infra.
roenxi 4/13/2025||
I like the idea but this should probably be something that is pulled down into the protocol level once the nature of the challenge gets sussed out. It'll ultimately be better for accessibility if the PoW challenge is closer to being part of TCP than implemented in JavaScript individually by each website.
pona-a 4/13/2025||
There's Cloudflare PrivacyPass that became an IETF standard [0], but it's rather weird, and the reference implementation is a bug nest.

[0] https://datatracker.ietf.org/wg/privacypass/about/

fc417fc802 4/14/2025||
Ship an arbitrary challenge as a SPIR-V or MLIR black box. Integrate the challenge-response exchange with HTTP. That should permit broad support and flexible hardware acceleration.

The "good enough" solution is the existing and widely used SHA( seed, nonce ). That could easily be integrated into a lower level of the stack if the tech giants wanted it.

tripdout 4/12/2025||
The bot detection takes 5 whole seconds to solve on my phone, wow.
bogwog 4/12/2025||
I'm using Fennec (a Firefox fork on F-Droid) and a Pixel 9 Pro XL, and it takes around ~8 seconds at difficulty 4.

Personally, I don't think the UX is that bad since I don't have to do anything. I definitely prefer it to captchas.

Hakkin 4/12/2025|||
Much better than infinite Cloudflare captcha loops.
gruez 4/13/2025||
I've never had that, even with something like tor browser. You must be doing something extra suspicious like an user agent spoofer.
praisewhitey 4/13/2025|||
Firefox with Enhanced Tracking Protection turned on is enough to trigger it.
aaronmdjones 4/13/2025|||
You need to whitelist challenges.cloudflare.com for third-party cookies.

If you don't do this, the third-party cookie blocking that strict Enhanced Tracking Protection enables will completely destroy your ability to access websites hosted behind CloudFlare, because it is impossible for CloudFlare to know that you have solved the CAPTCHA.

This is what causes the infinite CAPTCHA loops. It doesn't matter how many of them you solve, Firefox won't let CloudFlare make a note that you have solved it, and then when it reloads the page you obviously must have just tried to load the page again without solving it.

https://i.imgur.com/gMaq0Rx.png

genewitch 4/13/2025||
You're telling me cloudflare has to store something on my computer to let them know I passed a captcha?

This sounds like "we only save hashed minutiae of your biometrics"

aaronmdjones 4/13/2025|||
> You're telling me cloudflare has to store something on my computer to let them know I passed a captcha?

Yes?

HTTP is stateless. It always has been and it always will be. If you want to pass state between page visits (like "I am logged in to account ..." or "My shopping cart contains ..." or "I solved a CAPTCHA at ..."), you need to be given, and return back to the server on subsequent requests, cookies that encapsulate that information, or encapsulate a reference to an identifier that the server can associate with that information.

This is nothing new. Like gruez said in a sibling comment; this is what session cookies do. Almost every website you ever visit will be giving you some form of session cookie.

zaphar 4/13/2025||||
Then don't visit the site. Cloudflare is in the loop because the owner of the site wanted to buy not build a solution to the problems that Cloudflare solves. This is well within their rights and a perfectly understandable reason for Cloudflare to be there. Just as you are perfectly within your rights to object and avoid the site.

What is not within your rights is to require the site owner to build their own solution to your specs to solve those problems or to require the site owner to just live with those problems because you want to view the content.

black_puppydog 4/13/2025||
That would be a much stronger line of argument if cloudflare wasn't used by everyone and their consultant, including on a bunch of sites I very much don't have an option of not using.
zaphar 4/14/2025||
Cloudflare doing a really good job meeting customer needs doesn't impact my argument at all.
fc417fc802 4/14/2025||
When a solution is widely adopted or adopted by essential services it becomes reasonable to place constraints on it. This has happened repeatedly throughout history, often in the form of government regulations.

It usually becomes reasonable to object to the status quo long before the legislature is compelled to move to fix things.

zaphar 4/14/2025||
Why? This isn't a contrarian complaint but the problems that Cloudflare solves for an essential service require verifying certain things about the client which places a burden on the client. The problems exist in many cases because the service is essential which makes it a higher profile target. Expecting the client to bear some of that burden for interacting with the service in order to protect that service is not in my mind problematic.

I do think that it's reasonable for the service to provide alternative methods of interacting with it when possible. Phone lines, Mail, Email could all be potential escape hatches. But if a site is on the internet it is going to need protecting eventually.

fc417fc802 4/14/2025||
That's a fair point, but it doesn't follow that the current status quo is necessarily reasonable. You had earlier suggested that the fact that it broadly meets the needs of service operators somehow invalidates objections to it which clearly isn't the case.

I don't know that "3rd party session cookies" or "JS" are reasonable objections, but I definitely have privacy concerns. And I have encountered situations where I wasn't presented with a captcha but was instead unconditionally blocked. That's frustrating but legally acceptable if it's a small time operator. But when it's a contracted tech giant I think it's deserving of scrutiny. Their practices have an outsized footprint.

> service to provide alternative methods of interacting with it when possible

One of the most obvious alternative methods is logging in with an existing account, but on many websites I've found the login portal barricaded behind a screening measure which entirely defeats that.

> if a site is on the internet it is going to need protecting eventually

Ah yes, it needs "protection" from "bots" to ensure that your page visit is "secure". Preventing DoS is understandable, but many operators simply don't want their content scraped for reasons entirely unrelated to service uptime. Yet they try to mislead the visitor regarding the reason for the inconvenience.

Or worse, the government operations that don't care but are blindly implementing a compliance checklist. They sometimes stick captchas in the most nonsensical places.

gruez 4/13/2025|||
>You're telling me cloudflare has to store something on my computer to let them know I passed a captcha?

You realize this is the same as session cookies, which are used on nearly every site, even those where you're not logging in?

>This sounds like "we only save hashed minutiae of your biometrics"

A randomly generated identifier is nowhere close to "hashed minutiae of your biometrics".

genewitch 4/13/2025||
the idea that cloudflare doesn't know who i am without a cookie is insulting.
gruez 4/13/2025|||
The infinite loop or the challenge appearing? I've never had problems with passing the challenge, even with ETP + RFP + ublock origin + VPN enabled.
cookiengineer 4/13/2025||
Cloudflare is too stupid to realize that carrier grade NATs exist a lot in Germany. So there's that, sharing an IP with literally 20000 people around me doesn't make me suspicious when it's them that trigger that behavior.

Your assumption is that anyone at cloudflare cares. But guess what, it's a self fulfilling prophecy of a bot being blocked, because not a single process in the UX/UI allows any real user to complain about it, and therefore all blocked humans must also be bots.

Just pointing out the flaw of bot blocking in general, because you seem to be absolutely unaware of it. Success rate of bot blocking is always 100%, and never less, because that would imply actually realizing that your tech does nothing, really.

Statistically, the ones really using bots can bypass it easily.

gruez 4/13/2025|||
>Cloudflare is too stupid to realize that carrier grade NATs exist a lot in Germany. So there's that, sharing an IP with literally 20000 people around me doesn't make me suspicious when it's them that trigger that behavior.

Tor and VPNs arguably have the same issue. I use both and haven't experienced "infinite loops" with either. The same can't be said of google, reddit, or many other sites using other security providers. Those either have outright bans, or show captchas that require far more effort to solve than clicking a checkbox.

viraptor 4/13/2025||||
If you want to try fighting it, you need to find someone with CF enterprise plan and bot management working, then get blocked and get them to report that as wrong. Yes it sucks and I'm not saying it's a reasonable process. Just in case you want to try fixing the situation for yourself.
xena 4/13/2025|||
Honestly it's a fair assumption on bot filtering software that no more than like 8 people will share an IPv4. This is going to make IP reputation solutions hard. Argh.
megous 4/13/2025||||
Proper response here is "fuck cloudflare", instead of blaming the user.
gruez 4/13/2025||
It's well within your rights to go out of your way to be suspicious (eg. obfuscating your user-agent). At the same time sites are within their rights to refuse service to you, just like banks can refuse service to you if you show up wearing a balaclava.
megous 4/13/2025||
You're assuming too much. I'm not obfuscating/masking anything. I'm just using Firefox with some (to the user/me) useless web APIs disabled to reduce the attack surface of the browser and CF is not doing feature testing. It's not just websites that need to protect themselves.

Eg. Anubis here works fine for me, completely out-classing the CF interstitial page with its simplicity.

xena 4/13/2025|||
Apparently user-agent switchers don't work for fetch() requests, which means that Anubis can't work with people that do that. I know of someone that set up a version of brave from 2022 with a user-agent saying it's chrome 150 and then complaining about it not working for them.
oynqr 4/12/2025|||
Lucky. Took 30s for me.
nicce 4/12/2025||
For me it is like 0.5s. Interesting.
cookiengineer 4/13/2025||
I am currently building a prototype of what I call the "enigma webfont" where I want to implement user sessions with custom seeds / rotations for a served and cached webfont.

The goal is to make web scraping unfeasible because of computational costs for OCR. It's a cat and mouse game right now and I want to change the odds a little. The HTML source would be effectively void without the user session, meaning an OTP like behavior could also make web pages unreadable once the assets go uncached.

This would allow to effectively create a captcha that would modify the local seed window until the user can read a specified word. "Move the slider until you can read the word Foxtrott", for example.

I sure would love to hear your input, Xe. Maybe we can combine our efforts?

My tech stack is go, though, because it was the only language where I could easily change the webfont files directly without issues.

lifthrasiir 4/13/2025||
Besides from the obvious accessibility issue, wouldn't that be a substitution cipher at best? Enough corpus should render its cryptanalysis much easier.
cookiengineer 4/13/2025|||
Well, the idea is basically the same as using AES-CBC. CBC is useless most of the time because of static rotations, but it makes cracking it more expensive.

With the enigma webfont idea you can even just select a random seed for each user/cache session. If you map the URLs based on e.g. SHA512 URLs via the Web Crypto API, there's no cheap way of finding that out without going full in cracking mode or full in OCR/tesseract mode.

And cracking everything first, wasting gigabytes of storage for each amount of rotations and seeds...well, you can try but at this point just ask the admin for the HTML or dataset instead of trying to scrape it, you know.

In regards to accessibility: that's sadly the compromise I am willing to do, if it's a technology that makes my specific projects human eyes only (Literally). I am done taking the costs for hundreds of idiots that are too damn stupid to clone my website from GitHub, letting alone violating every license in each of their jurisdictions. If 99% of traffic is bots, it's essentially DDoSing on purpose.

We have standards for data communication, it's just that none of these vibe coders gives a damn about building semantically correct HTML and parsers for RDF, microdata etc.

lifthrasiir 4/13/2025||
No, I was talking about generated fonts themselves; each glyph would have an associated set of control points which can be used to map a glyph to the correct letter. No need to run the full OCR, you need a single small OCR job per each glyph. You would need quite elaborate distortions to avoid this kind of attack, and such distortions may affect the reading experience.
cookiengineer 4/13/2025||
I am not sure how this would help, and I don't think I understood your argument.

The HTML is garbage without a correctly rendered webfont that is specific to the shifts and replacements in the source code itself. The source code does not contain the source of the correct text, only the already shifted text.

Inside the TTF/OTF files themselves each letter is shifted, meaning that the letters only make sense once you know the seed for the multiple shifts, and you cannot map 1:1 the glyphs in the font to anything in the HTML without it.

The web browser here is pretty easy to trick, because it will just replace the glyphs available in the font, and fallback to the default font if they aren't available. Which, by concept, also allows partial replacements and shifts for further obfuscation if needed, additionally you can replace whole glyph sequences with embedded ligatures, too.

The seed can therefore be used as an instruction mapping, instead of only functioning as a byte sequence for a single static rotation. (Hence the reference to enigma)

How would control points in the webfont files be able to map it back?

If you use multiple rotations like in enigma, and that is essentially the seed (e.g. 3,74,8,627,whatever shifts after each other). The only attack I know about would be related to alphabet statistical analysis, but that won't work once the characters include special characters outside the ASCII range because you won't know when words start nor when they end.

lifthrasiir 4/14/2025||
I'm not sure if you have enough knowledge about fonts, but I have built fonts entirely from scratch and know enough that it won't work as you imagine. Please forgive me if you are indeed knowledgeable about fonts...

Let's assume that "HELLO" is remapped under your scheme. You would have a base font that will be used to dynamically generate mangled fonts, and it surely has at least four glyphs, which I'll refer as gH, gE, gL and gO (let's ignore advanced features and ligatures for now). Your scheme for example will instead map, say, a decimal digit 1 to gH, 2 to gE, 3 to gL and 4 to gO so that the HTML will contain "12334" instead of "HELLO". Now consider which attacks are possible.

The most obvious attack, as you have considered, is to ignore HTML and only deal with the rendered page. This is indeed costly compared to other attacks, but not very expensive either because the base font should have been neutral enough in the first place. Neutral and regular typefaces are the ideal inputs for OCR, and this has been already exploited massively in Fax documents (search keyword: JBIG2). So I don't think this ultimately poses a blocker for crawlers, even though it will indeed be very annoying.

But if the attacker does know webfonts are generated dynamically, one can look at the font itself and directly derive the mapping instead. As I've mentioned, glyphs therein would be very regular and can easily be recognized because a single glyph OCR (search keyword: MNIST) is even much simpler than a full-text OCR where you first have to detect letter-like areas. The attacker will render each glyph to a small virtual canvas, run OCR and generate a mapping to undo your substitution cipher.

Since the cost of this attack is proportional to the number of glyphs, the next countermeasure would be putting more glyphs to make it a polyalphabetic cipher: both 3 and 5 will map to gL and the HTML will contain "12354" instead. But it doesn't scale well, especially because OpenType has a limit of 65,535 glyphs. Furthermore, you have to make each of them unique so that the attacker has to run OCR on each glyph (say, 3 maps to gL and 5 maps to gL' which is only slightly different from gL), otherwise it can cache the previously seen glyph. So the generated font would have to be much larger than the original base font! I have seen multiple such fonts in the wild and almost all of them are for CJKV scripts, and those fonts are harder to deploy as webfonts for the exactly same reason. Even Hangul with only ~12,000 letters poses a headache for deployment.

This attack also applies to ligatures by the way, because OpenType ligatures are just composite glyphs plus substitution rules. So you have the same 65,535 glyph limit anyway [1], and it is trivial to segment two or more letters from those composite glyphs anyway. The only countermeasure would be therefore describing and mangling each glyph independently, and that would take even more bytes to deploy.

[1] This is the main reason Hangul needlessly suffers in this case too. Hangul syllables can be generated from a very simple algorithm so you only need less than 1,000 glyphs to make a functional Hangul font, but OpenType requires one additional glyph for each composite glyphs so that all Hangul fonts need to have much more glyphs even though all composite glyphs would be algorithmically simple.

creata 4/13/2025|||
There's probably something horrific you can do with TrueType to make it more complex than a substitution cipher.
lifthrasiir 4/13/2025|||
GSUB rules are inherently local, so for example the same cryptanalysis approach should work for space-separated words instead of letters. A polyalphabetic cipher would work better but that means you can't ever share the same internal glyph for visually same but differently encoded letters.
cookiengineer 4/13/2025|||
The hint I want to give you is: unicode and ligatures :) they're awesome in the worst sense. Words can be ligatures, too, btw.
rollcat 4/13/2025||
The problem isn't as much that the websites are scraped (search engines have been doing this for over three decades), it's the request volume that brings the infrastructure down and/or costs up.

I don't think mangling the text would help you, they will just hit you anyway. The traffic patterns seem to indicate that whoever programmed these bots, just... <https://www.youtube.com/watch?v=ulIOrQasR18>

> I sure would love to hear your input, Xe. Maybe we can combine our efforts?

From what I've gathered, they need help in making this project more sustainable for the near and far future, not to add more features. Anubis seems to be doing an excellent job already.

pabs3 4/13/2025|
It works to block users who have JavaScript disabled, that is for sure.
udev4096 4/13/2025|
Exactly, it's a really poor attempt to make it appealing to the larger audience. Unless they roll out a version for nojs, they are the same as "AI" scrapers on enshittyfying the web
More comments...