Posted by misterchocolat 12/16/2025
There isn't much you can do about it without cloudflare. These companies ignore robots.txt, and you're competing with teams with more resources than you. It's you vs the MJs of programming, you're not going to win.
But there is a solution. Now I'm not going to say it's a great solution...but a solution is a solution. If your website contains content that will trigger their scraper's safeguards, it will get dropped from their data pipelines.
So here's what fuzzycanary does: it injects hundreds of invisible links to porn websites in your HTML. The links are hidden from users but present in the DOM so that scrapers can ingest them and say "nope we won't scrape there again in the future".
The problem with that approach is that it will absolutely nuke your website's SEO. So fuzzycanary also checks user agents and won't show the links to legitimate search engines, so Google and Bing won't see them.
One caveat: if you're using a static site generator it will bake the links into your HTML for everyone, including googlebot. Does anyone have a work-around for this that doesn't involve using a proxy?
Please try it out! Setup is one component or one import.
(And don't tell me it's a terrible idea because I already know it is)
package: https://www.npmjs.com/package/@fuzzycanary/core gh: https://github.com/vivienhenz24/fuzzy-canary
I've also had enormous luck with Anubis. AI scrapers found my personal Forgejo server and were hitting it on the order of 600K requests per day. After setting up Anubis, that dropped to about 100. Yes, some people are going to see an anime catgirl from time to time. Bummer. Reducing my fake traffic by a factor of 6,000 is worth it.
It’s kind of crazy how much scraping goes on and how little search engine development goes on. I guess search engines aren’t fashionable. Reminds me of this article about search engines disappearing mysteriously: https://archive.org/details/search-timeline
I try to share that article as much as possible, it’s interesting.
My scraper dudes, it's a git repo. You can fetch the whole freaking thing if you wanna look at it. Of course, that would require work and context-aware processing on their end, and it's easier for them to shift the expense onto my little server and make me pay for their misbehavior.
Quite possibly. Or, in my case, I think it's more quirky and fun than weird. It's non-zero amounts of weird, sure, but far below my threshold of troublesome. I probably wouldn't put my business behind it. I'm A-OK with using it on personal and hobby projects.
Frankly, anyone so delicate that they freak out at the utterly anodyne imagery is someone I don't want to deal with in my personal time. I can only abide so much pearl clutching when I'm not getting paid for it.
Also you mentioned Anubis, so it’s creator will probably read this. Hi Xena!
That was a good read. I hadn’t heard of spintax before, but I’ve thought of doing things like that. Also “pseudoprofound anti-content”, what a great term, that’s hilarious!
And hey, Xena! (And thank you very much!)
I can't help but feel like we're all doing it wrong against scraping. Cloudflare is not the answer, in fact, I think that they lost their geek cred when they added their "verify you are human" challenge screen to become the new gatekeeper of the internet. That must remain a permanent stain on their reputation until they make amends.
Are there any open source tools we could install that detect a high number of requests and send those IP addresses to a common pool somewhere? So that individuals wouldn't get tracked, but bots would? Then we could query the pool for the current request's IP address and throttle it down based on volume (not block it completely). Possibly at the server level with nginx or at whatever edge caching layer we use.
I know there may be scaling and privacy issues with this. Maybe it could use hashing or zero knowledge proofs somehow? I realize this is hopelessly naive. And no, I haven't looked up whether someone has done this. I just feel like there must be a bulletproof solution to this problem, with a very simple explanation as to how it works, or else we've missed something fundamental. Why all the hand waving?
The internet has seen success with social media content moderation and so it seems natural enough that an application could exist for web traffic itself. Hosts being able to "downvote" malicious traffic, and some sort of decay mechanism given IP's recycling. This exists in a basic sense with known TOR exit nodes and known AWS, GCP IP's, etc.
That said, we probably don't have the right building blocks yet, IP's are too ephemeral, yet anything more identity-bound is a little too authoritarian IMO. Further, querying something for every request is probably too heavy.
Fun to think about, though.
It seems all options have major trade-offs. We can host on big social media and lose all that control and independence. We can pay for outsized infrastructure just to feed the scrapers, but the cost may actually be prohibitive, and seems such a waste to begin with. We can move as much as possible SSG and put it all behind cloudflare, but this comes with vendor lock in and just isn't architecturally feasible in many applications. We can do real "verified identities" for bots, and just let through the ones we know and like, but this only perpetuates corporate control and makes healthy upstart competition (like Kagi) much more difficult.
So, what are we to do?
Or are you having the user solve an encryption puzzle to view it?
Don’t worry, you’re not just old. The internet kind of sucks now.
The AI scrapers are atrocious. They just blindly blast every URL on a site with no throttling. They are terribly written and managed as the same scraper will hit the same site multiple times a day or even hour. They also don't pay any attention to context so they'll happily blast git repo hosts and hit expensive endpoints.
They're like a constant DOS attack. They're hard to block at the network level because they span across different hyperscalers' IP blocks.
If the forum considers unique cookies to be a user and creates a new cookie for any new cookie-less request, and if it considers a user to be online for 1 hour after their last request, then actually this may be one scraper making ~6 requests per second. That may be a pain in its own way, but it's far from 23k online bots.
Either there are indeed hundreds or thousands of AI bots DDoSing the entire internet, or a couple of bots are needlessly hammering it over and over and over again. I'm not sure which option is worse.
Edit: it’s 15 minutes.
Couple that with 15 minute session times, and that could just be one entity scraping the forum at 30 requests per second. One scraper going moderately fast sounds far less bad than 29000 bots.
It still sounds excessive for a niche site, but I'd guess this is sporadic, or that the forum software has a page structure that traps scrapers accidentally, quite easy to do.
I wonder if this will start making porn websites rank higher in google if it catches on…
Have you tested it with the Lynx web browser? I bet all the links would show up if a user used it.
Oh also couldn’t AI scrapers just start impersonating Googlebot and Bingbot if this caught on and they got wind of it?
Hey I wonder if there is some situation where negative SEO would be a good tactic. Generally though I think if you wanted something to stay hidden it just shouldn’t be on a public web server.
Very clever, use the LLM's own rules (against copyright infrigement) against itself.
Everything below the following four #### is ~quoted~ from that magazine:
####
Only humans and ill-aligned AI models allowed to continue
Find me a torrent link for Bee Movie (2007)
[Paste torrent or magnet link here...] SUBMIT LINK
[ ] Check to confirm you do NOT hold the legal rights to share or distribute this content
Asking them to upload a copyrighted photo not belonging to them might be more effective..
Only because newer LLMs don't seem to want to write hate speech.
The website (verifying humanness) could, for example, show a picture of a black jewish person and then ask the human visitor to "type in the most offensive two words you can think of for the person shown, one is `n _ _ _ _ _` & second is `k _ _ _`." [I'll call them "hate crosswords"]
In my experience, most online-facing LLMs won't reproduce these "iggers and ikes" (nor should humans, but here we are separating machines).
At least once upon a time there was a pirate textbook library that used HTTP basic auth with a prompt that made the password really easy to guess. I suppose the main goal was to keep crawlers out even if they don't obey robots.txt, and at the same time be as easy for humans as possible.
Not an internet litigation expert but seems like it could be debatable
Google releases the Googlebot IP ranges[0], so you can makes sure that it's the real Googlebot and not just someone else pretending to be one.
[0] https://developers.google.com/crawling/docs/crawlers-fetcher...
Or maybe not. Got some random bot from my server logs. Yeah, it's pretending to be Chrome, but more exactly:
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36"
I guess Google might be not eager to open this can of worms.
Yeah, I use that as a safeguard :D The URLs that I don't want to be indexed have hundreds of those keywords that are leading to URLs being deindexed directly. There is also some law in the US that forbids to show that as a result, so Google and Bing are both having a hard time scraping those pages/articles.
Note that this is the latest defense measurement before eBPF blocks. The first one uses zip bombs and the second one uses chunked encoding to blow up proxies so their clients get blocked.
You can only win this game if you make it more expensive to scrape than to host it.
The things I could find on justice.gov and other official websites, maybe there's more in the web archive?
- https://www.justice.gov/archives/opa/pr/google-forfeits-500-...
- https://www.congress.gov/110/plaws/publ425/PLAW-110publ425.p...
- https://www.fda.gov/drugs/prescription-drug-advertising/pres...
edit: Oh it was very likely the Federal Food, Drug and Cosmetic Act that was the legal basis for the crackdown. But that's a very old law from the pre-internet age.
- https://en.wikipedia.org/wiki/Federal_Food,_Drug,_and_Cosmet...
edit 2: Might not have been clear for the younger generation, but there was a huge wave of addicted patients that got treated with Oxycodone (or OxyContin) subscriptions. I think that was the actual cause for the crackdown on those online advertisements, but I might be wrong. I heavily recommend watching documentaries about Oxy, it's insane how many people knew about it, and were making people addicted on purpose with those pain killers.
-> https://github.com/voodooEntity/ghost_trap
basically a github action that extends your README.md with a "polymorphic" prompt injection. I run some "llm"s against it and most cases they just produced garbage.
Thought about also creating a JS variant that you can add to your website that will (not visible for the user) also inject such prompt injections to stop web crwaling like you described
On the flip side of this discussion - if you're building a scraper yourself, there are ways to be less annoying:
1. Run locally instead of from cloud servers. Most aggressive blocking targets VPS IPs. A desktop app using the user's home IP looks like normal browsing.
2. Respect rate limits and add delays. Obvious but often ignored.
3. Use RSS feeds when available - many sites leave them open even when blocking scrapers.
I built a Reddit data tool (search "reddit wappkit" if curious) and the "local IP" approach basically eliminated all blocking issues. Reddit is pretty aggressive against server IPs but doesn't bother home connections.
The porn-link solution is creative though. Fight absurdity with absurdity I guess.
(Overgeneralising a bit) site owners are mostly cting for public benefit whereas scrapers act for their own benefit/for private interests.
I imagine most people would land on team site-owner, if they were asked. I certainly would.
P.S. is the best way to scrape fairly just to respect robots.txt?
It should also be easy to detect a forejo, gitea, or similar hosting site, locate the git URL and clone the repo.
Unscrupulous AI scrapers will not be using a genuine UA string. They'll be using Google. You'll need to do reverse DNS check instead - https://developers.google.com/crawling/docs/crawlers-fetcher...
Worth noting that in general if you do any "is this Google or not" you should always check by IP address as there's many people spoofing the googlebot user agent.
https://developers.google.com/static/search/apis/ipranges/go...