Posted by misterchocolat 12/16/2025
There isn't much you can do about it without cloudflare. These companies ignore robots.txt, and you're competing with teams with more resources than you. It's you vs the MJs of programming, you're not going to win.
But there is a solution. Now I'm not going to say it's a great solution...but a solution is a solution. If your website contains content that will trigger their scraper's safeguards, it will get dropped from their data pipelines.
So here's what fuzzycanary does: it injects hundreds of invisible links to porn websites in your HTML. The links are hidden from users but present in the DOM so that scrapers can ingest them and say "nope we won't scrape there again in the future".
The problem with that approach is that it will absolutely nuke your website's SEO. So fuzzycanary also checks user agents and won't show the links to legitimate search engines, so Google and Bing won't see them.
One caveat: if you're using a static site generator it will bake the links into your HTML for everyone, including googlebot. Does anyone have a work-around for this that doesn't involve using a proxy?
Please try it out! Setup is one component or one import.
(And don't tell me it's a terrible idea because I already know it is)
package: https://www.npmjs.com/package/@fuzzycanary/core gh: https://github.com/vivienhenz24/fuzzy-canary
(No, I don't want to defend the poor AI companies. Go for it!)
Just curious. Hoping to be able to work on a website again someday, if I ever regain my health/stamina/etc back.
Anyway, if it is true, and assuming a forum with minimal genuine Chinese traffic, might a simple approach that injects the porn links only into IP's accessing from China work?
If your goal is to be blocked by China's great firewall, including mention of tank man and the Tiananmen Square massacre more generally, and certain pooh bear related imagery, might help.
That was my first question also, and had been my belief. The admin in question was very clear that the IP's were simply originating from China. I'm still surprised, and welcome better general data, but I trust him on this for the site in question.
There's a good chance corporate firewalls will end up blocking your domain if you do this but that sounds like a problem for the customers of those corporate firewalls to me.
edit: I noticed someone mentioned google DOES publish its IP's, there ya go, problem solved.
If I then get hit by a rude AI scraper, what chances would I have to sue the hell out of them in EU courts for copyright violation (uhh, my articles cost 100k a pop for AI training, actually) and the de facto DDoS attack?