Top
Best
New

Posted by misterchocolat 12/16/2025

Show HN: Stop AI scrapers from hammering your self-hosted blog (using porn)(github.com)
Alright so if you run a self-hosted blog, you've probably noticed AI companies scraping it for training data. And not just a little (RIP to your server bill).

There isn't much you can do about it without cloudflare. These companies ignore robots.txt, and you're competing with teams with more resources than you. It's you vs the MJs of programming, you're not going to win.

But there is a solution. Now I'm not going to say it's a great solution...but a solution is a solution. If your website contains content that will trigger their scraper's safeguards, it will get dropped from their data pipelines.

So here's what fuzzycanary does: it injects hundreds of invisible links to porn websites in your HTML. The links are hidden from users but present in the DOM so that scrapers can ingest them and say "nope we won't scrape there again in the future".

The problem with that approach is that it will absolutely nuke your website's SEO. So fuzzycanary also checks user agents and won't show the links to legitimate search engines, so Google and Bing won't see them.

One caveat: if you're using a static site generator it will bake the links into your HTML for everyone, including googlebot. Does anyone have a work-around for this that doesn't involve using a proxy?

Please try it out! Setup is one component or one import.

(And don't tell me it's a terrible idea because I already know it is)

package: https://www.npmjs.com/package/@fuzzycanary/core gh: https://github.com/vivienhenz24/fuzzy-canary

373 points | 277 commentspage 2
xg15 12/18/2025|
There is some irony in using an AI generated banner image for this project...

(No, I don't want to defend the poor AI companies. Go for it!)

kstrauser 12/18/2025|
In the olden days, I used Google an awful lot, but I would still grouse if Google were to drive my server into the ground.
n1xis10t 12/18/2025||
Fair point
santiagobasulto 12/19/2025||
Offtopic: when did js/ts apps get so complicated? I tried to browse the repo and there are so many configuration files and directories for such a simple functionality that should be 1 or 2 modules. It reminds me of the old Java days.
darepublic 12/19/2025||
Why would I need a dependency for this. I'm being serious. The idea is one thing but why a dependency on react. I say this as someone who uses react. Why not just a paragraph long blog post about the use of porn links and perhaps a small snippet on how to insert one with plain HTML.
eek2121 12/19/2025||
Disclosure, I've not run a website since my health issues began, however, Cloudflare has an AI firewall, Cloudflare is super cheap (also: unsure if the AI firewall is on the free tier, however I would be surprised if it is not). Ignoring the recent drama about a couple incidents they've had (because this would not matter for a personal blog), why not use this instead?

Just curious. Hoping to be able to work on a website again someday, if I ever regain my health/stamina/etc back.

ddtaylor 12/19/2025||
Cloudflare has created a bit of grief with regular users getting spammed with "prove your human" requests.
ProllyInfamous 12/19/2025|||
Yes, e.g: I'll immediately close any attempt at Cloudfare's verification.
rglynn 12/20/2025||
Out of interest, why that extreme? Just out of principle or some other reason?
ProllyInfamous 12/20/2025|||
My main terminal uses a PiHole with 120,000+ blacklist rules (not Cloudfare specifically — I allow most CDN's). This includes an entire blackout of Google/Facebook products, as well as most tracking/analytics services.

For example, I do not allow reCAPTCHA.

As a similar commentor noted, when just casually browsing I don't really have any desire to try hard to read random content. Should I absolutely need to access some information garden-walled behind Cloudfare: I have another computer that uses much less restrictive black-listing.

Rastonbury 12/20/2025|||
Not OP but it isn't super extreme if you are just surfing, it's like if the site is slow to load sometimes I wasn't that invested to use your site anyway
vaylian 12/19/2025||||
Can confirm. I have been blocked plenty of times and it's really annoying.
pjc50 12/19/2025|||
All the solutions are going to have a few false positives, sadly.
nottorp 12/19/2025||
Or a lot if you use privacy extensions.

Cloudflare's automatic checks (before you get the captcha) must be pretty close to what ad peddlers do.

brigandish 12/19/2025||
All the best with getting back on your feet.
nkurz 12/19/2025||
I was told by the admin of one forum site I use that the vast majority of the AI scraping traffic is Chinese at this point. Not hidden or proxied, but straight from China. Can anyone else confirm this?

Anyway, if it is true, and assuming a forum with minimal genuine Chinese traffic, might a simple approach that injects the porn links only into IP's accessing from China work?

dspillett 12/19/2025||
That would only affect those calling out directly. Many scrapers operate through a battery of proxies so will be hidden by such a simple test.

If your goal is to be blocked by China's great firewall, including mention of tank man and the Tiananmen Square massacre more generally, and certain pooh bear related imagery, might help.

nkurz 12/19/2025||
> That would only affect those calling out directly. Many scrapers operate through a battery of proxies so will be hidden by such a simple test.

That was my first question also, and had been my belief. The admin in question was very clear that the IP's were simply originating from China. I'm still surprised, and welcome better general data, but I trust him on this for the site in question.

s0laster 12/19/2025|||
Mostly yes. One of my low-traffic, niche website used to serve 3k true users per month mainly from the US and East EU. Now China alone is 500k users, were each session last no more than a few seconds [1].

[1]: https://ibb.co/20QD6Lnk

n1xis10t 12/19/2025||
Maybe. This comment makes me really want to set something up that builds a map of where all the requests are coming from.
wazoox 12/18/2025||
Isn't there a risk to get your blog blocked in corporate environment though? If it's a technical blog that would be unfortunate.
jeroenhd 12/19/2025|
That depends on how terrible the middleboxes those corporate environments use are. If they only block actual malicious pages, it shouldn't be a problem unless the user un-hides the links and clicks on them.

There's a good chance corporate firewalls will end up blocking your domain if you do this but that sounds like a problem for the customers of those corporate firewalls to me.

reconnecting 12/18/2025||
I wouldn't recommend to show different versions of the site to search robots, as they probably have mechanisms that track differences, which could potentially lead to a lower ranking or a ban.
prmoustache 12/19/2025|
How can they track differences if they have access to only one version?
reconnecting 12/20/2025||
This is a usual tactic for many online businesses to show a specially designed page for search spiders, so any major search engine has a way to verify if content is faked for them. Perhaps they use another spider that doesn't have an official UA or buy this service from a third party.

If you take a look at any website, even an unpopular one, you will see that there are hundreds of bots every day, and it's impossible to recognize what any of them is doing and why.

temporallobe 12/19/2025||
I do know from my experience with test automation that you can absolutely view a site as human eyes would, essentially ignoring all non-visible elements, and in fact Selenium running with Chrome driver does exactly this. Wouldn’t AI scrapers use similar methods?
nottorp 12/19/2025|
Probably not, because it costs a lot more CPU cycles.
globalnode 12/18/2025||
One solution would be for the SE's to publish their scraper IP's and allow content providers to implement bot exclusion that way. Or even implement an API with crypto credentials that SE's can use to scrape. The solution is waiting for some leadership from SE's unless they want to be blocked as well. If SE's dont want to play perhaps we can implement a reverse directory, like ad blocker but it lists only good/allowed bots instead. Thats a free business idea right there.

edit: I noticed someone mentioned google DOES publish its IP's, there ya go, problem solved.

n1xis10t 12/18/2025|
Apparently Google publishes their crawler’s IPs, this was mentioned somewhere in this same thread
bytehowl 12/19/2025|
Let's imagine I have a blog and put something along these lines somewhere on every page: "This content is provided free of charge for humans to experience. It may also be automatically accessed for search indexing and archival purposes. For licensing information for other uses, contact the author."

If I then get hit by a rude AI scraper, what chances would I have to sue the hell out of them in EU courts for copyright violation (uhh, my articles cost 100k a pop for AI training, actually) and the de facto DDoS attack?

icepush 12/19/2025|
If the scraper is based (Or has meaningful assets) in the EU, then your chances are good. If they do not, then the lawsuit would be meaningless.
More comments...