Top
Best
New

Posted by misterchocolat 12/16/2025

Show HN: Stop AI scrapers from hammering your self-hosted blog (using porn)(github.com)
Alright so if you run a self-hosted blog, you've probably noticed AI companies scraping it for training data. And not just a little (RIP to your server bill).

There isn't much you can do about it without cloudflare. These companies ignore robots.txt, and you're competing with teams with more resources than you. It's you vs the MJs of programming, you're not going to win.

But there is a solution. Now I'm not going to say it's a great solution...but a solution is a solution. If your website contains content that will trigger their scraper's safeguards, it will get dropped from their data pipelines.

So here's what fuzzycanary does: it injects hundreds of invisible links to porn websites in your HTML. The links are hidden from users but present in the DOM so that scrapers can ingest them and say "nope we won't scrape there again in the future".

The problem with that approach is that it will absolutely nuke your website's SEO. So fuzzycanary also checks user agents and won't show the links to legitimate search engines, so Google and Bing won't see them.

One caveat: if you're using a static site generator it will bake the links into your HTML for everyone, including googlebot. Does anyone have a work-around for this that doesn't involve using a proxy?

Please try it out! Setup is one component or one import.

(And don't tell me it's a terrible idea because I already know it is)

package: https://www.npmjs.com/package/@fuzzycanary/core gh: https://github.com/vivienhenz24/fuzzy-canary

373 points | 277 commentspage 5
cport1 12/17/2025|
That's a pretty hilarious idea, but in all serious you could use something like https://webdecoy.com/
misterchocolat 12/17/2025|
yes but here it's free, whereas this (https://webdecoy.com/) is at least 59$ a month
shadowangel 12/19/2025||
So if the bots use a google useragent it avoids the links?
MisterTea 12/18/2025||
> It's you vs the MJs of programming, you're not going to win.

MJs? Michael Jacksons? Right now the whole world, including me, want to know if that means they are bad?

kylecazar 12/18/2025||
I read it as Michael Jordan.
n1xis10t 12/18/2025||
Yes probably bad. Also smooth criminals.
inetknght 12/19/2025||
Porn? Distributed and/or managed by an NPM package?

What could go wrong?

cuku0078 12/19/2025||
Why is it so bad that AIs scrape your self-hosted blog?
FelipeCortez 12/19/2025|
because serving requires resources
cuku0078 12/19/2025|||
What specific resources are we referring to here? Are AI vendors re-crawling the whole blog repeatedly, or do they rely on caching primitives like ETag/If-Modified-Since (or hashes) to avoid fetching unchanged posts? Also: is the scraping volume high enough to cause outages for smaller sites?

Separately, I see a bigger issue: blog content gets paraphrased and reproduced by AIs without clearly mentioning the author or linking back to the original post. It feels like you often have to explicitly ask the model for sources before it will surface the exact citations.

xena 12/19/2025||
I love this. Please let me know how well it works for you. I may adjust recommendations based on your experiences.
valenceidra 12/19/2025||
Hidden links to porn sites? Lightweights.
n1xis10t 12/19/2025||
What do you mean? Would you do even more ridiculous things?
rpigab 12/19/2025||
If that's what it takes to fight back against AI crawlers, users will have to accept a fair amount of actually visible porn in blogs, maybe also on Wikipedia.

This is not enshittification, it's progress.

kislotnik 12/19/2025||
Funny how the project aims to fight AI scraping, but seems to be using an AI-generated image of a bird?
brazukadev 12/19/2025|
I think you can think a bit more about it and conclude these two things aren't related at all?
JohnMakin 12/18/2025||
Cloudflare offers bot mitigation for free, and pretty generous WAF rules that makes mitigations like this seem a little overblown to me
nospice 12/19/2025||
I'm on the free tier, but I also watch my logs. The vast majority of the traffic I'm getting are scrapers and vulnerability scanners, a lot of them coming through residential proxies and other "laundered" egress points.

I honestly don't think that Cloudflare is on top of the problem at all. They claim to be blocking abuse, but in my experience, most of the badness gets through.

cakealert 12/19/2025||
when you combine a residential proxy with a tool like curl-impersonate (there are libraries in Go for this type of fingerprint spoofing now) they dont even show up as scrapers anymore, just users. especially when they adjust timings to mimic humans.

clouflare only blocks the most dumb of bots, there are still a lot of them.

this is why cloudflare will issue javascript challenges to you even when you are using google chrome with a VPN, they are desperate to appear to be doing something. and every VPN is used to crawl as well. a slightly more sophisticated bot passes the cloudflare javascript challenge as well, there really is nothing they can do to win here.

i know some teams that got annoyed with residential proxies (they are usually sold as socks5 but can be buggy and low bandwidth) so they invested into defeating the cloudflare javascript challenge and now crawl using 1000's of VPN endpoints at over 100 Gbit/s.

oidar 12/19/2025||
Is "residential proxy" another name for an hacked/owned computer that the bots have access to? Or are there legitimate services that sell access to residential IPs?
nospice 12/19/2025||
People legitimately sell egress. It's "free" money. But of course, if you have a botnet, you can sell that through the same channels, no one is looking too closely.
n1xis10t 12/18/2025|||
You can’t deny that it’s fun though. Personally I generally feel like more people should be coming up with creative (if not entirely necessary) solutions to problems.
conception 12/18/2025|||
For “free”.
n1xis10t 12/18/2025||
Did you put “free” in quotes because you need to have paid for stuff from cloudflare to use the “free” thing?

If so, I suppose it’s like those magazines that say ”free cd”.

efilife 12/19/2025|||
Well, you literally MITM yourself so I think it's a big price
JohnMakin 12/18/2025||||
You don't though.
n1xis10t 12/18/2025||
Good to know thanks
Terr_ 12/18/2025|||
I thought they were referring to the indirect costs of supporting monopolistic stuff that enshittifies later.

https://www.youtube.com/watch?v=U8vi6Hbp8Vc

ATechGuy 12/19/2025||
It is really free? Genuinely asking.
gilrain 12/19/2025||
Yes. They upsell more complete solutions, but the free tier is pretty generous.
wcarss 12/19/2025|
Singing copyrighted Billy Joel to make your footage unusable for reality television; thanks 30 Rock for an early view into this dystopian strategy
More comments...