Top
Best
New

Posted by misterchocolat 12/16/2025

Show HN: Stop AI scrapers from hammering your self-hosted blog (using porn)(github.com)
Alright so if you run a self-hosted blog, you've probably noticed AI companies scraping it for training data. And not just a little (RIP to your server bill).

There isn't much you can do about it without cloudflare. These companies ignore robots.txt, and you're competing with teams with more resources than you. It's you vs the MJs of programming, you're not going to win.

But there is a solution. Now I'm not going to say it's a great solution...but a solution is a solution. If your website contains content that will trigger their scraper's safeguards, it will get dropped from their data pipelines.

So here's what fuzzycanary does: it injects hundreds of invisible links to porn websites in your HTML. The links are hidden from users but present in the DOM so that scrapers can ingest them and say "nope we won't scrape there again in the future".

The problem with that approach is that it will absolutely nuke your website's SEO. So fuzzycanary also checks user agents and won't show the links to legitimate search engines, so Google and Bing won't see them.

One caveat: if you're using a static site generator it will bake the links into your HTML for everyone, including googlebot. Does anyone have a work-around for this that doesn't involve using a proxy?

Please try it out! Setup is one component or one import.

(And don't tell me it's a terrible idea because I already know it is)

package: https://www.npmjs.com/package/@fuzzycanary/core gh: https://github.com/vivienhenz24/fuzzy-canary

372 points | 276 commentspage 5
cport1 12/17/2025|
That's a pretty hilarious idea, but in all serious you could use something like https://webdecoy.com/
misterchocolat 12/17/2025|
yes but here it's free, whereas this (https://webdecoy.com/) is at least 59$ a month
shadowangel 5 days ago||
So if the bots use a google useragent it avoids the links?
cuku0078 5 days ago||
Why is it so bad that AIs scrape your self-hosted blog?
FelipeCortez 5 days ago|
because serving requires resources
cuku0078 5 days ago|||
What specific resources are we referring to here? Are AI vendors re-crawling the whole blog repeatedly, or do they rely on caching primitives like ETag/If-Modified-Since (or hashes) to avoid fetching unchanged posts? Also: is the scraping volume high enough to cause outages for smaller sites?

Separately, I see a bigger issue: blog content gets paraphrased and reproduced by AIs without clearly mentioning the author or linking back to the original post. It feels like you often have to explicitly ask the model for sources before it will surface the exact citations.

inetknght 6 days ago||
Porn? Distributed and/or managed by an NPM package?

What could go wrong?

MisterTea 6 days ago||
> It's you vs the MJs of programming, you're not going to win.

MJs? Michael Jacksons? Right now the whole world, including me, want to know if that means they are bad?

kylecazar 6 days ago||
I read it as Michael Jordan.
n1xis10t 6 days ago||
Yes probably bad. Also smooth criminals.
xena 6 days ago||
I love this. Please let me know how well it works for you. I may adjust recommendations based on your experiences.
kislotnik 5 days ago||
Funny how the project aims to fight AI scraping, but seems to be using an AI-generated image of a bird?
brazukadev 5 days ago|
I think you can think a bit more about it and conclude these two things aren't related at all?
valenceidra 6 days ago||
Hidden links to porn sites? Lightweights.
n1xis10t 6 days ago||
What do you mean? Would you do even more ridiculous things?
rpigab 5 days ago||
If that's what it takes to fight back against AI crawlers, users will have to accept a fair amount of actually visible porn in blogs, maybe also on Wikipedia.

This is not enshittification, it's progress.

JohnMakin 6 days ago||
Cloudflare offers bot mitigation for free, and pretty generous WAF rules that makes mitigations like this seem a little overblown to me
nospice 6 days ago||
I'm on the free tier, but I also watch my logs. The vast majority of the traffic I'm getting are scrapers and vulnerability scanners, a lot of them coming through residential proxies and other "laundered" egress points.

I honestly don't think that Cloudflare is on top of the problem at all. They claim to be blocking abuse, but in my experience, most of the badness gets through.

cakealert 5 days ago||
when you combine a residential proxy with a tool like curl-impersonate (there are libraries in Go for this type of fingerprint spoofing now) they dont even show up as scrapers anymore, just users. especially when they adjust timings to mimic humans.

clouflare only blocks the most dumb of bots, there are still a lot of them.

this is why cloudflare will issue javascript challenges to you even when you are using google chrome with a VPN, they are desperate to appear to be doing something. and every VPN is used to crawl as well. a slightly more sophisticated bot passes the cloudflare javascript challenge as well, there really is nothing they can do to win here.

i know some teams that got annoyed with residential proxies (they are usually sold as socks5 but can be buggy and low bandwidth) so they invested into defeating the cloudflare javascript challenge and now crawl using 1000's of VPN endpoints at over 100 Gbit/s.

oidar 5 days ago||
Is "residential proxy" another name for an hacked/owned computer that the bots have access to? Or are there legitimate services that sell access to residential IPs?
nospice 5 days ago||
People legitimately sell egress. It's "free" money. But of course, if you have a botnet, you can sell that through the same channels, no one is looking too closely.
n1xis10t 6 days ago|||
You can’t deny that it’s fun though. Personally I generally feel like more people should be coming up with creative (if not entirely necessary) solutions to problems.
conception 6 days ago|||
For “free”.
n1xis10t 6 days ago||
Did you put “free” in quotes because you need to have paid for stuff from cloudflare to use the “free” thing?

If so, I suppose it’s like those magazines that say ”free cd”.

efilife 6 days ago|||
Well, you literally MITM yourself so I think it's a big price
JohnMakin 6 days ago||||
You don't though.
n1xis10t 6 days ago||
Good to know thanks
Terr_ 6 days ago|||
I thought they were referring to the indirect costs of supporting monopolistic stuff that enshittifies later.

https://www.youtube.com/watch?v=U8vi6Hbp8Vc

ATechGuy 6 days ago||
It is really free? Genuinely asking.
gilrain 6 days ago||
Yes. They upsell more complete solutions, but the free tier is pretty generous.
wcarss 5 days ago|
Singing copyrighted Billy Joel to make your footage unusable for reality television; thanks 30 Rock for an early view into this dystopian strategy
More comments...