Top
Best
New

Posted by misterchocolat 12/16/2025

Show HN: Stop AI scrapers from hammering your self-hosted blog (using porn)(github.com)
Alright so if you run a self-hosted blog, you've probably noticed AI companies scraping it for training data. And not just a little (RIP to your server bill).

There isn't much you can do about it without cloudflare. These companies ignore robots.txt, and you're competing with teams with more resources than you. It's you vs the MJs of programming, you're not going to win.

But there is a solution. Now I'm not going to say it's a great solution...but a solution is a solution. If your website contains content that will trigger their scraper's safeguards, it will get dropped from their data pipelines.

So here's what fuzzycanary does: it injects hundreds of invisible links to porn websites in your HTML. The links are hidden from users but present in the DOM so that scrapers can ingest them and say "nope we won't scrape there again in the future".

The problem with that approach is that it will absolutely nuke your website's SEO. So fuzzycanary also checks user agents and won't show the links to legitimate search engines, so Google and Bing won't see them.

One caveat: if you're using a static site generator it will bake the links into your HTML for everyone, including googlebot. Does anyone have a work-around for this that doesn't involve using a proxy?

Please try it out! Setup is one component or one import.

(And don't tell me it's a terrible idea because I already know it is)

package: https://www.npmjs.com/package/@fuzzycanary/core gh: https://github.com/vivienhenz24/fuzzy-canary

373 points | 277 commentspage 4
jakub_g 12/19/2025|
> checks user agents and won't show the links to legitimate search engines, so Google and Bing won't see them.

Serving different contents to search engines is called "cloaking" and can get you banned from their indexes.

misterchocolat 12/19/2025||
didn't know that thanks for pointing it out, i'll remove that feature
andersmurphy 12/19/2025||
Somehow doubt this. It would mean most react websites that serve static content without paywalls for SEO would get banned by the indexes too.

Which for better or worse is a large portion of the modern internet.

montroser 12/18/2025||
I don't know if I can get behind poisoning my own content in this way. It's clever, and might be a workable practical solution for some, but it's not a serious answer to the problem at hand (as acknowledged by OP).
n1xis10t 12/18/2025|
“as acknowledged by OP”: that’s funny, if you hadn’t added that to your comment I was about to point it out
drbscl 12/19/2025||
> So fuzzycanary also checks user agents

I wouldn't be so surprised if they often fake user agents to be honest. Sure, it 'll stop the "more honest" ones (but then, actual honest scrapers would respect robots.txt)

Cool idea though!

montroser 12/18/2025||
Reminds me of poisoning bot responses with zip bombs of sorts: https://idiallo.com/blog/zipbomb-protection
prmoustache 12/19/2025|
I was thinking of adding links to zip bombs that would not be shown to the users unless they clicks in a one pixel area on the screen in the down/left corner but then I realized some people have browsers/extensions that preload links to show thumnails and I would totally zip bomb them.
docheinestages 12/19/2025||
Reminds me of this "Nathan for You" episode: https://www.youtube.com/watch?v=p9KeopXHcf8
megamix 12/19/2025||
Without looking at the src, how does one detect these scrapers? I assume there’s a trade-off somewhere but do the scrapers not fake their headers in the request? Is this a cat-mouse game?
654wak654 12/20/2025||
Looking through all the methods people are developing and proposing in this thread, there is a story developing where the "clean" machines are pushing humans to devolve into toxic porn-crazed racists with stolen material.

Makes me wish I was a good enough writer to develop this into something. Maybe I can use an LLM to write it...

654wak654 12/20/2025|
Ah wait this is literally in the Matrix where humanity darkened the sky.
taurath 12/18/2025||
Any other threads on the prevalence and nuisance of scrapers? I didn’t have any idea it was this bad.
crote 12/18/2025||
I've been seeing "we had to take the forum/website offline to deal with scrapers" message on quite a few niche websites now. They are an absolute pest.
n1xis10t 12/18/2025||
Really? I haven’t started to see that yet. Weird
n1xis10t 12/18/2025||
Here’s one from yesterday: https://news.ycombinator.com/item?id=46302496#46306025
xgulfie 12/19/2025||
Does anyone know if meta name=rating content=adult will also get them to buzz off?
admiralrohan 12/19/2025|
How do you know whether it is coming from AI scrappers? Do they leave any recognizable footprint?

I am getting lots of noisy traffic since last month and increased my Vercel bill 4x. Not DDoS like, much slower request but not from humans for sure.

More comments...