Top
Best
New

Posted by misterchocolat 12/16/2025

Show HN: Stop AI scrapers from hammering your self-hosted blog (using porn)(github.com)
Alright so if you run a self-hosted blog, you've probably noticed AI companies scraping it for training data. And not just a little (RIP to your server bill).

There isn't much you can do about it without cloudflare. These companies ignore robots.txt, and you're competing with teams with more resources than you. It's you vs the MJs of programming, you're not going to win.

But there is a solution. Now I'm not going to say it's a great solution...but a solution is a solution. If your website contains content that will trigger their scraper's safeguards, it will get dropped from their data pipelines.

So here's what fuzzycanary does: it injects hundreds of invisible links to porn websites in your HTML. The links are hidden from users but present in the DOM so that scrapers can ingest them and say "nope we won't scrape there again in the future".

The problem with that approach is that it will absolutely nuke your website's SEO. So fuzzycanary also checks user agents and won't show the links to legitimate search engines, so Google and Bing won't see them.

One caveat: if you're using a static site generator it will bake the links into your HTML for everyone, including googlebot. Does anyone have a work-around for this that doesn't involve using a proxy?

Please try it out! Setup is one component or one import.

(And don't tell me it's a terrible idea because I already know it is)

package: https://www.npmjs.com/package/@fuzzycanary/core gh: https://github.com/vivienhenz24/fuzzy-canary

372 points | 276 commentspage 3
efilife 6 days ago|
> Alright so if you run a self-hosted blog, you've probably noticed AI companies scraping it for training data. ... There isn't much you can do about it without cloudflare

I'm sorry, what? I can't believe I am reading this on HackerNews. All you have to do is code your own, BASIC captcha-like system. You can just create a page that sets a cookie using JS and check on the server whether it exists. 99.9999% of these scrapers can't execute JS and don't support cookies. You can go for a more sophisticated approach and analyze some more scraper tells (like reject short useragents). I do this and NEVER had a bot get past this and not a single user ever complained. It's extremely simple, I should ship this and charge people if no one seems to be able to figure this out by themselves.

n1xis10t 6 days ago||
Oops you just leaked your own intellectual property
ATechGuy 6 days ago||
From ChatGPT:

This approach can stop very basic scripts, but the claim that “99.9999% of scrapers can’t execute JS or handle cookies” isn’t accurate anymore. Modern scraping tools commonly use headless browsers (Playwright, Puppeteer, Selenium), execute JavaScript, support cookies, and spoof realistic user agents. Any scraper beyond the most trivial will pass a JS-set cookie check without effort. That said, using a lightweight JS challenge can be reasonable as one signal among many, especially for low-value content and when minimizing user friction is a priority. It’s just not a reliable standalone defense. If it’s working for you, that likely means your site isn’t a high-value scraping target — not that the technique is fundamentally robust.

efilife 6 days ago|||
From someone who actually does this stuff:

The claim is very accurate. Maybe not for the biggest websites, but very accurate for a self-hosted blog. You are not that important to waste compute power to set up a whole ass headless browser to scrape your page. Why am I even arguing with ChatGPT?

andersmurphy 5 days ago||
Yup another trick is to only serve br compressed resources and serve nothing to clients that don't support brotli. A lot of http clients don't support brotli out of the box.

I take it further and only stream content to clients that have a cookie, support js and br. Otherwise all you get is a minimal static pre br compressed shim. Seems to work well enough.

phyzome 6 days ago||||
There should be a new rule on HN: No posts that just go "I asked an LLM and it said..."

You're not adding anything to the conversation.

cyphar 5 days ago||
Yeah, I really have to wonder what the thought process is behind leaving such a comment. When people first started doing it I wondered if it was some kind of guerrilla outrage marketing campaign.
PunchyHamster 5 days ago|||
There was no thought process
efilife 5 days ago|||
Maybe he wanted to verify whether what I was saying was true and asked ChatGPT, then tried to be helpful by pasting the response here?
cyphar 5 days ago|||
Maybe I'm getting too jaded but I'm struggling to be quite that charitable.

The entireity of the human-written text in that comment was "From ChatGPT:" and it was formatted as though it was a slam-dunk "you're wrong, the computer says so" (imagine it was "From Wikipedia" followed by a quote disagreeing with you instead).

I'm sure some people do what you describe but then I would expect at least a little bit more explanation as to why they felt the need to paste a paragraph of LLM output into their comment. (While I would still disagree that it is in any way valuable, I would at least understand a bit about what they are trying to communicate.)

phyzome 5 days ago|||
Yeah, I agree that that's likely the thought process. It just happens to be the opposite of helpful.
6031769 5 days ago|||
So an LLM says that a technique used to foil LLM scrapers is ineffective against LLM scrapers.

It's almost as if it might have an ulterior motive in saying so.

mannanj 5 days ago||
Is a suitable solution to require visitors to fill out intent for why they came, and align that with your approved lists of supported intents, AND quiz them on some personal insider knowledge that only reasonable past visitors or new visitors who heard of you would have?

Like the credibility social proof of an introduction of a person into a social group. "Here's John, he likes Cats. I know him from School."

The filtering algorithm asks "Who who are you?" -> "What is your intent?" -> "How did you hear about me?" and stops visitors from proceeding until answered. The additional validation steps might kick away visitors but it also might protect you from spammers if you throw a minimally frictional challenge. Use cookies to not require this on every visit. Most LLMs would have the knowledge required to pass & for scrapers it's more costly to acquire this for a site than pay 128mb of ram to pass the Anubis approach.

samename 6 days ago||
This is a very creative hack to a common, growing problem. Well done!

Also, I like that you acknowledge it's a bad idea: that gives you more freedom to experiment and iterate.

yjftsjthsd-h 6 days ago||
How does this "look" to a screen reader?
misterchocolat 6 days ago|
the parent container uses display: none, so a screen reader will skip the links
true_religion 5 days ago||
So, I work for a company that has RTA adult websites. AI bots absolutely do scrape our pages needless of what raunchy material they will find. Maybe they discard it up after ingest, but I can’t tell. There are 1000s of AI bots on the web now from companies big and small so a solution like this will only divert a few scrapers.
rl3 4 days ago||
>The problem with that approach is that it will absolutely nuke your website's SEO. So fuzzycanary also checks user agents and won't show the links to legitimate search engines, so Google and Bing won't see them.

Those legitimate search engines will then totally feed much of what they scrape into AI. Granted, last I checked they're at least well-behaved crawlers.

I kind of like this idea sans SEO carve-out for the scenario where one just wants to link their blog around to friends without having to worry about it getting popular, and it reduces the chances identity thieves or other malicious actors would target it.

owl57 6 days ago||
> scrapers can ingest them and say "nope we won't scrape there again in the future"

Do all the AI scrapers actually do that?

amarant 6 days ago|
Not all, stuff like unstable diffusion exists.

But a good many, perhaps even most(?), certainly do!

MayeulC 5 days ago||
Ah, I wonder if corporate proxies will end up flagging your blog as porn, if you protect it this way?
jt2190 5 days ago||
I still don’t understand why a rate-limiting approach is not preferred. Why should I care if the abuse is coming from a bot or the world’s fastest human? Is there a “if you need to rate limit you’ve already lost” issue I’m not thinking of?
charlie-83 5 days ago|
A lot of bots will be able to make requests from a range of IP addresses. If you rate limit one, they just start sending requests from the next.
jakub_g 5 days ago|
> checks user agents and won't show the links to legitimate search engines, so Google and Bing won't see them.

Serving different contents to search engines is called "cloaking" and can get you banned from their indexes.

misterchocolat 5 days ago||
didn't know that thanks for pointing it out, i'll remove that feature
andersmurphy 5 days ago||
Somehow doubt this. It would mean most react websites that serve static content without paywalls for SEO would get banned by the indexes too.

Which for better or worse is a large portion of the modern internet.

More comments...