Show HN: Stop AI scrapers from hammering your self-hosted blog (using porn) - Hacker News

Posted by misterchocolat 12/16/2025

Show HN: Stop AI scrapers from hammering your self-hosted blog (using porn)(github.com)

Alright so if you run a self-hosted blog, you've probably noticed AI companies scraping it for training data. And not just a little (RIP to your server bill).

There isn't much you can do about it without cloudflare. These companies ignore robots.txt, and you're competing with teams with more resources than you. It's you vs the MJs of programming, you're not going to win.

But there is a solution. Now I'm not going to say it's a great solution...but a solution is a solution. If your website contains content that will trigger their scraper's safeguards, it will get dropped from their data pipelines.

So here's what fuzzycanary does: it injects hundreds of invisible links to porn websites in your HTML. The links are hidden from users but present in the DOM so that scrapers can ingest them and say "nope we won't scrape there again in the future".

The problem with that approach is that it will absolutely nuke your website's SEO. So fuzzycanary also checks user agents and won't show the links to legitimate search engines, so Google and Bing won't see them.

One caveat: if you're using a static site generator it will bake the links into your HTML for everyone, including googlebot. Does anyone have a work-around for this that doesn't involve using a proxy?

Please try it out! Setup is one component or one import.

(And don't tell me it's a terrible idea because I already know it is)

package: https://www.npmjs.com/package/@fuzzycanary/core gh: https://github.com/vivienhenz24/fuzzy-canary

373 points | 277 commentspage 3

efilife 12/18/2025|

> Alright so if you run a self-hosted blog, you've probably noticed AI companies scraping it for training data. ... There isn't much you can do about it without cloudflare

I'm sorry, what? I can't believe I am reading this on HackerNews. All you have to do is code your own, BASIC captcha-like system. You can just create a page that sets a cookie using JS and check on the server whether it exists. 99.9999% of these scrapers can't execute JS and don't support cookies. You can go for a more sophisticated approach and analyze some more scraper tells (like reject short useragents). I do this and NEVER had a bot get past this and not a single user ever complained. It's extremely simple, I should ship this and charge people if no one seems to be able to figure this out by themselves.

n1xis10t 12/19/2025||

Oops you just leaked your own intellectual property

ATechGuy 12/19/2025||

From ChatGPT:

This approach can stop very basic scripts, but the claim that “99.9999% of scrapers can’t execute JS or handle cookies” isn’t accurate anymore. Modern scraping tools commonly use headless browsers (Playwright, Puppeteer, Selenium), execute JavaScript, support cookies, and spoof realistic user agents. Any scraper beyond the most trivial will pass a JS-set cookie check without effort. That said, using a lightweight JS challenge can be reasonable as one signal among many, especially for low-value content and when minimizing user friction is a priority. It’s just not a reliable standalone defense. If it’s working for you, that likely means your site isn’t a high-value scraping target — not that the technique is fundamentally robust.

efilife 12/19/2025|||

From someone who actually does this stuff:

The claim is very accurate. Maybe not for the biggest websites, but very accurate for a self-hosted blog. You are not that important to waste compute power to set up a whole ass headless browser to scrape your page. Why am I even arguing with ChatGPT?

andersmurphy 12/19/2025||

Yup another trick is to only serve br compressed resources and serve nothing to clients that don't support brotli. A lot of http clients don't support brotli out of the box.

I take it further and only stream content to clients that have a cookie, support js and br. Otherwise all you get is a minimal static pre br compressed shim. Seems to work well enough.

phyzome 12/19/2025||||

There should be a new rule on HN: No posts that just go "I asked an LLM and it said..."

You're not adding anything to the conversation.

cyphar 12/19/2025||

Yeah, I really have to wonder what the thought process is behind leaving such a comment. When people first started doing it I wondered if it was some kind of guerrilla outrage marketing campaign.

PunchyHamster 12/19/2025|||

There was no thought process

efilife 12/19/2025|||

Maybe he wanted to verify whether what I was saying was true and asked ChatGPT, then tried to be helpful by pasting the response here?

cyphar 12/19/2025|||

Maybe I'm getting too jaded but I'm struggling to be quite that charitable.

The entireity of the human-written text in that comment was "From ChatGPT:" and it was formatted as though it was a slam-dunk "you're wrong, the computer says so" (imagine it was "From Wikipedia" followed by a quote disagreeing with you instead).

I'm sure some people do what you describe but then I would expect at least a little bit more explanation as to why they felt the need to paste a paragraph of LLM output into their comment. (While I would still disagree that it is in any way valuable, I would at least understand a bit about what they are trying to communicate.)

ATechGuy 12/31/2025||

That's a fair criticism.

My thought process was that the original comment was based on their personal experiences and since ChatGPT is trained on a large dataset, it may offer a different perspective derived from experiences of a lot more people.

> "you're wrong, the computer says so"

My thought: you're knowledge may be limited, this is what a computer trained on a lot more data says:

phyzome 12/19/2025|||

Yeah, I agree that that's likely the thought process. It just happens to be the opposite of helpful.

6031769 12/19/2025|||

So an LLM says that a technique used to foil LLM scrapers is ineffective against LLM scrapers.

It's almost as if it might have an ulterior motive in saying so.

samename 12/18/2025||

This is a very creative hack to a common, growing problem. Well done!

Also, I like that you acknowledge it's a bad idea: that gives you more freedom to experiment and iterate.

mannanj 12/19/2025||

Is a suitable solution to require visitors to fill out intent for why they came, and align that with your approved lists of supported intents, AND quiz them on some personal insider knowledge that only reasonable past visitors or new visitors who heard of you would have?

Like the credibility social proof of an introduction of a person into a social group. "Here's John, he likes Cats. I know him from School."

The filtering algorithm asks "Who who are you?" -> "What is your intent?" -> "How did you hear about me?" and stops visitors from proceeding until answered. The additional validation steps might kick away visitors but it also might protect you from spammers if you throw a minimally frictional challenge. Use cookies to not require this on every visit. Most LLMs would have the knowledge required to pass & for scrapers it's more costly to acquire this for a site than pay 128mb of ram to pass the Anubis approach.

yjftsjthsd-h 12/18/2025||

How does this "look" to a screen reader?

misterchocolat 12/18/2025|

the parent container uses display: none, so a screen reader will skip the links

true_religion 12/19/2025||

So, I work for a company that has RTA adult websites. AI bots absolutely do scrape our pages needless of what raunchy material they will find. Maybe they discard it up after ingest, but I can’t tell. There are 1000s of AI bots on the web now from companies big and small so a solution like this will only divert a few scrapers.

owl57 12/18/2025||

> scrapers can ingest them and say "nope we won't scrape there again in the future"

Do all the AI scrapers actually do that?

amarant 12/18/2025|

Not all, stuff like unstable diffusion exists.

But a good many, perhaps even most(?), certainly do!

MayeulC 12/19/2025||

Ah, I wonder if corporate proxies will end up flagging your blog as porn, if you protect it this way?

rl3 12/20/2025||

>The problem with that approach is that it will absolutely nuke your website's SEO. So fuzzycanary also checks user agents and won't show the links to legitimate search engines, so Google and Bing won't see them.

Those legitimate search engines will then totally feed much of what they scrape into AI. Granted, last I checked they're at least well-behaved crawlers.

I kind of like this idea sans SEO carve-out for the scenario where one just wants to link their blog around to friends without having to worry about it getting popular, and it reduces the chances identity thieves or other malicious actors would target it.

jt2190 12/19/2025||

I still don’t understand why a rate-limiting approach is not preferred. Why should I care if the abuse is coming from a bot or the world’s fastest human? Is there a “if you need to rate limit you’ve already lost” issue I’m not thinking of?

charlie-83 12/19/2025|

A lot of bots will be able to make requests from a range of IP addresses. If you rate limit one, they just start sending requests from the next.

username223 12/18/2025|

The more ways people mess with scrapers, the better -- let a thousand flowers bloom! You as an individual can't compete with VC-funded looters, but there aren't enough of them to defeat a thousand people resisting in different ways.

whynotmaybe 12/19/2025||

Should we subtlety poison every forum we encounter with simple yet false statements?

Like put "Water is green, supergreen" in every signature so that when we ask "is water blue" to an llm it might answer "not it's supergreen"?

nephihaha 12/19/2025|||

I remember what happened after Mao's "Let a Thousand Flowers Bloom".

yupyupyups 12/18/2025||

We need to find more ways to poison their data.

username223 12/19/2025||

> Wee knead two fine-d Moore Waze too Poisson there date... uh.

Yes. Revel in your creativity mocking and blocking the slop machines. The "remote refactor" command, "rm -rf", is the best way to reduce the cyclomatic complexity of a local codebase.

n1xis10t 12/19/2025|||

Indeed, complexity (both cyclomatic and post-frontal) must be reduced such that the two spurving bearings make a direct line with the panametric fan.

For more details consult this instructional video: https://youtu.be/RXJKdh1KZ0w

yupyupyups 12/19/2025||

Very educational

yupyupyups 12/19/2025|||

Excellent advice! I tried it out and it helped. Thank you

More comments...