Top
Best
New

Posted by todsacerdoti 12/21/2025

How I protect my Forgejo instance from AI web crawlers(her.esy.fun)
189 points | 98 commentspage 2
userbinator 12/22/2025|
Unfortunately this means, my website could only be seen if you enable javascript in your browser.

Or have a web-proxy that matches on the pattern and extracts the cookie automatically. ;-)

apples_oranges 12/22/2025||
HTTP 412 would be better I guess..
jsheard 12/22/2025|
You shouldn't really serve aggressive scrapers any kind of error or otherwise unusual response, because they'll just take that as a signal to try again with a different IP address or user agent, or a residential proxy, or a headless browser, or whatever else. There's no obligation to be polite to rude guests, give them a 200 OK containing the output of a Markov chain trained on the Bee Movie script instead.
loloquwowndueo 12/22/2025||
Unless your output is static, you’d then be paying the cost of running the markov generator.
justsomehnguy 12/22/2025||
A similar approach can be done by writing a cookie by the proxy/webserver itself by visiting some path ie: example.net/sesame/open.

For a single user or a small team this could be enough.

frogperson 12/22/2025||
I think it would be really cool if someone built a reverse proxy just for dealing with these bad actors.

I would really like to easily serve some markov chain non-sense to Ai bots.

jakewil 12/22/2025|
perhaps Iocaine [1] is what you're looking for. See the demo page [2] for what it serves to AI crawlers.

1. https://iocaine.madhouse-project.org/

2. https://poison.madhouse-project.org/

philipwhiuk 12/22/2025|||
For images you have stuff like https://nightshade.cs.uchicago.edu/whatis.html
opem 12/22/2025||||
This site blocked me right away, seems quite agressive
gkbrk 12/22/2025|||
Seems like a good way to waste tons of your bandwidth. Almost every serious data pipeline has some quality filtering in there (even open-source ones like FineWeb and EduWeb). And the stuff Iocaine generates instantly gets filtered.

Feel free to test this with any classifier or cheapo LLM.

reconnecting 12/22/2025||
tirreno (1) guy here.

Our open-source system can block IP addresses based on rules triggered by specific behavior.

Can you elaborate on what exact type of crawlers you would like to block? Like, a leaky bucket of a certain number of requests per minute?

1. https://github.com/tirrenotechnologies/tirreno

reconnecting 12/22/2025||
I believe there is a slight misunderstanding regarding the role of 'AI crawlers'.

Bad crawlers have been there since the very beginning. Some of them looking for known vulnerabilities, some scraping content for third-party services. Most of them have spoofed UAs to pretend to be legitimate bots.

This is approximately 30–50% of traffic on any website.

notachatbot123 12/22/2025|||
The article is about AI web crawlers. How can your tool help and how would one set it up for this specific context?
reconnecting 12/22/2025||
I don't see how an AI crawler is different from any others.

The simplest approach is to count the UA as risky or flag multiple 404 errors or HEAD requests, and block on that. Those are rules we already have out of the box.

It's open source, there's no pain in writing specific rules for rate limiting, thus my question.

Plus, we have developed a dashboard for manually choosing UA blocks based on name, but we're still not sure if this is something that would be really helpful for website operators.

Roark66 12/22/2025|||
>It's open source, there's no pain in writing specific rules for rate limiting, thus my question.

Depends on the goal.

Author wants his instance not to get killed. Request rate limiting may achieve that easily in a way transparent to normal users.

mmarian 12/22/2025|||
> count the UA as risky

It's trivial to spoof UAs unfortunately.

reconnecting 12/23/2025||
It depends. If you want to stop OAI-SearchBot/1.3, UA will be enough.
mmarian 12/24/2025||
Why would you need tirreno if you just want to stop OAI's bot though?
reconnecting 12/24/2025||
OAI's is just an example that's easy to explain.

I believe that if something is publicly available, it shouldn't be overprotected in most cases.

However, there are many advanced cases, such as crawlers that collect data for platform impersonation (for scams) or custom phishing attacks, or account brute-force attacks. In those cases, I use tirreno to understand traffic through different dimensions.

mmarian 12/22/2025||
> block IP addresses based on rules triggered by specific behavior

Problem is, bots can easily can resort to resi proxies, at which point you'll end up blocking legitimate traffic.

reconnecting 12/23/2025||
Again, it depends. Residential proxies are much more expensive, and most vulnerability scanners will never shift to them.

I believe that there is a low chance that a real customer behind this residential IP will come to your resource. If you do an EU service, there is no pain to block Asian IPs and vice-versa.

What is really important here is that most people block IPs on autopilot without seeing the distribution of their actions, and this really matters.

stronglikedan 12/22/2025||
> Unfortunately this means, my website could only be seen if you enable javascript in your browser. I feel this is acceptable.

I wouldn't be surprised if all this AI stuff was just a global conspiracy to get everyone to turn on JS.

KronisLV 12/22/2025||
We should just have some standard for crawlable archived versions of pages with no back end or DB interaction behind them etc., for example if there's a reverse proxy, whatever it outputs is archived and it wouldn't actually pass on any call in the archive version. Same for translating the output of any dynamic JS into fully static HTML. Then add some proof-of-work that works without JS and is a web standard (e.g. server sends header, client sends correct response, gets access to archive) and mainstream the culture for low-cost hosting for such archives and you're done, also make sure that this sort of feature is enabled in the most basic configuration for all web servers and such, logged separately.

Obviously such a thing will never happen, because the web and culture went in a different direction. But if it were a mainstream thing, you'd get easy to consume archives (also for regular archival and data hoarding) and the "live" versions of sites wouldn't have their logs be bogged down by stupid spam.

Or if PoW was a proper web standard with no JS, then ppl who want to tell AI and other crawlers to fuck off, they could at least make it uneconomical to crawl their stuff en masse. In my view, proof of work that would work through headers in the current day world should be as ubiquitous as TLS.

agentifysh 12/22/2025||
never heard of forgejo, should one switch from gitea
tuananh 12/23/2025|
it's a fork of gitea.
Roark66 12/22/2025||
I'm glad the author clarified he wants to prevent his instance from crashing not simply "block robots and allow humans".

I think the idea that you can block bots and allow humans is fallacious.

We should focus on a specific behaviour that causes problems (like making a bajillion requests one for each commit, instead of cloning the repo). To fix this we should block clients that work in such ways. If these bots learn to request at a reasonable pace why cares if they are bots, humans, bots under a control of an individual human, bots owned by a huge company scraping for training data? Once you make your code (or anything else) public, then trying to limit access to only a certain class of consumers is a waste of effort.

Also, perhaps I'm biased, because I run a searXNG and Crawl4AI (and few ancillaries like jina rerank etc) in my homelab so I can tell my AI to perform live internet searches as well as it can get any website. For code it has a way to clone stuff, but for things like issues, discussions, PRs it goes mostly to GitHub.

I like that my AI can browse almost like me. I think this is the future way to consume a lot of the web (except sites like this one that are an actual pleasure to use).

The models sometimes hit sites they can't fetch. For this I use Firecrawl. I use MCP proxy that lets me rewrite the tool descriptions so my models get access to both my local Crawl4ai and hosted (and rather expensive)firecrawl, but they are told to use Firecrawl as last resort.

The more people use these kinds of solutions the more incentive there will be for sites not to block users that use automation. Of course they will have to rely on alternative monetisation methods, but I think eventually these stupid capchas will disappear and reasonable rate limiting will prevail.

popcornricecake 12/22/2025||
> I think this is the future way to consume a lot of the web

I think I see many prompt injections in your future. Like captchas with a special bypass solution just for AIs that leads to special content.

asfdasfsd 12/22/2025|||
And people who block AI crawlers on moral grounds?
szundi 12/22/2025||
[dead]
mintflow 12/22/2025|
recently I just noticed github trying(but failed) to charge the self host runners, I find a afternoon to setup a mini PC to install freeBSD and gitaea on it, then setup tailscale to let it only listen on the 100.64.x.x IP address.

Since I do not make this node public accessable, so no worry for AI web crawlers:)