Posted by todsacerdoti 1 day ago
The deeper issue is that git forges are pathological for naive crawlers: every commit/file combo is a unique URL, so one medium repo explodes into Wikipedia-scale surface area if you just follow links blindly. A more robust pattern for small instances is to explicitly rate limit the expensive paths (/raw, per-commit views, “download as zip”), and treat “AI” as an implementation detail. Good bots that behave like polite users will still work; the ones that try to BFS your entire history at line rate hit a wall long before they can take your box down.
For some people it's an ideological one--we don't want AI vacuuming up all of our content. For those, "is this an AI user?" is a useful question to answer. However it's a hard one.
For many the problem is simply "there are a class of users that are putting way too much load on the system and it's causing problems". Initially I was playing wack-a-mole with this and dealing with alerts firing on a regular basis because of Meta crawling our site very aggressively, not backing off when errors were returned, etc.
I looked at rate limiting but the work involved in distributed rate limiting versus the number of offenders involved made the effort look a little silly, so I moved towards a "nuke it from orbit" strategy:
Requests are bucketed by class C subnet (31.13.80.36 -> 31.13.80.x) and request rate is tracked over 30 minute windows. If the request rate over that window exceeds a very generous threshold I've only seen a few very obvious and poorly behaved crawlers exceed it fires an alert.
The alert kicks off a flow where we look up the ASN covering every IP in that range, look up every range associated with those ASNs, and throw an alert in Slack with a big red "Block" button attached. When approved, the entire ASN is blocked at the edge.
It's never triggered on anything we weren't willing to block (e.g., a local consumer ISP). We've dropped a handful of foreign providers, some "budget" VPS providers, some more reputable cloud providers, and Facebook. It didn't take long before the alerts stopped--both for high request rates and our application monitoring seeing excessive loads.
If anyone's interested in trying to implement something similar, there's a regularly updated database of ASN <-> IP ranges announced here: https://github.com/ipverse/asn-ip
What exactly is the source of these mappings? Never heard about ipverse before, seems to be a semi-anonymous GitHub organization and their website has had a failing certificate for more than a year by now.
Having to use a browser to crawl your site will slow down naive crawlers at scale.
But it wouldn't do much against individuals typing "what is a kumquat" into their local LLM tool that issues 20 requests to answer the question. They're not really going to care nor notice if the tool had to use a playwright instance instead of curl.
Yet it's that use-case that is responsible for ~all of my AI bot traffic according to Cloudflare which is 30x the traffic of direct human users. In my case, being a forum, it made more sense to just block the traffic.
If they're treating it specifically on “code forge” (because they're after coding use cases), there's lots of interesting information that you won't get by just cloning a repo.
It's not just the current state of the repo, or all commits (and their messages). It's the initial issue (and discussion) that lead to a pull request (and review comments) that eventually gets squashed into a single commit.
The way you code with an agent is a lot more similar to the: issue, comments, change, review, refinement sequence; that you get by slurping the website.
Unfortunately, this kind of scraping seems to inconvenience the host way more than the scraper.
Another tangent: there probably are better behaved scrapers, we just don't notice them as much.
Since I do actually host a couple of websites / services behind port 443, it means I can't just block everything that tries to scan my ip address at port 443. However, I've setup Cloudflare in front of those websites, so I do log and block any non-Cloudflare (using Cloudflare's ASN: 13335) traffic coming into port 443.
I also log and block IP address attempting to connect on port 80, since that essentially deprecated.
This, of course, does not block traffic coming via the DNS names of the sites, since that will be routed through Cloudflare - but as someone mentioned, Cloudflare has its own anti-scraping tools. And then as another person mentioned, this does require the use of Cloudflare, which is a vast centralising force on the Internet and therefore part of a different problem...
I don't currently split out a separate list for IP addresses that have connected to HTTP(S) ports, but maybe I'll do that over Christmas.
This is my current simple project: https://github.com/UninvitedActivity/UninvitedActivity
Apologies if the README is a bit rambling. It's evolved over time, and it's mostly for me anyway.
P.S. I always thought it was Yog Sothoth (not Sototh). Either way, I'm partial to Nyarlathotep. "The Crawling Chaos" always sounded like the coolest of the elder gods.
Thanks for this tip.
"GET /w/A_Wedding_at_City_Hall HTTP/1.1" 200 9677 "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) Chrome/119.0.6045.214 Safari/537.36"
The actual IP is in X-Forwarded-For and I didn't keep that.Hopefully it at least costs them a little bit more.
Git forges will expose a version of every file at every commit in the project's history. If you have medium sized project consisting of say 1000 files and 10,000 commits, the crawler will identify a number of URLs on the same order of magnitude as English Wikipedia, just for that one project. This is also very expensive for the git forge, as it needs to reconstruct the historical files from a bunch of commits.
Git forges interact spectacularly poorly with naively implemented web crawlers, unless the crawlers put in logic to avoid exhaustively crawling git forges. You honestly get a pretty long way just excluding URLs with long base64-like path elements, which isn't hard but it's also not obvious.
Because big sites have decades of experience fighting against scrapers and have recently upped their game significantly (even when doing so carries some SEO costs) so that they're the only ones that can train AI on their own data.
So now, when you're starting from scratch and your goal is to gather as much data as possible, targetting smaller sites with weak / non-existent scraping protection is the path of least resistence.
Because people are reporting constant traffic, which would imply that the site is being scraped millions of times per year. How does that make any sense? Are there millions of AI companies?
- have data to train on
- update the data more or less continuously
- answer queries from users on the fly
With a lot of AI companies, that generates a lot of scraping. Also, some of them behave terribly when scraping or is just bad at it.
This is all "legitimate" traffic in that it isn't about crawling the internet but in service of a real human.
Put another way, search is moving from a model of crawl the internet and query on cached data to being able to query on live data.
But when it comes to git repos, an LLM agent like claude code can just clone them for local crawling which is far better than crawling remotely, and it's the "Right Way" for various reasons.
Frankly I suspect AI agents will push search in the opposite direction from your comment and move us to distributed cache workflows. These tools just hit the origin because it's the easy solution of today, not because the data needs to be up to date to the millisecond.
Imagine a system where all those Fetch(url) invocations interact with a local LRU cache. That'd be really nice, and I think that's where we'd want to go, especially once more and more origin servers try to block automated traffic.
I call it `temu` anubis. https://github.com/rhee876527/expert-octo-robot/blob/f28e48f...
Jokes aside, the whole web seems to be trending towards some kind of wall (pay, login, app etc.) and this ultimately sucks for the open internet.
Temubis
Or have a web-proxy that matches on the pattern and extracts the cookie automatically. ;-)