Humanely dealing with humungus crawlers

Posted by freediver 9/12/2025

Humanely dealing with humungus crawlers(flak.tedunangst.com)

83 points | 54 comments

bobbiechen 9/12/2025|

>We’ve already done the work to render the page, and we’re trying to shed load, so why would I want to increase load by generating challenges and verifying responses? It annoys me when I click a seemingly popular blog post and immediately get challenged, when I’m 99.9% certain that somebody else clicked it two seconds before me. Why isn’t it in cache? We must have different objectives in what we’re trying to accomplish. Or who we’re trying to irritate.

+1000 I feel like so much bot detection (and fraud prevention against human actors, too) is so emotionally-driven. Some people hate these things so much, they're willing to cut off their nose to spite their face.

bayindirh 9/12/2025||

My view on this is simple:

If you're a bot which will ignore all the licenses I put on that content, then I don't want to you to be able to reach that content.

No, any amount of monetary compensation is not welcome either. I use these licenses as a matter of principle, and my principles are not for sale.

That's all, thanks.

beeflet 9/12/2025|||

I think the problem is that despite the effort, you will still end up in the dataset. So it's futile

warkdarrior 9/12/2025|||

How can you tell a bot will ignore all your content licenses?

bayindirh 9/12/2025||

Currently all AI companies argue that the content they use falls under fair use, and disregard all licenses. This means any future ones respecting these licenses needs to be whitelisted.

diggan 9/12/2025||

How do you know that that bot is part of those AI companies? Maybe it's my personal bot you're blocking, should I also not have (indirectly) access to the content?

simianparrot 9/12/2025|||

No. Access to my content is a privilege I grant you. I decide how you get to access it, and via a bot that my setup confuses for an AI crawler belonging to an anti-human AI corporation is not a valid way to access it. Get off my virtual lawn.

diggan 9/12/2025||

> No. Access to my content is a privilege I grant you.

Right, I thought the conversation was about public websites on the public internet, but I think you're talking about this in the context of a private website now? I understand keeping tighter controls if you're dealing with private content you want accessible via the internet for others but not the public.

privatelypublic 9/12/2025|||

All websites are private (excepting maybe government sites). In most places the internet infrastructure itself is private.

You're conflating a legal concept that applies to areas that are shared, government owned, paid for by taxes, and the government feels like people should be able to access them.

The web is closer to a shopping mall. You're on one persons property to access other people's stuff who pay to be there. They set their own rules. If you don't follow those rules you get kicked out, charged with trespassing, and possibly banned from the mall entire.

AI bots have been asked to leave. But, since they own the mall too, the store owners are more than a little screwed.

diggan 9/13/2025||

> You're on one persons property to access other people's stuff who pay to be there.

I see it more like I'm knocking on people's doors (issuing GET requests with my web browser) and people open their door for me (the server responds with something) or not. If you don't wanna open the door, fine you do you, but if you do open the door, I'm gonna assume it was on purpose as I'm not trying to be malicious, I'm just a user with a browser.

> AI bots have been asked to leave. But, since they own the mall too, the store owners are more than a little screwed.

I don't understand what you mean with this, what is the mall here, are you're saying that people have websites hosted at OpenAI et al? I'm not sure how the "mall owner" and the people running the AI bots are the same owners.

privatelypublic 9/14/2025||

First, the mall is the internet as a whole- you're going to have to pay to be there (entrance is free, getting there is not), then you use their property to get to private businesses that have leased space at the mall.

And finally: https://www.techspot.com/news/105769-meta-reportedly-plannin...

The internet runs on backhaul. A LOT of backhaul is now owned by FAANG. Along with that, most those companies can financially ruin any business simply by banning them from the platform. So, the companies use their backhaul fiber and peering agreements to crawl everybody else. And nobody says anything because of "The Implication" that if you sue under Computer fraud and abuse Act (among others) they'll just wholesale ban you.

A "door to door" analogy doesn't work because sidewalks are generally considered "Public." The best I can tweak that analogy is a gated neighborhood and everybody has "no soliciting" signs. (NB: at least in my area, soliciting when theres a no-soliciting sign is an actual crime, on top of being trespassing)

kiitos 9/14/2025||

making an HTTP GET request to an IP and port over the public internet, and getting a response back, is an interaction defined in a technical context, which has its own definitions for concepts like public/private.

stuff like licenses.txt or robots.txt exist in totally separate context, which has a totally separate set of definitions for concepts like public/private.

can't really conflate context-specific concepts like public/private, over multiple and incompatible contexts like technical/legal

the claim that "a lot of backhaul is now owned by FAANG" is obviously untrue at a basic technical level. the broader argument is cynical, unfalsifiable, and uninteresting.

simianparrot 9/12/2025||||

You’re literally visiting a service paid for by me. It’s open to the public, but it’s my domain and my server and I get to say “no thank you” to your visit if you don’t behave. You have no innate right to access the content I share.

Blocking misbehaving IP addresses isn’t new, and is another version of the same principle.

diggan 9/13/2025|||

> but it’s my domain and my server and I get to say “no thank you” to your visit if you don’t behave [...] Blocking misbehaving IP addresses isn’t new

Absolutely, I agree that of course people are free to block whatever they want, misbehaving or not. Guess I'm just trying to figure out what sort of "collateral damage" people are OK with when putting up content on the public internet but want it to be selectively available.

> You have no innate right to access the content I share.

No, I guess that's true, I don't have any "rights" to do so. But I am gonna assume that if whatever you host is available without any authentication, protection or similar, you're fine with me viewing that. I'm not saying you should be fine with 1000s of requests per second, but since you made it public in the first place by sharing it, you kind of implicitly agreed for others to view it.

kiitos 9/14/2025|||

doing an HTTP GET to your server is my request to access some content your server serves. that's my right as a client. and it is your server's responsibility to determine whether or not to respond to my request. that's your server's right. said another way, "access" is the responsibility of the server, not the client.

simianparrot 9/15/2025||

Technical pedantry aside, that's what I mean. And I choose to not respond to your request with my content if I don't think your client is acting in good faith -- ie. is a bot or crawler that disrespects robots.txt, for example.

kiitos 9/18/2025||

sorry, yes, i think we are in agreement

bayindirh 9/12/2025|||

This interpretation won't take you that far.

Crawling-prevention is not new. Many news outlets or biggish websites already was preventing access by non-human agents in various ways for a very long time.

Now, non-human agents are improved and started to leech everything they can find, so the methods are evolving, too.

News outlets are also public sites on the public internet.

Source-available code repositories are also on the public internet, but said agents crawl and use that code, too, backed by fair-use claims.

bayindirh 9/12/2025|||

You can use a honest user string denoting that it's your bot. Some AI companies label their bots transparently, they show up on the logs I keep.

While I understand that you may need a personal bot to crawl or mirror a site, I can't guarantee that I'll grant you access.

I don't like to be that heavy-handed in the first place, but capitalism is making it harder to trust entities which you can't see and talk face to face.

Vegenoid 9/12/2025|||

I think it’s better viewed through a lens of effort. Implementing systems that try harder to not challenge humans takes more work than just throwing up a catch-all challenge wall.

The author’s goal is admirable: “My primary principle is that I’d rather not annoy real humans more than strictly intended”. However, the primary goal for many people hosting content will be “block bots and allow humans with minimal effort and tuning”.

jitl 9/12/2025||

Really? If I’m an unsophisticated blog not using a CDN, and I get a $1000 bill for bandwidth overage or something, I’m gonna google a solve and slap it on there because I don’t want to pay another $1000 for Big Basilisk. I don’t think that’s emotional response, it’s common sense.

marginalia_nu 9/12/2025|||

Seems like you've made profoundly questionable hosting or design choices for that to happen. Flat rate web hosting exists, and blogs (especially unsophisticated ones) do not require much bandwidth or processing power.

Misbehaving crawlers are a huge problem but bloggers are among the least affected by them. Something like a wiki or a forum is a better example, as they're in a category of websites where each page visit is almost unavoidably rendered on the fly using multiple expensive SQL queries due to the rapidly mutating nature of their datasets.

Git forges, like the one TFA is discussing, are also fairly expensive, especially as crawlers traverse historical states. When the crawler is poorly implemented they'll get stuck doing this basically forever. Detecting and dealing with git hosts is an absolute must for any web crawler due to this.

mtlynch 9/12/2025||

>Flat rate web hosting exists, and blogs (especially unsophisticated ones) do not require much bandwidth or processing power.

I actually find this surprisingly difficult to find.

I just want static hosting (like Netlify or Firebase Hosting), but there aren't many hosts that offer that.

There are lots of providers where I can buy a VPS somewhere and be in charge of configuring and patching it, but if I just want to hand someone a set of HTML files and some money in exchange for hosting, not many hosts fit the bill.

diggan 9/12/2025|||

> There are lots of providers where I can buy a VPS somewhere and be in charge of configuring and patching it, but if I just want to hand someone a set of HTML files and some money in exchange for hosting, not many hosts fit the bill.

Yeah, that's true, there isn't a lot of "I give you money and HTML, you host it" services out there, surprisingly. Probably the most mature, cheapest and most reliable one today would be good ol' neocities.org (run by HN user kyledrake) which basically gives you 3TB/month for $5, pretty good deal :)

Sometimes when I miss StumbleUpon I go to https://neocities.org/browse?sort_by=random which gives a fun little glimpse of the hobby/curiosity/creative web.

marginalia_nu 9/12/2025||||

If you just want to host HTML for personal use github pages is free (and works with a custom domain). There are bandwidth limitations, but they definitely won't pull an AWS on you and send a bill that would cover a new car because a crawler acted up.

mtlynch 9/15/2025||

Github Pages' "soft" bandwidth limit is 100 MB. I typically use 300-500 GB per month at Netlify, so I'm over Github's limit.

I actually don't even want a free option. I want to pay a vendor that cares about keeping my website online. I'm fine paying $20-50/mo as long as it's bounded and they don't just take my site offline if I see a spike from HN.

ghssds 9/12/2025||||

You already had a couple of suggestions but I've been happy in the past with OVH.

https://www.ovhcloud.com/en/web-hosting/compare/

thaumaturgy 9/12/2025||||

Interesting, I was under the impression this was more common than maybe it is. I know the hosting market has gotten pretty bad.

So, I'm currently building pretty much this. After doing it on the side for clients for years, it's now my full-time effort. I have a solid and stable infrastructure, but not yet an API or web frontend. If somebody wants basically ssh, git, and static (or even not static!) hosting that comes with a sysadmin's contact information for a small number of dollars per month, I can be reached at sysop@biphrost.net.

Environment is currently Debian-in-LXC-on-Debian-on-DigitalOcean.

ctoth 9/12/2025||||

Dreamhost! They're still around and still lovely after how many years? I even find their custom control panel charming.

hobs 9/12/2025||

I really like DH(though I am still mad about the cloudatcost shenanigans) and use them but if you use 200x the resources the other shared sites consume you're getting the boot just like anyone.

tekne 9/14/2025|||

I host my personal static site with Firebase, haven’t paid a cent yet (and don’t even think I set up billing!) Just compile and firebase deploy.

phantompeace 9/12/2025|||

Wouldn't it be easier to put the unsophisticated blog behind cloudflare

mhuffman 9/12/2025||

As much as I like to shit on cloudflare at every opportunity, it would obviously be easier to put it behind CF than install bot detection plugins.

nektro 9/12/2025||

it's sad we've gotten to the point where mitigations against this have to be such a consideration when hosting a site

arjie 9/12/2025|

They don't really have to be. I don't have many mitigations and the AI bots crawl my site and it's fine. The robots.txt is pretty simple too and is really just set up to help the robot not get stuck in loops (I use Mediawiki as the CMS and it has a lot of GET paths that a normal person wouldn't choose). In my case, a machine near my desk hosts everything and it's fine.

jrochkind1 9/13/2025||

I used to say that, but last year it stopped being true for me.

hyperman1 9/12/2025||

I've been wondering about how to make a challenge that AI won't do. Some possibilities:

* Type this sentence, taken from a famous copyrighted work.

* Type Tienanmen protests.

* Type this list of swear words or sexual organs.

dweinus 9/12/2025||

> Type this list of swear words

1998: I swear at the computer until the page loads

2025: I swear at the computer until the page loads

seabass-labrax 9/13/2025|||

Unfortunately for your proposal, the crawlers for training LLMs don't have the same censorship as the AI chatbots do when communicating with the end user. The censorship of chatbots is either done by means of fine-tuning (a technique which is part of the broader category of 'alignment' processes), or having a separate model (which may or may not be an LLM) filter its output. Both of these are done only at runtime, after the LLM has already been trained - and most of the crawling comes during training.

All that's to say that you can stop some of your website contents being quoted by the chatbots verbatim, but you can't prevent the crawlers using up all your bandwidth in the way you describe. You also can't stop your website contents being rehashed in a conceptual way by the chatbot later. So if I just write something copyrighted or taboo here in this comment, that won't stop an LLM being trained on the comment as a whole, but it might stop the chatbot based on that LLM from quoting it directly.

Everything is moving so quickly with AI that my comment is probably out of date the moment I type it... take it with a grain of salt :)

userbinator 9/13/2025||

Ask it how many letters are in certain words.

michaeljx 9/12/2025||

For some reason I thought this would be about dealing with very large insects

felipeerias 9/13/2025||

There was another headline about “top models” getting in trouble for “history leaks” that would have been very confusing to 15 year old me.

rapsacnz 9/13/2025||

Me too

nickpsecurity 9/12/2025||

I made my pages static HTML with no images, used a fast server, and BunnyCDN (see profile domain). Ten thousand hits a day from bots costs a penny or something. When I'm using images, I link to image hosting sites. It might get more challenging if I try to squeeze meme images in between every other paragraph to make my sites more beautiful.

Far as Ted's article, the first thing that popped in my head is that most AI crawlers hitting my sites are in big, datacenter cities: Dallas, Dublin, etc. I wonder if I could easily geo-block those cities or redirect them to pages with more checks built-in. I just haven't looked into that on my CDN's or in general in a long time.

They also usually request files from popular, PHP frameworks and othrr things like that. If you don't use PHP, you could autoban on the first request for a PHP page. Likewise for anything else you don't need.

Of the two, looking for .php is probably lightening quick with low, CPU/RAM utilization in comparison.

kragen 9/13/2025||

This is exciting!

zkmon 9/12/2025||

[flagged]

politelemon 9/12/2025||

The point of a blog is whatever the author would like it to be. It doesn't have to follow a structure or expectations. We just happen to be consuming it.

bayindirh 9/12/2025||

> Sorry, what's the point of this blog?

Being a blog the way the author dreamed of it.

> I hope people would write a quick abstract/summary in the first few lines and then go on elaborating.

I hope people continue doing what makes them happy. It's their site, they owe nothing to anyone (maybe except hosting / network fees, but that's not my business, either).

> Or at least put that summary at the end, in old-fashioned way.

Or maybe people can spend a couple of minutes to read and understand it, with the MSI (MeatSpaceIntelligence) which comes bundled with all human beings.

It's free, too!

zkmon 9/12/2025||

Maybe, maybe .. you get some pleasure in forcing people to read every bit of what you write, just to get what the heck it is. But unfortunately it is the age of AI summaries and short attention spans. Not the times when you read half-foot thick novels end-to-end multiple times. TL;DR!

bayindirh 9/12/2025||

I write my digital garden and blog for myself. They are just happen to be public. The pleasurable part is putting it out there, not forcing people to read it.

If people prefer to have short attention spans and leave what I put out after 30 seconds, it's their own choice. My blog has minimal analytics (provided by the platform), and digital garden has no analytics whatsoever, so I don't care and get bothered what humans do with my site.

I personally don't use any AI tools whatsoever, and still prefer to read half-foot thick novels end-to-end. Hyperion Cantos (4 x 700 pages) was great. My next target is Foundation by Asimov (7 volumes incl. expansions).

zkmon 9/13/2025||

My gripe was actually only about the post, not about the blog. Sorry for not being specific. But if you read my comment closely you would have realized that I was commenting only on the post. I mention this because it goes to show what kind of animals we became to be - we don't read fully, we don't try to understand what's behind some text. We simply don't have time.

Also, your blog didn't just "happen" to be public. You posted about it on HN and you are curious about the HN comments on your post. This means you want people to read and comment on your stuff. There is no point in pretending that it's just for your own consumption. If you don't want people to say what they feel about your content, don't post it here. It is like looking into mirror and asking it not to show things you don't like. You need to come to accept the feedback, since you asked for it.

You can down vote me to ground. But you need to realize that this is HN, the feedback garden. And it did not happen to be that way. It is intentional.

bayindirh 9/13/2025||

> My gripe was actually only about the post, not about the blog.

Hey, no worries. I replied to you in the context of the post, not the blog, actually. The blog has some other interesting features, too, but your comment was very clear that you wanted a TL;DR: on the post itself. :)

> You can down vote me to ground.

As a matter of principle, I don't downvote people. On the other hand, you can't downvote direct replies anyway, but that's irrelevant to my stance.

> But you need to realize that this is HN, the feedback garden.

We're both here for ~8 years, so I believe we're both pretty proficient about how this place works.

> And it did not happen to be that way. It is intentional.

Yes.

> Also, your blog didn't just "happen" to be public. You posted about it on HN and you are curious about the HN comments on your post... [snipped for brevity].

Yes and no. I post my new blog posts to two places. Here, and a Discord server which I frequent. I get almost no feedback from either. If people read and like what they read, it's great. If not, I don't care. Both blog and the digital garden is written like nobody's gonna read it. There's no rush, no anxiety, no optimization. I cook an idea, and refine it until I'm happy, and post it when I feel it's ready to be there. That blog is a public chronicle of my thoughts and what I go through.

> It is like looking into mirror and asking it not to show things you don't like. You need to come to accept the feedback, since you asked for it.

I'm pretty content with any comment incl. yours. I'm not someone who morphs to please people. Comments which challenge my words are invaluable as long as they don't become ad-hominem attacks. It both helps me to see my dark corners and improve my discussion and language skills (this is not my native language, tbh).

kiitos 9/12/2025|

what a just totally bizarre perspective

all of the stuff that's being complained-about is absolute 100% table-stakes stuff that every http server on the public internet has needed to deal with since, man, i dunno, minimum 15 years now?

as a result literally nobody self-hosts their own HTTP content any more, unless they enjoy the challenge in like a problem-solving sense

if you are even self-hosting some kind of captcha system you've already make a mistake, but apparently this guy is not just hosting but building a bespoke one? which is like, my dude, respect, but this is light years off of the beaten path

the author whinges about google not doing their own internal rate limiting in some presumed distributed system, before any node in that system makes any http request over the open internet. that's fair and not doing so is maybe bad user behavior but on the open internet it's the responsibility of the server to protect itself as it needs to, it's not the other way around

everything this dude is yelling about is immediately solved by hosting thru a hosting provider, like everyone else does, and has done, since like 2005

jrochkind1 9/13/2025|

Google isn't mentioned in OP, did we read the same article?

kiitos 9/18/2025||

i was responding to something else, 100% my bad