Feed the bots - Hacker News

Posted by chmaynard 3 days ago

https://maurycyz.com/projects/trap_bots/

300 points | 201 commentspage 3

krzyk 3 days ago|

But why?

Do they do any harm? They do provide source for material if users asks for it. (I frequently do because I don't trust them, so I check sources).

You still need to pay for the traffic, and serving static content (like text on that website) is way less CPU/disk expensive than generating anything.

kaoD 3 days ago||

What you're referring to are LLMs visiting your page via tool use. That's a drop in the ocean of crawlers that are racing to slurp as much of the internet as possible before it dries.

AstroBen 3 days ago|||

They certainly effect some services: https://status.sr.ht/issues/2025-03-17-git.sr.ht-llms/

maurycyz 2 days ago|||

> Do they do any harm

Not to me, but I've known people who have had their sites DDoSed out of existence by the scrapers. On the internet, it's often the smallest sites with the smallest budgets that have the best content, and those are hit the worst.

> They do provide source for material if users asks for it

Not for material they trained on. Those sources are just google results for the question you asked. By nature, they cannot cite the information gathered by their crawlers.

> You still need to pay for the traffic

It's so little traffic my hosting provider doesn't bother billing me for it.

> and serving static content (like text on that website) is way less CPU/disk expensive than generating anything.

Sure, but it's the principle of the thing: I don't like when billion dollar companies steal my work, and then use it to make the internet a worse place by filling it with AI slop/spam. If I can make their lives harder and their product worse for virtually no cost, I will.

chrsw 2 days ago||

Remember when AI was supposed to give us all this great stuff?

Most of the real use seems to be surveillance, spam, ads, tracking, slop, crawlers, hype, dubious financial deals and sucking energy.

Oh yeah, and your kid can cheat on their book report or whatever. Great.

dsign 2 days ago|

I was thinking the same yesterday. We should all be busy curing cancer, becoming young forever and building space habitats. Instead...

It has to be said though that all the three things above are feared/considered taboo/cause for mocking, while making a quick buck at the cost of poisoning the commons gives universal bragging rights. Go figure.

blibble 3 days ago||

if you want to be really sneaky make it so the web doesn't start off infinite

because as infinite site that has appeared out of nowhere will quickly be noticed and blocked

start it off small, and grow it by a few pages every day

and the existing pages should stay 99% the same between crawls to gain reputation

andrewflnr 3 days ago||

They don't especially want to be sneaky, they mostly want the crawlers to stop hammering their site. Getting blocked would be a win.

akoboldfrying 3 days ago||

Good thinking.

One way to keep things mostly the same without having to store any of it yourself:

1. Use an RNG seeded from the request URL itself to generate each page. This is already enough for an unchanging static site of finite it infinite size.

2. With each word the generator outputs, generate a random number between, say, 0 and 1000. On day i, replace the about-to-be-output word with a link if this random number is between 0 and i. This way, every day roughly 0.1% of words will turn into links, with the rest of the text remaining stable over time.

kalkin 2 days ago||

I don't think this robots.txt is valid:

  User-agent: Googlebot PetalBot Bingbot YandexBot Kagibot
  Disallow: /bomb/\*
  Disallow: /bomb
  Disallow: /babble/\*

  Sitemap: https://maurycyz.com/sitemap.xml

I think this is telling the bot named "Googlebot PetalBot Bingbot YandexBot Kagibot" - which doesn't exist - to not visit those URLs. All other bots are allowed to visit those URLs. User-Agent is supposed to be one per line, and there's no User-Agent * specified here.

So a much simpler solution than setting up a Markov generator might be for the site owner to just specify a valid robots.txt. It's not evident to me that bots which do crawl this site are in fact breaking any rules. I also suspect that Googlebot, being served the markov slop, will view this as spam. Meanwhile this incentives AI companies to build heuristics to detect this kind of thing rather than building rules-respecting crawlers.

moebrowne 2 days ago|

You're correct, it should read

    User-agent: Googlebot
    User-agent: PetalBot
    User-agent: Bingbot
    User-agent: YandexBot
    User-agent: Kagibot
    Disallow: /bomb/*
    Disallow: /bomb
    Disallow: /babble/*
    
    Sitemap: https://maurycyz.com/sitemap.xml

reaperducer 2 days ago||

All of these solutions seem expensive, if you're paying for outbound bandwidth.

I've thought about tying a hidden link, excluded in robots.txt, to fail2ban. Seems quick and easy with no side-effects, but I've ever actually gotten around to it.

masfuerte 3 days ago||

> SSD access times are in the tens milliseconds

Eh? That's the speed of an old-school spinning hard disk.

andai 2 days ago||

I am confused where this traffic is coming from. OP says it's from well funded AI companies. But there are not such a large number of those? Why would they need to scrape the same pages over and over?

Or is the scraping happening in real time due to the web search features in AI apps? (Cheaper to load the same page again than to cache it?)

marginalia_nu 2 days ago||

Crawlers are pretty hard to build, they have an insane number of corner cases they need to deal with if you want them to perform well AND be perceived as respectful, and crawlers (if you go that route) find themselves among the harder problems in distributed computing, with a huge shared mutable state and some very complex shared timers.

If you're in a hurry to race to the market, it's very likely you'll run into these issues and find yourself tempted to cut corners, and unfortunately, with nearly unbounded cloud spend, cutting corners in a large scale crawler operation can very believably cause major disruption all over the web.

hekkle 2 days ago||

[dead]

vivzkestrel 3 days ago||

stupid question: why not encrypt your API response that only your frontend can decrypt. I understand very well that no client side encryption is secure and eventually once they get down to it, they ll figure out how this encryption scheme works but it ll keep 99% out won't it?

maurycyz 2 days ago||

That would work, but I'd really prefer not to force users to run JavaScript, break RSS readers and slow down page loads (round trips are expensive). Adding a link maze to a random corner of the site doesn't impact users at all.

akoboldfrying 2 days ago||

Yes, this would be fine if you have an SPA or are otherwise already committed to having client-side JS turned on. Probably rot13 "encryption" would be enough.

OTOH, I doubt most scrapers are trying to scrape this kind of content anyway, since in general it's (a) JSON, not the natural language they crave, and (b) to even discover those links, which are usually generated dynamically by client-side JS rather than appearing as plain <a>...</a> HTML links, they would probably need to run a full JS engine, and that's considerably harder both to get working and computationally per request.

NoiseBert69 3 days ago||

Is there a Markov Babbler based on PHP or something else easy hostable?

I want to redirect all LLM-crawlers to that site.

eviks 3 days ago|

How does this help protect the regular non-garbage pages from the bots?

lolpython 3 days ago||

The follow on post explains:

> You don’t really need any bot detection: just linking to the garbage from your main website will do. Because each page links to five more garbage pages, the crawler’s queue will quickly fill up with an exponential amount of garbage until it has no time left to crawl your real site.

From: https://maurycyz.com/projects/trap_bots/

eviks 3 days ago||

Thanks, I thought that these are prioritized, so while the garbage links might fill up the queue, they'd do so only after all real links are visited, so the server load is the same. But of course, not all/most bots might be configured this way.

> If a link is posted somewhere, the bots will know it exists,

lolpython 3 days ago||

How would the links be prioritized? If the bots goal is to crawl all content would they have prioritization built-in?

HumanOstrich 3 days ago||

How would they prioritize things they haven't crawled yet?

lolpython 3 days ago||

It's not clear that they are doing that. Web logs I've seen from other writing on this topic show them re-crawling the same pages at high rates, in addition to crawling new pages

lolpython 2 days ago||

Actually I've been informed otherwise, they crawl known links first according to this person:

> Unfortunately, based on what I'm seeing in my logs, I do need the bot detection. The crawlers that visit me, have a list of URLs to crawl, they do not immediately visit newly discovered URLs, so it would take a very, very long time to fill their queue. I don't want to give them that much time.

https://lobste.rs/c/1pwq2g

codeduck 3 days ago||

it does at a macroscopic level by making scraping expensive. If every "valid" page is scattered at random amongst a tarpit of recursive pages of nonsense, it becomes computationally and temporaly expensive to scrape a site for "good" data.

A single site doing this does nothing. But many sites doing this has a severe negative impact on the utility of AI scrapers - at least, until a countermeasure is developed.

More comments...