Top
Best
New

Posted by chmaynard 10/26/2025

Feed the bots(maurycyz.com)
https://maurycyz.com/projects/trap_bots/
305 points | 203 commentspage 3
blackhaj7 10/26/2025|
Can someone explain how this works?

Surely the bots are still hitting the pages they were hitting before but now they also hit the garbage pages too?

wodenokoto 10/26/2025||
In authors setup, sending Markova generated garbage is much lighter on resources than sending static pages. Only bots will continue to follow links to the next piece of garbage and thus he traps bots in garbage. No need to detect bots, they reveal themselves.

But yes, all bots start out on an actual page.

liqilin1567 10/27/2025|||
Seems like these garbage pages can't trap bots. People discussed it in this thread: https://news.ycombinator.com/item?id=45711987
blackhaj7 10/26/2025|||
Thanks for the explanation!
blackhaj7 10/26/2025||
Ah, it is explained in another post - https://maurycyz.com/projects/trap_bots/

Clever

chrsw 10/27/2025||
Remember when AI was supposed to give us all this great stuff?

Most of the real use seems to be surveillance, spam, ads, tracking, slop, crawlers, hype, dubious financial deals and sucking energy.

Oh yeah, and your kid can cheat on their book report or whatever. Great.

dsign 10/27/2025|
I was thinking the same yesterday. We should all be busy curing cancer, becoming young forever and building space habitats. Instead...

It has to be said though that all the three things above are feared/considered taboo/cause for mocking, while making a quick buck at the cost of poisoning the commons gives universal bragging rights. Go figure.

krzyk 10/26/2025||
But why?

Do they do any harm? They do provide source for material if users asks for it. (I frequently do because I don't trust them, so I check sources).

You still need to pay for the traffic, and serving static content (like text on that website) is way less CPU/disk expensive than generating anything.

kaoD 10/26/2025||
What you're referring to are LLMs visiting your page via tool use. That's a drop in the ocean of crawlers that are racing to slurp as much of the internet as possible before it dries.
AstroBen 10/26/2025|||
They certainly effect some services: https://status.sr.ht/issues/2025-03-17-git.sr.ht-llms/
maurycyz 10/27/2025|||
> Do they do any harm

Not to me, but I've known people who have had their sites DDoSed out of existence by the scrapers. On the internet, it's often the smallest sites with the smallest budgets that have the best content, and those are hit the worst.

> They do provide source for material if users asks for it

Not for material they trained on. Those sources are just google results for the question you asked. By nature, they cannot cite the information gathered by their crawlers.

> You still need to pay for the traffic

It's so little traffic my hosting provider doesn't bother billing me for it.

> and serving static content (like text on that website) is way less CPU/disk expensive than generating anything.

Sure, but it's the principle of the thing: I don't like when billion dollar companies steal my work, and then use it to make the internet a worse place by filling it with AI slop/spam. If I can make their lives harder and their product worse for virtually no cost, I will.

blibble 10/26/2025||
if you want to be really sneaky make it so the web doesn't start off infinite

because as infinite site that has appeared out of nowhere will quickly be noticed and blocked

start it off small, and grow it by a few pages every day

and the existing pages should stay 99% the same between crawls to gain reputation

andrewflnr 10/26/2025||
They don't especially want to be sneaky, they mostly want the crawlers to stop hammering their site. Getting blocked would be a win.
akoboldfrying 10/26/2025||
Good thinking.

One way to keep things mostly the same without having to store any of it yourself:

1. Use an RNG seeded from the request URL itself to generate each page. This is already enough for an unchanging static site of finite it infinite size.

2. With each word the generator outputs, generate a random number between, say, 0 and 1000. On day i, replace the about-to-be-output word with a link if this random number is between 0 and i. This way, every day roughly 0.1% of words will turn into links, with the rest of the text remaining stable over time.

delifue 10/29/2025||
Why not just use github pages for static blogs? It's free. No need to worry about extra bandwidth and other costs caused by crawlers.
kalkin 10/27/2025||
I don't think this robots.txt is valid:

  User-agent: Googlebot PetalBot Bingbot YandexBot Kagibot
  Disallow: /bomb/\*
  Disallow: /bomb
  Disallow: /babble/\*

  Sitemap: https://maurycyz.com/sitemap.xml
I think this is telling the bot named "Googlebot PetalBot Bingbot YandexBot Kagibot" - which doesn't exist - to not visit those URLs. All other bots are allowed to visit those URLs. User-Agent is supposed to be one per line, and there's no User-Agent * specified here.

So a much simpler solution than setting up a Markov generator might be for the site owner to just specify a valid robots.txt. It's not evident to me that bots which do crawl this site are in fact breaking any rules. I also suspect that Googlebot, being served the markov slop, will view this as spam. Meanwhile this incentives AI companies to build heuristics to detect this kind of thing rather than building rules-respecting crawlers.

moebrowne 10/27/2025|
You're correct, it should read

    User-agent: Googlebot
    User-agent: PetalBot
    User-agent: Bingbot
    User-agent: YandexBot
    User-agent: Kagibot
    Disallow: /bomb/*
    Disallow: /bomb
    Disallow: /babble/*
    
    Sitemap: https://maurycyz.com/sitemap.xml
masfuerte 10/26/2025||
> SSD access times are in the tens milliseconds

Eh? That's the speed of an old-school spinning hard disk.

vivzkestrel 10/26/2025||
stupid question: why not encrypt your API response that only your frontend can decrypt. I understand very well that no client side encryption is secure and eventually once they get down to it, they ll figure out how this encryption scheme works but it ll keep 99% out won't it?
maurycyz 10/27/2025||
That would work, but I'd really prefer not to force users to run JavaScript, break RSS readers and slow down page loads (round trips are expensive). Adding a link maze to a random corner of the site doesn't impact users at all.
akoboldfrying 10/26/2025||
Yes, this would be fine if you have an SPA or are otherwise already committed to having client-side JS turned on. Probably rot13 "encryption" would be enough.

OTOH, I doubt most scrapers are trying to scrape this kind of content anyway, since in general it's (a) JSON, not the natural language they crave, and (b) to even discover those links, which are usually generated dynamically by client-side JS rather than appearing as plain <a>...</a> HTML links, they would probably need to run a full JS engine, and that's considerably harder both to get working and computationally per request.

NoiseBert69 10/26/2025||
Is there a Markov Babbler based on PHP or something else easy hostable?

I want to redirect all LLM-crawlers to that site.

andai 10/27/2025|
I am confused where this traffic is coming from. OP says it's from well funded AI companies. But there are not such a large number of those? Why would they need to scrape the same pages over and over?

Or is the scraping happening in real time due to the web search features in AI apps? (Cheaper to load the same page again than to cache it?)

marginalia_nu 10/27/2025||
Crawlers are pretty hard to build, they have an insane number of corner cases they need to deal with if you want them to perform well AND be perceived as respectful, and crawlers (if you go that route) find themselves among the harder problems in distributed computing, with a huge shared mutable state and some very complex shared timers.

If you're in a hurry to race to the market, it's very likely you'll run into these issues and find yourself tempted to cut corners, and unfortunately, with nearly unbounded cloud spend, cutting corners in a large scale crawler operation can very believably cause major disruption all over the web.

hekkle 10/27/2025||
On is website: https://maurycyz.com/projects/ai-tarpit/

He mentions that he had a "Chrome" browser send him 20 requests per second from the address: 43.134.189.59. If you look this address up on shodan.io you will see this address is for Tencent, a public company that makes AI, with an annual revenue of $92 Billion USD.

More comments...