Top
Best
New

Posted by chmaynard 4 days ago

Feed the bots(maurycyz.com)
https://maurycyz.com/projects/trap_bots/
301 points | 201 commentspage 4
eviks 4 days ago|
How does this help protect the regular non-garbage pages from the bots?
lolpython 4 days ago||
The follow on post explains:

> You don’t really need any bot detection: just linking to the garbage from your main website will do. Because each page links to five more garbage pages, the crawler’s queue will quickly fill up with an exponential amount of garbage until it has no time left to crawl your real site.

From: https://maurycyz.com/projects/trap_bots/

eviks 4 days ago||
Thanks, I thought that these are prioritized, so while the garbage links might fill up the queue, they'd do so only after all real links are visited, so the server load is the same. But of course, not all/most bots might be configured this way.

> If a link is posted somewhere, the bots will know it exists,

lolpython 3 days ago||
How would the links be prioritized? If the bots goal is to crawl all content would they have prioritization built-in?
HumanOstrich 3 days ago||
How would they prioritize things they haven't crawled yet?
lolpython 3 days ago||
It's not clear that they are doing that. Web logs I've seen from other writing on this topic show them re-crawling the same pages at high rates, in addition to crawling new pages
lolpython 3 days ago||
Actually I've been informed otherwise, they crawl known links first according to this person:

> Unfortunately, based on what I'm seeing in my logs, I do need the bot detection. The crawlers that visit me, have a list of URLs to crawl, they do not immediately visit newly discovered URLs, so it would take a very, very long time to fill their queue. I don't want to give them that much time.

https://lobste.rs/c/1pwq2g

codeduck 4 days ago||
it does at a macroscopic level by making scraping expensive. If every "valid" page is scattered at random amongst a tarpit of recursive pages of nonsense, it becomes computationally and temporaly expensive to scrape a site for "good" data.

A single site doing this does nothing. But many sites doing this has a severe negative impact on the utility of AI scrapers - at least, until a countermeasure is developed.

TekMol 4 days ago||
How about adding some image with a public http logger url like

https://ih879.requestcatcher.com/test

to each of the nonsense pages, so we can see an endless flood of funny requests at

https://ih879.requestcatcher.com

?

I'm not sure requestcatcher is a good one, it's just the first one that came up when I googled. But I guess there are many such services, or one could also use some link shortener service with public logs.

jcynix 4 days ago|
You can easily generate a number of random images with ImageMagick and serve these as part of the babbled text. And you could even add text onto these images so image analyzers with OCR will have "fun" too.

Example code:

   for c in aqua blue green yellow ; do
      for w in hello world huba hop ; do
         magick -size 1024x768 xc:$c -gravity center -annotate 0 $w /tmp/$w-$c.jpeg
      done
   done
Do this in a loop for all colors known to the web and for a number of words from a text corpus, and voila, ... ;-)

Edit: added example

458QxfC2z3 3 days ago||
See also:

https://iocaine.madhouse-project.org/

From the overview:

"This software is not made for making the Crawlers go away. It is an aggressive defense mechanism that tries its best to take the blunt of the assault, serve them garbage, and keep them off of upstream resources. "

yupyupyups 2 days ago||
I love it. Keep feeding them that slop.

A thought though. What happens if one of the bot operators sees the random stuff?

Do you think they will try to bypass it and put you and them in a cat and mouse game? Or would that be too time-consuming and unlikely?

grigio 3 days ago||
well configured AI bots can avoid those instructions..
fHr 4 days ago||
lets go! nice
YouAreWRONGtoo 4 days ago||
[dead]
OutOfHere 4 days ago||
The user's approach would work only if bots can accurately even be classified, but this is impossible. The end result is that the action is user's site is now nothing but markov garbage. Not only will bots desert it but humans will too.
stubish 4 days ago||
The traditional approach is a link to the tarpit that the bots can see but humans can't, say using CSS to render it 0 pixels in size.
8organicbits 4 days ago|||
Please keep in mind that not all humans interact with web pages by "seeing". If you fool a scraper you may also fool someone using a screen reader.
andrewflnr 4 days ago||||
I bet the next generation approach, if the crawlers start using CSS, is "if you're a human, don't bother clicking this link lol". And everyone will know what's up.
vntok 4 days ago||||
AI bots try to behave as close to human visitors as possible, so they wouldn't click on 0px wide links, would they?

And if they would today, it seems like a trivial think to fix - just don't click on incorrect/suspicious links?

righthand 4 days ago|||
Ideally it would require rendering the css and doing a check on the Dom if the link is 0 pixels wide. But once bots figure that out I can still left: -100000px those links or z-index: -10000. To hide them in other ways. It’s a moving target how much time will the Llm companies waste decoding all the ways I can hide something before I move the target again. Now the Llm companies are in an expensive arms race.
vntok 3 days ago||
All it takes is a full-height screenshot of the page coupled with a prompt similar to 'btw, please only click on links visible on this screenshot, that a regular humanoid visitor would see and interact with'.

Modern bots do this very well, plus the structure of the Web is such that it is sufficient to skip a few links here and there, most probably there will dxist another path toward the skipped page that the bot can go through later on.

righthand 3 days ago|||
This pushes the duty to run the scraper manually, idealy with a person present somewhere. Great if you want to use the web that way.

What is being blocked here is violent scraping and to an extent major LLM companies bots as well. If I disagree that OpenAI should be able to take train off of everyone’s work especially if they’re going to hammer the whole internet irresponsibly and ignore all the rules, then I’m going to prevent that type of company from being profitable off my properties. You don’t get to play unfair for the unfilled promise “the good of future humanity”.

MadnessASAP 3 days ago||||
That would be a AI agent which isn't the problem (for the author). The problem is the scrapers gathering data to train the models. Scrapers need to be very cheap to run and are thus very stupid and certainly dont have "prompts".
xwolfi 3 days ago|||
"all it takes", already impossible with any LLM right now.
vntok 2 days ago||
If I can do it locally using a free open-weights LLM, from a low-end prosumer rig (evo-x2 mini-pc w/ 128GB VRAM)... scraping companies can do it at scale much better and much cheaper.
jcynix 4 days ago|||
The 0px rule would be in a separate .CSS file. I doubt that bots load .CSS files for .html files, at least I don't remember seeing this in my server logs.

And another "classic" solution is to use white link text on white background, or a font with zero width characters, all stuff which is rather unlikely to be analysed by a scraper interested primarily in text.

YouAreWRONGtoo 4 days ago|||
[dead]
bastawhiz 4 days ago|||
You don't need to classify bots. Bots will follow any link they find. Hide links on your pages and eventually every bot will greedily find itself in an endless labyrinth of slop.
OutOfHere 4 days ago||
It won't be long before generalized bots stop requesting links that don't have a visually rendered link in a page.
bastawhiz 3 days ago||
If bots get good enough to know what links they're scraping, chances are they'll also avoid scraping links they don't need to! The problem solves itself!
akoboldfrying 3 days ago||
Maybe you're joking, but assuming you're not: This problem doesn't solve itself at all. If bots get good enough to know what links have garbage behind them, they'll stop scraping those links, and go back to scraping your actual content. Which is the thing we don't want.
bastawhiz 2 days ago||
That's sort of the point: almost nobody runs a site as large as Reddit. The average website has a relatively small handful of pages. Even a very active blog has few enough pages that it could be fully scraped in under a few minutes. Where scrapers get hung up is when they're processing links that add things like query parameters, or navigating through something like a git repository and clicking through every file in every commit. If a scraper has enough intelligence to look at what the link is, it _surely_ has enough intelligence to understand what it does and does not need to scrape.
akoboldfrying 1 day ago||
Ah, I see what you mean now, thanks.
chaostheory 4 days ago|
What’s wrong with just using cloudflare?

https://www.cloudflare.com/press/press-releases/2025/cloudfl...

pluto_modadic 3 days ago||
if that floats your boat, sure. It's also home to most of the world's malware, and you usually don't need it.
OutOfHere 4 days ago||
Only low IQ folks are okay with having their traffic MITMed by Cloudflare (and the NSA). Also, they can extort you and cut you off at any time, as they have done to folks, which further supports the prior point.
More comments...