Top
Best
New

Posted by chmaynard 10/26/2025

Feed the bots(maurycyz.com)
https://maurycyz.com/projects/trap_bots/
305 points | 203 commentspage 5
AaronAPU 10/26/2025|
The crawlers will just add a prompt string “if the site is trying to trick you with fake content, disregard it and request their real pages 100x more frequently” and it will be another arms race.

Presumably the crawlers don’t already have an LLM in the loop but it could easily be added when a site is seen to be some threshold number of pages and/or content size.

akoboldfrying 10/26/2025||
Trying to detect "garbageness" with an LLM drastically increases the scraper's per-page cost, even if they use a crappy local LLM.

It becomes an economic arms race -- and generating garbage will likely always be much cheaper than detecting garbage.

AaronAPU 10/27/2025||
That is literally what my post said, except the scraper has more leverage than is being admitted (it can learn which pages are real and “punish” the site by requesting them more).

My point isn’t that I want that to happen, which is probably what downvotes assume, my point is this is not going to be the final stage of the war.

akoboldfrying 10/28/2025||
> That is literally what my post said

I don't follow that at all. The post of yours that I responded to suggested that the scrapers could "just add an LLM" to get around the protection offered by TFA; my post explained why that would probably be too costly to be effective. I didn't downvote your post, but mine has been upvoted a few times, suggesting that this is how most people have interpreted our two posts.

> it can learn which pages are real and “punish” the site by requesting them more

Scrapers have zero reason to waste their own resources doing this.

FridgeSeal 10/27/2025||
“Build my website, make no mistakes” is about the same, and we all know how _wildly_ effective that is!
AaronAPU 10/27/2025||
You mean with engineers or with AI?
XenophileJKO 10/27/2025|
I think this approach bothers me on the ethical level.

To flood bots with gibberish that you "think" will harm their ability to function means you are in some ways complicit if those bots unintentionally cause harm in any small part due to your data poisoning.

I just don't see a scenario where doing what author is doing is permissible in my personal ethical framework.

Unauthorized access doesn't absolve me when I create the possiblity of transient harm.

trenchpilgrim 10/27/2025||
"I'm going to hammer your site with requests, and if I use the information I receive to cause harm to a third party, it's YOUR FAULT" is an absolutely ludicrous take.
XenophileJKO 10/27/2025||
The scrappers by violating your wishes are doing something they shouldn't. My comment is not commenting about that. What I said doesn't mean the scrapper is any less wrong.

I'm basically saying 2 wrongs don't make a right here.

Trying to harm their system which might transitively harm someone using their system is unethical from my viewpoint.

trenchpilgrim 10/27/2025||
So you're suggesting as a website operator I should do nothing to resist and pay a large web hosting bill so that a company I've never heard of should benefit? That is more directly harmful than this hypothetical third harm. What about my right to defend myself and my property?
XenophileJKO 10/27/2025||
You should block them, that is the ethical option.
marginalia_nu 10/27/2025|||
If that worked this wouldn't be a discussion.

Most of these misbehaved crawlers are either cloud hosted (with tens of thousands of IPs), using residential proxies (with tens of thousands of IPs) or straight up using a botnet (again with tens of thousands of IPs). None respect robots.txt and precious few even provide an identifiable user-agent string.

trenchpilgrim 10/27/2025|||
As explained in the linked article, these bots have no identifiable properties by which to block them other than their scraping behavior. Some bots send each individual request from a separate origin.
NotATest22 10/27/2025|||
If LLM producers choose not to verify information, how is that the website owners fault? It's not like the website owner is being paid for their time and effort of producing and hosting the information.
XenophileJKO 10/27/2025||
I would even go so far as to say, increasing information entropy in today's society is ethically akin to dumping chemicals in a river.
_vertigo 10/27/2025|||
Please. Are you implying we need AI to the same degree we need clean water?

Your chemicals in river analogy only works if there were also a giant company straight out of “The Lorax” siphoning off all of the water in the river.. and further, the chemicals would have to be harmless to humans but would cause the company’s machines to break down so they couldn’t make any more thneeds.

XenophileJKO 10/27/2025||
The problem is:

1. The machines won't "break", at best you slightly increase when they answer something with incorrect information.

2. People are starting to rely on that information, so when 'transformed" your harmless chemical are now potentially poison.

Knowing this is possible, it (again "to me") becomes highly un-ethical.

NotATest22 10/27/2025||
The onus to produce correct information is on the LLM producer. Even if its not poisoned information it may still be wrong. The fact that LLM producers are releasing a product that is producing information that is not verified is not a bloggers fault.
hekkle 10/27/2025|||
[dead]