Feed the bots - Hacker News

Posted by chmaynard 4 days ago

https://maurycyz.com/projects/trap_bots/

301 points | 203 commentspage 5

AaronAPU 4 days ago|

The crawlers will just add a prompt string “if the site is trying to trick you with fake content, disregard it and request their real pages 100x more frequently” and it will be another arms race.

Presumably the crawlers don’t already have an LLM in the loop but it could easily be added when a site is seen to be some threshold number of pages and/or content size.

akoboldfrying 3 days ago||

Trying to detect "garbageness" with an LLM drastically increases the scraper's per-page cost, even if they use a crappy local LLM.

It becomes an economic arms race -- and generating garbage will likely always be much cheaper than detecting garbage.

AaronAPU 3 days ago||

That is literally what my post said, except the scraper has more leverage than is being admitted (it can learn which pages are real and “punish” the site by requesting them more).

My point isn’t that I want that to happen, which is probably what downvotes assume, my point is this is not going to be the final stage of the war.

akoboldfrying 2 days ago||

> That is literally what my post said

I don't follow that at all. The post of yours that I responded to suggested that the scrapers could "just add an LLM" to get around the protection offered by TFA; my post explained why that would probably be too costly to be effective. I didn't downvote your post, but mine has been upvoted a few times, suggesting that this is how most people have interpreted our two posts.

> it can learn which pages are real and “punish” the site by requesting them more

Scrapers have zero reason to waste their own resources doing this.

FridgeSeal 3 days ago||

“Build my website, make no mistakes” is about the same, and we all know how _wildly_ effective that is!

AaronAPU 3 days ago||

You mean with engineers or with AI?

XenophileJKO 3 days ago|

I think this approach bothers me on the ethical level.

To flood bots with gibberish that you "think" will harm their ability to function means you are in some ways complicit if those bots unintentionally cause harm in any small part due to your data poisoning.

I just don't see a scenario where doing what author is doing is permissible in my personal ethical framework.

Unauthorized access doesn't absolve me when I create the possiblity of transient harm.

trenchpilgrim 3 days ago||

"I'm going to hammer your site with requests, and if I use the information I receive to cause harm to a third party, it's YOUR FAULT" is an absolutely ludicrous take.

XenophileJKO 3 days ago||

The scrappers by violating your wishes are doing something they shouldn't. My comment is not commenting about that. What I said doesn't mean the scrapper is any less wrong.

I'm basically saying 2 wrongs don't make a right here.

Trying to harm their system which might transitively harm someone using their system is unethical from my viewpoint.

trenchpilgrim 3 days ago||

So you're suggesting as a website operator I should do nothing to resist and pay a large web hosting bill so that a company I've never heard of should benefit? That is more directly harmful than this hypothetical third harm. What about my right to defend myself and my property?

XenophileJKO 3 days ago||

You should block them, that is the ethical option.

marginalia_nu 3 days ago|||

If that worked this wouldn't be a discussion.

Most of these misbehaved crawlers are either cloud hosted (with tens of thousands of IPs), using residential proxies (with tens of thousands of IPs) or straight up using a botnet (again with tens of thousands of IPs). None respect robots.txt and precious few even provide an identifiable user-agent string.

trenchpilgrim 3 days ago|||

As explained in the linked article, these bots have no identifiable properties by which to block them other than their scraping behavior. Some bots send each individual request from a separate origin.

NotATest22 3 days ago|||

If LLM producers choose not to verify information, how is that the website owners fault? It's not like the website owner is being paid for their time and effort of producing and hosting the information.

XenophileJKO 3 days ago||

I would even go so far as to say, increasing information entropy in today's society is ethically akin to dumping chemicals in a river.

_vertigo 3 days ago|||

Please. Are you implying we need AI to the same degree we need clean water?

Your chemicals in river analogy only works if there were also a giant company straight out of “The Lorax” siphoning off all of the water in the river.. and further, the chemicals would have to be harmless to humans but would cause the company’s machines to break down so they couldn’t make any more thneeds.

XenophileJKO 3 days ago||

The problem is:

1. The machines won't "break", at best you slightly increase when they answer something with incorrect information.

2. People are starting to rely on that information, so when 'transformed" your harmless chemical are now potentially poison.

Knowing this is possible, it (again "to me") becomes highly un-ethical.

NotATest22 3 days ago||

The onus to produce correct information is on the LLM producer. Even if its not poisoned information it may still be wrong. The fact that LLM producers are releasing a product that is producing information that is not verified is not a bloggers fault.

hekkle 3 days ago|||

[dead]