Presumably the crawlers don’t already have an LLM in the loop but it could easily be added when a site is seen to be some threshold number of pages and/or content size.
It becomes an economic arms race -- and generating garbage will likely always be much cheaper than detecting garbage.
My point isn’t that I want that to happen, which is probably what downvotes assume, my point is this is not going to be the final stage of the war.
I don't follow that at all. The post of yours that I responded to suggested that the scrapers could "just add an LLM" to get around the protection offered by TFA; my post explained why that would probably be too costly to be effective. I didn't downvote your post, but mine has been upvoted a few times, suggesting that this is how most people have interpreted our two posts.
> it can learn which pages are real and “punish” the site by requesting them more
Scrapers have zero reason to waste their own resources doing this.
To flood bots with gibberish that you "think" will harm their ability to function means you are in some ways complicit if those bots unintentionally cause harm in any small part due to your data poisoning.
I just don't see a scenario where doing what author is doing is permissible in my personal ethical framework.
Unauthorized access doesn't absolve me when I create the possiblity of transient harm.
I'm basically saying 2 wrongs don't make a right here.
Trying to harm their system which might transitively harm someone using their system is unethical from my viewpoint.
Most of these misbehaved crawlers are either cloud hosted (with tens of thousands of IPs), using residential proxies (with tens of thousands of IPs) or straight up using a botnet (again with tens of thousands of IPs). None respect robots.txt and precious few even provide an identifiable user-agent string.
Your chemicals in river analogy only works if there were also a giant company straight out of “The Lorax” siphoning off all of the water in the river.. and further, the chemicals would have to be harmless to humans but would cause the company’s machines to break down so they couldn’t make any more thneeds.
1. The machines won't "break", at best you slightly increase when they answer something with incorrect information.
2. People are starting to rely on that information, so when 'transformed" your harmless chemical are now potentially poison.
Knowing this is possible, it (again "to me") becomes highly un-ethical.