Top
Best
New

Posted by chmaynard 2 days ago

Feed the bots(maurycyz.com)
https://maurycyz.com/projects/trap_bots/
299 points | 200 commentspage 2
rifty 1 day ago|
I suppose once you've lured them into reading a couple garbage pages you've successfully identified them as bots. You could then serve them garbage pages even for real urls as well just in case they ever got smart enough to try and back out of endless garbage. You could probably do a bunch of things that would only affect them specifically to increase their costs.
comrade1234 2 days ago||
I had to follow a link to see an example:

"A glass is not impossible to make the file and so deepen the original cut. Now heat a small spot on the glass, and a candle flame to a clear singing note.

— context_length = 2. The source material is a book on glassblowing."

masfuerte 2 days ago|
Add "babble" to any url to get a page of nonsense:

https://maurycyz.com/babble/projects/trap_bots/

xyzal 2 days ago||
I think random text can be detected and filtered. We need probably pre-generated bad information to make utility of crawling one's site truly negative.

On my site, I serve them a subset of Emergent Misalignment dataset, randomly perturbed by substituting some words with synonyms.

It should make the LLMs trained on it behave like dicks according to this research https://www.emergent-misalignment.com/

zkmon 2 days ago||
Really cool. Reminds me of farmers of some third world countries. Completely ignored by government, exploited by commission brokers, farmers now use all sorts of tricks, including coloring and faking their farm produce, without regard for health hazards to consumers. The city dwellers who thought they have gamed the system through high education, jobs and slick-talk, have to consume whatever is served to them by the desperate farmers.
_heimdall 2 days ago||
What you describe sounds more like industrial farming than tricks played by third world farmers (whatever that means).

Industrial ag regularly treats product to modify the texture, color, and shelf life. Its extremely common to expose produce to various gases and chemicals to either delay or hasten ripening, for example. Other tricks are used while the plants are still in the ground or immediately after harvest, for example spraying grains with roundup to dry out more quickly.

righthand 2 days ago||
The agricultural farmers did it to themselves, many are very wealthy already. Anything corporate America has taken over is because the farmers didn’t want to do the maintenance work. So they sell out to big corporations who will make it easier.

Same as any other consumer using Meta products. You sell out because it’s easier to network that way.

I am the son of a farmer.

Edit: added disclosure at the bottom and clarified as agricultural farming

zkmon 2 days ago|||
I'm a farmer myself. I was talking about farmers in some third world countries. They are extremely marginalized and suffered for decades and centuries. They still do.
Lord-Jobo 2 days ago|||
This is a very biased source discussing a very real prescription issue, and worth a glance for the statistics:

https://www.farmkind.giving/the-small-farm-myth-debunked

Tldr; the concept of farmers as small family farms has not been rooted in truth for a very long time in America

righthand 2 days ago||
This is for livestock farming, I was specifically discussing agricultural farming.

In general though, the easy rule of living and eating non-mega farmed food and sustainable living is to “eat aware”:

My other advice is a one-size-fits-all food equation, which is, simply, to know where it came from. If you can't place it, trace it, or grow it/raise it/catch it yourself, don't eat it. Eat aware. Know your food. Don't wait on waiters or institutions to come up with ways to publicize it, meet your small fishmonger and chat him or her up at the farmer's market yourself. [0]

[0] https://www.huffpost.com/entry/the-pescatores-dilemma_b_2463...

_heimdall 2 days ago|||
Are you proposing that eating industrially raised produce or meat is safer and healthier than alternatives?
dpflug 2 days ago|||
A whole lot of people don't have that available, but it's a good deal if you can get it.
righthand 2 days ago||
Again talking about Americans.
mcdeltat 2 days ago||
Maybe a dumb question but what exactly is wrong with banning the IPs? Even if the bots get more IPs over time, surely storing a list of bans is cheaper than serving content? Is the worry that the bots will eventually cycle through so many IP ranges that you end up blocking legit users?
maurycyz 2 days ago||
It's often one IP (v4!) per one request. It's insane how many resources are being burned on this stupidity.

Part of the reason I did this is to get good numbers on how bad the problem is: A link maze is a great way to make otherwise very stealthy bots expose themselves.

moebrowne 1 day ago||
Even if this is true how long can that be sustained before they start to be recycled? I bet the scrappers make a whole lot more requests than they have IPs
csomar 2 days ago||
They are usually using residential IPs through SOCK5. I am not sure how they are getting these residential IPs but it is definitively suspicious.

So by blocking these IPs, you are blocking your users. (ie: in many coffeshops, I get the "IP Blocked" banner, my guess is that they are running software on unsuspecting users to route this traffic).

moebrowne 1 day ago|||
> So by blocking these IPs, you are blocking your users.

There were 122 million residential internet connections in the US in 2024 so for an app with 1 million users the chance of affecting a single user is <1%.

[1] https://docs.fcc.gov/public/attachments/DOC-411463A1.pdf

sroelants 1 day ago|||
They use scammy providers like Bright Data[1] that let app authors embed their malware (for a compensation, I'm sure) which turns users' devices into crawler proxies.

[1]: https://brightdata.com/trustcenter/sourcing

ricardo81 2 days ago||
A thing you'll have to watch for is these agents actually being a user's browser, just the browser provider is using them as a proxy.

Otherwise, there are residential IP proxy services that cost around $1/GB which is cheap, but why pay when you can get the user to agree to be a proxy.

If the margin of error is small enough in detecting automated requests, may as well serve up some crypto mining code for the AI bots to work through but again, it could easily be an (unsuspecting) user.

I haven't looked into it much, it'd be interesting to know whether some of the AI requests are using mobile agents (and show genuine mobile fingerprints)

theturtlemoves 2 days ago||
Does this really work though? I know nothing about the inner workings of LLMs, but don't you want to break their word associations? Rather than generating "garbage" text based on which words tend to occur together and LLMs generating text based on which words it has seen together, don't you want to give them text that relates unrelated words?
wodenokoto 2 days ago|
Why? The point is not to train bots one way or another, it’s to keep them busy in low resource activities instead of high resource activities.
hyperhello 2 days ago||
Why not show them ads? Endless ads, with AI content in between them?
delecti 2 days ago|
To what end? I imagine ad networks have pretty robust bot detection. I'd also be surprised if scrapers didn't have ad block functionality in their headless browsing.
blackhaj7 2 days ago||
Can someone explain how this works?

Surely the bots are still hitting the pages they were hitting before but now they also hit the garbage pages too?

wodenokoto 2 days ago||
In authors setup, sending Markova generated garbage is much lighter on resources than sending static pages. Only bots will continue to follow links to the next piece of garbage and thus he traps bots in garbage. No need to detect bots, they reveal themselves.

But yes, all bots start out on an actual page.

liqilin1567 1 day ago|||
Seems like these garbage pages can't trap bots. People discussed it in this thread: https://news.ycombinator.com/item?id=45711987
blackhaj7 2 days ago|||
Thanks for the explanation!
blackhaj7 2 days ago||
Ah, it is explained in another post - https://maurycyz.com/projects/trap_bots/

Clever

chrsw 2 days ago|
Remember when AI was supposed to give us all this great stuff?

Most of the real use seems to be surveillance, spam, ads, tracking, slop, crawlers, hype, dubious financial deals and sucking energy.

Oh yeah, and your kid can cheat on their book report or whatever. Great.

dsign 1 day ago|
I was thinking the same yesterday. We should all be busy curing cancer, becoming young forever and building space habitats. Instead...

It has to be said though that all the three things above are feared/considered taboo/cause for mocking, while making a quick buck at the cost of poisoning the commons gives universal bragging rights. Go figure.

More comments...