Feed the bots - Hacker News

Posted by chmaynard 10/26/2025

https://maurycyz.com/projects/trap_bots/

305 points | 203 commentspage 2

reaperducer 10/27/2025|

All of these solutions seem expensive, if you're paying for outbound bandwidth.

I've thought about tying a hidden link, excluded in robots.txt, to fail2ban. Seems quick and easy with no side-effects, but I've ever actually gotten around to it.

comrade1234 10/26/2025||

I had to follow a link to see an example:

"A glass is not impossible to make the file and so deepen the original cut. Now heat a small spot on the glass, and a candle flame to a clear singing note.

— context_length = 2. The source material is a book on glassblowing."

masfuerte 10/26/2025|

Add "babble" to any url to get a page of nonsense:

https://maurycyz.com/babble/projects/trap_bots/

xyzal 10/26/2025||

I think random text can be detected and filtered. We need probably pre-generated bad information to make utility of crawling one's site truly negative.

On my site, I serve them a subset of Emergent Misalignment dataset, randomly perturbed by substituting some words with synonyms.

It should make the LLMs trained on it behave like dicks according to this research https://www.emergent-misalignment.com/

zkmon 10/26/2025||

Really cool. Reminds me of farmers of some third world countries. Completely ignored by government, exploited by commission brokers, farmers now use all sorts of tricks, including coloring and faking their farm produce, without regard for health hazards to consumers. The city dwellers who thought they have gamed the system through high education, jobs and slick-talk, have to consume whatever is served to them by the desperate farmers.

_heimdall 10/27/2025||

What you describe sounds more like industrial farming than tricks played by third world farmers (whatever that means).

Industrial ag regularly treats product to modify the texture, color, and shelf life. Its extremely common to expose produce to various gases and chemicals to either delay or hasten ripening, for example. Other tricks are used while the plants are still in the ground or immediately after harvest, for example spraying grains with roundup to dry out more quickly.

righthand 10/26/2025||

The agricultural farmers did it to themselves, many are very wealthy already. Anything corporate America has taken over is because the farmers didn’t want to do the maintenance work. So they sell out to big corporations who will make it easier.

Same as any other consumer using Meta products. You sell out because it’s easier to network that way.

I am the son of a farmer.

Edit: added disclosure at the bottom and clarified as agricultural farming

zkmon 10/26/2025|||

I'm a farmer myself. I was talking about farmers in some third world countries. They are extremely marginalized and suffered for decades and centuries. They still do.

Lord-Jobo 10/26/2025|||

This is a very biased source discussing a very real prescription issue, and worth a glance for the statistics:

https://www.farmkind.giving/the-small-farm-myth-debunked

Tldr; the concept of farmers as small family farms has not been rooted in truth for a very long time in America

righthand 10/26/2025||

This is for livestock farming, I was specifically discussing agricultural farming.

In general though, the easy rule of living and eating non-mega farmed food and sustainable living is to “eat aware”:

My other advice is a one-size-fits-all food equation, which is, simply, to know where it came from. If you can't place it, trace it, or grow it/raise it/catch it yourself, don't eat it. Eat aware. Know your food. Don't wait on waiters or institutions to come up with ways to publicize it, meet your small fishmonger and chat him or her up at the farmer's market yourself. [0]

[0] https://www.huffpost.com/entry/the-pescatores-dilemma_b_2463...

_heimdall 10/27/2025|||

Are you proposing that eating industrially raised produce or meat is safer and healthier than alternatives?

dpflug 10/26/2025|||

A whole lot of people don't have that available, but it's a good deal if you can get it.

righthand 10/27/2025||

Again talking about Americans.

akoboldfrying 10/27/2025||

Hope you don't mind if I point out a couple of small bugs in babble.c:

1. When read_word() reads the last word in a string, at line 146 it will read past the end (and into uninitialised memory, or the leftovers of previous longer strings), because you have already added 1 to len on line 140 to skip past the character that delimited the word. Undefined behaviour.

2. grow_chain() doesn't assign to (*chain)->capacity, so it winds up calling realloc() every time, unnecessarily. This probably isn't a big deal, because probably realloc() allocates in larger chunks and takes a fast no-op path when it determines it doesn't need to reallocate and copy.

3. Not a bug, but your index precomputation on lines 184-200 could be much more efficient. Currently it takes O(n^2 * MAX_LEAF) time, but it could be improved to linear time if you (a) did most of this computation once in the original Python extractor and (b) stored things better. Specifically, you could store and work with just the numeric indices, "translating" them to strings only at the last possible moment, before writing the word out. Translating index i to word i can be done very efficiently with 2 data structures:

    char word_data[MAX_WORDS * MAX_WORD_LEN];
    unsigned start_pos[MAX_WORDS + 1];

(Of course you could dynamically allocate them instead -- the static sizes just give the flavour.)

word_data stores all words concatenated together without delimiters; start_pos stores offsets into this buffer. To extract word i to dest:

    memcpy(dest, word_data + start_pos[i], start_pos[i + 1] - start_pos[i]);

You can store the variable-length list of possible next words for each word in a similar way, with a large buffer of integers and an array of offsets into it:

    unsigned next_words[MAX_WORDS * MAX_LEAF];     // Each element is a word index
    unsigned next_words_start_pos[MAX_WORDS + 1];  // Each element is an offset into next_words

Now the indices of all words that could follow word i are enumerated by:

    for (j = next_words_start_pos[i]; j < next_words_start_pos[i + 1]; ++j) {
        // Do something with next_words[j]
    }

(Note that you don't actually store the "current word" in this data structure at all -- it's the index i into next_words_start_pos, which you already know!)

theturtlemoves 10/26/2025||

Does this really work though? I know nothing about the inner workings of LLMs, but don't you want to break their word associations? Rather than generating "garbage" text based on which words tend to occur together and LLMs generating text based on which words it has seen together, don't you want to give them text that relates unrelated words?

wodenokoto 10/26/2025|

Why? The point is not to train bots one way or another, it’s to keep them busy in low resource activities instead of high resource activities.

hyperhello 10/26/2025||

Why not show them ads? Endless ads, with AI content in between them?

delecti 10/26/2025|

To what end? I imagine ad networks have pretty robust bot detection. I'd also be surprised if scrapers didn't have ad block functionality in their headless browsing.

mcdeltat 10/27/2025||

Maybe a dumb question but what exactly is wrong with banning the IPs? Even if the bots get more IPs over time, surely storing a list of bans is cheaper than serving content? Is the worry that the bots will eventually cycle through so many IP ranges that you end up blocking legit users?

maurycyz 10/27/2025||

It's often one IP (v4!) per one request. It's insane how many resources are being burned on this stupidity.

Part of the reason I did this is to get good numbers on how bad the problem is: A link maze is a great way to make otherwise very stealthy bots expose themselves.

moebrowne 10/27/2025||

Even if this is true how long can that be sustained before they start to be recycled? I bet the scrappers make a whole lot more requests than they have IPs

csomar 10/27/2025||

They are usually using residential IPs through SOCK5. I am not sure how they are getting these residential IPs but it is definitively suspicious.

So by blocking these IPs, you are blocking your users. (ie: in many coffeshops, I get the "IP Blocked" banner, my guess is that they are running software on unsuspecting users to route this traffic).

moebrowne 10/27/2025|||

> So by blocking these IPs, you are blocking your users.

There were 122 million residential internet connections in the US in 2024 so for an app with 1 million users the chance of affecting a single user is <1%.

[1] https://docs.fcc.gov/public/attachments/DOC-411463A1.pdf

sroelants 10/27/2025|||

They use scammy providers like Bright Data[1] that let app authors embed their malware (for a compensation, I'm sure) which turns users' devices into crawler proxies.

[1]: https://brightdata.com/trustcenter/sourcing

ricardo81 10/27/2025||

A thing you'll have to watch for is these agents actually being a user's browser, just the browser provider is using them as a proxy.

Otherwise, there are residential IP proxy services that cost around $1/GB which is cheap, but why pay when you can get the user to agree to be a proxy.

If the margin of error is small enough in detecting automated requests, may as well serve up some crypto mining code for the AI bots to work through but again, it could easily be an (unsuspecting) user.

I haven't looked into it much, it'd be interesting to know whether some of the AI requests are using mobile agents (and show genuine mobile fingerprints)

rifty 10/28/2025|

I suppose once you've lured them into reading a couple garbage pages you've successfully identified them as bots. You could then serve them garbage pages even for real urls as well just in case they ever got smart enough to try and back out of endless garbage. You could probably do a bunch of things that would only affect them specifically to increase their costs.

More comments...