Miasma: A tool to trap AI web scrapers in an endless poison pit

Posted by LucidLynx 12 hours ago

Miasma: A tool to trap AI web scrapers in an endless poison pit(github.com)

255 points | 192 commentspage 3

nosmokewhereiam 8 hours ago|

My asthmar

I'm assuming this is a reference to Lord of the flies

cwnyth 7 hours ago|

Miasma is bad or poisonous air. It's a Greek word.

iFire 2 hours ago||

I for one welcome everyone to the tarpit where a normal person is seen as a robot in an endless poison pit and sounds like a Black Mirror television episode.

jackdoe 3 hours ago||

rage against the dying of the light

ninjagoo 8 hours ago||

This is essentially machine-generated spam.

The irony of machine-generated slop to fight machine-generated slop would be funny, if it weren't for the implications. How long before people start sharing ai-spam lists, both pro-ai and anti-ai?

Just like with email, at some point these share-lists will be adopted by the big corporates, and just like with email will make life hard for the small players.

Once a website appears on one of these lists, legitimately or otherwise, what'll be the reputational damage hurting appearance in search indexes? There have already been examples of Google delisting or dropping websites in search results.

Will there be a process to appeal these blacklists? Based on how things work with email, I doubt this will be a meaningful process. It's essentially an arms race, with the little folks getting crushed by juggernauts on all sides.

This project's selective protection of the major players reinforces that effect; from the README:

" Be sure to protect friendly bots and search engines from Miasma in your robots.txt!

User-agent: Googlebot User-agent: Bingbot User-agent: DuckDuckBot User-agent: Slurp User-agent: SomeOtherNiceBot Disallow: /bots Allow: / "

snehesht 10 hours ago||

Why not simply blacklist or rate limit those bot IP’s ?

xprnio 9 hours ago||

If you have real traffic and bot traffic, you still need to identify which is which. On top of that, bots very likely don’t reuse the same IPs over and over again. I assume if we knew all the IPs used only by bots ahead of time, then yeah it would be simple to blacklist them. But although it’s simple in theory, the practice of identifying what to blacklist in the first place is the part that isn’t as simple

snehesht 9 hours ago||

You wouldn’t permanently block them, it’s more like a rolling window.

You can use security challenges as a mechanism to identify false positives.

Sure bots can get tons of proxies for cheap, doesn’t mean you can’t block them similar to how SSH Honeypots or Spamhaus SBL work albeit temporarily.

Bender 4 hours ago|||

Why not simply blacklist or rate limit those bot IP’s ?

Many bots cycle through short DHCP leases on LTE wifi devices. One would have to accept blocking all cell phones which I have done for my personal hobby crap but most businesses will not do this. Another big swath of bots come from Amazon EC2 and GoogleCloud which I will also happily block on my hobby crap but most businesses will not.

Some bots are easier to block as they do not use real web clients and are missing some TCP/IP headers making them ultra easy to block. Some also do not spoof user-agent and are easy to block. Some will attempt to access URL's not visible to real humans thus blocking themselves. Many bots can not do HTTP/2.0 so they are also trivial to block. Pretty much anything not using headless Chrome is easy to block.

phyzome 9 hours ago|||

Because punishment for breaking the robots.txt rules is a social good.

arbol 8 hours ago|||

The AI companies are using virtually unlimited "clean" residential IPs so this is not a valid strategy.

DaiPlusPlus 8 hours ago||

How? They run their scraping and training infrastructure - and models themselves - from within those “AI datacenters”[1] we hear about in the news - and not proxying through end-users’ own pipes.

[1]: in quotes, because I dislike the term, because it’s immaterial whether or not an ugly block of concrete out in the sticks is housing LLM hardware - or good ol’ fashioned colo racks.

AyyEye 7 hours ago||

Residential proxy networks.

nextlevelwizard 4 hours ago|||

Point is to kill or at least hinder AI progress

aduwah 9 hours ago|||

There are way too many to do that

snehesht 8 hours ago||

True, most of the blacklists systems today aren’t realtime like Amazon WAF or Cloudflare.

We need a Crawler blacklist that can in realtime stream list deltas to centralized list and local dbs can pull changes.

Verified domains can push suspected bot ips, where this engine would run heuristics to see if there is a patters across data sources and issue a temporary block with exponential TTL.

There are many problems to solve here, but as any OSS it will evolve over time if there is enough interest in it.

Costs of running this system will be huge though and corp sponsors may not work but individual sponsors may be incentivized as it’s helps them reduce bandwidth, compute costs related to bot traffic.

pixl97 8 hours ago||

In the real-time spam market the lists worked well with honest groups for a bit, but started falling apart when once good lists get taken over by actors that realize they can use their position to make more money. It's a really difficult trap to avoid.

xyzal 7 hours ago||

For the lulz

superkuh 7 hours ago||

Of course Googlebot, Bingbot, Applebot, Amazonbot, YandexBot, etc from the major corps are HTTP useragent spiders that will have their downloaded public content used by corporations for AI training too. Might as well just drop the "AI" and say "corporate scrapers".

rob 8 hours ago||

"/brainstorming git checkout this miasma repo source code and implement a fix to prevent the scraper from not working on sites that use this tool"

foxes 8 hours ago||

Wonder if you can just avoid hiding it to make it more believable

Why not have a library of babel esq labrinth visible to normal users on your website,

Like anti surveillance clothing or something they have to sift through

imdsm 10 hours ago|

Applied model collapse

More comments...