Top
Best
New

Posted by LucidLynx 15 hours ago

Miasma: A tool to trap AI web scrapers in an endless poison pit(github.com)
275 points | 204 commentspage 5
firekey_browser 12 hours ago|
[dead]
pugchat 7 hours ago||
[dead]
obsidianbases1 9 hours ago||
I know there are real world problems to deal with, but at least I got one over on that evil open claw instance /s
GaggiX 13 hours ago||
These projects are the new "To-Do List" app.
obsidianbases1 12 hours ago||
Why do this though?

It's like if someone was trying to "trap" search crawlers back in the early 2000s.

Seems counterproductive

bilekas 12 hours ago||
Because of bots that don't respect ROBOTS.txt .

If you want an AI bot to crawl your website while you pay for that bandwidth then you wont use the tool.

obsidianbases1 9 hours ago||
If bandwidth cost is a concern the maybe you should reconsider how you publish your site.

Like, what if you actually post something that gains traction, is it going to bankrupt you or something?

bilekas 8 hours ago||
It's not just financial, you're taking up a lot of bandwidth, resources etc.

It's not just some light bump in traffic. It's a headache that shouldn't need to be dealt with if they would respect ROBOTS.txt. Quite simple really.

integralid 11 hours ago|||
search crawlers used to bring people TO your site llm boots are used to keep people OUT of your site, because knowledge is indexed and distributed by corporations.
obsidianbases1 9 hours ago||
So if your site is dependent on ads, and since the only way for people to see those ads is coming to your site, then yes, you lose.

If your site exists to share information, then the information gets disseminated, whether via LLM or some browser, it doesn't make a difference to me

lelanthran 9 hours ago|||
Those are not the only two options.

Why are you presenting the latter option as if it were mainstream? It's such a small percentage of use cases that it probably isn't even a rounding error.

People who want to disseminate information also want the credit.

I'd still like to know why you are presenting this false dichotomy. What reason do you have for presenting a use case that has fractions of a percentage as if it were a standard use case? What is your motivation behind this?

obsidianbases1 9 hours ago|||
My only motivation is that it pains me to see smart capable people working on insignificant problems.

Maybe I don't understand the problem as well as I should, and I'm open to hearing what it is you think that I'm missing.

But from my perspective, this is a solution for a non-problem, which in my eyes is a problem itself.

lelanthran 8 hours ago||
You misunderstand: I am asking what is your motivation for presenting a 0.0001% use case as a 50% use case.

The use case you present is so small it can be ignored as an option, yet you present it as the only other option.

joquarky 6 hours ago|||
> People who want to disseminate information also want the credit.

This is psychological projection.

lelanthran 6 hours ago||
> This is psychological projection.

You don't know what that means.

In any case, people who want to disseminate information with credit can do so without standing up a blog (any place that allows posting of comments, such as Reddit, HN, etc).

In the context of this discussion, we're talking about site owners; people who put up a blog.

aarjaneiro 9 hours ago|||
You don't get attribution for your work if it merely feeds into it's training data
obsidianbases1 9 hours ago||
That assumes the AI bots are scraping for training data and not simple retrieval/ RAG (which would likely provide attribution)
Forgeties79 12 hours ago||
Web crawlers didn’t routinely take down public resources or use the scraped info to generate facsimiles that people are still having ethical debates over. Its presence didn’t even register and it was indexing that helped them. It isn’t remotely the same thing.

https://www.libraryjournal.com/story/ai-bots-swarm-library-c...

obsidianbases1 9 hours ago||
AI bots must've taken down that link you shared, it won't load :/

And search crawlers/results have been producing snippets that prevent users from clicking to the source for well over a decade.

Edit: it loaded. I don't see how the problem isn't simply solved by an off the shelf solution like cloud flare. In the real world, you wouldn't open up a space/location if you couldn't handle the throughput. Why should online spaces/locations get special treatment?

Forgeties79 7 hours ago||
Why should everyone else pay the price for VC-funded, private companies? They should incur the cost.

This is no different than saying “robbers aren’t causing any problems, you just need to lock your doors, buy and set up sensors on every point of potential ingress, and pay a monthly cost for an alarm system. That’s on you.”

splitbrainhack 13 hours ago||
-1 for the name
QuantumNomad_ 13 hours ago|
https://en.wikipedia.org/wiki/Miasma_theory

Seems a clever and fitting name to me. A poison pit would probably smell bad. And at the same time, the theory that this tool would actually cause “illness” (bad training data) in AI is not proven.

jstanley 10 hours ago|
If you want to ruin someone's web experience based on what kind of thing they are, rather than the content of their character, consider that you might be the baddies.
mrweasel 10 hours ago||
If you're constantly being harassed by someone and despite your best efforts, nothing is being done to help you, quite the opposite in fact, tons of people cheer your assailant on in the name of profit and progress, it's only natural that you lash out.

It's not all that productive, it's an act of desperation. If you can't stop the enemy, at least you can make their action more costly.

One positive outcome I could see it AI companies becoming more critical of their training data.

Apocryphon 6 hours ago|||
You’re gonna have to try harder to sneak in the a priori assumption that LLMs have any character beyond which corporation deployed them.
lifeformed 9 hours ago||
What "content of character" do you ascribe to a web scraper?
jstanley 9 hours ago||
You don't, that's why it's unethical to block them.

If you keep getting harrassed by people wearing black hoodies, would it be ethical to start taking countermeasures against all people who wear black hoodies?

lelanthran 9 hours ago||
If they are coming to my door to harass me, then yes, it makes sense to take countermeasures against all black-hoodie wearers when I see them at the door.