Do they do any harm? They do provide source for material if users asks for it. (I frequently do because I don't trust them, so I check sources).
You still need to pay for the traffic, and serving static content (like text on that website) is way less CPU/disk expensive than generating anything.
Not to me, but I've known people who have had their sites DDoSed out of existence by the scrapers. On the internet, it's often the smallest sites with the smallest budgets that have the best content, and those are hit the worst.
> They do provide source for material if users asks for it
Not for material they trained on. Those sources are just google results for the question you asked. By nature, they cannot cite the information gathered by their crawlers.
> You still need to pay for the traffic
It's so little traffic my hosting provider doesn't bother billing me for it.
> and serving static content (like text on that website) is way less CPU/disk expensive than generating anything.
Sure, but it's the principle of the thing: I don't like when billion dollar companies steal my work, and then use it to make the internet a worse place by filling it with AI slop/spam. If I can make their lives harder and their product worse for virtually no cost, I will.
Most of the real use seems to be surveillance, spam, ads, tracking, slop, crawlers, hype, dubious financial deals and sucking energy.
Oh yeah, and your kid can cheat on their book report or whatever. Great.
It has to be said though that all the three things above are feared/considered taboo/cause for mocking, while making a quick buck at the cost of poisoning the commons gives universal bragging rights. Go figure.
because as infinite site that has appeared out of nowhere will quickly be noticed and blocked
start it off small, and grow it by a few pages every day
and the existing pages should stay 99% the same between crawls to gain reputation
One way to keep things mostly the same without having to store any of it yourself:
1. Use an RNG seeded from the request URL itself to generate each page. This is already enough for an unchanging static site of finite it infinite size.
2. With each word the generator outputs, generate a random number between, say, 0 and 1000. On day i, replace the about-to-be-output word with a link if this random number is between 0 and i. This way, every day roughly 0.1% of words will turn into links, with the rest of the text remaining stable over time.
User-agent: Googlebot PetalBot Bingbot YandexBot Kagibot
Disallow: /bomb/\*
Disallow: /bomb
Disallow: /babble/\*
Sitemap: https://maurycyz.com/sitemap.xml
I think this is telling the bot named "Googlebot PetalBot Bingbot YandexBot Kagibot" - which doesn't exist - to not visit those URLs. All other bots are allowed to visit those URLs. User-Agent is supposed to be one per line, and there's no User-Agent * specified here.So a much simpler solution than setting up a Markov generator might be for the site owner to just specify a valid robots.txt. It's not evident to me that bots which do crawl this site are in fact breaking any rules. I also suspect that Googlebot, being served the markov slop, will view this as spam. Meanwhile this incentives AI companies to build heuristics to detect this kind of thing rather than building rules-respecting crawlers.
User-agent: Googlebot
User-agent: PetalBot
User-agent: Bingbot
User-agent: YandexBot
User-agent: Kagibot
Disallow: /bomb/*
Disallow: /bomb
Disallow: /babble/*
Sitemap: https://maurycyz.com/sitemap.xmlI've thought about tying a hidden link, excluded in robots.txt, to fail2ban. Seems quick and easy with no side-effects, but I've ever actually gotten around to it.
Eh? That's the speed of an old-school spinning hard disk.
Or is the scraping happening in real time due to the web search features in AI apps? (Cheaper to load the same page again than to cache it?)
If you're in a hurry to race to the market, it's very likely you'll run into these issues and find yourself tempted to cut corners, and unfortunately, with nearly unbounded cloud spend, cutting corners in a large scale crawler operation can very believably cause major disruption all over the web.
OTOH, I doubt most scrapers are trying to scrape this kind of content anyway, since in general it's (a) JSON, not the natural language they crave, and (b) to even discover those links, which are usually generated dynamically by client-side JS rather than appearing as plain <a>...</a> HTML links, they would probably need to run a full JS engine, and that's considerably harder both to get working and computationally per request.
I want to redirect all LLM-crawlers to that site.
> You don’t really need any bot detection: just linking to the garbage from your main website will do. Because each page links to five more garbage pages, the crawler’s queue will quickly fill up with an exponential amount of garbage until it has no time left to crawl your real site.
> If a link is posted somewhere, the bots will know it exists,
> Unfortunately, based on what I'm seeing in my logs, I do need the bot detection. The crawlers that visit me, have a list of URLs to crawl, they do not immediately visit newly discovered URLs, so it would take a very, very long time to fill their queue. I don't want to give them that much time.
A single site doing this does nothing. But many sites doing this has a severe negative impact on the utility of AI scrapers - at least, until a countermeasure is developed.