Posted by ColinWright 3 days ago
I know a thing or two about web scraping.
There are sometimes status codes 404 for protection, so that you skip this site, so my crawler tries, as a hammer, several of faster crawling methods (curlcffi).
Zip bombs are also not for me. Reading header content length is enough to not read the page/file. I provide byte limit to check if response is not too big for me. For other cases reading timeout is enough.
Oh, and did you know that requests timeout is not really timeout a timeout for page read? So server can spoonfeed you bytes, one after another, and there will be no timeout.
That is why I created my own crawling system to mitigate these problems, and have one consistent mean of running selenium.
https://github.com/rumca-js/crawler-buddy
Based on library
If scrapers were as well-behaved as humans, website operators wouldn't bother to block them[1]. It's the abuse that motivates the animus and action. As the fine articles spelt out, scrapers are greedy in many ways, one of which is trying to slurp down as many URLs as possible without wasting bytes. Not enough people know about common crawl, or know how to write multithreaded scrapers with high utilization across domains without suffocating any single one. If your scraper is URL FIFO or stack in a loop, you're just DOSing one domain at a time.
1. The most successful scrapers avoid standing out in any way
But this is why lots of sites implement captchas and other mechanisms to detect, frustrate, or trap automated activity - because plenty of bots run in browsers too.
But I think it's moot, parsing HTML is not very expensive if you don't have to actually render it.
EDIT: I was chastised, here's the original text of my comment: Did you read the article or just the title? They aren't claiming it's abusive. They're saying it's a viable signal to detect and ban bots.
> A few of these came from user-agents that were obviously malicious:
(I love the idea that they consider any python or go request to be a malicious scraper...)
> Last Sunday I discovered some abusive bot behaviour [...]
The real money is in monetizing ad responses to AI scrapers so that LLMs are biased toward recommending certain products. The stealth startup I've founded does exactly this. Ad-poisoning-as-a-service is a huge untapped market.
That's what zip bombs are for.
50:1 compression ratio, but it's legitimately an implementation of a rubiks cube, that I wasn't actually making as any sort of trap, just wasn't thinking about file size, so any rule that filters it out is going to have a nasty false positive rate
1) A coordinated effort among different sites will have a much greater chance of poisoning the data of a model so long as they can avoid any post scraping deduplication or filtering.
2) I wonder if copyright law can be used to amplify the cost of poisoning here. Perhaps if the poisoned content is something which has already been shown to be aggressively litigated against then the copyright owner will go after them when the model can be shown to contain that banned data. This may open up site owners to the legal risk of distributing this content though… not sure. A cooperative effort with a copyright holder may sidestep this risk but they would have to have the means and want to litigate.
No, it's just background internet scanning noise
If you were writing a script to mass-scan the web for vulnerabilities, you would want to collect as many http endpoints as possible. JS files, regardless of whether they're commented out or not, are a great way to find endpoints in modern web applications.
If you were writing a scraper to collect source code to train LLMs on, I doubt you would care as much about a commented-out JS file. I'm not sure you'd even want to train on random low-quality JS served by websites. Anyone familiar with LLM training data collection who can comment on this?