AI scrapers request commented scripts

Posted by ColinWright 3 days ago

AI scrapers request commented scripts(cryptography.dog)

264 points | 215 comments

renegat0x0 2 days ago|

Most web scrapers, even if illegal, are for... business. So they scrape amazon, or shops. So yeah. Most unwanted traffic is from big tech, or bad actors trying to sniff vulnerabilities.

I know a thing or two about web scraping.

There are sometimes status codes 404 for protection, so that you skip this site, so my crawler tries, as a hammer, several of faster crawling methods (curlcffi).

Zip bombs are also not for me. Reading header content length is enough to not read the page/file. I provide byte limit to check if response is not too big for me. For other cases reading timeout is enough.

Oh, and did you know that requests timeout is not really timeout a timeout for page read? So server can spoonfeed you bytes, one after another, and there will be no timeout.

That is why I created my own crawling system to mitigate these problems, and have one consistent mean of running selenium.

https://github.com/rumca-js/crawler-buddy

Based on library

https://github.com/rumca-js/webtoolkit

hnav 2 days ago||

content-length is computed after content-encoding

ahoka 2 days ago||

If it’s present at all.

1vuio0pswjnm7 2 days ago|||

Is there a difference between "scraping" and "crawling"

Mars008 2 days ago||

Looks like it's time for in-browser scrappers. They will be indistinguishable from the servers side. With AI driver can pass even human tests.

overfeed 2 days ago|||

> Looks like it's time for in-browser scrappers.

If scrapers were as well-behaved as humans, website operators wouldn't bother to block them[1]. It's the abuse that motivates the animus and action. As the fine articles spelt out, scrapers are greedy in many ways, one of which is trying to slurp down as many URLs as possible without wasting bytes. Not enough people know about common crawl, or know how to write multithreaded scrapers with high utilization across domains without suffocating any single one. If your scraper is URL FIFO or stack in a loop, you're just DOSing one domain at a time.

1. The most successful scrapers avoid standing out in any way

Mars008 2 days ago||

The question is who runs them? There are only a few big companies like MS, Google, OpenAI, Anthropic. But from the posts here it looks like hordes of buggy scrapers run by enthusiasts.

luckylion 2 days ago|||

Ad companies, even the small ones, "Brand Protection" companies, IP lawyers looking for images that were used without license, Brand Marketing companies, where it matters also your competitors etc etc

iamacyborg 2 days ago|||

Lots of “data” companies out there that want to sell you scraped data sets.

bartread 2 days ago||||

Not a new idea. For years now, on the occasions I’ve needed to scrape, I’ve used a set of ViolentMonkey scripts. I’ve even considered creating an extension, but have never really needed it enough to do the extra work.

But this is why lots of sites implement captchas and other mechanisms to detect, frustrate, or trap automated activity - because plenty of bots run in browsers too.

eur0pa 2 days ago|||

you mean OpenAI Atlas?

rokkamokka 2 days ago||

I'm not overly surprised, it's probably faster to search the text for http/https than parse the DOM

embedding-shape 2 days ago||

Not probably, searching through plaintext (which they seem to be doing) VS iterating on the DOM have vastly different amount of work behind them in terms of resources used and performance that "probably" is way underselling the difference :)

franktankbank 2 days ago||

Reminds me of the shortcut that works for the happy path but is utterly fucked by real data. This is an interesting trap, can it easily be avoided without walking the dom?

embedding-shape 2 days ago||

Yes, parse out HTML comments which is also kind of trivial if you've ever done any sort of parsing, listen for "". But then again, these people are using AI to build scrapers, so I wouldn't put too much pressure on them to produce high-quality software.

jcheng 2 days ago|||

It's not quite as trivial as that; one could start the page with a <script> tag that contains "", and that would hide all the content from your scraper but not from real browsers.

But I think it's moot, parsing HTML is not very expensive if you don't have to actually render it.

stevage 2 days ago|||

Lots of other ways to include URLs in an HTML document that wouldn't be visible to a real user, though.

marginalia_nu 2 days ago||

The regex approach is certainly easier to implement, but honestly static DOM parsing is pretty cheap, but quite fiddly to get right. You're probably gonna be limited by network congestion (or ephemeral ports) before you run out of CPU time doing this type of crawling.

sharkjacobs 2 days ago||

Fun to see practical applications of interesting research[1]

[1]https://news.ycombinator.com/item?id=45529587

Noumenon72 2 days ago||

It doesn't seem that abusive. I don't comment things out thinking "this will keep robots from reading this".

mostlysimilar 2 days ago||

The article mentions using this as a means of detecting bots, not as a complaint that it's abusive.

EDIT: I was chastised, here's the original text of my comment: Did you read the article or just the title? They aren't claiming it's abusive. They're saying it's a viable signal to detect and ban bots.

ang_cire 2 days ago|||

They call the scrapers "malicious", so they are definitely complaining about them.

> A few of these came from user-agents that were obviously malicious:

(I love the idea that they consider any python or go request to be a malicious scraper...)

pseudalopex 2 days ago||||

Please don't comment on whether someone read an article. "Did you even read the article? It mentions that" can be shortened to "The article mentions that".[1]

[1] https://news.ycombinator.com/newsguidelines.html

woodrowbarlow 2 days ago|||

the first few words of the article are:

> Last Sunday I discovered some abusive bot behaviour [...]

foobarbecue 2 days ago|||

Yeah but the abusive behavior is ignoring robots.txt and scraping to train AI. Following commented URLs was not the crime, just evidence inadvertently left behind.

mostlysimilar 2 days ago|||

> The robots.txt for the site in question forbids all crawlers, so they were either failing to check the policies expressed in that file, or ignoring them if they had.

michael1999 2 days ago||

Crawlers ignoring robots.txt is abusive. That they then start scanning all docs for commented urls just adds to the pile of scummy behaviour.

tveyben 2 days ago||

Human behavior is interesting - me, me, me…

stevage 2 days ago||

The title is confusing, should be "commented-out".

pimlottc 2 days ago|

Agree, I thought maybe this was going to be a script to block AI scrapers or something like that.

zahlman 2 days ago||

I thought it was going to be AI scraper operators getting annoyed that they have to run reasoning models on the scraped data to make use of it.

latenightcoding 2 days ago||

when I used to crawl the web, battle tested Perl regexes were more reliable than anything else, commented urls would have been added to my queue.

rightbyte 2 days ago|

DOM navigation for fetching some data is for tryhards. Using a regex to grab the correct paragraph or div or whatever is fine and is more robust versus things moving around on the page.

chaps 2 days ago|||

Doing both is fine! Just, once you've figured out your regex and such, hardening/generalizing demands DOM iteration. It sucks but it is what is is.

horseradish7k 2 days ago|||

but not when crawling. you don't know the page format in advance - you don't even know what the page contains!

bigbuppo 2 days ago||

Sounds like you should give the bots exactly what they want... a 512MB file of random data.

kelseyfrog 2 days ago||

That's leaving a lot of opportunity on the table.

The real money is in monetizing ad responses to AI scrapers so that LLMs are biased toward recommending certain products. The stealth startup I've founded does exactly this. Ad-poisoning-as-a-service is a huge untapped market.

bigbuppo 2 days ago||

Now that's a paid subscription I can get behind, especially if it suggests that Meta should cut Rob Schneider a check for $200,000,000,000 to make more movies.

kelseyfrog 2 days ago||

Contact info in bio. Always looking to make more happy customers.

aDyslecticCrow 2 days ago|||

Scraper sinkhole of randomly generated inter-linked files filled with AI poison could work. No human would click that link, so it leads to the "exclusive club".

oytis 2 days ago||

Outbound traffic normally costs more than inbound one, so the asymmetry is set up wrong here. Data poisoning is probably the way.

zahlman 2 days ago||

> Outbound traffic normally costs more than inbound one, so the asymmetry is set up wrong here.

That's what zip bombs are for.

kelnos 2 days ago|||

Most people have to pay for their bandwidth, though. That's a lot of data to send out over and over.

jcheng 2 days ago||

512MB file of incredibly compressible data, then?

QuadmasterXLII 2 days ago||

Could I recommend https://cubes.hgreer.com/ssg/output.html ?

50:1 compression ratio, but it's legitimately an implementation of a rubiks cube, that I wasn't actually making as any sort of trap, just wasn't thinking about file size, so any rule that filters it out is going to have a nasty false positive rate

AlienRobot 2 days ago||

512 MB of saying your service is the best service.

mikeiz404 2 days ago||

Two thoughts here when it comes to poisoning unwanted LLM training data traffic

1) A coordinated effort among different sites will have a much greater chance of poisoning the data of a model so long as they can avoid any post scraping deduplication or filtering.

2) I wonder if copyright law can be used to amplify the cost of poisoning here. Perhaps if the poisoned content is something which has already been shown to be aggressively litigated against then the copyright owner will go after them when the model can be shown to contain that banned data. This may open up site owners to the legal risk of distributing this content though… not sure. A cooperative effort with a copyright holder may sidestep this risk but they would have to have the means and want to litigate.

Anamon 2 days ago|

As for 1, it would be great to have this as a plugin for WordPress etc. that anyone could simply install and enable. Pre-processing images to dynamically poison them on each request should be fun, and also protect against a deduplication defense. I'd certainly install that.

throw_me_uwu 2 days ago||

> most likely trying to non-consensually collect content for training LLMs

No, it's just background internet scanning noise

lucasluitjes 2 days ago|

This.

If you were writing a script to mass-scan the web for vulnerabilities, you would want to collect as many http endpoints as possible. JS files, regardless of whether they're commented out or not, are a great way to find endpoints in modern web applications.

If you were writing a scraper to collect source code to train LLMs on, I doubt you would care as much about a commented-out JS file. I'm not sure you'd even want to train on random low-quality JS served by websites. Anyone familiar with LLM training data collection who can comment on this?

sokoloff 2 days ago|

Well, if they’re going to request commented out scripts, serve them up some very large scripts…

More comments...