Top
Best
New

Posted by ColinWright 10/31/2025

AI scrapers request commented scripts(cryptography.dog)
266 points | 220 comments
renegat0x0 10/31/2025|
Most web scrapers, even if illegal, are for... business. So they scrape amazon, or shops. So yeah. Most unwanted traffic is from big tech, or bad actors trying to sniff vulnerabilities.

I know a thing or two about web scraping.

There are sometimes status codes 404 for protection, so that you skip this site, so my crawler tries, as a hammer, several of faster crawling methods (curlcffi).

Zip bombs are also not for me. Reading header content length is enough to not read the page/file. I provide byte limit to check if response is not too big for me. For other cases reading timeout is enough.

Oh, and did you know that requests timeout is not really timeout a timeout for page read? So server can spoonfeed you bytes, one after another, and there will be no timeout.

That is why I created my own crawling system to mitigate these problems, and have one consistent mean of running selenium.

https://github.com/rumca-js/crawler-buddy

Based on library

https://github.com/rumca-js/webtoolkit

hnav 10/31/2025||
content-length is computed after content-encoding
ahoka 11/1/2025||
If it’s present at all.
1vuio0pswjnm7 11/1/2025|||
Is there a difference between "scraping" and "crawling"
Mars008 11/1/2025||
Looks like it's time for in-browser scrappers. They will be indistinguishable from the servers side. With AI driver can pass even human tests.
overfeed 11/1/2025|||
> Looks like it's time for in-browser scrappers.

If scrapers were as well-behaved as humans, website operators wouldn't bother to block them[1]. It's the abuse that motivates the animus and action. As the fine articles spelt out, scrapers are greedy in many ways, one of which is trying to slurp down as many URLs as possible without wasting bytes. Not enough people know about common crawl, or know how to write multithreaded scrapers with high utilization across domains without suffocating any single one. If your scraper is URL FIFO or stack in a loop, you're just DOSing one domain at a time.

1. The most successful scrapers avoid standing out in any way

Mars008 11/1/2025||
The question is who runs them? There are only a few big companies like MS, Google, OpenAI, Anthropic. But from the posts here it looks like hordes of buggy scrapers run by enthusiasts.
luckylion 11/1/2025|||
Ad companies, even the small ones, "Brand Protection" companies, IP lawyers looking for images that were used without license, Brand Marketing companies, where it matters also your competitors etc etc
iamacyborg 11/1/2025|||
Lots of “data” companies out there that want to sell you scraped data sets.
bartread 11/1/2025||||
Not a new idea. For years now, on the occasions I’ve needed to scrape, I’ve used a set of ViolentMonkey scripts. I’ve even considered creating an extension, but have never really needed it enough to do the extra work.

But this is why lots of sites implement captchas and other mechanisms to detect, frustrate, or trap automated activity - because plenty of bots run in browsers too.

eur0pa 11/1/2025|||
you mean OpenAI Atlas?
rokkamokka 10/31/2025||
I'm not overly surprised, it's probably faster to search the text for http/https than parse the DOM
embedding-shape 10/31/2025||
Not probably, searching through plaintext (which they seem to be doing) VS iterating on the DOM have vastly different amount of work behind them in terms of resources used and performance that "probably" is way underselling the difference :)
franktankbank 10/31/2025||
Reminds me of the shortcut that works for the happy path but is utterly fucked by real data. This is an interesting trap, can it easily be avoided without walking the dom?
embedding-shape 10/31/2025||
Yes, parse out HTML comments which is also kind of trivial if you've ever done any sort of parsing, listen for "<!--", whenever you come across it, ignore everything until the next "-->". But then again, these people are using AI to build scrapers, so I wouldn't put too much pressure on them to produce high-quality software.
jcheng 10/31/2025|||
It's not quite as trivial as that; one could start the page with a <script> tag that contains "<!--" without matching "-->", and that would hide all the content from your scraper but not from real browsers.

But I think it's moot, parsing HTML is not very expensive if you don't have to actually render it.

stevage 10/31/2025|||
Lots of other ways to include URLs in an HTML document that wouldn't be visible to a real user, though.
marginalia_nu 11/1/2025||
The regex approach is certainly easier to implement, but honestly static DOM parsing is pretty cheap, but quite fiddly to get right. You're probably gonna be limited by network congestion (or ephemeral ports) before you run out of CPU time doing this type of crawling.
sharkjacobs 10/31/2025||
Fun to see practical applications of interesting research[1]

[1]https://news.ycombinator.com/item?id=45529587

Noumenon72 10/31/2025||
It doesn't seem that abusive. I don't comment things out thinking "this will keep robots from reading this".
mostlysimilar 10/31/2025||
The article mentions using this as a means of detecting bots, not as a complaint that it's abusive.

EDIT: I was chastised, here's the original text of my comment: Did you read the article or just the title? They aren't claiming it's abusive. They're saying it's a viable signal to detect and ban bots.

ang_cire 10/31/2025|||
They call the scrapers "malicious", so they are definitely complaining about them.

> A few of these came from user-agents that were obviously malicious:

(I love the idea that they consider any python or go request to be a malicious scraper...)

pseudalopex 10/31/2025||||
Please don't comment on whether someone read an article. "Did you even read the article? It mentions that" can be shortened to "The article mentions that".[1]

[1] https://news.ycombinator.com/newsguidelines.html

woodrowbarlow 10/31/2025|||
the first few words of the article are:

> Last Sunday I discovered some abusive bot behaviour [...]

foobarbecue 10/31/2025|||
Yeah but the abusive behavior is ignoring robots.txt and scraping to train AI. Following commented URLs was not the crime, just evidence inadvertently left behind.
mostlysimilar 10/31/2025|||
> The robots.txt for the site in question forbids all crawlers, so they were either failing to check the policies expressed in that file, or ignoring them if they had.
michael1999 10/31/2025||
Crawlers ignoring robots.txt is abusive. That they then start scanning all docs for commented urls just adds to the pile of scummy behaviour.
tveyben 10/31/2025||
Human behavior is interesting - me, me, me…
stevage 10/31/2025||
The title is confusing, should be "commented-out".
pimlottc 10/31/2025|
Agree, I thought maybe this was going to be a script to block AI scrapers or something like that.
zahlman 10/31/2025||
I thought it was going to be AI scraper operators getting annoyed that they have to run reasoning models on the scraped data to make use of it.
latenightcoding 10/31/2025||
when I used to crawl the web, battle tested Perl regexes were more reliable than anything else, commented urls would have been added to my queue.
rightbyte 10/31/2025|
DOM navigation for fetching some data is for tryhards. Using a regex to grab the correct paragraph or div or whatever is fine and is more robust versus things moving around on the page.
chaps 10/31/2025|||
Doing both is fine! Just, once you've figured out your regex and such, hardening/generalizing demands DOM iteration. It sucks but it is what is is.
horseradish7k 10/31/2025|||
but not when crawling. you don't know the page format in advance - you don't even know what the page contains!
bigbuppo 10/31/2025||
Sounds like you should give the bots exactly what they want... a 512MB file of random data.
kelseyfrog 11/1/2025||
That's leaving a lot of opportunity on the table.

The real money is in monetizing ad responses to AI scrapers so that LLMs are biased toward recommending certain products. The stealth startup I've founded does exactly this. Ad-poisoning-as-a-service is a huge untapped market.

bigbuppo 11/1/2025||
Now that's a paid subscription I can get behind, especially if it suggests that Meta should cut Rob Schneider a check for $200,000,000,000 to make more movies.
kelseyfrog 11/1/2025||
Contact info in bio. Always looking to make more happy customers.
aDyslecticCrow 10/31/2025|||
Scraper sinkhole of randomly generated inter-linked files filled with AI poison could work. No human would click that link, so it leads to the "exclusive club".
oytis 10/31/2025||
Outbound traffic normally costs more than inbound one, so the asymmetry is set up wrong here. Data poisoning is probably the way.
zahlman 10/31/2025||
> Outbound traffic normally costs more than inbound one, so the asymmetry is set up wrong here.

That's what zip bombs are for.

kelnos 10/31/2025|||
Most people have to pay for their bandwidth, though. That's a lot of data to send out over and over.
jcheng 10/31/2025||
512MB file of incredibly compressible data, then?
QuadmasterXLII 11/1/2025||
Could I recommend https://cubes.hgreer.com/ssg/output.html ?

50:1 compression ratio, but it's legitimately an implementation of a rubiks cube, that I wasn't actually making as any sort of trap, just wasn't thinking about file size, so any rule that filters it out is going to have a nasty false positive rate

AlienRobot 11/1/2025||
512 MB of saying your service is the best service.
mikeiz404 10/31/2025||
Two thoughts here when it comes to poisoning unwanted LLM training data traffic

1) A coordinated effort among different sites will have a much greater chance of poisoning the data of a model so long as they can avoid any post scraping deduplication or filtering.

2) I wonder if copyright law can be used to amplify the cost of poisoning here. Perhaps if the poisoned content is something which has already been shown to be aggressively litigated against then the copyright owner will go after them when the model can be shown to contain that banned data. This may open up site owners to the legal risk of distributing this content though… not sure. A cooperative effort with a copyright holder may sidestep this risk but they would have to have the means and want to litigate.

Anamon 11/1/2025|
As for 1, it would be great to have this as a plugin for WordPress etc. that anyone could simply install and enable. Pre-processing images to dynamically poison them on each request should be fun, and also protect against a deduplication defense. I'd certainly install that.
throw_me_uwu 11/1/2025||
> most likely trying to non-consensually collect content for training LLMs

No, it's just background internet scanning noise

lucasluitjes 11/1/2025|
This.

If you were writing a script to mass-scan the web for vulnerabilities, you would want to collect as many http endpoints as possible. JS files, regardless of whether they're commented out or not, are a great way to find endpoints in modern web applications.

If you were writing a scraper to collect source code to train LLMs on, I doubt you would care as much about a commented-out JS file. I'm not sure you'd even want to train on random low-quality JS served by websites. Anyone familiar with LLM training data collection who can comment on this?

sokoloff 10/31/2025|
Well, if they’re going to request commented out scripts, serve them up some very large scripts…
More comments...