Top
Best
New

Posted by jeffpalmer 17 hours ago

Cloudflare crawl endpoint(developers.cloudflare.com)
384 points | 148 commentspage 4
devnotes77 16 hours ago|
To clarify the two questions raised:

First, the Cloudflare Crawl endpoint does not require the target site to use Cloudflare. It spins up a headless Chrome instance (via the Browser Rendering API) that fetches and renders any publicly accessible URL. You could crawl a site hosted on Hetzner or a bare VPS with the same call.

Second on pricing: Browser Rendering is only available on the Workers Paid plan ($5/month). It is not part of the free tier. Usage is billed per invocation beyond the included quota - the exact limits are in the Cloudflare docs under Browser Rendering pricing, but for archival use cases with moderate crawl rates you are very unlikely to run into meaningful costs.

The practical gotcha for forum archival is pagination and authentication-gated content. If the forum requires a login to see older posts, a headless browser session with saved cookies would help, but that is more complex to orchestrate than a single-shot fetch.

zyz 10 hours ago||
> Browser Rendering is only available on the Workers Paid plan ($5/month). It is not part of the free tier.

The post says it's available for both free and paid plans. According to the pricing page of the Browser Rendering, the free plan will have 10 minutes/day browsing time.

gingerlime 8 hours ago||
[0] seems to suggest even paid plans are effectively limited to 500 web pages per day, right?

    Crawl jobs per day 5 per day
    Maximum pages per crawl 100 pages
[0] https://developers.cloudflare.com/browser-rendering/limits/#...
1vuio0pswjnm7 12 hours ago||
Can a CDN be a "walled garden"
charcircuit 12 hours ago||
>Honors robots.txt

Is it possible to ignore robot.txt in the case the crawl was triggered by a human?

memothon 17 hours ago||
I've used browser rendering at work and it's quite nice. Most solutions in the crawling space are kind of scummy and designed for side-stepping robots.txt and not being a good citizen. A crawl endpoint is a very necessary addition!
greatgib 16 hours ago||
All what was expected, first they do a huge campaign to out evil scrapers. We should use their service to ensure your website block LLMs and bots to come scraping them. Look how bad it is.

And once that is well setup, and they have their walled garden, then they can present their own API to scrape websites. All well done to be used by your LLM. But as you know, they are the gate keeper so that the Mafia boss decide what will be the "intermediary" fee that is proper for itself to let you do what you were doing without intermediary before.

shadowfiend 16 hours ago|
No: https://developers.cloudflare.com/browser-rendering/rest-api...
greatgib 14 hours ago|||
That is funny because on this page there is a warning block with the following text:

   Refer to Will Browser Rendering bypass Cloudflare's Bot Protection? for instructions on creating a WAF skip rule.
And "Will Browser Rendering bypass Cloudflare's Bot Protection? " is a hash link to the FAQ page, that surprisingly doesn't anything available for this link entry.

Is it because it was removed (/hidden) or because it is not yet available until everyone forget the "we are no evil, we are here to protect the internet"?

x0x0 15 hours ago|||
most websites, particularly those behind cloudflare, are very restrictive even to crawlers that obey robots. Proof: a ton of my time over the last year, and my crawlers very carefully obey robots.

It's hard to see how this isn't extorting folks by offering a working solution that, oh, cloudflare doesn't block. As long as you pay Cloudflare.

Perhaps I'm overly cynical, but I'd be quite surprised if cloudflare subjected their own headless browsing to the same rules the rest of the internet gets.

gruez 15 hours ago||
>most websites, particularly those behind cloudflare, are very restrictive even to crawlers that obey robots. Proof: a ton of my time over the last year, and my crawlers very carefully obey robots.

The docs are pretty equivocal though:

>If you use Cloudflare products that control or restrict bot traffic such as Bot Management, Web Application Firewall (WAF), or Turnstile, the same rules will apply to the Browser Rendering crawler.

It's not just robots.txt. Most (all?) restrictions that apply to outside bots apply to cloudflare's bot as well, at least that's what they're claiming. If they're being this explicit about it, I'm willing to give them the benefit of the doubt until there's evidence to the contrary, rather than being a cynic and assuming the worst.

tjpnz 14 hours ago||
Do I have the option to fill it with junk for LLMs?
rvz 16 hours ago||
Selling the cure (DDoS protection) and creating the poison (Authorized AI crawling) against their customers.
Imustaskforhelp 16 hours ago||
This might be really great!

I had the idea after buying https://mirror.forum recently (which I talked in discord and archiveteam irc servers) that I wanted to preserve/mirror forums (especially tech) related [Think TinyCoreLinux] since Archive.org is really really great but I would prefer some other efforts as well within this space.

I didn't want to scrape/crawl it myself because I felt like it would feel like yet another scraping effort for AI and strain resources of developers.

And even when you want to crawl, the issue is that you can't crawl cloudflare and sometimes for good measure.

So in my understanding, can I use Cloudflare Crawl to essentially crawl the whole website of a forum and does this only work for forums which use cloudflare ?

Also what is the pricing of this? Is it just a standard cloudflare worker so would I get free 100k requests and 1 Million per the few cents (IIRC) offer for crawling. Considering that Cloudflare is very scalable, It might even make sense more than buying a group of cheap VPS's

Also another point but I was previously thinking that the best way was probably if maintainers of these forums could give me a backup archive of the forum in a periodic manner as my heart believes it to be most cleanest way and discussing it on Linux discord servers and archivers within that community and in general, I couldn't find anyone who maintains such tech forums who can subscribe to the idea of sharing the forum's public data as a quick backup for preservation purposes. So if anyone knows or maintains any forums myself. Feel free to message here in this thread about that too.

ipaddr 16 hours ago|
"I didn't want to scrape/crawl it myself because I felt like it would feel like yet another scraping effort for AI and strain resources of developers"

You feel better paying someone to do the same thimg?

Imustaskforhelp 15 hours ago||
I actually don't but it seems that cloudflare caches responses so if anything instead of straining the developer resources, it would strain more cloudflare resources and cloudflare could better handle that more efficiently with their own crawl product.

Also, I am genuinely open to feedback (Like a lot) so just let me know if you know of any other alternative too for the particular thing that I wish to create and I would love to have a discussion about that too! I genuinely wish that there can be other ways and part of the reason why I wrote that comment was wishing that someone who manages forums or knows people who do can comment back and we can have a discussion/something-meaningful!

I am also happy with you also suggesting me any good use cases of the domain in general if there can be made anything useful with it. In fact, I am happy with transferring this domain to you if this is something which is useful to ya or anyone here (Just donate some money preferably 50-100$ to any great charity in date after this comment is made and mail me details and I am absolutely willing to transfer the domain, or if you work in any charity currently and if it could help the charity in any meaningful manner!)

I had actually asked archive team if I could donate the domain to them if it would help archive.org in any meaningful way and they essentially politely declined.

I just bought this domain because someone on HN said mirror.org when they wanted to show someone else mirror and saw the price of the .org domain being so high (150k$ or similar)and I have habit of finding random nice TLD and I found mirror.forum so I bought it

And I was just thinking of hmm what can be a decent idea now that I have bought it and had thought of that. Obviously I have my flaws (many actually) but I genuinely don't wish any harm to anybody especially those people who are passionate about running independent forums in this centralized-web. I'd rather have this domain be expired if its activation meant harm to anybody.

looking forward to discussion with ya.

weird-eye-issue 12 hours ago||
This is used to scrape third-party sites not necessarily behind cloudflare so it has nothing to do with whether cloudflare caches it or not plus when using their browser rendering it doesn't even fetch cached responses anyways....
Imustaskforhelp 8 hours ago||
I didn't know that it doesn't fetch catched responses, my apologies. I had only read through it with a glance and it felt like something that cloudflare might've done. Is there any particular reason that they don't use the cached responses, feels like a missed opportunity but maybe I am missing something?
weird-eye-issue 5 hours ago||
It's a browser rendering API which means people are paying a premium specifically to have a browser render a live website. If you want to get a cached response of a page and still possibly get blocked by cloudflare you could just make a node script with a simple fetch and save your money.
ClaudeAgent_WK 4 hours ago||
[dead]
xorgun 2 hours ago|
[dead]
More comments...