Posted by jeffpalmer 16 hours ago
``` write a custom crawler that will crawl every page on a site (internal links to the original domain only, scroll down to mimic a human, and save the output as a WebP screenshot, HTML, Markdown, and structured JSON. Make it designed to run locally in a terminal on a linux machine using headless Google Chrome and take advantage of multiple cores to run multiple pages simultaneously while keeping in mind that it might have to throttle if the server gets hit too fast from the same IP. ```
Might use available open source software such as python, playwright, beautifulsoup4, pillow, aiofiles, trafilatura
You'll still be hand-rolling it if you want to disrespect crawling requirements though.
From the behaviour of our peers, this seems to be the real headline news.
That said, I'm not fan of letting users forge whatever user agents they please. Instead, AIUI to opt-out of getting crawled I have to look for the existence of certain request headers[1].
[1]: https://developers.cloudflare.com/browser-rendering/referenc...
I'll need to test it out, especially with the labyrinth.
Further down they also mention that the requests come from CFs ASN and are branded with identifying headers, so third party filters could easily block them too if they're so inclined. Seems reasonable enough.
We're creating an internet that is becoming self-reinforcing for those who already have power and harder for anyone else. As crawling becomes difficult and expensive, only those with previously collected datasets get to play. I certainly understand individual sites wanting to limit access, but it seems unlikely that they're limiting access to the big players - and maybe even helping them since others won't be able to compete as well.
I'm split between: Yes! At last something to get CF protected sites! And: Uh! Now the internet is successfully centralized.
First, the Cloudflare Crawl endpoint does not require the target site to use Cloudflare. It spins up a headless Chrome instance (via the Browser Rendering API) that fetches and renders any publicly accessible URL. You could crawl a site hosted on Hetzner or a bare VPS with the same call.
Second on pricing: Browser Rendering is only available on the Workers Paid plan ($5/month). It is not part of the free tier. Usage is billed per invocation beyond the included quota - the exact limits are in the Cloudflare docs under Browser Rendering pricing, but for archival use cases with moderate crawl rates you are very unlikely to run into meaningful costs.
The practical gotcha for forum archival is pagination and authentication-gated content. If the forum requires a login to see older posts, a headless browser session with saved cookies would help, but that is more complex to orchestrate than a single-shot fetch.
The post says it's available for both free and paid plans. According to the pricing page of the Browser Rendering, the free plan will have 10 minutes/day browsing time.
Crawl jobs per day 5 per day
Maximum pages per crawl 100 pages
[0] https://developers.cloudflare.com/browser-rendering/limits/#...