Cloudflare crawl endpoint

Posted by jeffpalmer 16 hours ago

Cloudflare crawl endpoint(developers.cloudflare.com)

355 points | 138 commentspage 3

fbrncci 12 hours ago|

Awesome, so I no longer have to use Firecrawl or my own crawler to scrape entire websites for an agent? Especially when needing residential proxies to do so on Cloudflare protected sites? Why though?

freakynit 12 hours ago|

I have tried theirs... they are NOT proxies.. that means majority of the popular sites actually block scraping... even if they are protected by cloudflare itself.

arjunchint 13 hours ago||

RIP @FireCrawl or at the very least they were the inspiration for this?

radium3d 14 hours ago||

Instead of "should have been an email" this is "should have been a prompt" and can be run locally instead. There are a number of ways to do this from a linux terminal.

``` write a custom crawler that will crawl every page on a site (internal links to the original domain only, scroll down to mimic a human, and save the output as a WebP screenshot, HTML, Markdown, and structured JSON. Make it designed to run locally in a terminal on a linux machine using headless Google Chrome and take advantage of multiple cores to run multiple pages simultaneously while keeping in mind that it might have to throttle if the server gets hit too fast from the same IP. ```

Might use available open source software such as python, playwright, beautifulsoup4, pillow, aiofiles, trafilatura

Normal_gaussian 13 hours ago||

This presumably is going to be cheap and effective. Its much easier to wrap a prompt round this and know it works that mess around with crawling it all yourself.

You'll still be hand-rolling it if you want to disrespect crawling requirements though.

supermdguy 13 hours ago|||

I’ve actually written a crawler like that before, and still ended up going with Firecrawl for a more recent project. There’s just so many headaches at scale: OOMs from heavy pages, proxies for sites that block cloud IPs, handling nested iframes, etc.

Keyframe 8 hours ago||

That'd be more like that draw an owl meme. Devil's in the details. Holy shit, there's so many details...

Normal_gaussian 13 hours ago||

"Well-behaved bot - Honors robots.txt directives, including crawl-delay"

From the behaviour of our peers, this seems to be the real headline news.

babelfish 14 hours ago||

Didn't they just throw a (very public) fit over Perplexity doing the exact same thing?

fleebee 12 hours ago|

The most egregious thing Perplexity did was to straight up ignore robots.txt. Cloudflare promise not to do that, so if we take their word for it, it's a quite different setup.

That said, I'm not fan of letting users forge whatever user agents they please. Instead, AIUI to opt-out of getting crawled I have to look for the existence of certain request headers[1].

[1]: https://developers.cloudflare.com/browser-rendering/referenc...

coreq 12 hours ago||

The big question here is this a verified-bot on the Cloudflare WAF? Didn't Google get into trouble for using their search engine user agent and IPs to feed Gemini in Europe?

8cvor6j844qw_d6 15 hours ago||

Does this bypass their own anti-AI crawl measures?

I'll need to test it out, especially with the labyrinth.

jsheard 14 hours ago||

They say it doesn't: https://developers.cloudflare.com/browser-rendering/faq/#wil...

Further down they also mention that the requests come from CFs ASN and are branded with identifying headers, so third party filters could easily block them too if they're so inclined. Seems reasonable enough.

xhcuvuvyc 15 hours ago|||

Yeah, that'd be huge, like 90% of my search engine results are just cloudflare bot checks if I don't filter it out.

mdasen 14 hours ago|||

If this does bypass their own (and others') anti-AI crawl measures, it'd basically mean that the only people who can't crawl are those without money.

We're creating an internet that is becoming self-reinforcing for those who already have power and harder for anyone else. As crawling becomes difficult and expensive, only those with previously collected datasets get to play. I certainly understand individual sites wanting to limit access, but it seems unlikely that they're limiting access to the big players - and maybe even helping them since others won't be able to compete as well.

adi_kurian 14 hours ago||

Common Crawl has free egress

canpan 15 hours ago||

I feel there is a conflict of interest here..

I'm split between: Yes! At last something to get CF protected sites! And: Uh! Now the internet is successfully centralized.

devnotes77 14 hours ago||

To clarify the two questions raised:

First, the Cloudflare Crawl endpoint does not require the target site to use Cloudflare. It spins up a headless Chrome instance (via the Browser Rendering API) that fetches and renders any publicly accessible URL. You could crawl a site hosted on Hetzner or a bare VPS with the same call.

Second on pricing: Browser Rendering is only available on the Workers Paid plan ($5/month). It is not part of the free tier. Usage is billed per invocation beyond the included quota - the exact limits are in the Cloudflare docs under Browser Rendering pricing, but for archival use cases with moderate crawl rates you are very unlikely to run into meaningful costs.

The practical gotcha for forum archival is pagination and authentication-gated content. If the forum requires a login to see older posts, a headless browser session with saved cookies would help, but that is more complex to orchestrate than a single-shot fetch.

zyz 8 hours ago||

> Browser Rendering is only available on the Workers Paid plan ($5/month). It is not part of the free tier.

The post says it's available for both free and paid plans. According to the pricing page of the Browser Rendering, the free plan will have 10 minutes/day browsing time.

gingerlime 6 hours ago||

[0] seems to suggest even paid plans are effectively limited to 500 web pages per day, right?

    Crawl jobs per day 5 per day
    Maximum pages per crawl 100 pages

[0] https://developers.cloudflare.com/browser-rendering/limits/#...

1vuio0pswjnm7 10 hours ago|

Can a CDN be a "walled garden"

More comments...