Cloudflare crawl endpoint

Posted by jeffpalmer 14 hours ago

Cloudflare crawl endpoint(developers.cloudflare.com)

331 points | 129 commentspage 2

iranu 2 hours ago|

Honestly, it feels like cloudflare bullying other sites into using their anti-bot services. great business model by charging owners and devs at the same time. Using AI per page to parse content. its reckless.

pupppet 13 hours ago||

Cloudflare getting all the cool toys. AWS, anyone awake over there?

jppope 13 hours ago||

This is actually really amazing. Cloudflare is just skating to where the puck is going to be on this one.

patchnull 12 hours ago||

The main win here is abstracting away browser context lifecycle management. Anyone who has run Puppeteer on Workers knows the pain of handling cold starts, context reuse, and timeout cascading across navigation steps. Having crawl() bundle render-then-extract into one call covers maybe 80% of scraping use cases. The remaining 20% where you need request interception or pre-render script injection still needs the full Browser Rendering API, but for pulling structured data from public pages this is a big simplification over managing session state yourself.

binarymax 13 hours ago||

Really hard to understand costs here. What is a reasonable pages per second? Should I assume with politeness that I'm basically at 1 page per second == 3600 pages/hour? Seems painfully slow.

skybrian 12 hours ago||

If two customers crawl the same website and it uses crawl-delay, how does it handle that? Are they independent, or does each one run half as fast?

PeterStuer 5 hours ago|

You put a governor on the domain, and you return from the cache instead.

triwats 14 hours ago||

this could be cool to use cloudflare's edge to do some monitoring of endpoints actual content for synthetic monitoring

ed_mercer 11 hours ago||

> Honors robots.txt directives, including crawl-delay

Sounds pretty useless for any serious AI company

PeterStuer 5 hours ago|

What % of sites have a content update volume that exceeds what you can get respecting crawl delay?

If your delay is 1s and you publish less than 60 updates a minute on average I can still get 100%. Most crawls are not that latency sensitive, certainly not the ai ones.

HFT bots, now that is an entirely different ballgame.

mrweasel 3 hours ago||

> Most crawls are not that latency sensitive, certainly not the ai ones.

They certainly behave like they are. We constantly see crawlers trying to do cache busting, for pages that hasn't change in days, if not weeks. It's hard to tell where the bots are coming from theses days, as most have taken to just lie and say that they are Chrome.

I'd agree that the respecting robots.txt makes this a non-starter for the problematic scrapers. These are bots that that will hammer a site into the ground, they don't respect robots.txt, especially if it tells them to go away.

All of this would be much less of a problem if the authors of the scrapers actually knew how to code, understood how the Internet works and had just the slightest bit of respect for others, but they don't so now all scrapers are labeled as hostile, meaning that only the very largest companies, like Google, get special access.

moebrowne 1 hour ago||

> We constantly see crawlers trying to do cache busting

Do you have a source for this? Not saying you're wrong, I'd just like to know more

fbrncci 11 hours ago||

Awesome, so I no longer have to use Firecrawl or my own crawler to scrape entire websites for an agent? Especially when needing residential proxies to do so on Cloudflare protected sites? Why though?

freakynit 11 hours ago|

I have tried theirs... they are NOT proxies.. that means majority of the popular sites actually block scraping... even if they are protected by cloudflare itself.

arjunchint 11 hours ago|

RIP @FireCrawl or at the very least they were the inspiration for this?

More comments...