Cloudflare crawl endpoint

Posted by jeffpalmer 13 hours ago

Cloudflare crawl endpoint(developers.cloudflare.com)

331 points | 129 comments

RamblingCTO 3 hours ago|

Doesn't work for pages protected by cloudflare in my experience. What a shame, they could've produced the problem and sold the solution.

chvid 2 hours ago|

As long at it gets past Azure's bot protection ...

jasongill 12 hours ago||

I'm surprised that Cloudflare hasn't started hosting a pre-scraped version of websites that use Cloudflare's proxy - something like https://www.example.com/cdn-cgi/cached-contents.json They already have the website content in their cache, so why not just cut out the middle man of scraping services and API's like this and publish it?

Obviously there's good reasons NOT to, but I am surprised they haven't started offering it (as an "on-by-default" option, naturally) yet.

brookst 6 minutes ago||

That was my first thought when I read the headline. It would make perfect sense, and would allow some websites to have best of both worlds: broadcasting content without being crushed by bots. (Not all sites want to broadcast, but many do).

cortesoft 7 hours ago|||

Well, the conversion process into the JSON representation is going to take CPU, and then you have to store the result, in essence doubling your cache footprint.

Doing it on demand still utilizes their cached version, so it saves a trip to the origin, but doesn’t require doubling the cache size. They can still cache the results if the same site is scraped multiple times, but this saves having to cache things that are never going to be requested.

Cache footprint management is a huge factor in the cost and performance for a CDN, you want to get the most out of your storage and you want to serve as many pages from cache as possible.

I know in my experience working for a CDN, we were doing all sorts of things to try to maximize the hit rate for our cache.. in fact, one of the easiest and most effective techniques for increasing cache hit rate is to do the OPPOSITE of what you are suggesting; instead of pre-caching content, you do ‘second hit caching’, where you only store a copy in the cache if a piece of content is requested a second time. The idea is that a lot of content is requested only once by one user, and then never again, so it is a waste to store it in the cache. If you wait until it is requested a second time before you cache it, you avoid those single use pages going into your cache, and don’t hurt overall performance that much, because the content that is most useful to cache is requested a lot, and you only have to make one extra origin request.

selcuka 11 hours ago|||

Not the same thing, but they have something close (it's not on-by-default, yet) [1]:

> Cloudflare's network now supports real-time content conversion at the source, for enabled zones using content negotiation headers. Now when AI systems request pages from any website that uses Cloudflare and has Markdown for Agents enabled, they can express the preference for text/markdown in the request. Our network will automatically and efficiently convert the HTML to markdown, when possible, on the fly.

[1] https://blog.cloudflare.com/markdown-for-agents/

michaelmior 11 hours ago|||

> I'm surprised that Cloudflare hasn't started hosting a pre-scraped version of websites that use Cloudflare's proxy

It's entirely possible that they're doing this under the hood for cases where they can clearly identify the content they have cached is public.

janalsncm 11 hours ago|||

How would they know the content hasn’t changed without hitting the website?

coreq 9 hours ago|||

They wouldn't, well there's Etag and alike but it still a round trip on level 7 to the origin. However the pattern generally is to say when the content is good to in the Response headers, and cache on that duration, for an example a bitcoin pricing aggregator might say good for 60 seconds (with disclaimers on page that this isn't market data), whilst My Little Town news might say that an article is good for an hour (to allow Updates) and the homepage is good for 5 minutes to allow breaking news article to not appear too far behind.

OptionOfT 8 hours ago||||

Caching headers?

(Which, on Akamai, are by default ignored!)

cortesoft 7 hours ago|||

Keeping track of when content changes is literally the primary function of a CDN.

binarymax 11 hours ago|||

Based on the post, it seems likely that they'd just delay per the robots.txt policy no matter what, and do a full browser render of the cached page to get the content. Probably overkill for lots and lots of sites. An HTML fetch + readability is really cheap.

hrmtst93837 2 hours ago|||

Offering wholesale cache dumps blows up every assumption about origin privacy and copyright. Suddenly you are one toggle away from someone else automatically harvesting and reselling your work with Cloudflare as the unwitting middle tier.

You could try to gate this behind access controls but at that point you have reinvented a clunky bespoke CDN API that no site owner asked for, plus a fresh legal mess. Static file caches work because they only ever respond to the original request, not because they claim to own or index your content.

It is a short path from "helpful pre-scraped JSON" to handing an entire site to an AI scraper-for-hire with zero friction. The incentives do not line up unless you think every domain on Cloudflare wants their content wholesale exported by default.

Symbiote 1 hour ago||

I think Common Crawl already offers this, although it's free: https://commoncrawl.org/

cmsparks 12 hours ago|||

That would prolly work for simple sites, but you still need the dedicated scraping service with a browser to render sites that are more complex (i.e. SPAs)

csomar 11 hours ago||

It’s a bit more complicated than that. This is their product Browser Rendering, which runs a real browser that loads the page and executes JavaScript. It’s a bit more involved than a simple curl scraping.

randomtools 3 hours ago||

So does that mean it can replace serpapi or similar?

ljm 12 hours ago||

Is cloudflare becoming a mob outfit? Because they are selling scraping countermeasures but are now selling scraping too.

And they can pull it off because of their reach over the internet with the free DNS.

rendaw 5 hours ago||

I think the simple explanation is that they weren't selling scraping countermeasures, they were selling web-based denial of service protection (which may be caused by scrapers).

PeterStuer 4 hours ago||

Ask yourself, why would a scraper ddos? Why would a ddos-protection vendor ddos?

c0balt 1 hour ago|||

The number of git forges behind Anubis et al and the numerous public announcements should be enough.

Scrappers seem to be exceedingly careless in using public resources. The problem is often not even DDOS (as in overwhelming bandwidth usage) but rather DOS through excessive hits on expensive routes.

wongarsu 1 hour ago||||

Because the scraper is either impatient, careless or indifferent; and if they scrape for training data they don't plan to come back. If they don't plan to come back they don't care if you tighten up crawling protections after they have moved on. In fact they are probably happy that they got their data and their competition won't

wiether 9 minutes ago||

> they don't plan to come back

To me the current behavior of those scrapers tells me that "they don't plan", period.

Looks like they hired a bunch of excavators and are digging 2 meters deep on whole fields, looking for nuggets of gold, and pilling the dirt on a huge mountain.

Once they realize the field was bereft of any gold but full of silver? Or that the gold was actually 2.5 meters deep?

They have to go through everything again.

junaru 1 hour ago|||

> Ask yourself, why would a scraper ddos?

Don't need to ask anything i can tell you exactly - because they have no regard for anything but their own profit.

Let me give you an example of this mom and pop shop known as anthropic.

You see they have this thing called claudebot and at least initially it scraped iterating through IP's.

Now you have these things called shared hosting servers, typically running 1000-10000 domains of actual low volume websites on 1-50 or so IPs.

Guess what happens when it is your networks time to bend over? Whole hosting company infrastructure going down as each server has hundreds of claudebots crawling hundreds of vhosts at the same time.

This happened for months. Its the reason they are banned in WAFs by half the hosting industry.

iso-logi 11 hours ago|||

Their free DNS is only a small piece of the pie.

The fact that 30%+ of the web relies on their caching services, routablility services and DDoS protection services is the main pull.

Their DNS is only really for data collection and to front as "good will"

jen729w 7 hours ago||

> The fact that 30%+ of the web relies on their caching services

30% of the web might use their caching services. 'Relies on' implies that it wouldn't work without them, which I doubt is the case.

It might be the case for the biggest 1% of that 30%. But not the whole lot.

reddalo 3 hours ago||

>'Relies on' implies that it wouldn't work without them

Last time Cloudflare went down, their dashboard was also unavailable, so you couldn't turn off their proxy service anyway.

shadowfiend 11 hours ago|||

No: https://developers.cloudflare.com/browser-rendering/rest-api...

oefrha 8 hours ago||

That's not the perfect defense you think it is. Plenty of robots.txts[1] technically allow scraping their main content pages as long as your user-agent isn't explicitly disallowed, but in practice they're behind Cloudflare so they still throw up Cloudflare bot check if you actually attempt to crawl.

And forget about crawling. If you have a less reputable IP (basically every IP in third world countries are less reputable, for instance), you can be CAPTCHA'ed to no end by Cloudflare even as a human user, on the default setting, so plenty of site owners with more reputable home/office IPs don't even know what they subject a subset of their users to.

[1] E.g. https://www.wired.com/robots.txt to pick an example high up on HN front page.

its-kostya 11 hours ago|||

Cloudflare has been trying to mediate publishers & AI companies. If publishers are behind Cloudflare and Cloudflare's bot detection stops scrapers at the request of publishers, the publishers can allow their data to be scraped (via this end point) for a price. It creates market scarcity. I don't believe the target audience is you and me. Unless you own a very popular blog that AI companies would pay you for.

PeterStuer 4 hours ago||

Next step will be their default "free" anti-bot denying all but their own bot. They know full well nearly nobody changes the default.

theamk 11 hours ago|||

no? it takes 10 seconds to check:

> The /crawl endpoint respects the directives of robots.txt files, including crawl-delay. All URLs that /crawl is directed not to crawl are listed in the response with "status": "disallowed".

You don't need any scraping countermeasures for crawlers like those.

Macha 10 hours ago|||

So what’s the user agent for their bot? They don’t seem to specify the default in the docs and it looks like it’s user configurable. So yet another opt out bot which you need your web server to match on special behaviour to block

flanksteak20 5 hours ago|||

Isn't this covered here? https://developers.cloudflare.com/browser-rendering/referenc...

Macha 2 hours ago||

No, hence all their examples using User-Agent: *

gruez 10 hours ago|||

>So yet another opt out bot which you need your web server to match on special behaviour to block

Given that malicious bots are allegedly spoofing real user agents, "another user agent you have to add to your list" seems like the least of your problems.

AdamN 2 hours ago||

Not 'allegedly' - it's just a fact. Even if you're not malicious however it's still sometimes necessary because the server may have different sites for different browsers and check user agents for the experience they deliver. So then even for legitimate purposes you need to at least use the prefix of the user agent that the server expects.

PeterStuer 3 hours ago|||

Like they explain in the docs, their crawler will respect the robots.txt dissalowed user-agents, right after the section hat explains how to change your user-agent.

isodev 7 hours ago|||

They always have been.

They also use their dominant position to apply political pressure when they don’t like how a country chooses to run things.

So yeah, we’ve created another mega corp monster that will hurt for years to come.

subscribed 10 hours ago|||

I think there's some space being absolutely snuffed by the countless bots of everyone, ignoring everything, pulling from residential proxies, and this, supposedly slower, well behavior, smarter bot.

Like there's a difference between dozens of drunk teenagers thrashing the city streets in the illegal street race vs a taxi driver.

pocksuppet 8 hours ago|||

Was it ever not one? They protect a lot of DDoS-for-hire sites from DDoS by their competitors. In return they increase the quantity of DDoS on the internet. They offer you a service for $150, then months later suddenly demand $150k in 24 hours or they shut down your business. If you use them as a DNS registrar they will hold your domain hostage.

azinman2 7 hours ago|||

Where can I learn more about the 150k in 24h?

caffeinewriter 5 hours ago||

I imagine it's referencing this story:

https://robindev.substack.com/p/cloudflare-took-down-our-web...

HN Discussion:

https://news.ycombinator.com/item?id=40481808

Sebguer 7 hours ago|||

yeah, GP completely fails to realize that Cloudflare has always played both sides. that is their entire business model, and it was transparent from the beginning that they would absolutely do the same here.

andrepd 1 hour ago|||

Well this scraper honours robots.txt so I'm sure most AI crawlers will find it useless.

rrr_oh_man 12 hours ago|||

[flagged]

stri8ted 11 hours ago|||

Do you have any evidence to support this view?

pocksuppet 8 hours ago|||

Who else would MITM 30% of the internet?

rolymath 10 hours ago|||

Read who and how it was founded. It's not a secret at all.

mtmail 11 hours ago|||

Any kind of source for the claim?

Retr0id 12 hours ago|||

For a long time cloudflare has proudly protected DDoS-as-a-service sites (but of course, they claim they don't "host" them)

Dylan16807 8 hours ago||

Are you using the word "claim" to call them wrong or for a more confusing reason?

Because I'm pretty sure they are not in fact wrong.

Retr0id 8 hours ago||

The distinction between a caching proxy and an origin server is pretty meaningless when you're serving static content, if you ask me.

Dylan16807 7 hours ago||

There's a blurry line there, true.

On the other hand when a page is small and static enough that it's basically just a flyer, I also care a lot less about who hosts it.

giancarlostoro 11 hours ago||

If they ever sell or the CEO shifts, yes. For the meantime, they have not given any strong indication that they're trying to bully anybody. I could see things changing drastically if the people in charge are swapped out.

radicalriddler 23 minutes ago||

Interesting... I built an MCP server for their initial browser render as markdown, and I just tell the LLM to follow reasonable links to relative content, and recurse the tool.

Lasang 10 hours ago||

The idea of exposing a structured crawl endpoint feels like a natural evolution of robots.txt and sitemaps.

If more sites provided explicit machine-readable entry points for crawlers, indexing could become a lot less wasteful. Right now crawlers spend a lot of effort rediscovering the same structure over and over.

It also raises interesting questions about whether sites will eventually provide different views for humans vs. automated agents in a more formalized way.

_heimdall 9 hours ago||

I expect that if we still used REST indexing would be even less wasteful.

I've found myself falling pretty hard on the side of making APIs work for humans and expecting LLM providers to optimize around that. I don't need an MCP for a CLI tool, for example, I just need a good man page or `--help` documentation.

berkes 3 hours ago|||

I know in practice it no longer is the case, if it ever was.

But semantic HTML is exactly that explicit machine-readable entrypoint. I am firmly entrenched in the opinion that HTML, and the DOM is only for machines to read, it just happens to be also somewhat understandable to some humans. Take an average webpage, have a look at all characters(bytes) in there: often two third won't ever be shown to humans.

Point being: we don't need to invent something new. We just need to realize we already have it and use it correctly. Other than this requiring better understanding of web tech, it has no downsides. The low hanging fruit being the frameworks out there that should really do a better job of leveraging semantics in their output.

PeterStuer 4 hours ago|||

The only ones benefitting from 'wastefull' crawling are the anti-bot solution vendors. Everyone else is incentivized to crawl as efficiently as possible.

Makes you think, right?

catlifeonmars 10 hours ago|||

> It also raises interesting questions about whether sites will eventually provide different views for humans vs. automated agents in a more formalized way.

This question raises an interesting question about if this would exacerbate supply chain injection attacks. Show the innocuous page to the human, another to the bot.

pocksuppet 8 hours ago|||

Apart from the obvious problem: presenting something different to crawlers and humans.

rglover 9 hours ago|||

I just do a query param to toggle to markdown/text if ?llm=true on a route. Easy pattern that's opt-in.

pdntspa 9 hours ago||

They already do...

A lot of known crawlers will get a crawler-optimized version of the page

rafram 8 hours ago||

Do they? AFAIK Google forbids that, and they’ll occasionally test that you aren’t doing it.

pdntspa 8 hours ago|||

I haven't checked in a while but I know for a fact that Amazon does or did it

6510 8 hours ago|||

With google covering only 3% I wonder how much people still care and if they should. Funny: I own and know sites that are by far the best resource on the topic but shouldn't have so many links google says. It's like I ask you for a page about cuban chains then you say you don't have it because they had to many links. Or your greengrocer suddenly doesn't have apples because his supplier now offers more than 5 different kinds so he will never buy there again.

allixsenos 1 hour ago||

"Selling the wall and the ladder."

"Biggest betrayal in tech."

"Protection racket."

These hot takes sound smart but they're not.

The web was built to be open and available to everyone. Serving static HTML from disk back in the day, nobody could hurt you because there was nothing to hurt.

We need bot protection now because everything is dynamic, straight from the database with some light caching for hot content. When Facebook decides to recrawl your one million pages in the same instant, you're very much up shit creek without a paddle. A bot that crawls the full site doesn't steal anything, but it does take down the origin server. My clients never call me upset that a bot read their blog posts. They call because the bot knocked the site offline for paying customers.

Bot protection protects availability, not secrecy.

And the real bot problem isn't even crawling. It's automated signups. Fake accounts messaging your users. Bots buying out limited drops before a human can load the page. Like-farming. Credential stuffing. That's what bot protection is actually for: preventing fraud, not preventing someone from reading your public website.

Cloudflare's `/crawl` respects robots.txt. Don't want your content crawled, opt out. But if you want it indexed and can't handle the traffic spike, this gets your content out without hammering production.

As for the folks saying Cloudflare should keep blocking all crawlers forever: AI agents already drive real browsers. They click, scroll, render JavaScript. Go look at what browser automation frameworks can do today and then explain to me how you tell a bot from a person. That distinction is already gone. The hot takes are about a version of the internet that doesn't exist anymore.

ramblurr 4 hours ago||

It seems like there's a missed use case: web archiving. I don't see any mention of WARC as an output format. This could be useful to journalists and academically if they had it.

arjie 10 hours ago||

Oh man, I was hoping I could offer a nicely-crawled version of my site. It would be cool if they offered that for site admins. Then everyone who wanted to crawl would just get a thing they could get for pure transfer cost. I suppose I could build one by submitting a crawl job against myself and then offering a `static.` subdomain on each thing that people could access. Then it's pure HTML instant-load.

echoangle 10 hours ago|

I don’t really get the usecase. Is your site static? Then you should just render it to html files and host the static files. And if it’s not static, how would a snapshot of the pages help if they change later? And also why not just add some caching to the site then?

arjie 8 hours ago||

Ah the use-case is archive.org but fast. But it's okay. Before I die I will make the static copy of my site myself.

everfrustrated 11 hours ago||

Will this crawler be run behind or infront of their bot blocker logic?

shadowfiend 11 hours ago|

In front: https://developers.cloudflare.com/browser-rendering/rest-api...

devnotes77 11 hours ago|

Worth noting: origin owners can still detect and block CF Browser Rendering requests if needed.

Workers-originated requests include a CF-Worker header identifying the workers subdomain, which distinguishes them from regular CDN proxying. You can match on this in a WAF rule or origin middleware.

The trickier issue: rendered requests originate from Cloudflare ASN 13335 with a low bot score, so if you rely on CF bot scores for content protection, requests through their own crawl product will bypass that check. The practical defense is application-layer rate limiting and behavioral analysis rather than network-level scores -- which is better practice regardless.

The structural conflict is real but similar to search engines offering webmaster tools while running the index. The incentives are misaligned, but the individual products have independent utility. The harder question is whether the combination makes it meaningfully harder to build effective bot protection on top of their platform.