Beyond Downtime: Architectural Resilience on Hyperscalers

Posted by rbanffy 18 hours ago

Beyond Downtime: Architectural Resilience on Hyperscalers(cacm.acm.org)

5 points | 4 comments

jiggawatts 12 hours ago

This is a low-value article that reads like it is AI generated even if it’s not.

Almost every instance of downtime I’ve experienced in the cloud was due to a global outage of some sort that no amount of regional redundancy could fix.

Regional redundancy is typically twice as expensive at small scales and decidedly non-trivial to implement because… where do you put your data? At most one region can have low-latency access, all others have to deal with either eventual consistency OR very high latencies! What happens during a network partition? That’s… “fun” to deal with!

Most groups would benefit far more from simply having seamless DevOps deploys and fast rollback.

Neither is available by default in most cloud platforms, you have to build it from fiddly little pieces like off-brand LEGO.

Proprietary pieces with no local dev experience such as syntax validation and emulators.

toast0 12 hours ago|

Certainly big cloud outages involve global outages, and some regional outages cascade into global outages.

But it's pretty common for a major event to happen in a single region. Datacenter fires and/or flooding happen from time to time. Extreme weather can happen. Automatic transfer switches fail from time to time. Fiber cuts happen.

Not everyone needs regional redundancy, and it does add costs, but I don't think it should be dismissed easily. If you're all in on cloudiness, you could have as little as an alternate region replica of your data and your vm images, and be ready to go manually in another region if you need to. Run some tests once or twice a year to confirm your plan works, and to make an estimate for how long it takes to restore service in the event of a regional outage. A few minutes to put up an outage page and an hour or three to restore service is probably fine... Automatic regional failover gets tricky with data consistency and split brain as you mentioned; and hopefully you don't need to do it often.

jiggawatts 12 hours ago||

> But it's pretty common for a major event to happen in a single region.

It's actually pretty rare these days because all major clouds use zone-redundancy and hence their core services are robust to the loss of any single building. Even during the recent Iberian power outages the local cloud sites mostly (entirely?) stayed up.

The outages I've experienced over the last decade(!) were: Global certificate expiry (Azure), Crowdstrike (Windows everywhere), IAM services down globally (AWS), core inter-region router misconfiguration (customer-wide).

None would have been avoided by having more replicas in more places. All of our production systems are already zone-redundant, which is either the default or "just a checkbox" in most clouds.

This article adds no value to the discussion because it states the problem that's not that big a deal, and then doesn't provide any useful solutions for the few people where it is a big deal.

The problem is either easy to solve -- tick the checkbox for zone-redundancy -- or very difficult to solve -- make your app's data globally replicated -- and the article just says "you should do it" without further elaboration.

That's of no value to anyone.

> Not everyone needs regional redundancy, and it does add costs, but I don't think it should be dismissed easily.

IMHO, it should be dismissed easily for almost everyone. I have far too many customers that think they need regional redundancy and end up paying 2-3x as much for something that they'll never use and wouldn't work anyway when they do need it.

> If you're all in on cloudiness, you could have as little as an alternate region replica of your data and your vm images, and be ready to go manually in another region if you need to.

This won't work for 90% of the customers that can afford it (big enterprise). Everyone, and I mean everyone forgets about internal DNS, Active Directory, PKI, and other core services. Some web servers won't start if they're missing half their dependencies, but that's "another team"... and that other team didn't have regional redundancy as one of their requirements. "Oops".

Not to mention that most clouds would immediately "run out" of capacity during such a DR. You'd be fighting against every other customer trying to do the same thing at the same time. I've been there, done that, and I've gotten "Resource unavailable, try again" errors.

The only way to guarantee that failover actually works is to pre-reserve 100% of the required VM capacity. This requires about 2x the spend at all times, whether that capacity is used or not.

> Run some tests once or twice a year to confirm your plan works, and to make an estimate for how long it takes to restore service in the event of a regional outage.

This ends up being a completely faked paperwork exercise. Over the last few years, I've seen this little game played out in various hilarious ways, including:

1) The tests were marked as "successful" but the 1 TB of data wasn't being replicated to the DR site. The tests were always to submit new data, which did work. "Ooops"

2) The tests involved failing over the "workload" while the file shares, domain controllers, DNS, etc... remained at the original primary location and had no replicas. "Ooops"

> A few minutes to put up an outage page and an hour or three to restore service is probably fine... Automatic regional failover gets tricky with data consistency and split brain as you mentioned; and hopefully you don't need to do it often.

Failover is the easy part. Now fail back without losing the data changes that occurred during the DR!

This is decidedly non-trivial unless you have bidirectional replication set up or a globally-available database like CosmosDB.

Inevitably the original site will come up and start accepting writes while the DR site is still up, and now you've got writes or transactions going to two places.

Reconciling that after-the-fact is awesome fun.

PS: No public cloud provides a convenient "global mutex" primitive on top of which such things can be easily built. You have to engineer this on a per-application basis, yourself. Good luck!

toast0 11 hours ago||

> It's actually pretty rare these days because all major clouds use zone-redundancy and hence their core services are robust to the loss of any single building. Even during the recent Iberian power outages the local cloud sites mostly (entirely?) stayed up.

Here's one from 2023 https://www.datacenterdynamics.com/en/news/water-leak-at-par...

I've been working with GCP hosted (cross region) services for a few years now, and the outages I remember are that one and their recent global partial outage. I've seen some things that seem to indicate major fiber cuts (or other routing woes) centered around certain locations too, but I don't remember the details.

> The only way to guarantee that failover actually works is to pre-reserve 100% of the required VM capacity. This requires about 2x the spend at all times, whether that capacity is used or not.

Incidentally the lesson from the global partial outage is if I wanted to survive those, I should always run all regions at 2x indicated traffic, because in the event of a similar outage in the future, competing services are likely to fail and we won't be able to scale up; instances were available, but the VM images were not, so scaling wasn't actually possible. If you can't get instances when your main region is down, it is what it is... but I suspect there's enough capacity unless everyone else has picked to be in the same two regions as you for hot and standby.

> This is decidedly non-trivial unless you have bidirectional replication set up or a globally-available database like CosmosDB.

> Inevitably the original site will come up and start accepting writes while the DR site is still up, and now you've got writes or transactions going to two places.

Depends what's going on at the original site. If the servers are flooded, chances are they're not coming up again. Assessing the likelyhood of automatic return to service is part of the manual process; it's also part of what makes automatic failover hard. Bidirectional replication might help, or might just fail when the connection comes back. I'm a big fan of having traditional database servers start in read only and needing manual intervention to accept writes, so the human in the loop can be the Mutex... but yeah, it's tricky. And using cloud solutions for global consistency are expensive.

If you honestly assess the costs and benefits and say woah, that's too expensive, that's fine with me. But you should probably have a look every once in a while. And if your deployment is big enough anyway, the costs start getting relatively lower, because maybe you want some servers here and there to reduce latency and then you need to figure out how to have the data in multiple places anyway, etc....