Operational issue – Multiple services (UAE)

Posted by earthboundkid 2 hours ago

Operational issue – Multiple services (UAE)(health.aws.amazon.com)

129 points | 57 comments

bgentry 1 hour ago|

The important quote from the timeline:

Mar 01 9:41 AM PST

We want to provide some additional information on the power issue in a single Availability Zone in the ME-CENTRAL-1 Region. At around 4:30 AM PST, one of our Availability Zones (mec1-az2) was impacted by objects that struck the data center, creating sparks and fire. The fire department shut off power to the facility and generators as they worked to put out the fire. We are still awaiting permission to turn the power back on, and once we have, we will ensure we restore power and connectivity safely. It will take several hours to restore connectivity to the impacted AZ. The other AZs in the region are functioning normally.

jiggawatts 34 minutes ago||

This reminds me of a visit to an Equinix data centre where the sales person was droning on and on about how incredibly reliable their power supplies were, how uninterruptible everything was, etc, etc…

Essentially, he was trying to assure us that no-no-no, we don’t need multiple zones like the public clouds, they can instead guarantee 100% uninterrupted power under all circumstances.

A bit bored and annoyed, I pointed to the giant red button conspicuously placed in the middle of a pillar and asked what it is for.

“Oh, that’s in case there’s a fire!”

“What does it do?”

“It cuts… the power… uhh… for the safety of the fire department.”

“So… if there’s a wisp of smoke in a corner somewhere, the fireys turn up, the first thing they do is… cut the power?”

“… yes.”

“Not 100% then, is it?”

vntok 6 minutes ago||

Should have pushed it.

Imustaskforhelp 1 hour ago||

> we will ensure we restore power and connectivity safely

this would require human intervention and I am a bit worried what if the strike can happen again and human lives might be lost.

IIRC there have been cases in history where sometimes a same location is targeted across multiple days. Obviously, AWS might have local employees working in the region but would there be an evaluation of this threat itself within the relevant team in AWS. What if they try to bring the service back but then missiles are struck again and what if human lives might be lost on it. Let's just hope that it could be part of a evaluation as well.

pwarner 5 minutes ago|||

But I mean,are the employees safe at home? I guess if the really targeted the data center then home is safer, but in the fog of war maybe the data center wasn't the target?

tokyobreakfast 38 minutes ago|||

> this would require human intervention

that's the difference between heroes and ordinary employees who bitch about having to go into the office twice a month.

same as the stories you hear of guys taking snow-cats up a mountain in a blizzard to restore phone circuits or radio transmitters gone offline.

flymasterv 29 minutes ago|||

Man, don’t be a “hero” trying to restore a lower ping to someone trying to buy a kindle in Jeddah.

ok_dad 25 minutes ago||

What about local hospitals which may have service from that data center? There are heroes needed everywhere, all the time.

stonogo 5 minutes ago||

In that case, the hero was the person who avoided relying on a single AZ when they deployed to cloud.

thatguy0900 25 minutes ago|||

I'm sure bezos will be really happy someone is being a hero for him in a war zone while he sails his newest yacht to wherever the new version of the island is.

tokyobreakfast 11 minutes ago||

on second thought there is a difference between restoring critical infrastructure in times of crisis vs restoring bot infrastructure for indian spamming operations. choose wisely

p-o 2 hours ago||

Interesting adjacent theory is how much are datacenters becoming military target to strike as part of disrupting initial defenses. It doesn't seem it was the case in this instance, but I could see this becoming a more important target in future.

Seems like it should be somewhat easier to bomb 50 datacenters than it would be to hack and disrupt 1000s of different services.

Again, this is just me thinking out loud on a tangent and this doesn't have much to do with this story, but I felt it was an interesting thought to share nonetheless.

swiftcoder 1 hour ago||

The more interesting question, is how many datacenters are just plonked next to a high-value military target?

For infrastructure reasons, we plonk datacenters down next to airports big enough to fly major hardware into, and near where the big oceanic cables come ashore… and for strategic reasons those are also the perfect places to place military bases

throw475787 53 minutes ago|||

Is there acrually some meaningful physical separation between military and civilian server deployments?

We seem to be really bad at separating those two. For example Starlink is basically military infrastructure now, used to guide bombs.

cherryteastain 33 minutes ago|||

A datacenter IS a high value military target.

roncesvalles 1 hour ago|||

Exactly. 2 is only sufficient for HA against random failures. It's not enough for HA against a determined adversary willing to use targeted force.

tbrownaw 1 hour ago|||

> Seems like it should be somewhat easier to nuke 50 datacenters than it would be to hack and disrupt 1000s of different services.

Previous outage news makes it sound like the cloud providers still have quite a few logical single points of failure.

Zeyka 1 hour ago|||

That's so interesting. Are any of the US military (or other satellite state of the US) systems running in "normal" datacenters or do they have a few protected DoD datacenters in the US?

Imustaskforhelp 1 hour ago||

Found this relevant article: https://serverlift.com/blog/military-modular-data-centers/ (AWS Military Modular Data centers)

I do think that though, atleast from the Anthropic decision prior, we know that Anthropic which was used by DoD should be on normal AWS datacenters.

I am saying this because, Dod Threatened to force take the source code of Anthropic if they don't agree to aggregious demands so that means that they don't have the source code.

Perhaps DoD used Anthropic within AWS Military modular DC's but I find it extremely unlikely.

I am almost certain that even with OpenAI who bent its knee to DoD, its still hosted on regular infrastructure and DoD is using these AI models on pretty sensitive tasks (During the Venezeula Maduro's capture, Anthropic/Claude were used iirc to handle some data analysis)

IMO Tho, Any Employee from Anthropic/OpenAI might know better tho about how these models are deployed.

roxolotl 1 hour ago|||

This is the data center version of https://xkcd.com/538/. Realistically if there is a hot war what you’re saying seems accurate.

Imustaskforhelp 1 hour ago|||

> Seems like it should be somewhat easier to nuke 50 datacenters than it would be to hack and disrupt 1000s of different services.

The bigger part of me seems that if we someone nukes 50 datacenters all at once or say all of Amazon's datacenters at once, then the data stored in there would simply be gone and given so many datacenters are located in Virginia,USA iirc or just so many companies being reliant on few datacenter providers.

The larger threat to me with the lose of data is firstly the panic within public fronting services but also, with Hedge Funds, Pension funds or banking datacenters who might be using these and if they lose the data, then its gonna cause even more public mayhem.

Some might be saying oh off-site backups exist but there has atleast been one instance, where a single Google accident had led to massive issues for a 135 Billion $ pension fund.

Relevant Kevin Faang video about it: https://www.youtube.com/watch?v=3GOAUyipnM4 [Google Accidentally Deletes $135 Billion Pension Fund, Chaos Ensues]

jcgrillo 1 hour ago||

IIUC part of the reason ballistic missiles have multiple warheads is that some of them detonate high up to knock out air defenses and other electronics allowing the rest to fall through to their targets. The last time we tried this experiment as a species was the starfish prime tests in 1962 which caused some electrical havoc in Hawaii. These days our systems are probably more delicate and sensitive? All that is to say, in a scenario where nukes are going off I'm not sure you'd even need to target any datacenters in particular.. they're probably all toast by default.

ejdyksen 2 hours ago||

Just one AZ, not the whole region:

> The other AZs in the region are functioning normally. Customers who were running their applications redundantly across the AZs are not impacted by this event.

anonu 1 hour ago||

We have business in UAE. For whatever reason I defaulted to us-west-2 since these particular applications are not latency sensitive.

boxedemp 2 hours ago||

Amazon usually has 3 AVs per region, looks like there are surviving AVs but the system didn't switch over gracefully.

I bet that was an interesting sev2 ticket!

easton 2 hours ago|

It depends on the service if things move gracefully or not. The incident explains it's only EC2 (and dependent services) in that AZ, so if they try to route traffic for services hosted on EC2 to that AZ it's not working (and customers running instances in that AZ have lost access).

The other ones are not impacted. They always like to tell you to pay for more than one instance in different AZs so if this happens you don't get impacted.

Shank 2 hours ago||

I wonder if this was bad targeting job or intentional. I appreciate the transparency and optimism in the status updates though!

sb057 1 hour ago|

Looking at Google Maps, there's Al Dhafra Air Base a couple of miles to the datacenter's south, an oil refinery a bit to the east, ports to the north, and a military academy to the west.

eptcyka 2 hours ago||

> one of our Availability Zones (mec1-az2) was impacted by objects that struck the data center, creating sparks and fire.

God forbid we'd ever say that it was struck by a missile or a munition in an act of war.

NikolaNovak 2 hours ago||

That's what I'm trying to understand too. It's this a meteor,tree,etc? Or a human made object,and if so accidental or intentional one. Further risk assessment would be dependent on root cause.

rrvidqdi 1 hour ago||

The earliest phrasing I saw internally was "Root cause is identified as a drone attack to DXB61 site". That's somewhat open to interpretation, and could also have simply been incorrect. It was scrubbed from the ticket, though, and it now merely vaguely gestures toward a "power event". The ticket I'd expect to have further detail was locked down.

hdgvhicv 1 hour ago|||

Maybe a missile, maybe a drone, maybe debris

Doesn’t really matter, we know trumps latest war is the cause

arjie 1 hour ago|||

I actually like the way they said it. I don't know if it's a different cultural tradition, but the cool steely-eyed fact-based conversation always really felt so much more inspiring:

    Conrad: I got three fuel cell lights, an AC bus light, a fuel cell disconnect, AC bus overload 1 and 2, Main Bus A and B out.

    Aaron: Flight, EECOM. Try SCE to Aux.

Modern culture in the movies and whatnot is that someone should be yelling "Everything's failing. Give me something, Houston. All lights are on! MAYDAY MAYDAY!" and some sort of flavour commentary like that. But reading engineering updates that go like this feels like watching maximal professionalism under fire:

> At around 4:30 AM PST, one of our Availability Zones (mec1-az2) was impacted by objects that struck the data center, creating sparks and fire. The fire department shut off power to the facility and generators as they worked to put out the fire. We are still awaiting permission to turn the power back on, and once we have, we will ensure we restore power and connectivity safely. It will take several hours to restore connectivity to the impacted AZ. The other AZs in the region are functioning normally. Customers who were running their applications redundantly across the AZs are not impacted by this event. EC2 Instance launches will continue to be impaired in the impacted AZ. We recommend that customers continue to retry any failed API requests. If immediate recovery of an affected resource (EC2 Instance, EBS Volume, RDS DB Instance, etc.) is required, we recommend restoring from your most recent backup, by launching replacement resources in one of the unaffected zones, or an alternate AWS Region. We will provide an update by 12:30 PM PST, or sooner if we have additional information to share.

This has that same mechanical tone of an ice-cold captain dealing with a proximate situation providing exactly the information they know. No flavour commentary. Amazing. I fucking love it.

AlexErrant 8 minutes ago|||

For those who wanna hear it https://youtu.be/-rqL035klC4?t=101

(Lightning at 1:03)

jcgrillo 1 hour ago|||

High signal/noise ratio is extra important when things are going badly

bigyabai 1 hour ago||

Potential moon bear attack; we're waiting on satellite imagery to confirm it. https://youtu.be/pvjgIxuVdo4?t=96

potatoproduct 2 hours ago||

We are living in increasingly weirder times.

astrange 2 hours ago||

A factory not working because of a missile strike seems pretty classic actually.

guerrilla 2 hours ago||

Sure, but it's not a factory.

astrange 2 hours ago|||

It's a big building with a lot of capital assets inside that are the means of production for a business…

mediaman 2 hours ago|||

Why not? It's a physical building with lots of equipment that produces products shipped to its customers.

Its products are sequences of electrons, instead of atoms. But so are power plants. And in the context of what happens when they're hit by missiles, a factory, data center, and power plant all behave the same.

debo_ 2 hours ago|||

When I first learned that there were AWS Middle East regions, my first thought was "wow they are more optimistic than I am ."

toast0 1 hour ago|||

Google Cloud also has middle east locations. As does Azure, Oracle and Alibaba. Afaik, IBM Cloud does not. I think those five and AWS are the top 6 global public access clouds.

Cyph0n 2 hours ago|||

No, they are more aware of the customer demand for compute in the region.

alexfoo 1 hour ago||

And demand for data sovereignty.

Cyph0n 1 hour ago||

Absolutely, especially in the KSA.

dgxyz 2 hours ago|||

Not really. It's just been pretty damn quiet for years.

buttermeup 2 hours ago||

[dead]

Trasmatta 2 hours ago||

Is this the one in Bahrain?

the_mitsuhiko 2 hours ago|

UAE. Abu Dhabi

Imustaskforhelp 1 hour ago|

Has this ever happened ever in history of Cloud providers before this because of war?

They mention that the datacenter had fires and sparks and they are mentioning hours of downtime but given the situation, How does that prevent the situation from happening again. It's best for people to use safer regions than the middle east in the moment as missiles might target the same datacenter seeing that some damage was caused.

Moving forward, will there be a demand (all be small) for nuclear bunker esque datacenters which can withstands missiles? I know absolutely nothing about constructing underground but can explosives not be used to create underground datacenters comparatively cheaply? One can also use revamped Nuclear bunkers (although the scale of AWS datacenters might be huge tho who knows)

Had some ideas which show that this idea might be interesting, https://www.nature.com/articles/s44284-026-00406-2

I am curious but what are the safety attempts made by Internet Exchange Providers or (had to search it up) but Submarine Cable landing stations, to me it feels like blowing these up leads to internet downtime across whole country / between providers.

toast0 1 hour ago||

Historically in the US, some portion of Bell installations were designed to be resistant to attack. But it comes at large expense for construction and maintenance. Underground facilities also bring increased risk of flooding.

Competition and deregulation and lack of attacks leads towards less robust installations to reduce costs. Geographically redundant installations help as long as all installations aren't targetted; and are valuable for operational concerns other than just attacks.

userbinator 3 minutes ago||

Cold War era definitely resulted in a lot of comms infrastructure being hardened against attack.

crote 16 minutes ago|||

> will there be a demand (all be small) for nuclear bunker esque datacenters

Those already exist. See for example Bahnhof's "Pionen - White Mountain" data center in Stockholm, or Cyberfort's "The Bunker" a bit west of London.

SoftTalker 1 hour ago|||

Data centers are usually built to withstand local natural risks e.g. weather. All bets, SLAs, and insurance are usually off when it comes to acts of war.

crote 2 minutes ago||

There's also just an upper limit to the kind of risk you can reasonably defend against.

An out-of-control wildfire levels the entire city? The Big One hits the Bay Area? The entire city is flooded for a few months because the levees break during a Cat5 hurricane? Yeah, your DC will be completely ruined. And even if it isn't, you're probably not getting any outside power, generator fuel, or repair technicians for a while.

No matter how much money you pump into hardening your own super-bunker DC, there will always be disasters you aren't prepared for. At a certain point it just makes more financial sense to abandon the idea of invulnerability and build a redundant site a few states over. Accept that you will occasionally lose one, and only protect against incidents where mitigation is cheaper than occasionally rebuilding.

tbrownaw 1 hour ago|||

> but given the situation, How does that prevent the situation from happening again

You don't. Instead, you make sure your failover or DR setup is regularly tested and works.