Posted by earthboundkid 2 hours ago
Mar 01 9:41 AM PST
We want to provide some additional information on the power issue in a single Availability Zone in the ME-CENTRAL-1 Region. At around 4:30 AM PST, one of our Availability Zones (mec1-az2) was impacted by objects that struck the data center, creating sparks and fire. The fire department shut off power to the facility and generators as they worked to put out the fire. We are still awaiting permission to turn the power back on, and once we have, we will ensure we restore power and connectivity safely. It will take several hours to restore connectivity to the impacted AZ. The other AZs in the region are functioning normally.
Essentially, he was trying to assure us that no-no-no, we don’t need multiple zones like the public clouds, they can instead guarantee 100% uninterrupted power under all circumstances.
A bit bored and annoyed, I pointed to the giant red button conspicuously placed in the middle of a pillar and asked what it is for.
“Oh, that’s in case there’s a fire!”
“What does it do?”
“It cuts… the power… uhh… for the safety of the fire department.”
“So… if there’s a wisp of smoke in a corner somewhere, the fireys turn up, the first thing they do is… cut the power?”
“… yes.”
“Not 100% then, is it?”
this would require human intervention and I am a bit worried what if the strike can happen again and human lives might be lost.
IIRC there have been cases in history where sometimes a same location is targeted across multiple days. Obviously, AWS might have local employees working in the region but would there be an evaluation of this threat itself within the relevant team in AWS. What if they try to bring the service back but then missiles are struck again and what if human lives might be lost on it. Let's just hope that it could be part of a evaluation as well.
that's the difference between heroes and ordinary employees who bitch about having to go into the office twice a month.
same as the stories you hear of guys taking snow-cats up a mountain in a blizzard to restore phone circuits or radio transmitters gone offline.
Seems like it should be somewhat easier to bomb 50 datacenters than it would be to hack and disrupt 1000s of different services.
Again, this is just me thinking out loud on a tangent and this doesn't have much to do with this story, but I felt it was an interesting thought to share nonetheless.
For infrastructure reasons, we plonk datacenters down next to airports big enough to fly major hardware into, and near where the big oceanic cables come ashore… and for strategic reasons those are also the perfect places to place military bases
We seem to be really bad at separating those two. For example Starlink is basically military infrastructure now, used to guide bombs.
Previous outage news makes it sound like the cloud providers still have quite a few logical single points of failure.
I do think that though, atleast from the Anthropic decision prior, we know that Anthropic which was used by DoD should be on normal AWS datacenters.
I am saying this because, Dod Threatened to force take the source code of Anthropic if they don't agree to aggregious demands so that means that they don't have the source code.
Perhaps DoD used Anthropic within AWS Military modular DC's but I find it extremely unlikely.
I am almost certain that even with OpenAI who bent its knee to DoD, its still hosted on regular infrastructure and DoD is using these AI models on pretty sensitive tasks (During the Venezeula Maduro's capture, Anthropic/Claude were used iirc to handle some data analysis)
IMO Tho, Any Employee from Anthropic/OpenAI might know better tho about how these models are deployed.
The bigger part of me seems that if we someone nukes 50 datacenters all at once or say all of Amazon's datacenters at once, then the data stored in there would simply be gone and given so many datacenters are located in Virginia,USA iirc or just so many companies being reliant on few datacenter providers.
The larger threat to me with the lose of data is firstly the panic within public fronting services but also, with Hedge Funds, Pension funds or banking datacenters who might be using these and if they lose the data, then its gonna cause even more public mayhem.
Some might be saying oh off-site backups exist but there has atleast been one instance, where a single Google accident had led to massive issues for a 135 Billion $ pension fund.
Relevant Kevin Faang video about it: https://www.youtube.com/watch?v=3GOAUyipnM4 [Google Accidentally Deletes $135 Billion Pension Fund, Chaos Ensues]
> The other AZs in the region are functioning normally. Customers who were running their applications redundantly across the AZs are not impacted by this event.
I bet that was an interesting sev2 ticket!
The other ones are not impacted. They always like to tell you to pay for more than one instance in different AZs so if this happens you don't get impacted.
God forbid we'd ever say that it was struck by a missile or a munition in an act of war.
Doesn’t really matter, we know trumps latest war is the cause
Conrad: I got three fuel cell lights, an AC bus light, a fuel cell disconnect, AC bus overload 1 and 2, Main Bus A and B out.
Aaron: Flight, EECOM. Try SCE to Aux.
Modern culture in the movies and whatnot is that someone should be yelling "Everything's failing. Give me something, Houston. All lights are on! MAYDAY MAYDAY!" and some sort of flavour commentary like that. But reading engineering updates that go like this feels like watching maximal professionalism under fire:> At around 4:30 AM PST, one of our Availability Zones (mec1-az2) was impacted by objects that struck the data center, creating sparks and fire. The fire department shut off power to the facility and generators as they worked to put out the fire. We are still awaiting permission to turn the power back on, and once we have, we will ensure we restore power and connectivity safely. It will take several hours to restore connectivity to the impacted AZ. The other AZs in the region are functioning normally. Customers who were running their applications redundantly across the AZs are not impacted by this event. EC2 Instance launches will continue to be impaired in the impacted AZ. We recommend that customers continue to retry any failed API requests. If immediate recovery of an affected resource (EC2 Instance, EBS Volume, RDS DB Instance, etc.) is required, we recommend restoring from your most recent backup, by launching replacement resources in one of the unaffected zones, or an alternate AWS Region. We will provide an update by 12:30 PM PST, or sooner if we have additional information to share.
This has that same mechanical tone of an ice-cold captain dealing with a proximate situation providing exactly the information they know. No flavour commentary. Amazing. I fucking love it.
(Lightning at 1:03)
Its products are sequences of electrons, instead of atoms. But so are power plants. And in the context of what happens when they're hit by missiles, a factory, data center, and power plant all behave the same.
They mention that the datacenter had fires and sparks and they are mentioning hours of downtime but given the situation, How does that prevent the situation from happening again. It's best for people to use safer regions than the middle east in the moment as missiles might target the same datacenter seeing that some damage was caused.
Moving forward, will there be a demand (all be small) for nuclear bunker esque datacenters which can withstands missiles? I know absolutely nothing about constructing underground but can explosives not be used to create underground datacenters comparatively cheaply? One can also use revamped Nuclear bunkers (although the scale of AWS datacenters might be huge tho who knows)
Had some ideas which show that this idea might be interesting, https://www.nature.com/articles/s44284-026-00406-2
I am curious but what are the safety attempts made by Internet Exchange Providers or (had to search it up) but Submarine Cable landing stations, to me it feels like blowing these up leads to internet downtime across whole country / between providers.
Competition and deregulation and lack of attacks leads towards less robust installations to reduce costs. Geographically redundant installations help as long as all installations aren't targetted; and are valuable for operational concerns other than just attacks.
Those already exist. See for example Bahnhof's "Pionen - White Mountain" data center in Stockholm, or Cyberfort's "The Bunker" a bit west of London.
An out-of-control wildfire levels the entire city? The Big One hits the Bay Area? The entire city is flooded for a few months because the levees break during a Cat5 hurricane? Yeah, your DC will be completely ruined. And even if it isn't, you're probably not getting any outside power, generator fuel, or repair technicians for a while.
No matter how much money you pump into hardening your own super-bunker DC, there will always be disasters you aren't prepared for. At a certain point it just makes more financial sense to abandon the idea of invulnerability and build a redundant site a few states over. Accept that you will occasionally lose one, and only protect against incidents where mitigation is cheaper than occasionally rebuilding.
You don't. Instead, you make sure your failover or DR setup is regularly tested and works.