Posted by DamnInteresting 6 hours ago
Yes, the loose wire was the immediate cause, but there was far more going wrong here. For example:
- The transformer switchover was set to manual rather than automatic, so it didn't automatically fail over to the backup transformer.
- The crew did not routinely train transformer switchover procedures.
- The two generators were both using a single non-redundant fuel pump (which was never intended to supply fuel to the generators!), which did not automatically restart after power was restored.
- The main engine automatically shut down when the primary coolant pump lost power, rather than using an emergency water supply or letting it overheat.
- The backup generator did not come online in time.
It's a classic Swiss Cheese model. A lot of things had to go wrong for this accident to happen. Focusing on that one wire isn't going to solve all the other issues. Wires, just like all other parts, will occasionally fail. One wire failure should never have caused an incident of this magnitude. Sure, there should probably be slightly better procedures for checking the wiring, but next time it'll be a failed sensor, actuator, or controller board.
If we don't focus on providing and ensuring a defense-in-depth, we will sooner or later see another incident like this.
There are so many layers of failures that it makes you wonder how many other operations on those ships are only working because those fallbacks, automatic switchovers, emergency supplies, and backup systems save the day. We only see the results when all of them fail and the failure happens to result in some external problem that means we all notice.
As Sidney Dekker (of Understanding Human Error fame) says: Murphy's Law is wrong - everything that can go wrong will go right. The problem arises from the operators all assuming that it will keep going right.
I remember reading somewhere that part of Qantas's safety record came from the fact that at one time they had the highest number of minor issues. In some sense, you want your error detection curve to be smooth: as you get closer to catastrophe, your warnings should get more severe. On this ship, it appeared everything was A-OK till it bonked a bridge.
Your car engaging auto brake to prevent a collision shouldn't be a "whew, glad that didn't happen" and more a "oh shit, I need to work on paying attention more."
> Our investigators routinely accomplish the impossible, and this investigation is no different...Finding this single wire was like hunting for a loose rivet on the Eiffel Tower.
In the software world, if I had an application that failed when a single DNS query failed, I wouldn't be pointing the blame at DNS and conducting a deep dive into why this particular query timed out. I'd be asking why a single failure was capable of taking down the app for hundreds or thousands of other users.
The YouTube animation they published notes that this also wasn't just one wire - they found many wires on the ship that were terminated and labeled in the same (incorrect) way, which points to an error at the ship builder and potentially a lack of adequate documentation or training materials from the equipment manufacturer, which is why WAGO received mention and notice.
The flushing pump not restarting when power resumed did also cause a blackout in port the day before the incident. But you know, looking into why you always have two blackouts when you have one is something anybody could do; open the main system breaker, let the crew restore it and that flushing pump will likely fail in the same way every time... but figuring out why and how the breaker opened is neat, when it's not something obvious.
The NTSB also had some comments on the ship's equivalent of a black box. Turns out it was impossible to download the data while it was still inside the ship, the manufacturer's software was awful and the various agencies had a group chat to share 3rd party software(!), the software exported thousands of separate files, audio tracks were mixed to the point of being nearly unusable, and the black box stopped recording some metrics after power loss "because it wasn't required to" - despite the data still being available.
At least they didn't have anything negative to say about the crew: they reacted timely and adequately - they just didn't stand a chance.
The regular fuel pumps were set up to automatically restart, which is why a set of them came online to feed generator 3 (which automatically spinned up after 1 & 2 failed, and wasn't tied to the fuel-line-flushing pump) after the second blackout.
I remember that the IT guys at my old company, used to immediately throw out every ethernet cable, and replace them with ones right out of the bag; first thing.
But these ships tend to be houses of cards. They are not taken care of properly, and run on a shoestring budget. Many of them look like floating wrecks.
If I come across a CATx (solid core) cable being used as a really long patch lead then I lose my shit or perhaps get a backbox and face plate and modules out along with a POST tool.
I don't look after floating fires.
And the physical layer issues I do see are related to ham fisted people doing unrelated work in the cage.
Actual failures are pretty damn rare.
Like you said (and illustrated well in the book) it's never just 1 thing, these incidents happen when multiple systems interact and often reflect a the disinvestment in comprehensive safety schemes.
> The settlement does not include any damages for the reconstruction of the Francis Scott Key Bridge. The State of Maryland built, owned, maintained, and operated the bridge, and attorneys on the state’s behalf filed their own claim for those damages. Pursuant to the governing regulation, funds recovered by the State of Maryland for reconstruction of the bridge will be used to reduce the project costs paid for in the first instance by federal tax dollars.
If everyone saved $100M by doing this and it only cost one shipper $100M, then of course everyone else would do it and just hope they aren’t the one who has bad enough luck to hit the bridge.
And statistically, almost all of them will be okay!
Basically, the line of causation of the mishap has to pass through a metaphorical block of Swiss cheese, and a mishap only occurs if all the holes in the cheese line up. Otherwise, something happens (planned or otherwise) that allows you to dodge the bullet this time.
Meaning a) it's important to identify places where firebreaks and redundancies can be put in place to guard against failures further upstream, and b) it's important to recognize times when you had a near-miss, and still fix those root causes as well.
Which is why the "retrospectives are useless" crowd spins me up so badly.
I mentioned this principal to the traffic engineer when someone almost crashed into me because of a large sign that blocked their view. The engineer looked into it and said the sight lines were within spec, but just barely, so they weren't going to do anything about it. Technically the person who almost hit me could have pulled up to where they had a good view, and looked both ways as they were supposed to, but that is relying on one layer of the cheese to fix a hole in another, to use your analogy.
The fact that the situation on the ground isn't safe in practice is irrelevant to the law. Legally the hedge is doing everything, so the blame falls on the driver. At best a "tragic accident" will result in a "recommendation" to whatever board is responsible for the rules to review them.
Which is why if you want to be a bastard, you send it to the owners, the city, and both their insurance agencies.
If your goal is to get the intersection fixed, this is a reasonable thing to do.
That we allow terrible drivers to drive is another matter...
When I see complaints about retrospectives from software devs they're usually about agile or scrum retrospective meetings, which have evolved to be performative routines. They're done every sprint (or week, if you're unlucky) and even if nothing happens the whole team might have to sit for an hour and come up with things to say to fill the air.
In software, the analysis following a mishap is usually called a post-mortem. I haven't seen many complaints about those have no value. Those are usually highly appreciated. Thought some times the "blameless post-mortem" people take the term a little too literally and try to avoid exploring useful failures if they might cause uncomfortable conversations about individuals making mistakes or even dropping the ball.
You mean to tell me that this comment section where we spew buzzwords and reference the same tropes we do for every "disaster" isn't performative.
Regarding blamelessness, I think it was W. Edwards Deming who emphasized the importance of blaming process over people, which is always preferable, but its critical for individuals to at least be aware of their role in the problem.
It is nice though (as long as there isn't anyone in there that the team is afraid to be honest in front of), when people can vent about something that has been pissing them off, so that I as their manager know how they feel. But that happens only about 15-20% of the time. The rest is meaningless tripe like "Glad Project X is done" and "$TECHNOLOGY sucks" and "Good job to Bob and Susan for resolving the issue with the Acme account"
I absolutely heard that in Hoover's voice.
Is there an equivalent to YouTube's Pilot Debrief or other similar channels but for ships?
As Ops person, I've said that before when talking about software and it's mainly because most companies will refuse to listen to the lessons inside of them so why am I wasting time doing this?
To put it aviation terms, I'll write up something being like (Numbers made up) "Hey, V1 for Hornet loaded at 49000 pounds needs to be 160 knots so it needs 10000 feet for takeoff" Well, Sales team comes back and says NAS Norfolk is only 8700ft and customer demands 49000+ loads, we are not losing revenue so quiet Ops nerd!
Then 49000+ Hornet loses an engine, overruns the runway, the fireball I'd said would happen, happens and everyone is SHOCKED, SHOCKED I TELL YOU this is happening.
Except it's software and not aircraft and loss was just some money, maybe, so no one really cares.
The metaphor relies on you mixing and matching some different batches of presliced Swiss cheese. In a single block, the holes in the cheese are guaranteed to line up, because they are two-dimensional cross sections of three-dimensional gas bubbles. The odds of a hole in one slice of Swiss cheese lining up with another hole in the following slice are very similar to the odds of one step in a staircase being followed by another step.
You cannot create a swiss cheese safety model with correlated errors, same as how the metaphor fails if the slices all come from the same block of swiss cheese!
You have to ensure your holes come from different processes and systems! You have to ensure your swiss cheese holes come from different blocks of cheese!
Edit wars aside, it's a nice philosophical question.
https://en.wikipedia.org/wiki/Francis_Scott_Key_Bridge_(Balt...
A lot of people wildly under-crimp things, but marine vessels not only have nuanced wire requirements, but more stringent crimping requirements that the field at large frustratingly refuses to adhere to despite ABYC and other codes insisting on it
The good tools will crimp to the proper pressure and make it obvious when it has happened.
Unfortunately the good tools aren't cheap. Even when they are used, some techs will substitute their own ideas of how a crimp should be made when nobody is watching them.
So outside of waiting time, I can go from eplan to "send me precrimped and labeled wires that were cut, crimped, and labeled by machine and automatically tested to spec" because this now exists as a service accessible even to random folks.
It is not even expensive.
The bad contact with the wire was just the trigger, that should have been recoverable had the regular fuel pumps been running.
Was a FMECA (Failure Mode, Effects, and Criticality Analysis) performed on the design prior to implementation in order to find the single points of failure, and identify and mitigate their system level effects?
Evidence at hand suggests "No."
That's true in this case, as well. There was a long cascade of failures including an automatic switchover that had been disabled and set to manual mode.
The headlines about a loose wire are the media's way of reducing it to an understandable headline.