Reproducing the AWS Outage Race Condition with a Model Checker

Posted by simplegeek 1 day ago

Reproducing the AWS Outage Race Condition with a Model Checker(wyounas.github.io)

141 points | 43 comments

muglug 1 day ago|

I'm a fan of more formal methods in progam analysis, but this particular excercise is very hindsight-is-20/20

> In this case, we can set up an invariant stating that the DNS should never be deleted once a newer plan has been applied

If that invariant had been expressed in the original code — as I'm sure it now is — it wouldn't have broken in the first place. The invariant is obvious in hindsight, but it's hardly axiomatic.

nyrikki 12 hours ago||

John McCarthy‘s qualification problem[0] relates to this.

While one can and will add invariants, as they are discovered, they cannot all be found.

Entscheidungsproblem and Trakhtenbrot's theorem apply here, counterintuitively that the validity of finite models is in co-re but not in re.

Validity in this case is not dependent by the truth of the premise or the truth of the conclusion.

Basically we have to use tools like systems thinking to construct robust systems, we cannot universally use formal methods across frames.

It is one way race conditions are complex.

Hindsight bias makes it seem easy but that is because that is in the co-re side.

Well intended actions with hindsight can often result in brittle systems as their composition tends to set systems in stone, with the belief that axioms are the end solution.

The fact that Gödels completeness theorem may not apply for finite systems when it works so well for infinite ones is hard for me to remember.

Remembering that axiomatization is a powerful tool but not a silver bullet has actually helped me more than I can count.

[0] http://jmc.stanford.edu/articles/circumscription/circumscrip...

pas 1 day ago|||

not deleting the active plan seems like a basic fail-safe design choice, and this isn't AWS people's first rodeo. likely there was some rationale for not going with a built-in fallback.

kqr 19 hours ago|||

If it was, they would have mentioned it in their summary report, the way they justified other deliberate design decisions. I find it more likely they thought of 25 different ways this system could fail, fixed the ones that needed fixing (some of them hinted in the summary report), and then they forgot about that one way it was actually going to fail. Happens all the time.

I agree this article is very hindsight biased though. We do need a way to model the failure modes we can think of, but we also need a method that helps us think of what the failure modes are, in a systematic manner that doesn't suffer from "oops we forgot the one way it was actually going to fail".

lala_lala 17 hours ago||

Yes, any analysis after an incident has the benefit, and bias, of hindsight.

But I see this post less as an incident analysis and more as an experiment in learning from hindsight. The goal, it seems, isn’t to replay what happened, but to show how formal methods let us model a complex system at a conceptual level, without access to every internal detail, and still reason about where races or inconsistencies could emerge.

nsatirini 22 hours ago|||

Every such analysis will have some hindsight bias. Still, it’s a great post that shows how to model such behavior. And I agree with the another reply that not deleting an active plan seems like a basic fail safe choice which the post also covered

ignoramous 16 hours ago||

> but it's hardly axiomatic.

Agree, but Time-of-Check to Time-of-Use is a pretty well established failure mode.

jldugger 1 day ago||

Presumably that one guy at AWS who promotes TLA+ is furiously modeling all this himself in more detail for internal analysis.

pjmlp 16 hours ago||

Problem with TLA+ is that it is a completely unrelated tool to the actual programming, and even when everything is done correctly, it doesn't prevent further code changes breaking the model, unless there are work processes in place to update the related model, revalidate it, and only then push the changes into production.

As mentioned previously, I think the only tools that are really valuable should be able to produce code, naturally with multiple possible languages as backends, that are then used as library from the application code.

Something like Lean, F*, Dafny, even if that isn't exactly the same as TLA+.

zozbot234 16 hours ago||

The point of TLA+ is to act as a simple "toy model" for the actual system where you can still figure out relevant failure modes by automated means. Real-world code is a lot more complex so "end to end" proofs, while viable in a sense (static type checking is a kind of proof about the code) have very different and more modest goals than something like TLA+.

pjmlp 14 hours ago|||

Which on my eyes renders it into a useless academic exercise, regardless of how well renowned the people behind TLA+ happen to be.

Scenario proof modeling that is bound to human translation errors is as useful in practice, as doing PowerPoint driven software architecture.

Every step of the flow requires an expert on TLA+, the actual programming language and frameworks being used, to manually validate everything still means exactly the same across all layers.

zozbot234 14 hours ago||

There's plenty of real-world production use of TLA+, including by Amazon. The "toy model" approach may have limitations of a sort but it's far from purely academic - building simplified models of a complex system is routine practice.

The obvious difference with PowerPoint design is that non-trivial failure modes can be surfaced automatically if they're reflected in the toy model - PowerPoint slides don't do this.

You don't even have to use TLA itself for this purpose, a Lean development could also do this - but then you would have to put together much of the basic formalism (which ultimately works by combining the "modalities" of time, state and non-determinism) on your own.

pjmlp 14 hours ago||

We have all seen how well it gets surfaced automatically at AWS.

zozbot234 14 hours ago||

There might be plenty of potential failures that we haven't all seen, simply because the problems were fixed after TLA+ modeling brought them up.

pjmlp 14 hours ago||

Or it might be that the model doesn't really avoid all possible human failures when translating TLA+ into Java, C++ metatemplate programming, or whatver.

jcgrillo 13 hours ago|||

Executable "toy models" are excellent in practice for property based regression tests. The invariant is that for all inputs the behavior of the oracle ("toy model") is the same as the behavior of the system under test. AWS call this approach "lightweight formal methods[1]" and it's a good one. The problem is afaict there's no way to insert a TLA+ oracle into e.g. a rust proptest suite.

[1] https://www.amazon.science/publications/using-lightweight-fo...

cowsandmilk 1 day ago|||

It’s more than one person at AWS now

beanjuiceII 1 day ago||

yea its an entire group of people that have yet to stop the next outage

deegles 1 day ago|||

I'd like to see hard data on the outage rate of systems with and without their input

angry_octet 14 hours ago||

Its the classic problem that code firefighters get all the attention, but they were also soaking the structure down with gasoline before hand.

Meanwhile the people who took the extra effort to tackle particularly painful parallel systems remove some race conditions early in the design process and are seen to achieve nothing.

cbsmith 19 hours ago|||

Kind of by definition, that's not exactly surprising, yeah?

symbogra 1 day ago||

Hah, I went to a few of his talks

Insanity 1 day ago||

Who are you guys talking about lol

shakna 21 hours ago||

Likely Marc Brooker, who has given some pretty great talks.

symbogra 20 hours ago||

It wasn't him although he did give some good talks. It was another guy who headed up the formal verification group. The used it for finding bugs in caches and whatnot

grogers 1 day ago||

Real world systems often have to deviate from the "pure" version used to run formal methods on. This could be how long you keep transaction logs for, or how long rows are tombstoned for, etc. The longer the time period, the costlier it usually is, in total storage cost and sometimes performance too. So you have to compromise with where you set the time period for.

Let's imagine that the process usually takes 1 minute and the tombstones are kept for 1 day. It would take something ridiculous to make the thing that usually takes 1 minute take longer than a day - not worth even considering. But sometimes there are a confluence of events that make such a thing possible... For example, maybe the top of rack switch died. The server stays running, it just can't succeed any upstream calls. Maybe it is continuously retrying while the network is down (or just slowly timing out on individual requests and skipping to the next one to try it). When the network comes back up, those calls start succeeding but now it's so much staler than you ever even thought was possible or planned for. That's just one scenario, probably not exactly what happened to AWS.

withinboredom 1 day ago|

In my mind, anything that has an actual time period is bound to fail, eventually. Then again, I hang around QA engineers a lot, and when you hear about the selenium troubles of "wait until an element is on the page" stories, you realise it relates to software in general.

QA people deal with problems and edge cases most devs will never deal with. They’re your subject-matter-experts of 'what can go wrong'.

Anyway, the point is. You can’t trust anything "will resolve in time period X" or "if it takes longer than X, timeout". There are so many cases where this is simply not true and should be added to a "myths programmers believe" article if it isn't already there.

sigseg1v 1 day ago|||

>You can't trust anything "will resolve in time period X"

As is, this statement just means you can't trust anything. You still need to choose a time period at some point.

My (pedantic) argument is that timestamps/dates/counters have a range based on the number of bits storage they consume and the tick resolution. These can be exceeded, and it's not reasonable for every piece of software in the chain to invent a new way to store time, or counters, etc.

I've seen a fair share of issues resulting from processes with uptime of over 1 year and some with uptime of 5 years. Of course the wisdom there is just "don't do that, you should restart for maintenance at some point anyway" which is true, but it still means we are living with a system that theoretically will break after a certain period of time, and we are sidestepping that by restarting the process for other purposes.

withinboredom 20 hours ago||

You can have liveness without a timeout. Think about it. Say you set a timeout of 1 minute in your application to transfer 500 mb over a 100mbps link. This normally takes 40s and this is that machines sole job, so it fails fast.

One day, an operator is updating some cabling and changes you over to a 10mbps link for a few hours. During this time, every single one of your transfers is going to fail even though if you were to inspect the socket, the socket is still making progress on the transfer.

This is why we put timeouts on the socket, not the application. The socket knows whether or not it is still alive but your application may not.

saurik 1 day ago|||

Yeah... it has felt kind of ridiculous over the years how many times I have tracked some but I was experiencing down to a timeout someone added in the code for a project I was working with, and I have come to the conclusion over the years that the fix is always to remove the timeout: the existence of a timeout is, inherently, a bug, not a feature, and if your design fundamentally relies on a timeout to function, then the design is also inherently flawed.

darkwater 16 hours ago|||

How would you handle the case when some web service is making calls to a 3rd-party and that 3rd-party is failing in unexpected ways (i.e. under high load or IPs are not answering due to routing issues) to avoid a snowball effect on your service without using the timeout concept in any way?

withinboredom 12 hours ago||

You put the timeout on the socket, not your application. Your application shouldn't care how long it takes, as long as progress is being made, which the socket will know about, but you won't. If you put a timeout on your application and then retry, you'll just make the problem worse. Your original packets are still in a buffer somewhere and still will be processed. Retrying won't help the situation.

baq 19 hours ago|||

My hypothetical pitch deck tile slide: setTimeout() on a vector clock. I can hear Lamport’s scream from here and I live far away

harshaw 14 hours ago||

If I were to summarize how we attacked this when I was on AWS (different team)

formal methods. Some of this started a long time ago so not sure if it was TLA, TLA+, or something else. (I am a useless manager type)

fake clients / servers to make testing possible

strict invariants

A simulator to fuzz/fault the entire system. We didn't get this until later in the life of the service but flushed out race condition bugs that would have taken years to do.

We never got to replaying customer traffic patterns which was a pet idea of mine but probably the juice wasn't worth the squeeze.

shayonj 14 hours ago|

"juice wasn't worth the squeeze" - adding that to my vocab

pkilgore 13 hours ago||

One thing I've observed in my career is that if you take the extra time / are good enough / have the experience to ship something solid enough that nobody ever thinks about it, you never get nearly as much credit as the folks who quickly ship broken things or the folks that parachute in to fix those things.

I've changed my behavior to reflect the incentives but remain sad about it.

mmiao 1 day ago||

imho, model checker suits for the problem with many different states and complex state transformation. But in this case, it's a simple toctou problem.. Using model checker sounds weird for me

carlmr 12 hours ago|

Yeah, I was going to say, if anybody with distributed systems knowledge actually thought about this code, it wouldn't have happened.

If you added model checking to it you could have prevented it though, because people that know how to program a model checking program, will see the error right away.

tonetegeatinst 1 day ago||

Wish the author had an introduction to model checker article.

I have yet to learn about this and will not be throwing some time into researching this topic.

ofrzeta 20 hours ago||

I don't know if this counts as "research" but this research might be helpful https://www.learntla.com/

throwaway81523 1 day ago||

I haven't used Alloy (alloytools.org) but it looks interesting and there are good docs. Did you mean "now" instead of "not"?

philipwhiuk 1 day ago|

I don't really understand the purpose of this. It's not like they have anything other than the RCA (e.g. the code)

__float 1 day ago|

A lot of people view model checking and similar tools as too theoretical, academic Stuff that can't be so easily applied to the real world.

Here we see the basic steps of modeling a complex system, and how that can be useful for understanding behavior even without knowing the details of every component.