Posted by simplegeek 1 day ago
> In this case, we can set up an invariant stating that the DNS should never be deleted once a newer plan has been applied
If that invariant had been expressed in the original code — as I'm sure it now is — it wouldn't have broken in the first place. The invariant is obvious in hindsight, but it's hardly axiomatic.
While one can and will add invariants, as they are discovered, they cannot all be found.
Entscheidungsproblem and Trakhtenbrot's theorem apply here, counterintuitively that the validity of finite models is in co-re but not in re.
Validity in this case is not dependent by the truth of the premise or the truth of the conclusion.
Basically we have to use tools like systems thinking to construct robust systems, we cannot universally use formal methods across frames.
It is one way race conditions are complex.
Hindsight bias makes it seem easy but that is because that is in the co-re side.
Well intended actions with hindsight can often result in brittle systems as their composition tends to set systems in stone, with the belief that axioms are the end solution.
The fact that Gödels completeness theorem may not apply for finite systems when it works so well for infinite ones is hard for me to remember.
Remembering that axiomatization is a powerful tool but not a silver bullet has actually helped me more than I can count.
[0] http://jmc.stanford.edu/articles/circumscription/circumscrip...
I agree this article is very hindsight biased though. We do need a way to model the failure modes we can think of, but we also need a method that helps us think of what the failure modes are, in a systematic manner that doesn't suffer from "oops we forgot the one way it was actually going to fail".
But I see this post less as an incident analysis and more as an experiment in learning from hindsight. The goal, it seems, isn’t to replay what happened, but to show how formal methods let us model a complex system at a conceptual level, without access to every internal detail, and still reason about where races or inconsistencies could emerge.
Agree, but Time-of-Check to Time-of-Use is a pretty well established failure mode.
As mentioned previously, I think the only tools that are really valuable should be able to produce code, naturally with multiple possible languages as backends, that are then used as library from the application code.
Something like Lean, F*, Dafny, even if that isn't exactly the same as TLA+.
Scenario proof modeling that is bound to human translation errors is as useful in practice, as doing PowerPoint driven software architecture.
Every step of the flow requires an expert on TLA+, the actual programming language and frameworks being used, to manually validate everything still means exactly the same across all layers.
The obvious difference with PowerPoint design is that non-trivial failure modes can be surfaced automatically if they're reflected in the toy model - PowerPoint slides don't do this.
You don't even have to use TLA itself for this purpose, a Lean development could also do this - but then you would have to put together much of the basic formalism (which ultimately works by combining the "modalities" of time, state and non-determinism) on your own.
[1] https://www.amazon.science/publications/using-lightweight-fo...
Meanwhile the people who took the extra effort to tackle particularly painful parallel systems remove some race conditions early in the design process and are seen to achieve nothing.
Let's imagine that the process usually takes 1 minute and the tombstones are kept for 1 day. It would take something ridiculous to make the thing that usually takes 1 minute take longer than a day - not worth even considering. But sometimes there are a confluence of events that make such a thing possible... For example, maybe the top of rack switch died. The server stays running, it just can't succeed any upstream calls. Maybe it is continuously retrying while the network is down (or just slowly timing out on individual requests and skipping to the next one to try it). When the network comes back up, those calls start succeeding but now it's so much staler than you ever even thought was possible or planned for. That's just one scenario, probably not exactly what happened to AWS.
QA people deal with problems and edge cases most devs will never deal with. They’re your subject-matter-experts of 'what can go wrong'.
Anyway, the point is. You can’t trust anything "will resolve in time period X" or "if it takes longer than X, timeout". There are so many cases where this is simply not true and should be added to a "myths programmers believe" article if it isn't already there.
As is, this statement just means you can't trust anything. You still need to choose a time period at some point.
My (pedantic) argument is that timestamps/dates/counters have a range based on the number of bits storage they consume and the tick resolution. These can be exceeded, and it's not reasonable for every piece of software in the chain to invent a new way to store time, or counters, etc.
I've seen a fair share of issues resulting from processes with uptime of over 1 year and some with uptime of 5 years. Of course the wisdom there is just "don't do that, you should restart for maintenance at some point anyway" which is true, but it still means we are living with a system that theoretically will break after a certain period of time, and we are sidestepping that by restarting the process for other purposes.
One day, an operator is updating some cabling and changes you over to a 10mbps link for a few hours. During this time, every single one of your transfers is going to fail even though if you were to inspect the socket, the socket is still making progress on the transfer.
This is why we put timeouts on the socket, not the application. The socket knows whether or not it is still alive but your application may not.
formal methods. Some of this started a long time ago so not sure if it was TLA, TLA+, or something else. (I am a useless manager type)
fake clients / servers to make testing possible
strict invariants
A simulator to fuzz/fault the entire system. We didn't get this until later in the life of the service but flushed out race condition bugs that would have taken years to do.
We never got to replaying customer traffic patterns which was a pet idea of mine but probably the juice wasn't worth the squeeze.
I've changed my behavior to reflect the incentives but remain sad about it.
If you added model checking to it you could have prevented it though, because people that know how to program a model checking program, will see the error right away.
I have yet to learn about this and will not be throwing some time into researching this topic.
Here we see the basic steps of modeling a complex system, and how that can be useful for understanding behavior even without knowing the details of every component.