Futurelock: A subtle risk in async Rust

Posted by bcantrill 3 days ago

Futurelock: A subtle risk in async Rust(rfd.shared.oxide.computer)

This RFD describes our distillation of a really gnarly issue that we hit in the Oxide control plane.[0] Not unlike our discovery of the async cancellation issue[1][2][3], this is larger than the issue itself -- and worse, the program that hits futurelock is correct from the programmer's point of view. Fortunately, the surface area here is smaller than that of async cancellation and the conditions required to hit it can be relatively easily mitigated. Still, this is a pretty deep issue -- and something that took some very seasoned Rust hands quite a while to find.

[0] https://github.com/oxidecomputer/omicron/issues/9259

[1] https://rfd.shared.oxide.computer/rfd/397

[2] https://rfd.shared.oxide.computer/rfd/400

[3] https://www.youtube.com/watch?v=zrv5Cy1R7r4

437 points | 242 commentspage 3

jhhh 2 days ago|

I read this once over and the part that doesn't seem to make sense to me is why the runtime chose, when there are two execution contexts up to the lock() in both future1 and future3, to wake up the main thread instead? I get why in a fair lock it would pick future1 but I don't get how that causes a different thread than the one holding the lock to execute.

pshirshov 2 days ago||

In my experience, almost any asynchronous runtime faced similar issue at some point (e.g. we helped to find and fix such issue in ZIO).

It's hard to verify these protocols and very easy to write something fragile.

lowbloodsugar 2 days ago||

I know this is going to sound trite, but “don’t do that”. It’s no different than deciding to poll the win32 event queue inside an a method you executed in response to polling the event queue. Nested shit is always going to cause a bug. I guess each new generation just has to learn.

dap 2 days ago||

Don't do ... what, exactly? The RFD answers this more precisely and provides suggestions for alternatives. But it's not very simple because the things that can cause this are all common patterns individually and it's only the confluence (which can be spread across layers of the program) that introduces this problem. In our case, it wasn't a Mutex, but an mpsc channel (that was working correctly! it just got very briefly saturated) and it was 3-4 modules lower in the stack than the code with the `tokio::select!` that induced this.

wngr 2 days ago||

It’s not nested, that’s the thing.

imtringued 2 days ago||

Based on the description:

>This RFD describes futurelock: a type of deadlock where a resource owned by Future A is required for another Future B to proceed, while the Task responsible for both Futures is no longer polling A. Futurelock is a particularly subtle risk in writing asynchronous Rust.

I was honestly wondering how you could possibly cause this in any sane code base. How can an async task hold a lock and keep it open? It sounds illogical, because critical sections are meant to be short and never interrupted by anything. You're also never allowed to panic, which means you have to write no panic Rust code inside a critical section. Critical sections are very similar to unsafe blocks, but with the caveat that they cannot cause complete take over of your application.

So how exactly did they bring about the impossible? They put an await call inside the critical section. The part of the code base that is not allowed to be subject to arbitrary delays. Massive facepalm.

When you invoke await inside a critical section, you're essentially saying "I hereby accept that this critical section will last an indeterminate amount of time, I am fully aware of what the code I'm calling is doing and I am willing to accept the possibility that the release of the lock may never come, even if my own code is one hundred percent correct, since the await call may contain an explicit or implicit deadlock"

dap 2 days ago|

> So how exactly did they bring about the impossible? They put an await call inside the critical section. The part of the code base that is not allowed to be subject to arbitrary delays. Massive facepalm.

I'm not sure where you got the impression that the example code was where we found the problem. That's a minimal reproducer trying to explain the problem from first principles because most people look at that code and think "that shouldn't deadlock". It uses a Mutex because people are familiar with Mutexes and `sleep` just to control the interleaving of execution. The RFD shows the problem in other examples without Mutexes. Here's a reproducer that futurelocks even though nobody uses `await` with the lock held: https://play.rust-lang.org/?version=stable&mode=debug&editio...

> I was honestly wondering how you could possibly cause this in any sane code base.

The actual issue is linked at the very top of the RFD. In our cases, we had a bounded mpsc channel used to send messages to an actor running in a separate task. That actor was working fine. But the channel did become briefly saturated (i.e., at capacity) at a point where someone tried to send on it via a `tokio::select!` similar to the one in the example.

mdasen 2 days ago||

I rewrote this in Go and it also deadlocks. It doesn't seem to be something that's Rust specific.

I'm going to write down the order of events.

1. Background task takes the lock and holds it for 5 seconds.

2. Async Thing 1 tries to take the lock, but must wait for background task to release it. It is next in line to get the lock.

3. We fire off a goroutine that's just sleeping for a second.

4. Select wants to find a channel that is finished. The sleepChan finishes first (since it's sleeping for 1 second) while Async Thing 1 is still waiting 4 more seconds for the lock. So select will execute the sleepChan case.

5. That case fires off Async Thing 2. Async Thing 2 is waiting for the lock, but it is second in line to get the lock after Async Thing 1.

6. Async Thing 1 gets the lock and is ready to write to its channel - but the main is paused trying to read from c2, not c1. Main is "awaiting" on c2 via "<-c2". Async Thing 1 can't give up its lock until it writes to c1. It can't write to c1 until c1 is "awaited" via "<-c1". But the program has already gone into the other case and until the sleepChan case finishes, it won't try to await c1. But it will never finish its case because its case depends on c1 finishing first.

You can use buffered channels in Go so that Async Thing 1 can write to c1 without main reading from it, but as the article notes you could use join_all in Rust.

But the issue is that you're saying with "select" in either Go or Rust "get me the first one that finishes" and then in the branch that finishes first, you are awaiting a lock that will get resolved when you read the other branch. It just doesn't feel like something that is Rust specific.

    func main() {
        lock := sync.Mutex{}
        c1 := make(chan string)
        c2 := make(chan string)
        sleepChan := make(chan bool)    
        
        go start_background_task(&lock)
        time.Sleep(1 * time.Millisecond) //make sure it schedules start_background_task first

        go do_async_thing(c1, "op1", &lock)
 
        go func() {
                time.Sleep(1 * time.Second)
                sleepChan <- true
        }()

        for range 2 {
                select {
                case msg1 := <-c1:
                        fmt.Println("In the c1 case")
                        fmt.Printf("received %s\n", msg1)
                case _ = <-sleepChan:
                        fmt.Println("In the sleepChan case")
                        go do_async_thing(c2, "op2", &lock)
                        fmt.Printf("received %s\n", <-c2) // "awaiting" on c2 here, but c1's lock won't be given up until we read it
                }
        }
        fmt.Println("all done")
    }

    func start_background_task(lock *sync.Mutex) {
        fmt.Println("starting background task")
        lock.Lock()
        fmt.Println("acquired background task lock")
        defer lock.Unlock()
        time.Sleep(5 * time.Second)
        fmt.Println("dropping background task lock")
    }

    func do_async_thing(c chan string, label string, lock *sync.Mutex) {
        fmt.Printf("%s: started\n", label)
        lock.Lock()
        fmt.Printf("%s: acuired lock\n", label)
        defer lock.Unlock()
        fmt.Printf("%s: done\n", label)
        c <- label
    }

clarkmcc 2 days ago||

I think the thing that rubs me the wrong way is that Rust was supposed to be "fearless" concurrency. Go doesn't claim that title so I'm not offended when it doesn't live up to it.

kibwen 2 days ago||

Despite "fearless concurrency", Rust has been careful to never claim to prevent deadlocks/race conditions in general, in either async code or non-async code. It's certainly easier to get deadlocks in async Rust than in non-async Rust, but this isn't some sort of novel failure mode.

hu3 2 days ago|||

Yeah but Go makes it abvious why it is deadlocking because the async primitives are more explicit. Even a dumb LLM could have told us where the problem is (I tested).

Menawhile in Rust it looks like it took thousands of dollars in engineering time to find the issue.

jhhh 2 days ago|||

I wrote a version of the article's code in Java and couldn't figure out why it was working until reading your example. I see now that the channel operations in Go must rendezvous which I assume matches Rust's Future behavior. Whereas, the Java CompletableFuture operations I was using to mimic the select aren't required to meet. Thanks for writing this.

mjevans 2 days ago||

Difference in Go is that you've _expressly_ constructed a dependency ring. Should Go or any runtime go out of it's way to detect a dependency ring?

This the programming equivalent of using welding (locks) to make a chain loop, you've just done it with the 3D space impossible two links case.

As with the sin of .await(no deadline), the sin here is not adding a deadline.

bilbo-b-baggins 2 days ago||

I’m just gonna make a new language that has future borrowing semantics and future lifetimes to solve this.

wbl 3 days ago||

Sadly I'm away from my bookshelf but I think Concurrent ML solved this issue.

moralestapia 3 days ago||

Hmm, curious to see if this could happen on JS. I'll reproduce the code.

comex 3 days ago||

JS shouldn't have a direct equivalent because JS async functions are eager. Once you call an async function, it will keep running even if the caller doesn't await it, or stops awaiting it. So in the scenario described, the function next in line for the lock would always have a chance to acquire and release it. The problem in Rust is that async functions are lazy and only run while they're being polled/awaited (unless wrapped in tasks). A function that's next in line for the lock might never acquire it if it's not being polled, blocking progress for other functions that are being polled.

raggi 3 days ago||

yes, you can produce similar issues with promise guarded states and so on as well, it's a fairly common issue in async programming, but can be surprising when it's hidden by layers of abstraction / far up/down a call-chain.

ideaformlabs 2 days ago||

It’s kind of wild how even the most careful Rust code can run into issues like this, really shows how deep async programming goes.

24f0bacc7c72d0a 2 days ago|

This is why I use a threadpool instead. Cant deal with the complexity of async code.

e-dant 2 days ago|

A masterclass in debugging

More comments...