Posted by bcantrill 3 days ago
[0] https://github.com/oxidecomputer/omicron/issues/9259
[1] https://rfd.shared.oxide.computer/rfd/397
It's hard to verify these protocols and very easy to write something fragile.
>This RFD describes futurelock: a type of deadlock where a resource owned by Future A is required for another Future B to proceed, while the Task responsible for both Futures is no longer polling A. Futurelock is a particularly subtle risk in writing asynchronous Rust.
I was honestly wondering how you could possibly cause this in any sane code base. How can an async task hold a lock and keep it open? It sounds illogical, because critical sections are meant to be short and never interrupted by anything. You're also never allowed to panic, which means you have to write no panic Rust code inside a critical section. Critical sections are very similar to unsafe blocks, but with the caveat that they cannot cause complete take over of your application.
So how exactly did they bring about the impossible? They put an await call inside the critical section. The part of the code base that is not allowed to be subject to arbitrary delays. Massive facepalm.
When you invoke await inside a critical section, you're essentially saying "I hereby accept that this critical section will last an indeterminate amount of time, I am fully aware of what the code I'm calling is doing and I am willing to accept the possibility that the release of the lock may never come, even if my own code is one hundred percent correct, since the await call may contain an explicit or implicit deadlock"
I'm not sure where you got the impression that the example code was where we found the problem. That's a minimal reproducer trying to explain the problem from first principles because most people look at that code and think "that shouldn't deadlock". It uses a Mutex because people are familiar with Mutexes and `sleep` just to control the interleaving of execution. The RFD shows the problem in other examples without Mutexes. Here's a reproducer that futurelocks even though nobody uses `await` with the lock held: https://play.rust-lang.org/?version=stable&mode=debug&editio...
> I was honestly wondering how you could possibly cause this in any sane code base.
The actual issue is linked at the very top of the RFD. In our cases, we had a bounded mpsc channel used to send messages to an actor running in a separate task. That actor was working fine. But the channel did become briefly saturated (i.e., at capacity) at a point where someone tried to send on it via a `tokio::select!` similar to the one in the example.
I'm going to write down the order of events.
1. Background task takes the lock and holds it for 5 seconds.
2. Async Thing 1 tries to take the lock, but must wait for background task to release it. It is next in line to get the lock.
3. We fire off a goroutine that's just sleeping for a second.
4. Select wants to find a channel that is finished. The sleepChan finishes first (since it's sleeping for 1 second) while Async Thing 1 is still waiting 4 more seconds for the lock. So select will execute the sleepChan case.
5. That case fires off Async Thing 2. Async Thing 2 is waiting for the lock, but it is second in line to get the lock after Async Thing 1.
6. Async Thing 1 gets the lock and is ready to write to its channel - but the main is paused trying to read from c2, not c1. Main is "awaiting" on c2 via "<-c2". Async Thing 1 can't give up its lock until it writes to c1. It can't write to c1 until c1 is "awaited" via "<-c1". But the program has already gone into the other case and until the sleepChan case finishes, it won't try to await c1. But it will never finish its case because its case depends on c1 finishing first.
You can use buffered channels in Go so that Async Thing 1 can write to c1 without main reading from it, but as the article notes you could use join_all in Rust.
But the issue is that you're saying with "select" in either Go or Rust "get me the first one that finishes" and then in the branch that finishes first, you are awaiting a lock that will get resolved when you read the other branch. It just doesn't feel like something that is Rust specific.
    func main() {
        lock := sync.Mutex{}
        c1 := make(chan string)
        c2 := make(chan string)
        sleepChan := make(chan bool)    
        
        go start_background_task(&lock)
        time.Sleep(1 * time.Millisecond) //make sure it schedules start_background_task first
        go do_async_thing(c1, "op1", &lock)
 
        go func() {
                time.Sleep(1 * time.Second)
                sleepChan <- true
        }()
        for range 2 {
                select {
                case msg1 := <-c1:
                        fmt.Println("In the c1 case")
                        fmt.Printf("received %s\n", msg1)
                case _ = <-sleepChan:
                        fmt.Println("In the sleepChan case")
                        go do_async_thing(c2, "op2", &lock)
                        fmt.Printf("received %s\n", <-c2) // "awaiting" on c2 here, but c1's lock won't be given up until we read it
                }
        }
        fmt.Println("all done")
    }
    func start_background_task(lock *sync.Mutex) {
        fmt.Println("starting background task")
        lock.Lock()
        fmt.Println("acquired background task lock")
        defer lock.Unlock()
        time.Sleep(5 * time.Second)
        fmt.Println("dropping background task lock")
    }
    func do_async_thing(c chan string, label string, lock *sync.Mutex) {
        fmt.Printf("%s: started\n", label)
        lock.Lock()
        fmt.Printf("%s: acuired lock\n", label)
        defer lock.Unlock()
        fmt.Printf("%s: done\n", label)
        c <- label
    }Menawhile in Rust it looks like it took thousands of dollars in engineering time to find the issue.
This the programming equivalent of using welding (locks) to make a chain loop, you've just done it with the 3D space impossible two links case.
As with the sin of .await(no deadline), the sin here is not adding a deadline.