Futurelock: A subtle risk in async Rust

Posted by bcantrill 2 days ago

Futurelock: A subtle risk in async Rust(rfd.shared.oxide.computer)

This RFD describes our distillation of a really gnarly issue that we hit in the Oxide control plane.[0] Not unlike our discovery of the async cancellation issue[1][2][3], this is larger than the issue itself -- and worse, the program that hits futurelock is correct from the programmer's point of view. Fortunately, the surface area here is smaller than that of async cancellation and the conditions required to hit it can be relatively easily mitigated. Still, this is a pretty deep issue -- and something that took some very seasoned Rust hands quite a while to find.

[0] https://github.com/oxidecomputer/omicron/issues/9259

[1] https://rfd.shared.oxide.computer/rfd/397

[2] https://rfd.shared.oxide.computer/rfd/400

[3] https://www.youtube.com/watch?v=zrv5Cy1R7r4

434 points | 242 commentspage 2

jcalvinowens 2 days ago|

I have very little Rust experience... but I'm hung up on this:

> The lock is given to future1

> future1 cannot run (and therefore cannot drop the Mutex) until the task starts running it.

This seems like a contradiction to me. How can future1 acquire the Mutex in the first place, if it cannot run? The word "given" is really odd to me.

Why would do_async_thing() not immediately run the prints, return, and drop the lock after acquiring it? Why does future1 need to be "polled" for that to happen? I get that due to the select! behavior, the result of future1 is not consumed, but I don't understand how that prevents it from releasing the mutex.

It's more typical in my experience that the act of granting the lock to a thread is what makes it runnable, and it runs right then. Having to take some explicit second action to make that happen seems fundamentally broken to me...

EDIT: Rephrased for clarity.

oconnor663 2 days ago||

> This seems like a contradiction to me. How can future1 acquire the Mutex in the first place, if it cannot run? The word "given" is really odd to me.

`future1` did run for a bit, and it got far enough to acquire the mutex. (As the article mentioned, technically it took a position in a queue that means it will get the mutex, but that's morally the same thing here.) Then it was "paused". I put "paused" in scare quotes because it kind of makes futures sound like processes or threads, which have a "life of their own" until/unless something "interrupts" them, but an important part of this story is that Rust futures aren't really like that. When you get down to the details, they're more like a struct or a class that just sits there being data unless you call certain methods on it (repeatedly). That's what the `.await` keyword does for you, but when you use more interesting constructs like `select!`, you start to get more of the details in your face.

It's hard to be more concrete than that without getting into an overwhelming amount of detail. I wrote a set of blog posts that try to cover it without hand-waving the details away, but they're not short, and they do require some Rust background: https://jacko.io/async_intro.html

jcalvinowens 1 day ago||

So my understanding was correct, it requires the programmer to deal with scheduling explicitly in userspace.

If I'm writing bare metal code for e.g. a little cortex M0, I can very much see the utility of this abstraction.

But it seems like an absolutely absurd exercise for code running in userspace on a "real" OS like Linux. There should be some simpler intermediate abstraction... this seems like a case of forcing a too-complex interface on users who don't really require it.

wrs 1 day ago|||

There is one: tasks. But having the lower level (futures) available makes it very tempting to use it, both for performance and because the code is simpler (at least, it looks simpler). Some things that are easy with select! would be clunky with tasks.

On the other hand, some direct uses of futures are reminiscent of the tendency to obsess over ownership and borrowing to maximize sharing, when you could just use .clone() and it wouldn’t make any practical difference. Because Rust is so explicit, you can see the overhead so you want to minimize it.

oconnor663 1 day ago|||

To be clear, if you restrict yourself to `async`/`.await` syntax, you never see any of this. To await something means to poll it to completion, which is usually what you want. "Joining" two futures lets you poll both of them concurrently until they're both done, which is kind of the point of async as a concept, and this also doesn't really require you to think about scheduling. One place where things get hairy (like in this article) is "selecting" on futures, which polls them all until one of them is done, and then stops polling the rest. (Normally I'd loosely say it "drops the rest on the floor", but the deadlock in this article actually hinges on exactly what gets "dropped" when, in the Rust sense of the `Drop` trait.) This is where scheduling as you put it, or "cancellation" as Rust folks often put it, starts to become important. And that's why the article concludes "In the end, you should always be extremely careful with tokio::select!" However, `select!` is not the only construct that raises these issues. Speaking of which...

> But it seems like an absolutely absurd exercise for code running in userspace on a "real" OS like Linux

Clearly you have a point here, which is why these blog posts are making an impact. That said, one counterpoint is, have you ever wished you could kill a thread? The reason there are so many old Raymond Chen "How many times does it have to be said: Never call TerminateThread" blog posts, is that lots of real world applications really desperately want to call TerminateThread, and it's hard to persuade them to stop! The ability to e.g. put a timeout on any async function call is basically this same superpower, without corrupting your whole process (yay), but still with the unavoidable(?) difficulty of thinking about what happens when random functions give up halfway through.

dap 2 days ago||

Your confusion is very natural:

> It's more typical in my experience that the act of granting the lock to a thread is what makes it runnable, and it runs right then.

This gets at why this felt like a big deal when we ran into this. This is how it would work with threads. Tasks and futures hook into our intuitive understanding of how things work with threads. (And for tasks, that's probably still a fair mental model, as far as I know.) But futures within a task are different because of the inversion of control: tasks must poll them for them to keep running. The problem here is that the task that's responsible for polling this future has essentially forgotten about it. The analogous thing with threads would seem to be something like if the kernel forgot to enqueue some runnable thread on a run queue.

jcalvinowens 1 day ago||

> tasks must poll them for them to keep running.

So async Rust introduces an entire novel class of subtle concurrent programming errors? Ugh, that's awful.

> The analogous thing with threads would seem to be something like if the kernel forgot to enqueue some runnable thread on a run queue.

Yes. But I've never written code in a preemptible protected mode environment like Linux userspace where it is possible to make that mistake. That's nuts to me.

From my POV this seems like a fundamental design flaw in async rust. Like, on a bare metal thing I expect to deal with stuff like this... but code running on a real OS shouldn't have to.

dap 1 day ago|||

I definitely hear that!

To keep it in perspective, though: we've been operating a pretty good size system that's heavy on async Rust for a few years now and this is the first we've seen this problem. Hitting it requires a bunch of things (programming patterns and runtime behavior) to come together. It's really unfortunate that there aren't guard rails here, but it's not like people are hitting this all over the place.

The thing is that the alternatives all have tradeoffs, too. With threaded systems, there's no distinction in code between stuff that's quick vs. stuff that can block, and that makes it easy to accidentally do time-consuming (blocking) work in contexts that don't expect it (e.g., a lock held). With channels / message passing / actors, having the receiver/actor go off and do something expensive is just as bad as doing something expensive with a lock held. There are environments that take this to the extreme where you can't even really block or do expensive things as an actor, but there the hidden problem is often queueing and backpressure (or lack thereof). There's just no free lunch.

I'd certainly think carefully in choosing between sync vs. async Rust. But we've had a lot fewer issues with both of these than I've had in my past experience working on threaded systems in C and Java and event-oriented systems in C and Node.js.

jamincan 1 day ago|||

Rust can't assume you're running on a real OS though.

mpeklar 1 day ago||

When considering this issue alongside with RFD 397, it seems to me that the problem is actually using future drops as an implicit (!) cancellation signal. This makes drop handlers responsible for handling every cancellation-related task, which they are not very good at. If a future is not immediately dropped after selecting on it, you get futurelock, and if it is, you get an async cancellation correctness problem, where the only way to try and interact with the cancellation execution flow is to use drop handlers (maybe in the form of scope guards).

Sadly, the only solution I know of is to use an explicit cancellation signal, and to modify ~everything to work with it. In that world, almost all async functions would need to accept a cancellation parameter of some sort, like a Go Context or like the tokio-utils CancellationToken, and explicitly check it every time they await a function. The new select!-equivalent would need to signal cancellations and then keep polling all unfinished cancellation-aware futures in a loop until they finished, and maybe immediately drop all non-aware futures to prevent futurelock. The entire Tokio API would need to be wrapped to take into account cancellation tokens, as well as any other async library you would want to use.

A lot of work, and you would need to do something if cancel-aware futures get dropped anyway. What a mess.

quietbritishjim 2 days ago||

Wow, it is simply outrageous that Rust doesn't just allow all active tasks to make progress. It creates a whole class of incomprehensible bugs, like this one, for no reason. Can any Rust experts explain why it's done this way? It seems like an unforced error.

In Python, I often use the Trio library, which offers "structured, concurrency": tasks are (only) spawned into lexical scopes, and they are all completed (waited for) before that scope is left. That includes waiting for any cancelled tasks (which are allowed to do useful async work, including waiting for any of their own task scopes to complete).

Could Rust do something like that? It's far easier to reason about than traditional async programs, which seems up Rust's street. As a bonus it seems to solve this problem, since a Rust equivalent would presumably have all tasks implicitly polled by their owning scope.

duped 2 days ago||

So there's a distinction between a task and a future. A future doesn't do anything until it's polled, and since there's nothing special about async runtimes (it's just user level code), it's always possible to create futures and never poll them, or stop polling them.

A task is a different construct and usually tied to the runtime. If you look at the suggestions in the RFD they call out using a task explicitly instead of polling a future in place.

There's some debate to be had over what constitutes "cancellation." The article and most colloquial definitions I've heard define it as a future being dropped before being polled to completion. Which is very clean - if you want to cancel a future, just drop it. Since Rust strongly encourages RAII, cleanup can go in drop implementations.

A much tougher definition of cancellation is "the future is never polled again" which is what the article hits on. The future isn't dropped but its poll is also unreachable, hence the deadlock.

wrs 2 days ago|||

I wish "cancellation" wasn't used for both of those. It seems to obfuscate understanding quite a bit. We should call them "dropped" and "abandoned" or something.

duped 2 days ago||

I don't think anyone really calls the latter "cancellation" in practice. I'm just pointing out that "is never polled again" is the tricky state with futures.

quietbritishjim 2 days ago|||

Interesting, thanks. So is it fair to say that if tokio::select!() only accepted tasks (or implicitly turned any futures it receives into tasks, like Python's asyncio.gather() does) then it wouldn't have this problem? Or, even if the async runtime is careful, is it still possible to create and fail to poll a raw Future by accident?

NobodyNada 2 days ago|||

> if tokio::select!() only accepted tasks (or implicitly turned any futures it receives into tasks, like Python's asyncio.gather() does) then it wouldn't have this problem?

Yes, this is correct. However, many of the use cases for select rely on the fact that it doesn't run all the tasks to completion. I've written many a select! statement to implements timeouts or other forms of intentionally preempting a task. Sometimes I want to cancel the task and sometimes I want to resume it after dealing with the condition that caused the preemption -- so the behavior in the article is very much an intentional feature.

> even if the async runtime is careful, is it still possible to create and fail to poll a raw Future by accident?

This is also the case. There's nothing magic about a future; it's just an ordinary object with a poll function. Any code can create a future and do whatever it likes with it; including polling it a few times and then stopping.

Despite being included as part of Tokio, select! does not interact with the runtime or need any kind of runtime support at all. It's an ordinary function that creates a future which waits for the first of several "child" futures to complete; similar functions are also provided in other prominent ecosystem crates besides Tokio and can be implemented in user code as well.

quietbritishjim 1 day ago||

> However, many of the use cases for select rely on the fact that it doesn't run all the tasks to completion.

That seems like a different requirement than "all arguments are tasks". If I understand it right (and quite possibly I don't), making them all tasks means that they are all polled and therefore continue progressing until they are dropped. It doesn't mean that select! would have to run them all the way to completion.

NobodyNada 1 day ago|||

I was sloppy with my wording, I should have said "it doesn't run all the futures to completion".

> making them all tasks means that they are all polled and therefore continue progressing until they are dropped. It doesn't mean that select! would have to run them all the way to completion.

This is exactly correct, but oftentimes the reason you're using select is because you don't want to run the futures all the way to completion. In my experience, the most common use cases for select are:

- An event handler loop that receives input from multiple channels. You could replace this with multiple tasks, one reading from each channel; but this could potentially mess with your design for queueing / backpressure -- often it's important for the loop to pause reading from the channels while processing the event.

- An operation that's run with a timeout, or a shutdown event from a controlling task. In this case I want the future to be dropped when the task is cancelled.

The example in the original post was the second case: an operation with a timeout. They wanted the operation to be cancelled when the timeout expired, but because the select statement borrowed the future, it only suspended the future instead of cancelling it. This is a very common code pattern when calling select! in a loop, when you want a future to be resumed instead of restarted on the next loop iteration -- it's very intentional that select! allows you to use either way, because you often want either behavior.

wrs 1 day ago|||

Doing select on tasks doesn’t really make sense semantically in the first place. Tasks are already getting polled by the executor. The purpose of select is to run some set of futures the executor doesn’t know about, until the first one of them completes. If you wanted to wait for one of a set of tasks to do something, you don’t need any additional polling, you’d just use something like a signal or a channel to communicate with them.

duped 2 days ago|||

It's always possible to create a future that is never polled, and this is a feature of Rusts zero cost abstraction for async/await. If tokio::select required tasks it would be a lot less useful.

This problem would have been avoided by taking the future by value instead of by reference.

wrs 2 days ago||

It's hard to answer the question because of unclear terminology (tasks vs. futures). There are ways to do structured concurrency in Rust, but they are for tasks, not futures. There's not really a concept of "active futures" (other than calling an "active future" one that returned Pending the last time you polled it).

A task is the thing that drives progress by polling some futures. But one of those futures may want to handle polling for other futures that it made, which is where this arises.

As the article says, one option is to spawn everything as a task, but that doesn't solve all problems, and precludes some useful ways of using futures.

dvt 2 days ago||

> &mut future1 is dropped, but this is just a reference and so has no effect. Importantly, the future itself (future1) is not dropped.

There's a lot of talk about Rust's await implementation, but I don't really think that's the issue here. After all, Rust doesn't guarantee convergence. Tokio, on the other hand (being a library that handles multi-threading), should (at least when using its own constructs, e.g. the `select!` macro).

So, since the crux of the problem is the `tokio::select!` macro, it seems like a pretty clear tokio bug. Side note, I never looked at it before, but the macro[1] is absolutely hideous.

[1] https://docs.rs/tokio/1.34.0/src/tokio/macros/select.rs.html

oconnor663 2 days ago||

There's nothing `select!` could do here to force `future1` to drop, because it doesn't receive ownership of `future1`. If we wanted to force this, we'd have to forbid `select!` from polling futures by reference, but that's a pretty fundamental capability that we often rely on to `select!` in a loop for example. The blanket `impl<F> Future for &mut F where F: Future ...` isn't a Tokio thing either; that's in the standard library.

kibwen 2 days ago||

Surely not every use of `select!` needs this ability. If you can design a more restrictive interface that makes correctness easier to determine, then you should use that interface where you can, and reserve `select!` for only those cases where you can't.

mycoliza 2 days ago|||

What could `tokio::select!` do differently here to prevent bugs like this?

In the case of `select!`, it is a direct consequence of the ability to poll a `&mut` reference to a future in a `select!` arm, where the future is not dropped should another future win the "race" of the select. This is not really a choice Tokio made when designing `select!`, but is instead due to the existence of implementations of `Future` for `&mut T: Future + Unpin`[1] and `Pin<T: Future>`[2] in the standard library.

Tokio's `select!` macro cannot easily stop the user from doing this, and, furthermore, the fact that you can do this is useful --- there are many legitimate reasons you might want to continue polling a future if another branch of the select completes first. It's desirable to be able to express the idea that we want to continually poll drive one asynchronous operation to completion while periodically checking if some other thing has happened and taking action based on that, and then continue driving forward the ongoing operation. That was precisely what the code in which we found the bug was doing, and it is a pretty reasonable thing to want to do; a version of the `select!` macro which disallows that would limit its usefulness. The issue arises specifically from the fact that the `&mut future` has been polled to a state in which it has acquired, but not released, a shared lock or lock-like resource, and then another arm of the `select!` completes first and the body of that branch runs async code that also awaits that shared resource.

If you can think of an API change which Tokio could make that would solve this problem, I'd love to hear it. But, having spent some time trying to think of one myself, I'm not sure how it would be done without limiting the ability to express code that one might reasonably want to be able to write, and without making fundamental changes to the design of Rust async as a whole.

[1] https://doc.rust-lang.org/stable/std/future/trait.Future.htm... [2]: https://doc.rust-lang.org/stable/std/future/trait.Future.htm...

quadhome 1 day ago|||

It's desirable to be able to express the idea that we want to continually poll drive one asynchronous operation to completion while periodically checking if some other thing has happened and taking action based on that, and then continue driving forward the ongoing operation.

This idea may be desirable; but, a deadlock is possible if there's a dependency between the two operations. The crux is the "and then continue," which I'm taking to mean that the first operation is meant to pause whilst the second operation occurs. The use of `&mut` in the code specifically enables that too.

If it's OK for the first operation to run concurrently with the other thing, then wrt. Tokio's APIs, have you seen LocalSet[1]? Specifically:

    let local = LocalSet::new();
    local.spawn_local(async move {
        sleep(Duration::from_millis(500)).await;
        do_async_thing("op2", lock.clone()).await;
    });
    local.run_until(&mut future1).await;

This code expresses your idea under a concurrent environment that resolves the deadlock. However, `op2` will still never acquire the lock because `op1` is first in the queue. I strongly suspect that isn't the intended behaviour; but, it's also what would have happened if the `select!` code had worked as imagined.

[1] https://docs.rs/tokio/latest/tokio/task/struct.LocalSet.html

rtpg 2 days ago||||

A meta-idea I have: look at all usages of `select!` with `&mut future`s in the code, and see if there are maybe 4 or 5 patterns that emerge. With that it might be possible to say "instead of `select!` use `poll_continuing!` or `poll_first_up!` or `poll_some_other_common_pattern!`".

It feels like a lot of the way Rust untangles these tricky problems is by identifying slightly more contextful abstractions, though at the cost of needing more scratch space in the mind for various methods

amluto 2 days ago||||

I can imagine an alternate universe in which you cannot do:

1. Create future A.

2. Poll future A at least once but not provably poll it to completion and also not drop it. This includes selecting it.

3. Pause yourself by awaiting anything that does not involve continuing to poll A.

I’m struggling a bit to imagine the scenario in which it makes sense to pause a coroutine that you depend on in the middle like this. But I also don’t immediately see a way to change a language like Rust to reliably prevent doing this without massively breaking changes. See my other comment :)

grogers 2 days ago|||

I'm not familiar with tokio, but I am familiar with folly coro in C++ which is similiar-ish. You cannot co_await a folly::coro::Task by reference, you must move it. It seems like that prevents this bug. So maybe select! is the low level API and a higher level (i.e. safer) abstraction can be built on top?

dap 2 days ago|||

(author here)

Although the design of the `tokio::select!` macro creates ways to run into this behavior, I don't believe the problem is specific to `tokio`. Why wouldn't the example from the post using Streams happen with any other executor?

dvt 2 days ago||

First of all, great write-up! Had a blast reading it :) I think there's a difference between a language giving you a footgun and a library giving you a footgun. Libraries, by definition, are supposed to be as user-friendly as possible.

For example, I can just do `loop { }` which the language is perfectly okay with letting me do anywhere in my code (and essentially hanging execution). But if I'm using a library and I'm calling `innocuous()` and there's a `loop { }` buried somewhere in there, that is (in my opinion) the library's responsibility.

N.B. I don't know enough about tokio's internals to suggest any changes and don't want to pretend like I'm an expert, but I do think this caveat should be clearly documented and a "safe" version of `select!` (which wouldn't work with references) should be provided.

raggi 2 days ago||

i forget if this part unwinds to the exact same place, but some of this kind of design constraint in tokio stems from the much earlier language capabilities and is prohibitive to adjust without breaking the user ecosystem.

one of the key advertised selling points in some of the other runtimes was specifically around behavior of tasks on drop of their join handles for example, for reasons closely related to this post.

Sytten 2 days ago||

I am wondering if there is a larger RFC for Rust to force users to not hold a variable across await points.

In my mind futurelock is similar to keeping a sync lock across an await point. We have nothing right now to force a drop and I think the solution to that problem would help here.

ameliaquining 2 days ago||

There's an existing lint that lets you prohibit instances of specific types from being held across await points: https://rust-lang.github.io/rust-clippy/stable/index.html#aw...

sunshowers 2 days ago|||

Note that forcing a drop of a lock guard has its own issues, particularly around leaving the guarded data in an invalid state. I cover this a bit in my talk that Bryan linked to in the OP [1].

[1] timestamped: https://youtu.be/zrv5Cy1R7r4?t=1067

amluto 2 days ago|||

I’m not convinced that this can help in a meaningful way.

Fundamentally, if you have two coroutines (or cooperatively scheduled threads or whatever), and one of them holds a lock, and the other one is awaiting the lock, and you don’t schedule the first one, you’re stuck.

I wonder if there’s a form of structured concurrency that would help. If I create two futures and start both of them (in Rust this means polling each one once) but do not continue to poll both, then I’m sort of making a mistake.

So imagine a world where, to poll a future at all, I need to have a nursery, and the nursery is passed in from my task and down the call stack. When I create a future, I can pass in my nursery, but that future then gets an exclusive reference to my future until it’s complete or cancelled. If I want to create more than one future that are live concurrently, I need to create a FutureGroup (that gets an exclusive reference to my nursery) and that allows me to create multiple sub-nurseries that can be used to make futures but cannot be used to poll them — instead I poll the FutureGroup.

(I have yet to try using an async/await system or a reactor or anything of the sort that is not very easy to screw up. My current pet peeve is this pattern:

    data = await thingy.read()

What if thingy.read() succeeds but I am cancelled? This gets nasty is most programming languages. Python: the docs on when I can get cancelled are almost nonexistent, and it’s not obviously possible to catch the CancelledError such that I still have data and can therefore save it somewhere so it’s not lost. Rust: what if thingy thinks it has returned the data but I’m never polled again? Maybe this can’t happen if I’m careful, but that requires more thought than I’m really happy with.)

mechanical_berk 1 day ago|||

I agree. It seems like this bug arises because one Future is awaited while another is ignored. I have seen this sort of bug a lot.

So maybe all that is needed is a lint that warns if you keep a Future (or a reference to one) across an await point? The Future you are awaiting wouldn't count of course. Is there some case where this doesn't work?

cogman10 2 days ago||

The ideas that have been batted around is called "async drop" [1]

And it looks like it's still just an unaddressed well known problem [2].

Honestly, once the Mozilla sackening of rust devs happened it seems like the language has been practically rudderless. The RFC system seems almost dead as a lot of the main contributors are no longer working on rust.

This initiative hasn't had motion since 2021. [3]

[1] https://rust-lang.github.io/async-fundamentals-initiative/ro...

[2] https://rust-lang.github.io/async-fundamentals-initiative/

[3] https://github.com/rust-lang/async-fundamentals-initiative

raggi 2 days ago|||

Those pages are out of date, and AsyncDrop is in progress: https://github.com/rust-lang/rust/issues/126482

I think "practically rudderless" here is fairly misinformed and a little harmful/rude to all the folks doing tons of great work still.

It's a shame there are some stale pages around and so on, but they're not good measures of the state of the project or ecosystem.

The problem of holding objects across async points is also partially implemented in this unstable lint marker which is used by some projects: https://dev-doc.rust-lang.org/unstable-book/language-feature...

You also get a similar effect in multi-threaded runtimes by not arbitrarily making everything in your object model Send and instead designing your architecture so that most things between wake-ups don't become arbitrarily movable references.

These aren't perfect mitigations, but some tools.

bigstrat2003 2 days ago|||

In fairness, if you're a layman to the rust development process (as I am, so I'm speaking from personal experience here) it's damn near impossible to figure out the status of things. There tracking issues, RFCs, etc which is very confusing as an outsider and gives no obvious place to look to find out the current status of a proposal. I'm sure there is a logic to it and that if I spent the time to learn it would make sense. But it is really hard to approach for someone like me.

kibwen 2 days ago||

If you want to find out the status of something, the best bet is to go to the Rust Zulip and ask around: https://rust-lang.zulipchat.com/ . Most Rust initiatives are pushed forward by volunteers who are happy to talk about what they're working on, but who only periodically write status reports on tracking issues (usually in response to someone asking them what the status is). Rust isn't a company where documentation is anyone's job, it's just a bunch of people working on stuff, for better or worse.

surajrmal 1 day ago||

To be fair, this information is hard to come by in companies as well, unless it's being tracked and reported by program managers. I wonder if it would be sensible for the rust foundation to use some funds to pay folks to help track and organize efforts which the wider community is deeply interested in better. I care about a lot of the things mentioned in this thread, but the cost of paying attention or needing to bother folks on chatrooms to get status prevents me from really staying abreast of the latest.

cogman10 2 days ago|||

> I think "practically rudderless" here is fairly misinformed and a little harmful/rude to all the folks doing tons of great work still.

That great work is mostly opaque on the outside.

What's been noticeable as an observer is that a lot of the well known names associated with rust no longer work on it and there's been a large amount of turnover around it.

That manifests in things like this case where work was in progress up until ~2021 and then was ultimately backburnered while the entire org was reshuffled. (I'd note the dates on the MCP as Feb 2024).

I can't tell exactly how much work or what direction it went in from 2021 to 2024 but it does look apparent that the work ultimately got shifted between multiple individuals.

I hope rust is in a better spot. But I also don't think I was being unfair in pointing out how much momentum got wrecked when Mozilla pulled support.

raggi 2 days ago||

The language team tends to look at these kinds of challenges and drive them to a root cause, which spins off a tree of work to adjust the core language to support what's required by the higher level pieces, once that work is done then the higher level projects are unblocked (example: RPIT for async drop).

That's not always super visible if you're not following the working groups or in contact with folks working on the stuff. It's entirely fair that they're prioritizing getting work done than explaining low level language challenges to everyone everywhere.

I think you're seeing a lack of data and trying to use that as a justification to fit a story that you like, more than seeing data that is derivative of the story that you like. Of course some people were horribly disrupted by the changes, but language usage also expanded substantially during and since that time, and there are many team members employed by many other organizations, and many independents too.

And there are more docs, anyway:

https://rust-lang.github.io/rust-project-goals/2024h2/async.... https://rust-lang.github.io/rust-project-goals/2025h1/async.... https://rust-lang.github.io/rust-project-goals/2025h2/field-... https://rust-lang.github.io/rust-project-goals/2025h2/evolvi... https://rust-lang.github.io/rust-project-goals/2025h2/goals....

kibwen 2 days ago|||

While the Mozilla layoffs were a stressful time with a lot of uncertainty involved, in the end it hasn't appeared to have had a deleterious effect on Rust development. Today the activity in the Rust repo is as high as it's ever been (https://github.com/rust-lang/rust/graphs/contributors) and the governance of the project is more organized and healthy than it's ever been (https://blog.rust-lang.org/2025/10/15/announcing-the-new-rus...). The language certainly isn't rudderless, it's just branched out beyond the RFC system (https://blog.rust-lang.org/2025/10/28/project-goals-2025h2/). RFCs are still used for major things as a form of documentation, validation, and community alignment, but doing design up-front in RFCs has turned out to be an extremely difficult process. Instead, it's evolving toward a system where major things get implemented first as experiments, whose design later guides the eventual RFC.

surajrmal 1 day ago||

All major projects need to be done this way. Upfront design without any concrete feedback from a real implementation of that design will always be flawed unless incredibly simple. If anything, the next evolution of this is a process to determine which experiments are worth running to get to a point where you can come up with a sensible design. In organizations I've worked in, the first phase is known as problem framing, which quickly evolved into requirements gathering. From there you come up with several possible loosely defined designs which you pursue in parallel and then at some point, with data in hand you come back and decide on a path forward, possibly needing a few more iterations of experiments based on what you've learned. Reaching agreement early on the problem and requirements really does help the rest of the process move forward, and makes deciding on a final choice much simpler and less political.

arjie 2 days ago||

Wow, that makes sense afterwards but I would not have guessed at it immediately looking at the code. Very insidious. Great blogpost.

qouteall 2 days ago||

Simplify: tokio::select! will discard other futures when one future progress.

The discarded futures will never be run again.

Normally when a future is discarded it's dropped. When a future holding lock is dropped, lock is released, but it's passing future borrow to select so the discarded future is not dropped while holding lock.

So it leaves a future that holds a lock that will never run again.

octoberfranklin 2 days ago||

For anybody who wants to cut to the chase, it's this:

> The behavior of tokio::select! is to poll all branches' futures only until one of them returns `Ready`. At that point, it drops the other branches' futures and only runs the body of the branch that’s ready.

This is, unfortunately, doing what it's supposed to do: acting as a footgun.

The design of tokio::select!() implicitly assumes it can cancel tasks cleanly by simply dropping them. We learned the hard way back in the Java days that you cannot kill threads cleanly all the time. Unsurprisingly, the same thing is true for async tasks. But I guess every generation of programmers has to re-learn this lesson. Because, you know, actually learning from history would be too easy.

Unfortunately there are a bunch of footguns in tokio (and async-std too). The state-machine transformation inside rustc is a thing of beauty, but the libraries and APIs layered on top of that should have been iterated many more times before being rolled out into widespread use.

kmeisthax 2 days ago||

No, dropping a Rust future is an inherently safe operation. Futures don't live on their own, they only ever do work inside of .poll(), so you can't "catch them with their pants down" and corrupt state by dropping them. Yield points are specifically designed to be cancel-safe.

Crucially, however, because Futures have no independent existence, they can be indefinitely paused if you don't actively and repeatedly .poll() them, which is the moral equivalent of cancelling a Java Thread. And this is represented in language state as a leaked object, which is explicitly allowed in safe Rust, although the language still takes pains to avoid accidental leakage. The only correct way to use a future is to poll it to completion or drop it.

The problem is that in this situation, tokio::select! only borrows the future and thus can't drop it. It also doesn't know that dropping the Future does nothing, because borrows of futures are still futures so all the traits still match up. It's a combination of slightly unintuitive core language design and a major infrastructure library not thinking things out.

vacuity 1 day ago|||

> The state-machine transformation inside rustc is a thing of beauty, but the libraries and APIs layered on top of that should have been iterated many more times before being rolled out into widespread use.

Quite. The mistake was advertising the feature as production-ready (albeit not complete), which led the ecosystem to adopt async faster than it could work out the flaws and footguns. The nice thing about the language/library split is that there is a path to good async Rust without major backwards-compatibility-breaking language changes (probably...), because the core concepts are sound, but the library ecosystem will bear the necessary brunt of evolution.

littlestymaar 2 days ago||

I genuinely don't understand why people use select! at all given how much of a footgun it is.

octoberfranklin 2 days ago||

Well the less-footgun-ish alternative would look something like a Stream API, but the last time I checked tokio-stream wasn't stable yet.

Then you could merge a `Stream<A>` and `Stream<B>` into a `Stream<Either<A,B>>` and pull from that. Since you're dealing with owned streams, dropping the stream forces some degree of cleanup. There are still ways to make a mess, but they take more effort.

   ....................................

Ratelimit so I have to reply to mycoliza with an edit here:

That example calls `do_thing()`, whose body does not appear anywhere in the webpage. Use better identifiers.

If you meant `do_stuff()`, you haven't replaced select!() with streams, since `do_stuff()` calls `select!()`.

The problem is `select!()`; if you keep using `select!()` but just slather on a bunch of streams that isn't going to fix anything. You have to get rid of select!() by replacing it with streams.

mycoliza 2 days ago|||

In reply to your edit, that section in the RFD includes a link to the full example in the Rust playground. You’ll note that it does not make any use of ‘select!`: https://play.rust-lang.org/?version=stable&mode=debug&editio...

Perhaps the full example should have been reproduced in the RFD for clarity…

mycoliza 2 days ago|||

An analogous problem is equally possible with streams: https://rfd.shared.oxide.computer/rfd/0609#_how_you_can_hit_...

tick_tock_tick 2 days ago||

It seems more and more clear every day that async was rushed out the door way to quickly in Rust.

kibwen 2 days ago||

There's a lot of improvements I could think of for async Rust, but there's basically nothing I would change about the fundamentals that underlie it (other than some tweaks to Pin, maybe, and I could quibble over some syntax). There's nothing rushed about it; it's a great foundation that demonstrably just needs someone to finish building the house on top of it (and, to continue the analogy, needs someone to finish building the sub-basement (cough, generalized coroutines)).

tick_tock_tick 2 days ago||

A foundation full of warts belongs in experimental. I don't know how by your own confession of the house and the sub-basement not yet being finished doesn't instantly mean it should have stayed in experimental.

kibwen 2 days ago||

Your assertion is that it was "rushed". And yet here we are today, talking about how much we wish were implemented. That's not rushed--that's the polar opposite of rushed. Almost nothing about what we currently have on stable would have been better if it was still percolating on nightly, and would have the downside of having almost no feedback from real-world use. I remember the pre-async days, nesting callbacks by hand. What we have now is a great improvement, and just needs more niceties stacked on top of it, not any sort of fundamental overhaul.

clarkmcc 2 days ago||

I can’t say whether it was rushed out, but it’s clearly not everything it was advertised to be. Early on, the big talking point was that the async implementation was so modular you could swap runtimes like Lego bricks. In reality, that’s nowhere near true. Changing runtimes means changing every I/O dependency (mutexes, networking, fs), because everything is tightly coupled to the runtime. I raised this in a Reddit thread some time ago, and the feedback there reinforced that I'm not the only one with a sour Rust async taste in my mouth. https://www.reddit.com/r/rust/comments/1f4z84r/is_it_fair_to...

pshirshov 1 day ago|

In my experience, almost any asynchronous runtime faced similar issue at some point (e.g. we helped to find and fix such issue in ZIO).

It's hard to verify these protocols and very easy to write something fragile.

More comments...