Looking at Unity made me understand the point of C++ coroutines

Posted by ingve 3 days ago

Looking at Unity made me understand the point of C++ coroutines(mropert.github.io)

122 points | 113 comments

Joker_vD 8 hours ago|

Simon Tatham, author of Putty, has quite a detailed blog post [0] on using the C++20's coroutine system. And yep, it's a lot to do on your own, C++26 really ought to give us some pre-built templates/patterns/scaffolds.

[0] https://web.archive.org/web/20260105235513/https://www.chiar...

zozbot234 2 hours ago||

People love to complain about Rust async-await being too complicated, but somehow C++ manages to be even worse. C++ never disappoints!

jandrewrogers 18 seconds ago|||

I find C++ coroutines to be well-designed. Most of the complexity is intrinsic because it tries to be un-opinionated. It allows precise control and customization of almost every conceivable coroutine behavior while still adhering to the principle of zero-cost abstractions.

Most people would prefer opinionated libraries that allow them to not think about the design tradeoffs. The core implementation is targeted at efficient creation of opinionated abstractions rather than providing one. This is the right choice. Every opinionated abstraction is going to be poor for some applications.

01HNNWZ0MV43FF 40 minutes ago|||

async is simply a difficult problem, and I think we'll find irreducible complexity there. Sometimes you are just doing 2 or 3 things at once and you need a hand-written state machine with good unit tests around it. Sometimes you can't just glue 3 happy paths together into CSP and call it a day.

rafram 3 minutes ago||

Languages like Swift do manage to make it much simpler. The culture guiding Rust design pretty clearly treats complexity as a goal.

matt_d 1 hour ago||

See also C++ coroutines resources (posts, research, software, talks): https://gist.github.com/MattPD/9b55db49537a90545a90447392ad3...

ZoomZoomZoom 1 hour ago||

For a layperson it's clear that it's either "Writings" and "Talks", or "Readings" and "'Listenings", but CPP profeciency is in an inverse relation with being apt in taxonomy, it looks like.

Thanks for the list.

nananana9 8 hours ago||

You can roll stackful coroutines in C++ (or C) with 50-ish lines of Assembly. It's a matter of saving a few registers and switching the stack pointer, minicoro [1] is a pretty good C library that does it. I like this model a lot more than C++20 coroutines:

1. C++20 coros are stackless, in the general case every async "function call" heap allocates.

2. If you do your own stackful coroutines, every function can suspend/resume, you don't have to deal with colored functions.

3. (opinion) C++20 coros are very tasteless and "C++-design-commitee pilled". They're very hard to understand, implement, require the STL, they're very heavy in debug builds and you'll end up with template hell to do something as simple as Promise.all

[1] https://github.com/edubart/minicoro

pjc50 8 hours ago||

> You can roll stackful coroutines in C++ (or C) with 50-ish lines of Assembly

I'm not normally keen to "well actually" people with the C standard, but .. if you're writing in assembly, you're not writing in C. And the obvious consequence is that it stops being portable. Minicoro only supports three architectures. Granted, those are the three most popular ones, but other architectures exist.

(just double checked and it doesn't do Windows/ARM, for example. Not that I'm expecting Microsoft to ship full conformance for C++23 any time soon, but they have at least some of it)

audidude 3 hours ago|||

> I'm not normally keen to "well actually" people with the C standard, but .. if you're writing in assembly, you're not writing in C.

These days on Linux/BSD/Solaris/macOS you can use makecontext()/swapcontext() from ucontext.h and it will turn out roughly the same performance on important architectures as what everyone used to do with custom assembly. And you already have fiber functions as part of the Windows API to trampoline.

I had to support a number of architectures in libdex for Debian. This is GNOME code of course, which isn't everyone's cup of C. (It also supports BSDs/Linux/macOS/Solaris/Windows).

* https://packages.debian.org/sid/libdex-1-1

* https://gitlab.gnome.org/GNOME/libdex

gpderetta 1 hour ago||

Unfortunately swap context requires saving and restoring the signal mask, which, at least on Linux, requires a syscall so it is going to be at least a hundred times slower than an hand rolled implementation.

Also, although not likely to be removed anytime soon from existing systems, POSIX has declared the context API obsolescent a while ago (it might actually no longer be part of the standard).

giancarlostoro 3 hours ago||||

> Not that I'm expecting Microsoft to ship full conformance for C++23 any time soon,

They are actively working on it for their VS2026 C++ compiler. I think since 2017 or so they've kept up with C++ standards reasonably? I'm not a heavy C++ guy, so maybe I'm wrong, but my understanding is they match the standards.

manwe150 6 hours ago||||

Boost has stackful coroutines. They also used to be in posix (makecontext).

blacklion 4 hours ago||||

There is no "Linux/ARM[64]". But there are "Raspberry Pi" and "RISC-V". I don't know such OSes, to be honest :-)

This support table is complete mess. And saying "most platforms are supported" is too optimistic or even cocky.

ndiddy 3 hours ago||||

Looking at the repo, it falls back to Windows fibers on Windows/ARM. If you'd like a coroutine with more backends, I'm a fan of libco: https://github.com/higan-emu/libco/ which has assembly backends for x86, amd64, ppc, ppc-64, arm, and arm64 (and falls back to setjmp on POSIX platforms and fibers on Windows). Obviously the real solution would be for the C or C++ committees to add stackful coroutines to the standard, but unless that happens I would rather give up support for hppa or alpha or 8-bit AVR or whatever than not be able to use stackful corountines.

gpderetta 1 hour ago||

A proposal to add stackfull coroutines has been around forever and gets updated at every single mailing. Unfortunately the authors don't really have backing from any major company.

fluoridation 6 hours ago|||

I think what they meant is that that what it takes to add coroutines support to a C/++ program. Adding it to, say, Java or C# is much more involved.

TuxSH 1 hour ago|||

> every async "function call" heap allocates.

> require the STL

That it has to heap-allocate if non-inlined is a misconception. This is only the default behavior.

One can define:

void *operator new(size_t sz, Foo &foo)

in the coro's promise type, and this:

- removes the implicitly-defined operator new

- forces the coro's signature to be CoroType f(Foo &foo), and forwards arguments to the "operator new" one defined

Therefore, it's pretty trivial to support coroutines even when heap cannot be used, especially in the non-recursive case.

Yes, green threads ("stackful coroutines") are more straightforward to use, however:

- they can't be arbitrarily destroyed when suspended (this would require stack unwinding support and/or active support from the green thread runtime)

- they are very ABI dependent. Among the "few registers" one has to save FPU registers. Which, in the case of older Arm architectures, and codegen options similar to -mgeneral-regs-only (for code that runs "below" userspace). Said FPU registers also take a lot of space in the stack frame, too

Really, stackless coros are just FSM generators (which is obvious if one looks at disasm)

gpderetta 1 hour ago||

A stackful coroutine implementation has to save exactly the same registers that a stackless one has to: the live ones at the suspension point.

A pure library implementation that uses on normal function call semantics obviously needs to conservatively save at least all callee-save registers, but that's not the only possible implementation. An implementation with compiler help should be able to do significantly better.

Ideally the compiler would provide a built-in, but even, for example, an implementation using GCC inline ASM with proper clobbers can do significantly better.

Joker_vD 7 hours ago|||

Hmm. I'm fairly certain that most of that assembly code for saving/restoring registers can be replaced with setjmp/longjmp, and only control transfer itself would require actual assembly. But maybe not.

That's the problem with register machines, I guess. Interestingly enough, BCPL, its main implementation being a p-code interpreter of sorts, has pretty trivially supported coroutines in its "standard" library since the late seventies — as you say, all you need to save is the current stack pointer and the code pointer.

lelanthran 7 hours ago|||

> Hmm. I'm fairly certain that most of that assembly code for saving/restoring registers can be replaced with setjmp/longjmp, and only control transfer itself would require actual assembly.

Actually you don't even need setjmp/longjmp. I've used a library (embedded environment) called protothreads (plain C) that abused the preprocessor to implement stackful coroutines.

(Defined a macro that used the __LINE__ macro coupled with another macro that used a switch statement to ensure that calling the function again made it resume from where the last YIELD macro was encountered)

Cloudef 6 hours ago||

Wouldnt that be stackless (shared stack)

lelanthran 6 hours ago||

Correct; stackless. I misspoke.

zabzonk 7 hours ago||||

You can do a lot of horrible things with setjmp and friends. I actually implemented some exception throw/catch macros using them (which did work) for a compiler that didn't support real C++ exceptions. Thank god we never used them in production code.

This would be about 32 years ago - I don't like thinking about that ...

gpderetta 58 minutes ago||

GCC still uses sj/lj by default on some targets to implement exceptions.

gpderetta 7 hours ago|||

setjmp + longjump + sigaltstack is indeed the old trick.

Sharlin 6 hours ago|||

C++ destructors and exception safety will likely wreak havoc with any "simple" assembly/longjmp-based solution, unless severely constraining what types you can use within the coroutines.

fluoridation 6 hours ago||

Not really. I've done it years ago. The one restriction for code inside the coroutine is that it mustn't catch (...). You solve destruction by distinguishing whether a couroutine is paused in the middle of execution or if it finished running. When the coroutine is about to be destructed you run it one last time and throw a special exception, triggering destruction of all RAII objects, which you catch at the coroutine entry point.

Passing uncaught exceptions from the coroutine up to the caller is also pretty easy, because it's all synchronous. You just need to wrap it so it can safely travel across the gap. You can restrict the exception types however you want. I chose to support only subclasses of std::exception and handle anything else as an unknown exception.

gpderetta 56 minutes ago|||

> mustn't catch (...)

You could use the same trick used by glibc to implement unstoppable exceptions for POSIX cancellation: the exception rethrows itself from its destructor.

pjc50 5 hours ago||||

> Passing uncaught exceptions from the coroutine up to the caller is also pretty easy, because it's all synchronous. You just need to wrap it so it can safely travel across the gap

This is also how dotnet handles it, and you can choose whether to rethrow at the caller site, inspect the exception manually, or run a continuation on exception.

Sharlin 6 hours ago|||

Thanks, that's interesting.

socalgal2 2 hours ago|||

As an x-gamedev, suspect/resume/stackful coroutines made them too heavy to have several thousand of them running during a game loop for our game. At the time we used GameMonkey Script: https://github.com/publicrepo/gmscript

That was over 20 years ago. No idea what the current hotness is.

MisterTea 3 hours ago||

A much nicer code base to study is: https://swtch.com/libtask/

The stack save/restore happens in: https://swtch.com/libtask/asm.S

cherryteastain 8 hours ago||

Not an expert in game development, but I'd say the issue with C++ coroutines (and 'colored' async functions in general) is that the whole call stack must be written to support that. From a practical perspective, that must in turn be backed by a multithreaded event loop to be useful, which is very difficult to write performantly and correctly. Hence, most people end up using coroutines with something like boost::asio, but you can do that only if your repo allows a 'kitchen sink' library like Boost in the first place.

spacechild1 6 hours ago||

> that must in turn be backed by a multithreaded event loop to be useful

Why? You can just as well execute all your coroutines on a single thread. Many networking applications are doing fine with just use a single ASIO thread.

Another example: you could write game behavior in C++ coroutines and schedule them on the thread that handles the game logic. If you want to wait for N seconds inside the coroutine, just yield it as a number. When the scheduler resumes a coroutine, it receives the delta time and then reschedules the coroutine accordingly. This is also a common technique in music programming languages to implement musical sequencing (e.g. SuperCollider)

pjc50 8 hours ago|||

Much of the original motivation for async was for single threaded event loops. Node and Python, for example. In C# it was partly motivated by the way Windows handles a "UI thread": if you're using the native Windows controls, you can only do so from one thread. There's quite a bit of machinery in there (ConfigureAwait) to control whether your async routine is run on the UI thread or on a different worker pool thread.

In a Unity context, the engine provides the main loop and the developer is writing behaviors for game entities.

hrmtst93837 53 minutes ago|||

They don't need a multithreaded event loop. Single-threaded schedulers cover plenty of game-style work without hauling in Boost, and the uglier part is that async colors the API surface and control flow in ways that make refactors annoying and big legacy codebases harder to reason about.

nitwit005 1 hour ago|||

> but I'd say the issue with C++ coroutines (and 'colored' async functions in general) is that the whole call stack must be written to support that.

You can call a function that makes use of coroutines without worrying about it. That's the core intent of the design.

That is, if you currently use some blocking socket library, we could replace the implementation of that with coroutine based sockets, and everything should still work without other code changes.

spacechild1 8 hours ago|||

ASIO is also available outside of boost! https://github.com/chriskohlhoff/asio

lionkor 8 hours ago||

For anyone wondering; this isn't a hack, that's the same library, just as good, just without boost dependencies.

spacechild1 6 hours ago||

Thanks for pointing this out! This may not obvious not everybody.

Also, this is not some random GitHub Repo, Chris Kohlhoff is the developer of ASIO :)

inetknght 5 hours ago||

> From a practical perspective, that must in turn be backed by a multithreaded event loop to be useful

Multithreaded? Nope. You can do C++ coroutines just fine in a single-threaded context.

Event loop? Only if you're wanting to do IO in your coroutines and not block other coroutines while waiting for that IO to finish.

> most people end up using coroutines with something like boost::asio

Sure. But you don't have to. Asio is available without the kitchen sink: https://think-async.com/Asio/

Coroutines are actually really approachable. You don't need boost::asio, but it certainly makes it a lot easier.

I recommend watching Daniela Engert's 2022 presentation, Contemporary C++ in Action: https://www.youtube.com/watch?v=yUIFdL3D0Vk

Davidbrcz 3 hours ago||

I use asio at work for coroutine. It's one of the most opaque library I've ever used. The doc is awful and impenetrable.

The most helpful resource about it is a guy on stackoverflow (sehe). No idea how to get help once SO will have closed

astrange 16 minutes ago||

Ask Claude Code to write a manual for it.

abcde666777 8 hours ago||

More broadly the dimension of time is always a problem in gamedev, where you're partially inching everything forward each frame and having to keep it all coherent across them.

It can easily and often does lead to messy rube goldberg machines.

There was a game AI talk a while back, I forget the name unfortunately, but as I recall the guy was pointing out this friction and suggesting additions we could make at the programming language level to better support that kind of time spanning logic.

manoDev 6 hours ago||

This is more evident in games/simulations but the same problem arises more or less in any software: batch jobs and DAGs, distributed systems and transactions, etc.

This what Rich Hickey (Clojure author) has termed “place oriented programming”, when the focus is mutating memory addresses and having to synchronize everything, but failing to model time as a first class concept.

I’m not aware of any general purpose programming language that successfully models time explicitly, Verilog might be the closest to that.

gopher_space 15 minutes ago||

> I’m not aware of any general purpose programming language that successfully models time explicitly

Step 1, solve "time" for general computing.

The difficulty here is that our periods are local out of both necessity and desire; we don't fail to model time as a first class concept, we bring time-as-first-class with us and then attempt to merge our perspectives with varying degrees of success.

We're trying to rectify the observations of Zeno, a professional turtle hunter, and a track coach with a stopwatch when each one has their own functional definition of time driven by intent.

syncurrent 6 hours ago|||

This timing additions to a language is also at the core of imperative synchronous programming languages like Este rel, Céu or Blech.

repelsteeltje 8 hours ago|||

> There was a game AI talk a while back, I forget the name unfortunately, but as I recall the guy was pointing out this friction and suggesting additions we could make at the programming language level to better support that kind of time spanning logic.

Sounds interesting. If it's not too much of an effort, could you dig up a reference?

abcde666777 5 hours ago||

You're in luck - it's the first talk at this link, "The Polling Problem": https://www.gdcvault.com/play/1018040/Architecture-Tricks-Ma...

Mind you my memory may have distorted it a little beyond what it was, but it's loosely on the topic!

truepricehq 7 hours ago||

[dead]

twoodfin 7 hours ago||

As the author lays out, the thing that made coroutines click for me was the isomorphism with state machine-driven control flow.

That’s similar to most of what makes C++ tick: There’s no deep magic, it’s “just” type-checked syntactic sugar for code patterns you could already implement in C.

(Occurs to me that the exceptions to this … like exceptions, overloads, and context-dependent lookup … are where C++ has struggled to manage its own complexity.)

HarHarVeryFunny 7 hours ago|

If you need to implement an async state machine, couldn't that just as easily be done with std::future? How do coroutines make this cleaner/better?

LatencyKills 7 hours ago||

std::future doesn't give you a state machine. You get the building blocks you have to assemble into one manually. Coroutines give you the same building blocks but let the compiler do the assembly, making the suspension points visible in the source while hiding the mechanical boilerplate.

This is why coroutine-based frameworks (e.g., C++20 coroutines with cppcoro) have largely superseded future-chaining for async state machine work — the generated code is often equivalent, but the source code is dramatically cleaner and closer to the synchronous equivalent.

(me: ex-Visual Studio dev who worked extensively on our C++ coroutine implementation)

HarHarVeryFunny 3 hours ago|||

It doesn't seem like a clear win to me. The only "assembly" required with std::future is creating the associated promise and using it to signal when that async step is done, and the upside is a nice readable linear flow, as well as ease of integration (just create a thread to run the state machine function if want multiple in parallel).

With the coroutine approach using yield, doesn't that mean the caller needs to decide when to call it again? With the std::future approach where it's event driven by the promise being set when that state/step has completed.

LatencyKills 2 hours ago||

You are describing a single async step, not a state machine. "Create a promise, set it when done", that's one state. A real async state machine has N states with transitions, branching, error handling, and cleanup between them.

> "The only 'assembly' required is creating the associated promise"

Again, that is only true for one step. For a state machine with N states you need explicit state enums or a long chain of .then() continuations. You also need to the manage the shared state across continuations (normally on the heap). You need to manage manual error propagation across each boundary and handle the cancellation tokens.

You only get a "A nice readable linear flow" using std:future when 1) using a blocking .get() on a thread, or 2) .then() chaining, which isn't "nice" by any means.

Lastly, you seem to be conflating a co_yield (generator, pull-based) with co_await (event-driven, push-based). With co_await, the coroutine is resumed by whoever completes the awaitable.

But what do I know... I only worked on implementing coroutines in cl.exe for 4 years. ;-)

physPop 7 hours ago|||

I feel like thats really oversellign coro -- theres still a TON of boilerplate

LatencyKills 6 hours ago||

My response specifically addressed the question of why you might choose one option over the other.

Do you believe that std::future is the better option?

wiseowise 5 hours ago||

Looking at C++ made me understand the point of Rust.

bullen 6 hours ago||

Coroutines generally imply some sort of magic to me.

I would just go straight to tbb and concurrent_unordered_map!

The challenge of parallelism does not come from how to make things parallel, but how you share memory:

How you avoid cache misses, make sure threads don't trample each other and design the higher level abstraction so that all layers can benefit from the performance without suffering turnaround problems.

My challenge right now is how do I make the JVM fast on native memory:

1) Rewrite my own JVM. 2) Use the buffer and offset structure Oracle still has but has deprecated and is encouraging people to not use.

We need Java/C# (already has it but is terrible to write native/VM code for?) with bottlenecks at native performance and one way or the other somebody is going to have to write it?

pjc50 5 hours ago||

> C# (already has it but is terrible to write native/VM code for?)

What do you mean here? Do you mean hand-writing MSIL or native interop (pinvoke) or something else?

bullen 2 hours ago||

No I meant this but for C# is a whole lot more complex:

http://move.rupy.se/file/jvm.txt

themafia 1 hour ago||

> some sort of magic to me.

Your stack is on the heap and it contains an instruction pointer to jump to for resume.

sagebird 2 hours ago||

>> To misquote Kennedy, “we chose to focus coroutines on generator in C++23, not because it is hard, but because it is easy”.

Appreciate this humor -- absurd, tasteful.

pjmlp 6 hours ago||

As I mentioned on the Reddit thread,

This is quite understandable when you know the history behind how C++ coroutines came to be.

They were initially proposed by Microsoft, based on a C++/CX extension, that was inspired by .NET async/await implementation, as the WinRT runtime was designed to only support asynchronous code.

Thus if one knows how the .NET compiler and runtime magic works, including custom awaitable types, there will be some common bridges to how C++ co-routines ended up looking like.

mgaunard 7 hours ago|

Coroutines is just a way to write continuations in an imperative style and with more overhead.

I never understood the value. Just use lambdas/callbacks.

usrnm 6 hours ago||

> Just use lambdas/callbacks

"Just" is doing a lot of work there. I've use callback-based async frameworks in C++ in the past, and it turns into pure hell very fast. Async programming is, basically, state machines all the way down, and doing it explicitly is not nice. And trying to debug the damn thing is a miserable experience

mgaunard 6 hours ago||

You can embed the state in your lambda context, it really isn't as difficult as what people claim.

The author just chose to write it as a state machine, but you don't have to. Write it in whatever style helps you reach correctness.

Sharlin 6 hours ago||

You still need the state and the dispatcher, even if the former is a little more hidden in the implicit closure type.

affenape 6 hours ago|||

Not necessarily. A coroutine encapsulates the entire state machine, which might pe a PITA to implement otherwise. Say, if I have a stateful network connection, that requires initialization and periodic encryption secret renewal, a coroutine implementation would be much slimmer than that of a state machine with explicit states.

spacechild1 5 hours ago|||

> Just use lambdas/callbacks.

Lol, no thanks. People are using coroutines exactly to avoid callback hell. I have rewritten my own C++ ASIO networking code from callback to coroutines (asio::awaitable) and the difference is night and day!

socalgal2 2 hours ago|||

I'll take the bait. Here's a coroutine

    waitFrames(5); // wait 5 frames
    fireProjectile();
    waitFrames(15);
    turnLeft(-30/*deg*/, 120); // turn left over 120 frames
    waitFrames(10);
    fireProjectile();
    // spin and shoot
    for (i of range(0, 360, 60)) {
      turnRight(60, 90);  // turn 60 degrees over 90 frames
      fireProjectile();
    }

10 lines and I get behavior over time. What would your non-coroutine solution look like?

mgaunard 1 hour ago||

Given a coroutine body

``` int f() { a; co_yield r; b; co_return r2; } ```

this transforms into

``` auto f(auto then) { a; return then(r, [&]() { b; return then(r2); }); }; ```

You can easily extend this to arbitrarily complex statements. The main thing is that obviously, you have to worry about the capture lifetime yourself (coroutines allocate a frame separate from the stack), and the syntax causes nesting for every statement (but you can avoid that using operator overloading, like C++26/29 does for executors)

jayd16 5 hours ago|||

You can structure coroutines with a context so the runtime has an idea when it can drop them or cancel them. Really nice if you have things like game objects with their own lifecycles.

For simple callback hell, not so much.

Sharlin 6 hours ago|||

Did you read the article? As the author says, it becomes a state machine hell very quickly beyond very simple examples.

kccqzy 6 hours ago||

I just don’t agree that it always becomes a state machine hell. I even did this in C++03 code before lambdas. And honestly, because it was easy to write careless spaghetti code, it required a lot more upfront thought into code organization than just creating lambdas willy-nilly. The resulting code is verbose, but then again C++ itself is a fairly verbose language.

duped 3 hours ago|||

The value is fewer indirect function calls heap allocations (so less overhead than callbacks) and well defined tasks that you can select/join/cancel.

DonHopkins 6 hours ago||

The Unity editor does not let you examine the state hidden in your closures or coroutines. (And the Mono debugger is a steaming pile of shit.)

Just put your state in visible instance variables of your objects, and then you will actually be able to see and even edit what state your program is in. Stop doing things that make debugging difficult and frustratingly opaque.

jayd16 4 hours ago||

Use Rider or Visual Studio. Debugging coroutines should be easy. You just can't step over any yield points so you need to break after execution is resumed. It's mildly tedious but far from impossible.

More comments...