Posted by 0x1997 1 day ago
If you’re counting that low, then you need to count carefully.
A coroutine switch, however well implemented, inevitably breaks the branch predictor’s idea of your return stack, but the effect of mispredicted returns will be smeared over the target coroutine’s execution rather than concentrated at the point of the switch. (Similar issues exist with e.g. measuring the effect of blowing the cache on a CPU migration.) I’m actually not sure if Zig’s async design even uses hardware call/return pairs when a (monomorphized-as-)async function calls another one, or if every return just gets translated to an indirect jump. (This option affords what I think is a cleaner design for coroutines with compact frames, but it is much less friendly to the CPU.)
So a foolproof benchmark would require one to compare the total execution time of a (compute-bound) program that constantly switches between (say) two tasks to that of an equivalent program that not only does not switch but (given what little I know about Zig’s “colorless” async) does not run under an async executor(?) at all. Those tasks would also need to yield on a non-trivial call stack each time. Seems quite tricky all in all.
Also, if you control the compiler, an option is to compile all call/rets in and out of "io" code in terms of explicit jumps. A ret implemented as pop+indirect jump will be less less predictable than a paired ret, but has more chances to be predicted than an unpaired one.
My hope is that, if stackful coroutines become more mainstreams, CPU microarchitectures will start using a meta-predictor to chose between the return stack predictor and the indirect predictor.
Zig no longer has async in the language (and hasn't for quite some time). The OP implemented task switching in user-space.
See this for more details on how stackful coroutines can be made much faster:
https://photonlibos.github.io/blog/stackful-coroutine-made-f...
Yep, the frame pointer as well if you're using it. This is exactly how its implemented in user-space in Zig's WIP std.Io branch green-threading implementation: https://github.com/ziglang/zig/blob/ce704963037fed60a30fd9d4...
On ARM64, only fp, sp and pc are explicitly restored; and on x86_64 only rbp, rsp, and rip. For everything else, the compiler is just informed that the registers will be clobbered by the call, so it can optimize allocation to avoid having to save/restore them from the stack when it can.
I see, so you're saying that GCC can be coaxed into gathering only the relevant registers to stack and unstack not blindly do all of them?
> buttering the cost of switches [over the whole execution time]
The switches get cheaper but the rest of the code gets slower (because it has less flexibility in register allocation) so the cost of the switches is "buttered" (i.e. smeared) over the rest of the execution time.
But I don't think this argument holds water. The surrounding code can use whatever registers it wants. In the worst case it saves and restores all of them, which is what a standard context switch does anyway. In other words, this can be better and is never worse.
https://easyperf.net/blog/2018/03/09/Store-forwarding
and, section 15.10 of https://www.agner.org/optimize/microarchitecture.pdf
Either you're context switching often enough that store forwarding helps, or you're not spending a lot of time context switching. Either way, I would expect that you aren't waiting on L1: you put the write into a queue and move on.
I've been using Zig for embedded (ARM Cortex-M4, 256KB RAM) mainly for memory safety with C interop. The explicitness around calling conventions catches ABI mismatches at compile-time instead of runtime crashes.
I actually prefer colored async (like Rust) over this approach. The "illusion of synchronous code" feels magical, but magic becomes a gotcha in larger codebases when you can't tell what's blocking and what isn't.
The only problem is that the OS implements that illusion in a way that's rather costly, allowing only a relatively small number of threads (typically, you have no more than a few thousand frequently-active OS threads), while languages, which know more about how they use the stack, can offer the same illusion in a way that scales to a higher number of concurrent operations. But there's really no more magic in how a language implements this than in how the OS implements it, and no more illusion. They are both a mostly similar implementation of the same illusion. "Blocking" is always a software abstraction over machine operations that don't actually block.
The only question is how important is it for software to distinguish the use of the same software abstraction between the OS and the language's implementation.
There are two different usecases for coroutines that may tempt implementors to address with a single implementation, but the usecases are sufficiently different to separate into two different implementations. One is the generator use case. What makes it special is that there are exactly two communicating parties, and both of their state may fit in the CPU cache. The other use case is general concurrency, primarily for IO. In that situation, a scheduler juggles a large number of user-mode threads, and because of that, there is likely a cache miss on every context switch, no matter how efficient it is. However, in the second case, almost all of the performance is due to Little's law rather than context switch time (see my explanation here: https://inside.java/2020/08/07/loom-performance/).
That means that a "stackful" implementation of user-mode threads can have no significant performance penalty for the second use case (which, BTW, I think has much more value than the first), even though a more performant implementation is possible for the first use case. In Java we decided to tackle the second use case with virtual threads, and so far we've not offered something for the first (for which the demand is significantly lower).
What happens in languages that choose to tackle both use cases with the same construct is that they gain negligible performance in the second use case (at best), but they're paying for that negligible benefit with a substantial degradation in user experience. That's just a bad tradeoff, but some languages (especially low-level ones) may have little choice, because their stackful solution does carry a significant performance cost compared to Java because of Java's very efficient heap memory management.
You don't have to color your function based on whether you're supposed to use in in an async or sync manner. But it will essentially be colored based on whether it does I/O or not (the function takes IO interface as argument). Which is actually important information to "color" a function with.
Whether you're doing async or sync I/O will be colored at the place where you call an IO function. Which IMO is the correct way to do it. If you call with "async" it's nonblocking, if you call without it, it's blocking. Very explicit, but not in a way that forces you to write a blocking and async version of all IO functions.
The Zio readme says it will be an implementation of Zig IO interface when it's released.
I guess you can then choose if you want explicit async (use Zig stdlib IO functions) or implicit async (Zio), and I suppose you can mix them.
> Stackful coroutines make sense when you have the RAM for it.
So I've been thinking a bit about this. Why should stackful coroutines require more RAM? Partly because when you set up the coroutine you don't know how big the stack needs to be, right? So you need to use a safe upper bound. While stackless will only set up the memory you need to yield the coroutine. But Zig has a goal of having a built-in to calculate the required stack size for calling a function. Something it should be able to do (when you don't have recursion and don't call external C code), since Zig compiles everything in one compilation unit.
Zig devs are working on stackless coroutines as well. But I wonder if some of the benefits goes away if you can allocate exactly the amount of stack a stackful coroutine needs to run and nothing more.
const n = try reader.interface.readVec(&data);
Can you guess if it's going to do blocking or non-blocking I/O read?The io parameter is not really "coloring", as defined by the async/await debate, because you can have code that is completely unaware of any async I/O, pass it std.Io.Reader and it will just work, blocking or non-blocking, it makes no difference. Heck, you even even wrap this into C callbacks and use something like hiredis with async I/O.
Stackful coroutines need more memory, because you need to pre-allocate large enough stack for the entire lifetime. With stackless coroutines, you only need the current state, but with the disadvantage that you need frequent allocations.
This is not quite correct -- a stackful coroutine can start with a small stack and grow it dynamically, whereas stackless coroutines allocate the entire state machine up front.
The reason why stackful coroutines typically use more memory is that the task's stack must be large enough to hold both persistent state (like local variables that are needed across await points) and ephemeral state (like local variables that don't live across await points, and stack frames of leaf functions that never suspend). With a stackless implementation, the per-task storage only holds persistent state, and the OS thread's stack is available as scratch space for the current task's ephemeral state.
The Rust folks are working on a general effect system, including potentially an 'IO' effect. Being able to abstract out the difference between 'sync' and 'async' code is a key motivation of this.
Isn't that exactly why they're making IO explicit in functions? So you can trace it up the call chain.
I had that very impression in early 2020 after some months of Zigging (and being burned by constant breaking changes), and left, deciding "I'll check it out again in a few years."
I had some intuition it might be one of these forever-refactoring eternal-tinker-and-rewrite fests and here I am 5 years later, still lurking for that 1.0 from the sidelines, while staying in Go or C depending on the nature of the thing at hand.
That's not to say it'll never get there, it's a vibrant project prioritizing making the best design decisions rather than mere Shipping Asap. For a C-replacement that's the right spirit, in principle. But whether there's inbuilt immunity to engineers falling prey to their forever-refine-and-resculpt I can't tell. I find it a great project to wait for leisurely (=
it's completely feasible to stick to something that works for you, and only update/port/rewrite when it makes sense.
what matters is the overall cost.
I would expect fixing an application to an older version would be just fine, so long as you don't need newer language features. If newer language features are a requirement, I would expect that would drive refactoring or selecting a different implementation language entirely if refactoring would prove to be too onerous.
all of these project is great but we cant ignore that Zig is not enter phase where we can guarantee stable API compability
I think it speaks volumes that these projects chose to use it, and speak very highly of it, despite the fact that it's still pre 1.0.
seems like you missing a lot of comment on this tree
how about you goes to Zig github and check how progress of the language
it literally there and its still beta test and not fit for production let alone have mature ecosystem
> Additionally, when Zig 0.16 is released with the std.Io interface, I will implement that as well, allowing you to use the entire standard library with this runtime.
Unrelated to this library, I plan to do lots of IO with Zig and will wait for 0.16. Your intuition may decide otherwise and that’s ok.
In other words, the only reason to not use zig if you detest upgrading or improving your code. Code you write today will still work tomorrow. Code you write tomorrow, will likely have a new Io interface, because you want to use that standard abstraction. But, if you don't want to use it, all your existing code will still work.
Just like today, if you want to alloc, but don't want to pass an `Allocator` you can call std.heap.page_allocator.alloc from anywhere. But because that abstraction is so useful, and zig supports it so ergonomically, everyone writes code that provides that improved API
side note; I was worried about upgrading all my code to interface with the new Reader/Writer API that's already mostly stable in 0.15.2, but even though I had to add a few lines in many existing projects to upgrade. I find myself optionally choosing to refactor a lot of functions because the new API results is code that is SO much better. Both in readability, but also performance. Do I have to refactor? No, the old API works flawlessly, but the new API is simply more ergonomic, more performant and easier to read and reason about. I'm doing it because I want to, not because I have to.
Everyone knows' a red diff is the best diff, and the new std.Io API exposes an easier way to do things. Still, like everything in zig, it allows you to write the code that you want to write. But if you want to do it yourself, that's fully supported too!
Haha no! Zig makes breaking changes in the stdlib in every release. I can guarantee you won’t be able to update a non trivial project between any of the latest 10 versions and beyond without changing your code , often substantially, and the next release is changing pretty much all code doing any kind of IO. I know because I keep track of that in a project and can see diffs between each of the latest versions. This allows me to modify other code much more easily.
But TBH, in 0.15 only zig build broke IIRC. However, I just didn’t happen to use some of the things that changed, I believe.
I idle on IRC a lot, and try to help out with questions. From that view, this is the experience of over 90% of users. Minor namespace changes, or calling a function with a different named option. . root_source_file became . root_source_module (and required an additional function call)
Complex changes are almost never required, and IMO shouldn't be expected by most people using zig. Those who might need to make them, already know they're coming because they're already paying attention to the language as a prerequisite for writing such complex code. (Complex here meaning depending on the more esoteric zig stdlib internals)
[1]: https://github.com/ziglang/zig/blob/init-std.Io/lib/std/fs.z...
> Do I have to refactor? No, the old API works flawlessly
The old API was deleted though? If you're saying it's possible to copy/paste the old stdlib into your project and maintain the old abstractions forward through the ongoing language changes, sure that's possible, but I don't think many people will want to fork std. I copy/pasted some stuff temporarily to make the 0.15 migration easier, but maintaining it forever would be swimming upstream for no reason.
uhhh.... huh? you and I must be using very different definitions for the word most.
> The old API was deleted though?
To be completely fair, you're correct, the old deprecated writer that was available in 0.15 has been removed https://ziglang.org/documentation/0.15.2/std/#std.Io.Depreca... contrasted with the master branch which doesn't provide this anymore.
edit: lmao, your profile about text is hilarious, I appreciate the laugh!
You're of course correct here; but I thought it was reasonable to omit changes that I would describe as namespace changes. Now considering the audience I regret doing so. (it now does require nhe Io object as well, so namespace is inarticulate here)
Because you explicitly said that existing code would continue to work without `std.Io`.
> Code you write tomorrow, will likely have a new Io interface, because you want to use that standard abstraction. But, if you don't want to use it, all your existing code will still work.
I like Zig, but it does not have a stable API. That's just how it is.
Because I'm not conflating passing an Io object, (what everyone expects to be mandatory) and existing APIs moving into the Io namespace (an API change that can only be considered significant if you're trying to win an argument on reddit).
These are drastically different changes, and only one can be considered a meaningful change.
> I like Zig, but it does not have a stable API. That's just how it is.
The last 3 minor version upgrades, required a diff in all of my projects. All of them could have been fixed with exclusively sed -i to update namespaces. None of them required real attention, or logic changes.
In one repo I made the namespace changes in isolation. After some time I then went back and rewrote a few blocks to take advantage of the features, runtime speed improvements, and generally improved code quality granted from the new API.
I don't expect zig's API to be stable, and I regret it if my comment gave you a different impression. But I stand by my comments because I respectfully refuse to ignore the pragmatics of using zig. Calling moving a function between namespaces breaking API change can be argued as technically correct, but bordering on being intentionally misleading.
The Rust folks adopted async with callbacks, and they were essentially starting from scratch so had no need to do it that way, and they are smarter than I (both individually and collectively) so I'm sure they have a reason; I just don't know what it is.
Also, async at low level is literally always callbacks (even processor interrupts are callbacks)
By mucking about with the stack, you break stuff like stack unwinding for exceptions and GC, debuggers, and you probably make a bunch of assumptions you shouldn't
If you start using the compiler backend in unexpected ways, you either expose bugs or find missing functionality and find that the compiler writers made some assumptions about the code (either rightfully or not), that break when you start wildly overwriting parts of the stack.
Writing a compiler frontend is hard enough as it is, and becoming an LLVM expert is generally too much for most people.
But even if you manage to get it working, should you have your code break in either the compiler or any number of widely used external tooling, you literally can't fast track your fix, and thus you can't release your language (since it depends on a broken external dependency, fix pending whenever they feel like it).
I guess even if you are some sort of superhero who can do all this correclty, the LLVM people won't be happy merging some low level codegen change that has the potential to break all compiled software of trillion dollar corporations for the benefit of some small internet project.
> DO NOT USE FIBERS!
There are downsides to stackful coroutines (peak stack usage for example), but I feed that p1364 was attacking a strawman: first of all it is comparing a solution with builtin compiler support against a pure library implementation, second it is not even comparing against the reference implementation of the competing proposal.
As an aside, I know Rust would be unlikely to implement segmented stacks for fibers, given that they were burned by the performance implications thereof previously.
For C++.
If your language has RAII or exceptions, it raises crazy questions about how if thread A is hosting fiber 1, which throws an exception, which propagates outside of the fiber invocation scope, destroys a bunch of objects, then we switch to fiber 2, which sees the world in an inconsistent state (outside resources have been cleaned up, inside ones still alive).
This was literally impossible in pre-fiber code, so most existing code would probably not handle it well.
On an OS with overcommit, you might also only pay for what you use (at a page granularity), but this may be defeated if the stack gets cleared (or initialized to a canary value) by the runtime.
This is a gotcha of using stack allocation in general, but exacerbated in this case by the fact that you have an incentive to keep the stacks as small as possible when you want many concurrent tasks. So you either end up solving the puzzle of how big exactly the stack needs to be, you undershoot and overflow with possibly disastrous effects (especially if your stack happens to overflow into memory that doesn't cause an access violation) or you overshoot and waste memory. Better yet, you may have calculated and optimized your stack size for your platform and then the code ends up doing UB on a different platform with fewer registers, bigger `c_long`s or different alignment constraints.
If something like https://github.com/ziglang/zig/issues/157 actually gets implemented I will be happier about this approach.
AFAIK Go solves this by keeping track of these pointer locations and adjusting them when reallocating the stack. Aside from the run-time cost this incurs, this is unsuitable for Zig because it can't stricly know whether values represent pointers.
Go technically also has this problem as well, if you for example convert a pointer to a uintptr, but maintains no guarantee that a former pointer will still be valid when converted back. Such conversions are also rarely warranted and are made explicit using the `unsafe` package.
Zig is more like C in that it gives the programmer rather than a memory management runtime exclusive control and free rein over the memory. If there are some bits in memory that happen to have the same size as a pointer, Zig sees no reason to stop you from interpreting them as such. This is very powerful, but precludes abstractions like Go's run-time stack reallocation.
Previous versions of Go used segmented stacks, which are theoretically possible, if Zig really wanted (would need compiler support), but they have nasty performance side-effects, see https://www.youtube.com/watch?v=-K11rY57K7k
Or are you making a point about virtual memory? If so, that assumption seems highly platform dependent.
Rust's async is not based on callbacks, it's based on polling. So really there are three ways to implement async:
- The callback approach used by e.g. Node.js and Swift, where a function that may suspend accepts a callback as an argument, and invokes the callback once it is ready to make progress. The compiler transforms async/await code into continuation-passing style.
- The stackful approach used by e.g. Go, libtask, and this; where a runtime switches between green threads when a task is ready to make progress. Simple and easy to implement, but introduces complexity around stack size.
- Rust's polling approach: an async task is statically transformed into a state machine object that is polled by a runtime when it's ready to make progress.
Each approach has its advantages and disadvantages. Continuation-passing style doesn't require a runtime to manage tasks, but each call site must capture local variables into a closure, which tends to require a lot of heap allocation and copying (you could also use Rust's generic closures, but that would massively bloat code size and compile times because every suspending function must be specialized for each call site). So it's not really acceptable for applications looking for maximum performance and control over allocations.
Stackful coroutines require managing stacks. Allocating large stacks is very expensive in terms of performance and memory usage; it won't scale to thousands or millions of tasks and largely negates the benefits of green threading. Allocating small stacks means you need the ability to dynamically resize stacks at runtime, which requires dynamic allocation and adds significant performance and complexity overhead if you want to make an FFI call from an asynchronous task (in Go, every function begins with a prologue to check if there is enough stack space and allocate more if needed; since foreign functions do not have this prologue, an FFI call requires switching to a sufficiently large stack). This project uses fixed-sized task stacks, customizable per-task but defaulting to 256K [1]. This default is several orders of mangitude larger than a typical task size in other green-threading runtimes, so to achieve large scale the programmer must manually manage the stack size on a per-task basis, and face stack overflows if they guess wrong (potentially only in rare/edge cases).
Rust's "stackless" polling-based approach means the compiler knows statically exactly how much persistent storage a suspended task needs, so the application or runtime can allocate this storage up-front and never need to resize it; while a running task has a full OS thread stack available as scratch space and for FFI. It doesn't require dynamic memory allocation, but it imposes limits on things like recursion. Rust initially had stackful coroutines, but this was dropped in order to not require dynamic allocation and remove the FFI overhead.
The async support in Zig's standard library, once it's complete, is supposed to let the application developer choose between stackful and stackless coroutines depending on the needs of the application.
[1]: https://github.com/lalinsky/zio/blob/9e2153eed99a772225de9b2...
The history of this concurrency model is here: https://seh.dev/go-legacy/
At some level it's always callbacks. Then people build frameworks on top of these so programmers can pretend they're not dealing with callbacks.
https://linux.die.net/man/3/setsockopt
Zig has a posix API layer.
What I envision is something like `asyncio.timeout` in Python, where you start a timeout and let the code run as usual. If it's in I/O sleep when the timeout fires, it will get woken up and the operation gets canceled.
I see something like this:
var timeout: zio.Timeout = .init;
defer timeout.cancel(rt);
timeout.set(rt, 10);
const n = try reader.interface.readVec(&data);https://www.withsecure.com/en/solutions/innovative-security-...
https://www.ptc.com/en/products/developer-tools/perc
Note the
> This video illustrates the use case of Perc within the Aegis Combat System, a digital command and control system capable of identifying and tracking incoming threats and providing the war fighter with a solution to address threats. Aegis, developed by Lockheed Martin, is critical to the operation of the DDG-51, and Lockheed Martin has selected Perc as the operating platform for Aegis to address real-time requirements and response times.
Not all GCs are born alike.
The thing that actually convinced me to learn Rust was for something that I wanted to use less memory; my initial Clojure version, compiled with GraalVM, hovered around 100 megs. When I rewrote it in Rust, it hovered around 500kb.
It’s not completely apples to apples, and the laptop running this code has a ton of RAM anyway, but it’s still kind of awesome to see a 200x reduction in memory without significantly more complicated code.
A lot of the stuff I have to do in Rust for GC-less memory safety ends up being stuff I would have to do anyway in a GC’d language, e.g. making sure that one thread owns the memory after it has been transferred over a channel.
True. However in the bounded-time GC space few projects share the same definitions of low-latency or real-time. So you have to find a language that meets all of your other desiderata and provides a GC that meets your timing requirements. Perc looks interesting, Metronome made similar promises about sub-ms latency. But I'd have to get over my JVM runtime phobia.
DoD uses languages like Java in applications where raw throughput and low-latency is not critical to success. A lot of what AEGIS does is not particularly performance sensitive.
1: Yes, pre-1.0 Rust had a garbage collector.
It is an ABI change though, so you need to recompile the whole stack (there might be the ability for segmented code to call non segmented code, but I don't remember the extent of the support) and it is probably half deprecated now. But it works and it doesn't need GC.
The Fortran, Modula-2 and ALGOL 68 frontends are getting much more development work than gccgo, stuck in pre-generics Go, version 1.18 from 2022 and no one is working on it other than minor bug fixes.
There are new Qt bindings for these. Go has https://github.com/mappu/miqt and Zig has https://github.com/rcalixte/libqt6zig. I wonder if the author knew about them. I don't know enough about either language to speak on the async parts.
For me, I want these for Rust, especially what Zig has because I use KDE. I know about https://github.com/KDAB/cxx-qt and it is the only maintained effort for Rust that is left standing after all these years. But I don't want QML. I definitely don't want C++ or CMake. I just want Rust and Cargo.