How I turned Zig into my favorite language to write network programs in

Posted by 0x1997 1 day ago

How I turned Zig into my favorite language to write network programs in(lalinsky.com)

323 points | 137 comments

mananaysiempre 1 day ago|

> Context switching is virtually free, comparable to a function call.

If you’re counting that low, then you need to count carefully.

A coroutine switch, however well implemented, inevitably breaks the branch predictor’s idea of your return stack, but the effect of mispredicted returns will be smeared over the target coroutine’s execution rather than concentrated at the point of the switch. (Similar issues exist with e.g. measuring the effect of blowing the cache on a CPU migration.) I’m actually not sure if Zig’s async design even uses hardware call/return pairs when a (monomorphized-as-)async function calls another one, or if every return just gets translated to an indirect jump. (This option affords what I think is a cleaner design for coroutines with compact frames, but it is much less friendly to the CPU.)

So a foolproof benchmark would require one to compare the total execution time of a (compute-bound) program that constantly switches between (say) two tasks to that of an equivalent program that not only does not switch but (given what little I know about Zig’s “colorless” async) does not run under an async executor(?) at all. Those tasks would also need to yield on a non-trivial call stack each time. Seems quite tricky all in all.

gpderetta 1 day ago||

If you constantly switch between two tasks from the bottom of their call stack (as for stackless coroutines) and your stack switching code is inlined, then you can mostly avoid the mispaired call/ret penalty.

Also, if you control the compiler, an option is to compile all call/rets in and out of "io" code in terms of explicit jumps. A ret implemented as pop+indirect jump will be less less predictable than a paired ret, but has more chances to be predicted than an unpaired one.

My hope is that, if stackful coroutines become more mainstreams, CPU microarchitectures will start using a meta-predictor to chose between the return stack predictor and the indirect predictor.

messe 1 day ago|||

> I’m actually not sure if Zig’s async design even uses hardware call/return pairs

Zig no longer has async in the language (and hasn't for quite some time). The OP implemented task switching in user-space.

loeg 1 day ago||

Even so. You're talking about storing and loading at least ~16 8-byte registers, including the instruction pointer which is essentially a jump. Even to L1 that takes some time; more than a simple function call (jump + pushed return address).

lukaslalinsky 1 day ago|||

Only stack and instruction pointer are explicitly restored. The rest is handled by the compiler, instead of depending on the C calling convention, it can avoid having things in registers during yield.

See this for more details on how stackful coroutines can be made much faster:

https://photonlibos.github.io/blog/stackful-coroutine-made-f...

messe 1 day ago||

> The rest is handled by the compiler, instead of depending on the C calling convention, it can avoid having things in registers during yield.

Yep, the frame pointer as well if you're using it. This is exactly how its implemented in user-space in Zig's WIP std.Io branch green-threading implementation: https://github.com/ziglang/zig/blob/ce704963037fed60a30fd9d4...

On ARM64, only fp, sp and pc are explicitly restored; and on x86_64 only rbp, rsp, and rip. For everything else, the compiler is just informed that the registers will be clobbered by the call, so it can optimize allocation to avoid having to save/restore them from the stack when it can.

flimflamm 1 day ago||

Is this just buttering the cost of switches by crippling the optimization options compiler have?

lukaslalinsky 1 day ago|||

If this was done the classical C way, you would always have to stack-save a number of registers, even if they are not really needed. The only difference here is that the compiler will do the save for you, in whatever way fits the context best. Sometimes it will stack-save, sometimes it will decide to use a different option. It's always strictly better than explicitly saving/restoring N registers unaware of the context. Keep in mind, that in Zig, the compiler always knows the entire code base. It does not work on object/function boundaries. That leads to better optimizations.

hawk_ 1 day ago||

This is amazing to me that you can do this in Zig code directly as opposed to messing with the compiler.

lukaslalinsky 1 day ago|||

See https://github.com/alibaba/PhotonLibOS/blob/2fb4e979a4913e68... for GNU C++ example. It's a tiny bit more limited, because of how the compilation works, but the concept is the same.

messe 1 day ago|||

To be fair, this can be done in GNU C as well. Like the Zig implementation, you'd still have to use inline assembly.

hawk_ 1 day ago||

> If this was done the classical C way, you would always have to stack-save a number of registers

I see, so you're saying that GCC can be coaxed into gathering only the relevant registers to stack and unstack not blindly do all of them?

messe 1 day ago||

Yes, you write inline assembly that saves the frame pointer, stack pointer, and instruction pointer to the stack, and list every other register as a clobber. GCC will know which ones its using at the call-site (assuming the function gets inlined; this is more likely in Zig due to its single unit of compilation model), and save those to the stack. If it doesn't get inlined, it'll be treated as any other C function and only save the ones needed to be preserved by the target ABI.

GoblinSlayer 1 day ago||||

I wonder how you see it. Stackful coroutines switch context on syscall in the top stack frame, the deeper frames are regular optimized code, but syscall/sysret is already big context switch. And read/epoll loop has exactly same structure, the point of async programming isn't optimization of computation, but optimization of memory consumption. Performance is determined by features and design (and Electron).

hawk_ 1 day ago|||

What do you mean by "buttering the cost of switches", can you elaborate? (I am trying to learn about this topic)

masfuerte 1 day ago||

I think it is

> buttering the cost of switches [over the whole execution time]

The switches get cheaper but the rest of the code gets slower (because it has less flexibility in register allocation) so the cost of the switches is "buttered" (i.e. smeared) over the rest of the execution time.

But I don't think this argument holds water. The surrounding code can use whatever registers it wants. In the worst case it saves and restores all of them, which is what a standard context switch does anyway. In other words, this can be better and is never worse.

ori_b 1 day ago|||

Which, with store forwarding, can be shockingly cheap. You may not actually be hitting L1, and if you are, you're probably not hitting it synchronously.

https://easyperf.net/blog/2018/03/09/Store-forwarding

and, section 15.10 of https://www.agner.org/optimize/microarchitecture.pdf

loeg 1 day ago||

Are you talking about context switching every handful of cycles? This is going to be extremely inefficient even with store forwarding.

ori_b 12 hours ago||

Sure, and so is calling a function every handful of cycles. That's a big part of why compilers inline.

Either you're context switching often enough that store forwarding helps, or you're not spending a lot of time context switching. Either way, I would expect that you aren't waiting on L1: you put the write into a queue and move on.

lukaslalinsky 1 day ago|||

You are right that the statement was overblown, however when I was testing with "trivial" load between yields (synchronized ping-pong between coroutines), I was getting numbers that I had trouble believing, when comparing them to other solutions.

gpderetta 1 day ago||

In my test of a similar setup in C++ (IIRC about 10 years ago!), I was able to do a context switch every other cycle. The bottleneck was literally the cycles per taken jump of the microarchitecture I was testing again. As in your case it was a trivial test with two coroutines doing nothing except context switching, so the compiler had no need to save any registers at all and I carefully defined the ABI to be able to keep stack and instruction pointers in registers even across switches.

jadbox 1 day ago||

Semi-unrelated, but async is coming soon to Zig. I'm sorta holding off getting deep into Zig until it lands. https://kristoff.it/blog/zig-new-async-io/

throwawaymaths 1 day ago||

the point of all this io stuff is that you'll be able to start playing with zig before async comes and when async comes it will be either drop in if you choose an async io for main() or it will be a line or two of code if you pick an event loop manually.

cat-whisperer 1 day ago||

Stackful coroutines make sense when you have the RAM for it.

I've been using Zig for embedded (ARM Cortex-M4, 256KB RAM) mainly for memory safety with C interop. The explicitness around calling conventions catches ABI mismatches at compile-time instead of runtime crashes.

I actually prefer colored async (like Rust) over this approach. The "illusion of synchronous code" feels magical, but magic becomes a gotcha in larger codebases when you can't tell what's blocking and what isn't.

pron 1 day ago||

All synchronous code is an illusion created in software, as is the very notion of "blocking". The CPU doesn't block for IO. An OS thread is a (scheduled) "stackful coroutine" implemented in the OS that gives the illusion of blocking where there is none.

The only problem is that the OS implements that illusion in a way that's rather costly, allowing only a relatively small number of threads (typically, you have no more than a few thousand frequently-active OS threads), while languages, which know more about how they use the stack, can offer the same illusion in a way that scales to a higher number of concurrent operations. But there's really no more magic in how a language implements this than in how the OS implements it, and no more illusion. They are both a mostly similar implementation of the same illusion. "Blocking" is always a software abstraction over machine operations that don't actually block.

The only question is how important is it for software to distinguish the use of the same software abstraction between the OS and the language's implementation.

zozbot234 1 day ago||

Unfortunately, the illusion of an OS thread relies on keeping a single consistent stack. Stackful coroutines (implemented on top of kernel threads) break this model in a way that has many detrimental effects; stackless ones do not.

pron 1 day ago|||

It is true that in some languages there could be difficulties due to the language's idiosyncrasies of implementation, but it's not an intrinsic difficulty. We've implemented virtual threads in the JVM, and we've used the same Thread API with no issue.

hawk_ 1 day ago||

Yep the JVM structured concurrency implementation is amazing. One thing I got wondering especially when reading this post on HN though is if stackless coroutines could (have) fit the JVM in some way to get even better performance for those who may care.

pron 1 day ago||

They wouldn't have had better performance, though. There is no significant performance penalty we're paying, although there's a nuance here that may be worth pointing out.

There are two different usecases for coroutines that may tempt implementors to address with a single implementation, but the usecases are sufficiently different to separate into two different implementations. One is the generator use case. What makes it special is that there are exactly two communicating parties, and both of their state may fit in the CPU cache. The other use case is general concurrency, primarily for IO. In that situation, a scheduler juggles a large number of user-mode threads, and because of that, there is likely a cache miss on every context switch, no matter how efficient it is. However, in the second case, almost all of the performance is due to Little's law rather than context switch time (see my explanation here: https://inside.java/2020/08/07/loom-performance/).

That means that a "stackful" implementation of user-mode threads can have no significant performance penalty for the second use case (which, BTW, I think has much more value than the first), even though a more performant implementation is possible for the first use case. In Java we decided to tackle the second use case with virtual threads, and so far we've not offered something for the first (for which the demand is significantly lower).

What happens in languages that choose to tackle both use cases with the same construct is that they gain negligible performance in the second use case (at best), but they're paying for that negligible benefit with a substantial degradation in user experience. That's just a bad tradeoff, but some languages (especially low-level ones) may have little choice, because their stackful solution does carry a significant performance cost compared to Java because of Java's very efficient heap memory management.

lukaslalinsky 1 day ago|||

The OS allocates your thread stack in a very similar way that a coroutine runtime allocates the coroutine stack. The OS will swap the stack pointer and a bunch more things in each context switch, the coroutine runtime will also swap the stack pointer and some other things. It's really the same thing. The only difference is that the runtime in a compiled language knows more about your code than the OS does, so it can make assumptions that the OS can't and that's what makes user-space coroutines lighter. The mechanisms are the same.

zozbot234 1 day ago||

And the stackless runtime will use some other register than the stack pointer to access the coroutine's activation frame, leaving the stack pointer register free for OS and library use, and avoiding the many drawbacks of fiddling with the system stack as stackful coroutines do. It's the same thing.

audunw 1 day ago|||

The new Zig IO will essentially be colored, but in a nicer way than Rust.

You don't have to color your function based on whether you're supposed to use in in an async or sync manner. But it will essentially be colored based on whether it does I/O or not (the function takes IO interface as argument). Which is actually important information to "color" a function with.

Whether you're doing async or sync I/O will be colored at the place where you call an IO function. Which IMO is the correct way to do it. If you call with "async" it's nonblocking, if you call without it, it's blocking. Very explicit, but not in a way that forces you to write a blocking and async version of all IO functions.

The Zio readme says it will be an implementation of Zig IO interface when it's released.

I guess you can then choose if you want explicit async (use Zig stdlib IO functions) or implicit async (Zio), and I suppose you can mix them.

> Stackful coroutines make sense when you have the RAM for it.

So I've been thinking a bit about this. Why should stackful coroutines require more RAM? Partly because when you set up the coroutine you don't know how big the stack needs to be, right? So you need to use a safe upper bound. While stackless will only set up the memory you need to yield the coroutine. But Zig has a goal of having a built-in to calculate the required stack size for calling a function. Something it should be able to do (when you don't have recursion and don't call external C code), since Zig compiles everything in one compilation unit.

Zig devs are working on stackless coroutines as well. But I wonder if some of the benefits goes away if you can allocate exactly the amount of stack a stackful coroutine needs to run and nothing more.

lukaslalinsky 1 day ago|||

This is not true. Imagine code like this:

    const n = try reader.interface.readVec(&data);

Can you guess if it's going to do blocking or non-blocking I/O read?

The io parameter is not really "coloring", as defined by the async/await debate, because you can have code that is completely unaware of any async I/O, pass it std.Io.Reader and it will just work, blocking or non-blocking, it makes no difference. Heck, you even even wrap this into C callbacks and use something like hiredis with async I/O.

Stackful coroutines need more memory, because you need to pre-allocate large enough stack for the entire lifetime. With stackless coroutines, you only need the current state, but with the disadvantage that you need frequent allocations.

NobodyNada 1 day ago||

> Stackful coroutines need more memory, because you need to pre-allocate large enough stack for the entire lifetime. With stackless coroutines, you only need the current state, but with the disadvantage that you need frequent allocations.

This is not quite correct -- a stackful coroutine can start with a small stack and grow it dynamically, whereas stackless coroutines allocate the entire state machine up front.

The reason why stackful coroutines typically use more memory is that the task's stack must be large enough to hold both persistent state (like local variables that are needed across await points) and ephemeral state (like local variables that don't live across await points, and stack frames of leaf functions that never suspend). With a stackless implementation, the per-task storage only holds persistent state, and the OS thread's stack is available as scratch space for the current task's ephemeral state.

zozbot234 1 day ago|||

> You don't have to color your function based on whether you're supposed to use in in an async or sync manner. But it will essentially be colored based on whether it does I/O or not (the function takes IO interface as argument). Which is actually important information to "color" a function with.'

The Rust folks are working on a general effect system, including potentially an 'IO' effect. Being able to abstract out the difference between 'sync' and 'async' code is a key motivation of this.

vrnvu 1 day ago||

> when you can't tell what's blocking and what isn't.

Isn't that exactly why they're making IO explicit in functions? So you can trace it up the call chain.

quantummagic 1 day ago||

Isn't this a bad time to be embracing Zig? It's currently going through an intrusive upheaval of its I/O model. My impression is that it was going to take a few years for things to shake out. Is that wrong?

dualogy 1 day ago||

> My impression is that it was going to take a few years for things to shake out. Is that wrong?

I had that very impression in early 2020 after some months of Zigging (and being burned by constant breaking changes), and left, deciding "I'll check it out again in a few years."

I had some intuition it might be one of these forever-refactoring eternal-tinker-and-rewrite fests and here I am 5 years later, still lurking for that 1.0 from the sidelines, while staying in Go or C depending on the nature of the thing at hand.

That's not to say it'll never get there, it's a vibrant project prioritizing making the best design decisions rather than mere Shipping Asap. For a C-replacement that's the right spirit, in principle. But whether there's inbuilt immunity to engineers falling prey to their forever-refine-and-resculpt I can't tell. I find it a great project to wait for leisurely (=

geysersam 1 day ago|||

What's a few years? They go by in the blink of an eye. Zig is a perfectly usable language. People who want to use it will, those who don't won't.

attila-lendvai 1 day ago|||

following upstream is overrated since we have good package managers and version control.

it's completely feasible to stick to something that works for you, and only update/port/rewrite when it makes sense.

what matters is the overall cost.

kunley 1 day ago||

Hmm, if one writes a library Zetalib for the language Frob v0.14 and then Frob v0.15 introduces breaking changes that everyone else is going to adapt to, then well, package managers and version control is going to help indeed - they will help in staying in a void as no one will use Zetalib anymore because of the older Frob.

all2 1 day ago||

For libs, yes, for applications dev, no.

I would expect fixing an application to an older version would be just fine, so long as you don't need newer language features. If newer language features are a requirement, I would expect that would drive refactoring or selecting a different implementation language entirely if refactoring would prove to be too onerous.

tonyhart7 1 day ago|||

only for hobby project

scuff3d 1 day ago|||

TigerBeetle, Bun, and Ghostty all beg to differ...

tonyhart7 1 day ago||

[flagged]

samtheprogram 1 day ago|||

Bun is 100% fine in production. And you should be using it instead of transpiling TypeScript unless there’s some other barrier to using it.

tonyhart7 1 day ago||

I am not trying to dunk on this project

all of these project is great but we cant ignore that Zig is not enter phase where we can guarantee stable API compability

scuff3d 1 day ago||

Nobody is denying that? Andrew Kelly and the Zig team have been extremely clear that they are okay making break changes. So if you're choosing to use it in large projects, as some have, you're taking that risk.

I think it speaks volumes that these projects chose to use it, and speak very highly of it, despite the fact that it's still pre 1.0.

tonyhart7 1 day ago||

"Nobody is denying that?"

seems like you missing a lot of comment on this tree

scuff3d 1 day ago||

Care to point out a few?

nesarkvechnep 1 day ago|||

You or in general? Because, you know, this is like, your opinion, man.

tonyhart7 1 day ago||

My Opinion???

how about you goes to Zig github and check how progress of the language

it literally there and its still beta test and not fit for production let alone have mature ecosystem

dns_snek 1 day ago||

Yes, your opinion. I run it in production and everything I've built with it has been rock solid (aside from my own bugs). I haven't touched a few of my projects in a few years and they work fine, but if I wanted to update them to the latest version of Zig I'd have a bit of work ahead of me. That's it.

tonyhart7 1 day ago||

[flagged]

dns_snek 1 day ago||

Can you rage bait somewhere else?

tonyhart7 1 day ago||

since when stating fact is rage bait????

laserbeam 1 day ago|||

Kind of is a bad idea. Even the author’s library is not using the latest zig IO features and is planning for big changes with 0.16. From the readme of the repo:

> Additionally, when Zig 0.16 is released with the std.Io interface, I will implement that as well, allowing you to use the entire standard library with this runtime.

Unrelated to this library, I plan to do lots of IO with Zig and will wait for 0.16. Your intuition may decide otherwise and that’s ok.

lukaslalinsky 1 day ago|||

It really depends on what you are doing, but if it's something related to I/O and you embrace the buffered reader/writer interfaces introduced in Zig 0.15, I think not much is going to change. You might need changes on how you get those interfaces, but the core of your code is unchanged.

grayhatter 1 day ago||

IMO, it's very wrong. Zig's language is not drastically changing, it's adding a new, *very* powerful API, which similar to how most everything in zig passes an allocator as a function param, soon functions that want to do IO, will accept an object that will provide the desired abstraction, so that callers can define the ideal implementation.

In other words, the only reason to not use zig if you detest upgrading or improving your code. Code you write today will still work tomorrow. Code you write tomorrow, will likely have a new Io interface, because you want to use that standard abstraction. But, if you don't want to use it, all your existing code will still work.

Just like today, if you want to alloc, but don't want to pass an `Allocator` you can call std.heap.page_allocator.alloc from anywhere. But because that abstraction is so useful, and zig supports it so ergonomically, everyone writes code that provides that improved API

side note; I was worried about upgrading all my code to interface with the new Reader/Writer API that's already mostly stable in 0.15.2, but even though I had to add a few lines in many existing projects to upgrade. I find myself optionally choosing to refactor a lot of functions because the new API results is code that is SO much better. Both in readability, but also performance. Do I have to refactor? No, the old API works flawlessly, but the new API is simply more ergonomic, more performant and easier to read and reason about. I'm doing it because I want to, not because I have to.

Everyone knows' a red diff is the best diff, and the new std.Io API exposes an easier way to do things. Still, like everything in zig, it allows you to write the code that you want to write. But if you want to do it yourself, that's fully supported too!

brabel 1 day ago|||

> Code you write today will still work tomorrow.

Haha no! Zig makes breaking changes in the stdlib in every release. I can guarantee you won’t be able to update a non trivial project between any of the latest 10 versions and beyond without changing your code , often substantially, and the next release is changing pretty much all code doing any kind of IO. I know because I keep track of that in a project and can see diffs between each of the latest versions. This allows me to modify other code much more easily.

But TBH, in 0.15 only zig build broke IIRC. However, I just didn’t happen to use some of the things that changed, I believe.

grayhatter 6 hours ago||

> But TBH, in 0.15 only zig build broke IIRC. However, I just didn’t happen to use some of the things that changed, I believe.

I idle on IRC a lot, and try to help out with questions. From that view, this is the experience of over 90% of users. Minor namespace changes, or calling a function with a different named option. . root_source_file became . root_source_module (and required an additional function call)

Complex changes are almost never required, and IMO shouldn't be expected by most people using zig. Those who might need to make them, already know they're coming because they're already paying attention to the language as a prerequisite for writing such complex code. (Complex here meaning depending on the more esoteric zig stdlib internals)

do_not_redeem 1 day ago||||

This isn't quite accurate. If you look at the new IO branch[1] you'll see (for example) most of the std.fs functions are gone, and most of what's left is deprecated. The plan is for all file/network access, mutexes, etc to be accessible only through the Io interface. It'll be a big migration once 0.16 drops.

[1]: https://github.com/ziglang/zig/blob/init-std.Io/lib/std/fs.z...

> Do I have to refactor? No, the old API works flawlessly

The old API was deleted though? If you're saying it's possible to copy/paste the old stdlib into your project and maintain the old abstractions forward through the ongoing language changes, sure that's possible, but I don't think many people will want to fork std. I copy/pasted some stuff temporarily to make the 0.15 migration easier, but maintaining it forever would be swimming upstream for no reason.

grayhatter 1 day ago||

> most of the std.fs functions are gone, and most of what's left is deprecated.

uhhh.... huh? you and I must be using very different definitions for the word most.

> The old API was deleted though?

To be completely fair, you're correct, the old deprecated writer that was available in 0.15 has been removed https://ziglang.org/documentation/0.15.2/std/#std.Io.Depreca... contrasted with the master branch which doesn't provide this anymore.

edit: lmao, your profile about text is hilarious, I appreciate the laugh!

do_not_redeem 1 day ago||

Even the basic stuff like `openFile` is deprecated. I don't know what else to tell you. Zig won't maintain two slightly different versions of the fs functions in parallel. Once something is deprecated, that means it's going away. https://github.com/ziglang/zig/blob/init-std.Io/lib/std/fs/D...

grayhatter 1 day ago||

Oh, I guess that's a fair point. I didn't consider the change from `std.fs.openFile` to `std.Io.Dir.openFile` to be meaningful, but I guess that is problematic for some reason?

You're of course correct here; but I thought it was reasonable to omit changes that I would describe as namespace changes. Now considering the audience I regret doing so. (it now does require nhe Io object as well, so namespace is inarticulate here)

bccdee 1 day ago|||

> I didn't consider the change from `std.fs.openFile` to `std.Io.Dir.openFile` to be meaningful, but I guess that is problematic for some reason?

Because you explicitly said that existing code would continue to work without `std.Io`.

> Code you write tomorrow, will likely have a new Io interface, because you want to use that standard abstraction. But, if you don't want to use it, all your existing code will still work.

I like Zig, but it does not have a stable API. That's just how it is.

grayhatter 6 hours ago||

> Because you explicitly said that existing code would continue to work without `std.Io`.

Because I'm not conflating passing an Io object, (what everyone expects to be mandatory) and existing APIs moving into the Io namespace (an API change that can only be considered significant if you're trying to win an argument on reddit).

These are drastically different changes, and only one can be considered a meaningful change.

> I like Zig, but it does not have a stable API. That's just how it is.

The last 3 minor version upgrades, required a diff in all of my projects. All of them could have been fixed with exclusively sed -i to update namespaces. None of them required real attention, or logic changes.

In one repo I made the namespace changes in isolation. After some time I then went back and rewrote a few blocks to take advantage of the features, runtime speed improvements, and generally improved code quality granted from the new API.

I don't expect zig's API to be stable, and I regret it if my comment gave you a different impression. But I stand by my comments because I respectfully refuse to ignore the pragmatics of using zig. Calling moving a function between namespaces breaking API change can be argued as technically correct, but bordering on being intentionally misleading.

Ar-Curunir 1 day ago|||

That is literally a breaking change, so your old code will by definition not work flawlessly. Maybe the migration overhead is low, but it’s not zero like your comment implies

kunley 1 day ago|||

Zealotry in almost every paragraph.

grayhatter 6 hours ago||

I mean, zig has made writing code something I enjoy again, instead of being something I hate. So I don't mind the title of zealot, I'm sure you meant it as an insult, but I have no intention to apologize for finding some lost joy, and being excited about it.

aidenn0 1 day ago||

I am still mystified as to why callback-based async seems to have become the standard. What this and e.g. libtask[1] do seems so much cleaner to me.

The Rust folks adopted async with callbacks, and they were essentially starting from scratch so had no need to do it that way, and they are smarter than I (both individually and collectively) so I'm sure they have a reason; I just don't know what it is.

1: https://swtch.com/libtask/

torginus 1 day ago||

Stackless coroutines can be implemented using high level language constructs, and entirely in your language. Because of this it interacts with legacy code, and existing language features in predictable ways. Some security software or code hardening and instrumentation libraries will break as well.

Also, async at low level is literally always callbacks (even processor interrupts are callbacks)

By mucking about with the stack, you break stuff like stack unwinding for exceptions and GC, debuggers, and you probably make a bunch of assumptions you shouldn't

If you start using the compiler backend in unexpected ways, you either expose bugs or find missing functionality and find that the compiler writers made some assumptions about the code (either rightfully or not), that break when you start wildly overwriting parts of the stack.

Writing a compiler frontend is hard enough as it is, and becoming an LLVM expert is generally too much for most people.

But even if you manage to get it working, should you have your code break in either the compiler or any number of widely used external tooling, you literally can't fast track your fix, and thus you can't release your language (since it depends on a broken external dependency, fix pending whenever they feel like it).

I guess even if you are some sort of superhero who can do all this correclty, the LLVM people won't be happy merging some low level codegen change that has the potential to break all compiled software of trillion dollar corporations for the benefit of some small internet project.

vlovich123 1 day ago|||

The research Microsoft engineers did on stackful vs stackless coroutines for the c++ standard I think swayed this as “the way” to implement it for something targeting a systems level - significantly less memory overhead (you only pay for what you use) and offload the implementation details of the executor (lots of different design choices that can be made).

zozbot234 1 day ago|||

Yup, stackful fibers are an anti-pattern. Here's Gor Nishanov's review for the C++ ISO committee https://www.open-std.org/JTC1/SC22/WG21/docs/papers/2018/p13... linked from https://devblogs.microsoft.com/oldnewthing/20191011-00/?p=10... . Notice how it sums things up:

> DO NOT USE FIBERS!

gpderetta 1 day ago|||

And this is the rebuttal: https://www.open-std.org/JTC1/SC22/WG21/docs/papers/2019/p08...

There are downsides to stackful coroutines (peak stack usage for example), but I feed that p1364 was attacking a strawman: first of all it is comparing a solution with builtin compiler support against a pure library implementation, second it is not even comparing against the reference implementation of the competing proposal.

aidenn0 20 hours ago||

The TL;DR of that sums up my opinions pretty well.

As an aside, I know Rust would be unlikely to implement segmented stacks for fibers, given that they were burned by the performance implications thereof previously.

torginus 1 day ago||||

> DO NOT USE FIBERS!

For C++.

If your language has RAII or exceptions, it raises crazy questions about how if thread A is hosting fiber 1, which throws an exception, which propagates outside of the fiber invocation scope, destroys a bunch of objects, then we switch to fiber 2, which sees the world in an inconsistent state (outside resources have been cleaned up, inside ones still alive).

This was literally impossible in pre-fiber code, so most existing code would probably not handle it well.

gpderetta 1 day ago||

That's not different from threads running concurrent exceptions (in fact it is simpler in the single threaded example). RAII or exceptions are really not an issue for stackful coroutines.

aidenn0 20 hours ago||||

Many of these issues go away if you control the compiler and runtime, which Rust does (and they needed to make changes to those to add async, so changes were inevitable).

sgt 1 day ago|||

Is stackful fibers the same as stackful coroutines?

gpderetta 1 day ago||

yes same thing, different names.

aidenn0 20 hours ago|||

> significantly less memory overhead

On an OS with overcommit, you might also only pay for what you use (at a page granularity), but this may be defeated if the stack gets cleared (or initialized to a canary value) by the runtime.

boomlinde 1 day ago|||

One thing I would consider "unclean" about the zio approach (and e.g. libtask) is that you pass it an arbitrary expected stack size (or, as in the example, assume the default) and practically just kind of hope it's big enough not to blow up and small enough to be able to spawn as many tasks as you need. Meanwhile, how much stack actually ends up being needed by the function is a platform specific implementation detail and hard to know.

This is a gotcha of using stack allocation in general, but exacerbated in this case by the fact that you have an incentive to keep the stacks as small as possible when you want many concurrent tasks. So you either end up solving the puzzle of how big exactly the stack needs to be, you undershoot and overflow with possibly disastrous effects (especially if your stack happens to overflow into memory that doesn't cause an access violation) or you overshoot and waste memory. Better yet, you may have calculated and optimized your stack size for your platform and then the code ends up doing UB on a different platform with fewer registers, bigger `c_long`s or different alignment constraints.

If something like https://github.com/ziglang/zig/issues/157 actually gets implemented I will be happier about this approach.

aidenn0 20 hours ago|||

Maybe I've been on x64 Linux too long, but I would just specify 8MB of stack for each fiber and let overcommit handle the rest. For small fibers that would be 4k per fiber of RSS so a million fibers is 4GB of RAM which seems fine to me?

Hendrikto 1 day ago|||

Couldn’t you use the Go approach of starting with a tiny stack that is big enough for 90% of cases, then grow it as needed?

boomlinde 1 day ago|||

Consider that resizing the stack may require reallocating it elsewhere in memory. This would invalidate any internal pointers to the stack.

AFAIK Go solves this by keeping track of these pointer locations and adjusting them when reallocating the stack. Aside from the run-time cost this incurs, this is unsuitable for Zig because it can't stricly know whether values represent pointers.

Go technically also has this problem as well, if you for example convert a pointer to a uintptr, but maintains no guarantee that a former pointer will still be valid when converted back. Such conversions are also rarely warranted and are made explicit using the `unsafe` package.

Zig is more like C in that it gives the programmer rather than a memory management runtime exclusive control and free rein over the memory. If there are some bits in memory that happen to have the same size as a pointer, Zig sees no reason to stop you from interpreting them as such. This is very powerful, but precludes abstractions like Go's run-time stack reallocation.

lukaslalinsky 1 day ago||||

Go depends on the fact that it can track all pointers, and when it needs to resize stacks, it can update them.

Previous versions of Go used segmented stacks, which are theoretically possible, if Zig really wanted (would need compiler support), but they have nasty performance side-effects, see https://www.youtube.com/watch?v=-K11rY57K7k

loeg 1 day ago||

Resizing stacks on use does not depend on any of these properties of Go. You can do it like this in C, too. It does not require segmentation.

boomlinde 1 day ago|||

Resizing stacks insofar that expansion may require moving the stack to some other place in memory that can support the new size depends on these properties. Your initial 4k of coroutine stack may have been allocated some place that wont fit the new 8k of coroutine stack.

Or are you making a point about virtual memory? If so, that assumption seems highly platform dependent.

loeg 1 day ago||

You would implement this with virtual memory. Obviously, this is less of a limited resource on 64-bit systems. And I wouldn't recommend the Go/stack/libtask style model for high concurrency on any platform.

lukaslalinsky 1 day ago|||

I'm very interested to know how. Do you mean reserving a huge chunk of virtual memory and slowly allocating it? That works to some degree, but limits how many coroutines can you really spawn.

loeg 1 day ago||

Yes, exactly.

loeg 1 day ago|||

8kB is enough for 90% of use cases. But then you invoke getaddrinfo() once and now your stack is 128kB+.

loeg 1 day ago|||

The thread stack for something like libtask is ambiguously sized and often really large relative to like, formalized async state.

NobodyNada 1 day ago|||

> The Rust folks adopted async with callbacks

Rust's async is not based on callbacks, it's based on polling. So really there are three ways to implement async:

- The callback approach used by e.g. Node.js and Swift, where a function that may suspend accepts a callback as an argument, and invokes the callback once it is ready to make progress. The compiler transforms async/await code into continuation-passing style.

- The stackful approach used by e.g. Go, libtask, and this; where a runtime switches between green threads when a task is ready to make progress. Simple and easy to implement, but introduces complexity around stack size.

- Rust's polling approach: an async task is statically transformed into a state machine object that is polled by a runtime when it's ready to make progress.

Each approach has its advantages and disadvantages. Continuation-passing style doesn't require a runtime to manage tasks, but each call site must capture local variables into a closure, which tends to require a lot of heap allocation and copying (you could also use Rust's generic closures, but that would massively bloat code size and compile times because every suspending function must be specialized for each call site). So it's not really acceptable for applications looking for maximum performance and control over allocations.

Stackful coroutines require managing stacks. Allocating large stacks is very expensive in terms of performance and memory usage; it won't scale to thousands or millions of tasks and largely negates the benefits of green threading. Allocating small stacks means you need the ability to dynamically resize stacks at runtime, which requires dynamic allocation and adds significant performance and complexity overhead if you want to make an FFI call from an asynchronous task (in Go, every function begins with a prologue to check if there is enough stack space and allocate more if needed; since foreign functions do not have this prologue, an FFI call requires switching to a sufficiently large stack). This project uses fixed-sized task stacks, customizable per-task but defaulting to 256K [1]. This default is several orders of mangitude larger than a typical task size in other green-threading runtimes, so to achieve large scale the programmer must manually manage the stack size on a per-task basis, and face stack overflows if they guess wrong (potentially only in rare/edge cases).

Rust's "stackless" polling-based approach means the compiler knows statically exactly how much persistent storage a suspended task needs, so the application or runtime can allocate this storage up-front and never need to resize it; while a running task has a full OS thread stack available as scratch space and for FFI. It doesn't require dynamic memory allocation, but it imposes limits on things like recursion. Rust initially had stackful coroutines, but this was dropped in order to not require dynamic allocation and remove the FFI overhead.

The async support in Zig's standard library, once it's complete, is supposed to let the application developer choose between stackful and stackless coroutines depending on the needs of the application.

[1]: https://github.com/lalinsky/zio/blob/9e2153eed99a772225de9b2...

oaiey 1 day ago|||

I think it started with an interrupt. And less abstraction often wins.

dgb23 1 day ago||

This is the only explanation here I can intuitively understand!

MisterTea 1 day ago|||

The history of this concurrency model is here: https://seh.dev/go-legacy/

secondcoming 1 day ago||

> callback-based async seems to have become the standard

At some level it's always callbacks. Then people build frameworks on top of these so programmers can pretend they're not dealing with callbacks.

noselasd 1 day ago||

Mostly out of curiosity, a read on a TCP connection could easily block for a month - how does the I/O timeout interface look like ? e.g. if you want to send an application level heartbeat when a read has blocked for 30 seconds.

dgb23 1 day ago||

You can set read and write timeouts on TCP sockets:

https://linux.die.net/man/3/setsockopt

Zig has a posix API layer.

lukaslalinsky 1 day ago|||

I don't have a good answer for that yet, mostly because TCP reads are expected to be done through std.Io.Reader which isn't aware of timeouts.

What I envision is something like `asyncio.timeout` in Python, where you start a timeout and let the code run as usual. If it's in I/O sleep when the timeout fires, it will get woken up and the operation gets canceled.

I see something like this:

    var timeout: zio.Timeout = .init;
    defer timeout.cancel(rt);

    timeout.set(rt, 10);
    const n = try reader.interface.readVec(&data);

sgt 1 day ago||

Are you working using Zig master with the new Io interface passed around, by the way?

lukaslalinsky 1 day ago||

No, I'm targeting Zig 0.15. The new Io interface is not in master yet, it's still evolving. When it's merged to master and stable, I'll start implementing the vtable. But I'm just passing Runtime around, instead of Io. So you can easily migrate code from zio to std when it's released.

secondcoming 1 day ago||

This is very true. Most examples of async io I've seen - regardless of the framework - gloss over timeouts and cancellation. It's really the hardest part. Reading and writing asynchronously from a socket, or whatever, is the straightforward part.

dxxvi 1 day ago||

Do you know that there's a concurrent Scala library named ZIO (https://zio.dev)? :-)

tombert 1 day ago||

I really need to play with Zig. I got really into Rust a few months ago, and I was actually extremely impressed by Tokio, so if this library also gives me Go-style concurrency without having to rely on a garbage collector, then I am likely to enjoy it.

lukaslalinsky 1 day ago|

Go has tricks that you can't replicate elsewhere, things like infinitely growable stacks, that's only possible thanks to the garbage collector. But I did enjoy working on this, I'm continually impressed with Zig for how nice high-level looking APIs are possible in such a low-level language.

pjmlp 1 day ago|||

Also, it is about time to let go with GC-phobia.

https://www.withsecure.com/en/solutions/innovative-security-...

https://www.ptc.com/en/products/developer-tools/perc

Note the

> This video illustrates the use case of Perc within the Aegis Combat System, a digital command and control system capable of identifying and tracking incoming threats and providing the war fighter with a solution to address threats. Aegis, developed by Lockheed Martin, is critical to the operation of the DDG-51, and Lockheed Martin has selected Perc as the operating platform for Aegis to address real-time requirements and response times.

Not all GCs are born alike.

tombert 4 hours ago|||

I don’t have an issue with garbage collectors. Most code I write is GC’d.

The thing that actually convinced me to learn Rust was for something that I wanted to use less memory; my initial Clojure version, compiled with GraalVM, hovered around 100 megs. When I rewrote it in Rust, it hovered around 500kb.

It’s not completely apples to apples, and the laptop running this code has a ton of RAM anyway, but it’s still kind of awesome to see a 200x reduction in memory without significantly more complicated code.

A lot of the stuff I have to do in Rust for GC-less memory safety ends up being stuff I would have to do anyway in a GC’d language, e.g. making sure that one thread owns the memory after it has been transferred over a channel.

RossBencina 1 day ago||||

> Not all GCs are born alike.

True. However in the bounded-time GC space few projects share the same definitions of low-latency or real-time. So you have to find a language that meets all of your other desiderata and provides a GC that meets your timing requirements. Perc looks interesting, Metronome made similar promises about sub-ms latency. But I'd have to get over my JVM runtime phobia.

pjmlp 1 day ago||

I consider one where human lifes depend on it, for good or worse depending on the side, real time enough.

bccdee 1 day ago||

Human lives often depend on processes that can afford to be quite slow. You can have a real time system requiring only sub-hour latency; the "realness" of a real-time deadline is quite distinct from the duration of that deadline.

jandrewrogers 1 day ago||||

That GC introduces latencies of ~1000µs. The article is about eliminating ~10µs context switching latencies. Completely different performance class. The "GC-phobia" is warranted if you care about software performance, throughput, and scalability.

DoD uses languages like Java in applications where raw throughput and low-latency is not critical to success. A lot of what AEGIS does is not particularly performance sensitive.

kunley 1 day ago||||

GC is fine, what scaries me is using j*va in Aegis..

Ygg2 1 day ago||

The OutOfMemoryError will happen after rocket hits the target.

bccdee 1 day ago|||

Real-time GCs can only guarantee a certain number of deallocations per second. Even with a very well-designed GC, there's no free lunch. A system which manages its memory explicitly will not need to risk overloading its GC.

aidenn0 20 hours ago||

I think you have that backwards; they can only guarantee a certain number of allocations per second (once the application hits steady-state the two are the same, but there are times when it matters)

aidenn0 1 day ago||||

Pre-1.0 Rust used to have infinitely growing stacks, but they abandoned it due to (among other things) performance reasons (IIRC the stacks were not collected with Rust's GC[1], but rather on return; the deepest function calls may happen in tight loops, and if you are allocating and freeing the stack in a tight loop, oops!)

1: Yes, pre-1.0 Rust had a garbage collector.

RustSupremacist 1 day ago||

Rust still has garbage collection if you use Arc and Rc. Not a garbage collector but this type of garbage collection.

aidenn0 20 hours ago|||

I'm going to veer into no-true-scottsman territory for a bit and claim that those don't count since they cannot collect cycles (if I'm wrong and they implement e.g. trial-deletion, let me know). This isn't just academic, since cyclic data-structures are an important place where the borrow-checker can't help you, so a GC would be useful.

echelon 1 day ago|||

You mean Drop, which is entirely predictable and controlled by the user?

gpderetta 1 day ago|||

You mean GO segmented stacks? You can literally them in C and C++ with GCC and glibc. It was implemented to support gccgo, but it works for other languages as well.

It is an ABI change though, so you need to recompile the whole stack (there might be the ability for segmented code to call non segmented code, but I don't remember the extent of the support) and it is probably half deprecated now. But it works and it doesn't need GC.

lukaslalinsky 1 day ago|||

No, Go abandoned segmented stacks a long time ago. It causes unpredictable performance, because you can hit alloc/free cycle somewhere deep in code. What they do now is that when they hit stack guard, they allocate a new stack (2x size), copy the data, update pointers. Shrinking happens during GC.

pjmlp 1 day ago|||

I think by now we can consider gccgo will enventually join gcj.

The Fortran, Modula-2 and ALGOL 68 frontends are getting much more development work than gccgo, stuck in pre-generics Go, version 1.18 from 2022 and no one is working on it other than minor bug fixes.

mrasong 1 day ago||

The first time I heard about Zig was actually on Bun’s website, it’s been getting better and better lately.

RustSupremacist 1 day ago||

> In the previous C++ version, I used Qt, which might seem very strange for a server software, but I wanted a nice way of doing asynchronous I/O and Qt allowed me to do that. It was callback-based, but Qt has a lot of support for making callbacks usable. In the newer prototypes, I used Go, specifically for the ease of networking and concurrency. With Zig, I was stuck.

There are new Qt bindings for these. Go has https://github.com/mappu/miqt and Zig has https://github.com/rcalixte/libqt6zig. I wonder if the author knew about them. I don't know enough about either language to speak on the async parts.

For me, I want these for Rust, especially what Zig has because I use KDE. I know about https://github.com/KDAB/cxx-qt and it is the only maintained effort for Rust that is left standing after all these years. But I don't want QML. I definitely don't want C++ or CMake. I just want Rust and Cargo.

otobrglez 1 day ago|

There is an extremely popular library/framework for Scala named ZIO out there,… Naming is hard.

More comments...