Concurrency in Haskell: Fast, Simple, Correct

Posted by ingve 4/14/2025

Concurrency in Haskell: Fast, Simple, Correct(bitbashing.io)

193 points | 125 comments

_jackdk_ 4/17/2025|

From the footnotes:

> It gets weirder: in Haskell, exceptions can be thrown to other threads!

What's really interesting is that because of purity, you have to have asynchronous exceptions otherwise you give up a lot of modularity. At least that's what Simons Marlow and Peyton Jones argue in Asynchronous Exceptions in Haskell (2006): https://www.microsoft.com/en-us/research/wp-content/uploads/...

> While the semi-asynchronous approach avoids breaking synchronization abstractions, it is non-modular in that the target code must be written to use the signalling mechanism. Worse still (for us), the semi-asynchronous approach is simply incompatible with a purely-functional language, such as Concurrent Haskell. The problem is that polling a global flag is not a functional operation, yet in a Concurrent Haskell program, most of the time is spent in purely-functional code. On the other hand, since there is absolutely no problem with abandoning a purely-functional computation at any point, asynchronous exceptions are safe in a functional setting. In short, in a functional setting, fully-asynchronous exceptions are both necessary and safe — whereas in an imperative context fully-asynchronous exceptions are not the only solution and are unsafe.

If you can read PLTese, it's really quite a nice paper.

haskell17373 4/17/2025||

It's maybe interesting to note that the `async` library in use here is very simple and easy to understand. Nearly every function is one or two lines. Likewise `TQueue` is extremely simple (and easy to prove correct) thanks to STM, and also generally has good performance.

zozbot234 4/17/2025||

A lot of the complexity here is just hidden in Haskell's runtime, which implements async processing based on green threads, besides other features such as GC. Though to be fair, the software transactional memory (STM) featureset is quite unique to Haskell since it relies on the availability of pure functions to ensure correctness. It's kind of hard to imagine a full equivalent to it in other well-known languages.

juliangamble 4/17/2025||

Quibble: Both Clojure and Scala have a Software Transactional Memory implementation, and the original Clojure Ant demo showed this.

kreetx 4/17/2025||

While the async library is great, then everything that came before (forkIO, MVar's, etc) was already simple enough - it's only the exception handling that was missing.

internet_points 4/17/2025||

https://www.oreilly.com/library/view/parallel-and-concurrent... is a great resource for those who want to go deeper into this

cosmic_quanta 4/17/2025||

The author is thinking of updating the book to a second edition as well. Looking forward to it

ahel 4/17/2025||

noice

alatriste 4/17/2025||

I read that book many years ago, but I haven't looked into Haskell for a long time. Is it still relevant today? I imagine many things have changed in 12 years!

valcron1000 4/17/2025||

The fundamentals are the same, and `async` is as relevant as it was back then. The ecosystem is extremely stable in that regard.

ackfoobar 4/17/2025||

"Fast" in title. But I don't see any benchmarks, or discussion on how to reason about the performance of STM code.

thuanao 4/17/2025||

Correct? Anyone who has worked with concurrency in Haskell is probably laughing... :)

Haskell's IO type system doesn't model concurrency at all. `IO a` could be a fork and join, an infinite event loop, really anything, it's a black box in terms of "correctness". Better than using JavaScript maybe, but hardly "correct" in any sort of formal, tractable sense.

wodenokoto 4/17/2025||

I don't know how async is in other languages but I find Pythons async incredibly difficult to use, and I kinda feel validated about how poor chatGPT is at it as well.

Is it because it is just a very hard thing, or is it because its a synchronous language, with async bolted on? (I'm talking about a purely language point of view, not from a python VM / GIL point of view)

aeonik 4/17/2025||

The easiest language I’ve used for async is Clojure—mostly because the language is immutable by default and ~99% of the code is referentially transparent. That doesn’t magically solve async, but it removes an entire class of headaches by nudging you away from shared state and side effects. You don’t need locks if there’s nothing to lock.

Async is hard, no doubt—but some languages are designed to reduce the surface area of what can go wrong. I’ve heard great things about Erlang, Elixir, and BEAM-based languages in general. They treat async not as an add-on, but as a core architectural principle.

Starlevel004 4/17/2025||

It's because ``asyncio`` is a dogwater library that's barely functional and full of footguns. The ecosystem is about the same quality too.

gpderetta 4/17/2025||

Indeed, as much as I dislike async in general, asyncio is its own special hell.

throwaway17_17 4/17/2025||

From the footnotes:

> In bare-metal embedded systems or inside the operating system, it’s not unusual to manually break computation into state machines, driven by interrupts.

Although not the topic of TFA, in fact, the footnotes that this is "a whole different ball game." Does anyone have any good source for this aspect of 'low-level'/OS development? I'm more than capable of chasing down sources from a more high level introduction or overview, so anything would be helpful. This concept seems like it may just be a more pragmatic description of embedded/OS development than what I've read previously.

thih9 4/17/2025||

After reading the title I was expecting a "pick two". From my anecdotal experience haskell is usually far from simple, but other configurations are possible too.

michalsustr 4/17/2025||

I’m not familiar with Haskell concurrency. The combination of green threads and large memory allocations due to immutable data structures sounds like it would be hard to implement a web server handling 10k+ concurrent requests on commodity hardware?

Btw. too bad author talks about microsecond guarantees usage but does not provide a link, that would be interesting reading.

cosmic_quanta 4/17/2025||

> sounds like it would be hard to implement a web server handling 10k+ concurrent requests on commodity hardware?

In practice, it is not. The canonical Haskell compiler, GHC, is excellent at transforming operations on immutable data, as Haskell programs are written, into efficient mutations, at the runtime level. Also, since web development is quite popular in the Haskell community, lots of people have spent many hours optimizing this precise use-case.

In my experience, the real downside is that compilation times are a bit long -- the compiler is doing a LOT of work after all.

eru 4/17/2025||

> The canonical Haskell compiler, GHC, is excellent at transforming operations on immutable data, as Haskell programs are written, into efficient mutations, at the runtime level.

Yes, at the level of native machine code and memory cells, there's not that much of a difference between immutability + garbage collection, and higher level source code that mutates. Thanks to GC you are going to overwrite the same memory locations over and over again, too.

whateveracct 4/18/2025||

Programmers for some reason really don't understand that generational garbage collection provides locality. I am really surprised how often I see C/C++/Rust types not understand this.

eru 4/18/2025||

I think that only applies to a moving GC. A conservative GC (like the Boehm GC for C) doesn't move any items around, and thus doesn't do anything for locality.

Of course, even a moving GC has limits, itwon't turn a hashtable into something that has local accesses.

stevan 4/17/2025|||

> Warp is a high-performance HTTP server library written in Haskell, a purely functional programming language. Both Yesod, a web application framework, and mighty, an HTTP server, are implemented over Warp. According to our throughput benchmark, mighty provides performance on a par with nginx.

Source: https://aosabook.org/en/posa/warp.html

_jackdk_ 4/17/2025|||

The interaction of laziness and purity means that the memory costs are not always what you think. Purity means that it's a lot safer to share structure between old and new versions of a data structure where an imperative language would have to do defensive copying, and laziness means that you can incrementally amortise the cost of expensive rebalancing operations (Okasaki is the standard reference for this).

eru 4/17/2025|||

> [...] large memory allocations due to immutable data structures sounds [...]

Why would there be large memory allocations because of immutable data structures? Btw, you can also use immutable data structure in eg Rust fairly easily. And Haskell also supports mutation and mutable data structures.

However, Haskell can use a lot of memory, but that's more to do with pervasive 'boxing' by default, and perhaps laziness.

nesarkvechnep 4/17/2025||

No reason. OC probably thinks that immutable data structures are always copied when being operated on.

michalsustr 4/17/2025||

Yes indeed, unless you use ropes or other specialised structures

xmcqdpt2 4/17/2025|||

Lists aren’t copied on prepend.

Tries (like scala’s Vector) or trie maps (the core map types of Scala, Clojure and probably Haskell?) aren’t copied on updates.

In fact, whether a data structure is an immutable or persistent data structure or merely an unmodifiable data structure (like Kotlin uses) is based on whether it requires full copies on most updates or not. In FP languages, immutable data structures aren’t “specialized” at all.

Y_Y 4/17/2025||

> whether a data structure is an immutable or persistent data structure or merely an unmodifiable data structure...

This hurt my brain. It seems that in some places (e.g. Java land) unmodifiable refers to something that you can't modify but could just be a wrapper around a structure that can be modified. In that case they use immutable to mean something that is nowhere modifiable.

I may be misrepresenting this idea, but I think the terminology is so poor that it deserves to be misunderstood.

mrkeen 4/17/2025||

Think about mutability in Java land this way:

  // Using mutability.
  // `increment` is void, and makes 2 bigger for everyone.
  increment(2); 

  // Typical Java "safety".
  // It's still void, but now it throws a RuntimeException
  // because the developers are saving you from making everyone's 2 bigger.
  increment(2);

  // Immutable
  // Returns 3
  increment(2);

tasuki 4/17/2025||||

Doesn't it depend on the data structure? Eg prepending to a list is actually cheaper with immutable data structures: you keep the original list and add a new head pointing to its head. Now you have two lists available in your program, but only one stored in memory. Yay!

whateveracct 4/18/2025||||

Well luckily, Haskell is full of said "specialized structures."

containers and unordered-containers handle most of your needs and they only copy their trees' spines (O log n) on update.

eru 4/18/2025||

Haskell also support eg arrays with O(1) in-place updates just fine.

nesarkvechnep 4/17/2025|||

Not really. You might want to look into “ Purely functional data structures” by Chris Okazaki.

whateveracct 4/18/2025|||

It doesn't actually have "large memory allocations" due to immutable data structures. This is a meme that isn't true. Immutable data structures, especially at small scale, do not have huge performance penalties. You don't copy the entire structure over and over...you copy the O(log n) spine.

Haskell's GC is also fast when you are mostly generating garbage, which is inherently true for web server handlers.

butterisgood 4/18/2025||

Deforestation helps with that

A composition of catamorphic and anamorphic functions can eliminate a lot of the in-between allocations (a hylomorphism)

Basically it looks like you’re building a ton of intermediate structure then consuming it - meaning much of the in-between stuff can be eliminated.

Interesting optimizations and a little mind blowing when you see it.

nesarkvechnep 4/17/2025|||

You obviously haven’t ran anything on the BEAM (Erlang’s VM).

michalsustr 4/17/2025||

Correct. Erlang also uses green threads?

jlouis 4/17/2025||

Yes. And immutable data structures.

When data is immutable, it can be freely shared. Changes to the data essentially uses copy-on-write. And it only writes the delta change, since you don't need a deep copy due to immutability. Add that the garbage collectors of Haskell and Erlang are designed to work with a high allocation rate and have 0 cost for dead data, and this is much faster than what people think.

The way you implement a webserver in either Haskell or Erlang is rather trivial. Whenever there's an incoming request, you make a thread to handle it. So you don't have 1 webserver serving 10k requests. You have 10k webservers serving 1 request each. And since they are started from the same core data, they'll share that due to immutability. See also old-style Apache or PHP and fork().

eru 4/17/2025||

Web servers handling lots of small requests are actually pretty easy to garbage collect to: you just delete all the data at the end of the request.

Either you have a specialised GC that works like this, or probably a good general generational GC can pick up on this pattern on its own.

jlouis 4/17/2025||

Or you do as Erlang's BEAM VM: each thread has it's own memory area which is GC'ed individually. This means upon request termination, you just terminate the thread and the memory is reclaimed with no need for a GC.

eru 4/18/2025||

In the abstract, this is very similar to spawning a unix process for every request, never free-ing any memory, and letting the memory allocation die with the process.

lemper 4/17/2025||

nah bro, warp is quite performant. think there were some consultancies that wrote haskal web app for their clients.

kookamamie 4/17/2025|

> 1. Compose the program into several threads of execution, traditionally scheduled and ran by the operating system

The step 0 is missing:

Compose the program into several lanes of execution, traditionally executed via SIMD.

This is a massive piece of performance left on the table on modern computer architectures, by assuming threading is the first manifestation of concurrency.

jayd16 4/17/2025|

SIMD has been somewhat of a massive failure in this regard. Unlike threads, most languages seem to ignore its existence and abdicate its usage to the sufficiently complex compiler.

I wish there was better author time feedback to the developer on where they're getting such a perf boost. As far as I'm aware there's no popular linting or blue squiggle to guide you in the right direction.

In games it seems like the popular pattern is to rewrite everything entirely in an entity component system framework.

kookamamie 4/17/2025||

Agreed completely. Most auto-vectorization approaches are hit-miss and you still cannot have big-binaries, where instruction set is decided dynamically, trivially.

ISPC comes close, but does come with a learning curve.

SleepyMyroslav 4/18/2025||

I would say that Highway [1] comes close. Can't say anything about ISPC because in gamedev work it never even came into consideration for multiple platforms.

1. https://google.github.io/highway/en/master/

More comments...