Learning Assembly for Fun, Performance and Profit

Posted by klelatti 4/10/2025

Learning Assembly for Fun, Performance and Profit(thechipletter.substack.com)

78 points | 15 comments

raphlinus 4/13/2025|

The question of which assembly is best to learn is of course incredibly subjective, but I think the author gives short shrift to ARM32. It is historically important (especially for the Acorn computers, most popular in the UK), sensibly designed, and still relevant today, just in the context of microcontrollers.

Some of the most fun I've had programming assembly has been writing HDMI video scanout kernels for a RP2040 chip[1]. It was a delightful puzzle how to make every single cycle count. It is a great sense of satisfaction of using every one of the 8 "low" registers (the other 8 "high" registers generally take one more cycle to move into a low register, but there are exceptions such as add and compare where they can be free; thus you almost always use a high register for the loop termination comparison). Most satisfying, you can cycle-count and predict the performance very accurately, which is not at all true on modern 64 bit processors. These video kernels could not be written in Rust or C with anywhere near the same performance. Also, in general, Rust compiles to pretty verbose code, which matters a lot when you have limited memory.

Ironically, the reasons for this project being on hold also point to the downside of assembler: since then, the RP2350 chip has come out, and huge parts of the project would need to be rewritten (though it would be much, much more capable than the first version).

[1]: https://github.com/DusterTheFirst/pico-dvi-rs/blob/main/src/...

musicale 4/12/2025||

I guess the comparison here is mainly for fun and food for thought, but I think the answer is that it depends on your use case. If there's a system that you are enthusiastic about and want to be able to develop code for, then that is probably what you should pick.

bee_rider 4/13/2025||

For some reason (not sure why, maybe it was the discussion of portability and this fun NVIDIA not-quite-assembly language), this made me wonder: has anybody gotten really good at writing LLVM IIR? It seems fairly low level, and but also quite portable. And… I don’t know, I’m talking about a topic I don’t know much about, so I’m happy to be corrected here, but as a static-single-assignment language maybe it is… even more machine sympathetic than assembly? (I’m under the impression that writing really high performance assembly is really quite difficult, you have to keep a ton of instructions in flight at once, right?)

raphlinus 4/13/2025||

I read LLVM (or one of its many GPU-flavored variants) reasonably often, mostly to figure out where in the chain a shader miscompilation is happening. But I've never personally had to write it, and it's not easy for me to think of a use case where it would make a lot of sense. It's pretty unpleasant and fiddly, as you have to annotate all the types of the intermediate values and so on, and it doesn't have the main advantage of actual assembler: being able to reason about the performance of the code. That depends so much on the way it's compiled.

That said, I have several times wanted to reach for LLVM intrinsics. In Rust, these are mostly available through a nightly-only feature (in std::intrinsics). One thing that potentially unlocks is "unordered" memory semantics, which are intermediate between nonatomic and relaxed atomics, in that they allow much of the optimization of the former, while also not being UB if there's a data race. In a similar vein is the LLVM "freeze" operation, which makes read from uninitialized memory into a well-defined bit pattern. There's some discussion ([1] [2], for example) of adding those to Rust proper, but it's tricky.

[1]: https://internals.rust-lang.org/t/using-llvms-unordered-read...

[2]: https://internals.rust-lang.org/t/what-if-reading-uninit-ram...

But as another data point, for something I really want to do that's not yet expressible in Rust (fp16 SIMD operations), I would rather write NEON assembly language than LLVM IR. And I am quite certain I don't want to write any of the GPU variants by hand either.

sam_bishop 4/13/2025|||

I'm not a compiler expert either, but I think LLVM IR is the way it is for the benefit of the optimizer. It wasn't intended to be written by humans, typically.

It's more "portable" than assembly so that the same optimizer passes can work on multiple architectures. The static-single-assignment restriction makes it easier to write compiler passes by removing the complexity that comes with multiple assignments.

almostgotcaught 4/13/2025||

> static-single-assignment language maybe it is… even more machine sympathetic than assembly?

Dunno why you'd say that - real registers are neither static nor single assignment

microtherion 4/13/2025||

There are also some microcontroller architectures worth considering:

* Intel 8051, dating from 1980. You can still buy e.g. the CH559, based on the same architecture with an USB interface retrofitted somehow.

* AVR 8-bit architecture. Readily available in (older) Arduinos or easy to handle chip packages.

For modern architectures, reading skills can be very valuable, as they allow you to chase bugs beyond the high level language border. Astonishes your friends & confounds your enemies. Writing is decidedly more niche.

jjmarr 4/13/2025||

Is GPU assembly an actually in-demand skill?

raphlinus 4/13/2025||

So I would say skill at GPU assembly is in-demand for the elite tier of GPU performance work. Not necessarily writing much of it (though see [1] for an example, this is the kernel of multisplit as used in Nvidia's Onesweep implementation), but definitely in being able to read it so you can understand what the compiled code is actually doing. I'll also cite as evidence of that the incredible work of the engineers on Nanite. They describe writing the core of the microtriangle software renderer in HLSL but analyzing the assembler output to optimize down to the cycle level, as described in their "deep dive into Nanite virtualized geometry" talk (timestamp points to the reference to instruction-level micro-optimization).

[1]: https://github.com/NVIDIA/cccl/blob/2d1fa6bc9235106740d9373c...

[2]: https://www.youtube.com/watch?v=eviSykqSUUw&t=2073s

pjmlp 4/13/2025|||

Only as debugging skill, when going through graphical debuggers for GPUs might be helpful, but that is about it.

Each card model is a snowflake in what they support, hence why dynamic compilers are used.

xgkickt 4/13/2025|||

In games only one console vendor allows you to write shaders in asm, though it is not very productive, especially with RDNA. Reading the compiler output is a good-to-have skill however, for teasing the compiler into better register usage, reducing divergency, identifying problematic folded math, and debugging live GPU hangs.

FilosofumRex 4/13/2025||

In China and other places where you want to squeeze all the performance from gpus of a generation or two ago. But it's not a portable skill set (Google won't hire you) so be careful of what you wish for...

DeathArrow 4/13/2025||

Learning some x86 assembly early was a good thing because it taught me a bit about how computers work.

anta40 4/13/2025||

I'm still learning assembly for fun.

Many years ago: x86 for reverse engineering. Nowadays: bootable games (https://github.com/nanochess is an excellent treasure trove) and classic game consoles (GBA, SNES) etc.

hdrz 4/13/2025||

The first commenter on the article page states that his favorite is pdp-11 assembly. In the 90s at uni I learned to write assembly on pdp-11 emulator running on a pc. It truly was a nice experience.

curtisszmania 4/13/2025|

[dead]