Posted by klelatti 4 days ago
Some of the most fun I've had programming assembly has been writing HDMI video scanout kernels for a RP2040 chip[1]. It was a delightful puzzle how to make every single cycle count. It is a great sense of satisfaction of using every one of the 8 "low" registers (the other 8 "high" registers generally take one more cycle to move into a low register, but there are exceptions such as add and compare where they can be free; thus you almost always use a high register for the loop termination comparison). Most satisfying, you can cycle-count and predict the performance very accurately, which is not at all true on modern 64 bit processors. These video kernels could not be written in Rust or C with anywhere near the same performance. Also, in general, Rust compiles to pretty verbose code, which matters a lot when you have limited memory.
Ironically, the reasons for this project being on hold also point to the downside of assembler: since then, the RP2350 chip has come out, and huge parts of the project would need to be rewritten (though it would be much, much more capable than the first version).
[1]: https://github.com/DusterTheFirst/pico-dvi-rs/blob/main/src/...
That said, I have several times wanted to reach for LLVM intrinsics. In Rust, these are mostly available through a nightly-only feature (in std::intrinsics). One thing that potentially unlocks is "unordered" memory semantics, which are intermediate between nonatomic and relaxed atomics, in that they allow much of the optimization of the former, while also not being UB if there's a data race. In a similar vein is the LLVM "freeze" operation, which makes read from uninitialized memory into a well-defined bit pattern. There's some discussion ([1] [2], for example) of adding those to Rust proper, but it's tricky.
[1]: https://internals.rust-lang.org/t/using-llvms-unordered-read...
[2]: https://internals.rust-lang.org/t/what-if-reading-uninit-ram...
But as another data point, for something I really want to do that's not yet expressible in Rust (fp16 SIMD operations), I would rather write NEON assembly language than LLVM IR. And I am quite certain I don't want to write any of the GPU variants by hand either.
It's more "portable" than assembly so that the same optimizer passes can work on multiple architectures. The static-single-assignment restriction makes it easier to write compiler passes by removing the complexity that comes with multiple assignments.
Dunno why you'd say that - real registers are neither static nor single assignment
* Intel 8051, dating from 1980. You can still buy e.g. the CH559, based on the same architecture with an USB interface retrofitted somehow.
* AVR 8-bit architecture. Readily available in (older) Arduinos or easy to handle chip packages.
For modern architectures, reading skills can be very valuable, as they allow you to chase bugs beyond the high level language border. Astonishes your friends & confounds your enemies. Writing is decidedly more niche.
[1]: https://github.com/NVIDIA/cccl/blob/2d1fa6bc9235106740d9373c...
Each card model is a snowflake in what they support, hence why dynamic compilers are used.
Many years ago: x86 for reverse engineering. Nowadays: bootable games (https://github.com/nanochess is an excellent treasure trove) and classic game consoles (GBA, SNES) etc.