Posted by jnsgruk 3 days ago
If you’re doing that sort of work, you also shouldn’t use pre-compiled PyPi packages for the same reason - you leave a ton of performance on the table by not targeting the micro-architecture you’re running on.
That said, most of those packages will just read the hardware capability from the OS and dispatch an appropriate codepath anyway. You maybe save some code footprint by restricting the number of codepaths it needs to compile.
There’s also the difference between being able to run and being able to run optimised. At least 5 years ago, the Ubuntu/Debian builds of FFTW didn’t include the parallelised OpenMP library.
In a past life I did HPC support and I recommend the Spack package manager a lot to people working in this area because you can get optimised builds with whatever compiler tool chain and options you need quite easily that way.
If you do a typical: "cmake . && make install" then you will often miss compiler optimisations. There's no standard across different packages so you often have to dig into internals of the build system and look at the options provided and experiment.
Typically if you compile a C/C++/Fortran .cpp/.c/.fXX file by hand, you have to supply arguments to instruct the use of specific instruction sets. -march=native typically means "compile this binary to run with the maximum set of SIMD instrucitons that my current machine supports" but you can get quite granular doing things like "-march=sse4,avx,avx2" for either compatibility reasons or to try out subsets.
> As a result, we’re very excited to share that in Ubuntu 25.10, some packages are available, on an opt-in basis, in their optimized form for the more modern x86-64-v3 architecture level
> Previous benchmarks we have run (where we rebuilt the entire archive for x86-64-v3 57) show that most packages show a slight (around 1%) performance improvement and some packages, mostly those that are somewhat numerical in nature, improve more than that.
1. https://riscv.atlassian.net/wiki/spaces/HOME/pages/16154732/... 2. https://developer.arm.com/documentation/dui0801/h/A64-Floati...
I couldn't run something from NPM on a older NAS machine (HP Microserver Gen 7) recently because of this.
To me hwcaps feels like a very unfortunate feature creep of glibc now. I don't see why it was ever added, given that it's hard to compile only shared libraries for a specific microarch, and it does not benefit executables. Distros seem to avoid it. All it does is causing unnecessary stat calls when running an executable.