Posted by mpreda 1 day ago
Tell HN: GpuOwl/PRPLL, GPU software used to find the largest prime number
Feel free to ask questions about technical aspects of the GpuOwl implementation, about optimizations, tricks, efficient FFT implementation on GPUs etc. Or anything else.
[1] GpuOwl: https://github.com/preda/gpuowl [2] GIMPS: https://www.mersenne.org/
1. I guess the most time consuming part is multiplication, right? What kind of FFT do you use? Schönhage-Strassen, multi-prime NTT, ..? Is it implemented via floating-point numbers or integers?
2. Not sure if you encountered this, but do you have any advice for small mulmod (multiplication reduced by prime modulus)? By small I mean machine-word size (i.e. preferably 64-bits).
3. For larger modulus, what do you use? Is it worth precomputing the inverse by, say, Newton iteration or is it faster to use asymptotically slower algorithms? Do you use Montgomery representation?
4. Does the code use any kind of GCD? What algorithm did you choose?
5. Now this is a bit broad question, but could you perhaps compare the traditional algorithms implemented sequentially (e.g. GMP) and algorithm suitable to run on GPUs? I mean, does it make sense to use, say, a quadratic algorithm amenable to parallel execution, rather than a asymptotically faster (and sequential) algorithm?
4. GCD is not used in PRP, but it is used in P-1 (Pollard's P-1 algo). We use GMP GCD on the CPU (as it's a very rare operation, and GMP/CPU is fast enough). I understand the complexity of the GCD as implemented in GMP is logarithmic which is good.
5. For our dimension it does not make sense to use a quadratic algo instead of a NlogN one; We absolutely need the NlogN provided by convolution/FFT.
- involved in TF (Trial Factoring), including on GPUs
- involved in NTT transforms ("integer FFTs")
[1] http://mersenneforum.org/What we do is modular squaring iterations:
x := x^2 mod M,
where M== 2^p - 1, i.e. M is the Mersenne number that we test.
Realize that working modulo 2^p - 1 means that 2^p == 1, which corresponds to a circular convolution of size p bits. We use the "Irrational Base Discrete Weighted Transform", IBDWT [1] introduced by Crandall/Fagin to turn this into a convolution of a convenient size N "words", where each word contains about 18bits, so Words ~= p/18. For example our prime of interest M52 was tested with a FFT of size 7.5M == 1024 * 15 * 512.
The FFT is a double precision (FP64) floating point FFT. Depending on the FFT size we can make use of about 18bits per FFT "word", where a "word" corresponds to one FP64 value.
Some tricks involved up to this point are: one FFT size halving and the modular reduction for free because of IBDWT. Another FFT size halving because turning the real input/output values into complex numbers in the FFT.
The FFT implementation that we found appropriate for GPUs is the "matrix FFT", which splits the FFT of size N=A*B into sub-FFTs of size A, one matrix multiplication with about A*B twiddle factors, and sub-FFTs of size B. In practice we split the FFT into three dimensions, e.g. for M52 we used: 7.5M == 1024 * 15 * 512.
We implement in a workgroup one FFT of size 1024 or 512. These are usually base-4 FFTs, with transpositions using LDS (Local Data Share, local per-workgroup memory in OpenCL).
The convolution is formed of:
- forward FFT
- element-wise multiplication
- inverse FFT
After the inverse FFT, we also need to do Carry propagation which properly turns the convolution into a multi-word multiplication.For performance we merge a few logical kernels that are invoked in succession into a single big kernel, where possible. The main advantage of doing so is that the data does not need to transit through "global memory" (VRAM) anymore but stays local to the workgroup, which is a large gain.
So, to recap:
- multiplication via convolution
- convolution via FP64 FFT, achieving about 18bits per FP64 word
- modular reduction for free through IBDWT
[1] https://en.wikipedia.org/wiki/Irrational_base_discrete_weigh...*I came across this new paper, INTEGER PARTITIONS DETECT THE PRIMES [1] from July, 2024, but I don't have enough knowledge to even read it. I wonder if an implementation of this method would provide any speed benefits compared to PRP.
Great work!
1). What profiling tools do you use for GPU code?
2). Where one would start, in terms of learning resources, about coding using inline GPU assembler?
3). Do you verify GPU assembler generated by a compiler from C/C++ code, in terms of effectiveness? If so, which tools do you use for that?
4). Is SIMD on GPUs a thing?
5). What are the primary factors being taken into account by you (cache sizes, microoptimizations, etc.) when you write code for a tool like gpuowl/prpll? Which factor is the most important? Thanks!
2. I'm not aware of good learning resources. Explore existing such code, e.g. opencl miners tend to use asm. Read in amdgpu/ in LLVM. Disassemble code from OpenCL and read the ISA. Explore and experiment, but it's tedious. I would not recommend to jump into ISA initially. BTW AMD does have good GCN ISA docs available online, that is useful!
3. Yes I often read the compiled ISA, and over time I discover bugs and also better understand the ISA.
4. OpenCL is SIMD, and yes it matches the GPU HW.
5. most important is to reduce the number of registers used (#VGPRs), as that influences heavilly the occupancy of the kernel. Use fewer costly instructions such as FP64 mul/FMA. Sequential memory access, and in general reduce global memory access as it's very slow. Merge small kernels into one (keep the data in the kernel). Never spill VGPRs.
A number of 136M bits (136 Mega bits), using a 7'500'000-points FFT, can be squared and mod-reduced (modular reduction) in less than 1ms (one milli-second) on consumer-priced (less than $500) GPUs.
Second: I'm confused by something in your readme. It says:
> For Mersenne primes search, the PRP test is by far preferred over LL, such that LL is not used anymore for search.
But later notes that PRP is computationally nearly identical to LL. Was that sentence supposed to say TF and P-1 instead of PRP or am I misunderstanding something about the actual computational cost of PRP?
We used to use the LL test because the LL result is a bit stronger than the PRP result, LL stating that the number is prime, while PRP saying only that it is likely prime. This is the reason LL is still used as an after-test following any successful PRP discovery, as it happened for the most recent M52 as well.
The first transition from LL to PRP happened because a very strong and cheap error-checking algorithm, that we call "the Gerbicz error check", was discovered by Robert Gerbicz. This error-check in its most efficient form only works for PRP not for LL. This error-check allows to verify the correctitude of the computation, as it progresses on the GPU, with high confidence and low overhead. It does protect against a lot of HW errors originating from e.g. the GPU VRAM overheating, the GPU having been under-volted too aggressively, bad VRAM; but also from SW bugs and from FFT precision issues.
As the test of a single exponent takes a long time (let's say 24h on a fast GPU), having confidence that this long computation is proceeding along correctly instead of wasting cycles is a great benefit from the error-check.
The second step of the transition from LL to PRP happened when the PRP proof was introduced, following on the ideas from the VDF (Verifiable Delay Function) article, which allowed to verify cheaply that a PRP test was indeed executed correcty. This eliminated the need for the Double Check (DC) which was standard procedure with the LL test; practically speeding the process up with 100%.
- Why use OpenCL when implementing GPU software
- Does it run on AMD or on Nvidia GPUs?
- How does the primality test implemented in GpuOwl work?
- How fast is it to test a Mersenne candidate?
- Why use FFTs? how large are the FFTs?
- What do you use for sin/cos?
https://x.com/HotAisle/status/1848780396609106359
If someone can come up with a way to perf test this against an H100, hit me up! It seems like something that could make a fun competition given the use of OpenCL. =)
There is a lot of documentation and HowTos on the Mersenne Forums [1] where experienced users help newcomers, and that relieves effort from myself.
Indeed, I’m curious why you’ve used OpenCL. And what was the hardware/general setup used for finding the prime?
What was your motivation behind building this software?
My personal setup is 8x Radeon Pro VII which also provide heating during the cold season. During summer the effort is in removing the excess heat, and the GPUs run in a reduced-power mode (slower & more efficient).
Motivation: a long time ago I had an AMD GPU and no way to run an LL test on it, so I decided to write my own. And I was hooked by the power of the GPU and the quest for ever more efficient, faster implem.
I am contributing to GIMPS with 2 Radeon Pro VII cards. I'm wondering what will happen when ROCm stops supporting these GPUs.
Do you have any plans to keep them working with GPUOwl/Prpll when they are no longer supported by ROCm?
Second, "does not support anymore" does not necessarily mean that it stops working on the old HW, but it could mean that new features/extension aren't implemented for the old HW anymore, and we may not care about those.
Third, AMD does contribute and integrates changes with upstream LLVM. This open-source work could be used by third parties (with significant effort I assume) to continue support.
OTOH CUDA only works on Nvidia, and that's a major limitation.
GpuOwl uses heavily FP64 ("double" floating point), and FP64 is more readily available at consumer prices on AMD GPUs. We (the GIMPS project) use a lot of Radeon VII and Radeon Pro VII GPUs, which have great FP64 at a cheap price (I am personally running 8x Radeon Pro VII that I bought new for about $300 a piece).
So you see, for us AMD GPUs are the first citizen. Of course I want to support Nvidia GPUs as well, and OpenCL allows that. Luke Durant did run GpuOwl on a lot of Nvidia GPUs in the cloud, and I'm happy GpuOwl did work well for him on Nvidia.
Nvidia A100 GPU which was used to find a new Mersenne prime has specialized dedicated hardware like tensor cores, which on A100 can work not only for FP16 and FP32 but also for FP64. Are there any benefits of utilizing this capabilities?
I thought though that prospective HPC users have more Nvidia A100 and H100 in mind when buying hardware.
But just to set it straight, GpuOwl received exactly $0 contributions or sponsoring from exactly nobody. It's a pleasure work from my side, and it's open sourced for the easy access of curious minds to the algorithms and techniques implemented. I did receive great help, in the form of source-code contributions, most importantly from George Woltman.