Tell HN: GpuOwl/PRPLL, GPU software used to find the largest prime number

Posted by mpreda 10/26/2024

Hi, I'm Mihai Preda the author of GpuOwl/PRPLL [1], an OpenCL software used by Luke Durant for his recent discovery of the largest prime number know, the 52nd Mersenne prime 2^136279841 - 1 [2].

Feel free to ask questions about technical aspects of the GpuOwl implementation, about optimizations, tricks, efficient FFT implementation on GPUs etc. Or anything else.

[1] GpuOwl: https://github.com/preda/gpuowl [2] GIMPS: https://www.mersenne.org/

72 points | 43 comments

oxxoxoxooo 10/27/2024|

Hi! Please, I also have a few questions:

1. I guess the most time consuming part is multiplication, right? What kind of FFT do you use? Schönhage-Strassen, multi-prime NTT, ..? Is it implemented via floating-point numbers or integers?

2. Not sure if you encountered this, but do you have any advice for small mulmod (multiplication reduced by prime modulus)? By small I mean machine-word size (i.e. preferably 64-bits).

3. For larger modulus, what do you use? Is it worth precomputing the inverse by, say, Newton iteration or is it faster to use asymptotically slower algorithms? Do you use Montgomery representation?

4. Does the code use any kind of GCD? What algorithm did you choose?

5. Now this is a bit broad question, but could you perhaps compare the traditional algorithms implemented sequentially (e.g. GMP) and algorithm suitable to run on GPUs? I mean, does it make sense to use, say, a quadratic algorithm amenable to parallel execution, rather than a asymptotically faster (and sequential) algorithm?

mpreda 10/27/2024||

3. Answered in point 1., we use IBDWT and get modular reduction for free through the circular convolution. This works nicely for Mersenne modulus.

4. GCD is not used in PRP, but it is used in P-1 (Pollard's P-1 algo). We use GMP GCD on the CPU (as it's a very rare operation, and GMP/CPU is fast enough). I understand the complexity of the GCD as implemented in GMP is logarithmic which is good.

5. For our dimension it does not make sense to use a quadratic algo instead of a NlogN one; We absolutely need the NlogN provided by convolution/FFT.

mpreda 10/27/2024|||

2. This was discussed to some length over the years on mersenneforum.org [1]. There is a lot of wisdom stored there but hard to find, and many smart & helpful guys, so feel free to ask there. This is an operation of interest because it's:

   - involved in TF (Trial Factoring), including on GPUs
   - involved in NTT transforms ("integer FFTs")

[1] http://mersenneforum.org/

mpreda 10/27/2024||

1. Yes, the core of the algorithm is the modular squaring. The squaring is similar to a multiplication, of course. In general, the fast multiplication is implemented via convolution, via FFTs which results in a N x log(N) time complexity of the multiplication.

What we do is modular squaring iterations:

x := x^2 mod M,

where M== 2^p - 1, i.e. M is the Mersenne number that we test.

Realize that working modulo 2^p - 1 means that 2^p == 1, which corresponds to a circular convolution of size p bits. We use the "Irrational Base Discrete Weighted Transform", IBDWT [1] introduced by Crandall/Fagin to turn this into a convolution of a convenient size N "words", where each word contains about 18bits, so Words ~= p/18. For example our prime of interest M52 was tested with a FFT of size 7.5M == 1024 * 15 * 512.

The FFT is a double precision (FP64) floating point FFT. Depending on the FFT size we can make use of about 18bits per FFT "word", where a "word" corresponds to one FP64 value.

Some tricks involved up to this point are: one FFT size halving and the modular reduction for free because of IBDWT. Another FFT size halving because turning the real input/output values into complex numbers in the FFT.

The FFT implementation that we found appropriate for GPUs is the "matrix FFT", which splits the FFT of size N=A*B into sub-FFTs of size A, one matrix multiplication with about A*B twiddle factors, and sub-FFTs of size B. In practice we split the FFT into three dimensions, e.g. for M52 we used: 7.5M == 1024 * 15 * 512.

We implement in a workgroup one FFT of size 1024 or 512. These are usually base-4 FFTs, with transpositions using LDS (Local Data Share, local per-workgroup memory in OpenCL).

The convolution is formed of:

   - forward FFT
   - element-wise multiplication
   - inverse FFT

After the inverse FFT, we also need to do Carry propagation which properly turns the convolution into a multi-word multiplication.

For performance we merge a few logical kernels that are invoked in succession into a single big kernel, where possible. The main advantage of doing so is that the data does not need to transit through "global memory" (VRAM) anymore but stays local to the workgroup, which is a large gain.

So, to recap:

   - multiplication via convolution
   - convolution via FP64 FFT, achieving about 18bits per FP64 word
   - modular reduction for free through IBDWT

[1] https://en.wikipedia.org/wiki/Irrational_base_discrete_weigh...*

oxxoxoxooo 10/27/2024||

Thank you very much for the answers, very informative!

And congratulations on the discovery!

motorolnik 10/26/2024||

Hi, I've got few questions:

1). What profiling tools do you use for GPU code?

2). Where one would start, in terms of learning resources, about coding using inline GPU assembler?

3). Do you verify GPU assembler generated by a compiler from C/C++ code, in terms of effectiveness? If so, which tools do you use for that?

4). Is SIMD on GPUs a thing?

5). What are the primary factors being taken into account by you (cache sizes, microoptimizations, etc.) when you write code for a tool like gpuowl/prpll? Which factor is the most important? Thanks!

mpreda 10/26/2024||

1. My profiling is rudimentary but effective. I measure per-kernel execution time with OpenCL events (which register with high accuracy start/end times w. practically no overhead), and also I continously measure per-iteration time by dividing wall-time for blocks of 20'000 iterations by that nb. These measuremens are consistent and sensitive.

2. I'm not aware of good learning resources. Explore existing such code, e.g. opencl miners tend to use asm. Read in amdgpu/ in LLVM. Disassemble code from OpenCL and read the ISA. Explore and experiment, but it's tedious. I would not recommend to jump into ISA initially. BTW AMD does have good GCN ISA docs available online, that is useful!

3. Yes I often read the compiled ISA, and over time I discover bugs and also better understand the ISA.

4. OpenCL is SIMD, and yes it matches the GPU HW.

5. most important is to reduce the number of registers used (#VGPRs), as that influences heavilly the occupancy of the kernel. Use fewer costly instructions such as FP64 mul/FMA. Sequential memory access, and in general reduce global memory access as it's very slow. Merge small kernels into one (keep the data in the kernel). Never spill VGPRs.

mpreda 10/27/2024||

My above answer was typed on a mobile phone while travelling, so it was maybe exceedingly brief. But now, on a real keyboard, I can go into more detail on any point if there's interest.

motorolnik 10/26/2024||

And another more general question: (6) gcc, clang, and nvcc have some OpenMP offloading capabilities which allow to compile code into binaries which can then run on GPUs. Is the code they produce through OpenMP anywhere close to what one gets directly with i.e. opencl?

kopadudl 10/27/2024|||

Using OpenMP with the GPU may be fine depending on the problem, but you cant explore the full GPU potential. Parallelizing the loops on the GPU may be sufficient, but when it is not you have to dig deeper.

mpreda 10/26/2024|||

I don't know, I haven't eplored OpenMP myself.. maybe some day.

mpreda 10/28/2024||

Thank you all for the questions! This was basically my first submission on HN, I'm still learning how to do things around here, but the overall tone was gentle and encouraging. And my main take-away was that I need to make the software more user-friendly in order to help potential new users try it out -- I'll work on that.

dgacmu 10/27/2024||

First, congrats! Awesome work and appreciate you sharing more.

Second: I'm confused by something in your readme. It says:

> For Mersenne primes search, the PRP test is by far preferred over LL, such that LL is not used anymore for search.

But later notes that PRP is computationally nearly identical to LL. Was that sentence supposed to say TF and P-1 instead of PRP or am I misunderstanding something about the actual computational cost of PRP?

cbright 10/27/2024|

The PRP test has the same computational cost as an LL test. The reason why GIMPS now prefers to do PRP tests instead of LL tests is because an efficiently verifiable proof-of-work certificate was developed for PRP tests [1].

[1] https://doi.org/10.4230/LIPIcs.ITCS.2019.60

mpreda 10/27/2024|||

Yes. In fact the transition from LL to PRP took place in two steps, at different moments in time.

We used to use the LL test because the LL result is a bit stronger than the PRP result, LL stating that the number is prime, while PRP saying only that it is likely prime. This is the reason LL is still used as an after-test following any successful PRP discovery, as it happened for the most recent M52 as well.

The first transition from LL to PRP happened because a very strong and cheap error-checking algorithm, that we call "the Gerbicz error check", was discovered by Robert Gerbicz. This error-check in its most efficient form only works for PRP not for LL. This error-check allows to verify the correctitude of the computation, as it progresses on the GPU, with high confidence and low overhead. It does protect against a lot of HW errors originating from e.g. the GPU VRAM overheating, the GPU having been under-volted too aggressively, bad VRAM; but also from SW bugs and from FFT precision issues.

As the test of a single exponent takes a long time (let's say 24h on a fast GPU), having confidence that this long computation is proceeding along correctly instead of wasting cycles is a great benefit from the error-check.

The second step of the transition from LL to PRP happened when the PRP proof was introduced, following on the ideas from the VDF (Verifiable Delay Function) article, which allowed to verify cheaply that a PRP test was indeed executed correcty. This eliminated the need for the Double Check (DC) which was standard procedure with the LL test; practically speeding the process up with 100%.

dgacmu 10/27/2024|||

Ah, that's interesting and makes sense. Thank you!

luizfzs 10/27/2024||

Hello!

I came across this new paper, INTEGER PARTITIONS DETECT THE PRIMES [1] from July, 2024, but I don't have enough knowledge to even read it. I wonder if an implementation of this method would provide any speed benefits compared to PRP.

Great work!

[1] https://arxiv.org/pdf/2405.06451

mpreda 10/28/2024|

Sorry, I don't know, and I don't even have an oppinion on the paper yet.

PRP is a pretty efficient test though, I would consider it a breakthrough for anything to improve on the efficiency of PRP for mersenne candidates.

kristopolous 10/27/2024||

The binary of this number is over 16MB of 1s. that's nuts.

mpreda 10/27/2024|

What's nuts is how fast you can square such a number on a GPU!

A number of 136M bits (136 Mega bits), using a 7'500'000-points FFT, can be squared and mod-reduced (modular reduction) in less than 1ms (one milli-second) on consumer-priced (less than $500) GPUs.

kristopolous 10/27/2024||

Really? What on earth... I was trying to guess the number as I was reading your post and I was thinking "a few seconds". I'll go try it later today.

rigmonger 10/26/2024||

First of all, thank you for your work and congratulations on your achievements, both in the search for Mersenne primes and software development.

I am contributing to GIMPS with 2 Radeon Pro VII cards. I'm wondering what will happen when ROCm stops supporting these GPUs.

Do you have any plans to keep them working with GPUOwl/Prpll when they are no longer supported by ROCm?

mpreda 10/27/2024|

IF ROCm stops supporting Radeon Pro VII, the first solution is to stay on the most recent ROCm that still supports them.

Second, "does not support anymore" does not necessarily mean that it stops working on the old HW, but it could mean that new features/extension aren't implemented for the old HW anymore, and we may not care about those.

Third, AMD does contribute and integrates changes with upstream LLVM. This open-source work could be used by third parties (with significant effort I assume) to continue support.

amuresan 10/27/2024||

Hi Mihai! This is impressive work!

Are you aware of any other computational maths problems where a sufficiently motivated amateur could make an improvement on the state of the art?

mpreda 10/28/2024|

> Are you aware of any other computational maths problems where a sufficiently motivated amateur could make an improvement on the state of the art?

Unfortunatelly no, there's nothing I can think of, but that's clearly because I don't know what's out there.

If there's something you'd like to do, focus on something small/simple at first, get it done, and iterate from there.

mpreda 10/26/2024||

Some topic ideas:

  - Why use OpenCL when implementing GPU software
  - Does it run on AMD or on Nvidia GPUs?
  - How does the primality test implemented in GpuOwl work?
  - How fast is it to test a Mersenne candidate?
  - Why use FFTs? how large are the FFTs?
  - What do you use for sin/cos?

latchkey 10/27/2024|

It definitely runs on our AMD MI300x. But, the documentation is pretty fragmented and requires a bunch of math knowledge that I don't have, so I'm not really sure how to run it. Just some proof of working...

https://x.com/HotAisle/status/1848780396609106359

If someone can come up with a way to perf test this against an H100, hit me up! It seems like something that could make a fun competition given the use of OpenCL. =)

mpreda 10/27/2024||

Point taken. I need to improve the documentation and make it easier to start with.

There is a lot of documentation and HowTos on the Mersenne Forums [1] where experienced users help newcomers, and that relieves effort from myself.

[1] http://mersenneforum.org/

latchkey 10/27/2024||

Thanks. Just some feedback. Ideally, I'd download something, compile it and then be able to just run the binary and have it start up and do something. Right now, it does nothing. Ideally, the app could even connect to some central server to get work, and just start chugging along doing whatever it needs to do.

Heading to the forums, it is a mess of what looks like decades of information. Tons of it outdated. Much of it heavy in math terms, I don't know anything about. I wish I did! I wish I was smarter! Following the "follow this first" is just a whole bunch of random information.

If you're looking for people to throw compute at the problem, you need to cater to the LCD of people like me who have tons of compute, but don't have the time to get a math degree or pile through forums for information.

Imagine if in order to use a web browser, you needed to understand every single underlying protocol first.

mpreda 10/28/2024||

No, you don't have to understand the algorithms in order to use the software.

But I understand your feedback. I put it on my list to make it really easy to see the software running once you have the executable.

There is one little complication though -- you do need a working OpenCL install in order to use GpuOwl. For example, for AMD GPUs, you'd need to install ROCm. On Nvidia GPUs you'd need a working install of CUDA.

latchkey 10/28/2024||

By default, we provision our machines for our customers, with Ubuntu and the latest version of ROCm.

The software compiled cleanly and easily, that part is very well done.

The flags on the software though are greek to me.

iyn 10/26/2024|

Wow, congrats!

Indeed, I’m curious why you’ve used OpenCL. And what was the hardware/general setup used for finding the prime?

What was your motivation behind building this software?

mpreda 10/26/2024||

OpenCL works on both AMD and Nvidia GPUs with mostly the same source code. By supporting at-runtime compilation it allows a lot of code particularization/instantiation before compilation, which reduces the power (cost) of the generated code. In general OpenCL is close enough to the HW and the generated code is improving over time (LLVM).

Motivation: a long time ago I had an AMD GPU and no way to run an LL test on it, so I decided to write my own. And I was hooked by the power of the GPU and the quest for ever more efficient, faster implem.

mpreda 10/27/2024||

The HW setup for finding the prime was Nvidia and AMD GPUs with good FP64 in the cloud, using "spot" instances for better price. This allowed scaling up quickly to many GPUs, and it did have a significant cost.

My personal setup is 8x Radeon Pro VII which also provide heating during the cold season. During summer the effort is in removing the excess heat, and the GPUs run in a reduced-power mode (slower & more efficient).

More comments...