Use your Nvidia GPU's VRAM as swap space on Linux

Posted by tanelpoder 18 hours ago

Use your Nvidia GPU's VRAM as swap space on Linux(github.com)

418 points | 107 comments

yjftsjthsd-h 17 hours ago|

> Built for laptops with soldered memory and no upgrade path. If you have an RTX card sitting there with 8GB of VRAM and you're getting swapped to SSD, this puts that VRAM to work.

Well, that does at least answer my immediate question about why I would ever swap from expensive RAM to really expensive RAM:) Feels niche, but when you want it it's a good idea.

Wowfunhappy 16 hours ago||

Another possible reason that occurred to me: what if you have VRAM but you're not using it all the time? For example, let's say you bought a GPU because you like to play video games. When you're not actively gaming, you probably don't need 16 GB of VRAM just to render the desktop. Might as well use it for something else, right?

Edit: Although, this is predicated on the system being able to release VRAM that is acting as swap when it's time to start a game. Can it do that?

c0dejedi 12 hours ago|||

I am catching up on comments

The reason I wrote this is I run this laptop in hybrid (AMD display + NVIDIA as swap). So all at VRAM was going to waste.

On your question re: switchable swap. It's on my to-do list ;)

kllrnohj 3 hours ago||

Wouldn't this prevent the nvidia GPU from being power gated since it's never "idle"? So like your battery life regresses?

Saris 15 hours ago||||

It's easy enough to 'offline' swap space on Linux normally so I suspect that would work fine, as long as you didn't instantly run out of RAM when doing so.

eru 12 hours ago||

If you have enough swap on disk available, it should be fine.

nuccy 10 hours ago||||

Best case is if gaming and productivity (with high memory use) activities are not concurrent, and productivity applications are stopped before gaming starts, then `swapoff` can easily release swap device without restart.

ornornor 4 hours ago|||

> you probably don't need 16 GB of VRAM just to render the desktop

Microsoft: hold my beer

Phelinofist 9 hours ago|||

So can VRAM actually be used like regular RAM? E.g. if I have a 16GB module and my GPU has 16GB VRAM, could it be made so that my system reports 32GB RAM? What would be the implications of that?

tobyhinloopen 9 hours ago|||

It behaves like slower ram I assume, due to the increased distance from the CPU and overhead. Still, it’s much faster than normal SWAP which uses a disk or SSD.

How it is reported? As SWAP space, not as RAM.

Tuna-Fish 9 hours ago|||

Typical desktop GPU ram does not support being write-back cached by the CPU. With PCIe resizable BAR, you could map the area into ram, so you could technically fit 32GB to memory, but it would have to be uncached (or write-combine cached), which would make it really, really slow.

There are a bunch of datacenter GPUs that support full cache coherency, but if you used them like that the VRAM would be very high latency from the CPU. So it would only be really slow.

ChocolateGod 5 hours ago|||

I assume on Linux you could use something like daxctl to tell the kernel to treat the vRAM as normal RAM, but I think this would be Intel/AMD only.

Tuna-Fish 4 hours ago||

I don't think it would help. It's not just a software issue that can be fixed in the kernel, the hardware fundamentally isn't part of the cache coherency system of the CPU.

zmysysz 4 hours ago|||

[flagged]

ErroneousBosh 10 hours ago||

In the olden days we called that a "RAM Disk" and it made our Atari STs go really fast!

On the old Amstrad PCWs that were everywhere at least in the UK in the mid 80s to mid 90s you could have up to 512kB of RAM, a fair chunk of which could be a RAM disk. This made compiling stuff in Turbo Pascal really fast too :-)

3form 7 hours ago|||

Except swap is, like, opposite of RAM disk.

That said, still an nice and fun concept. Though caching got better since I assume :)

lloeki 5 hours ago||

RAM disk is, like, the brd module on Linux, which allocates and exposes a /dev/ram0 block device.

From the project description this looks like it, exposing a raw block device backed by VRAM (with some trip through the nbd protocol, but that's an implementation detail to have it in userland, it could just as well have been implemented kernel side).

It's just that the usage of this mem-backed block device is different than the thing of yore (copy HD or floppy into RAM)

The more frequent alternative to brd, tmpfs, skips the block device part and does a filesystem directly. I wonder if it could be made so that it's swap directly and skip the block device part entirely like tmpfs.

1matin 6 minutes ago||

[delayed]

RachelF 16 hours ago||

Nice idea, but something has gone very wrong here:

>Sequential throughput: ~1.3 GB/s

[on a RTX 3070 Laptop]

This RTX 3070 chip is on PCIe 4.0 x16 which should give 64GB/s. The 8GB of GDDR6 is 448GB/s.

Swapping to an NVMe drive would be twice as fast, but with higher latency.

Teknoman117 14 hours ago||

Gen 4.0 x16 is 32 GB/s in each direction, but the way this is implemented is not the way you'd go about this if you wanted high performance.

Edit: Their benchmarks are also run using ZRAM, which compresses pages before writing to swap. Not sure what the performance overhead of that is, but it's probably quite a bit.

First of all, it's a userspace program hooking the nbd driver, which is known for being slow. It also uses a bounce buffer in userspace before transferring to the GPU. So when the kernel needs to swap a page, it has to first copy it into a userspace facing buffer. The userspace program that has to wake back up and issue the cuda operation to copy the page into device memory.

nbd also doesn't really do a good job of supporting high queue depth or merging adjacent accesses. So if the kernel is issuing a bunch of 4K page swaps without any coalescing, you're going to end up with at least million kernel/userspace context switches per second just to handle 4 GB/s (4 GB / 4K page), let alone 64 GB/s. And that's just the NBD portion, forget the mess that is the NVIDIA driver. PCIe can move a lot of data, but in order to get anything even resembling the full bandwidth, you have to have use DMA engines with long page lists. Having to set up a transfer for every 4K page over PCIe will not reach full saturation of the bus.

Swapping to NVMe is a very optimized path -> the swapper can submit lists of pages directly to the NVMe driver and the controller can DMA them directly out of RAM, no copies or context switches CPU side at all.

This could probably be improved by migrating to the ublk driver as it might let you avoid the userspace bounce buffer. It'd also be able to have multiple write queues to at least set up CUDA copies in parallel.

lstodd 13 hours ago||

yup. it's nbd and userspace making it slow. zram on the other hand adds little.

one can get rid of zram and just reimplement some compression in shaders but I think that would be a pointless optimization.

dannyw 12 hours ago||

Swapping to a NVMe will also consume PE cycles on your NAND, ie wearing it out over time.

RAM/VRAM don’t degrade from use.

markhahn 12 hours ago|||

flash is a consumable, yes.

but flash endurance isn't a strong argument here. you probably have O(TB) of flash, and aren't going to produce PB of swap writes any time soon. if you do a lot of swapping to a small flash device, it'll happen sooner.

I'm typing from a quite old 4GB laptop, which swaps heavily to a 250G SATA ssd. sure, it's not great, but it also costs zero. currently 9GB of swap is used, and it's not really noticeable. if I open 20 more tabs, it can introduce pauses.

google says this drive was released in 2014, and SMART says POH is about 10 years.

SMART also says wear leveling count is 665 and total written is 165327189538 LBAs (78834 GiB, or 338 drive-writes). I'm not expecting it to die soon, though using a 4G laptop is a bit of a stunt these days...

the point is that a system that has sustained heavy swapping for years has not generates so many writes to worry much. a modern system with 10x speed and 10x capacity (and probably less RAM deficit) would have even less effect. even for QDR with it's few-hundred cycle endurance spec...

LtdJorge 7 hours ago||

I guess you haven’t tried AMD’s composable kernel on Gentoo, or qtwebkit. I have a special env for the former called half-the-threads because it eats 2.5GB per thread. I removed the latter as soon as I was able to. I even add 32GB (half my RAM) of ZRAM for CK, and the Gentoo ebuild has a check for enough RAM per thread that stops the build if unmet, it wasn’t there before and I’ve had my system lock up because of OOM which OOMD wasn’t quick enough to catch.

All of this is to say that, it does have a potential impact on flash, if you rebuild often, which tends to happen on Gentoo.

c0dejedi 12 hours ago|||

This was a consideration when I wrote this

xfalcox 17 hours ago||

Given my dev machine has 32GB of RAM and 32GB of VRAM that sits mostly idle when I'm not running AI models, this is not that bad of an idea.

mathisfun123 14 hours ago|

this is the pcmasterrace equivalent of being all upper body and with scrawny legs lol

zamadatix 9 hours ago|||

Actually not that crazy of a spread. E.g. I have 48 GB + 32 GB in my gaming PC because if you go beyond 48 GB you start having to trade off more and more performance to keep the memory controller from falling over, so you really have to have a good reason to want to load more. Server platforms, like Epyc, it tends not to matter as much because you have so many channels for bandwidth and a beefier memory controller to handle them. Then on the VRAM side it's more about what makes sense for the GPU and how you plan on using it there (games or AI or modeling or whatever), and for a lot of cases the 5090 is just a good card to get for one reason or another (it just has a ton of compute + bandwidth for a consumer part).

LtdJorge 7 hours ago|||

I’ve got 64GB with a 3950x working great, although the speeds are not high. Just 3200MHz, IIRC.

ownagefool 9 hours ago|||

What's this trade off about?

I thought it was a simple 2 dims are probably better than 4, but unsure how you'd ever land on 48?

wtallis 8 hours ago|||

DRAM chips aren't always manufactured in power of two sizes. It's been common for years to have non power of two capacities for LPDDR used in phones, and has started to show up in other DRAM types with the current generation standards: DDR5 for desktops/servers and GDDR7 for GPUs. That's how there have been 24GB single-rank DIMMs and 48GB dual-rank DIMMs for desktops and 96GB RDIMMs for servers for a few years, and how a mobile RTX 5090 has 24GB VRAM vs mobile RTX 5080 having only 16GB VRAM despite both GPUs being different bins of the same silicon and both configurations using a 256-bit memory bus.

scns 8 hours ago|||

Not that simple. 4 dimms were getting higher clocks on 2 CCD Ryzen models (12 & 16 cores) compared to those with one CCD. Motherboard topology is a factor too.

tempoponet 13 hours ago|||

It's fine for dense models where you need them in VRAM, less so for MoE where you're offloading layers to ram. But 32/32 is pretty good for both in the popular ~30b range right now.

xxs 5 hours ago||

running 5090 on 32GB RAM is just weird, still

kimixa 13 hours ago||

I remember this being a thing done a while back using linux's MTD/phram drivers - https://wiki.archlinux.org/title/Swap_on_video_RAM - not sure if that's still relevant though as I don't know how it'll interact with DRM and how it handles reserving some of the vram - the suggested limit using xorg.conf is probably pretty obsolete now.

That page also has a fuse filesystem implementation on top of opencl - https://github.com/Overv/vramfs - which may be more compatible.

aa-jv 4 hours ago|

Yeah, I used to map my 8 megabytes of video memory through the mtd back in the day, it helped build those .. you know .. X11 drivers .. ;)

Man, that brings back memories.

drdaeman 16 hours ago||

What about backpressure, how does it handle requirements for VRAM allocation when VRAM is used for swap space?

With X11 it's not that bad (buffers are pre-allocated), but with Wayland allocations are a lot more dynamic, so running low on VRAM can easily crash the whole desktop. I just had a few of such crashes with Hyprland+llama-server+KVM switching between computers without freeing VRAM.

molticrystal 13 hours ago||

For windows I saw something similar to this years ago. An experimental proof of concept driver that allows the creation of a ram drive from vram for NVIDIA cards. Sequential is fast as you'd expect, random has lots of room for improvement.

>GpuRamDrive

>Create a virtual drive backed by GPU RAM.

https://github.com/prsyahmi/GpuRamDrive

Fork with AMD support:

https://github.com/brzz/GpuRamDrive/

c0dejedi 12 hours ago|

Thanks for sharing this, good read :-)

tgtweak 1 hour ago||

I think you can definitely improve the throughput/iops by using BAR vs treating it like a file store/mount through cuda which adds a lot of overhead.

rwmj 13 hours ago||

Similar but using OpenCL APIs, so it works on AMD (for some definition of "works" since their drivers are quite buggy): https://libguestfs.org/nbdkit-vram-plugin.1.html

c0dejedi 12 hours ago|

Thank you for pointing me to this

dragontamer 17 hours ago|

Remember how 16GBs used to be an enterprise level database mainframe?

Well, GPUs also have stupid amounts of compute on them. I have to imagine that there is some kind of database format that's useful with GPU compute attached.

Since the data is already in VRAM, the GPU can sort, join, or otherwise manipulate data as needed.

tmostak 16 hours ago||

GPU-accelerated databases have a long history. I founded HeavyAI (previously MapD/OmniSci) in 2013, but there are or have been many other startups in this space, such as Voltron Data, Kinetica, Sqream, etc. And now you have major players like IBM, Starburst, and Microsoft (which just announced Fabric SQL on GPU today) working on their own GPU-accelerated systems. GPUs have a huge advantage in terms of compute, memory, and interconnect bandwidth over CPU, as long as you can keep them fed with data.

I believe within 2-3 years databases and data warehouses on GPU will be common. The widespread use of agents to query data will be a part of this, as there will be a need to run far more queries at lower latency than needed for the ETL and BI workloads of the past.

c0dejedi 12 hours ago||

Insightful take, looking into these

einichi 16 hours ago|||

oh god please don't create more demand for GPUs

giancarlostoro 16 hours ago|||

Can we somehow make them work with 1 TB PCIes so we can churn through way more data?

dragontamer 15 hours ago|||

Have you heard of the "Radeon Pro SSG" ??

It must have failed because I never heard of an update to this GPU. But AMD definitely made a GPU with 4x NVMe SSDs attached to the GPU.

strictnein 15 hours ago||||

You are able to use GPU Direct Storage to communicate between the GPU and PCIE storage devices. It's nice, but it's not typically as performant as one would like, in comparison to the onboard memory.

https://docs.nvidia.com/gpudirect-storage/

https://github.com/microsoft/DirectStorage/tree/main

the8472 8 hours ago|||

linux has P2P-DMA for this. The drivers, devices and bus topology need to support it though.

https://docs.kernel.org/driver-api/pci/p2pdma.html

LtdJorge 7 hours ago||

I think GP means 1TB of PCIe bandwidth, instead of 1TB of PCIe NVMe drives.

Nate75Sanders 16 hours ago||

Possibly LSM compaction.

More comments...