Posted by mmastrac 3 days ago
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amdttm.pages_limit=5242880 ttm.pages_limit=5242880"
One downside is your kernel isn't going to reserve that memory away from userland. You will still see all the memory at system level as "free". As the GPU driver starts using it, other apps/the OS will try to use the "free" memory, not knowing how much of it is in use (it may show up as "cache", or not at all). Then OOM killer starts going or programs start crashing, and at some point the OS tips over or GPU driver crashes. You can add loads of swap as a compromise and it works okay, if a bit slow.In any case, loading a gigantic model just to use system RAM is absurdly slow (due to mem bandwidth), like 1-5 t/s, so it's not practical. It'd take a whole day to process one 86k token request. Just pay a cloud provider $0.01 to do it in 10 seconds.
This is a great project. I love the possibilities it hints at. Thanks for building it!
The benchmarks at the bottom mention memory tiering and manually controlling where things go, but if your application already does that, then you probably don’t also need a CUDA shim. The application should control the VRAM to system memory transfers with boring normal code.
You’re basically stating that swapping is also a bad idea. And to take it further, any memory or storage is a bad idea because there’s L1 cache/SRAM which is faster then the rest
Is that a crazy thing to say? I can't recall the last time I was grateful for swap; it might've been before 2010.
It's pretty weird to insist on a counterargument that has no implications or consequences to the presented argument.
Yes, swapping is a bad idea.
Your second argument also falls flat, because the standard CUDA hardware setup doesn't use CXL so cache coherence isn't available. You're left with manual memory synchronization. Pretending that GPUs have cache for system RAM when they don't is pretty suspect.
So don't use it for large requests. Ideal for when you just want to categorise things, for example, "does this task need a shell" or "bucket this email into one of help request, bill due or personal comms".
The ExLlamaV3 EXL3 2bpw (8 GB, full VRAM) row is an order of magnitude faster than the baseline - but the baseline seems to be the 32GB model running with the KV cache shared to system memory only (I think?)
But if a 8GB model gives sufficient quality then it seems like that would have worked without the shared memory thing?
I think the useful apples-to-apples benchmark is currently the Ollama + GreenBoost shim (baseline) (2-5 tps) vs ExLlamaV3 + GreenBoost cache (8–20 tps) comparison.
It would be really useful to see this compared with the existing llama CPU/memory offload. There is a note at the start ("Offload layers to CPU — works, but drops token/s by 5–10× because CPU RAM has no CUDA coherence") - but it is unclear if that 5-10x token speed drop is compared to running a model completely in GPU or compared to the greenboost approach.
I think it is vs GPU, in which case it seems likely the performance is similar to what greenboost is giving but probably much more stable.
The reported size of the ModelOpt FP8, 16 GB, sounds wrong to me. If its 8 bits per parameter it is going to be a similar size to the glm-4.7-flash:q8_0. They repeat this a few times in the readme.
"I turned a $95 AMD APU into a 16GB VRAM GPU and it can run stable diffusion!"
I have the 4650G APU, and the best way to describe it is: lacking of support. This was even more true 3 yo than now. rocm (is) was absolutely dogshit then, I know this because I tried to do the same when that post was made. You have to compile everything from scratch, get the relevant patches, and even then, xformers which is a library that accelerate diffusion model inferencing was not supported for renoir or rocm back then. Yes, you can generate an image, but it was much slower, and rigged with bugs. You couldn'rt update rocm because it broke compatibility, and it was partly the reason I got into nixos. That being said, those APUs are a power house. Nowadays I can run decent agentic workflows on them (I have 64gb of ddr4 ram, ie APU can suck as much as it needs with the latest linux kernels).
Just note, diffusion models are still second class citizens on AMD apus even GPUs. But then again, nothing close right now on the market except for what apple offers.
But I'm always interested in first hand experiences of how good is it really - I'm pretty cynical about the idea that AMD actually knows what it takes to build good software end-to-end.
But yes I agree with you about their lack of prioritization for software!
> The code is really bad with completely uneeded parts. The LLM (Qwen 2.5 7B) has hardcoded the i9 14700KF topology, and has variables related to it never used... It's even funnier that the show hardware function always prints the same string. There are even random pip log files. Why did this slop got coverage here?
https://www.phoronix.com/forums/forum/linux-graphics-x-org-d...Does this make sense? I'd have thought the KV is guaranteed to be used 100% of the time while say in a MoE the same can't be said of the weights.
Though I suppose if you're shooting for huge context then having that allocation go into ram makes sense specially when its allocated but not used yet
MoE is kinda related in terms of lower usage requirements vs a dense model of same total param size, but I think your mental model is a bit off.
I would prefer to use system memory to cache different models, focusing on things like embedding, rerankers, and TTS. This is sufficient to run a more complex RAG locally, for example, via Mem0, and then use a larger LLM via the cloud.
> I have an RTX 5070 with 12 GB VRAM and I wanted to run glm-4.7-flash:q8_0, which is a 31.8 GB model. The standard options are:
> Offload layers to CPU — works, but drops token/s by 5–10× because CPU RAM has no CUDA coherence. You end up waiting. Use a smaller quantization — you lose quality. At q4_0 the model is noticeably worse on reasoning tasks.
> Buy a bigger GPU — not realistic for consumer hardware. A 48 GB card costs more than a complete workstation.
> None of those felt right, so I built an alternative: route the overflow memory to DDR4 via DMA-BUF, which gives the GPU direct access to system RAM over PCIe 4.0 without a CPU copy involved.
And then limps home with this caveat on the closest thing to a benchmark:
> The PCIe 4.0 link (~32 GB/s) is the bottleneck when the model overflows VRAM. The best strategy is to shrink the model until it fits — either with EXL3 quantization or ModelOpt PTQ — and use GreenBoost's DDR4 pool for KV cache only.
I think the reason it refers it to DDR4 is because that is how the user explained it to their coding agent. LLMs are great at perpetuating unnecessary specificity.
For actual training, explicit sharding and RAM mapping are ugly, but at least you can see where the pressure is and reason about it. 'Transparent' often just means performance falls off a cliff and now debugging it sucks.
Do not put swap on an SSD you care about at all.
This.
Many people rediscovering what the purpose of swap files are, but will still find a way to abuse it without knowing that they are actually destroying their SSD.