Posted by tosh 1 day ago
Nothing new here, apart from being able to use CUDA on a less power hungry system.
> Nothing new here, apart from being able to use CUDA on a less power hungry system.
CUDA has been running on ARM SOCs since the Tegra K1, 12 years ago. Nvidia is not new to ARM, nor is CUDA.
Tech companies have strangled their own market.
Up to $5000 because why not?
With that money you can build a real PC with rtx 5090!
We aren't so naive as to move from a locked IP ISA like x86 to another locked IP ISA such as ARM.
Right?
> The memory is not as fast as dedicated GPU memory, but it is cheap enough while delivering enough bandwidth to run AI models locally.
So, the reason "dedicated GPU memory" is fast, isn't because it's "dedicated"; it's because the types of memory built into GPU cards — GDDR and HBM — are designed for throughput over latency.
Which is to say, GDDR and HBM memory could be shared with the CPU in UMA while still being "fast" (for GPU use-cases.) In fact, the PS4/5 and Xbox 360 / One X / Series consoles have UMA architectures that use GDDR memory as their main memory, with no regular DDR memory to be found.
What I don't understand: why don't we see UMA architectures where there's both regular DDR and GDDR/HBM memory mapped into the address space of the CPU+GPU? That seems like the best of both worlds: you'd have some memory that's "tuned" for random-access CPU usage (regular DDR), and some memory that's "tuned" for streaming GPU usage (GDDR/HBM), but either type of memory can still be put to the use it wasn't "tuned" for, just with slightly-worse performance.
I guess you'd need to do a bit of software work:
1. a bit of work in the OS kernel / malloc library to get CPU workloads to "prefer" allocating DDR memory over the GDDR/HBM memory until they've exhausted DDR memory (or maybe not, if you just tell the kernel the GDDR/HBM memory is something like a zswap thinpool);
2. and a bit of work in supported ML frameworks, to teach them about a hybrid strategy between UMA "allocate anywhere, it's all the same" and NUMA "keep assets in VRAM if possible; if you spill assets to RAM, then they must stream into VRAM on access" (i.e. "at allocation time, allocate as if the system were NUMA, VRAM first then spilling to RAM; but at execution time, use the UMA codepaths, no need to copy RAM into VRAM.")
...but once that's done, it's done.
Nvidia going from GPU to CPU now?