Nvidia is proposing a beast of a CPU system for Windows PCs

Posted by tosh 6 hours ago

Nvidia is proposing a beast of a CPU system for Windows PCs(twitter.com)

88 points | 218 commentspage 2

SwtCyber 5 hours ago|

The interesting part to me isn't really the Cortex-X925 vs AVX-512 comparison, but Nvidia trying to make the GPU the center of a Windows PC rather than an add-in card

amacbride 2 hours ago||

It's effectively the same as the GB10 in the DGX Spark (Blackwell architecture, 6,144 CUDA cores, perf-wise comparable to an RTX 5070).

I've found it very useful for running big models, but it's not a screaming powerhouse in terms of raw compute.

adamnemecek 2 hours ago|

They are early versions, wait 4 years.

tosh 5 hours ago||

nb: poster is Daniel Lemire (https://lemire.me), who is very skilled in getting performance out of compute hardware (e.g. via simd, cache usage etc)

tempodox 4 hours ago||

Still, Microslop has repeatedly proven their ability to slow everything down to a crawl no matter how powerful the hardware. If you want it to be fast, don’t use Windows.

infecto 5 hours ago||

As he likes to share often, "He ranks among the top 2% of scientists globally (Stanford/Elsevier 2025) and is one of GitHub's top 1000 most followed developers. "

tosh 5 hours ago||

based on citations and github stars? or what's the context there?

infecto 4 hours ago||

I was adding further citation based on his own claims. Not sure what context is missing.

embedding-shape 3 hours ago||

> up to 6,144 state-of-the-art CUDA cores

A RTX Pro 6000 has ~24K 5th generation tensor cores, I'm guessing this would then be 1/4 of the count but 6th generation? Wasn't clear from the images.

gravypod 2 hours ago|

What is more important than core count is how the caching architecture is laid out. They could lay out those 6k cuda cores in a layout which provides much larger blocks of cache to smaller number of cores. That would increase the memory bandwidth which would be better for inference.

embedding-shape 2 hours ago||

Sounds like the memory bandwidth is worse though;

> The memory is not as fast as dedicated GPU memory, but it is cheap enough while delivering enough bandwidth to run AI models locally.

Also "cheap while delivering enough" certainly sounds like someone is trying to temper expectations. It sounds like something sitting in-between GPU+VRAM inference and CPU+RAM one, not as a step above/besides GPU+VRAM.

fg137 2 hours ago||

I don't think this is going to get any traction in the general consumer world, even less relevant than Apple Vision Pro.

(HN reaction to Vision Pro back in 2024 is almost hilarious if not ridiculous, looking at it today. I knew it would be a flop and I was so right.)

seanalltogether 5 hours ago||

Is it really unified memory? AMD Strix Halo is "unified" but you still have to allocate memory separately for cpu vs gpu. Apple Silicon is true unified memory.

flakiness 5 hours ago||

My understanding is that this is the limitation from Windows not from AMD SoC. There are several internet resources to "enable unified memory support" on linux eg [1].

As a side note, qualcomm chip set on Android has been doing this for years (like Apple) so it's not super unique thing. It's more like there was no need before.

[1] https://www.jeffgeerling.com/blog/2025/increasing-vram-alloc...

kimixa 4 hours ago||

Even then the "reserved" section is a carve out guaranteed chunk to allow stuff that might need contiguous physical memory (display scan out buffers and page tables, for example) and similar.

The GPU can still happily use all the rest of the memory for other use cases - which tend to be the bulk of allocations anyway. Though there might be performance implications - for example "moving" buffer ownership to the GPU would need to evict CPU caches, and often 4k pages and tlb lookups can be a pretty inefficient situation for GPU-style accesses.

That's been pretty standard for any SoC for decades. And "differences" to apple's SoC are more implementation details.

Keyframe 5 hours ago|||

yes, but more due to OS limitations than hardware. You can use their GTT which is then _true_ UMA where GPU can grab whatever it wants from the memory pool.

This isn't the first time we have UMA on the PC, btw. When SGI did their PC workstations, their 320 and 540 PC workstations had what they called Cobalt graphics chipset and crossbar with their IVC architecture. They bypassed AGP at the time completely. It was quite unique to see strict UMA on a PC. Haven't seen it since until these new systems we're seeing now on PCs and Mac.

eigenspace 5 hours ago|||

That's a software question, not a hardware question.

Some software assumes pre-defined set-aside pools of memory reserved for video purposes, but the chip does actually have access to the whole pool.

SwtCyber 5 hours ago|||

For local models, the useful part is not just having 128GB attached to the package. It is whether the GPU can practically use that memory without the usual VRAM-style constraints

glitchc 5 hours ago|||

Memory bandwidth is what matters, unified or otherwise. Discrete GPUs don't have unified memory either.

ApatheticCosmos 5 hours ago|||

Strix halo is unified memory. The memory allocation set in BIOS is overridden by the operating system if it has the capability.

fc417fc802 5 hours ago|||

> you still have to allocate memory separately for cpu vs gpu

That's an API issue not a hardware issue. Regardless, I believe the major APIs permit seamlessly sharing pointers at this point? (I have no experience doing that though.)

joe_mamba 5 hours ago|||

>AMD Strix Halo is "unified" but you still have to allocate memory separately for cpu vs gpu.

IIRC that's due to maintain BIOS and Windows (+games & apps) backwards compatibility, but memory access speeds are the same.

ankurdhama 5 hours ago||

It is unified in the sense that the OS can dynamically assign memory to CPU and GPU. Apple silicon is not a alien tech that other silicon vendors cannot implement.

ozgrakkurt 3 hours ago||

Says running local llms isn’t relevant. Than says it is decent for games, which is just correct if you compare any gpu remotely similarly priced. I don’t understand what is the point he is making

Waterluvian 5 hours ago||

It’s an opportunity for them to start doing away with the whole ATX thing where owners had freedom to mix and match at their own pleasure.

burnt-resistor 2 hours ago|

They'll ship a welded-shut box that requires an activation key to power on. Users will get to pick color sleeve it uses though.

derefr 2 hours ago||

> The game changer is the unified 128 GB memory. That is the path Apple took years ago. Instead of separate memory for the CPU and GPU, everything shares a single pool. It is increasingly popular.

> The memory is not as fast as dedicated GPU memory, but it is cheap enough while delivering enough bandwidth to run AI models locally.

So, the reason "dedicated GPU memory" is fast, isn't because it's "dedicated"; it's because the types of memory built into GPU cards — GDDR and HBM — are designed for throughput over latency.

Which is to say, GDDR and HBM memory could be shared with the CPU in UMA while still being "fast" (for GPU use-cases.) In fact, the PS4/5 and Xbox 360 / One X / Series consoles have UMA architectures that use GDDR memory as their main memory, with no regular DDR memory to be found.

What I don't understand: why don't we see UMA architectures where there's both regular DDR and GDDR/HBM memory mapped into the address space of the CPU+GPU? That seems like the best of both worlds: you'd have some memory that's "tuned" for random-access CPU usage (regular DDR), and some memory that's "tuned" for streaming GPU usage (GDDR/HBM), but either type of memory can still be put to the use it wasn't "tuned" for, just with slightly-worse performance.

I guess you'd need to do a bit of software work:

1. a bit of work in the OS kernel / malloc library to get CPU workloads to "prefer" allocating DDR memory over the GDDR/HBM memory until they've exhausted DDR memory (or maybe not, if you just tell the kernel the GDDR/HBM memory is something like a zswap thinpool);

2. and a bit of work in supported ML frameworks, to teach them about a hybrid strategy between UMA "allocate anywhere, it's all the same" and NUMA "keep assets in VRAM if possible; if you spill assets to RAM, then they must stream into VRAM on access" (i.e. "at allocation time, allocate as if the system were NUMA, VRAM first then spilling to RAM; but at execution time, use the UMA codepaths, no need to copy RAM into VRAM.")

...but once that's done, it's done.

alberth 5 hours ago|

Is this essentially an Apple M-Series chip in concept?

More comments...