Posted by albelfio 20 hours ago
Can they/someone else give more details as to what workloads pytorch is more than 2x slower than the hardware provides? Most of the papers use standard components and I assume pytorch is already pretty performant at implementing them at 50+% of extractable performance from typical GPUs.
If they mean more esoteric stuff that requires writing custom kernels to get good performance out of the chips, then that's a different issue.
* RAM - $1500 - Crucial Pro 128GB Kit (2x64GB) DDR5 RAM, 5600MHz CP2K64G56C46U5, up to 4 sticks for 128GB or 256GB, Amazon
* GPU - $4700 - RTX Pro 5000 48GB, Microcenter
* CPU/Mobo bundle - $1100 - AMD Ryzen 7 9800X3D, MSI X870E-P Pro, ditch the 32GB RAM, Microcenter
* Case - $220, Hyte Y70, Microcenter
* Cooler - $155, Arctic Cooling Liquid Freezer III Pro, top-mount it, Microcenter
* PSU - $180, RM1000x, Microcenter
* SSD - $400 - Samsung 990 pRO 2TB gen 4 NVMe M.2
* Fans - $100 - 6x 120mm fans, 1x 140mm fan, of your choice
Look into models like Qwen 3.5
This is certainly not the most effective use of $7k for running local LLMs.
The answer is a 16" M5 Max 128GB for $5k. You can run much bigger models than your setup while being an awesome portable machine for everything else.
https://marketplace.nvidia.com/en-us/enterprise/personal-ai-...
A small joke at this weeks GTC was the "BOGOD" discount was to sell them at $4K each...
Machines with the 4xx chips are coming next month so maybe wait a week or two.
It's soldered LPDDR5X with amd strix halo ... sglang and llama.cpp can do that pretty well these days. And it's, you know, half the price and you're not locked into the Nvidia ecosystem
You can check what each model does on AMD Strix halo here:
Mac Studio or Mac Mini, depending on which gives you the highest amount of unified memory for ~$5k.
I’m pretty curious to see any benchmarks on inference on VRAM vs UM.
Raptor Lake + 5080: 380.63 GB/s
Raptor Lake (CPU for reference): 20.41 GB/s
GB10 (DGX Spark): 116.14 GB/s
GH200: 1697.39 GB/s
This is a "eh, it works" benchmarks, but should give you a feel for the relative performance of the different systems.In practice, this means I can get something like 55 tokens a sec running a larger model like gpt-oss-120b-Q8_0 on the DGX Spark.
55 t/s is much better than I could expect.
So for an LLM inference is relatively slow because of that bandwidth, but you can load much bigger smarter models than you could on any consumer GPU.
Nowadays I find most things work fine on Arm. Sometimes something needs to be built from source which is genuinely annoying. But moving from CUDA to ROCm is often more like a rewrite than a recompile.
Isn't everyone* in this segment just using PyTorch for training, or wrappers like Ollama/vllm/llama.cpp for inference? None have a strict dependency on Cuda. PyTorch's AMD backend is solid (for supported platforms, and Strix Halo is supported).
* enthusiasts whose budget is in the $5k range. If you're vendor-locked to CUDA, Mac Mini and Strix Halo are immediately ruled out.
For 5K one can get a desktop PC with RTX 5090, that has 3x more compute, but 4x less VRAM - so depending on the workload may be a better option.
Obviously any Turing machine can run any size of model, so the “120B” claim doesn’t mean much - what actually matters is speed and I just don’t believe this can be speedy enough on models that my $5000 5090-based pc is too slow for and lacks enough vram for.
120B could run, but I wouldn't want to be the person who had to use it for anything.
To be fair, the 120B claim doesn't appear on the webpage. I don't know where it came from, other than the person who submitted this to HN
Also nobody is comparing this box to an $10M nVidia rack scale deployment. They're comparing it to putting all of the same parts into their Newegg basket and putting it together themself.
A single box with those specs without having to build/configure (the red and green) - I could see being useful if you had $ and not time to build/configure/etc yourself.
I could swear I filed a GitHub issue asking about the plans for that but I don't see it. Anyway I think he mentioned it when explaining tinygrad at one point and I have wondered why that hasn't got more attention.
As far as boxes, I wish that there were more MI355X available for normal hourly rental. Or any.
the boxes look cool but how good are they really? the cheapest box seems pricey at 12 for a what is essentially a few gaming gpus. i dont see why you couldnt make that like half the price. u could do a PC/server build thats much much faster for way less. size doesnt matter if its more than twice the price i think...
the more expensive box has atleast real processing gpus but afaik also not very popular ones, this one seems maybe more fair priced (there seems a big difference in bang for buck between these???).
the third one suggested looks like a joke.
dont get me wrong, this seems like a really cool idea. But i dont see it taking off as the prices are corporate but the product seems more home use.
maybe in time they will find a better balance, i do respect the fact that the component market now is sour as hell and making good products with stable prices is pretty much i possible.
id love one of these machines someday, maybe when i am less poor, or when they are xD.
(love the styling of everything, this is the most critical i could be from a dumb consumer perspective, which i totally am btw.)
The point is that they care now.
Not surprising. True, the ecosystem is like early OSX vs. Windows. Eventually it'll get ported over if there is demand.
But even in the amd stack things (like ck and aiter) consumer cards are not even second class citizens. They are a distance third at best. If you just want to run vllm with the latest model, if you can get it running at all there are going to be paper cuts all along the way and even then the performance won't be close to what you could be getting out of the hardware.