Tinybox – A powerful computer for deep learning

Posted by albelfio 20 hours ago

Tinybox – A powerful computer for deep learning(tinygrad.org)

538 points | 309 commentspage 3

SmartestUnknown 16 hours ago|

Regarding 2x faster than pytorch being a condition for tinygrad to come out of alpha:

Can they/someone else give more details as to what workloads pytorch is more than 2x slower than the hardware provides? Most of the papers use standard components and I assume pytorch is already pretty performant at implementing them at 50+% of extractable performance from typical GPUs.

If they mean more esoteric stuff that requires writing custom kernels to get good performance out of the chips, then that's a different issue.

mayukh 19 hours ago||

What’s the most effective ~$5k setup today? Interested in what people are actually running.

BobbyJo 19 hours ago||

Depends. If token speed isn't a big deal, then I think strix halo boxes are the meta right now, or Mac studios. If you need speed, I think most people wind up with something like a gaming PC with a couple 3090 or 4090s in it. Depending on the kinds of models you run (sparse moe or other), one or the other may work better.

emidoots 17 hours ago|||

At $7.2k + tax:

* RAM - $1500 - Crucial Pro 128GB Kit (2x64GB) DDR5 RAM, 5600MHz CP2K64G56C46U5, up to 4 sticks for 128GB or 256GB, Amazon

* GPU - $4700 - RTX Pro 5000 48GB, Microcenter

* CPU/Mobo bundle - $1100 - AMD Ryzen 7 9800X3D, MSI X870E-P Pro, ditch the 32GB RAM, Microcenter

* Case - $220, Hyte Y70, Microcenter

* Cooler - $155, Arctic Cooling Liquid Freezer III Pro, top-mount it, Microcenter

* PSU - $180, RM1000x, Microcenter

* SSD - $400 - Samsung 990 pRO 2TB gen 4 NVMe M.2

* Fans - $100 - 6x 120mm fans, 1x 140mm fan, of your choice

Look into models like Qwen 3.5

aurareturn 10 hours ago|||

$7.2k just to run at best Qwen3.5-35B-A3B doesn't seem worth it at all.

This is certainly not the most effective use of $7k for running local LLMs.

The answer is a 16" M5 Max 128GB for $5k. You can run much bigger models than your setup while being an awesome portable machine for everything else.

cmxch 14 hours ago|||

Surprised to see X3D given the reports of failures. I’ve opted for a regular 9900x and X670E-E just to have a bit more assurance.

bensyverson 19 hours ago|||

Sadly $5k is sort of a no-man's land between "can run decent small models" and "can run SOTA local models" ($10k and above). It's basically the difference between the 128GB and 512GB Mac Studio (at least, back when it was still available).

EliasWatson 19 hours ago|||

The DGX Spark is probably the best bang for your buck at $4k. It's slower than my 4090 but 128gb of GPU-usable memory is hard to find anywhere else at that price. It being an ARM processor does make it harder to install random AI projects off of GitHub because many niche Python packages don't provide ARM builds (Claude Code usually can figure out how to get things running). But all the popular local AI tools work fine out of the box and PyTorch works great.

NickJLange 11 hours ago||

It's $4.7K now, darn inflation!

https://marketplace.nvidia.com/en-us/enterprise/personal-ai-...

A small joke at this weeks GTC was the "BOGOD" discount was to sell them at $4K each...

cco 17 hours ago|||

Biggest Mac Studio you can get. The DGX Spark may be better for some workflows but since you're interested in price, the Mac will maintain it's value far longer than the Spark so you'll get more of your money out of it.

kristopolous 19 hours ago|||

Fully aware of the DGX spark I've actually been looking into AMD Ryzen AI Max+ 395/392 machines. There's some interesting things here like https://www.bee-link.com/products/beelink-gtr9-pro-amd-ryzen... and https://www.amazon.com/GMKtec-5-1GHz-LPDDR5X-8000MHz-Display... ... haven't pulled the trigger yet but apparently inferencing on these chips are not trash.

Machines with the 4xx chips are coming next month so maybe wait a week or two.

It's soldered LPDDR5X with amd strix halo ... sglang and llama.cpp can do that pretty well these days. And it's, you know, half the price and you're not locked into the Nvidia ecosystem

ejpir 18 hours ago|||

unfortunately the bigger models are pretty slow in token speed. The memory is just not that fast.

You can check what each model does on AMD Strix halo here:

https://kyuz0.github.io/amd-strix-halo-toolboxes/

Tepix 9 hours ago|||

4xx chips are less capable than the 395

zozbot234 18 hours ago|||

> What’s the most effective ~$5k setup today?

Mac Studio or Mac Mini, depending on which gives you the highest amount of unified memory for ~$5k.

borissk 19 hours ago|||

With $5k you have to make compromises. Which compromises you are willing to make depends on what you want to do - and so there will be different optimal setup.

oofbey 19 hours ago||

DGX Spark is a fantastic option at this price point. You get 128GB VRAM which is extremely difficult to get at this price point. Also it’s a fairly fast GPU. And stupidly fast networking - 200gbps or 400gbps mellanox if you find coin for another one.

ekropotin 19 hours ago|||

I’m not very well versed in this domain, but I think it’s not going to be “VRAM” (GDDR) memory, but rather “unified memory”, which is essentially RAM (some flavour of DDR5 I assume). These two types of memory has vastly different bandwidth.

I’m pretty curious to see any benchmarks on inference on VRAM vs UM.

banana_giraffe 14 hours ago|||

A quick benchmark using float32 copies using torch cuda->cuda copies, comparing some random machines:

    Raptor Lake + 5080: 380.63 GB/s
    Raptor Lake (CPU for reference): 20.41 GB/s
    GB10 (DGX Spark): 116.14 GB/s
    GH200: 1697.39 GB/s

This is a "eh, it works" benchmarks, but should give you a feel for the relative performance of the different systems.

In practice, this means I can get something like 55 tokens a sec running a larger model like gpt-oss-120b-Q8_0 on the DGX Spark.

ekropotin 14 hours ago||

Nice! Thanks for that.

55 t/s is much better than I could expect.

oofbey 18 hours ago|||

I’m using VRAM as shorthand for “memory which the AI chip can use” which I think is fairly common shorthand these days. For the spark is it unified, and has lower bandwidth than most any modern GPU. (About 300 GB/s which is comparable to an RTX 3060.)

So for an LLM inference is relatively slow because of that bandwidth, but you can load much bigger smarter models than you could on any consumer GPU.

BobbyJo 19 hours ago||||

Internet seems to think the SW support for those is bad, and that strix halo boxes are better ROI.

oofbey 19 hours ago||

Meh. DGX is Arm and CUDA. Strix is X86 and ROCm. Cuda has better support than ROCm . And x86 has better support than Arm.

Nowadays I find most things work fine on Arm. Sometimes something needs to be built from source which is genuinely annoying. But moving from CUDA to ROCm is often more like a rewrite than a recompile.

overfeed 17 hours ago|||

> But moving from CUDA to ROCm is often more like a rewrite than a recompile.

Isn't everyone* in this segment just using PyTorch for training, or wrappers like Ollama/vllm/llama.cpp for inference? None have a strict dependency on Cuda. PyTorch's AMD backend is solid (for supported platforms, and Strix Halo is supported).

* enthusiasts whose budget is in the $5k range. If you're vendor-locked to CUDA, Mac Mini and Strix Halo are immediately ruled out.

oofbey 50 minutes ago||

Most everything starts as PyTorch. (Or maybe Jax.) But the inference engines all use hand tuned CUDA kernels - at least the good ones do. You have to do that to optimize things.

BobbyJo 18 hours ago|||

CUDA != Driver support. Driver support seems to be what's spotty with DGX, and iirc Nvidia jas only committed to updates for 2 years or something.

borissk 19 hours ago|||

Can even network 4 of these together, using a pretty cheap InfiniBand switch. There is a YouTube video of a guy building and benchmarking such setup.

For 5K one can get a desktop PC with RTX 5090, that has 3x more compute, but 4x less VRAM - so depending on the workload may be a better option.

ekropotin 19 hours ago||

VRAM vs UM is not exactly apples to apples comparison.

alasdair_ 13 hours ago||

I just don’t believe that this can run inference on a 120 billion parameter model at actually useful speeds.

Obviously any Turing machine can run any size of model, so the “120B” claim doesn’t mean much - what actually matters is speed and I just don’t believe this can be speedy enough on models that my $5000 5090-based pc is too slow for and lacks enough vram for.

mnkyprskbd 13 hours ago|

Look at the GPU and RAM spec; 120b seems workable.

Aurornis 13 hours ago||

For the red v2?

120B could run, but I wouldn't want to be the person who had to use it for anything.

To be fair, the 120B claim doesn't appear on the webpage. I don't know where it came from, other than the person who submitted this to HN

mnkyprskbd 13 hours ago||

It is more than fair, also, you're comparing your 5k devices to 12k and more importantly 65k and >10m devices.

Aurornis 13 hours ago||

The "to be fair" part of my comment was saying that the tinygrad website doesn't claim 120B.

Also nobody is comparing this box to an $10M nVidia rack scale deployment. They're comparing it to putting all of the same parts into their Newegg basket and putting it together themself.

jmspring 13 hours ago||

Tinygrad devices are interesting, I wish I have screen captures - but their prices have gone up and some specs like RAM have gone down.

A single box with those specs without having to build/configure (the red and green) - I could see being useful if you had $ and not time to build/configure/etc yourself.

ilaksh 18 hours ago||

I thought the most interesting thing about tinygrad was that theoretically you could render a model all the way into hardware similar to Taalas (tinygrad might be where Taalas got the idea for all I know).

I could swear I filed a GitHub issue asking about the plans for that but I don't see it. Anyway I think he mentioned it when explaining tinygrad at one point and I have wondered why that hasn't got more attention.

As far as boxes, I wish that there were more MI355X available for normal hourly rental. Or any.

saidnooneever 8 hours ago||

its a bit weird to me ud need to be contributor to their software to work in operations or hardware, but I suppose its ok for tinycompany. in long term its likely better to have domain experts and not bias everything towards the same thing.

the boxes look cool but how good are they really? the cheapest box seems pricey at 12 for a what is essentially a few gaming gpus. i dont see why you couldnt make that like half the price. u could do a PC/server build thats much much faster for way less. size doesnt matter if its more than twice the price i think...

the more expensive box has atleast real processing gpus but afaik also not very popular ones, this one seems maybe more fair priced (there seems a big difference in bang for buck between these???).

the third one suggested looks like a joke.

dont get me wrong, this seems like a really cool idea. But i dont see it taking off as the prices are corporate but the product seems more home use.

maybe in time they will find a better balance, i do respect the fact that the component market now is sour as hell and making good products with stable prices is pretty much i possible.

id love one of these machines someday, maybe when i am less poor, or when they are xD.

(love the styling of everything, this is the most critical i could be from a dumb consumer perspective, which i totally am btw.)

jeremie_strand 17 hours ago||

The AMD angle is interesting given the history — tinygrad has had to work around a lot of driver quirks to get ROCm into a usable state. At that price point, you're esentially betting on a software stack that NVIDIA has had years to stabilize. Would be curious to see real-world utilization numbers vs. a comparable NVIDIA setup.

latchkey 16 hours ago|

Old news. ROCm works a lot better now than it did a year ago.

Gigachad 16 hours ago||

You are still really limited in what you can run. So much stuff is cuda only.

latchkey 16 hours ago||

Like what? Most of the good stuff is ported over already and anything else, tag Anush on X and see what you get. Also happy to help.

The point is that they care now.

Gigachad 16 hours ago|||

Tbh my experience is in the non AI uses, recently I was looking at Gaussian splatting tools and it seemed the majority of it was CUDA only. I’m also still bothered AMD for ages claimed my card (5700xt) would be getting rocm but just abandoned it.

latchkey 15 hours ago||

>I was looking at Gaussian splatting tools and it seemed the majority of it was CUDA only.

Not surprising. True, the ecosystem is like early OSX vs. Windows. Eventually it'll get ported over if there is demand.

djsjajah 16 hours ago|||

trl. give me a uv command to get that working.

But even in the amd stack things (like ck and aiter) consumer cards are not even second class citizens. They are a distance third at best. If you just want to run vllm with the latest model, if you can get it running at all there are going to be paper cuts all along the way and even then the performance won't be close to what you could be getting out of the hardware.

latchkey 16 hours ago||

It is not perfect, but it isn't that bad anymore. Tons of improvements over the last year.

himata4113 18 hours ago||

exabox reads as if it was making a joke of something or someone. if it's real then it's really interesting!

DeathArrow 3 hours ago|

I wonder how much has he sold.

More comments...