1.5 TB of VRAM on Mac Studio – RDMA over Thunderbolt 5

Posted by rbanffy 12/18/2025

1.5 TB of VRAM on Mac Studio – RDMA over Thunderbolt 5(www.jeffgeerling.com)

617 points | 228 commentspage 2

mikestaas 12/19/2025|

> You have to click buttons in the UI.

I like doing development work on a Mac, but this has to be my biggest bugbear with the system.

clan 12/19/2025||

As Jeff states there are really no Thunderbolt switches which currently limits the size of the cluster.

But would it be possible to utilize RoCE with these boxes rather than RDMA over Thunderbolt? And what would the expected performance be? As I understand RDMA should be 7-10 times faster than via TCP. But if I understand it correctly RoCE is RDMA over Converged Ethernet. So using ethernet frames and lower layer rather than TCP.

10G Thunderbolt adapters are fairly common. But you can find 40G and 80G Thunderbolt ethernet adapters from Atto. Probably not cheap - but would be fun to test! But ieven if the bandwidth is there we might get killed with latency.

Imagine this hardware with a PCIe slot. The Infiniband hardware is there - then we "just" need the driver.

KeplerBoy 12/19/2025||

At that point you could just breakout the thunderbolt to PCIe and use a regular NIC. Actually, I'm pretty sure that's all that to the Atto Thunderlink, a case around a broadcom nic.

Then you _just_ need the driver. Fascinating, Apple ships MLX5 drivers, that's crazy imo. I understand that's something they might need internally, but shipping that on ipadOs is wild. https://kittenlabs.de/blog/2024/05/17/25gbit/s-on-macos-ios/

clan 12/19/2025||

That is what I am suggesting with the Atto adapter.

Infiniband is way faster and lower latency than a NIC. These days NIC==Ethernet.

KeplerBoy 12/19/2025||

What makes you think Infiniband is faster than Ethernet? Aren't they pretty much equal these days with RDMA and kernel bypass?

jamesfmilne 12/19/2025||

macOS ships with drivers for Mellanox ConnectX cards, but I have no idea if they will show up in `ibv_devices` or `ibv_devinfo`.

pjmlp 12/19/2025||

In an ideal world, Apple would have released a Mac Pro with card slots for doing this kind of stuff.

Instead we get gimmicks over Thunderbolt.

g947o 12/21/2025|

I can imagine Apple shipping Mac Pros with add-ons that allows running local inference with minimal setups. "Look, just spend $50k on this machine and you get a usable LLM server that can be shared for a team." But they don't seem particularly interested in that market.

supermatt 12/19/2025||

What is the max token throughput when batching. Lots of agentic workflows (not just vibe coding) are running many inferences in parallel.

It seems like every time someone does an AI hardware “review” we end up with figures for just a single instance, which simply isn’t how the target demographic for a 40k cluster are going to be using it.

Jeff, I love reading your reviews, but can’t help but feel this was a wasted opportunity for some serious benchmarking of LLM performance.

andy99 12/18/2025||

Very cool, I’m probably thinking too much but why are they seemingly hyping this now (I’ve seen a bunch of this recently) with no M5 Max/Ultra machines in sight. Is it because their release is imminent (I have heard Q1 2026) or is it to try and stretch out demand for M4 Max / M3 Ultra. I plan to buy one (not four) but would feel like I’m buying something that’s going to be immediately out of date if I don’t wait for the M5.

GeekyBear 12/18/2025||

I imagine that they want to give developers time to get their RDMA support stabilized, so third party software will be ready to take advantage of RDMA when the M5 Ultra lands.

I definitely would not be buying an M3 Ultra right now on my own dime.

spacedcowboy 12/19/2025||

I am typing this on my own 512GB M3 Ultra. I've just put out some feelers for 2nd-hand sale price...

I have an M4 Max I can use to bridge any gap...

fooblaster 12/19/2025|||

Does it actually creates a unified memory pool? it looks more like an accelerated backend for a collective communications library like nccl, which is very much not unified memory.

9dev 12/19/2025||

The yearly release cadence annoys me to no end. There is literally zero reason to have a new CPU generation every year, it just devalues Mac hardware faster.

Which I guess is the point of this for Apple, but still.

lvl155 12/18/2025||

Seriously, Jeff has the best job. Him and STH Patrick.

geerlingguy 12/19/2025|

I got to spend a day with Patrick this week, and try out his massive CyPerf testing rig with multiple 800 Gbps ConnectX-8 cards!

lvl155 12/19/2025||

Patrick’s enthusiasm is so contagious and you perfected tech YouTube format. There’s not a dead spot in your video.

bluedino 12/19/2025||

There have a been a couple videos/posts about this from other influencers today

Does anyone remember a guy here posting about linking Mac Studios with Thunderbolt for HPC/clustering? I wasn't able to find it with a quick search.

Edit: I think it was this?

https://www.youtube.com/watch?v=d8yS-2OyJhw

polynomial 12/19/2025||

BUILD AI has a post about this and in particular sharding k-v cache across GPUs, and how network is the new memory hierarchy:

https://buildai.substack.com/p/kv-cache-sharding-and-distrib...

Retr0id 12/19/2025||

I wonder if there's any possibility that an RDMA expansion device could exist in the future - i.e. a box full of RAM on the other end of a thunderbolt cable. Although I guess such a device would cost almost as much as a mac mini in any case...

roadbuster 12/19/2025||

You still need an interface which does at least two things: handles incoming read/write requests using some kind of network protocol, and operates as a memory controller for the RAM.

Texas Memory Systems was in the business of making large 'RAM Drives'. They had a product line known as "RamSan" which made many gigabytes/terabytes of DDR available via a block storage interface over infiniband and fibre channel. The control layer was implemented via FPGA.

I recall a press release from 2004 which publicized the US govt purchase of a 2.5TB RamSan. They later expanded into SSDs and were acquired by IBM in 2012.

https://en.wikipedia.org/wiki/Texas_Memory_Systems

https://www.lhcomp.com/vendors/tms/TMS-RamSan300-DataSheet.p...

https://gizmodo.com/u-s-government-purchases-worlds-largest-...

https://www.lhcomp.com/vendors/tms/TMS-RamSan20-DataSheet.pd...

https://www.ibm.com/support/pages/ibm-plans-acquire-texas-me...

amluto 12/19/2025|||

RDMA is not really intended for this. RDMA is really just a bunch of functionality of a PCIe device, and even PCIe isn’t really quite right to use like RAM because its cache semantics aren’t intended for this use case.

But the industry knows this, and there’s a technology that is electrically compatible with PCIe that is intended for use as RAM among other things: CXL. I wonder if a anyone will ever build CXL over USB-C.

RantyDave 12/19/2025||

Couldn't you "just" use a honking fast SSD and set it as a swap drive?

Retr0id 12/19/2025||

You might get close in peak bandwidth, but not in random access and latency.

delaminator 12/18/2025|

> Working with some of these huge models, I can see how AI has some use, especially if it's under my own local control. But it'll be a long time before I put much trust in what I get out of it—I treat it like I do Wikipedia. Maybe good for a jumping-off point, but don't ever let AI replace your ability to think critically!

It is a little sad that they gave someone an uber machine and this was the best he could come up with.

Question answering is interesting but not the most interesting thing one can do, especially with a home rig.

The realm of the possible

Video generation: CogVideoX at full resolution, longer clips

Mochi or Hunyuan Video with extended duration

Image generation at scale:

FLUX batch generation — 50 images simultaneously

Fine-tuning:

Actually train something — show LoRA on a 400B model, or full fine-tuning on a 70B

but I suppose "You have it for the weekend" means chatbot go brrrrr and snark

storus 12/19/2025||

M3 Ultra has a crappy GPU, somewhere around 3060Ti-3070. Its only benefit is the memory throughput that makes LLM token generation fast, at around 3080 level. But token prefill that determines time-to-first-token is extremely slow, and coincidentally all those tasks you mentioned above would be around 3060Ti level. That's why Exo coupled DGX Spark (5090 performance for FP4) with MacStudio and sped it up 4x. M5 Ultra is supposed to be as fast as DGX Spark at FP4 due to new neural cores.

benjismith 12/19/2025|||

> show LoRA on a 400B model, or full fine-tuning on a 70B

Yeah, that's what I wanted to see too.

theshrike79 12/18/2025||

Yea, I don't understand why people use LLMs for "facts". You can get them from Wikipedia or a book.

Use them for something creative, write a short story on spec, generate images.

Or the best option: give it tools and let it actually DO something like "read my message history with my wife, find top 5 gift ideas she might have hinted at and search for options to purchase them" - perfect for a local model, there's no way in hell I'd feed my messages to a public LLM, but the one sitting next to me that I can turn off the second it twitches the wrong way? - sure.

mft_ 12/19/2025||

> Yea, I don't understand why people use LLMs for "facts". You can get them from Wikipedia or a book.

Because web search is so broken these days, if you want a clean answer instead of wading through pages of SEO nonsense. It's really common (even) amongst non-techy friends that "I'll ask ChatGPT" has replaced "I'll Google it".

theshrike79 12/19/2025||

Kagi or DDG

Google is useless

More comments...