Run a 1T parameter model on a 32gb Mac by streaming tensors from NVMe

Posted by tatef 1 hour ago

Run a 1T parameter model on a 32gb Mac by streaming tensors from NVMe(github.com)

43 points | 19 comments

baq 5 minutes ago|

Intel Optane rolling in its grave.

liuliu 1 minute ago|

Still have 4 brand new ones in my storage unit. Just in case these moments.

Joke aside (I do have them tho!), I don't think Optane is that much use (not to mention it is only 256GiB for my unit). It is useful legacy crutch if you have legacy software that is not designed to issue multiple reads / writes in parallel. If you do, it is really not faster than NVMe, especially these modern ones.

tatef 1 hour ago||

Hypura is a storage-tier-aware inference scheduler for Apple Silicon. It reads the GGUF file, profiles your hardware, and solves a placement problem that assigns every tensor to GPU, RAM, or NVMe based on access frequency and tier bandwidth. No manual tuning — it picks the right mode automatically.

The core insight: most of a model's weights aren't needed every token. For MoE models like Mixtral, only 2/8 experts fire per token. Hypura keeps the non-expert tensors (~1 GB) on Metal and streams expert data from NVMe through a small pool buffer. A neuron cache hits 99.5% after warmup, so steady-state NVMe I/O is near-zero. Vanilla llama.cpp OOMs on the same model — Metal counts the full mmap'd file against recommendedMaxWorkingSetSize even when only a fraction is GPU-offloaded.

For dense models (Llama 70B, 40 GB), it keeps attention+norms on GPU (~8 GB) and streams FFN tensors from NVMe with prefetch lookahead. Slower (0.3 tok/s), but the alternative is a crash.

Numbers on M1 Max 32 GB (~5.1 GB/s NVMe):

- Mixtral 8x7B Q5_K_M (31 GB): 2.2 tok/s. llama.cpp: OOM at any ngl setting. - Llama 3.3 70B Q4_K_M (40 GB): 0.3 tok/s. llama.cpp: OOM. - Qwen 2.5 14B Q4_K_M (8.4 GB): 12.3 tok/s. No overhead when the model fits in memory.

It also exposes an Ollama-compatible HTTP API (/api/chat, /api/generate), so it's a drop-in for anything that talks to Ollama.

Written in Rust, wraps llama.cpp via FFI, MIT licensed.

Honest disclosure: I directed the architecture and design decisions, but the code was largely written by LLMs (Claude). I used the Socratic method — asking questions, proposing approaches, evaluating tradeoffs — while the models did the implementation. I think this is worth being transparent about. The hunch that motivated it: NVMe-backed inference is underutilized despite being a slow but perfectly valid memory tier, especially on Apple Silicon where unified memory + fast SSDs are the norm.

Limitations I won't bury: dense FFN-streaming is I/O-bound (~50 ms per-layer stalls on each of 80 layers). Co-activation predictions need ~100 tokens to warm up. The optimize command rewrites the full model file. This is early and rough.

Happy to answer questions about the placement LP, the custom GGML buffer type, or what I learned about Metal's mmap behavior on Apple Silicon (it's weird).

password4321 24 minutes ago||

Don't post generated/AI-edited comments. HN is for conversation between humans

https://news.ycombinator.com/item?id=47340079

causal 24 minutes ago|||

You need to change the title or actually include 1T parameter model content.

WithinReason 19 minutes ago|||

Have you ever generated access frequency statistics for the experts in these models, something like a histogram?

frikk 28 minutes ago|||

This is interesting work, thank you for sharing. What hardware would you buy today for experimenting? Seems like the new gen of macbook pros are pretty powerful?

lostmsu 28 minutes ago||

Why would llama with --mmap crash?

zozbot234 16 minutes ago||

This doesn't surprise me all that much, mmap support gets little attention in general and interacts poorly with GPU-side inference. (And that's with it being default, you don't even really need to specify it as a CLI option.) OP has raised a discussion with the llama.cpp folks https://github.com/ggml-org/llama.cpp/discussions/20852 but little interest so far

Insanity 23 minutes ago||

This is a pretty cool project! Essentially this is like using Swap memory to extend your RAM, but in a 'smart' way so you don't overload the NVMe unnecessarily.

I do wonder in practice how the 'smarts' pan out, because putting a ton of stress on your NVMe during generation is probably not the best choice for it's longevity.

embedding-shape 15 minutes ago||

> but in a 'smart' way so you don't overload the NVMe unnecessarily

"overloading NVMe"? What is that about? First time I've heard anything about it.

> because putting a ton of stress on your NVMe during generation

Really shouldn't "stress your NVMe", something is severely wrong if that's happening. I've been hammering my SSDs forever, and while write operations "hurt" the longevity of the flash cells themselves, the controller interface really shouldn't be affected by this at all, unless I'm missing something here.

Insanity 9 minutes ago||

I had assumed heat generation on the controller if it's continuously reading. But maybe it's not actually bad.

zozbot234 18 minutes ago||

This is not putting any stress or wear on the NVMe, it's a pure read workload.

marksully 32 minutes ago||

Where does "1T parameter model" come from? I can only see models with 70B params or less mentioned in the repo.

causal 25 minutes ago|

Yeah title comes from nowhere in the link. No doubt it's possible but all that matters is speed and we learn nothing of that here...

zozbot234 28 minutes ago||

It will be interesting to compare this to https://news.ycombinator.com/item?id=47476422 and https://news.ycombinator.com/item?id=47490070 . Very similar design except that this is apparently using mmap, which according to the earlier experiment incurs significant overhead.

salynchnew 5 minutes ago|

It was written by an LLM, so... yeah.

monksy 10 minutes ago|

There needs to be something like this from Ollama. At the moment Ollama has a lot of flaws that prevent it from getting great performance. (My understanding is better GPU/CPU splits, etc). But Ollama is the only way to host an LLM and have it switch out on demand. Sigh.

rubiquity 4 minutes ago|

llama.cpp and llama-swap do this better than Ollama and with far more control.