Posted by tatef 1 hour ago
Joke aside (I do have them tho!), I don't think Optane is that much use (not to mention it is only 256GiB for my unit). It is useful legacy crutch if you have legacy software that is not designed to issue multiple reads / writes in parallel. If you do, it is really not faster than NVMe, especially these modern ones.
The core insight: most of a model's weights aren't needed every token. For MoE models like Mixtral, only 2/8 experts fire per token. Hypura keeps the non-expert tensors (~1 GB) on Metal and streams expert data from NVMe through a small pool buffer. A neuron cache hits 99.5% after warmup, so steady-state NVMe I/O is near-zero. Vanilla llama.cpp OOMs on the same model — Metal counts the full mmap'd file against recommendedMaxWorkingSetSize even when only a fraction is GPU-offloaded.
For dense models (Llama 70B, 40 GB), it keeps attention+norms on GPU (~8 GB) and streams FFN tensors from NVMe with prefetch lookahead. Slower (0.3 tok/s), but the alternative is a crash.
Numbers on M1 Max 32 GB (~5.1 GB/s NVMe):
- Mixtral 8x7B Q5_K_M (31 GB): 2.2 tok/s. llama.cpp: OOM at any ngl setting. - Llama 3.3 70B Q4_K_M (40 GB): 0.3 tok/s. llama.cpp: OOM. - Qwen 2.5 14B Q4_K_M (8.4 GB): 12.3 tok/s. No overhead when the model fits in memory.
It also exposes an Ollama-compatible HTTP API (/api/chat, /api/generate), so it's a drop-in for anything that talks to Ollama.
Written in Rust, wraps llama.cpp via FFI, MIT licensed.
Honest disclosure: I directed the architecture and design decisions, but the code was largely written by LLMs (Claude). I used the Socratic method — asking questions, proposing approaches, evaluating tradeoffs — while the models did the implementation. I think this is worth being transparent about. The hunch that motivated it: NVMe-backed inference is underutilized despite being a slow but perfectly valid memory tier, especially on Apple Silicon where unified memory + fast SSDs are the norm.
Limitations I won't bury: dense FFN-streaming is I/O-bound (~50 ms per-layer stalls on each of 80 layers). Co-activation predictions need ~100 tokens to warm up. The optimize command rewrites the full model file. This is early and rough.
Happy to answer questions about the placement LP, the custom GGML buffer type, or what I learned about Metal's mmap behavior on Apple Silicon (it's weird).
I do wonder in practice how the 'smarts' pan out, because putting a ton of stress on your NVMe during generation is probably not the best choice for it's longevity.
"overloading NVMe"? What is that about? First time I've heard anything about it.
> because putting a ton of stress on your NVMe during generation
Really shouldn't "stress your NVMe", something is severely wrong if that's happening. I've been hammering my SSDs forever, and while write operations "hurt" the longevity of the flash cells themselves, the controller interface really shouldn't be affected by this at all, unless I'm missing something here.