Flash-MoE: Running a 397B Parameter Model on a Laptop

Posted by mft_ 9 hours ago

Flash-MoE: Running a 397B Parameter Model on a Laptop(github.com)

234 points | 88 commentspage 2

mkw 5 hours ago|

TLDR I took a stab at leveraging Dan's work and making it more practical:

https://github.com/matt-k-wong/mlx-flash

2 bit quantization lobotomizes the model but is impressive nonetheless! Maybe one day we'll be able to have intelligent 2 bit quants... I wonder.

my version supports - 4bit quantization, hybrid streaming (Disk + ram), arbitrary model compatibility, tested on Mamba2, and lets up the framework for LM Studio integration

I leveraged this work (Credit to Danveloper) and am in the middle of making this work on more practical models and quants. It still uses flash streaming, but done so with a control knob so you can choose how much ram and how little ram to use. In the craziest case, it uses as little ram as possible but is very slow, however, in the balanced case you use some ram and it's much faster.

I designed it around the intelligence dense Nemotron 3 Nano 30B and Nemotron Cascade 2 30B models (which are smaller, more intelligence density) and can run on low end 16GB machines, though you can run arbitrarily large models on larger machines (designed for very low end, but capable of high end).

maxloh 7 hours ago||

Can you add a license to the repo? Legally we couldn't run any code without a license attached to it.

Wowfunhappy 5 hours ago|

...you can't redistribute code without a license, but surely you can legally run it, can't you?

Like, if I write a blog post and put it on my blog, you're allowed to read it, right?

Heck, if my blog contains some Javascript code I wrote, I would imagine your web browser is allowed to run that code without opening you up to copyright infringement, even if I didn't provide an explicit license.

haomingkoo 6 hours ago||

Really interesting approach. Curious how the 2-bit quantization affects the model's reasoning ability on longer chains of thought vs shorter prompts. The benchmarkslook solid but real-world usage seems like a different story based on the comments here.

m-hodges 6 hours ago||

As frontier models get closer and closer to consumer hardware, what’s the most for the API-driven $trillion labs?

stri8ted 6 hours ago||

48 GB is not consumer hardware. But fundamentally, there are economies of scale due to batching, power distribution, better utilization etc.., that means data center tokens will be cheaper. Also, as the cost of training (frontier) models increases, it's not clear the Chinese companies will continue open sourcing them. Notice for example, that Qwen-Max is not open source.

zozbot234 6 hours ago|||

Nothing obviously prevents using this approach, e.g. for 3B-active or 10B-active models, which do run on consumer hardware. I'd love to see how the 3B performs with this on the MacBook Neo, for example. More relevantly, data-center scale tokens are only cheaper for the specific type of tokens data centers sell. If you're willing to wait long enough for your inferences (and your overall volume is low enough that you can afford this) you can use approaches like OP's (offloading read-only data to storage) to handle inference on low-performing, slow "edge" devices.

WesolyKubeczek 42 minutes ago||||

It is consumer hardware in the sense that Macbook Pros come with this RAM size as base and that you can buy them as a consumer, without having to sign a special B2B contract, show that your company is big and reputable enough, and order a minimum of 10 or 100.

m-hodges 3 hours ago|||

> 48 GB is not consumer hardware.

It’s a MacBook.

OJFord 6 hours ago|||

Assuming 'moat' – they'll push the frontier forward; they don't really have to worry until progress levels off.

At that point, I suppose there's still paid harnesses (people have always paid for IDEs despite FOSS options) partly for mindshare, and they could use expertise & compute capacity to provide application-specific training for enterprises that need it.

BoredomIsFun 6 hours ago||

> the API-driven $trillion labs?

here we go: https://huggingface.co/collections/trillionlabs/tri-series

mannyv 4 hours ago||

Everyone is focused on the bad 2 bit result but who cares? He says don’t use it because it’s bad.

Aurornis 2 hours ago|

If you don’t care about the output, why not reduce to 1-bit and only 1 active expert? It will be completely useless but it will be faster!

383toast 6 hours ago||

yeah 4tok/s is kinda unusable though

matchbox 3 hours ago||

this is awesome Dan!

spwa4 7 hours ago||

Does this mean that it should be possible to load up a system with ~10 (seems to me at least the number of active experts) SSDs to get 40 tok/s even on truly gigantic models?

zozbot234 7 hours ago|

SSD bandwidth will ultimately be limited by the amount of PCIe lanes you have available (for something other than the Apple Silicon internal storage). So the approach has inherent limitations. You can of course scale out to multiple systems to get more throughput.

You can use this approach with Intel Optane, which is wearout-resistant unlike NAND and can thus substitute for RAM. Last I checked, it was available quite cheap on the secondary market, ~$1/GB as opposed to ~$15/GB or more for DRAM. (Of course that's nowhere near as cheap as NAND, which is around ~$0.1/GB but quite wearout-prone with heavy writes.)

spwa4 6 hours ago||

Yeah, PCIe is the bottleneck. The point being that whether the data originates from RAM or from NVME or Optane, you cannot get data to the GPU faster with RAM than with SSDs.

Meanwhile PCIe switches exist. So why not build:

1 CPU + memory + ...

N PCIe switch with each 1 low-memory GPU + 6 NVME drives (in theory 5 can saturate the GPU)

Each of those should only bother the CPU when they have some tokens produced and have plenty of PCIe lanes to get at their data.

Such a setup should be able to get a 6 to 8 times speedup from the solution detailed here, and a model compute increase should make relatively little difference in performance.

lostmsu 7 hours ago|

How large is the KV cache?

xbar 7 hours ago|

0.1 GB per full-attention layer and "The model has 60 transformer layers: 45 GatedDeltaNet (linear attention) + 15 standard full attention." So, 1.5 GB.

More comments...