Flash-MoE: Running a 397B Parameter Model on a Laptop

Posted by mft_ 12 hours ago

Flash-MoE: Running a 397B Parameter Model on a Laptop(github.com)

278 points | 97 commentspage 4

diablevv 9 hours ago|

[dead]

leontloveless 10 hours ago||

[dead]

jee599 8 hours ago||

[dead]

genie3io 8 hours ago||

[dead]

feshbach 6 hours ago||

[dead]

mugivarra69 11 hours ago||

[dead]

vilequeef 11 hours ago||

Why so much RAM?

vilequeef 11 hours ago|

Oh Mac, unified. Sometimes it takes a downvote

harshhhhhhhhh 11 hours ago||

seems promising , this is the way , can someone benchmark this

frwickst 11 hours ago|

I'm getting 6.55t/s using the Qwen3.5-397B-A17B-4bit model with the command: ./infer --prompt "Explain quantum computing" --tokens 100

MacBook Pro M5 Pro (64GB RAM)

j45 10 hours ago|||

Appreciate the data point. M5 Max would also be interesting to see once available in desktop form.

logicallee 11 hours ago|||

can you post the final result (or as far as you got before you killed it) to show us how cohesive and good it is? I'd like to see an example of the output of this.

frwickst 11 hours ago||

Since the output is quite long, here is a link: https://pastebin.com/k76wiVGP

hrimfaxi 11 hours ago||

Why does this G character appear to prefix most of the output? ("Ġlike")

frwickst 10 hours ago|||

It is a tokenizer artifact most likely (https://github.com/huggingface/transformers/issues/4786). So the output is not properly decoded in this case, it should just be a space.

kgeist 10 hours ago|||

The original tokens have Ġ instead of space. I had this issue too when writing an inference engine for Qwen. You have to "normalize" those special characters.

rvz 11 hours ago|

The technical write up is great, but Mac users should not get too excited just yet on running 300B+ parameter models locally as the TPS isn't that good.

>...at 4.4+ tokens/second

That is even when it is using 4-bit quantization and it is still at that speed.

> The entire 209GB model streams from SSD through a custom Metal compute pipeline.

This is my main problem.

If I were to run this on a Mac SSD, 24/7 for heavy usage such as Openclaw, that is going to significantly reduce the lifetime of the SSD.

Can't imagine using this in the long term right now, but improvements will follow. Still a great write up anyways.

Roxxik 11 hours ago||

Does an SSD meaningfully degrade by read only workloads?

JSR_FDED 11 hours ago||

Nope, reads don’t cause wear

zozbot234 10 hours ago||

No appreciable wear of course, but read disturb (requiring occasional rewrites) becomes more of an issue as NAND fabrication advances.

etiam 11 hours ago|||

> If I were to run this on a Mac SSD, 24/7 for heavy usage such as Openclaw, that is going to significantly reduce the lifetime of the SSD.

How sure are you about that? I've never looked closer at how a large LLM with mixture of experts architecture switches between expert modules, but staying on roughly the same topic for the use (as it often would when editing the same codebase), I wouldn't be surprised to see the switches of composition are fairly rare, fairly small, and to the extent it happens it's repeated reads from the flash disk rather than writes it tends to cause.

frotaur 11 hours ago||

Afaik the experts are not usually very interpretable, and generally would be surprised if at least one does not change every token. I don't know what happens in practice, but I know at least during training, nothing is done to minimize the number of expert switches between tokens.

etiam 2 hours ago||

I'd have thought at least a tiny explicit penalty term for switching, to discourage messing around with the composition without any expected gains from it.

If one is to use these on hardware that can't keep everything loaded I guess someone should examine how it works out in practice. Interpretability may be be a too much to ask, but I can't spontaneously see any reason why the experts can't at least be pushed to incorporate what's needed to remain the good choice for a longer segment.

zozbot234 2 hours ago||

The switching is done by layer, not just per token. Every layer is loading completely different parameters, you don't really benefit from continuity. You're generally better off shifting this work to the CPU, since CPU RAM is more abundant than the GPU's VRAM hence it matters less that so much of it is "wasted" on inactive expert layers. Disk storage is even more relatively abundant, so offloading experts to disk if you can't keep them in RAM (as OP does) is the next step.

Wowfunhappy 11 hours ago|||

Eh. I mean, 4 tokens a second works fine if you're patient. Go do something else while you wait.

I feel like whenever I'm trying to find information on which local models will work on my hardware, I have to overestimate because people don't know how to wait for things.

Also, reading data doesn't cause SSD wear.

hrmtst93837 11 hours ago||

If you want decent throughput and do not care about burning SSD write cycles on a box that was never meant to act like a tiny inference server, a used server with actual RAM is still the cheaper and less silly option. I woudn't expect Apple's warranty team to be much help.

K0balt 11 hours ago||

Is it doing a bunch of ssd writes?

mkw 8 hours ago||

stream from the SSD, perform the calculation, discard, repeat