Posted by mft_ 9 hours ago
https://github.com/matt-k-wong/mlx-flash
2 bit quantization lobotomizes the model but is impressive nonetheless! Maybe one day we'll be able to have intelligent 2 bit quants... I wonder.
my version supports - 4bit quantization, hybrid streaming (Disk + ram), arbitrary model compatibility, tested on Mamba2, and lets up the framework for LM Studio integration
I leveraged this work (Credit to Danveloper) and am in the middle of making this work on more practical models and quants. It still uses flash streaming, but done so with a control knob so you can choose how much ram and how little ram to use. In the craziest case, it uses as little ram as possible but is very slow, however, in the balanced case you use some ram and it's much faster.
I designed it around the intelligence dense Nemotron 3 Nano 30B and Nemotron Cascade 2 30B models (which are smaller, more intelligence density) and can run on low end 16GB machines, though you can run arbitrarily large models on larger machines (designed for very low end, but capable of high end).
Like, if I write a blog post and put it on my blog, you're allowed to read it, right?
Heck, if my blog contains some Javascript code I wrote, I would imagine your web browser is allowed to run that code without opening you up to copyright infringement, even if I didn't provide an explicit license.
It’s a MacBook.
At that point, I suppose there's still paid harnesses (people have always paid for IDEs despite FOSS options) partly for mindshare, and they could use expertise & compute capacity to provide application-specific training for enterprises that need it.
here we go: https://huggingface.co/collections/trillionlabs/tri-series
You can use this approach with Intel Optane, which is wearout-resistant unlike NAND and can thus substitute for RAM. Last I checked, it was available quite cheap on the secondary market, ~$1/GB as opposed to ~$15/GB or more for DRAM. (Of course that's nowhere near as cheap as NAND, which is around ~$0.1/GB but quite wearout-prone with heavy writes.)
Meanwhile PCIe switches exist. So why not build:
1 CPU + memory + ...
N PCIe switch with each 1 low-memory GPU + 6 NVME drives (in theory 5 can saturate the GPU)
Each of those should only bother the CPU when they have some tokens produced and have plenty of PCIe lanes to get at their data.
Such a setup should be able to get a 6 to 8 times speedup from the solution detailed here, and a model compute increase should make relatively little difference in performance.