https://x.com/natebrake/status/2013978241573204246
Thus far, the 6-bit quant MLX weights were too much and crashed LMS with OOM
In my ime small tier models are good for simple tasks like translation and trivia answering, but are useless for anything more complex. 70B class and above is where models really start to shine.
Not for code. The quality is so low, it's roughly on par with Sonnet 3.5
My recommendation would be to use other tools built to support pluggable model backends better. If you're looking for a Claude Code alternative, I've been liking OpenCode so far lately, and if you're looking for a Cursor alternative, I've heard great things about Roo/Cline/KiloCode although I personally still just use Continue out of habit.
https://huggingface.co/inference/models?model=zai-org%2FGLM-...
Slow inference is also present on z.ai, eyeballing it the 4.7 flash model was twice as slow as regular 4.7 right now.
I am interesting if I can run it on a 24GB RTX 4090.
Also, would vllm be a good option?
Should be able to run this in 22GB vram so your 4090 (and a 3090) would be safe. This model also uses MLA so you can run pretty large context windows without eating up a ton of extra vram.
edit: 19GB vram for a Q4_K_M - MLX4 is around 21GB so you should be clear to run a lower quant version on the 4090. Full BF16 is close to 60GB so probably not viable.
I'm thinking of giving it a go with aider, but using something like gemma3:27b as the architect. I don't think you can have different models for different skills in opencode, but with smaller local models I suspect it's unavoidable for now.
I suppose Flash is merely a distillation of that. Filed under mildly interesting for now.
My Mac Mini probably isn't up for the task, but in the future I might be interested in a Mac Studio just to churn at long-running data enrichment types of projects