DeepSeek 4 Flash local inference engine for Metal

Posted by tamnd 20 hours ago

DeepSeek 4 Flash local inference engine for Metal(github.com)

408 points | 117 commentspage 2

visarga 17 hours ago|

Large LLMs on MacBook produce tokens at an acceptable speed but the problem is reading context. Not incremental reading like when you have a chat session, because they use KV cache, but large size reading, like when you paste a big file. It can take minutes.

antirez 16 hours ago||

DS4 can process 460 prompt tokens per second. Not stellar but not so slow. On M3 max. See the benchmarks on readme.

habosa 8 hours ago|||

Can you ELI5 why this is so slow for local inference but so fast for using hosted models?

bel8 17 hours ago|||

And unless I'm mistaken, the repo is about running it with 2bit quantization.

This is probably far from the raw intelligence provided by cloud providers.

Still, this shines more light on local LLMs for agentic workflows.

antirez 15 hours ago||

It runs both q2 and original (4 bit routed experts). At the same speed more or less. The q2 quants are not what you could expect: it works extremely well for a few reasons. For the full model you need a Mac with 256GB.

someone13 9 hours ago||

Out of curiosity, do you have any theories of why it works so well at such aggressive quantization levels?

antirez 6 hours ago||

It's a mix of extreme sparsity but with the routed expert doing a non trivial amount of work (and it is q8), and projections and routing not being quantized as well. Also the fact it's a QAT model must have a role I guess, and I quantized routed experts out layers with Q2 instead of IQ2_XXS to retain quality.

brcmthrowaway 16 hours ago||

Why is this the case?

Are there any architectures that don't rely on feeding the entire history back into the chat?

Recurrent LLMs?

dejli 6 hours ago||

The beaty of it, that you can clone and make it, and it just works, no python shenanigans, what a blessing for this eco system.

Havoc 14 hours ago||

Was excited until I realized DS flash is still enormous. Oh well...glad it exists anyway & happy to see antirez still doing fun stuff

zozbot234 12 hours ago|

It could run viably with SSD offload on Macs with very little memory. You could even exploit batching to make the model almost compute limited even in that challenging setting, seeing as the KV cache is so extremely small (for non-humongous context). In fact, if that approach can be made to work I'd like to see a comparison between DS4 Flash and Pro on the same (Mac) hardware.

Havoc 12 hours ago||

>It could run viably with SSD offload on Macs with very little memory

Not really. That's going to land you somewhere in the 0.2-0.5 tokens a second range

Lovely as modern nvmes are they're not memory

zozbot234 12 hours ago||

You can run multiple inferences in parallel on the same set of weights, that's what batching is. Given enough parallelization it can be almost entirely compute-limited, at least for small context (max ~10GB per request apparently, but that's for 1M tokens!)

octocop 3 hours ago||

Finally someone who pays proper respect to GGML ecosystem.

amunozo 18 hours ago||

I am curious about it producing less tokens except for the max mode. I love DeepSeek V4 Flash and I use it extensively, it's so cheap I can use it all day and still not use all my 10$ OpenCode Go subscription. I use it always in max mode because of this, but now I wonder whether I should rather use high.

unshavedyak 18 hours ago||

What do you use it for? I tend to just stick to SOTA (Claude 4.7 Max thinking), and put up with the slow req/response. I'm not sure what type of work i'd trust a less thinking model, as my intuition is built around what Claude vSOTA Max can handle.

Nonetheless eventually i want to build an at-home system. I imagine some smaller local model could handle metadata assignment quite well.

edit: Though TIL Mac Studio doesn't offer 512GB anymore... DRAM shortage lol. Rough.

amunozo 16 hours ago|||

I am experimenting with some game development and my thesis' beamer. I have a 20$ Codex account and I use GPT-5.5 for planning and DeepSeek for executing in OpenCode. This makes my Codex 5h tokens to last more than 10 minutes.

actsasbuffoon 16 hours ago|||

Apple just dropped the 128GB option as well.

fgfarben 9 hours ago||

It is still available for the M5 Max Macbook Pro, but yes, the Mac Studio is now only offered with up to 96 GB.

PhilippGille 17 hours ago|||

On max it uses more than twice as many tokens as on high when running the ArtificialAnalysis benchmark suite, and then it's indeed the model with the highest token usage (among the current top tier models). See the "Intelligence vs. Token Use" chart here:

https://artificialanalysis.ai/models?models=gpt-5-5%2Cgpt-5-...

amunozo 16 hours ago||

Wow, the difference is quite considerable and the gain in intelligence is not that much. I might try to use high and just iterate more often. I am working with hobby stuff so I don't have to worry whether it breaks things or not.

syntaxing 17 hours ago||

How has opencode go been for you? Worth changing over from Claude pro?

DefineOutside 17 hours ago|||

I've found that opencode and codex are the two subscriptions that still seem to subsize usage. Deepseek V4 has been the most powerful model in opencode IMO, I trust it with problems where I can validate the solution such as debugging an issue - but I only trust the proprietary GPT-5.5 and Claude Opus 4.7 models for writing code that matters.

amunozo 16 hours ago|||

Given the price, extremely satisfied, especially thanks to DeepSeek V4 Flash that makes it last forever. I use it on top of my 20$ Codex which is great but tokens last nothing.

sourcecodeplz 17 hours ago||

Great project!

This is also a fine example of a vibe-coded project with purpose, as you acknowledged.

brcmthrowaway 16 hours ago||

How does this compare with oMLX?

andrefelipeafos 13 hours ago||

[flagged]

micalo 8 hours ago|

[flagged]

More comments...