Top
Best
New

Posted by caust1c 14 hours ago

A few words on DS4(antirez.com)
336 points | 139 commentspage 2
ilaksh 6 hours ago|
I want something like this but not only for my own computer but also for client projects or stuff I might run in cloud GPUs. Because the core idea of having a strong model that is efficient and doesn't require a cluster still applies to a lot of business cases. I am hoping something like this can work in batch mode.

Right now I feel like a 4bit Qwen 3.6 27B with MTP is one of the best for agentic tool calling for some smart voice agents in an H200. I wonder if DS4 Flash being using 80b at 2 bit with 13b active and MTP could be even faster and smarter and allow more concurrent sequences?

This special 2bit quantization seems like a big deal.

whazor 2 hours ago||
Some of my colleagues believe that current frontier AIs are too heavily subsidized and it will come to an end. They think frontier coding AI's might get unavailable for one reason or another. But these kind of projects show that with 6000$ Macbook we are getting closer to a local frontier model. More importantly, it shows the genie will not go back into the bottle.
somewhatrandom9 11 hours ago||
With "intelligence" (or whatever you want to call it) and speed both seeming to ramp up quickly with local models I wonder what the growth rate and ceiling(?) might be in this space. Will this kind of iq and performance work with just e.g: 16GB RAM in a couple years? Is there a new kind of Moore's law to be defined here?
hadlock 9 hours ago||
640gb ought to be enough for anybody
famouswaffles 7 hours ago|||
Squeezing a model like this complete with 'big model smell' into 16GB...Honestly it's not even possible or feasibly possible today.

It'll require some kind of:

- breakthrough in architecture or

- breakthrough in hardware or

- some breakthrough quantisization technique

The problem is that all the parameters need to be in memory, even the ones that aren't active (say for Mixture Of Expert Models) because switching parametrs in and out of ram is far too slow.

marci 5 hours ago||
"That’s where EMO comes in.

We show that EMO – a 1B-active, 14B-total-parameter (8-expert active, 128-expert total) MoE trained on 1 trillion tokens – supports selective expert use: for a given task or domain, we can use only a small subset of experts (just 12.5% of total experts) while retaining near full-model performance."

https://allenai.org/blog/emo

lwansbrough 10 hours ago||
The people working at the leading edge of this stuff seem to believe that there is a need for parallel models that solve different problems.

A crow exhibits some degree of intelligence in what is a very small brain compared to humans. There is overlap in the problem solving skills of the dumbest humans and the smartest crows.

So the question is: what is that? Yann LeCun seems to think it’s what we now call world models. World models predict behaviour as opposed to predicting structured data (like language.)

If your model can predict how some world works (how you define world largely depends on the size of your training data), then in theory it is able to reason about cause and effect.

If you can combine cause and effect reasoning with language, you might get something truly intelligent.

That’s where things seem to be going. Once we have a prototype of that system, there will be many questions about how much data you really need. We’ve seen how even shrinking LLMs with 1-bit quantization can lead to models that exhibit a fairly strong understanding of language.

I don’t think it’s unreasonable to expect to see some very intelligent low (relatively) memory AI systems in the next couple years.

NitpickLawyer 2 hours ago||
> This project supports steering with single-vector activation directions; [...] This is also useful for cybersecurity researchers who want to reduce a model's willingness to provide dual-use or offensive security guidance.

Wink wink, nudge nudge.

I have a feeling most cybersec researchers would only be interested in negative values of "reduce" :D

simonw 13 hours ago||
I got this running on a 128GB M5 the other day - pretty painless, model runs in about 80GB of RAM and it seemed to be very capable at writing code and tool execution.
perfmode 13 hours ago||
How’s the token throughput / response time?
simonw 13 hours ago||
Healthy!

  prefill: 30.91 t/s, generation: 29.58 t/s
From https://gist.github.com/simonw/31127f9025845c4c9b10c3e0d8612...
embedding-shape 12 hours ago|||
Comparison with a RTX Pro 6000, with DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf:

prefill: 121.76 t/s, generation: 47.85 t/s

Main target seems to be Apple's Metal, so makes sense. Might be fun to see how fast one could make it go though :) The model seems really good too, even though it's in IQ2.

xienze 13 hours ago||||
I don't want to be a jerk but 31t/s prefill is basically unusable in an agentic situation. A mere 10k in context and you're sitting there for 5+ minutes before the first token is generated.
fgfarben 11 hours ago|||
That prefill number isn't right. M4 Max hits 200-300: https://github.com/antirez/ds4/blob/main/speed-bench/m4_max_...
hadlock 9 hours ago||
M5 studio is gonna sell like hot cakes
throwdbaaway 5 hours ago||||
Hah, that's because the prompt itself was only about 30 tokens. We need a much bigger prompt to properly test PP.
aiscoming 13 hours ago|||
if it's just the coding agent system prompt and tools, you can cache that
xienze 13 hours ago||
Yeah the problem is that's just the start of the context. There's, you know, all the tool call results and file reads and stuff.
rtpg 10 hours ago|||
what are token speeds like for frontier models, if that gives a rough idea of how much "slower" slow is?
chatmasta 11 hours ago||
So you’re saying I should buy the M5? :) I’ve been resisting, thinking I’ll never use it… it’ll be better in a year… I’ll wait for the Studio (do we still think that’s coming in June?)… etc.
simonw 10 hours ago||
I expect this to be my main machine for the next 3-4 years (which is how I justified the 128GB one). It's a beast of a machine - I love that I can run an 80GB model and still have 48GB left for everything else.

Can't say that it wouldn't be a better idea to spend that cash on tokens from the frontier hosted models though.

I'm an LLM nerd so running local models is worth it from a research perspective.

simpaticoder 7 hours ago||
An M5 Max MBP with 128G of RAM costs ~$5k. An Nvidia RTX 5090 with 32G RAM is $4-5k, and RTX PRO 6000 with 96GB RAM $10k. Do you have any data on which is the best price/performance for local inference? Do you know what the big OpenAI/Anthropic/Google datacenters are running?
driese 5 hours ago|||
As always: it depends on your needs. Here's a very basic heuristics rundown:

- More RAM: bigger models, more intelligence.

- More FLOPs: higher pre-fill (reading large files and long prompts before answering, the so-called "time to first token").

- More RAM bandwidth: higher token generation (speed of output).

So basically Macs (high RAM, okay bandwidth, lowish FLOPs) can run pretty intelligent models at an okay output speed but will take a long time to reply if you give them a lot of context (like code bases). Consumer GPUs have great speed and pre-fill time, but low RAM, so you need multiple if you want to run large intelligent models. Big boy GPUs like the RTX 6000 have everything (which is why they are so expensive).

There are some more nuances like the difference of Metal vs. CUDA, caching, parallelization etc., but the things above should hold true generally.

theturtletalks 1 hour ago||
Do you think Apple will fix prefill speed with the M6 Max MacBook Ultra 128GB?
aiscoming 6 hours ago|||
[dead]
minimaxir 12 hours ago||
A relevant recent tweet from antirez: https://x.com/antirez/status/2054854124848415211

> Gentle reminder on how, in the recent DS4 fiesta, not just me but every other contributor found GPT 5.5 able to help immensely and Opus completely useless.

I've noticed the same for lower level squeezing-as-much-performance-as-possible code work.

throwaway041207 12 hours ago||
Assuming we are talking about Code/Codex are you on API billing or subscription? I have essentially unlimited API billing at my disposal and I haven't noticed any degradation of quality across Opus versions.
chatmasta 11 hours ago||
Same here, the enterprise version of Claude has been great. Luckily I’m not the one paying for it. We also have CoPilot and when GPT-5.4 came out, and was 1x request cost, I was very impressed but haven’t had much time to compare the two.

I also don’t have time to do much personal coding outside of work, so I haven’t subscribed to a personal one yet. But I intend to go for Codex just to balance the Claude at work and also because of the hostile moves from Anthropic toward their consumer business.

rjh29 4 hours ago|||
There's so much subjectivity with models. As soon as a new model comes out people act like the last model they used for 6 months was completely useless.
sanxiyn 11 hours ago||
There is a benchmark for performance work, and I think it is not being optimized by model vendors. The latest result from GSO is that both Opus 4.6 and 4.7 slightly outperforms GPT 5.5. This also matches my experience.

https://gso-bench.github.io/

vitorsr 10 hours ago||
Tasks are taken from commit histories in public Git repositories which defeats the purpose.
easythrees 10 hours ago||
I thought for a moment there was a Dark Souls 4
NDlurker 10 hours ago||
I was thinking dual shock 4
blitzar 4 hours ago|||
The prequel to the prequel of Deep Space 9
JavierFlores09 10 hours ago|||
Glad I wasn't the only one, my second thought was Dual Shock controller but that wasn't it either lol
txhwind 5 hours ago||
Fucking abbreviations. Who knows it's DeepSeek, Dark Souls or DualShock? All possible on HN.
Riany 6 hours ago||
I think local models need to be good enough that privacy, latency, and control become worth the tradeoff, instead of beat the best cloud models
kamranjon 13 hours ago||
Just want to mention that I've been pulling down and using DwarfStar locally and it's incredible. I actually have it running on my personal macbook m4 max with 128gb of ram and I am running the server to share it through tailscale with my work laptop and just have pi running there.

The long context reasoning is something I haven't even seen in frontier models - I was running at 124k tokens earlier and it was still just buzzing along with no issues or fatigue.

I am amazed at how well it works, I'm using it right now for some pretty complex frontend work, and it is much much faster than, for example running a dense 27b or 31b model (like qwen or gemma) for me (The benefits of MoE) - but the long context capabilities have been what have been absolutely flooring me.

Super excited about this project and hope Antirez can keep himself from burning out - i've been following the repo pretty closely and there are a ton of PR's flooding in and it seems like he's had to do a lot of filtering out of slop code.

le-mark 13 hours ago|
Is DS4 dwarf star 4 or deep seek 4?
kamranjon 13 hours ago|||
Just updated! Sorry I meant Dwarf Star - it's the only way I've actually managed to run DeepSeek flash on my local hardware
zackify 11 hours ago||
Are you on q2?
kamranjon 9 hours ago||
Yea I'm on the imatrix q2 version now
wolttam 13 hours ago|||
DwarfStar 4 is DeepSeek 4 (check the repo)
kgeist 8 hours ago|
Did someone compare DeepSeek 4 Flash to Qwen3.6-27B on real tasks (quality + speed)? According to the benchmarks at artificialanalysis.ai, Qwen3.6-27B is better at agentic tasks, and DS4 is only 2 points better at coding (both with max reasoning effort, full weights). At the same time, DS4 requires 5 times more VRAM even at 2 bits. Last time I explored this topic, large MoE models at 2-3 bits usually performed worse (quality-wise) than dense ~30B models at 4-8 bits, despite being much heavier to run.

Sure, MoE models have more knowledge, but extreme quantization may negate the benefits. And generally for coding tasks, you don't need a model that has memorized all the irrelevant trivia like, I don't know, the list of all villages in country X. DS4 also seems to run much slower on Mac Studio Ultra, which appears to be more or less in the same price range as RTX 5090. RTX 5090 gives me 50-60 tok/sec and 260k context with Unsloth's 5-bit quantization (only some layers are 5-bit too) and an 8-bit KV cache; prefill is instant too. It works flawlessly in OpenCode.

If you already have a spare high-end Mac, I can see the benefit, but I'm not sure it's a good configuration overall. Unless Qwen3.6 is more benchmaxxed than DS4 :)

More comments...