A few words on DS4 - Hacker News

Posted by caust1c 13 hours ago

336 points | 139 comments

wg0 5 minutes ago|

DeepSeekV4 Pro is really really competent model and what makes it extremely good is the price point it is offered at.

I have been toying with a 2.5D engine in C on on top of raylib and using DeepSeek as companion in between.

It's thinking transcripts in OpenaCode are transparent and mind boggling to look at things it would consider in its thought process. Very long to read but none of them useless or meaningless.

Always happened that I discovered an assumption that I didn't think about or was wrong but DeepSeek flags it in its thought process and then in final output it would "align" to my flawed request and I'll tell it wait, I saw you thought so and so too and that's correct I made a mistake let's consider that aspect too.

gcr 9 hours ago||

DwarfStar4 is a small LLM inference runtime that can run DeepSeek 4. The blog post implies that it currently requires 96GB of VRAM.

For others who are lacking context :-)

foresto 9 hours ago||

Thanks. Outside of LLM circles, DS4 is usually a video game controller.

artyom 9 hours ago|||

Well, I was sitting here expecting the Redis creator have an opinion on still-unannounced Dark Souls 4.

low_tech_love 3 hours ago||

Haha the same here!!

oezi 7 hours ago||||

Or a car from Citroen

pavlov 4 hours ago||

Technically DS is an independent sibling of Citroën within Stellantis, a sprawling car conglomerate that owns a dog’s dinner of car brands in Europe and USA.

orthoxerox 2 hours ago|||

It's still the Lexus to Citroen's Toyota.

Hamuko 1 hour ago||||

If we want to get really technical, “DS4” is a model from Citroën and they later spun out the DS lineup into its separate brand, with the “Citroën DS4” becoming “DS 4”, “DS” being the make and “4” being the model.

pavlov 1 hour ago||

And even more pedantically, DS has recently adopted a new naming scheme where the former DS 4 is now written as DS N°4, pronounced "number 4"...

Their stated inspiration for this SEO bomb is Chanel perfumes.

drcongo 2 hours ago|||

Pavlov's dog's dinner?

insensible 7 hours ago||||

Trekkies are experiencing a major regression from Deep Space Nine.

RALaBarge 1 hour ago||

They never should have trusted Qwark

jofzar 9 hours ago|||

I am actually kind of disappointed it wasn't a deep dive on the dual shock 4

Wowfunhappy 55 minutes ago|||

Thanks. How is DwarfStar4 different from llama.cpp?

smcleod 1 hour ago|||

That's the flash version not the full model and only at Q2-3~ so while impressive it's still quite different from the full model.

rurban 1 hour ago||

Not really. I'm building now another fast C compiler with DeepSeek 4 Flash, and rarely have to step outside to use Pro or Sonnet, gpt or kimi-2.6. Flash is very capable of almost everything.

zozbot234 4 hours ago|||

> The blog post implies that it currently requires 96GB of VRAM.

Has anyone tested what happens if you try and run this on lower-RAM Macs? It might work and just be a bit slower as it falls back on fetching model layers from storage.

conradkay 4 hours ago||

It'd be way slower since you'd be doing that work every token

zozbot234 4 hours ago||

True (with 64GB RAM it'd have to fetch 20% of its active experts from disk already, about 650MB/tok at 2-bit quant - and that percentage rises quickly as you lower RAM further); my question is just a more practical one about whether it runs at all, how bad the slowdown is, and to what extent you might be able to get some of that decode throughput back by running multiple (slower) agent sessions in parallel under a single Dwarf Star 4 server.

rpigab 3 hours ago|||

I knew Death Stranding 3 wasn't out yet!

DeathArrow 7 hours ago||

>The blog post implies that it currently requires 96GB of VRAM.

From the Github page it seems it only supports Apple and DGX Spark. I have 128 GB of RAM and a 3090 but it probably won't work.

thomasm6m6 6 hours ago|||

FYI, llama.cpp (which antirez/ds4 is inspired by) supports system ram. E.g. [1] is a good guide for running a similar-sized model with 128gb ram and a 3090-sized GPU.

[1] https://unsloth.ai/docs/models/tutorials/minimax-m27

(Unsloth's deepseek-v4 support is still WIP)

DeathArrow 5 hours ago||

Thanks, I can run Qwen 3.6 27B with vllm, but I was curious about antirez tool.

manmal 5 hours ago|||

It wouldn’t be useful with your setup, probably 3-4 token per second.

DeathArrow 4 hours ago||

Yep, maybe I can open a feature request if it makes sense technically.

zozbot234 3 hours ago||

Arguably it makes more sense technically to get the model support into llama.cpp, which provides many options for GPU+CPU split inference already.

ttoinou 19 minutes ago||

When I ran DS4 Q2 the other day (without the new update Q2 imatrix) it was behaving quite poorly after a few agentic turns with opencode, it couldn't modify the files it was telling me the work was ready and didn't use any tool to update files

petercooper 1 hour ago||

I've been using the Q4 version on my Mac Studio over my local network and it's been good. Indeed, I had the first ever experience where I was playing with it alongside my various other agents and forgot it was a local model as it was doing such a good job.

I do wonder, though, if another agent is really needed. I've been driving it with Pi (Claude Code's system prompt is far too heavy given the prefill speeds) and it's been great. OpenCode is another good option. Is there anything else to gain from another similar tool specific to Deepseek 4?

antirez 1 hour ago||

There is no need for another agent, functionally. But if you follow the idea of DS4 itself: the API agents use forces to do odd things, like translating the DSML stanzas to JSON, with all the canonicalization / KV cache checkpointing problems resulting from that. Is it really the case? What about also providing a sane alternative? Also I'm not sure why people don't try to write more stuff in that area in C/Go/Rust to have more control / speed / less dependences.

Also there is a lot more to imagine, TUI side. The problem is that most projects all copy what they already saw. For instance I just did this in 20 minutes: https://x.com/antirez/status/2055190821373116619 Now that code is cheap, ideas have more value. Are we sure that today it is still the case to think in terms: "Is another XYZ needed"? It could be the case that only just to explore new ideas, it is worth it. I I don't like the Javascript / Node ecosystem for my code, so if I have to explore a new TUI or agent workflow, if I do it with the tools I'm more happy to use, the result, the iterations, are different.

zozbot234 1 hour ago||

> ...I'm not sure why people don't try to write more stuff in C to have more control / speed / less dependences.

Codex CLI is written in Rust, which should give comparable raw performance to C/C++. Of course you can care about the "less dependencies" point but this is somewhat less of a concern on a properly maintained project like Codex. That's not so much "wild, out of control" third-party dependencies and closer to the old ideal of proper software componentry.

> Also there is a lot more to imagine, TUI side. The problem is that most projects all copy what they already saw. For instance I just did this in 20 minutes.

This mockup is really nice and the sidebar display gives you a natural way to expose running multiple thinking flows in parallel, at least if you keep them from stepping on each other's toes with code edits (keep them all in read-only "plan" mode or working on completely separate directories/files). That's not so helpful on a 128GB MacBook where a single agentic flow brings you to thermal/power limits already, but it suddenly becomes useful on other hardware (DGX Spark, Strix Halo, lower-RAM machines with SSD offload, multiple nodes with pipeline parallelism) where you have more compute than you could use for single-stream decode.

zozbot234 1 hour ago||

DS4 is an inference engine, not a harness. It provides an inference API server and you point your coding harness to it.

antirez 1 hour ago||

You misunderstood the OP. I hinted, in my blog, at my interest to also putting an agent harness inside.

ljosifov 2 hours ago||

Love this, even if can't use it atm (not got the h/w - only 96gb on M2 Max). I get it the general comp/public will find it unusable or worse. Reminds me of how home computers were - mere toys - before they became personal computers (PC). On my h/w the only passable combo for me atm is pi agent + llama.cpp + nemotron cascade-2 model: to 1M context, hybrid arch doesn't crash & burn 1/N^2 with context depths of 10K-50K-100K used by code agents. Was on a plane without Internet the other day. Brought a smile to my face that I could run pi agent (with llama.cpp serving), and it was just about usable at 40-30 tok/s. Afaik the usual API speeds are double that, 60-80 tok/s. Sensors showing using 60W when running inference. So battery probably would not last more than >3h. Model only 30B in size leaves plenty of space for KV-caches, and other programs - even at generous 8-bit quant. Only 3B active params at one time (with MoE A3B) is about the most that ageing M2 Max can carry it seems.

zozbot234 1 hour ago|

It should work with 96GB, especially on a limited context. But the M2 Max is a bit slower, yes.

zmmmmm 9 hours ago||

I'm very curious where we will saturate the curve on "enough" intelligence for coding. At some point, you can let a less smart model hammer at a problem for longer and get to the same result, and as long as you are not involved it comes to the same thing. I feel like DeepSeek V4 Pro is nearly there. Maybe Flash is too.

Once we hit that point, I am curious how much of Anthropic's current business model falls apart? So far it's always been clear that you just pay for the most intelligent model you can get because it is worth it. It now seems clear to me that there is limited runway on that concept. It is just a question of how long that runway is. I honestly wonder how much of their frantic push to broaden out into enterprise / productivity is because they see this writing on the wall already.

loeg 8 hours ago||

> At some point, you can let a less smart model hammer at a problem for longer and get to the same result, and as long as you are not involved it comes to the same thing.

Is that true? I find the smarter models can just be effective when smaller models can't. It isn't a matter of just waiting longer.

davnicwil 7 hours ago||

it's almost certainly not true yet but at some point there might be an equilibrium reached of speed Vs quality (and let's not forget, cost) where it's true for most of what you do.

Perhaps you'd still turn to hosted models for the hardest tasks, but most tasks go local. It does seem like that would make demand go down significantly.

Of course that's all predicated on model advances plateauing, or at least getting increasingly more expensive for incremental improvements, such that local open source models can catch up on that speed/quality/cost curve. But there is a fair amount of evidence that's happening. The models are still getting noticably better, but relative improvement does seem to be slowing, and cost is seemingly only going up.

vlovich123 5 hours ago||

Why is this presumed to be de facto inevitable:

* local compute isn’t scaling as before, so algorithmic improvements are the only ways models get meaningfully faster and smarter

* all those same algorithmic improvements would also be true for larger models

* hardware manufacturers have an incentive against local LLMs because cloud LLMs are so much more lucrative (+ corps would by desktop variants if they were good enough)

So no it’s not clear quality will ever be comparable. It may be good enough for what you want but there will always be a harder problem that you need to throw more compute and more memory at.

kennywinker 2 hours ago||

> It may be good enough for what you want but there will always be a harder problem that you need to throw more compute and more memory at.

Sure, but if the “good enough for what you want” consumes the vast majority of cases - data-center ai becomes just for the very extreme edge cases. Like how I can render a 4k rez video game at 60fps on my home pc, but if pixar wants to render their next movie they use data-center compute.

> all those same algorithmic improvements would also be true for larger models

Smaller models run faster. If ten runs of a small model gets me the same quality result as one run of the big model, and the small model runs 10x faster, then they are functionally the same.

zozbot234 2 hours ago||

> Like how I can render a 4k rez video game at 60fps on my home pc, but if pixar wants to render their next movie they use data-center compute.

This is a very nice analogy actually and it impacts the whole story about US vs. Chinese leadership in "frontier AI".

jofzar 8 hours ago|||

> I'm very curious where we will saturate the curve on "enough" intelligence for coding. At some point, you can let a less smart model hammer at a problem for longer and get to the same result, and as long as you are not involved it comes to the same thing. I feel like DeepSeek V4 Pro is nearly there. Maybe Flash is too.

It's always going to be cost;

developer time vs developer cost vs AI cost vs developer productivity.

With 4.6 it's looking like we are at the upper limit of appetite for cost (for "regular" Business) so the other levers will probably need to change.

nl 5 hours ago|||

Kilo (the open source coding agent) tested Deepseek v4 Pro and Flash vs Opus 4.7 and Kimi K2[1].

It did ok, but scored substantially less than Opus. It also cost nearly as much, even with the current launch promo pricing for Deepseek.

That cost is interesting - I've seen similar things with Sonnet vs Opus, and in my own benchmarking there are some models that benchmark well, seem to have a good price but use so many tokens they cost just as much as "more expensive" models.

[1] https://blog.kilo.ai/p/we-tested-deepseek-v4-pro-and-flash

skybrian 6 hours ago||

I imagine we'll get to "good enough" for hobbyist programmers fairly quickly, but businesses will still be willing to pay more for faster and smarter. Why make your programmers wait?

zmmmmm 5 hours ago||

> Why make your programmers wait?

That depends on where the methodology goes. But more and more it's hands off. If the trajectory continues it won't matter because nobody is sitting their waiting / watching the LLM code anyway. It is all happening in the background. We might see hybrid approaches where the weaker / cheaper agent tries to solve it and just "asks for help" from the more expensive agent when it needs it etc.

kaoD 15 minutes ago||

> nobody is sitting their waiting / watching the LLM code anyway

My personal experience is that for production-grade code you need to steer the agent more often than not... so yes, at least some of us are watching the LLM code.

karmakaze 10 hours ago||

Great to find this narrow focused thing:

> We support the following backends:

    Metal is our primary target. Starting from MacBooks with 96GB of RAM.
    NVIDIA CUDA with special care for the DGX Spark.
    AMD ROCm is only supported in the rocm branch. It is kept separate from main
    since I (antirez) don't have direct hardware access, so the community rebases
    the branch as needed.

> This project would not exist without llama.cpp and GGML, make sure to read the acknowledgements section, a big thank you to Georgi Gerganov and all the other contributors.

Edit: aww, doesn't seem to support offloading to system RAM[0] (yet)

[0] https://github.com/antirez/ds4/issues/108

Guess I'll have to keep watching the llama.cpp issue[1]

[1] https://github.com/ggml-org/llama.cpp/issues/22319

zimmerfrei 5 hours ago||

> AMD ROCm is only supported in the rocm branch.

Has anybody tried it? There is a lot of emphasis on MacBook Pro in this thread, but I would like to use it with an AMD Halo Strix with 128GB of unified RAM.

keyle 6 hours ago||

If only you could still buy Mac's with that much RAM

shric 6 hours ago||

You can buy 128GB M5 MacBook Pros?

Configured one just now, delivers in 2 weeks

keyle 4 hours ago||

Interesting there were news last week or so of apple removing Mac minis options.

littlecranky67 1 hour ago||

They removed the baseline 8GB RAM/256GBstorage model. My bet is with increased RAM prices the markup on the lower end is not enough to still make a profit

albertzeyer 50 minutes ago||

More information about DwarfStar 4 (DS4) in the readme: https://github.com/antirez/ds4

The code seems based on llama.cpp and GGML.

I don't fully understand why it is a standalone project. The readme discusses this: DwarfStar 4 is a small native inference engine specific for DeepSeek V4 Flash. It is intentionally narrow: ...

I think the only bigger difference in DeepSeek V4 vs other models is maybe the type of self-attention. And that leads to: KV cache is actually a first-class disk citizen.

But I still feel like those changes could have been implemented as part of some of the other local engines.

I also assume more models will come out, not just from DeepSeek but also from others, and they might share similar self-attention approaches, that would benefit from a similar KV cache implementation.

skiwithuge 15 minutes ago|

because llama.cpp doesn't accept fully pr made by ai agents even if they are guided by the author

https://github.com/ggml-org/llama.cpp/blob/master/AGENTS.md

FuckButtons 10 hours ago||

It’s shocking how close this feels to claude, obviously it's much slower, but I don’t know that it’s significantly dumber. Interestingly the imatrix quantization seems to be better than whatever quant the zdr inference backends on open router are using. It was self aware enough yesterday to realize that it’s own server process was itself without me telling it, which is not something I’ve ever observed a local model doing before.

stavros 10 hours ago|

In my (obviously anecdotal) testing, DeepseekV4 Pro was better than Sonnet at coding. However, it is much slower, but also many times cheaper, especially with the promotion right now.

DeathArrow 7 hours ago||

Do they have a coding plan or you only pay per API call?

trollbridge 6 hours ago|||

It’s just per token, but burning up 100 million+ tokens is a $3 transaction with their pricing right now

DeathArrow 5 hours ago||

Do you use the official API or another provider?

trollbridge 5 minutes ago|||

Just directly. Paid for it with PayPal. It’s quite simple to set up and use.

stavros 3 hours ago|||

I use the official API, OpenRouter somehow didn't use caching and one short session with Qwen cost me $5.

ReptileMan 1 hour ago|||

You pay per api call but you will be challenged to burn trough 20$ per month. 24/7 usage for single agent will probably cost you around 100$ per month. It is very efficient especially with modern harnesses.

0xbadcafebee 11 hours ago|

I don't see an explanation of why they would make a model-specific inference engine vs just using llamacpp. There are already lots of people working on the llamacpp integration. This is a lot of effort spent on a single model which is likely to become obsolete when a different model comes out that does better. In some discussions, people are now making PRs against both the llamacpp branches and ds4... so it's taking a rare commodity (people investing development time in this model) and fragmenting it

dilap 7 hours ago||

way easier to work on a focussed c codebase you own than a mature unwieldy c++ codebase you don't. but it's fine, people will take that work and port to llamacpp and everyone wins.

(the ux of ds4 is fantastic too -- it's dead-easy to get a known-good model, great quant. llamacpp you're much more hacking in the wilderness, w/ many many knobs.)

flakiness 11 hours ago|||

I believe the assumption is: The code is cheap. The collaboration (eg. upstreaming) is expensive.

Is it true? We'll see, in a few years.

zozbot234 11 hours ago|||

Author has mentioned many times that the llama.cpp maintainers don't want code that's prevalently written by AI with no human revision. If anyone wants to try and get the support upstreamed into that project, they're quite free to do that: the code is MIT licensed.

kristianp 10 hours ago||

Also Antirez has been able to use GPT to iterate on the code and performance. He/they (others contributed to DS4) has a set of result files to ensure that correctness is maintained, and benchmarks to verify performance, and the LLM is able to iterate within that framework. Having a small, focussed codebase helps here.

Antirez explained the dev process when he posted a pure C implementation of the Flux 2 Klein image gen model, at https://news.ycombinator.com/item?id=46670279

fgfarben 10 hours ago||

At a certain point the level of abstraction / genericization necessary for a big flexible project (like llama.cpp or Linux) blows things up into a huge number of files. Something newer and smaller can move faster.

More comments...