We got 207 tok/s with Qwen3.5-27B on an RTX 3090

Posted by GreenGames 5 hours ago

We got 207 tok/s with Qwen3.5-27B on an RTX 3090(github.com)

151 points | 43 comments

Aurornis 3 hours ago|

This is a Claude-code generated repo that implements some ideas from research papers. If you follow this space, every paper release spawns tens or hundreds of vibecoded repos like this that get spammed to Reddit, Hacker News, and other sites.

It's generally best to overlook the vibecoded repos and go closer to the source for up to date information. In this case, z-lab already showed Qwen3.5-27B with DFlash last month: https://huggingface.co/z-lab/Qwen3.5-27B-DFlash

This repo is an example of what you get if you point Claude Code at the upstream repo and have it iterate with some other objective (loading GGUF). They also included DDTree in there somewhere.

You also need to look closely at the claims. A classic trick in these repos is to cherry-pick numbers that make the work in the repo look extraordinary until you start reading the details. From my quick read, this repo is using Q4 quantization on the KV cache which does not produce good results. Someone who reads everything in detail might find more tricks. This is par for all of these demo repos because the goal is to impress casual viewers with big numbers.

I'm trying to find where they get the 207 tok/s number but the 207 number only appears in their headline claim. If you read deeper the real numbers are half that or less.

There are also several (possibly vibecoded, I haven't checked) draft PRs and forks to use these techniques on upstream llama.cpp that would be much more useful for experimenting. One example I picked at random: https://github.com/ggml-org/llama.cpp/pull/22105

j45 2 hours ago||

Appreciate the reading and things to go learn more from.

Learning about Qwen 3.5, and also learning how Gemma 4 appears to be unique (relatively speaking), and Apple possibly using some type of Gemma model on-device I think will also help fill in how to track local model and local device capabilities which could be additional measures/KPIs as well.

GreenGames 2 hours ago||

This reads like you didn’t read the post.

z-lab runs BF16 on B200 (54+ GB). There is no z-lab path that fits on a 24 GB 3090. That is literally the entire point of our work, and it is stated in the second paragraph. If you had checked the HF model card you linked before posting, you would see the same thing. Before this repo, there was no path to run this... SGLang's GGUF path for this model is broken. llama.cpp doesn't have DFlash speculative decoding at all. If you wanted to run this hybrid model fast on a 24 GB consumer card, there was nothing...

That took weeks of real engineering.

Calling that "vibecoded" because we used a bit of AI in the README is clean is the laziest possible critique. An LLM reading the DFlash paper does not catch verify_logits_buf being sized vocabq_len when DDTree reads vocab(budget+1). That is hours of debugging with nvidia-smi and memory sanitizers, not prompting.

The 207 and 129.5 numbers are both in the second sentence of the post and again in the TL;DR. 207.6 is peak tok/s in the linked demo video, 129.5 is the HumanEval 10-prompt mean at DDTree budget=22. We specify both just behind the title.

On the Q4 KV cache: the tradeoff is disclosed with actual numbers. AL 8.56 -> 8.33 at short context (3% drop), dramatically better at long context. It’s the only way 128K allocates on 24 GB. The binary is env-selectable, you can run BF16 KV if you don’t need 128K. Both are benchmarked.

Aurornis 1 hour ago|||

> This reads like you didn’t read the post.

I was discussing details I read in your repo. How did you conclude that I didn't read the post? I'm skeptical a human is writing these comments because everything you're posting reads like LLM output

> On the Q4 KV cache: the tradeoff is disclosed with actual numbers. AL 8.56 -> 8.33 at short context (3% drop), dramatically better at long context.

I'm sorry, but you're not the first (or LLM) to think of using Q4 KV cache to fit more context in VRAM.

The degradation is far more than 3% on real evals. Q8 only recently became usable on Qwen3.5 in llama.cpp with the context rotation changes. Before that bf16 was necessary to get decent performance in real tasks.

Q4 is a non-starter for real work. The fact that you're still trying to defend it tells me you haven't used this for anything other than token/sec racing.

ohyoutravel 44 minutes ago||||

This is an embarrassing reply. Unfortunately you’ve hit the hour mark so you cannot delete it. :(

refulgentis 1 hour ago|||

You wrote this reply with Claude, and it's lying about it only being README.md. OP, and I, know this because you and Claude documented it.*

I use the same tools, I'm not mad at you for using it. It's just, idk man, you want to use it tactically in ways that are a net benefit to you. Not in ways that embarrass you or lie.

* https://github.com/Luce-Org/lucebox-hub/commit/cfc38f67275ee...

* * Here's Claude's version of this very post if you want to see an example of Claude voice vs. original and how to spot it: https://gist.githubusercontent.com/jpohhhh/a42060f0f34339c4b...

dirtikiti 3 hours ago||

"Local AI should be a default, not a privilege: private data, no per-token bill, no vendor lock-in. The hardware to run capable models already sits on desks. The software to run those chips well doesn't."

So figure out how to run it on Vulkan instead of requiring the user to be locked into expensive CUDA cards.

Aurornis 3 hours ago||

So everyone is aware, you can already run Qwen3.5-27B on Vulkan or Apple's hardware. Every major inference engine supports it right now.

This repo is a vibecoded demo implementation of some recent research papers combined with some optimizations that sacrifice quality for speed to get a big number that looks impressive. The 207 tok/s number they're claiming only appears in the headline. The results they show are half that or less, so I already don't trust anything they're saying they accomplished.

If you want to run Qwen3.5-27B you can do it with a project llama.cpp on CUDA, Vulkan, Apple, or even CPU.

Grimblewald 1 hour ago||

This, even on android via termux you can run ollama with gpu accelaration on phone. This works, though milage will vary.

SwellJoe 1 hour ago|||

You can run pretty much every model on Vulkan, including the Qwen MoE models. You can also run pretty much every model on ROCm, Apple Silicon via MLX, and Intel hardware via OpenVINO. Nvidia got there first, but they're no longer clearly dominant in the self-hosting space, simply because of the high cost. I think Apple probably has the lead there, due to unified memory allowing big models to run without multiple big dedicated GPUs, but stuff like Strix Halo with 128GB of unified memory is also pretty much sold out everywhere. There's a lower bound on how small a model can be and still be useful.

Anyway, I don't have any Nvidia hardware, and I've got several local models running and/or training at all times.

andsoitis 3 hours ago||

Why doesn’t Apple?

tpurves 3 hours ago|||

Like with all new tech trends, it takes them a hot minute to catch up, but it's highly likely they will (eventually) release some killer platforms for local AI. The shared memory, high bandwidth and power-efficiency of their M chips is a near-ideal architecture. If/when they finally push out the M5-ultra, that could be round one (albeit still not at the best price/performace vs comparable cloud api tokens). A real mass-market killer device for local LLMs is still going to require some remediation of the global DRAM shortages, and maybe the M6/M7 generation.

dvt 3 hours ago||||

Apple has Metal, which is already pretty well-integrated in llama.cpp, various Python libs, and mistral-rs & candle. Unpopular opinion, but Vulkan is hot garbage and the definition of "design by committee." There's a reason people still prefer CUDA, whereas most code could likely be programmatically ported anyway.

coldtea 2 hours ago||||

Vulkan is not Apple.

Metal is Apple's API.

hansonkd 3 hours ago|||

After the steep increase in sales of Mac Studios specifically for LLMs, I'm waiting for Apple to release a frontier level model, optimized for highest end of apple hardware (probably would be hardware locked by a certain neural processor needed (which would then lock the memory config).

The built in Apple Intelligence right now is very small, but even just having a small LLM you know is always there, online, fast and ready makes you think about building app differently. I would love the context to expand from the meager ~4K tokens.

SilverElfin 3 hours ago||

Why did they focus on that particular graphics card and not others, and not common laptops used by developers, or something like that?

Aurornis 3 hours ago||

The repo is very vibecoded (Claude is co-author, READMEs are obviously AI).

This is the output of someone with a 3090 who pointed Claude Code at some research papers and possibly the upstream implementations of these techniques and then posted the output as original work.

VladVladikoff 3 hours ago||

That's a pretty popular budget friendly GPU people use for local AI, it actually seems like an excellent choice IMHO.

doubled112 3 hours ago|||

Depends on your definition of budget friendly, I suppose. I was looking around the other day and the cheapest working 24GB RTX 3090 on eBay was $1800 CAD after exchange rate, shipping and all the rest.

Hugely inflated from the $700 they were once going for. Maybe there are still deals around.

suprjami 38 minutes ago|||

Actually budget friendly is RTX 3060 12Gb.

With one you can run 9B/12B models which are fine for text tasks like chatting or summarisation. Not for precision like tool calling or code.

With two of them you can run models up to Qwen 27B and 35B with a few-turn context window (8k-16k). Dense at 14t/s and MoE at 68t/s.

With three of them you can run 128k context, though you'll need a large format case and the right motherboard or PCIe riser.

I'm running three and even with a new case this setup cost me less than one 3090.

fluoridation 1 hour ago||||

That's insane. I bought two in December for ARS 1.2M (a little less than USD 1000). Maybe OpenClaw raised the demand.

VladVladikoff 1 hour ago||||

Wild I paid $1000 CAD for mine 2 years ago, I guess things have changed.

dist-epoch 2 hours ago|||

Because they are hugely more useful now than running some stupid game at 240 fps instead of 60 fps.

CoolGuySteve 1 hour ago||

They're not a particularly fast card compared to something like a 5070, they have lots of VRAM.

That's why they were cheap before.

Also "Some stupid game", who woke up and made you king of hobbies.

declan_roberts 3 hours ago|||

The only thing that compares to this is probably Mac mini with MLX models.

VHRanger 18 minutes ago||

Radeon 9700 pro or intel arc b70 (both $1000-1400, 32GB, 650GB/s bandwidth), or ryzen AI max 390 (more vram, less bandwidth)

The local inference space is pretty good nowadays.

lostmsu 3 hours ago||

No you did not. You got 207 tok/s on an RTX 3090 with speculative decoding which, generally speaking, is not the same quality as serving the model without it.

Greedy-only decoding is even worse. There's a reason every public model comes with suggested sampling parameters. When you don't use them, output tends to degrade severely. In your case simply running a 14B model on the same hardware with the tools you compare against would probably be both faster and produce output of higher quality.

kingstnap 3 hours ago||

Speculative decoding doesn't degrade output quality. The distribution it produces is exactly the same if you do it correctly. The original paper on it clearly talks about this. [0]

Speculative decoding is the same as speculative execution on CPUs. As long as you walk back on an incorrect prediction (i.e. the speculated tokens weren't accepted) then everything is mathematically exactly the same. It just uses more parallelism (specificslly higher arithmetic intensity).

[0] https://arxiv.org/abs/2211.17192

vessenes 3 hours ago|||

why is it that speculative decoding lowers quality? My understanding of it is that you use a small/distilled fast model to predict next token - when it doesn't match, you generate more. Checking against the large model is quick.

This should maintain exactly the quality of the original model, no?

ndriscoll 1 hour ago|||

AFAIU It's not that checking against the large model is quick (in the usual P!=NP sense that checking an answer is easier than finding one). It's that you can batch your checks. So you speculate the next 5 tokens, and then you can parallelize the large model running once for the batch of [...,n+1], [...,n+2], [...,n+3], [...,n+4], [...,n+5]. If you guessed right for a prefix, you turned a sequential problem (computing next token from current prefix) into a parallel one (doing multiple prefixes together) that the GPU likes. If you guessed wrong, you have to throw away the suffix starting at the wrong guess, and you wasted some extra energy computing.

lostmsu 2 hours ago|||

I looked up, and you are correct in regards to the specific algorithm used. In general there are approximate algorithms for speculative decoding.

Greedy decoding means it is still not ready though.

nodja 3 hours ago||

> speculative decoding which, generally speaking, is not the same quality as serving the model without it.

I've never heard of ANY speculative decoding that wasn't lossless. If it was lossy it'd be called something else.

This page is just a port of DFLASH to gguf format, it only implements greedy decoding like you said so the outputs will be inferior, but not inferior to greedy decoding on the original model. Tho that's just a matter of implementing temperature, top_k, etc.

GreenGames 5 hours ago||

We built a standalone C++/ggml speculative decoder for Qwen3.5-27B Q4_K_M with a DFlash block-diffusion draft.

207.6 tok/s peak (5.46x over AR); HE 10-prompt bench averages 129.5 tok/s at DDTree budget=22, single RTX 3090, 24 GB. 3.43x over autoregressive and 2.8x over the best public SGLang AWQ number.

TL;DR - Peak 207.6 tok/s DFlash vs 38.0 tok/s AR (5.46x). HE bench: 129.5 tok/s mean at DDTree budget=22. - 3.43x over autoregressive Q4_K_M baseline (37.78 tok/s). - 2.8x vs SGLang AWQ reference (46.6 tok/s) on the same RTX 3090. - 128K context fits on 24 GB. Q4_0 KV + rolling 4096-slot target feature buffer. 134.78 tok/s at ctx=131072. - Only ggml. Never link libllama. ~2000 LOC C++/CUDA in libdflash27b.a around ggml_gated_delta_net.

Why the experiment exists Qwen3.5-27B is a hybrid model: every 4th layer is full softmax attention, the rest (48 of 64) are Gated DeltaNet. SSM state cache alongside the KV cache. That combo doesn't have a good single-3090 decode path today: llama.cpp has the GGUF loader and ggml_gated_delta_net, but no DFlash speculative decoding. vLLM / SGLang ship z-lab's DFlash integration, but only on BF16 (54 GB, doesn't fit on 24 GB). AWQ target on SGLang runs plain AR at 46.6 tok/s but can't host a BF16 draft + DDTree state in 24 GB. z-lab's reference benchmarks run BF16 on B200, 54+ GB class. We wanted the fastest single-3090 decode on a 24 GB card. The answer: port only the graph glue to ggml, keep the existing DeltaNet kernel, run DFlash block-diffusion draft with a DDTree verifier, compress KV to Q4_0 for long context.

From autoregressive to DDTree Same 10-prompt HE bench, n_gen=256, Q4_K_M target, BF16 draft. AL = average accept length. DDTree paper reports +35-42% over chain DFlash on pure-attention Qwen3 variants. On our hybrid Q4_K_M/RTX 3090 combo we see +15% over chain. The gap comes from Q4 quantization flattening the draft softmax, partially patched with a chain pre-seed in build_ddtree. Draft-ceiling bound, not verify-memory bound: a bigger tree won't help, only a better draft will.

Key wins - f16 intermediate cache: half the bandwidth, +5% at the same tree budget. Bit-identical to AR at 40 tokens. - Persist-write kernel (ggml_gated_delta_net_tree_persist): skips a 9 ms ggml_cpy per step, +11%. - target_feat compaction after sibling accept: unlocked real tree rescue on 9/10 prompts. - extract_draft_topk reverse bug: sort_heap + cmp_greater already produces descending order; an extra std::reverse was sending the worst candidate to the tree root. One-line fix. - verify_logits_buf overflow: sized vocabq_len but DDTree reads vocab(budget+1) past budget 15. Silent memory corruption. One-line size fix.

128K context on 24 GB Flash-attention in ggml-cuda supports Q4_0 K+V natively, so KV compression is just ggml_cpy with the F32->Q4_0 quantizer on write. 8x over f16. Combined with a rolling 4096-slot target_feat ring, target_feat shrinks from 6.6 GB to 0.2 GB at 128K. Tradeoffs: Q4_0 KV costs ~3% quality on HE (AL 8.56 -> 8.33) at short context, dramatically better at long ones. Only thing that lets 128K fit on 24 GB.

Prefill Short prompts (<=2048 tok): PREFILL_UBATCH=16. Matches DFlash block size. Long prompts (>2048 tok): auto-bump to PREFILL_UBATCH=192. 13K prefill: 40.9 s -> 15.07 s (2.7x, ~913 tok/s).

What comes next - Daemon mode: keep the model resident, first-token latency 10 s -> ms. - Temperature / top-k sampling in verify. Currently greedy-only. - Q5_K_M / Q6_K: better quants should recover most of the ~30-point accept gap vs BF16. - Full llama.cpp integration: qwen35 arch, llama-speculative-dflash.cpp wiring. - Metal/Vulkan: not planned. CUDA only, anyone who wants Metal can fork.

As soon as Qwen3.6-27B comes out, we'll do the same for it. Repo in the first comment (open source, MIT).

xiphias2 3 hours ago||

> Temperature / top-k sampling in verify. Currently greedy-only

This is interesting, doesn't greedy-only decoding slow down speculative decoding significantly?

In theory the probability of needing resampling (rejection) is (p_real-p_sample)+, which should be much smaller with non-greedy distribution

causal 3 hours ago|||

Cool. If I understand correctly though, the single-kernel only works on a single GPU right- no parallelism support to go Q8 on 2x3090?

doctorpangloss 3 hours ago||

AI authored comments are against the rules. that said what is the point of engaging here if you won't do it with your own words?

like do you understand any of what you wrote?

tempaccount5050 1 hour ago||

I find these comments hilarious. Are we supposed to build AI and then not use it? Super goofy.

manbash 38 minutes ago|||

> Don't post generated comments or AI-edited comments. HN is for conversation between humans.

These are the rules.

zdragnar 1 hour ago||||

It didn't really add anything to the conversation, and if I wanted to know what an LLM thought, I'd ask it. The reason for the rule is people come here to interact with other people.

fluoridation 1 hour ago|||

Not "not use it". Not use it to make people believe they're talking to real people.

halJordan 53 minutes ago||

The crazy thing is how effort posts went from the most valuable part of this site to the most hateful part of the site, by the very people claiming to be protecting the site

varispeed 2 hours ago|

What is the point of this if such small model generally produces rubbish?

UncleOxidant 1 hour ago|

Have you tried out Qwen3.5-27b? It's quite amazing that a model with only 27b parameters can do what it's doing. I've had it working on a project that has C++, python and Verilog code. It's generating code in all 3 and very competently. I've had it look into other git repos to bring in ideas from them into this one. Again, it's doing an amazingly good job with this and it's running locally on my PC.