Qwen3-Coder-Next - Hacker News

Posted by danielhanchen 10 hours ago

551 points | 338 commentspage 3

zokier 6 hours ago|

For someone who is very out of the loop with these AI models, can someone explain what I can actually run on my 3080ti (12G)? Is this something like that or is this still too big; is there anything remotely useful runnable with my GPU? I have 64G RAM if that helps (?).

AlbinoDrought 6 hours ago||

This model does not fit in 12G of VRAM - even the smallest quant is unlikely to fit. However, portions can be offloaded to regular RAM / CPU with a performance hit.

I would recommend trying llama.cpp's llama-server with models of increasing size until you hit the best quality / speed tradeoff with your hardware that you're willing to accept.

The Unsloth guides are a great place to start: https://unsloth.ai/docs/models/qwen3-coder-next#llama.cpp-tu...

zokier 4 hours ago||

Thanks for the pointers!

one more thing, that guide says:

> You can choose UD-Q4_K_XL or other quantized versions.

I see eight different 4-bit quants (I assume that is the size I want?).. how to pick which one to use?

    IQ4_XS
    Q4_K_S
    Q4_1
    IQ4_NL
    MXFP4_MOE
    Q4_0
    Q4_K_M
    Q4_K_XL

MrDrMcCoy 3 hours ago||

The I-prefix stands for Imatrix smoothing in the quantization. It trades a little more accuracy for speed than other quant styles. The _0 and _1 quants are older, simpler quants that are very accurate but kinda slow. The K quants, in my limited understanding, primarily quantize at the specified bit depth, but will bump certain important areas higher, and less used parts lower. It generally performs better while providing similar accuracy to the _1 quants. MXFP4 is specific to Nvidia, so I can't use it on my AMD hardware. It's supposed to be very efficient. The UD part includes more of Unsloth's speed optimizations.

Also, depending on how much regular system RAM you have, you can offload mixture-of-expert models like this, keeping only the most important layers on your GPU. This may let you use larger, more accurate quants. That is functionality that is supported by llama.cpp and other frameworks and is worth looking into how to do.

cirrusfan 6 hours ago||

This model is exactly what you’d want for your resources. GPU for prompt processing, ram for model weights and context length, and it being MoE makes it fairly zippy. Q4 is decent; Q5-6 is even better, assuming you can spare the resources. Going past q6 goes into heavily diminishing resources.

zamadatix 9 hours ago||

Can anyone help me understand the "Number of Agent Turns" vs "SWE-Bench Pro (%)" figure? I.e. what does the spread of Qwen3-Coder-Next from ~50 to ~280 agent turns represent for a fixed score of 44.3%: that sometimes it takes that spread of agent turns to achieve said fixed score for the given model?

yorwba 8 hours ago||

SWE-Bench Pro consists of 1865 tasks. https://arxiv.org/abs/2509.16941 Qwen3-Coder-Next solved 44.3% (826 or 827) of these tasks. To solve a single task, it took between ≈50 and ≈280 agent turns, ≈150 on average. In other words, a single pass through the dataset took ≈280000 agent turns. Kimi-K2.5 solved ≈84 fewer tasks, but also only took about a third as many agent turns.

zamadatix 7 hours ago|||

Ah, a spread of the individual tests makes plenty of sense! Many thanks (same goes to the other comments).

regularfry 8 hours ago|||

If this is genuinely better than K2.5 even at a third the speed then my openrouter credits are going to go unused.

edude03 9 hours ago||

Essentially the more turns you have the more the agent is likely to fail since the error compounds per turn. Agentic model are tuned for “long horizon tasks” ie being able to go many many turns on the same problem without failing.

zamadatix 9 hours ago||

Much appreciated, but I mean more around "what do the error bars in the figure represent" than what the turn scaling itself is.

esafak 8 hours ago|||

For the tasks in SWE-Bench Pro they obtained a distribution of agent turns, summarized as the box plot. The box likely describes the inter-quartile range while the whiskers describe the some other range. You'd have to read their report to be sure. https://en.wikipedia.org/wiki/Box_plot

jsnell 8 hours ago|||

That's a box plot, so those are not error bars but a visualization of the distribution of a metric (min, max, median, 25th percentile, 75th percentile).

The benchmark consists of a bunch of tasks. The chart shows the distribution of the number of turns taken over all those tasks.

ionwake 8 hours ago||

will this run on an apple m4 air with 32gb ram?

Im currently using qwen 2.5 16b , and it works really well

segmondy 8 hours ago|

No, at Q2 you are looking at a size of about 26gb-30gb. Q3 exceeds it, you might run it, but the result might vary. Best to run a smaller model like qwen3-32b/30b at Q6

ionwake 7 hours ago||

Thank you for your advice have a good evening

alexellisuk 9 hours ago||

Is this going to need 1x or 2x of those RTX PRO 6000s to allow for a decent KV for an active context length of 64-100k?

It's one thing running the model without any context, but coding agents build it up close to the max and that slows down generation massively in my experience.

redrove 6 hours ago||

I have a 3090 and a 4090 and it all fits in in VRAM with Q4_0 and quantized KV, 96k ctx. 1400 pp, 80 tps.

segmondy 8 hours ago||

1 6000 should be fine, Q6_K_XL gguf will be almost on par with the raw weights and should let you have 128k-256k context.

storus 8 hours ago||

Does Qwen3 allow adjusting context during an LLM call or does the housekeeping need to be done before/after each call but not when a single LLM call with multiple tool calls is in progress?

segmondy 8 hours ago|

Not applicable... the models just process whatever context you provide to them, context management happens outside of the model and depends on your inference tool/coding agent.

cyanydeez 5 hours ago||

It's interesting how people can be so into LLMs but dont, at the end of the day, understand they're just passing "well formatted" text to a text processor and everything else is build around encoding/decoding it into familiar or novel interfaces & the rest.

The instability of the tooling outside of the LLM is what keeps me from building anything on the cloud, because you're attaching your knowledge and work flow to a tool that can both change dramatically based on context, cache, and model changes and can arbitrarily raise prices as "adaptable whales" push the cost up.

Its akin to learning everything about beanie babies in the early 1990's and right when you think you understand the value proposition, suddenly they're all worthless.

storus 4 hours ago||

That's why you can use latest open coding models locally that reportedly reached the performance of Sonet-4.5 so almost SOTA. And then you can think of tricks like I mentioned above to directly manipulate GPU RAM for context cleanup when needed which is not possible with cloud models unless their provider enables that.

orliesaurus 9 hours ago||

how can anyone keep up with all these releases... what's next? Sonnet 5?

gessha 8 hours ago||

Tune it out, come back in 6 months, the world is not going to end. In 6 months, you’re going to change your API endpoint and/or your subscription and then spend a day or two adjusting. Off to the races you go.

Squarex 8 hours ago|||

Well there are rumors sonnet 5 is coming today, so...

Havoc 5 hours ago|||

Pretty much every lab you can think of has something scheduled for february. Gonna be a wild one

cmrdporcupine 4 hours ago|||

This is going to be a crazy month because the Chinese labs are all trying to get their releases out prior to their holidays (Lunar New Year / Spring Festival).

So we've seen a series of big ones already -- GLM 4.7 Flash, Kimi 2.5, StepFun 3.5, and now this. Still to come is likely a new DeepSeek model, which could be exciting.

And then I expect the Big3, OpenAI/Google/Anthropic will try to clog the airspace at the same time, to get in front of the potential competition.

bigyabai 7 hours ago||

Relatively, it's not that hard. There's like 4-5 "real" AI labs, who altogether manage to announce maybe 3 products max, per-month.

Compared to RISC core designs or IC optimization, the pace of AI innovation is slow and easy to follow.

StevenNunez 7 hours ago||

Going to try this over Kimi k2.5 locally. It was nice but just a bit too slow and a resource hog.

fudged71 7 hours ago||

I'm thrilled. Picked up a used M4 Pro 64GB this morning. Excited to test this out

endymion-light 9 hours ago||

Looks great - i'll try to check it out on my gaming PC.

On a misc note: What's being used to create the screen recordings? It looks so smooth!

kevinsync 32 minutes ago|

It might be Screen Studio [0] -- I was gonna write "99% sure" but now I'm not sure at all!!

[0] https://screen.studio

ossicones 8 hours ago|

What browser use agent are they using here?

novaray 6 hours ago|

Yes, the general purpose version is already supported and should have the same identical architecture

More comments...