I would recommend trying llama.cpp's llama-server with models of increasing size until you hit the best quality / speed tradeoff with your hardware that you're willing to accept.
The Unsloth guides are a great place to start: https://unsloth.ai/docs/models/qwen3-coder-next#llama.cpp-tu...
one more thing, that guide says:
> You can choose UD-Q4_K_XL or other quantized versions.
I see eight different 4-bit quants (I assume that is the size I want?).. how to pick which one to use?
IQ4_XS
Q4_K_S
Q4_1
IQ4_NL
MXFP4_MOE
Q4_0
Q4_K_M
Q4_K_XLAlso, depending on how much regular system RAM you have, you can offload mixture-of-expert models like this, keeping only the most important layers on your GPU. This may let you use larger, more accurate quants. That is functionality that is supported by llama.cpp and other frameworks and is worth looking into how to do.
The benchmark consists of a bunch of tasks. The chart shows the distribution of the number of turns taken over all those tasks.
Im currently using qwen 2.5 16b , and it works really well
It's one thing running the model without any context, but coding agents build it up close to the max and that slows down generation massively in my experience.
The instability of the tooling outside of the LLM is what keeps me from building anything on the cloud, because you're attaching your knowledge and work flow to a tool that can both change dramatically based on context, cache, and model changes and can arbitrarily raise prices as "adaptable whales" push the cost up.
Its akin to learning everything about beanie babies in the early 1990's and right when you think you understand the value proposition, suddenly they're all worthless.
So we've seen a series of big ones already -- GLM 4.7 Flash, Kimi 2.5, StepFun 3.5, and now this. Still to come is likely a new DeepSeek model, which could be exciting.
And then I expect the Big3, OpenAI/Google/Anthropic will try to clog the airspace at the same time, to get in front of the potential competition.
Compared to RISC core designs or IC optimization, the pace of AI innovation is slow and easy to follow.
On a misc note: What's being used to create the screen recordings? It looks so smooth!