Top
Best
New

Posted by alphabettsy 17 hours ago

Local Qwen isn't a worse Opus, it's a different tool(blog.alexellis.io)
417 points | 224 commentspage 2
eurekin 11 hours ago|
> The model is running so hot, that it shoots past the goal and starts looping

later:

> My latest experiment was setting up vLLM (the gold standard for production and concurrent serving) and even with an NVLink (175GBP) and tensor parallelism turned on, it was 3 tokens/second slower than llama.cpp during generation for an equivalent setup.

In all my tests, getting vllm to run is worth it. It was the single biggest thing, that helped for looping issues, agents going whack and losing focus on the task, long context being essentially useless.

FP8 model, unquantized cache in vllm an you have a league better overall experience, with any other stack I tested. Then, you can actually focus on using the model for other things and stop tinkering with settings.

trey-jones 10 hours ago||
I'm really curious about this, not because I disagree, but because I want to avoid agents going whack. Are you running vllm for yourself only, or a for a team, or for an application, etc? And do you feel there is a minimum hardware requirement for vllm to be useful in this way?

My weekend project is going to be building a home inference server (from ancient datacenter parts) and I'm still massaging in my head what the end result will be.

eurekin 8 hours ago||
If I started today, with building a server, I'd jump right into verified set-ups and writeups, like this one:

https://github.com/noonghunna/club-3090

You can find info about running a patched version of vllm for 1x24gb, 2x and 4x. There's also quite a few "blackwell" subreddits, where people seem to share a lot of substantial information, if you're going the 6000 route.

hypfer 7 hours ago||
That writeup is completely unhinged and utterly incomprehensible to follow.

It just throws "you can do <large number>" at you, with no real explainer regarding how it manages that and which trade-offs are made. I still don't know for certain, but I think one of those trade-offs is 3 bit context? Which is a terrible idea.

Please don't share these walls of noise. They shouldn't exist

Iolaum 8 hours ago||
Why unquantized instead of Q8 ?
eurekin 8 hours ago||
Noticed few cliffs. Sometimes it was a spurious stop (had to write "go on" or "continue" to restart), othertimes it was randomly saying: "Oh the user wants [the thing we already resolved]" and goes back in history. Cleared all out on fp16
nessex 13 hours ago||
This is a great post that covers a lot of the recent ground. I have a very similar setup after a very similar journey, minus the RTX6000. Worth noting though that a lot of the recent changes make a single 3090/4090 much more viable here too. MTP and the recent improvements to kv quantization in particular, as well as model-specific template & quant fixes. I run a 4090 with the 4-bit quantized variant of the same model now and have had a great experience. Qwen3.5 was already a big step up, but with 3.6 and the rest of the improvements it's substantially more reliable as a daily use tool and I find myself reaching for hosted models a lot less. Feels like I could work entirely without them if they were to disappear without going back to typing every line of code myself.

To make 4-bit fit on one card with reasonable (100k+) context needs a bit more care though. And tuning can be highly specific to your machine, gpu and use-case. But I use a headless server, offload multi-modal to CPU, use fit-target to reduce wasted memory and use q8_0 kv since the 4090 performs well with it... In addition to most of the same config as the author elsewhere. I get 50-60tps generation with a power limit of 275W (450W is default), more than enough to offer a roughly an Opus-speed feedback loop.

I haven't seen many of the issues with looping the author mentions. But I did with Qwen3.5 and in particular other 4-bit quants in the past. But the difference is probably a mix of the improvements above, as well as habits changing to avoid cases where models will loop. For what I'm doing, it seems like I loop Qwen3.6 on the same kind of prompts I'll make Haiku or Sonnet loop on (the latter hide some of their existential loops behind "thinking"). Usually it's cause I was too vague about some aspect of what I'm wanting them to do or I forgot to include some context that smaller models just don't have access to in their smaller knowledge base. But at least for what I'm doing (Rust, React, kubernetes) it's not been a notable problem at all with the latest iteration of this whole stack. And knowledge of standard libraries and default k8s resource kinds has been almost flawless.

There's still plenty of more complex stuff where I'll choose to jump straight to Claude or GLM-5.2, but if it's not worth that jump I've stopped paying for the middle ground as it's usually not much better than just one more iteration through qwen.

All this to say, if you have a 3090/4090, feel free to give the same setup a go. It's come a long way in recent weeks.

piterrro 7 hours ago||
This is amazing but for everyone out there wanting to buy and build your own AI rig I recommend connecting to one of mamy inference providers and trying out different models themselves for a while. Costs pennies but can give you a nice preview of what you can get with your own rig. Just a friendly tip.
bee_rider 5 hours ago||
Tangential question (since they brought it up in the article) from someone not involved in AI performance optimization:

How big of a deal is looping, practically? Or, I mean, I see thinking models loop occasionally. But it seems to me that every token in the loop should be in the KV cache already, is there really no way to either power through a loop because of the 100% cache hit rate, or identify that you are in a loop that way? (As a human, when thinking hard I sometimes loop, but it is easy enough to identify…)

alexellisuk 3 hours ago|
1. On the technical:

The cache only makes generation fast, it doesn't influence what gets chosen next. The loops that hurt the most (point 2 below) are when the model re-decides to do the same thing in different words, which is much harder to detect automatically. We're experimenting with repetition penalty and turning thinking off to solve for the 1st kind of looping (below)

2. On "why is looping a problem" for us

Practical example, which I covered in the post: "add --json to every command that does a get or list in faas-cli" - this was a small-ish, open source CLI written with Cobra a very common framework.

If I send that to Claude (any of their models) or Codex (GPT), I would have a fully working solution the next time I opened that terminal - a few seconds - a few minutes.

With the local model, when it loops, you get some progress and start working on something else. Come back, maybe even 30 minutes later and see it's been printing the same 5 lines over and over constantly.

Trust is important for a tool like this, that eroded it.

The other type of loop I mention in the blog post is "unable to solve it" loop - Han ran into that more.

"Oh I need to fix the indent from 8 to 5 characters in main.py" "Wait I don't know how to write Python code" "Oh now it's broken and I don't know what to do, maybe I should stop" "Let me edit ... " etc, etc

krzyk 10 hours ago||
3090 and 2x3090 are quite popular. But if you uses gigantic (for local models) context of 200k it will go south pretty quickly - any quantization of context quickly becomes the issue.
alexellisuk 7 hours ago|
I think that's quite telling Gorgi replied that he uses Qwen with 131k context.

https://x.com/ggerganov/status/2067539416436867230?s=20

We also use it with 200-256k (native) context length.

The issue could be that folks that don't see looping aren't pushing the model as hard, or as enthusiastically.

We also had far fewer issues when thinking was turned off, than with a reasoning budget capped at 2048.

Some fine-tunes like Qwopus-Coder just seem prone to looping - google it, you'll see plenty of reports, even on Reddit.

For what it's worth seen the RTX 6000 Pro loop even at fp16 on the KV cache - and with vLLM.

teh 11 hours ago||
I sometimes wonder how much of intelligence is being good with tools.

I feel pretty averagely smart but give me some good tooling like a good editor, a good type system, semantic grep, good testing and some solvers and I can actually deliver some work.

Maybe the trick isn't 500 billion parameters but a model super integrated with the task at hand for iteration and debugging?

FWIW the article really mirrors my own experience. I can run a small gemma4 for quick edits (and it's fast!) or data cleanup but for other tasks you do need a different tool (claude).

whazor 13 hours ago||
Would be interesting to use local models for:

- tool calling

- code base exploration

- anonymizing / abstracting your request

Such that your local AI communicates to frontier model like an expensive consultant giving high level advice.

I think due to the lower latency of a local model that this could be faster.

alexellisuk 7 hours ago||
One of the things I mentioned in the post:

> Local models can quickly read and explain codebases, even if they can't write them - this is a superpower

Might have been buried lower down.

And yes latency of local on a fast card with MTP enabled can be blistering 130-200 tokens per second sustained at full context on Q5. About 100+ on Q8.

On tool calling

> Agent Skills can help immensely - we had a local agent set up Slicer completely from scratch on a new mini PC. It even gave feedback on the usability of slicer CLI which we integrated

There's a link to a post showing some examples.

Occasionally, we'll also have the local model _review_ the changes of GPT/Opus - and it can return duds, but also insights the larger model overlooked, or was too intelligent to pick out.

So yes - absolutely blazing fast at understanding a codebase, very good at running skills "cheaply" and could be used with larger models as a "helper" / sub-agent.

asimovDev 11 hours ago|||
I used Qwen 27b 8 bit MLX version on a decompiled android APK recently. It succesfully identified how it worked even the obfuscated classes and methods. It wrote a 1000 line documentation with examples but the time was dreadful. At some point it slowed down to 5 t/s so the whole thing took over an hour , the writing of documentation alone was over 40 minutes, fans blasting the entire time.
trey-jones 10 hours ago||
I know it uses electricity, but part of the benefit of a local model has to be that you can let it do this while you sleep, and not pay Anthropic for an unknown number of tokens.
asimovDev 9 hours ago||
yeah i totally understand and I am thoroughly impressed it works. And the electricity cost isn't that bad since it was on a ARM laptop (MacBook M3 Max) and not a beefy workstation with a GPU. I just let the agent do its work while watching the World Cup.
dofm 12 hours ago||
I doubt your experience of local models would be of lower latency, except for quite small models in edge uses.

In every way, the cloud products from the big two seem optimised for speed and speed of initial response even.

I don’t think most people are running local models for speed. More for control, privacy, interest, bloody-mindedness and general principle.

cptskippy 15 hours ago||
I've been running qwen3-5-9b-q4-k-m and qwen3-6-27b-q6-k simultaneously on an Intel Arc Pro B70 with a lot of success.

https://github.com/cptskippy/battlemage-llm-gateway

Opencode has been a huge productivity accelerator. I have two Hermes agents that I'm training to support my workflow with pretty good success. One is a personal assistant who manages my backlog and keeps me on task, follows up with me on items, and will put together research briefs. The other I use a general purpose coder and research and it's about 50:50 with the tasks I've given it. In fairness though, the task it failed at left me scratching my head to figure out as well.

hbbio 15 hours ago||
Interesting setup, thx for sharing.

How many tokens/sec do you get with 27b? Are you using MTP?

askvictor 14 hours ago|||
Does Intel make decent GPUs now? I must be out of the loop...
speedgoose 13 hours ago|||
They released a few good value GPUs for LLM inference about a year ago: more memory than AMD and NVIDIA consumer GPUs, not too expensive, but also not great tokens/watt.

I am not sure whether you can find those in stock anywhere.

cptskippy 3 hours ago|||
I'm using an Intel Arc Pro B70 which has 32 GB of VRAM. It's estimated to get ~35-45 t/s at $21-27 $/t/s. An RTX 5090 is ~61 t/s at ~$33 $/t/s.

So in terms of raw power Nvidia is effortlessly still king, but in price-to-capacity Intel is best in class.

Intel's Battlemage GPUs also natively support SR-IOV and GPU partitioning which allows you to isolate workloads. This is useful in homelab environments if you have workloads that benefit from GPU acceleration. I was able to split the B70 into 4 virtual GPUs and hand them out to Frigate NVR, Plex, and other workloads.

jauntywundrkind 14 hours ago||
What's the value running the smaller model too? Why not just the big model for everything? I note both are dense, as well.
Ritewut 14 hours ago||
Tokens per second. The difference between 8B and something like 16B is not as big as you might think in practical usage and 8B is a lot faster and interactive than 16B but there are certain things where it is useful to farm it out to the large model.
Natalia724 14 hours ago|||
Agree. For local coding help, latency often matters more than raw benchmark quality. A slightly weaker model that answers immediately changes how often you reach for it.
cptskippy 3 hours ago|||
Exactly this.

Creating conversation titles and parsing HTML/JSON don't benefit from 27B models.

The B70 can run both models comfortably side-by-side so it makes better use of time and resources.

zkmon 13 hours ago||
The seems to talk a lot about 27B. In my experience, I saw 35B-A3B to be equally good in quality and the MoE gave more tg/s.
alexellisuk 7 hours ago|
The important thing about MoEs which I mention in the conclusion is that they carry fewer (way fewer) active tokens during inference/generation.

35B-A3B is what we started out with in the days of only having the 3090, but the quality is not as good, and the speed from the cards we have now can blaze at 130-200 tokens per second of generation with q5 and a full context in fp16.

Not to say that MoEs don't have their place. For people running on unified RAM, they're sometimes the only viable option due to the slowness of dense models.

Why is a dense model slower? All model weights have to be loaded and exercised. Passing through 27B vs 3B (active) is maths. So yes you will always get more tokens per second of generation.

You must (just as we did) evaluate on your own products and daily work. If the MoE gives the results you need with only 3B parameters then you have your answer.

Not prescriptive at all. This is experience based, from the trenches of a actual software business so hopefully a different perspective for folks than "Ran Qwen on my macbook, generated a great python script for me"

watt 10 hours ago|
I find it strange that software people will accept this level of flakiness from the hardware. Normally you would just send the card back, and request a replacement.

> One of the cards would only show up if I crossed my fingers when turning it on. Even reboots wouldn't cure it - I had to A/C power off and remove the power cable each time for 30 seconds.

This is ridiculous. Of course we are living through supply crunch, but that card is clearly defective hardware.

alexellisuk 7 hours ago|
Ha, you underestimate how dogged you need to be to get this stuff working well.

The RTX 3090 in question was used from eBay, no way to return it. The RTX 6000 Pro is the "new card" in question here. The 3090s remain an interesting playground for testing things like VFIO passthrough for SlicerVM and other models whilst not interrupting people on the newer card.

In the end, the most stable fix I've found is to install the older proprietary driver and disable the GSP firmware. Have had no issues since.

So "clearly defective hardware" seems like it may not be quite correct. And the thing that kept me coming back - along with not having a suitable replacement - or having to gamble on eBay again was the reliability once it showed up in nvidia-smi.

More comments...