Posted by alphabettsy 17 hours ago
later:
> My latest experiment was setting up vLLM (the gold standard for production and concurrent serving) and even with an NVLink (175GBP) and tensor parallelism turned on, it was 3 tokens/second slower than llama.cpp during generation for an equivalent setup.
In all my tests, getting vllm to run is worth it. It was the single biggest thing, that helped for looping issues, agents going whack and losing focus on the task, long context being essentially useless.
FP8 model, unquantized cache in vllm an you have a league better overall experience, with any other stack I tested. Then, you can actually focus on using the model for other things and stop tinkering with settings.
My weekend project is going to be building a home inference server (from ancient datacenter parts) and I'm still massaging in my head what the end result will be.
https://github.com/noonghunna/club-3090
You can find info about running a patched version of vllm for 1x24gb, 2x and 4x. There's also quite a few "blackwell" subreddits, where people seem to share a lot of substantial information, if you're going the 6000 route.
It just throws "you can do <large number>" at you, with no real explainer regarding how it manages that and which trade-offs are made. I still don't know for certain, but I think one of those trade-offs is 3 bit context? Which is a terrible idea.
Please don't share these walls of noise. They shouldn't exist
To make 4-bit fit on one card with reasonable (100k+) context needs a bit more care though. And tuning can be highly specific to your machine, gpu and use-case. But I use a headless server, offload multi-modal to CPU, use fit-target to reduce wasted memory and use q8_0 kv since the 4090 performs well with it... In addition to most of the same config as the author elsewhere. I get 50-60tps generation with a power limit of 275W (450W is default), more than enough to offer a roughly an Opus-speed feedback loop.
I haven't seen many of the issues with looping the author mentions. But I did with Qwen3.5 and in particular other 4-bit quants in the past. But the difference is probably a mix of the improvements above, as well as habits changing to avoid cases where models will loop. For what I'm doing, it seems like I loop Qwen3.6 on the same kind of prompts I'll make Haiku or Sonnet loop on (the latter hide some of their existential loops behind "thinking"). Usually it's cause I was too vague about some aspect of what I'm wanting them to do or I forgot to include some context that smaller models just don't have access to in their smaller knowledge base. But at least for what I'm doing (Rust, React, kubernetes) it's not been a notable problem at all with the latest iteration of this whole stack. And knowledge of standard libraries and default k8s resource kinds has been almost flawless.
There's still plenty of more complex stuff where I'll choose to jump straight to Claude or GLM-5.2, but if it's not worth that jump I've stopped paying for the middle ground as it's usually not much better than just one more iteration through qwen.
All this to say, if you have a 3090/4090, feel free to give the same setup a go. It's come a long way in recent weeks.
How big of a deal is looping, practically? Or, I mean, I see thinking models loop occasionally. But it seems to me that every token in the loop should be in the KV cache already, is there really no way to either power through a loop because of the 100% cache hit rate, or identify that you are in a loop that way? (As a human, when thinking hard I sometimes loop, but it is easy enough to identify…)
The cache only makes generation fast, it doesn't influence what gets chosen next. The loops that hurt the most (point 2 below) are when the model re-decides to do the same thing in different words, which is much harder to detect automatically. We're experimenting with repetition penalty and turning thinking off to solve for the 1st kind of looping (below)
2. On "why is looping a problem" for us
Practical example, which I covered in the post: "add --json to every command that does a get or list in faas-cli" - this was a small-ish, open source CLI written with Cobra a very common framework.
If I send that to Claude (any of their models) or Codex (GPT), I would have a fully working solution the next time I opened that terminal - a few seconds - a few minutes.
With the local model, when it loops, you get some progress and start working on something else. Come back, maybe even 30 minutes later and see it's been printing the same 5 lines over and over constantly.
Trust is important for a tool like this, that eroded it.
The other type of loop I mention in the blog post is "unable to solve it" loop - Han ran into that more.
"Oh I need to fix the indent from 8 to 5 characters in main.py" "Wait I don't know how to write Python code" "Oh now it's broken and I don't know what to do, maybe I should stop" "Let me edit ... " etc, etc
https://x.com/ggerganov/status/2067539416436867230?s=20
We also use it with 200-256k (native) context length.
The issue could be that folks that don't see looping aren't pushing the model as hard, or as enthusiastically.
We also had far fewer issues when thinking was turned off, than with a reasoning budget capped at 2048.
Some fine-tunes like Qwopus-Coder just seem prone to looping - google it, you'll see plenty of reports, even on Reddit.
For what it's worth seen the RTX 6000 Pro loop even at fp16 on the KV cache - and with vLLM.
I feel pretty averagely smart but give me some good tooling like a good editor, a good type system, semantic grep, good testing and some solvers and I can actually deliver some work.
Maybe the trick isn't 500 billion parameters but a model super integrated with the task at hand for iteration and debugging?
FWIW the article really mirrors my own experience. I can run a small gemma4 for quick edits (and it's fast!) or data cleanup but for other tasks you do need a different tool (claude).
- tool calling
- code base exploration
- anonymizing / abstracting your request
Such that your local AI communicates to frontier model like an expensive consultant giving high level advice.
I think due to the lower latency of a local model that this could be faster.
> Local models can quickly read and explain codebases, even if they can't write them - this is a superpower
Might have been buried lower down.
And yes latency of local on a fast card with MTP enabled can be blistering 130-200 tokens per second sustained at full context on Q5. About 100+ on Q8.
On tool calling
> Agent Skills can help immensely - we had a local agent set up Slicer completely from scratch on a new mini PC. It even gave feedback on the usability of slicer CLI which we integrated
There's a link to a post showing some examples.
Occasionally, we'll also have the local model _review_ the changes of GPT/Opus - and it can return duds, but also insights the larger model overlooked, or was too intelligent to pick out.
So yes - absolutely blazing fast at understanding a codebase, very good at running skills "cheaply" and could be used with larger models as a "helper" / sub-agent.
In every way, the cloud products from the big two seem optimised for speed and speed of initial response even.
I don’t think most people are running local models for speed. More for control, privacy, interest, bloody-mindedness and general principle.
https://github.com/cptskippy/battlemage-llm-gateway
Opencode has been a huge productivity accelerator. I have two Hermes agents that I'm training to support my workflow with pretty good success. One is a personal assistant who manages my backlog and keeps me on task, follows up with me on items, and will put together research briefs. The other I use a general purpose coder and research and it's about 50:50 with the tasks I've given it. In fairness though, the task it failed at left me scratching my head to figure out as well.
How many tokens/sec do you get with 27b? Are you using MTP?
I am not sure whether you can find those in stock anywhere.
So in terms of raw power Nvidia is effortlessly still king, but in price-to-capacity Intel is best in class.
Intel's Battlemage GPUs also natively support SR-IOV and GPU partitioning which allows you to isolate workloads. This is useful in homelab environments if you have workloads that benefit from GPU acceleration. I was able to split the B70 into 4 virtual GPUs and hand them out to Frigate NVR, Plex, and other workloads.
Creating conversation titles and parsing HTML/JSON don't benefit from 27B models.
The B70 can run both models comfortably side-by-side so it makes better use of time and resources.
35B-A3B is what we started out with in the days of only having the 3090, but the quality is not as good, and the speed from the cards we have now can blaze at 130-200 tokens per second of generation with q5 and a full context in fp16.
Not to say that MoEs don't have their place. For people running on unified RAM, they're sometimes the only viable option due to the slowness of dense models.
Why is a dense model slower? All model weights have to be loaded and exercised. Passing through 27B vs 3B (active) is maths. So yes you will always get more tokens per second of generation.
You must (just as we did) evaluate on your own products and daily work. If the MoE gives the results you need with only 3B parameters then you have your answer.
Not prescriptive at all. This is experience based, from the trenches of a actual software business so hopefully a different perspective for folks than "Ran Qwen on my macbook, generated a great python script for me"
> One of the cards would only show up if I crossed my fingers when turning it on. Even reboots wouldn't cure it - I had to A/C power off and remove the power cable each time for 30 seconds.
This is ridiculous. Of course we are living through supply crunch, but that card is clearly defective hardware.
The RTX 3090 in question was used from eBay, no way to return it. The RTX 6000 Pro is the "new card" in question here. The 3090s remain an interesting playground for testing things like VFIO passthrough for SlicerVM and other models whilst not interrupting people on the newer card.
In the end, the most stable fix I've found is to install the older proprietary driver and disable the GSP firmware. Have had no issues since.
So "clearly defective hardware" seems like it may not be quite correct. And the thing that kept me coming back - along with not having a suitable replacement - or having to gamble on eBay again was the reliability once it showed up in nvidia-smi.