Posted by mpweiher 1 day ago
I've noticed that I need to be a lot more specific in those cases, up to the point where being more specific is slowing me down, partially because I don't always know what the right thing is.
I found the instructions for this scattered all over the place so I put together this guide to using Claude-Code/Codex-CLI with Qwen3-30B-A3B, 80B-A3B, Nemotron-Nano and GPT-OSS spun up with Llama-server:
https://github.com/pchalasani/claude-code-tools/blob/main/do...
Llama.cpp recently started supporting Anthropic’s messages API for some models, which makes it really straightforward to use Claude Code with these LLMs, without having to resort to say Claude-Code-Router (an excellent library), by just setting the ANTHROPIC_BASE_URL.
Latency is not an issue at all for LLMs, even a few hundred ms won't matter.
It doesn't make a lot of sense to me, except when working offline while traveling.
This is something that frequently comes up and whenever I ask people to share the full prompts, I'm never able to reproduce this locally. I'm running GPT-OSS-120B with the "native" weights in MXFP4, and I've only seen "I cannot fulfill this request" when I actually expect it, not even once had that happen for a "normal" request you expect to have a proper response for.
Has anyone else come across this when not using the lower quantizations or 20b (So GPT-OSS-120B proper in MXFP4) and could share the exact developer/system/user prompt that they used that triggered this issue?
Just like at launch, from my point of view, this seems to be a myth that keeps propagating, and no one can demonstrate a innocent prompt that actually triggers this issue on the weights OpenAI themselves published. But then the author here seems to actually have hit that issue but again, no examples of actual prompts, so still impossible to reproduce this issue.
I am still toying with the notion of assembling an LLM tower with a few old GPUs but I don't use LLMs enough at the moment to justify it.
Cheap tier is dual 3060 12G. Runs 24B Q6 and 32B Q4 at 16 tok/sec. The limitation is VRAM for large context. 1000 lines of code is ~20k tokens. 32k tokens is is ~10G VRAM.
Expensive tier is dual 3090 or 4090 or 5090. You'd be able to run 32B Q8 with large context, or a 70B Q6.
For software, llama.cpp and llama-swap. GGUF models from HuggingFace. It just works.
If you need more than that, you're into enterprise hardware with 4+ PCIe slots which costs as much as a car and the power consumption of a small country. You're better to just pay for Claude Code.
Indeed, his self hosting inspired me to get Qwen3:32B ollama working locally. Fits nicely on my M1 pro 32GB (running Asahi). Output is a nice read-along speed and I havent felt the need for anything more powerful.
I'd be more tempted with a maxed out M2 Ultra as an upgrade, vs tower with dedicated GPU cards. The unified memory just feels right for this task. Although I noticed the 2nd hand value of those machine jumped massively in the last few months.
I know that people turn their noses up at local LLM's, but it more than does the job for me. Plus I decided a New Years Resolution of no more subscriptions / Big-AdTech freebies.
I bet you could build a stationary tower for half the price with comparable hardware specs. And unless I'm missing something you should be able to run these things on Linux.
Getting a maxed out non-apple laptop will also be cheaper for comparable hardware, if portability is important to you.
Also, if you think Apple’s RAM prices are crazy… you might be surprised at what current DDR5 pricing is today. The $800 that Apple charges to upgrade a MBP from 64-128GB is the current price of 64GB desktop DDR5-6000. Which is actually slower memory than the 8533 MT/s memory you’re getting in the MacBook.
On Linux your options are the NVidia Spark (and other vendor versions) or the AMD Ryzen AI series.
These are good options, but there are significant trade-offs. I don't think there are Ryzen AI laptops with 128GB RAM for example, and they are pricey compared to traditional PCs.
You also have limited upgradeability anyway - the RAM is soldered.
Not an Apple fanboy, but I was under the impression that having access to up to 512GB usable GPU memory was the main feature in favour of the mac.
And now with Exo, you can even break the 512GB barrier.
Personally, I run a few local models (around 30B params is the ceiling on my hardware at 8k context), and I still keep a $200 ChatGPT subscription cause I'm not spending $5-6k just to run models like K2 or GLM-4.6 (they’re usable, but clearly behind OpenAI, Claude, or Gemini for my workflow)
I was got excited about aescoder-4b (model that specialize in web design only) after its DesignArena benchmarks, but it falls apart on large codebases and is mediocre at Tailwind
That said, I think there’s real potential in small, highly specialized models like 4B model trained only for FastAPI, Tailwind or a single framework. Until that actually exists and works well, I’m sticking with remote services.
If you scale that setup and add a couple of used RTX 3090s with heavy memory offloading, you can technically run something in the K2 class.
That's around 350,000 tokens in a day. I don't track my Claude/Codex usage, but Kilocode with the free Grok model does and I'm using between 3.3M and 50M tokens in a day (plus additional usage in Claude + Codex + Mistral Vibe + Amp Coder)
I'm trying to imagine a use case where I'd want this. Maybe running some small coding task overnight? But it just doesn't seem very useful.
For coding related tasks I consume 30-80M tokens per day and I want something as fast as it gets
Yesterday I asked Claude to write one function. I didn't ask it to do anything else because it wouldn't have been helpful.
Have a bunch of other side projects as well as my day job.
It's pretty easy to get through lots of tokens.
Essentially migrating codebases, implementing features, as well as all of the referencing of existing code and writing tests and various automation scripts that are needed to ensure that the code changes are okay. Over 95% of those tokens are reads, since often there’s a need for a lot of consistency and iteration.
It works pretty well if you’re not limited by a tight budget.
It's also unbeatable in price to performance as the next best 24GiB card would be the 4090 which, even used, is almost tripple the price these days while only offering about 25%-30% more performance in real-world AI workloads.
You can basically get an SLI-linked dual 3090 setup for less money than a single used 4090 and get about the same or even more performance and double the available VRAM.
> The tensor performance of the 3090 is also abysmal.
I for one compared my 50-series card's performance to my 3090 and didn't see "abysmal performance" on the older card at all. In fact, in actual real-world use (quantised models only, no one runs big fp32 models locally), the difference in performance isn't very noticeable at all. But I'm sure you'll be able to provide actual numbers (TTFT, TPS) to prove me wrong. I don't use diffusion models, so there might be a substantial difference there (I doubt it, though), but for LLMs I can tell you for a fact that you're just wrong.
> if you don't see the performance uplift everyone else sees there is something wrong with your setup and I don't have the time to debug it.
Read these two statements and think about what might be the issue. I only run what you call "toy models" (good enough for my purposes), so of course your experience is fundamentally different from mine. Spending 5 figures on hardware just to run models locally is usually a bad investment. Repurposing old hardware OTOH is just fine to play with local models and optimise them for specific applications and workflows.
Also, the 3090 supports NVLink, which is actually more useful for inference speed than native BF16 support.
Maybe if you're training bf16 matters?
I didn’t try it long because I got frustrated waiting for it to spit out wrong answers.
But I’m open to trying again.
I'm doing programming on/off (mostly use Codex with hosted models) with GPT-OSS-120B, and with reasoning_effort set to high, it gets it right maybe 95% of the times, rarely does it get anything wrong.
I also use it to start off MVP projects that involve both frontend and API development but you have to be super verbose, unlike when using Claude. The context window is also small, so you need to know how to break it up in parts that you can put together on your own
It seems so obvious to me, but I guess people are happy with claude living in their home directory and slurping up secrets.
(In particular the fact that Claude Code has access to your Anthropic API key is ironic given that Dario and Anthropic spend a lot of time fearmongering about how the AI could go rogue and “attempt to escape”).
These are no more than a few dozen lines I can easily eyeball and verify with confidence- that’s done in under 60 seconds and leaves Claude code with plenty of quota for significant tasks.