You have dense models (qwen 27b, gemma 31b) who are pretty smart, but pretty slow
You have MoE models (gemma 26b, qwen 35b, north mini code 30b) who are pretty fast, but make a lot of mistakes
You need a lot of memory to run these well, quantization makes tool calling weaker, so most run at 4 bit quants and are wondering why it kinda sucks and that's because you've essentially lobotomized the model (I recommend unsloth quants, i recommend 6bit for MoEs and 5bit for dense)
So you need a lot of compute to make the pre-fill fast, you need bandwidth to make the decode fast, you need a lot of memory to hold everything - lot of ifs
On top of that, your laptop becomes a loud hot churning machine, it's uncomfortable to work with.
So are they good? not really. Do they work? yes
edit: just wanna clarify - i think open models are the future, i think they're super important, i'm contributing constantly to the ecosystem - i think people should play around with these models, i think people should use `pi` and learn how it all works - but don't download a model expecting it to be good out of the box, you will have to tune and configure a lot of stuff to replace a "coding agent" that most people are using models for
The best "free" experience I've found is using OpenCode with Big Pickle. It's not especially smart, so it often won't produce the correct result the first time, but the free tier is generous enough that I don't think I've hit the limit more than twice over around a month with frequent multi-hour sessions. If running locally is truly the goal, it's not going to fit the bill, but if the goal is just "get the best experience without having to pay for a sub or tokens", it's the least bad option I've found so far.
Locally I haven’t gone much further than 8k. That is sufficient for small changes on small code bases. And you need condensed tool output.
I haven’t tried any tool that compresses the tokens yet.
I have absolutely zero interest in free. I honestly don't think I'm even remotely in the same demographic as people using free tiers / models.
I want to pay. I don't want my data used for training. I want it to be open. I want it to be consistently up (more than Claude!). I want it to be fast. I don't want it to be subsidized as that's just an excuse for shitty quality. Deepseek flash knocks it out of the park on all of these except you're data is used in training. I'm fine with it being hosted since there's no way I'm using it 24/7, but data MUST be private.
Basically I want Hetzner and OVH to run open model clouds. I'm convinced this is going to happen eventually when everyone realizes this is a commodity.
I wonder if there are competent models trained purely on permissive open-source code like MIT or Apache 2.0.
If you throw the right GPUs at the problem, they become much better - but still not quite in the realm of Sonnet or DeepSeek 4 Flash, let alone Opus / DeepSeek Pro or Mythos/Fable/GPT-5.5.
Given enough budget, power, and cooling, you can run some pretty good data pipelines, but for code, I think it still makes sense to shell out to an API provider most of the time.
> but still not quite in the realm of Sonnet or DeepSeek 4 Flash
these are not mutually exclusive anymore. DS4 has set the bar for me these days. https://github.com/antirez/ds4
Ultimately if you skip over the opportunity to play with these models on your own machine you are losing out on a lot of really interesting educational opportunities — it helps make a lot of stuff feel more concrete in a way that only tinkering can.
But then I think once I had an idea of something that I was building against Gemma 4 or Qwen 3.6 I would be looking at openrouter etc., to stabilise it for the next tier of experimentation (and to get back a kind of multi-device access without tailscale/lm link etc.).
Are they good enough to replace what people seem to want to do with Claude? Maybe not. But it's an unparalleled learning opportunity.
This setup is extremely optimized down to the last flag. Changing any param above the temp flag craters performance.
I don't have enough system RAM to properly handle the large context windows so I don't use local models.
# 1,257 tokens 17s 72.18 t/s
$env:CUDA_DEVICE_SCHEDULE = "SPIN"
cd D:\src\llama.cpp\
.\build\bin\Release\llama-server.exe `
--port 8080 `
--host 127.0.0.1 `
-m "D:\LLM\Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf" `
-fitt 2048 `
-c 98304 `
-n 32768 `
-fa on `
-np 1 `
--kv-unified `
-ctk q8_0 `
-ctv q8_0 `
-ctkd q8_0 `
-ctvd q8_0 `
-ctxcp 64 `
--mlock `
--no-warmup `
--spec-type draft-mtp `
--spec-draft-n-max 2 `
--spec-draft-p-min 0.1 `
--chat-template-kwargs '{\"preserve_thinking\": true}' `
--temp 0.6 `
--top-p 0.95 `
--top-k 20 `
--min-p 0.0 `
--presence-penalty 0.0 `
--repeat-penalty 1.0Also, funny lumping the M4 "and" the M5, I find them 15% to 45% different performance, depending.
And for a good deal of work, an M3 Studio Ultra outpaces the M4 and ties the M5 on single work at a time, outpaces both doing multiple work at a time.
The Q4_K_XL bit for those not in the know.
It outperforms all the Qwen models (even 100B+) for rule following/automation style tasks in my experience. Its image interpretation is also very good, and out-benchmarks Opus.
Qwen seems to ignore instructions and consistently outputs incorrect formats (when token generation format is not explicitly constrained)
But yes, on the DGX Spark Gemma 31B Q4 with MTP runs around 20 tok/s and Gemma 26B A4B around 60 tok/s. Still quite slow. But on a high end Nvidia card would run significantly faster and still fit in memory.
I'd recommend for anyone getting into local models to focus on memory bandwidth over RAM. Models under 100B parameters are now sufficient and hugely useful for automation.
I agree that for coding/creation use cases, there's still not a compelling argument for local models.
But e.g. if you want to scan a list of stocks and interpret news/high pass filtering, interpreting logs, interpreting screenshots, the local models are more than sufficient already.
I'm really surprised how much slower a DGX spark is for the same price.
1. Here's my command.
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \ vllm serve cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit \ --dtype auto \ --gpu-memory-utilization 0.95 \ --kv-cache-dtype fp8 \ --enable-chunked-prefill \ --enable-prefix-caching \ --trust-remote-code \ --enable-auto-tool-choice \ --tool-call-parser gemma4 \ --reasoning-parser gemma4 \ --max-num-batched 16000 \ --max-model-len 64000 \ --max-num-seqs 12 --speculative-config '{"model": "./gemma-4-26B-A4B-it-assistant", "num_speculative_tokens": 4}'
You can run multiple instances of these models in parallel on the DGX Spark which somewhat mitigates the difference if your task is parallelizable.
But I'd take the simplicity of a single thread and higher throughput personally.
Overall of course still better to wait for next gen devices if you can.
Gemma will just stop mid-tool call. It's been slower and I've had to reduce context size to run it. Qwen3.6 27b has been rock solid using club 3090's single card setup for agentic use -- https://github.com/noonghunna/club-3090/blob/master/docs/SIN...
E.g. prompt A to achieve X, output in format Y. Use Y to do something in prompt B.
Agentic loops will underperform deterministic control flow pipelines (with non-determinism constrained to LLM calls).
Agents are more general, which is the main advantage. But inherently a more general solution will waste context on unnecessary reasoning.
Try asking the smaller Qwen models to output a JSON in a specific format. It basically can't do it consistently with a moderately sized prompt unless you constrain the token generation via GGML or are extremely repetitive and specific about it. (Thinking disabled)
Gemma 4 will do it correctly pretty much 100% of the time. (Thinking disabled)
Applies to other rule following as well in my experience.
Qwen may be better at toolcalling and certainly probably codegen.
It seems to me Google explicitly designed Gemma for edge device automation, and didn't fine tune for agentic or coding use cases.
If you don’t need the machine to respond instantly (or explain your own business model to you) everything can be local and it’s been like that for a few years now.
Our GPU computer server cost $110k.
I don't care how many tokens per second of nonsense it can generate.
It is probably not smart enough for "design this whole architecture of this complex system from scratch, make no mistakes", but that is not something I want from a coding tool anyway. I want a model that I can point to a file and tell it to make some changes to the file and related files. Or that I can ask to review a PR with regards to certain aspects.
My suggestion is to simply try it and see what it feels like.
I find devstral (even though it’s weak generally) much better at writing and documentation than Opus. I’m actually now delegating all documentation to devstral and away from Claude, which makes a mess.
The carpenter has to get up close and personal with the wood. He can't match the crew's throughput, but maybe that's not what he's trying to do.
you get a macbook for work, you run the macbook
they're not going to start giving GPUs to employees to run local models
You generally want to run q8 or some kind of "6bit" quantization at least.
40GB of VRAM is the entry-point in my experience, you can run qwen 3.6 35b a3b with full context or qwen 27b with about 92k of context.
Before you get fully discouraged, you don't need 1 gpu with 40GBs you can use multiple cards, with minimum impact on performance.
i use it usecases like that latter and they are fine.
It is such a downgrade. I don't understand how that's even possible. The thing has so many strongly-held opinions I did not ever ask it for, talking just way too much and generally feeling somehow dumber.
Of course, being significantly larger, it will encode more knowledge, but that doesn't help me when I hate talking to it. And all that on top of the fact that talking with it costs real money.
I wonder what it might be that makes me hate it so much. Maybe because it doesn't see itself as a tool but almost an equal? As if its opinions would have weight.
Qwen too can act like an overeager intern, but if you tell it that it is an idiot, it will drop that ego. Not so much with Claude. In my experience, anyway.
Anyway, point is: full ack on that headline.
[0] - https://github.com/search?q=%22Assisted-by%22+user%3Aggml-or...
[1] - https://github.com/ggml-org/llama.cpp/blob/master/.pi/gg/SYS...
Gerganov, hope you will consider developing further the CLI cause we suffering with the server.
Curious if you can share the prefill speed too?
I run locally on a crappy desktop (some AMD iGPU with Vulkan llama.cpp, 32 GB DDR4 RAM) for experimentation. I get 15 tok/s on generation for the qwen & gemma4 MoE models. I get around 150 tok/s prefill speed.
Reason I'm asking about the prefill is looking at my stats at work, I use between 20M to peaks of 300M input tokens daily. Some of those token are cached but in general, I seem to have roughly 500x more input tokens than output. So interested in prefill tok/s stats.
Huge Thank you for llama.cpp btw!!
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, VRAM: 32109 MiB
| model | size | params | backend | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | -------- | --: | --------------: | -------------------: |
| qwen35 27B Q4_K - Medium | 15.92 GiB | 27.32 B | CUDA | 1 | pp2048 @ d512 | 3714.02 ± 10.85 |
| qwen35 27B Q4_K - Medium | 15.92 GiB | 27.32 B | CUDA | 1 | pp2048 @ d1024 | 3684.86 ± 15.21 |
| qwen35 27B Q4_K - Medium | 15.92 GiB | 27.32 B | CUDA | 1 | pp2048 @ d2048 | 3650.80 ± 8.53 |
| qwen35 27B Q4_K - Medium | 15.92 GiB | 27.32 B | CUDA | 1 | pp2048 @ d8192 | 3473.88 ± 0.97 |
| qwen35 27B Q4_K - Medium | 15.92 GiB | 27.32 B | CUDA | 1 | pp2048 @ d32768 | 2754.69 ± 4.07 |
ggml_metal_device_init: GPU name: MTL0 (Apple M2 Ultra)
| model | size | params | backend | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | -------- | -: | --------------: | -------------------: |
| qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | MTL | 1 | pp2048 @ d512 | 379.75 ± 0.21 |
| qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | MTL | 1 | pp2048 @ d1024 | 377.15 ± 0.35 |
| qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | MTL | 1 | pp2048 @ d2048 | 371.46 ± 0.91 |
| qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | MTL | 1 | pp2048 @ d8192 | 344.84 ± 0.41 |
| qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | MTL | 1 | pp2048 @ d32768 | 222.42 ± 5.29 |
Btw, based on your numbers, I think our use cases are quite different. I use the agent for very targeted sessions - basically things that are clear to me how to do, just want to automate them. My workflow is usually: new session -> read this, this and this -> do that. I.e. I don't let it wander at all in the codebase, so I rarely exceed the context window.Also, I get a lot of mileage from the ngram-based speculative decoding functionality [0] as it allows me to iterate on the implementation much faster.
I do use it the same way as you're describing on personal projects at home, in a very crude manner (pasting code snippets in llama server web UI prompt. Next will attempt OpenCode)
At work I use it in similar manner with more mature tools, but the vast majority of token spend comes from a totally different workflow: "pretend the AI is a fleet of junior/intern engineer you're delegating work to", where the agent will on its own do the implementation, commit the changes etc.
It does indeed spend a lot of tokens wandering the codebase, talking to MCPs, loading skills etc.
[0] https://huggingface.co/ggerganov/presets/blob/main/preset.in...
Plus, I never have to worry about rate limits, quotas, or sitting in a queue during peak time. And I can always see its full thoughts, don't have to worry about where my data is getting sent, and know it can't get secretly nerfed.
Running on 2x 3090, 500-1000tok/s prefill and 60tok/s output at Q6_K_XL with MTP on llama.cpp, 220k tokens context window (starts to get a bit dumb above 160k ish), no KV quantization
For this reason I wonder if local models are a potential business opportunity. Provide the service to engineering teams to give them a pre-built and setup GPU rig they can run in a closet. No need to worry about all the things you mentioned and clients can rest-assured their data isn't disappearing into a sketchy data center. There might be regulatory reasons that make on-prem setups appealing as well.
OLLAMA_FLASH_ATTENTION=1
OLLAMA_KV_CACHE_TYPE=q8_0
OLLAMA_CONTEXT_LENGTH=180000
and that fits in 23GB.[edited for format]
If open models can ever hold roughly 600k token windows, I'll be really excited, I found that around 300 ~ 400k of Claude reading through your codebase results in better outputs. I also have Claude read official docs instead of "guessing" as to how to do something.
I think deepseek v4 pro has 1m context and does pretty well up to around 600k. But if you have the hardware to run that locally, you already know
Even then if there's a smaller model with 1M context, you'll need a ton of RAM to actually run it at full 1M. I guess that's why you don't see it too much. Anyone that could run Qwen 3.6 27B with 1m context would be better off running a much bigger model with smaller context instead, in the same amount of VRAM.
In terms of optimizing further, huge context + KV quantization sounds like a terrible idea, but there's some decent innovation in sparse attention, KV cache rotation allowing Q8 to perform nearly as well as full 16-bit precision, plus some ideas around offloading KV cache to system RAM (but I'm skeptical)
I think the way these models work excludes sane behaviors the larger the context gets as each token introduces potential ambiguities between "USER" and "SYSTEM" messages leading to all the catastrophic behaviors.
Anyway, with AMD395+ I'm finding ~100k is both speed and context usefulness unless it's scoped tightly. with opencode, I manage it with dynamic context pruning: https://github.com/Opencode-DCP/opencode-dynamic-context-pru... ; then anything I touch ends up being refactored so context doesn't get bloated with unecessary functions, etc.
Obviously, this isn't compatible with certain business codebases, so I can see why bloat meets bloat.
OMG this is such an annoying property, just shut the hell up please, and be concise.
I suspect that this is an artifact of the thinking property, but please just summarize the thinking process far more concisely, where a single sentence answer is more than sufficient the frontier models seem devoted to going on to a minimum of 5 paragraphs and offering 3-5 new directions.
And requests to please only offer a single step at once, or single option at once, or to even stop eagerly offering future directions is really hard to prompt correctly.
And look, there I did exactly what I was complaining about...
For example, the Claude web UI has an Instructions field where I have told it never to congratulate or praise me for asking questions. Earlier Copilot models used a ridiculous number of emoji and bullet lists when answering literally every prompt, I told it to knock that off and prefer detailed paragraphs in prose.
Local agents/frameworks/whatever all have their equivalents for overall user preferences.
Asking Claude for this provides incorrect instructions for me, so I'm guessing it moves around a lot.
Edit: also, how can I stop the LLM from all this fake glazing, as if every question I have is some sort of unique genius insight, it's so damn annoying. I just got the third straight round of this while merely trying to get summarization of a PDF:
> Good question — it gets right at a real tension in the paper. Let me check the current state of actual SV-imputation efforts, since this has moved since 2020.
This is likely due to a combination of mass funding for the AI companies, but also they are trying to governmentally restrict which countries get access to these cards so certain countries get a head start. The only way to lock that down is to have them literally locked in their own GPU prisons (data centers). Third reason is it does make it possible to train the models faster by having them in the same data center connected directly. Having them distributed to everyone would slow down training considerably.
The current way to 'own' decent RAM and GPUs right now is through the stock market it seems.
Hmm. I think I might just fundamentally disagree with Anthropic about the idea of what a "tool" should be.
I’d think the volume for that category would be low but LLMs aren’t just for coding.
Sure I could splash out a ton of money for a high ram Mac, but deepseek is so dirt cheap that I think depreciation on a high end machine costs more than my api spend.
Example of what I’m using it for: building a semantic database of podcast content (podcast discoverability sucks on an episode level). I need a cheap LLM, an embedder, a transcriber, none of which Claude will do.
My api costs for coding agents plus running apps are about ~$20/month, but I get more than just chat + Claude code.
If all I was doing was pumping an employers codebase through a coding agent, Claude would be the answer.
[0] https://twelvetables.blog/comparing-claude-fable-5s-system-p...
My Mac only has 16GB of VRAM (20GB total - 8 is reserved for the OS) so I have to leave room for VRAM, I usually find a model that fits in 5 to 7 GB of VRAM and then max the context window as much as I can.
Works perfectly fine in llama.cpp throwing 70+t/s at me with 128k q8 K/V context when using the IQ4_NL quant + MTP at q4 MTP K/V.
Also leaving this here because you might find it useful: https://hypfer.github.io/will-it-fit-llama-cpp/
Crypto (to my knowledge at least) moved away from GPU mining. I guess you could maybe rent out GPU compute, but - being in germany - it's not worth the legal hassle. You could of course always commit tax fraud, though I wouldn't recommend that.
Massive legal liability. Not worth it.
FWIW Codex/GPT models are way less this way. Maybe to a fault.
I'm setting up my DGX Spark to try Qwen 3.6 27B again, as I'm hearing a lot of good reviews. When I tried it some time ago it was still early for support in llama.cpp.
It won't happen with AI models either.
It's almost ingrained in the American business model now. Outsource everything. Nobody wants to manage a room full of servers when they can spend 2-3x as much and outsource that headache along with the responsibility for it.
Same will happen with AI. Whether that means paying Anthropic that premium or paying AWS.
I'm in a relatively small business, we recently had an outage related to our local infrastructure.
I got pressure from the CEO saying it wasn't reliable to host our own infrastructure anymore even though our total internal down time over the last 5 years is significantly less than even a single of the larger recent AWS outages.
Everyone wants to shuck the chore and the responsibility.
AI is different.
Cloud computing genuinely is cheaper on average. It's better than paying for cisco servers, and at scale, it's cheaper than managed platforms (ala Heroku), and it's a coin toss for when you're in the middle ground and constantly approaching the point of rebuilding poor-man versions of existing products but with very very expensive engineering salaries.
In contrast, local models offer dramatic savings, and are magnitude of orders better in certain aspects: like stability - the performance is all over the place with traditional AI companies as they divert compute to their next big thing.
The benefits to maintaining your own infrastructure are pretty moderate to low, with very high risk.
And also, alternate models are pretty easy to use and easy to swap out unlike the vendor lock-in that exists with cloud services.
Same reason people pay for things through the AWS marketplace (like Vanta) instead of having to go through their invoicing process.
But AI is just weights, you can run a reasonably intelligent model at home, or on a few GPUs if you're a small-medium sized company, and it doesn't require dedicated maintenance.
Same here. My job as a software dev does not require me to self-host services we need and use. Quite the opposite. But, I am reluctant to hand over all control to AWS or equivalent for several reasons that I will get into here.
I have found that Infrastructure as Code (IaC) and modern tools like opentofu, ansible, combined with frontier AI models and harnesses gives you superpowers in this space. Almost all of our self-hosted services are fully managed by these tools. e.g. We perform backups and test them more often now than we ever did before. Entirely because it is so much easier to do all of that now.
The OS needs updates, file systems get corrupted.
Fans get dirty.
All the things that you need to deal with in hosting your own server infrastructure you have to deal with when hosting your own AI infrastructure (which runs on servers...)
A lot of the reason people outsource normal software is its brittle security properties, not sure that even applies to an LLM - it can go and look up the latest security best practices just like an engineer can.
You know what gives me headaches? When I'm in the middle of a session and the model gets rug-pulled out from under me because somebody at the model provider didn't pay the Trump bill that month.
Or when someone at the model provider decides that the curve-fitting algorithm in my graphics package looks a little too much like Skynet for comfort.
Or when they do any number of other things to undermine my work for the sake of their business model, some of which I won't even notice until the damage is done.
The sad thing is, if you know how inference works, you know that it really is insanely wasteful for everybody to run it locally. If anything naturally belongs in the cloud, it's inference. But at the same time, what choice are we being given?
I think that's basically Geohot's business model at Tiny Corp.
What's interesting/exciting is that local models are _already_ quite good at tasks we never imagined AI _ever_ doing before ChatGPT hit the scene just a few short years ago.
We're also in an interesting point in time where companies are releasing the fruits of their research/labor (the LLMs) to the general public for free. For now, I think they see it in their best interest to gain mindshare and rapport, as well as advancing the state of the art in smaller LLMs ("a rising tide lifts all boats") but I fear and expect that these will dry up as the major players buy the minor players, and all will seek a return on their considerable investments in AI research.
That's what I mean by diminishing returns.
If things change to token usage billing for everyone, maybe I'll be singing a different tune but on a subscription, I don't think it makes sense financially.
Fun? Yes. Financially sound? No.
Accountants are reasonably good at figuring this out - there are a lot of different things that need a large upfront investment before you can charge anything. People still debate if they are correct in this each case.
But, for smaller more well-defined workflows, or as straight "edit this part to be like this exact" edits, they seem more than enough. Still waiting for them to become mature enough to be able to replace what we have as SOTA today, I'd say it's ready to be switched over then.
Speaking of local models, DiffusionGemma (and diffusion models in general) should not be slept on for local usage! Usually the problem locally is that the LLMs aren't efficiently making use of your hardware, unless you start batching requests and run many at the same time, but that require different approaches in general. Instead, diffusion models work much faster for individual prompts, and not by a small margin either.
Today I finally finished porting diffusiongemma-26B-A4B-it support from Transformers into Candle, and together with some optimizations I now have it basically flying with ~450 tok/s (~19 it/s) in Candle during inference, instead of ~180 tok/s (~11 it/s) from HF's Transformers library. Even using vLLM with similar sized LLMs, I don't think I've ever gotten past the ~250 tok/s threshold for single prompts, exciting stuff for local models :)
Diffusion models can't really be trained beyond low-to-mid size and have lower quality than an equally sized, plain one-token-at-a-time model.
I think people used to say the same about the 8B text-diffusion models too when they came out, like LLaDA. LLaDA2.0 seemingly claims 100B total / 6.1B active MoE diffusion (DiffusionGemma is also MoE). Not saying you're wrong about the current consensus, but it has a way of changing over time, might be a bit early to claim it's infeasible to scale them, especially considering the final artifact being much more suitable for local usage.
The 27B is the smarter, more reliable one - but it is slower. The 35B is faster, still very smart but below 27B, a bit less reliable. The reason is the MoE - Mixture of Experts architecture, which only activates a subset of parameters, making the model much much faster.
I run the 27B on a MacBook Pro M5 Max + 40 GPU cores + 128GB RAM (well, on this beast I can have 27B + 35B in memory at the same time with headroom for all the other stuff). But because this is a laptop, it is not possible to run local LLMs all the time - it just gets too hot and too loud.
What excites me more: I run the 35B model on a MacMini M4 with 64GB RAM. It is fast, it gets a lot of work done (e.g. it scans, extracts and classifies my emails, it watches the mailbox all the time and does work). I also use it as my private Hermes assistant ("when is the next Starship launch?", "who is playing today at the World Cup? Give me some trivia").
Next step I am planning is a RTX Pro 6000 Blackwell workstation I can put in my basement. I want to run qwen really fast, with multiple threads / prompts / agents at once. And MAYBE if the budget allows, a 2x RTX Pro 6000 setup in order to run DeepSeek v4 flash on it (to run research on it).
Also, it will just be faster - and more fun too.
After the recent changes to usage, I've spent an annoyingly long number of hours trying to get this to work.
Claude code supports this by setting the model to "opusplan"- it will automatically use Opus for planning and sonnet for implementation. This was completely necessary with the fable release. I was able to do this with fable and it was necessary to avoid getting quickly rate limited. In settings.json:
"env": { "ANTHROPIC_DEFAULT_OPUS_MODEL": "claude-fable-5" },
Obviously have that set to "claude-opus-4-8" now.