Running local models is good now

Posted by jfb 3 hours ago

Running local models is good now(vickiboykis.com)

495 points | 244 comments

c0rruptbytes 1 hour ago|

I don't know about good, I use a lot of local models and they're still pretty painful to run locally

You have dense models (qwen 27b, gemma 31b) who are pretty smart, but pretty slow

You have MoE models (gemma 26b, qwen 35b, north mini code 30b) who are pretty fast, but make a lot of mistakes

You need a lot of memory to run these well, quantization makes tool calling weaker, so most run at 4 bit quants and are wondering why it kinda sucks and that's because you've essentially lobotomized the model (I recommend unsloth quants, i recommend 6bit for MoEs and 5bit for dense)

So you need a lot of compute to make the pre-fill fast, you need bandwidth to make the decode fast, you need a lot of memory to hold everything - lot of ifs

On top of that, your laptop becomes a loud hot churning machine, it's uncomfortable to work with.

So are they good? not really. Do they work? yes

edit: just wanna clarify - i think open models are the future, i think they're super important, i'm contributing constantly to the ecosystem - i think people should play around with these models, i think people should use `pi` and learn how it all works - but don't download a model expecting it to be good out of the box, you will have to tune and configure a lot of stuff to replace a "coding agent" that most people are using models for

saghm 1 hour ago||

This is basically my experience as well. I have a moderately recent but high spec desktop (Radeon 6900 XT with 16 GB VRAM, Ryzen 9 7900X 12-core, 64 GB system RAM), and I tried out some recommended models with ollama a month or two ago. Anything not geared specifically towards coding seemed to struggled with actually making tool calls instead of just stating the actions they would take without making them (and trying to get help from them to explain what I needed to configure to change that behavior was useless; qwen refused to believe that it was running in ollama and insisted that it was running from the Alibaba cloud without access to my local system), and the models intended for coding were barely thinking faster than I could type (if they had any ability to show thinking at all).

The best "free" experience I've found is using OpenCode with Big Pickle. It's not especially smart, so it often won't produce the correct result the first time, but the free tier is generous enough that I don't think I've hit the limit more than twice over around a month with frequent multi-hour sessions. If running locally is truly the goal, it's not going to fit the bill, but if the goal is just "get the best experience without having to pay for a sub or tokens", it's the least bad option I've found so far.

redmalang 1 minute ago|||

Try llama.cpp it seems to be a lot more performant and a lot more hackable. Also I'm surprised how substantial the impact of some of the inference configs (beyond just temp) can have, though this is much more model specific.

spockz 5 minutes ago||||

For what it is worth, I’m on a similar machine. (9070XT,5900X) and found a lot of performance improvement over ollama by compiling llama.cpp and running with —no-mmap and —perf. The context is still quite small though. With online models I use contexts of at least 200k which is useful for longer running/more complicated commands.

Locally I haven’t gone much further than 8k. That is sufficient for small changes on small code bases. And you need condensed tool output.

I haven’t tried any tool that compresses the tokens yet.

ryukoposting 2 minutes ago||||

I found that, with the heavily quantized Qwen3 models I can cram onto my 3060 Ti, telling the model to use its tools in the system prompt made it a lot more likely to actually do it. YMMV of course, but give it a shot.

rapind 21 minutes ago|||

> The best "free" experience I've found is using OpenCode with Big Pickle.

I have absolutely zero interest in free. I honestly don't think I'm even remotely in the same demographic as people using free tiers / models.

I want to pay. I don't want my data used for training. I want it to be open. I want it to be consistently up (more than Claude!). I want it to be fast. I don't want it to be subsidized as that's just an excuse for shitty quality. Deepseek flash knocks it out of the park on all of these except you're data is used in training. I'm fine with it being hosted since there's no way I'm using it 24/7, but data MUST be private.

Basically I want Hetzner and OVH to run open model clouds. I'm convinced this is going to happen eventually when everyone realizes this is a commodity.

darkmarmot 4 minutes ago|||

Hard to guarantee it's private if you don't keep it local... I don't have a lot of trust for companies in this space.

aamoscodes 11 minutes ago||||

You can pay, and also use deepseek-v4-flash. OpenRouter even lets you "block" or limit your usage to providers that don't train on data. Since the weights are open, other companies are already serving the model on non-DeepSeek owned hardware: https://openrouter.ai/deepseek/deepseek-v4-flash

Bnjoroge 8 minutes ago||||

You can specify which providers you want to serve your model in OpenRouter. Then you can chose US-based ones.

bel8 2 minutes ago|||

These competent open models you want to use were trained on data from people like you and me.

I wonder if there are competent models trained purely on permissive open-source code like MIT or Apache 2.0.

aftbit 1 hour ago|||

IMO running local models "well" still requires an expensive hardware investment. You really want 96GB of VRAM on a modern Blackwell arch to run these models with decent KV cache. Trying to run them on a unified memory Mac, an AI Max AMD processor, or a DGX Spark-alike is really just asking for trouble. Prefill kills perf.

If you throw the right GPUs at the problem, they become much better - but still not quite in the realm of Sonnet or DeepSeek 4 Flash, let alone Opus / DeepSeek Pro or Mythos/Fable/GPT-5.5.

Given enough budget, power, and cooling, you can run some pretty good data pipelines, but for code, I think it still makes sense to shell out to an API provider most of the time.

jtbaker 3 minutes ago|||

> Trying to run them on a unified memory Mac

> but still not quite in the realm of Sonnet or DeepSeek 4 Flash

these are not mutually exclusive anymore. DS4 has set the bar for me these days. https://github.com/antirez/ds4

wincy 5 minutes ago||||

If I could just save up $6000 I could sell off my RTX 5090 for $4,000 and buy an RTX 6000 Blackwell Pro Workstation. I can fit models into the 32GB of vram but my context window ends up being tiny for any halfway capable model.

dofm 26 minutes ago||||

FWIW I think it might be both.

Ultimately if you skip over the opportunity to play with these models on your own machine you are losing out on a lot of really interesting educational opportunities — it helps make a lot of stuff feel more concrete in a way that only tinkering can.

But then I think once I had an idea of something that I was building against Gemma 4 or Qwen 3.6 I would be looking at openrouter etc., to stabilise it for the next tier of experimentation (and to get back a kind of multi-device access without tailscale/lm link etc.).

Are they good enough to replace what people seem to want to do with Claude? Maybe not. But it's an unparalleled learning opportunity.

eek2121 1 hour ago|||

Not really, Qwen 27b offloads to a decent gaming GPU (RTX 4090 in my case) without needing tons of RAM.

mathisfun123 1 hour ago||

can you give more info? llama.cpp vs vllm? config? i wanna try specifically this model

zozbot234 1 hour ago|||

Maybe we shouldn't be running these models on laptops with their thermally constrained form factor, and we shouldn't expect quick inference on a par with a large cloud-based platform either, at least not for near-SOTA model quality. It's still worth it to avoid becoming massively reliant on centralized services.

greenavocado 1 hour ago||

I have a 5070 12 GB laptop GPU and can hit 72 tokens per second in the first couple thousand tokens before dropping to mid-high 50s after about 15k context.

This setup is extremely optimized down to the last flag. Changing any param above the temp flag craters performance.

I don't have enough system RAM to properly handle the large context windows so I don't use local models.

  # 1,257 tokens 17s 72.18 t/s

  $env:CUDA_DEVICE_SCHEDULE = "SPIN"
  cd D:\src\llama.cpp\
  .\build\bin\Release\llama-server.exe `
    --port 8080 `
    --host 127.0.0.1 `
    -m "D:\LLM\Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf" `
    -fitt 2048 `
    -c 98304 `
    -n 32768 `
    -fa on `
    -np 1 `
    --kv-unified `
    -ctk q8_0 `
    -ctv q8_0 `
    -ctkd q8_0 `
    -ctvd q8_0 `
    -ctxcp 64 `
    --mlock `
    --no-warmup `
    --spec-type draft-mtp `
    --spec-draft-n-max 2 `
    --spec-draft-p-min 0.1 `
    --chat-template-kwargs '{\"preserve_thinking\": true}' `
    --temp 0.6 `
    --top-p 0.95 `
    --top-k 20 `
    --min-p 0.0 `
    --presence-penalty 0.0 `
    --repeat-penalty 1.0

themanualstates 58 minutes ago|||

That’s useless without describing WHY you chose those flags, and how you did the optimisation…

ridiculous_leke 18 minutes ago||||

Can you comment on the quality and accuracy of it? People have managed to run Gemma 26b without GPU on old CPUs but I don't think quality is anywhere close to what Gemma 12b offers.

nateb2022 1 hour ago||||

I get over 100 tok/s sustained on my M4 Max and M5 Max, in MacBook Pro's. LM Studio + MLX.

Terretta 25 minutes ago||

With Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf?

Also, funny lumping the M4 "and" the M5, I find them 15% to 45% different performance, depending.

And for a good deal of work, an M3 Studio Ultra outpaces the M4 and ties the M5 on single work at a time, outpaces both doing multiple work at a time.

mattmanser 55 minutes ago|||

That's a quant 4 which the thread OP specifically called out as rubbish.

The Q4_K_XL bit for those not in the know.

stymaar 12 minutes ago|||

Anyone calling Qwen3.6-35B-A3B-Q4_K_XL “rubish” has no idea what they are talking about.

c0rruptbytes 1 minute ago|||

q4 isn't rubbish, but it's a compromise for a good value, q6 is essentially a no-compromise quantization and it's what i recommend for MoEs in my experience for agentic workflows

greenavocado 9 minutes ago|||

He's probably calling me out for this comment https://news.ycombinator.com/item?id=48557579

greenavocado 13 minutes ago|||

I typically find myself using a context of between 150-500k with GPT models so local models are simply not enough and I stopped using them.

stymaar 8 minutes ago||

That's way higher than their optimal ceiling (and absolutely suboptimal from a token cost point of view), why are you doing that?

greenavocado 7 minutes ago||

You're 100% right. I really try to avoid it, but when reconciling APIs across two large codebases you really start pressing north of 200k. I find myself topping out at 800k sometimes and that's with careful context management. I actually had to drop to GPT 5.4 for 1M context in my subscription because GPT 5.5 tops out at 272k. Hitting 800k context is better than repeatedly hitting let's say 200k out of 272k with multiple rounds of compaction. I run Can's snapcompact and while its better than normal compaction it still lobotomizes the model more than running with a very high context window.

adam_arthur 1 hour ago|||

Gemma 4 is particularly good at pipeline/automation tasks.

It outperforms all the Qwen models (even 100B+) for rule following/automation style tasks in my experience. Its image interpretation is also very good, and out-benchmarks Opus.

Qwen seems to ignore instructions and consistently outputs incorrect formats (when token generation format is not explicitly constrained)

But yes, on the DGX Spark Gemma 31B Q4 with MTP runs around 20 tok/s and Gemma 26B A4B around 60 tok/s. Still quite slow. But on a high end Nvidia card would run significantly faster and still fit in memory.

I'd recommend for anyone getting into local models to focus on memory bandwidth over RAM. Models under 100B parameters are now sufficient and hugely useful for automation.

I agree that for coding/creation use cases, there's still not a compelling argument for local models.

But e.g. if you want to scan a list of stocks and interpret news/high pass filtering, interpreting logs, interpreting screenshots, the local models are more than sufficient already.

trouve_search 1 hour ago|||

On a 5090, gemma4 26B runs at 350TPS with the command below [1] and gemma4 31B is around 150TPS with a similar command.

I'm really surprised how much slower a DGX spark is for the same price.

1. Here's my command.

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \ vllm serve cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit \ --dtype auto \ --gpu-memory-utilization 0.95 \ --kv-cache-dtype fp8 \ --enable-chunked-prefill \ --enable-prefix-caching \ --trust-remote-code \ --enable-auto-tool-choice \ --tool-call-parser gemma4 \ --reasoning-parser gemma4 \ --max-num-batched 16000 \ --max-model-len 64000 \ --max-num-seqs 12 --speculative-config '{"model": "./gemma-4-26B-A4B-it-assistant", "num_speculative_tokens": 4}'

adam_arthur 51 minutes ago||

Yes, I'd recommend a 5090 over the DGX Spark if your goal is general automation.

You can run multiple instances of these models in parallel on the DGX Spark which somewhat mitigates the difference if your task is parallelizable.

But I'd take the simplicity of a single thread and higher throughput personally.

Overall of course still better to wait for next gen devices if you can.

dstryr 54 minutes ago||||

This is not my experience at all. Even the Nous Research guys have stated that "Qwen3.6-27B is the canonical local model to use Hermes Agent with" [https://old.reddit.com/r/LocalLLaMA/comments/1sz2y76/ama_wit...]. I am finding the same when used with Pi and OpenCode.

Gemma will just stop mid-tool call. It's been slower and I've had to reduce context size to run it. Qwen3.6 27b has been rock solid using club 3090's single card setup for agentic use -- https://github.com/noonghunna/club-3090/blob/master/docs/SIN...

adam_arthur 48 minutes ago||

I'm talking about automation generally, not agent loops.

E.g. prompt A to achieve X, output in format Y. Use Y to do something in prompt B.

Agentic loops will underperform deterministic control flow pipelines (with non-determinism constrained to LLM calls).

Agents are more general, which is the main advantage. But inherently a more general solution will waste context on unnecessary reasoning.

Try asking the smaller Qwen models to output a JSON in a specific format. It basically can't do it consistently with a moderately sized prompt unless you constrain the token generation via GGML or are extremely repetitive and specific about it. (Thinking disabled)

Gemma 4 will do it correctly pretty much 100% of the time. (Thinking disabled)

Applies to other rule following as well in my experience.

Qwen may be better at toolcalling and certainly probably codegen.

It seems to me Google explicitly designed Gemma for edge device automation, and didn't fine tune for agentic or coding use cases.

msp26 10 minutes ago||||

Yep agreed completely. I couldn't imagine torturing myself with a small model for local coding. But Gemma 4 31B is so fucking good for a variety of language modelling tasks.

gopher_space 50 minutes ago|||

In my mind it’s a question of knowing what you want to build and how to divide the project into tasks your local setup can handle.

If you don’t need the machine to respond instantly (or explain your own business model to you) everything can be local and it’s been like that for a few years now.

FuriouslyAdrift 15 minutes ago|||

Kimi 2.6 or 2.8 is what we are playing with locally. They need 512GB to 1TB to run with full capabilities so that's not exactly "desktop"

Our GPU computer server cost $110k.

ridiculous_leke 21 minutes ago|||

A median laptop is no bueno for running a reliable model(which will be qwen 27b as per my reading here and r/localllama). Powerful macs would be prevalent in certain areas of the world but in rest of the world personal machines aren't always that powerful.

heipei 1 hour ago|||

Depends on what you mean by "local". On your Macbook, large dense models like Qwen 3.6 27B will be slow, sure. On a local workstation with a dedicated RTX card you can get > 100 tps, which is more than good enough to work with it, and faster than cloud models in many cases.

jstanley 1 hour ago|||

But how smart is it? All the people running local models never seem to mention that they are way dumber than cloud models.

I don't care how many tokens per second of nonsense it can generate.

throwawayffffas 34 minutes ago|||

Qwen 3.6 35b a3b is about as good as sonnet 4.5. It varies but it's at that level.

notnullorvoid 59 minutes ago||||

Quantized Gemma 4 26B is as smart or better than GPT 5 in most of my testing. Granted GPT 5 is nearly a year old at this point, but I can run Gemma 4 on a ~6 year old consumer GPU (RTX 3090) and get 140 t/s.

heipei 1 hour ago||||

It is smart enough that I use for all my coding tasks, and a lot of other mundane tasks.

It is probably not smart enough for "design this whole architecture of this complex system from scratch, make no mistakes", but that is not something I want from a coding tool anyway. I want a model that I can point to a file and tell it to make some changes to the file and related files. Or that I can ask to review a PR with regards to certain aspects.

My suggestion is to simply try it and see what it feels like.

myaccountonhn 1 hour ago|||

Its not going to be as good as Claude, but if you know what you're doing, it may be good enough to get your work done.

data-ottawa 1 hour ago|||

This is task dependent.

I find devstral (even though it’s weak generally) much better at writing and documentation than Opus. I’m actually now delegating all documentation to devstral and away from Claude, which makes a mess.

garciasn 1 hour ago|||

A highly skilled carpenter may be able to 'get work done' by banging nails in with a heavy-bottomed cocktail glass, doesn't mean it's not painful to do so when it is continuously breaking and leaving shards of glass all over the workshop for you to find every day for the rest of your life until you clean up the mess you made using the wrong tool for the job.

CamperBob2 1 hour ago||

More like, a highly-skilled carpenter can work miracles with a $6 hammer from the hardware store, while the pros on the commercial crew are using fancy compressed-air tools.

The carpenter has to get up close and personal with the wood. He can't match the crew's throughput, but maybe that's not what he's trying to do.

c0rruptbytes 41 minutes ago|||

I'm talking about the common use case that I think hacker news people have:

you get a macbook for work, you run the macbook

they're not going to start giving GPUs to employees to run local models

everdrive 1 hour ago|||

What counts as a lot of memory? What could someone do with 16 GB of RAM?

throwawayffffas 29 minutes ago|||

Not much, the capable models won't fit unless you go with very low quantization but that leads to a lot of loss.

You generally want to run q8 or some kind of "6bit" quantization at least.

40GB of VRAM is the entry-point in my experience, you can run qwen 3.6 35b a3b with full context or qwen 27b with about 92k of context.

Before you get fully discouraged, you don't need 1 gpu with 40GBs you can use multiple cards, with minimum impact on performance.

abalashov 50 minutes ago||||

Not a ton. I'd say 64 GB minimal to play, 96-128 GB better.

throwawayffffas 26 minutes ago||

Nah, you can run the 24b - 35b class with between 90k and 256k of context with about 40GB and they are pretty good. Especially the MOE variants fit neatly in 40GB.

zozbot234 1 hour ago||||

Modern inference engines can stream in weights from SSD in order to save on RAM, but this makes inference very slow, especially for the trivial single-session case. (Jury is still out on whether batching multiple sessions together can mitigate this well enough, but even then that's mostly helpful for the "running lots of inferences overnight and getting fresh results first thing in the morning" case. Which is interesting (the big third-party suppliers don't really offer a way of doing this at reasonable cost) but a bit of a niche.)

ValdikSS 1 hour ago||||

Gemma e2b, Gemma e4b. It's made for smartphones basically. You can run e2b with 8GB RAM.

trouve_search 1 hour ago||||

gemma 12B 4bit quant; try something with MTP and an AWQ quant

monegator 1 hour ago|||

gemma runs pretty well

greenavocado 1 hour ago|||

4 bit unsloth quants are good if you never ask for more than 20k context, use it as autocomplete on steroids, and never delegate serious questions to it

iwontberude 1 hour ago|||

They are good if you were clever enough to buy a powerful enough rig before memory went up. For everyone else I say just wait. M1 Ultra 128GB and higher is sufficient to run gemma4:31b-mlx or qwen3.6:35b-mlx with subagents. It’s only slow if you don’t know how to plan your work effectively.

dominotw 1 hour ago||

maybe painful if you are using it like a chatbot. you are sitting there waiting for response. vs ambient ai like automatically classifying your family pics and discarding random things like parking floor number pic.

i use it usecases like that latter and they are fine.

hypfer 2 hours ago||

After having been a happy user of Qwen3.6-27B for a few weeks, due to being away from the hardware, I'm currently forced to use Claude Sonnet 4.6

It is such a downgrade. I don't understand how that's even possible. The thing has so many strongly-held opinions I did not ever ask it for, talking just way too much and generally feeling somehow dumber.

Of course, being significantly larger, it will encode more knowledge, but that doesn't help me when I hate talking to it. And all that on top of the fact that talking with it costs real money.

I wonder what it might be that makes me hate it so much. Maybe because it doesn't see itself as a tool but almost an equal? As if its opinions would have weight.

Qwen too can act like an overeager intern, but if you tell it that it is an idiot, it will drop that ego. Not so much with Claude. In my experience, anyway.

Anyway, point is: full ack on that headline.

ggerganov 2 hours ago||

I haven't spent a dime on cloud inference, so cannot make a direct comparison like you. But I can 100% attest to the fact that Qwen3.6-27B is a very capable local model for coding tasks. Over the last month and a half I've been using it almost daily, either on my M2 Ultra or on my RTX 5090 box. I use it for small mundane tasks at ggml-org [0] - nothing really impressive, but definitely a helpful tool for a maintainer. I think I would be using it much more, if I didn't have to spend a lot of my time on reviewing PRs. Currently, I have a very lightweight harness - the pi agent with everything stripped (`pi -nc --offline`) and a short system prompt [1] to align it a bit with my style. About the generation speed: ~100-150 t/s on the RTX 5090 and ~40 t/s on the Mac. I definitely prefer running it on the RTX machine - it's so much faster. But for the sake of testing and getting wider experience with local configurations, I often run it on the Mac too.

[0] - https://github.com/search?q=%22Assisted-by%22+user%3Aggml-or...

[1] - https://github.com/ggml-org/llama.cpp/blob/master/.pi/gg/SYS...

trilogic 1 hour ago|||

I also confirm that local inference is on par with proprietary cloud services (with a bit of local setup, simple agents.md and some utils skills). This local models come with tools, that's mind blowing, considering that some months ago we had to .md tools ourselves. What makes a model worth even more is "Memory". We implemented that long ago. Last time I used proprietary services was 3 months ago, don´t really need it, my subscription is going blank.

Gerganov, hope you will consider developing further the CLI cause we suffering with the server.

jayGlow 12 minutes ago||

what are you using for memory with your local models? is there a specific harness you would recommend for local agents?

kpw94 1 hour ago||||

> About the generation speed: ~100-150 t/s on the RTX 5090 and ~40 t/s on the Mac

Curious if you can share the prefill speed too?

I run locally on a crappy desktop (some AMD iGPU with Vulkan llama.cpp, 32 GB DDR4 RAM) for experimentation. I get 15 tok/s on generation for the qwen & gemma4 MoE models. I get around 150 tok/s prefill speed.

Reason I'm asking about the prefill is looking at my stats at work, I use between 20M to peaks of 300M input tokens daily. Some of those token are cached but in general, I seem to have roughly 500x more input tokens than output. So interested in prefill tok/s stats.

Huge Thank you for llama.cpp btw!!

ggerganov 1 hour ago||

Here are the prefill speeds:

    Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, VRAM: 32109 MiB
  | model                          |       size |     params | backend  |  fa |            test |                  t/s |
  | ------------------------------ | ---------: | ---------: | -------- | --: | --------------: | -------------------: |
  | qwen35 27B Q4_K - Medium       |  15.92 GiB |    27.32 B | CUDA     |   1 |   pp2048 @ d512 |      3714.02 ± 10.85 |
  | qwen35 27B Q4_K - Medium       |  15.92 GiB |    27.32 B | CUDA     |   1 |  pp2048 @ d1024 |      3684.86 ± 15.21 |
  | qwen35 27B Q4_K - Medium       |  15.92 GiB |    27.32 B | CUDA     |   1 |  pp2048 @ d2048 |       3650.80 ± 8.53 |
  | qwen35 27B Q4_K - Medium       |  15.92 GiB |    27.32 B | CUDA     |   1 |  pp2048 @ d8192 |       3473.88 ± 0.97 |
  | qwen35 27B Q4_K - Medium       |  15.92 GiB |    27.32 B | CUDA     |   1 | pp2048 @ d32768 |       2754.69 ± 4.07 |

  ggml_metal_device_init: GPU name:   MTL0 (Apple M2 Ultra)
  | model                          |       size |     params | backend  | fa |            test |                  t/s |
  | ------------------------------ | ---------: | ---------: | -------- | -: | --------------: | -------------------: |
  | qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | MTL      |  1 |   pp2048 @ d512 |        379.75 ± 0.21 |
  | qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | MTL      |  1 |  pp2048 @ d1024 |        377.15 ± 0.35 |
  | qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | MTL      |  1 |  pp2048 @ d2048 |        371.46 ± 0.91 |
  | qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | MTL      |  1 |  pp2048 @ d8192 |        344.84 ± 0.41 |
  | qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | MTL      |  1 | pp2048 @ d32768 |        222.42 ± 5.29 |

Btw, based on your numbers, I think our use cases are quite different. I use the agent for very targeted sessions - basically things that are clear to me how to do, just want to automate them. My workflow is usually: new session -> read this, this and this -> do that. I.e. I don't let it wander at all in the codebase, so I rarely exceed the context window.

Also, I get a lot of mileage from the ngram-based speculative decoding functionality [0] as it allows me to iterate on the implementation much faster.

[0] https://github.com/ggml-org/llama.cpp/pull/19164

kpw94 40 minutes ago||

Thanks! Super helpful.

I do use it the same way as you're describing on personal projects at home, in a very crude manner (pasting code snippets in llama server web UI prompt. Next will attempt OpenCode)

At work I use it in similar manner with more mature tools, but the vast majority of token spend comes from a totally different workflow: "pretend the AI is a fleet of junior/intern engineer you're delegating work to", where the agent will on its own do the implementation, commit the changes etc.

It does indeed spend a lot of tokens wandering the codebase, talking to MCPs, loading skills etc.

celrod 1 hour ago||||

What quant do you run it at? 32GB seems like cutting it close on the rtx 5090 if going 8b, but other commenters are saying 4b lobotomizes the model.

ggerganov 1 hour ago||

As a baseline, I run all models in Q8 [0] because I want to be confident that when I observe a problem, the root cause is not due to the quantization. However, in this specific case, I use Q8 on the mac and Q4 on the RTX machine because the latter does not fit the full context at Q8. So far, I don't have conclusive evidence that the Q4 quantization affects the quality in a significant way for this model and the tasks that I am using it for.

[0] https://huggingface.co/ggerganov/presets/blob/main/preset.in...

fridder 1 hour ago|||

Not too shabby. I like the regular Qwen but prompt prefill on my m1max is slow as hell

StevenWaterman 2 hours ago|||

Yep, I daily drive Qwen3.6-27B (including for work), have done pretty much since it came out. IMO it's the only (small-ish, local) model worth using, if you can run it. It might not be as good as Opus at "add X large feature" but I don't want that in a model. I want to do the thinking while it does the typing. And Qwen 3.6 27B is perfectly good at that (while in my experience models like the 35A3B and gemma are significant downgrades)

Plus, I never have to worry about rate limits, quotas, or sitting in a queue during peak time. And I can always see its full thoughts, don't have to worry about where my data is getting sent, and know it can't get secretly nerfed.

Running on 2x 3090, 500-1000tok/s prefill and 60tok/s output at Q6_K_XL with MTP on llama.cpp, 220k tokens context window (starts to get a bit dumb above 160k ish), no KV quantization

indoordin0saur 2 hours ago|||

> And I can always see its full thoughts, don't have to worry about where my data is getting sent, and know it can't get secretly nerfed.

For this reason I wonder if local models are a potential business opportunity. Provide the service to engineering teams to give them a pre-built and setup GPU rig they can run in a closet. No need to worry about all the things you mentioned and clients can rest-assured their data isn't disappearing into a sketchy data center. There might be regulatory reasons that make on-prem setups appealing as well.

amoshebb 2 hours ago|||

This is, as far as I know, the business model of coys like mistral and cohere

suncemoje 1 hour ago||||

On-premise (1960-2010) -> Cloud (2010-2026) -> On-premise (2026+)?

indoordin0saur 1 hour ago||

I think that's overstated, but the loss of trust companies have with the big AI players is pretty serious. Not a big deal if your app is for sharing cat videos, but if you're medical or wealth management or a government contractor or the like enterprise clients really like to see good data security policies.

suncemoje 1 hour ago||

Agree. I also wonder how zero e.g., Claude Enterprise ZDR really is, and what their data pipeline actually looks like.

cyanydeez 1 hour ago|||

I think the next step to anyone but overbloated USA models is to follow https://chatjimmy.ai/ with one of the qwen models. If they can mass produce something at relative cost, these would be awesome sidecars.

iamtheworstdev 1 hour ago||||

are you running an NVLink? I have the same setup but no NVLink and it feels like it's best just splitting the 3090s to run separate models concurrently. But I also have no idea what I'm doing.

hughw 1 hour ago||||

Just this morning I tweaked my single 3090 setup too:

  OLLAMA_FLASH_ATTENTION=1
  OLLAMA_KV_CACHE_TYPE=q8_0
  OLLAMA_CONTEXT_LENGTH=180000

and that fits in 23GB.

[edited for format]

Andrex 27 minutes ago||||

How long have you been using it?

giancarlostoro 2 hours ago||||

> (starts to get a bit dumb above 160k ish)

If open models can ever hold roughly 600k token windows, I'll be really excited, I found that around 300 ~ 400k of Claude reading through your codebase results in better outputs. I also have Claude read official docs instead of "guessing" as to how to do something.

StevenWaterman 2 hours ago|||

I think we'll get there. Right now it works for me, because I'm naturally pretty verbose in my prompts, and know the codebase well, so I know what it needs to look at. Plus subagents for anything exploratory.

I think deepseek v4 pro has 1m context and does pretty well up to around 600k. But if you have the hardware to run that locally, you already know

Even then if there's a smaller model with 1M context, you'll need a ton of RAM to actually run it at full 1M. I guess that's why you don't see it too much. Anyone that could run Qwen 3.6 27B with 1m context would be better off running a much bigger model with smaller context instead, in the same amount of VRAM.

In terms of optimizing further, huge context + KV quantization sounds like a terrible idea, but there's some decent innovation in sparse attention, KV cache rotation allowing Q8 to perform nearly as well as full 16-bit precision, plus some ideas around offloading KV cache to system RAM (but I'm skeptical)

zozbot234 1 hour ago||

DeepSeek V4 (both Flash and Pro) has very good scaling of context length wrt. RAM use, so this is not an inherent limit of LLMs in general.

0xc133 1 hour ago||||

With yarn and rope scaling arguments for llama.cpp you could run qwen3.6-27B with 1M context… if you have enough memory to store it.

cyanydeez 1 hour ago|||

I don't really think you're making reasonable decisions at that size; but I suppose if you're not allowed to refactor it, maybe.

I think the way these models work excludes sane behaviors the larger the context gets as each token introduces potential ambiguities between "USER" and "SYSTEM" messages leading to all the catastrophic behaviors.

Anyway, with AMD395+ I'm finding ~100k is both speed and context usefulness unless it's scoped tightly. with opencode, I manage it with dynamic context pruning: https://github.com/Opencode-DCP/opencode-dynamic-context-pru... ; then anything I touch ends up being refactored so context doesn't get bloated with unecessary functions, etc.

Obviously, this isn't compatible with certain business codebases, so I can see why bloat meets bloat.

QuantumNoodle 1 hour ago|||

Do you have any resources on hardware necessary for running models and tweaks? I see you mention 2x 3090 and I wanted to do more search on what hardware is satisfactory for what models.

epistasis 2 hours ago|||

> talking just way too much

OMG this is such an annoying property, just shut the hell up please, and be concise.

I suspect that this is an artifact of the thinking property, but please just summarize the thinking process far more concisely, where a single sentence answer is more than sufficient the frontier models seem devoted to going on to a minimum of 5 paragraphs and offering 3-5 new directions.

And requests to please only offer a single step at once, or single option at once, or to even stop eagerly offering future directions is really hard to prompt correctly.

And look, there I did exactly what I was complaining about...

bityard 1 hour ago|||

I'm not sure to what degree you can influence how a model thinks, but you can definitely hide the thinking tokens and tell the model how you want it to talk to you.

For example, the Claude web UI has an Instructions field where I have told it never to congratulate or praise me for asking questions. Earlier Copilot models used a ridiculous number of emoji and bullet lists when answering literally every prompt, I told it to knock that off and prefer detailed paragraphs in prose.

Local agents/frameworks/whatever all have their equivalents for overall user preferences.

epistasis 44 minutes ago||

Thanks for the reminder! For others looking for this setting, it is currently under User Menu (click your account name in the lower left), then "Settings", then the "General" tab there's an "Instructions for Claude" box.

Asking Claude for this provides incorrect instructions for me, so I'm guessing it moves around a lot.

illegalsmile 1 hour ago|||

That's why you have to give claude and others directives/.md at the beginning so it doesn't go off the deep end with suggestions.

epistasis 1 hour ago||

Yeah, I've tried, and I'm sure somebody is going to say "skill issue" but it's not so easy to get the model to do that. Maybe it should be a SKILLS.md issue.

Edit: also, how can I stop the LLM from all this fake glazing, as if every question I have is some sort of unique genius insight, it's so damn annoying. I just got the third straight round of this while merely trying to get summarization of a PDF:

> Good question — it gets right at a real tension in the paper. Let me check the current state of actual SV-imputation efforts, since this has moved since 2020.

bornfreddy 45 minutes ago||

I didn't try telling to be concise and stop pampering me yet (but good idea, tomorrow), however I found that instead of me writing agent instructions, it works much better if I tell claude to write instructions for itself. I do check if they make sense of course, but its wording works much better than mine.

andix 9 minutes ago|||

Sonnet is extremely overpriced. It's a good model, but not worth the money Anthropic charges for it.

radium3d 2 hours ago|||

If you think about it, they're splitting the power across millions of users. Essentially, these AI companies have YOUR hardware that YOU are paying (them) for in a cabinet at some data center. This means the hardware could easily be run locally for inference for these 'big' models. It's just a problem of dynamics-- RAM is being bought in bulk by these companies through these B200 style cards, instead of sold slowly through the open public markets.

This is likely due to a combination of mass funding for the AI companies, but also they are trying to governmentally restrict which countries get access to these cards so certain countries get a head start. The only way to lock that down is to have them literally locked in their own GPU prisons (data centers). Third reason is it does make it possible to train the models faster by having them in the same data center connected directly. Having them distributed to everyone would slow down training considerably.

The current way to 'own' decent RAM and GPUs right now is through the stock market it seems.

kitd 2 hours ago|||

Funny that coding agents have personalities, including "that colleague" you want to avoid even if you know they're probably quite good at what they do!

derethanhausen 2 hours ago|||

I would not generalize based on experiences with Sonnet. The flagship models (Opus being the claude equivalent) are dramatically better.

hypfer 2 hours ago||

Opus in my experience is equally unpleasant "character"-wise, but at least it actually gets stuff done more often, so it's at least slightly more earned at that. It's still a neurotic cargo-culting dogmatic idiot, but one that at least sometimes does produce deliverables instead of only bottom-tier HN-esque opinions.

Hmm. I think I might just fundamentally disagree with Anthropic about the idea of what a "tool" should be.

giancarlostoro 2 hours ago|||

There's a model on Huggingface where someone takes Qwen and makes it think Opus style, and that one seems to be decent, not sure if they have the 27B variant in that style. I do wonder if you can tweak your system prompt to force Qwen to behave better?

StevenWaterman 2 hours ago|||

You read the OP backwards, they said Sonnet is a downgrade from Qwen, and prefer Qwen's tone

giancarlostoro 1 hour ago||

Sure, but my argument still holds, the idea is that Qwen reasons the way that Opus on High (what is now Max or whatever?) level thinking to reason about problems instead of its standard approach.

whythismatters 2 hours ago|||

Yes, Qwopus :) I've been pleasantly surprised by its quality

giancarlostoro 1 hour ago||

Seen that one too, same guy I'm thinking of too, havent had a chance to try all of their models. For anyone curious I believe the username is Jackrong on huggingface? They've got several models out on there each focused on programming from different approaches.

MostlyStable 2 hours ago|||

Curious if you have tried custom instructions. I was never quite as unhappy with Claude's voice as you appear to be, but there were several things I didn't like. A custom prompt fixed almost all of them.

clickety_clack 2 hours ago||

I think it would be very hard to convince someone to pay $100/mo to go back to Claude if they have a local model up and running, particularly now that model improvement has basically been stalled for the last 6 months. It’s so easy to set it up for yourself now too with things like LM studio. That said, there will always be unsophisticated users who can’t figure it out, so there will always be someone there to pay.

MostlyStable 1 hour ago|||

The person I was replying to specifically said that the Claude will "encode more knowledge" and that their problem was that they didn't like talking to Claude. It sounds like they think that Claude is at least slightly more functional. And the "not liking talking to it" is probably fixable. Someone for whom a local model works, and for whom the economics make sense, should absolutely run a local model and I wouldn't try to convince them otherwise. I'm sure it's the right choice for a lot of people. But not liking the personality of Claude is probably not a great reason on its own, given the minuscule amount of effort it takes to fix.

Scoundreller 2 hours ago||||

The third category are the occasional users that won’t have the hardware and won’t stomach a monthly fee for “unlimited” but are happy to pay-per-use.

I’d think the volume for that category would be low but LLMs aren’t just for coding.

dghlsakjg 1 hour ago||

I’m probably the third category. I like experimenting and trying different models and techniques. I want api access for my own apps and Claude subscriptions don’t have that.

Sure I could splash out a ton of money for a high ram Mac, but deepseek is so dirt cheap that I think depreciation on a high end machine costs more than my api spend.

Example of what I’m using it for: building a semantic database of podcast content (podcast discoverability sucks on an episode level). I need a cheap LLM, an embedder, a transcriber, none of which Claude will do.

My api costs for coding agents plus running apps are about ~$20/month, but I get more than just chat + Claude code.

If all I was doing was pumping an employers codebase through a coding agent, Claude would be the answer.

chrisweekly 2 hours ago|||

Not everyone has the right hardware.

clickety_clack 2 hours ago||

I guess I’m thinking of the $100/mo users, for whom it’s probably possible to get the right hardware.

zerd 1 hour ago|||

I noticed Fable was quite a bit terser, and I think it's due to changes in the system prompt [0]. They're literally saying "just give me the TLDR" and "give brief updates". You can tweak a lot of that with an AGENTS.md.

[0] https://twelvetables.blog/comparing-claude-fable-5s-system-p...

dackdel 2 hours ago|||

what kind of hardware do you need in order to run qwen3.6-27b

giancarlostoro 1 hour ago|||

Depends on which variant you pull down, but a single 5090 GPU (I know these are insanely expensive, but for context) could run either the Q8 or Q4_K_M version. It will not fit the 52GB version (BF16) on the other hand. So any modern Mac with a Pro or better processor and more than 52GB of RAM (don't forget VRAM for context window also matters!) would suffice, as someone else noted, probably a 128GB model would do the trick, and give you enough wiggle room to max out the context window.

My Mac only has 16GB of VRAM (20GB total - 8 is reserved for the OS) so I have to leave room for VRAM, I usually find a model that fits in 5 to 7 GB of VRAM and then max the context window as much as I can.

iagooar 1 hour ago||||

I recommend MacBook M5 Max with 128 GB of RAM to run it comfortably and fast. If you have something like a regular M4, go with qwen3.6-35b-a3d - the Mixture of Expert architecture makes it run 2-3x faster than the 27b version.

sbmthakur 1 hour ago|||

I could run it on 7900 XT with 64k context. You could run it more comfortably on a 24 gb vram.

indoordin0saur 2 hours ago|||

Very curious what hardware you're running this on!

hypfer 2 hours ago||

The same 24GB VRAM RTX 4090 I bought to play Cyberpunk 2077 with.

Works perfectly fine in llama.cpp throwing 70+t/s at me with 128k q8 K/V context when using the IQ4_NL quant + MTP at q4 MTP K/V.

Also leaving this here because you might find it useful: https://hypfer.github.io/will-it-fit-llama-cpp/

indoordin0saur 2 hours ago|||

Nice! Do you do anything with that compute when you're not actively using it? Is the crypto-mining hobby still worth it? I've also wondered if such expensive hardware can be rented back out to offset cost. Looks like these cards are going for as much as $4k nowadays.

all2 2 hours ago|||

There are services where you can hook your card up and rent it out to other users. I don't know what any of them are called, but they do exist.

dghlsakjg 1 hour ago||

Salad.com is one. (I’m unaffiliated, just happened to come across it this week while looking for a cheap option)

hypfer 2 hours ago||||

I've paid ~2k€ in 2023. Since I'm usually sitting next to it, I'm only using it when I want to use it. It can get quite loud and warm.

Crypto (to my knowledge at least) moved away from GPU mining. I guess you could maybe rent out GPU compute, but - being in germany - it's not worth the legal hassle. You could of course always commit tax fraud, though I wouldn't recommend that.

esseph 1 hour ago|||

> I've also wondered if such expensive hardware can be rented back out to offset cost.

Massive legal liability. Not worth it.

cdelsolar 2 hours ago|||

What did you call me?

ltononro 1 hour ago|||

Well but comparing with sonnet 4.6 instead of opus 4.6,.7 or .8 doesnt make a real point I mean, pay 200 USD/month (if you have that cash, or your company has it), might not justify using local at all (unless you have some reason to suspect about data leakage)

chrisweekly 2 hours ago|||

Why Sonnet 4.6 not Opus?

calebm 52 minutes ago|||

sync/ack

cmrdporcupine 22 minutes ago||

The Anthropic models have always been annoying this way -- chatty/opinionated and Dunning-Krugerish. And love to run away and do things unprompted with me jamming my ESC ESC ESC key over and over so I can get a word in edgewise.

FWIW Codex/GPT models are way less this way. Maybe to a fault.

I'm setting up my DGX Spark to try Qwen 3.6 27B again, as I'm hearing a lot of good reviews. When I tried it some time ago it was still early for support in llama.cpp.

b3ing 2 minutes ago||

They are ok for simple stuff, coding is weak, chat is alright, writing is ok. But I had many of them write stories for ideas and they kept using the same names regardless of what the story was about. I can’t complain, it’s free. Can’t wait till they get even better, but for local image generation they are good, slow but just create a bunch in the background while you do other things otherwise it’s like 14.4k modems

rmunn 2 hours ago||

This is the kind of thing that Anthropic et al should be worried about. As it becomes easier and easier to run local models, the ceiling of what they'll be able to charge will get lower and lower. Not that nobody will be willing to pay $$$$$ per month, but a lot of people are going to multiply the per-month charge by 12 or 24 and say "Could I set up a local model for less than that, and have it pay for itself within a year or two?" And if a significant portion of customers decide to buy instead of rent, the companies whose business model is entirely centered around renting will suddenly find themselves hurting for customers.

sathackr 2 hours ago||

The opposite of that has been happening for 20 years now with cloud compute.

It won't happen with AI models either.

It's almost ingrained in the American business model now. Outsource everything. Nobody wants to manage a room full of servers when they can spend 2-3x as much and outsource that headache along with the responsibility for it.

Same will happen with AI. Whether that means paying Anthropic that premium or paying AWS.

I'm in a relatively small business, we recently had an outage related to our local infrastructure.

I got pressure from the CEO saying it wasn't reliable to host our own infrastructure anymore even though our total internal down time over the last 5 years is significantly less than even a single of the larger recent AWS outages.

Everyone wants to shuck the chore and the responsibility.

preommr 2 hours ago|||

> The opposite of that has been happening for 20 years now with cloud compute. It won't happen with AI models either.

AI is different.

Cloud computing genuinely is cheaper on average. It's better than paying for cisco servers, and at scale, it's cheaper than managed platforms (ala Heroku), and it's a coin toss for when you're in the middle ground and constantly approaching the point of rebuilding poor-man versions of existing products but with very very expensive engineering salaries.

In contrast, local models offer dramatic savings, and are magnitude of orders better in certain aspects: like stability - the performance is all over the place with traditional AI companies as they divert compute to their next big thing.

The benefits to maintaining your own infrastructure are pretty moderate to low, with very high risk.

And also, alternate models are pretty easy to use and easy to swap out unlike the vendor lock-in that exists with cloud services.

richardwhiuk 24 minutes ago||

There's no economic reason why running a model locally should be better than using a cloud hosted version.

TkTech 1 hour ago||||

For many companies (country-dependent) that's not really why they use cloud services vs purchasing. It's tax shenanigans and business process overhead. OpEx vs CapEx, and a small (%) bump in the huge AWS bill no one will even notice or a $30k+ invoice for hardware that has to go through rigorous review and 3 departments.

Same reason people pay for things through the AWS marketplace (like Vanta) instead of having to go through their invoicing process.

dreambuffer 2 hours ago||||

It's just not comparable though is it? You need cloud services because it's physically impossible to use your single home computer as a server, CDN, load balancer, mass storage, security service, and distributed system.

But AI is just weights, you can run a reasonably intelligent model at home, or on a few GPUs if you're a small-medium sized company, and it doesn't require dedicated maintenance.

cheema33 2 hours ago||||

> I got pressure from the CEO saying it wasn't reliable to host our own infrastructure anymore even though our total internal down time over the last 5 years is significantly less than even a single of the larger recent AWS outages.

Same here. My job as a software dev does not require me to self-host services we need and use. Quite the opposite. But, I am reluctant to hand over all control to AWS or equivalent for several reasons that I will get into here.

I have found that Infrastructure as Code (IaC) and modern tools like opentofu, ansible, combined with frontier AI models and harnesses gives you superpowers in this space. Almost all of our self-hosted services are fully managed by these tools. e.g. We perform backups and test them more often now than we ever did before. Entirely because it is so much easier to do all of that now.

derfurth 2 hours ago||||

That's an interesting take, however there is no ongoing maintenance related to local models, maybe the only effort is giving more capable machines to the workforce; but yeah I can see how it might feel like a barrier.

sathackr 1 hour ago||

The hardware, the power systems, the cooling systems. They need maintenance.

The OS needs updates, file systems get corrupted.

Fans get dirty.

All the things that you need to deal with in hosting your own server infrastructure you have to deal with when hosting your own AI infrastructure (which runs on servers...)

ajb 54 minutes ago||

However, you can get many of the benefits of a "local model" by outsourcing all the hardware maintenance but still using an open model. Guaranteed repeatability for one.

A lot of the reason people outsource normal software is its brittle security properties, not sure that even applies to an LLM - it can go and look up the latest security best practices just like an engineer can.

davidw 1 hour ago||||

Still though, perhaps the existence of low-margin, generic, cloud LLM's puts some downward pressure on the 'brand name' companies?

CamperBob2 53 minutes ago|||

outsource that headache along with the responsibility for it

You know what gives me headaches? When I'm in the middle of a session and the model gets rug-pulled out from under me because somebody at the model provider didn't pay the Trump bill that month.

Or when someone at the model provider decides that the curve-fitting algorithm in my graphics package looks a little too much like Skynet for comfort.

Or when they do any number of other things to undermine my work for the sake of their business model, some of which I won't even notice until the damage is done.

The sad thing is, if you know how inference works, you know that it really is insanely wasteful for everybody to run it locally. If anything naturally belongs in the cloud, it's inference. But at the same time, what choice are we being given?

indoordin0saur 2 hours ago|||

I'm curious when coding-heavy companies will start running their own on-prem AI clusters. Has anyone had the idea to sell something like 4 GPU machine an engineering team could throw in a closet somewhere and run whatever they want on it? I imagine this won't appeal to everybody but with the trust issues the hyperscalers have developed hoovering up people's data and using it to train their models, I imagine some will find value in a machine and model they have transparent control over including the option to walk over and unplug the thing.

CamperBob2 40 minutes ago||

Has anyone had the idea to sell something like 4 GPU machine an engineering team could throw in a closet somewhere and run whatever they want on it?

I think that's basically Geohot's business model at Tiny Corp.

storus 1 hour ago|||

They are working hard on you not being able to run a thing locally. OpenAI buys all RAM on the spot market, causing the rise of RAM/VRAM prices 6x, making GPUs and decent computers unreachable for the majority of the population. OK, some richer folks might be able to get a 512GB MacStudio or a single RTX Pro 6000 for 13k and be able to run some decent local models, but the vast majority will need to use API. And at some point Nvidia might say: "We don't sell that many 6000s, so let's just cancel them altogether as we can gain 4x profit on datacenter-only GPUs" and then they'll become unobtainium and no private person would ever be able to run anything decent (~1 year behind the frontier) locally.

bityard 1 hour ago|||

The general consensus is that local models will continue to improve drastically, but hosted models will as well. There will _always_ be a pretty big gulf of capability between what you can do with a desk full of hardware at home vs a few racks of hardware in a datacenter. That seems to be the real "moat" of hosted models at this point in time: access to capital.

What's interesting/exciting is that local models are _already_ quite good at tasks we never imagined AI _ever_ doing before ChatGPT hit the scene just a few short years ago.

We're also in an interesting point in time where companies are releasing the fruits of their research/labor (the LLMs) to the general public for free. For now, I think they see it in their best interest to gain mindshare and rapport, as well as advancing the state of the art in smaller LLMs ("a rising tide lifts all boats") but I fear and expect that these will dry up as the major players buy the minor players, and all will seek a return on their considerable investments in AI research.

cogman10 1 hour ago||

I believe there's a level of diminishing returns. Sure, SOTA will probably always benchmark better than local models. But do we need it? That's the question that the likes of OpenAI and Anthropic should be worried about.

regularfry 1 hour ago||

The difference won't be in the individual tasks. It'll be in the scale of job they can take on and how you interact with the model. Think of pairing with a junior vs replacing a full delivery team, that's the sort of difference we'll be looking at. We'll be able to get closer to the latter by being more clever with harnesses, I reckon, but the frontier labs will run ahead because for any given harness trick they can lean harder on model smarts.

cogman10 51 minutes ago||

True, but my point is that if/when local models get to the point where they are capable of doing the "delivery team" work what's next? What can these bigger SOTA models offer? And especially what can they offer above and beyond what you might be able to get from much cheaper models which the open models are based on?

That's what I mean by diminishing returns.

wuliwong 2 hours ago|||

These local models can do some of the work the non-frontier models can do but for me, that's not worth much. If I am just using Sonnet 4.6, I can pretty much work all day on the $20/month plan. And Sonnet is still a way more powerful model than a one you could self host on an M2 mac.

If things change to token usage billing for everyone, maybe I'll be singing a different tune but on a subscription, I don't think it makes sense financially.

Fun? Yes. Financially sound? No.

icoder 2 hours ago|||

What I don't understand is that on one hand we read 'what they charge is much less than it costs them' and on the other hand this thread seems to suggest that 'what they charge is more than it would cost me'.

bluGill 2 hours ago|||

What it costs is tricky to measure. A large part of the costs are training the model. Once they have the model they are making a ton of profit from what they charge (or so we think - I haven't seen the numbers). However the sunk costs of getting the model need to be paid for and that means an accounting problem where we have to guess how much the model will be used in the future.

Accountants are reasonably good at figuring this out - there are a lot of different things that need a large upfront investment before you can charge anything. People still debate if they are correct in this each case.

esailija 2 hours ago||||

Bigger models that Antrophic want to sell cost disproportionately more (e.g. 100% more cost for 5% performance improvement) than small models you would use locally

themaninthedark 2 hours ago|||

Maybe that is why they are buying up as much hardware as they can? If their service is the only game in town.

otterdude 2 hours ago||

Data Center providers are buying hardware, not anthropic. Certainly related but alot of the hardware purchased is just sitting in a warehouse waiting for a data center to get built.

sbmthakur 1 hour ago||

Someone was able to run gemma-4-26B-A4B on an i5-8500 with 32 gb ram with NO GPU. Granted this is an extreme example these MoE models are value for money for a lot of use cases.

https://www.reddit.com/r/LocalLLaMA/s/YontVNVRbL

embedding-shape 2 hours ago||

Show us the resulting code of using them! :) I want to use local models, I have the hardware for it, but while trying them out as replacements for GPT 5.5 xhigh or Opus or other SOTA models, they aren't quite ready to be replaced yet, sadly. The quality and bumps they encounter just slows down the workflow so much, even screwing up tool call syntax sometimes.

But, for smaller more well-defined workflows, or as straight "edit this part to be like this exact" edits, they seem more than enough. Still waiting for them to become mature enough to be able to replace what we have as SOTA today, I'd say it's ready to be switched over then.

Speaking of local models, DiffusionGemma (and diffusion models in general) should not be slept on for local usage! Usually the problem locally is that the LLMs aren't efficiently making use of your hardware, unless you start batching requests and run many at the same time, but that require different approaches in general. Instead, diffusion models work much faster for individual prompts, and not by a small margin either.

Today I finally finished porting diffusiongemma-26B-A4B-it support from Transformers into Candle, and together with some optimizations I now have it basically flying with ~450 tok/s (~19 it/s) in Candle during inference, instead of ~180 tok/s (~11 it/s) from HF's Transformers library. Even using vLLM with similar sized LLMs, I don't think I've ever gotten past the ~250 tok/s threshold for single prompts, exciting stuff for local models :)

zozbot234 2 hours ago|

> Instead, diffusion models work much faster for individual prompts, and not by a small margin either.

Diffusion models can't really be trained beyond low-to-mid size and have lower quality than an equally sized, plain one-token-at-a-time model.

embedding-shape 2 hours ago||

As mentioned, I've just finished the implementation and started playing around with it, seems to be doing similarly well inside of my own agent harness as similarly sized "traditional" LLMs. Of course, neither come close to SOTA models, but I suppose if we can figure out the scaling issues you mention, we'd get a bit closer. The performance just feels like it's too good to quickly ditch diffusion. Do you have more info what those "can't be trained beyond low/mid size" issues are in practice today?

zozbot234 1 hour ago||

The issues around training diffusion models are well known among researchers. They're likely to not be feasibly scalable far beyond the 26B size of DiffusionGemma itself, and their lower quality compared to an equally-sized auto-regressive model (the usual one-token-at-a-time flow) is also a matter of broad consensus.

embedding-shape 1 hour ago||

> They're likely to not be feasibly scalable far beyond the 26B size of DiffusionGemma itself

I think people used to say the same about the 8B text-diffusion models too when they came out, like LLaDA. LLaDA2.0 seemingly claims 100B total / 6.1B active MoE diffusion (DiffusionGemma is also MoE). Not saying you're wrong about the current consensus, but it has a way of changing over time, might be a bit early to claim it's infeasible to scale them, especially considering the final artifact being much more suitable for local usage.

iagooar 1 hour ago||

I love running two models locally: qwen3.6 27B 8bit (dense) and qwen3.6 35B 4bit (MoE).

The 27B is the smarter, more reliable one - but it is slower. The 35B is faster, still very smart but below 27B, a bit less reliable. The reason is the MoE - Mixture of Experts architecture, which only activates a subset of parameters, making the model much much faster.

I run the 27B on a MacBook Pro M5 Max + 40 GPU cores + 128GB RAM (well, on this beast I can have 27B + 35B in memory at the same time with headroom for all the other stuff). But because this is a laptop, it is not possible to run local LLMs all the time - it just gets too hot and too loud.

What excites me more: I run the 35B model on a MacMini M4 with 64GB RAM. It is fast, it gets a lot of work done (e.g. it scans, extracts and classifies my emails, it watches the mailbox all the time and does work). I also use it as my private Hermes assistant ("when is the next Starship launch?", "who is playing today at the World Cup? Give me some trivia").

Next step I am planning is a RTX Pro 6000 Blackwell workstation I can put in my basement. I want to run qwen really fast, with multiple threads / prompts / agents at once. And MAYBE if the budget allows, a 2x RTX Pro 6000 setup in order to run DeepSeek v4 flash on it (to run research on it).

zerd 1 hour ago||

I'd love an RTX 6000 Pro, but how can you justify it when it costs 10 years worth of Claude Max?

iagooar 59 minutes ago||

10 years worth of Claude Max today. Also - Anthropic recently removed a model I relied on and isn't giving it back. As a non-US citizen, I would rather pay in advance but be sure, I will keep having access to inference on my own terms.

Also, it will just be faster - and more fun too.

Barbing 1 hour ago||

Did you get a Brave search API key or something for that “Hermes”?

nickthegreek 17 minutes ago|||

I have my mine setup with a searxng instance I run in a docker. Works great and costs zero.

iagooar 58 minutes ago||||

Yes, Brave search is one of these services I highly recommend paying for, the search they provide (similar to Exa, Tavily) is what makes an "OK LLM" become super smart.

dghlsakjg 1 hour ago|||

Hermes is just an agent that can be setup for whatever you want (coding or more commonly personal assistant ala clawdbot). You can set it up with any of the standard tools and MCPs like brave or tavily for search.

jszymborski 6 minutes ago||

I run local models and they work fine for me, but specifically for use in coding harnesses, I'm having a hard time. Tools tend to end up in the same loop, trying to `ls` the same folder or `grep` the same file, over and over and eating up the whole context. Super hard to get it to do anything but that. Any tips?

ta-run 6 minutes ago||

Not related, but, I can't seem to get my copilot-cli (office is an MS shop) use qwen3.5:27b on ollama for some odd reason.

After the recent changes to usage, I've spent an annoyingly long number of hours trying to get this to work.

sosodev 2 hours ago||

I think this is overselling their capabilities. I've used Gemma 4 and Qwen 3.6 quite a bit on my strix halo home server. They're great models and the dense variants are significantly better, but they're still very far behind the frontier. If you boot up Gemma 4 MoE and OpenCode/Pi and expect to perform anything like Claude Code or Codex you're going to be very disappointed.

gregwebs 20 minutes ago|

All these conversations seem like they are missing talking about planning vs execution. I want the best possible frontier model to plan out my changes. I also have a 2nd agent that is a frontier model check the plan. Then at that point the implementation can be done by a lesser and possibly local model. The frontier model can still do a final code review on the implementation of the changes.

Claude code supports this by setting the model to "opusplan"- it will automatically use Opus for planning and sonnet for implementation. This was completely necessary with the fable release. I was able to do this with fable and it was necessary to avoid getting quickly rate limited. In settings.json:

"env": { "ANTHROPIC_DEFAULT_OPUS_MODEL": "claude-fable-5" },

Obviously have that set to "claude-opus-4-8" now.

More comments...