Posted by greenstevester 11 hours ago
Every project races to have support on launch day so they don’t lose users, but the output you get may not be correct. There are already several problems being discovered in tokenizer implementations and quantizations may have problems too if they use imatrix.
So you’re going to see a lot of “I tried it but it sucks because it can’t even do tool calls” and other reports about how the models don’t work at all in the coming weeks from people who don’t realize they were using broken implementations.
If you want to try cutting edge open models you need to be ready to constantly update your inference engine and check your quantization for updates and re-download when it’s changed. The mad rush to support it on launch day means everything gets shipped as soon as it looks like it can produce output tokens, not when it’s tested to be correct.
I keep having "I tried it but it sucks" issues mostly around tool calling and it's not clear if it's the model or ollama. And not one model in particular, any of them really.
Same thing happened when GPT-OSS launched, bunch of projects had "day-1" support, but in reality it just meant you could load the model basically, a bunch of them had broken tool calling, some chat prompt templates were broken and so on. Even llama.cpp which usually has the most recent support (in my experience) had this issue, and it wasn't until a week or two after llama.cpp that GPT-OSS could be fairly evaluated with it. Then Ollama/LM Studio updates their llama.cpp some days after that.
So it's a process thing, not "this software is better than that", and it heavily depends on the model.
It seems to me that most model providers are not running/testing via the most used backends i.e Llama, Ollama etc because if they were, they would see how broken their release is.
Tool calling is like the Achilles Heel where most will fail unless you either modify the system prompts or run via proxies so you can inject/munge the request/reply.
Like seriously… how many billions and billions (actually we saw one >800 billion evaluation last week, so almost a whole trillion) goes into AI development and yet 99.999% of all models from the big names do not work straight out of the box with the most common backends. Blows my mind!
Sure, for single use-cases, you could make use of a ~20B model if you fine-tune and have very narrow use-case, but at that point usually there are better solutions than LLMs in the first place. For something general, +32B + Q8 is probably bare-minimum for local models, even the "SOTA" ones available today.
For new LLMs I get in the habit of building llama.cpp from upstream head and checking for updated quantizations right before I start using it. You can also download llama.cpp CI builds from their release page but on Linux it’s easy to set up a local build.
If you don’t want to be a guinea pig for untested work then the safe option would be to wait 2-3 weeks
Local is great for experimentation but production workloads that need to run reliably at specific times still favor API imo. That said for privacy sensitive use cases where data cant leave the machine, setups like this are invaluable.
There are many services online which offer hosted services for these models, my advice for anyone who is thinking about buying hardware to self host this is to try those first, that way you can get an impression of the capabilities and limitations of those models before you commit to buying hardware
From my experience doing this, they're nowhere close, but it's entertaining to check in once in a while.
However you should manage your expectations. Whatever the benchmarks say, you'll quickly realise they're not at all competing with Sonnet let alone Opus. Even the largest open weights models aren't really doing that.
(I haven’t tried the 120B, which I’ve read is significantly better than 20B)
I'm hoping to replace coding with Claude Sonnet 4.5 with a model with an open source or open weights model. Are any of the models on Ollama.com cloud offering (https://ollama.com/search?c=cloud) or any of the models on OpenRouter.ai a close replacement? I know that no model right now matches the full performance and capabilities of Claude Sonnet 4.5, but I want to know how close I can get and with which model(s).
If there is a model you say can replace it, talk about how long you have been using it for, and using what harness (Claude code, opencode, etc), and some strengths and weakness you have noticed. I'm not interested in what benchmarks say, I want to hear about real world use from programmers using these models.
Nothing comes close, in my opinion. Sonnet and Opus are still the best models. The Codex variants of the GPT models are also great. I've tried MiniMax, GLM, Qwen and Kimi and for anything even remotely complex these models seriously struggle.
Yes, this is the conclusion I've come to as well. I don't want to continue supporting OpenAI nor Anthropic, but the other models don't seem to be anywhere close yet, despite the hype.
Ran gemma-4-26B-A4B-it-GGUF:Q4_K_M just fine with llama.cpp though. First time in a long time that I have been impressed by a local model. Both speed (~38t/s) and quality are very nice.
This is how all open weight model launches go.
EDIT: The issue is addressed in LM Studio 0.4.9 (build 1), which auto-update wasn't picking up for me for some reason.
https://github.com/ggml-org/llama.cpp/issues/21347#issuecomm...
There's also a step to verify that it doesn't fit on the GPU with ollama ps showing "14%/86% CPU/GPU". Doesn't this mean you'll have really bad performance?
I personally prefer Pi as I like the fact that it's minimalist and extensible. But some people just use Claude Code, some OpenCode, there are a ton of options out there and most of them can be used with local models.
I've got a workaround for that called petsitter where it sits as a proxy between the harness and inference engine and emulates additional capabilities through clever prompt engineering and various algorithms.
They're abstractly called "tricks" and you can stack them as you please.
https://github.com/day50-dev/Petsitter
You can run the quantized model on ollama, put petsitter in front of it, put the agent harness in front of that and you're good to go
If you have trouble, file bugs. Please!
Thank you
edit: just checked, the ollama version supports everything
$ llcat -u http://localhost:11434 -m gemma4:latest --info
["completion", "vision", "audio", "tools", "thinking"]
so you can just use that.