Top
Best
New

Posted by greenstevester 11 hours ago

April 2026 TLDR Setup for Ollama and Gemma 4 26B on a Mac mini(gist.github.com)
262 points | 105 comments
Aurornis 6 hours ago|
If this is your first time using open weight models right after release, know that there are always bugs in the early implementations and even quantizations.

Every project races to have support on launch day so they don’t lose users, but the output you get may not be correct. There are already several problems being discovered in tokenizer implementations and quantizations may have problems too if they use imatrix.

So you’re going to see a lot of “I tried it but it sucks because it can’t even do tool calls” and other reports about how the models don’t work at all in the coming weeks from people who don’t realize they were using broken implementations.

If you want to try cutting edge open models you need to be ready to constantly update your inference engine and check your quantization for updates and re-download when it’s changed. The mad rush to support it on launch day means everything gets shipped as soon as it looks like it can produce output tokens, not when it’s tested to be correct.

colechristensen 5 hours ago|
You seem like you know what you're talking about... what inference engine should I use? (linux, 4090)

I keep having "I tried it but it sucks" issues mostly around tool calling and it's not clear if it's the model or ollama. And not one model in particular, any of them really.

embedding-shape 4 hours ago|||
For the specific issue parent is talking about, you really need to give various tools a try yourself, and if you're getting really shit results, assume it's the implementation that is wrong, and either find an existing bug tracker issue or create a new one.

Same thing happened when GPT-OSS launched, bunch of projects had "day-1" support, but in reality it just meant you could load the model basically, a bunch of them had broken tool calling, some chat prompt templates were broken and so on. Even llama.cpp which usually has the most recent support (in my experience) had this issue, and it wasn't until a week or two after llama.cpp that GPT-OSS could be fairly evaluated with it. Then Ollama/LM Studio updates their llama.cpp some days after that.

So it's a process thing, not "this software is better than that", and it heavily depends on the model.

alfiedotwtf 3 hours ago||
After spending the past few weeks playing with different backends and models, I just can’t believe how buggy most models are.

It seems to me that most model providers are not running/testing via the most used backends i.e Llama, Ollama etc because if they were, they would see how broken their release is.

Tool calling is like the Achilles Heel where most will fail unless you either modify the system prompts or run via proxies so you can inject/munge the request/reply.

Like seriously… how many billions and billions (actually we saw one >800 billion evaluation last week, so almost a whole trillion) goes into AI development and yet 99.999% of all models from the big names do not work straight out of the box with the most common backends. Blows my mind!

embedding-shape 3 hours ago||
Just since I'm curious, what exact models and quantization are you using? In my own experience, anything smaller than ~32B is basically useless, and any quantization below Q8 absolutely trashes the models.

Sure, for single use-cases, you could make use of a ~20B model if you fine-tune and have very narrow use-case, but at that point usually there are better solutions than LLMs in the first place. For something general, +32B + Q8 is probably bare-minimum for local models, even the "SOTA" ones available today.

kamranjon 4 hours ago||||
I've had really good success with LMStudio and GLM 4.7 Flash and the Zed editor which has a baked in integration with LMStudio. I am able to one-shot whole projects this way, and it seems to be constantly improving. Some update recently even allowed the agent to ask me if it can do a "research" phase - so it'll actually reach out to website and read docs and code from github if you allow it. GLM 4.7 flash has been the most adept at tool calling I've found, but the Qwen 3 and 3.5 models are also fairly good, though run into more snags than I've seen with GLM 4.7 flash.
Aurornis 5 hours ago||||
I don’t know if any of engines are fully tested yet.

For new LLMs I get in the habit of building llama.cpp from upstream head and checking for updated quantizations right before I start using it. You can also download llama.cpp CI builds from their release page but on Linux it’s easy to set up a local build.

If you don’t want to be a guinea pig for untested work then the safe option would be to wait 2-3 weeks

vardalab 5 hours ago|||
just use openrouter or google ai playground for the first week till bugs are ironed out. You still learn the nuances of the model and then yuu can switch to local. In addition you might pickup enough nuance to see if quantization is having any effect
pwr1 27 minutes ago||
Running 26B locally is impressive but the latency math gets rough once your doing anything beyond chat. We switched from local inference to API calls for image generation specifically because cold start + generation time on consumer hardware made it impractical for any kind of automated workflow.

Local is great for experimentation but production workloads that need to run reliably at specific times still favor API imo. That said for privacy sensitive use cases where data cant leave the machine, setups like this are invaluable.

neo_doom 4 hours ago||
Huge Claude user here… can someone help me set some realistic expectations if I bought a Mac mini and spun one up? I use Claude primarily for dev work and Home Lab projects. Are the open models good enough to run locally and replace the Claude workload? Or am I better off with my $20/mo Claude subscription?
NietTim 4 hours ago||
They are good for small tasks but you would not be able to use it like you use Claude and most likely be disappointed. But also, I do not know how you use claude.

There are many services online which offer hosted services for these models, my advice for anyone who is thinking about buying hardware to self host this is to try those first, that way you can get an impression of the capabilities and limitations of those models before you commit to buying hardware

hamdingers 3 hours ago|||
Best way to find out is to buy $10 of OpenRouter credits and try the models for yourself.

From my experience doing this, they're nowhere close, but it's entertaining to check in once in a while.

MrScruff 1 hour ago|||
I've been playing with the open models since the original llama leak. They're getting better over time, are useful for tasks of moderate complexity and it's just cool to have a binary blob of knowledge that you can run locally without an internet connection.

However you should manage your expectations. Whatever the benchmarks say, you'll quickly realise they're not at all competing with Sonnet let alone Opus. Even the largest open weights models aren't really doing that.

alfiedotwtf 3 hours ago||
So far, I’ve found gpt-oss-20B to be pretty good agentic wise, but it’s nothing like Claude Code using its paid models.

(I haven’t tried the 120B, which I’ve read is significantly better than 20B)

jasonriddle 2 hours ago||
Slightly off topic, but question for folks.

I'm hoping to replace coding with Claude Sonnet 4.5 with a model with an open source or open weights model. Are any of the models on Ollama.com cloud offering (https://ollama.com/search?c=cloud) or any of the models on OpenRouter.ai a close replacement? I know that no model right now matches the full performance and capabilities of Claude Sonnet 4.5, but I want to know how close I can get and with which model(s).

If there is a model you say can replace it, talk about how long you have been using it for, and using what harness (Claude code, opencode, etc), and some strengths and weakness you have noticed. I'm not interested in what benchmarks say, I want to hear about real world use from programmers using these models.

dimgl 2 hours ago||
In short: no.

Nothing comes close, in my opinion. Sonnet and Opus are still the best models. The Codex variants of the GPT models are also great. I've tried MiniMax, GLM, Qwen and Kimi and for anything even remotely complex these models seriously struggle.

jasonriddle 2 hours ago||
Thank you for the honest answer.

Yes, this is the conclusion I've come to as well. I don't want to continue supporting OpenAI nor Anthropic, but the other models don't seem to be anywhere close yet, despite the hype.

scottcha 2 hours ago||
Yes GLM5 and KimiK2.5 are pretty close replacements for sonnet.
jasonriddle 2 hours ago|||
What coding harness are you using? What are some example workflows you have used either for? Have you used them only for new/simple projects or for more complicated refactoring or architecture design?
MrScruff 2 hours ago|||
Haven't really tried GLM5 much but I've used 4.7 quite a bit and it was pretty far from competing with Sonnet at the time, although I saw claims online to the contrary.
milchek 7 hours ago||
I tested briefly with a MacBook Pro m4 with 36gb. Run in LM Studio with open code as the frontend and it failed over and over on tool calls. Switched back to qwen. Anyone else on similar setup have better luck?
internet101010 6 hours ago||
I failed to run in LM Studio on M5 with 32gb at even half max context. Literally locked up computer and had to reboot.

Ran gemma-4-26B-A4B-it-GGUF:Q4_K_M just fine with llama.cpp though. First time in a long time that I have been impressed by a local model. Both speed (~38t/s) and quality are very nice.

Aurornis 5 hours ago|||
Tool calls falling is a problem with the inference engine’s implementation and/or the quant. Update and try again in a few days.

This is how all open weight model launches go.

jasonjmcghee 6 hours ago||
Haven't had time to try yet, but heard from others that they needed to update both the main and runtime versions for things to work.
abroadwin 6 hours ago||
Even with the latest version of LM Studio and the latest runtimes I find that tool use fails 100% of the time with the following error: Error rendering prompt with jinja template: "Cannot apply filter "upper" to type: UndefinedValue".

EDIT: The issue is addressed in LM Studio 0.4.9 (build 1), which auto-update wasn't picking up for me for some reason.

jasonjmcghee 5 hours ago||
I googled it- supposed fixed template

https://github.com/ggml-org/llama.cpp/issues/21347#issuecomm...

abroadwin 5 hours ago||
Alas, this does not resolve the issue for me.
kilzimir 36 minutes ago||
Kinda crazy that I can run a 26B model on a 1500€ laptop (MacBook Air M5 32GB). Does anyone know how I can actually use this in a productive way?
spencer-p 4 hours ago||
Weird that the steps are for "Gemma 4 12b", which does not exist, and then switches to 26b midway through.

There's also a step to verify that it doesn't fit on the GPU with ollama ps showing "14%/86% CPU/GPU". Doesn't this mean you'll have really bad performance?

Schiendelman 2 hours ago|
The Mac mini doesn't have different memory for the CPU and GPU, so maybe that's ignorable?
OkGoDoIt 1 hour ago||
Sorry for being off topic, but why can’t I open this without being logged into GitHub? I thought gists are either completely private or publicly accessible. Are they no longer publicly accessible?
OkGoDoIt 1 hour ago|
In case anyone’s wondering, I tried it again and it worked this time, even without logging in. Maybe because this was my first visit to GitHub in a new country (I’m currently on vacation), I triggered some sort of anti-scraping measure or something.
anonyfox 6 hours ago||
M5 air here with 32gb ram and 10/10 cores. Anyone got some luck with mlx builds on oMLX so far? Not at my machine right now and would love to know if these models already work including tool calling
Yukonv 5 hours ago||
The latest release v0.3.2 has partial support, generation is supported but not all special tokens are handled. I've done some personal testing to add tool calling and <|channel> thinking support. https://github.com/Yukon/omlx
anonyfox 3 hours ago||
awesome man, can’t wait! And just now checked it out and indeed 0.3.2 does already work for baseline chatting with mlx versions of Gemma 4 … downloading and comparing different variants right now!
smith7018 6 hours ago||
I know that someone got Gemma 4 E4B working with MLX [1] but I don't know much more than that.

1: https://github.com/bolyki01/localllm-gemma4-mlx

aetherspawn 8 hours ago|
Which harness (IDE) works with this if any? Can I use it for local coding right now?
lambda 7 hours ago||
Yes, you can use it for local coding. Most harnesses can be pointed at a local endpoint which provides an OpenAI compatible API, though I've had some trouble using recent versions of Codex with llama.cpp due to an API incompatibility (Codex uses the newer "responses" API, but in a way that llama.cpp hasn't fully supported).

I personally prefer Pi as I like the fact that it's minimalist and extensible. But some people just use Claude Code, some OpenCode, there are a ton of options out there and most of them can be used with local models.

kristopolous 5 hours ago||
It needs to support tool calling and many of the quantized ggufs don't so you have to check.

I've got a workaround for that called petsitter where it sits as a proxy between the harness and inference engine and emulates additional capabilities through clever prompt engineering and various algorithms.

They're abstractly called "tricks" and you can stack them as you please.

https://github.com/day50-dev/Petsitter

You can run the quantized model on ollama, put petsitter in front of it, put the agent harness in front of that and you're good to go

If you have trouble, file bugs. Please!

Thank you

edit: just checked, the ollama version supports everything

    $ llcat -u http://localhost:11434 -m gemma4:latest --info
    ["completion", "vision", "audio", "tools", "thinking"]
so you can just use that.
More comments...