April 2026 TLDR Setup for Ollama and Gemma 4 26B on a Mac mini

Posted by greenstevester 12 hours ago

April 2026 TLDR Setup for Ollama and Gemma 4 26B on a Mac mini(gist.github.com)

262 points | 105 commentspage 2

kristopolous 6 hours ago|

Are you getting tool call and multimodal working? I don't see it in the quantized unsloth ggufs...

easygenes 11 hours ago||

Why is ollama so many people’s go-to? Genuinely curious, I’ve tried it but it feels overly stripped down / dumbed down vs nearly everything else I’ve used.

Lately I’ve been playing with Unsloth Studio and think that’s probably a much better “give it to a beginner” default.

diflartle 10 hours ago||

Ollama is good enough to dabble with, and getting a model is as easy as ollama pull <model name> vs figuring it out by yourself on hugging face and trying to make sense on all the goofy letters and numbers between the forty different names of models, and not needing a hugging face account to download.

So you start there and eventually you want to get off the happy path, then you need to learn more about the server and it's all so much more complicated than just using ollama. You just want to try models, not learn the intricacies of hosting LLMs.

flux3125 7 hours ago||

to be fair, llama.cpp has gotten much easier to use lately with llama-server -hf <model name>. That said, the need to compile it yourself is still a pretty big barrier for most people.

MarsIronPI 1 minute ago|||

[delayed]

dTal 1 hour ago||||

You don't need to compile it yourself though? Unless you want CUDA support on Linux I guess, dunno why you'd need such a silly thing though:

https://github.com/ggml-org/llama.cpp/releases

ryandrake 5 hours ago|||

I started with ollama and now I'm using llama.cpp/llama-server's Router Mode that allows you to manage multiple models through a single server instance.

One thing I haven't figured out: Subjectively, it feels like ollama's model loading was nearly instant, while I feel like I'm always waiting for llama.cpp to load models, but that doesn't make sense because it's ultimately the same software. Maybe I should try ollama again to convince myself that I'm not crazy and that ollama's model loading wasn't actually instant.

polotics 10 hours ago|||

Ollama got some first-mover advantage at the time when actually building and git pulling llama.cpp was a bit of a moat. The devs' docker past probably made them overestimate how much they could lay claim to mindshare. However, no one really could have known how quickly things would evolve... Now I mostly recommend LM-studio to people.

What does unsloth-studio bring on top?

easygenes 10 hours ago||

LM Studio has been around longer. I’ve used it since three years ago. I’d also agree it is generally a better beginner choice then and now.

Unsloth Studio is more featureful (well integrated tool calling, web search, and code execution being headline features), and comes from the people consistently making some of the best GGUF quants of all popular models. It also is well documented, easy to setup, and also has good fine-tuning support.

xenophonf 9 hours ago||

LM Studio isn't free/libre/open source software, which misses the point of using open weights and open source LLMs in the first place.

vonneumannstan 8 hours ago||

Disagree, there are a lot of reasons to use open source local LLMs that aren't related to free/libre/oss principles. Privacy being a major one.

ekianjo 6 hours ago||

If you care about privacy making sure the closed source software does not call home is a concern...

the_lucifer 4 hours ago||

I run Little Snitch[1] on my Mac, and I haven't seen LM Studio make any calls that I feel like it shouldn't be making.

Point it to a local models folder, and you can firewall the entire app if you feel like it.

Digressing, but the issue with open source software is that most OSS software don't understand UX. UX requires a strong hand and opinionated decision making on whether or not something belongs front-and-center and it's something that developers struggle with. The only counterexample I can think of is Blender and it's a rare exception and sadly not the norm.

LM Studio manages the backend well, hides its complexities and serves as a good front-end for downloading/managing models. Since I download the models to a shared common location, If I don't want to deal with the LM Studio UX, I then easily use the downloaded models with direct llama.cpp, llama-swap and mlx_lm calls.

[1]: https://obdev.at

linolevan 5 hours ago|||

What I really don't get is why more people don't talk about LMStudio, I switched to it months ago and it seems like a straight upgrade.

alfiedotwtf 4 hours ago|||

Isn’t LMStudio closed source?

brcmthrowaway 5 hours ago|||

How does LMStudio compare to Unsloth Studio?

DiabloD3 9 hours ago|||

Advertising, mostly.

Ollama's org had people flood various LLM/programming related Reddits and Discords and elsewhere, claiming it was an 'easy frontend for llama.cpp', and tricked people.

Only way to win is to uninstall it and switch to llama.cpp.

jrm4 7 hours ago|||

Ollama user with the opposite question -- why not? What am I missing out on? I'm using it as the backend for playing with other frontend stuff and it seems to work just fine.

And as someone running at 16gb card, I'm especially curious as to if I'm missing out on better performance?

the_lucifer 3 hours ago|||

> Ollama user with the opposite question -- why not? What am I missing out on? I'm using it as the backend for playing with other frontend stuff and it seems to work just fine.

Used to be an Ollama user. Everything that you cite as benefits for Ollama is what I was drawn to in the first place as well, then moved on to using llama.cpp directly. Apart from being extremely unethical, The issue is that they try to abstract away a bit too much, especially when LLM model quality is highly affected by a bunch of parameters. Hell you can't tell what quant you're downloading. Can you tell at a glance what size of model's downloaded? Can you tell if it's optimized for your arch? Or what Quant?

`ollama pull gemma4`

(Yes, I know you can add parameters etc. but the point stands because this is sold as noob-friendly. If you are going to be adding cli params to tweak this, then just do the same with llama.cpp?)

That became a big issue when Deep Seek R1 came out because everyone and their mother was making TikToks saying that you can run the full fat model without explaining that it was a distill, which Ollama had abstracted away. Running `ollama run deepseek-r1` means nothing when the quality ranges from useless to super good.

> And as someone running at 16gb card, I'm especially curious as to if I'm missing out on better performance?

I'd go so far as to say, I can *GUARANTEE* you're missing out on performance if you are using Ollama, no matter the size of your GPU VRAM. You can get significant improvement if you just run underlying llama.cpp.

Secondly, it's chock full of dark patterns (like the ones above) and anti-open source behavior. For some examples:

1. It mangles GGUF files so other apps can't use them, and you can't access them either without a bunch of work on your end (had to script a way to unmangle these long sha-hashed file names) 2. Ollama conveniently fails contribute improvements back to the original codebase (they don't have to technically thanks to MIT), but they didn't bother assisting llama.cpp in developing multimodal capabilities and features such as iSWA. 3. Any innovations to the do is just piggybacking off of llama.cpp that they try to pass off as their own without contributing back to upstream. When new models come out they post "WIP" publicly while twiddling their thumbs waiting for llama.cpp to do the actual work.

It operates in this weird "middle layer" where it is kind of user friendly but it’s not as user friendly as LM Studio.

After all this, I just couldn't continue using it. If the benefits it provides you are good, then by all means continue.

IMO just finding the most optimal parameters for a models and aliasing them in your cli would be a much better experience ngl, especially now that we have llama-server, a nice webui and hot reloading built into llama.cpp

ekianjo 6 hours ago|||

Ollama has had bad defaults forever (stuck on a default CTX of 2048 for like 2 years) and they typically are late to support the latest models vs llamacpp. Absolutely no reason to use it in 2026.

wolvoleo 7 hours ago||

For me it's just the server. I use openwebui as interface. I don't want it all running on the same machine.

boutell 10 hours ago||

Last night I had to install the VO.20 pre-release of ollama to use this model. So I'm wondering if these instructions are accurate.

redrove 11 hours ago||

There is virtually no reason to use Ollama over LM Studio or the myriad of other alternatives.

Ollama is slower and they started out as a shameless llama.cpp ripoff without giving credit and now they “ported” it to Go which means they’re just vibe code translating llama.cpp, bugs included.

alifeinbinary 11 hours ago||

I really like LM Studio when I can use it under Windows but for people like me with Intel Macs + AMD gpu ollama is the only option because it can leverage the gpu using MoltenVK aka Vulkan, unofficially. We're still testing it, hoping to get the Vulkan support in the main branch soon. It works perfectly for single GPUs but some edge cases when using multiple GPUs are unsupported until upstream support from MoltenVK comes through. But yeah, I agree, it wasn't cool to repackage Georgi's work like that.

gen6acd60af 10 hours ago|||

LM Studio is closed source.

And didn't Ollama independently ship a vision pipeline for some multimodal models months before llama.cpp supported it?

zozbot234 7 hours ago||

Yes, they introduced that Golang rewrite precisely to support the visual pipeline and other things that weren't in llama.cpp at the time. But then llama.cpp usually catches up and Ollama is just left stranded with something that's not fully competitive. Right now it seems to have messed up mmap support which stops it from properly streaming model weights from storage when doing inference on CPU with limited RAM, even as faster PCIe 5.0 SSDs are finally making this more practical.

The project is just a bit underwhelming overall, it would be way better if they just focused on polishing good UX and fine-tuning, starting from a reasonably up-to-date version of what llama.cpp provides already.

jrm4 7 hours ago|||

Do y'all mean backend or the Ollama frontend or both? I find it trivially easy to sub in my local Ollama api thing in virtually all of the interesting frontend things. I'm quite curious about the "why not Ollama" here.

faitswulff 10 hours ago|||

Does LM Studio have an equivalent to the ollama launch command? i.e. `ollama launch claude --model qwen3.5:35b-a3b-coding-nvfp4`

DiabloD3 9 hours ago||

I don't think it does, but llama.cpp does, and can load models off HuggingFace directly (so, not limited to ollama's unofficial model mirror like ollama is).

There is no reason to ever use ollama.

ffsm8 8 hours ago|||

> I don't think it does, but llama.cpp does

I just checked their docs and can't see anything like it.

Did you mistake the command to just download and load the model?

u8080 8 hours ago||

-hf ModelName:Q4_K_M

ffsm8 7 hours ago||

Did you mistake the command to just download and load the model too?

Actually that shouldn't be a question, you clearly did.

Hint: it also opens Claude code configured to use that model

beanjuiceII 8 hours ago|||

sure there's a reason...it works fine thats the reason

meltyness 10 hours ago|||

I feel like the READMEs for these 3 large popular packages already illustrate tradeoffs better than hacker news argument

iLoveOncall 11 hours ago|||

> There is virtually no reason to use Ollama over LM Studio or the myriad of other alternatives.

Hmm, the fact that Ollama is open-source, can run in Docker, etc.?

DiabloD3 8 hours ago||

Ollama is quasi-open source.

In some places in the source code they claim sole ownership of the code, when it is highly derivative of that in llama.cpp (having started its life as a llama.cpp frontend). They keep it the same license, however, MIT.

There is no reason to use Ollama as an alternative to llama.cpp, just use the real thing instead.

simondotau 8 hours ago||

If it’s MIT code derived from MIT code, in what way is its openness ”quasi”? Issues of attribution and crediting diminish the karma of the derived project, but I don’t see how it diminishes the level of openness.

lousken 10 hours ago|||

lm studio is not opensource and you can't use it on the server and connect clients to it?

jedisct1 10 hours ago||

LM Studio can absolutely run as as server.

walthamstow 9 hours ago||

IIRC it does so as default too. I have loads of stuff pointing at LM Studio on localhost

logicallee 8 hours ago||

>Ollama is slower

I've benchmarked this on an actual Mac Mini M4 with 24 GB of RAM, and averaged 24.4 t/s on Ollama and 19.45 t/s on LM Studio for the same ~10 GB model (gemma4:e4b), a difference which was repeated across three runs and with both models warmed up beforehand. Unless there is an error in my methodology, which is easy to repeat[1], it means Ollama is a full 25% faster. That's an enormous difference. Try it for yourself before making such claims.

[1] script at: https://pastebin.com/EwcRqLUm but it warms up both and keeps them in memory, so you'll want to close almost all other applications first. Install both ollama and LM Studio and download the models, change the path to where you installed the model. Interestingly I had to go through 3 different AI's to write this script: ChatGPT (on which I'm a Pro subscriber) thought about doing so then returned nothing (shenanigans since I was benchmarking a competitor?), I had run out of my weekly session limit on Pro Max 20x credits on Claude (wonder why I need a local coding agent!) and then Google rose to the challenge and wrote the benchmark for me. I didn't try writing a benchmark like this locally, I'll try that next and report back.

dminik 8 hours ago||

It depends on the hardware, backend and options. I've recently tried running some local AIs (Qwen3.5 9B for the numbers here) on an older AMD 8GB VRAM GPU (so vulkan) and found that:

llama.cpp is about 10% faster than LM studio with the same options.

LM studio is 3x faster than ollama with the same options (~13t/s vs ~38t/s), but messes up tool calls.

Ollama ended up slowest on the 9B, Queen3.5 35B and some random other 8B model.

Note that this isn't some rigorous study or performance benchmarking. I just found ollama unnaceptably slow and wanted to try out the other options.

logicallee 9 hours ago||

In case someone would like to know what these are like on this hardware, I tested Gemma 4 32b (the ~20 GB model, the largest Gemma model Google published) and Gemma 4 gemma4:e4b (the ~10 GB model) on this exact setup (Mac Mini M4 with 24 GB of RAM using Ollama), I livestreamed it:

https://www.youtube.com/live/G5OVcKO70ns

The ~10 GB model is super speedy, loading in a few seconds and giving responses almost instantly. If you just want to see its performance, it says hello around the 2 minute mark in the video (and fast!) and the ~20 GB model says hello around 5 minutes 45 seconds in the video. You can see the difference in their loading times and speed, which is a substantial difference. I also had each of them complete a difficult coding task, they both got it correct but the 20 GB model was much slower. It's a bit too slow to use on this setup day to day, plus it would take almost all the memory. The 10 GB model could fit comfortably on a Mac Mini 24 GB with plenty of RAM left for everything else, and it seems like you can use it for small-size useful coding tasks.

zachperkel 6 hours ago||

how many TPS does a build like this achieve on gemma 4 26b?

renewiltord 7 hours ago||

Just told Claude to sort it out and it ran it. 26 tok/s on the Mac mini I use for personal claw type program. Unusable for local agent but it’s okay.

zozbot234 7 hours ago|

Isn't 26 tok/s quite usable for a claw-like agent though? You can chat with it on a IM platform and get notified as soon as it replies, you're not dependent on real-time quick interaction.

robotswantdata 11 hours ago||

Why are you using Ollama? Just use llama.cpp

brew install llama.cpp

use the inbuilt CLI, Server or Chat interface. + Hook it up to any other app

Bigsy 10 hours ago|

For MLX I'd guess.

redrove 9 hours ago|||

https://omlx.ai/

leftnode 7 hours ago||

Does this have a CLI only interface?

redrove 2 hours ago||

Yes. You could also look at the README.md.

wronglebowski 9 hours ago|||

That also comes upstream from llama.cpp https://github.com/ggml-org/llama.cpp/discussions/4345

techpulselab 5 hours ago||

[dead]

aplomb1026 4 hours ago|

[dead]

More comments...