Experimenting with Local LLMs on macOS

Posted by frontsideair 5 days ago

Experimenting with Local LLMs on macOS(blog.6nok.org)

386 points | 259 commentspage 2

TYPE_FASTER 5 days ago|

Also see https://github.com/Mozilla-Ocho/llamafile.

SLWW 5 days ago||

The use of the word "emergent" is concerning to me. I believe this to be an... exaggeration of the observed effect. Depending on the perspective and the knowledge of the domain, this might seem to some ad emergent, however we saw equally interesting developments with more complex Markov chaining given the sheer lack of computational resources and time. What we are observing is just another step up that ladder, another angle to enumerate and pick the best token next in the sequence given the information revealed by the proceeding words. Linguistics is all about efficient, lossless data-transfer. While it's "cool" and very surprising.. I don't believe we should be treating it as somewhere between a spell-checker and a sentient being. People aren't simple heuristic models, and to imply these machines are remotely close is woefully inaccurate and will lead to further confusion and disappointment in the future.

jerryliu12 5 days ago||

My main concern with running LLMs locally so far is that it absolutely kills your battery if you're constantly inferencing.

seanmcdirmid 5 days ago|

It really does. On the other hand, if you have a power outlet handy, you can inference on the plane even without a net connection.

noja 5 days ago||

I really like On-Device AI on iPhone (also runs on Mac): https://ondevice-ai.app in addition to LM Studio. It has a nice interface, with multiple prompt integration, and a good selection of models. Also the developer is responsive.

LeoPanthera 5 days ago||

But it has a paid recurring subscription, which is hard to justify for something that runs entirely locally.

noja 5 days ago||

I am using it without one so far. But if they continue to develop it I will upgrade.

gazpachotron 5 days ago||

[dead]

balder1991 4 days ago||

As someone who sometimes downloads random models to play around on my 16GB Mac Mini, I like his suggestions of models. I guess these are the best ones for their sizes if you get down to 4 or 5 worth keeping.

tpae 5 days ago||

Check out Osaurus - MIT Licensed, native, Apple Silicon–only local LLM server - https://github.com/dinoki-ai/osaurus

colecut 4 days ago|

thank you

tolerance 5 days ago||

DEVONThink 4’s support for local models is great and could possibly contribute to the software’s enduring success for the next 10 years. I’ve found it helpful for summarizing documents and selections of text, but it can do a lot more than that apparently.

https://www.devontechnologies.com/blog/20250513-local-ai-in-...

Damogran6 5 days ago||

Oddly, my 2013 MacPro (Trashcan) runs LLMs pretty well, mostly because 64Gb of old school RAM is, like, $25.

frontsideair 4 days ago|

I'm interested in this, my impression was that the newer chips have unified memory and high memory bandwidth. Do you do inference on the CPU or the external GPU?

Damogran6 4 days ago||

I don't, I'm a REALLY light user. smaller LLMs work pretty well. I used a 40gb LLM and it was _pokey_, but it worked, and switching them is pretty easy. This is a 12 core Xeon with 64Gb RAM...my M4 mini is....okay with smaller LLMs, I have a Ryzen 9 with a RTX3070ti that's the best of the bunch, but none of this holds a candle to people that spend real money to experiment in this field.

jasonjmcghee 5 days ago||

I think the best models around right now that most people can fit some quantization on their computer if it's a apple silicon Mac or gaming PC would be:

For non-coding: Qwen3-30B-A3B-Instruct-2507 (or the thinking variant, depending on use case)

For coding: Qwen3-Coder-30B-A3B-Instruct

---

If you have a bit more vram, GLM-4.5-Air or the full GLM-4.5

all2 5 days ago|

Note that Qwen3 and Deepseek are hobbled in Ollama; they cannot use tools as the tool portion of the system prompt is missing.

Recommendation: use something else to run the model. Ollama is convenient, but insufficient for tool use for these models.

theshrike79 4 days ago||

Could you give a recommendation that works instead of saying what doesn't work?

simonw 4 days ago|||

Try LM Studio or llama-server: https://simonwillison.net/2025/Aug/19/gpt-oss-with-llama-cpp...

all2 3 days ago|||

I would, but I haven't found a working solution.

jftuga 5 days ago|

I have a macbook air M4 with 32 GB. What LM Studio models would you recommend for:

* General Q&A

* Specific to programming - mostly Python and Go.

I forgot the command now, but I did run a command that allowed MacOS to allocate and use maybe 28 GB of RAM to the GPU for use with LLMs.

frontsideair 4 days ago||

This is the command probably:

  sudo sysctl iogpu.wired_limit_mb=184320

Source: https://github.com/ggml-org/llama.cpp/discussions/15396

DrAwdeOccarim 4 days ago|||

I adore Qwen 3 30b a3b 2507. Pretty easy to write an MCP to let us search the web with Brave API key. I run it on my Macbook Pro M3 Pro 36 GB.

theshrike79 4 days ago||

What are you running it on that lets you connect tools to it?

DrAwdeOccarim 4 days ago||

LM Studio. I just vibe code the nodeJS code.

balder1991 4 days ago||

You’ll certainly find better answers on /r/LocalLlama in Reddit for this.

More comments...