Posted by frontsideair 5 days ago
https://www.devontechnologies.com/blog/20250513-local-ai-in-...
For non-coding: Qwen3-30B-A3B-Instruct-2507 (or the thinking variant, depending on use case)
For coding: Qwen3-Coder-30B-A3B-Instruct
---
If you have a bit more vram, GLM-4.5-Air or the full GLM-4.5
Recommendation: use something else to run the model. Ollama is convenient, but insufficient for tool use for these models.
* General Q&A
* Specific to programming - mostly Python and Go.
I forgot the command now, but I did run a command that allowed MacOS to allocate and use maybe 28 GB of RAM to the GPU for use with LLMs.
sudo sysctl iogpu.wired_limit_mb=184320
Source: https://github.com/ggml-org/llama.cpp/discussions/15396