How to setup a local coding agent on macOS

Posted by kkm 18 hours ago

How to setup a local coding agent on macOS(ikyle.me)

374 points | 92 commentspage 2

hanifbbz 16 hours ago|

Here's a visual post for using LM Studio and VS Code (and Pi): https://blog.alexewerlof.com/p/local-llms-for-agentic-coding

One way or another local AI is the future. I actually find weaker models more interesting because it keeps me sharp (at the cost of velocity of course).

mark_l_watson 14 hours ago||

Nice writeup, thanks.

I run something very similar except for directly using pi as the agentic harness I use little-coder that wraps pi with reasonable defaults for running local models. Even though my local setup is a bit slow, it is a thrill to do real work completely locally.

anigbrowl 11 hours ago||

This video is realtime. And shows the agent responding at a perfectly usable speed.

Alas, this video appears not have been linked to the text that describes it. Perhaps I should ask an AI to generate an artistic rendering of the author's description.

freerunnering 8 hours ago|

The video is stuck in an `<img>` tag so you need to wait for it to load. On a slow connection it might just not show for a while. Though the video is only 1MB so should load in if you wait.

smetannik 11 hours ago||

I wonder why something like LM Studio didn't work for the author?

b3ing 9 hours ago|

That’s what I was wondering, lm studio and draw things are easy to use apps that handle much of the cruft for you

freerunnering 8 hours ago||

I do a lot of fine tuning and development with small models themselves (not just using an LLM over a HTTP API). So downloading the models directly and running them from the CLI was natural for me, so that's what I reached for when I wanted to play around with this.

reenorap 14 hours ago||

My biggest pet peeve with all these articles on local AI is the only thing they talk about is tokens per second. No one mentions the quality of the answers. No one. I don't mind waiting a little longer if the quality is better. Quickly serving me slop doesn't make it more useful. Are people really only looking at tokens per second?

frollogaston 12 hours ago||

The model already has its own quality benchmarks elsewhere. The article is just about running the model on X hardware, so the remaining question is then how fast it is. Or does the output quality somehow depend on the hardware too?

ozim 13 hours ago|||

Local model as such will give you "autocomplete on steroids" but it is not going to run away and implement cross project feature like frontier model in let's say Cursor.

So there is no value in testing quality of answers, but there is value in testing token speed.

You just have to have correct expectations.

krzyk 3 hours ago||

Is autocomplete using LLMs really useful? Even with frontier models I found it to be about 50% right, I turned it of and prefer to use IntelliJ built-in, it is way more reliable.

For me local models is all about quality, and how to achieve that - e.g. by providing guardrails that test the job done.

akman 14 hours ago||

That's fair. There are even many dimensions to define 'quality' which include use case (coding? writing? multimedia?) and prompt. I suppose if you ask testers to provide benchmarks with their analysis, that might hamper their desire to share.

bicepjai 13 hours ago||

I assumed lmstudio is the obvious choice after ollama. Is there a reason lmstudio is not used widely ?

krzyk 3 hours ago||

Why would anyone use Ollama at all (aside from obvious reasons one can look up online) - llama.cpp used directly, without this wrapper is faster.

Basically one has two real choices for local LLMs: llama.cpp (if single user) or vLLM (if multi-user/enterprise).

dofm 12 hours ago|||

LM Studio is fine. Gorgeous actually. I've found it really helpful for understanding parameters, settings, general figuring out.

But there is an incentive not to use it if you want to write an article that uses only open-source tools, because it isn't.

stingraycharles 13 hours ago||

Yeah I’ve also been using it on macOS, my experience is that it works better with the metal API and has better performance.

cdolan 17 hours ago||

Is there a link to the video? It did not render when I went to the page. Curious about the real-time feel of this

dewey 17 hours ago||

That's the direct link: https://ikyle.me/blog/2026/how-to-setup-a-local-coding-agent...

c-hendricks 16 hours ago||

Note this is cut to just before the model responds, so not a great way for people to judge the real-time feel of this.

freerunnering 8 hours ago||

The full video is on Twitter: https://x.com/Freerunnering/status/2065275403548168398

Plus a followup one where you see me type the question in and press enter (though that video is with Qwen 3.6, not Gemma 4) https://x.com/Freerunnering/status/2065354101878055038

namnnumbr 17 hours ago|

oMLX (https://github.com/jundot/omlx) makes running the mlx inference server quite easy for those interested in UI-based hosting. oMLX also supports mtp or dflash drafting.

amboo7 4 hours ago||

Whay about of the tons of caches that just pile up until you notice that you must delete them manually?

w10-1 16 hours ago||

Agreed (not sure what you mean by UI-based hosting).

oMLX does the caching I need to fit models that are near gross memory, and it handles most of the work in finding usable models. After cobbling together various solutions over months, I now just use oMLX, often from Xcode. I can tell the difference between Gemma-4 (local/free) and Claude (paid) only on the largest tasks.

More comments...