Posted by kkm 18 hours ago
One way or another local AI is the future. I actually find weaker models more interesting because it keeps me sharp (at the cost of velocity of course).
I run something very similar except for directly using pi as the agentic harness I use little-coder that wraps pi with reasonable defaults for running local models. Even though my local setup is a bit slow, it is a thrill to do real work completely locally.
Alas, this video appears not have been linked to the text that describes it. Perhaps I should ask an AI to generate an artistic rendering of the author's description.
So there is no value in testing quality of answers, but there is value in testing token speed.
You just have to have correct expectations.
For me local models is all about quality, and how to achieve that - e.g. by providing guardrails that test the job done.
Basically one has two real choices for local LLMs: llama.cpp (if single user) or vLLM (if multi-user/enterprise).
But there is an incentive not to use it if you want to write an article that uses only open-source tools, because it isn't.
Plus a followup one where you see me type the question in and press enter (though that video is with Qwen 3.6, not Gemma 4) https://x.com/Freerunnering/status/2065354101878055038
oMLX does the caching I need to fit models that are near gross memory, and it handles most of the work in finding usable models. After cobbling together various solutions over months, I now just use oMLX, often from Xcode. I can tell the difference between Gemma-4 (local/free) and Claude (paid) only on the largest tasks.