Posted by ricardbejarano 15 hours ago
I don't really understand how the interface to the NPU chip looks from the perspective of a non-system caller, if it exists at all. This is a Samsung device but I am wondering about the general principle.
Since I considered buying M3 Ultra and feel like it the most often discussed regarding using Apple hardware for runninh local LLMs. Where speed might be okay, but prompt processing can take ages.
It pretty obvious that this reasoning scaling is a mirage, parameters are all you need. Everything else is mostly just wasting time while hardware get better.
The tool is very nice though.
It’s basically an open-source OS layer that standardizes the local AI stack—Kubernetes (K3s) for orchestration, standardized model serving, and GPU scheduling. The goal is to stop fiddling with Python environments/drivers and just treat local agents like standardized containers. It runs on Mac Minis or dedicated hardware.
One thing I do wonder is what sort of solutions there are for running your own model, but using it from a different machine. I don't necessarily want to run the model on the machine I'm also working from.
You can also use the kubernetes operator to run them on a cluster: https://ollama-operator.ayaka.io/pages/en/