Can I run AI locally?

Posted by ricardbejarano 15 hours ago

945 points | 255 commentspage 6

sdingi 10 hours ago|

When running models on my phone - either through the web browser or via an app - is there any chance it uses the phone's NPU, or will these be GPU only?

I don't really understand how the interface to the NPU chip looks from the perspective of a non-system caller, if it exists at all. This is a Samsung device but I am wondering about the general principle.

SXX 8 hours ago||

Sorry if already been answered, but will there be a metric for latency aka time to first token?

Since I considered buying M3 Ultra and feel like it the most often discussed regarding using Apple hardware for runninh local LLMs. Where speed might be okay, but prompt processing can take ages.

teaearlgraycold 8 hours ago|

Wait for the M5 Ultra. It will get the 4x prompt processing speeds from the rest of the M5 product line. I hear rumors it will be released this year.

casey2 2 hours ago||

Something notable is that Qwen3.5:0.8B does better on benchmarks than GPT3.5. Runs much faster on local hardware than GPT3.5 on release. However Qwen3.5:0.8B dumber and slower than GPT3.5. It's dumber: it can do 3*3, but if asked to explain it in terms of the definition (i.e. 3+3+3=9) it fails. It's slower: It's a thinking model so your 900T/S are mainly spent "thinking" most of the time it will just repeat until it hangs.

It pretty obvious that this reasoning scaling is a mirage, parameters are all you need. Everything else is mostly just wasting time while hardware get better.

kuon 8 hours ago||

I have amd 9700 and it is not listed while it is great llm hardware because it has 32Gb for a reasonable price. I tried doing "custom" but it didn't seem to work.

The tool is very nice though.

urba_ 6 hours ago||

Man, I wonder when there will be AI server farms made from iCloud locked jailbroken iPhone 16s with backported MacOS

remote3body 4 hours ago||

The 'spent 100 hours configuring' part hits home. That fragmentation is exactly why we started building Olares (https://github.com/beclab/Olares).

It’s basically an open-source OS layer that standardizes the local AI stack—Kubernetes (K3s) for orchestration, standardized model serving, and GPU scheduling. The goal is to stop fiddling with Python environments/drivers and just treat local agents like standardized containers. It runs on Mac Minis or dedicated hardware.

vova_hn2 11 hours ago||

It says "RAM - unknown", but doesn't give me an option to specify how much RAM I have. Why?

ementally 5 hours ago||

In mobile section it is missing Tensor chips (used by Google Pixel devices).

mrdependable 11 hours ago||

This is great, I've been trying to figure this stuff out recently.

One thing I do wonder is what sort of solutions there are for running your own model, but using it from a different machine. I don't necessarily want to run the model on the machine I'm also working from.

cortesoft 11 hours ago||

Ollama runs a web server that you use to interact with the models: https://docs.ollama.com/quickstart

You can also use the kubernetes operator to run them on a cluster: https://ollama-operator.ayaka.io/pages/en/

rebolek 10 hours ago||

ssh?

zitterbewegung 9 hours ago|

The M4 Ultra doesn't exist and there is more credible rumors for an M5 Ultra. I wouldn't put a projection like that without highlighting that this processor doesn't exist yet.

More comments...