Local AI needs to be the norm

Posted by cylo 1 day ago

1595 points | 626 commentspage 8

try-working 15 hours ago|

I'm building a protocol and router runtime for hybrid local/cloud AI.

The goal is that you would assign roles to models based on tasks, capabilities and observed performance. The router would then take care of model selection in the background.

It's tricky though. Probably have another two weeks before I can release the runtime.

I have a preview up at https://role-model.dev/

You can follow me on Twitter if you want updates (see profile)

Galanwe 21 hours ago||

I would love for local inference to be possible, but from my experience, Kimi 2.6 is the only model that would be worth it, and its a $10k (M3 Ultra max spec'd - 30s TTFT so kind of slowish) to $30k (RTX6000/700GB+ DDR5) upfront, noise / power consumption aside.

mft_ 21 hours ago|

You're maybe missing the article's point, which is to use local models appropriately:

> “But Local Models Aren’t As Smart”

> Correct.

> But also so what?

> Most app features don’t need a model that can write Shakespeare, explain quantum mechanics, and pass the bar exam. They need a model that can do one of these reliably: summarize, classify, extract, rewrite, or normalize.

> And for those tasks, local models can be truly excellent.

Galanwe 21 hours ago|||

This is a bit naive IMHO...

I have tried quite a bunch of local models, and the reality is that it's not just a matter of of "it's a small model that should be hostable easily". Its also a matter of whats your acceptable prefill TTFT and decode t/s.

All the local models I used, on a _consumer grade_ server (32GB DDR5, AMD Ryzen) have been mostly unusable interactively (no use as coding agent decently possible), and even for things like classification, context size is immediatly an issue.

I say that with 6m experience running various local models for classifying and summarizing my RSS feeds. Just offline summarizing ans tagging HN articles published on the front page barely make the queue sustainable and not growing continuously.

mft_ 20 hours ago||

1) Again, I suspect you're missing the point of the article. The iPhone's on-device LLM is (apparently) ~3 Bn parameters - and runs well/fast enough to be used in the manner described. Of course, the iPhone has its GPU to leverage.

2) It's probably not the time/place to trouble-shoot your "consumer grade server" LLM experience, but if you're running on CPU (you don't mention a GPU) then yeah, your inference speed will be slow.

3) Counterpoint: my consumer-grade Macbook Pro (M1 Max, 64GB) runs Qwen3.6-35B-A3B fast enough to be very usable for regular interactive coding support. (And it would fly with smaller models performing simpler tasks.)

mikrl 21 hours ago|||

One of my hobbyist workflows involved transcribing ETF prospecti into yaml for an optimizer to optimize over.

Used to take me maybe 10-20 minutes per sheet.

Then I got codex to whip up a script that sends each sheet to a fairly low parameter locally running LLM and I have the yaml in a couple seconds.

My dream is to bootstrap myself to local productivity with providers… I know I’ll never get there because hedonic treadmill etc, but I do feel there’s lots more juice to squeeze. I just need to invest more time into AI engineering…

october8140 10 hours ago||

They will never let us have enough RAM every again. RAM will be kept behind locked doors in the name of national security and only trusted corporations will be aloud to run AIs and "safely" run them in the cloud and sell them to us.

j3th9n 10 hours ago||

I’ll make my own RAM, with the help of AI.

yuppiepuppie 10 hours ago||

Is this a conspiracy?

stuaxo 9 hours ago||

Harnessed seem to be a big part of what makes stuff good or not.

I tried Cline and couldn't get it working well and part of this was that at the time it expected OpenAIs output format.

hydra-f 11 hours ago||

Unless there's a breakthrough or a transition to diffusion models, it's hard to imagine them becoming an affordable commodity

Small models are still in their infancy, and there's still much to sort out about and around them, as well

osjxjsjxjs 2 hours ago||

No AI needs to be the norm. Again.

Aleesha_hacker 10 hours ago||

To what extent is this strategy currently feasible for windows of android development? I am interested in how portable local-first AI is across platforms, but it seems promising on Apple devices.

eldenring 15 hours ago||

This article makes 0 sense. Its not up to billing or computer systems or ease of use or anything else that matters. The question is will the scaling laws, which in the asymptote are likely the laws of physics, hold up in converting energy to smarter models. Its not really up to anyone, the labs or developers, to choose if local or remote models will be the norm.

throawayonthe 6 hours ago||

it's not going to happen with LLMs unless ram + storage gets several orders of magnitude cheaper like, yesterday

informatics aren't magic, you'll never be able to compress """knowledge""" into a small model in a way equivalent to the 1.5 TB model

kilroy123 6 hours ago||

I agree. But I also think the future is some kind of hybrid approach where agents run locally, what they can, and then call out to the cloud for what they can't.

acidhousemcnab 6 hours ago||

This will happen, but reconfiguring the infrastructure of the entire planet to train LLMs and run them over networks might be the "bubble", the megalomania.

AuditMind 4 hours ago|

It's almost here. Look at the new Qwen 3.6 models. Solid stuff there.

It runs by now on 8GB Vram, so a Legion 5 for about 1500$ could be a good workhorse.

More comments...