Local AI needs to be the norm

Posted by cylo 18 hours ago

1222 points | 511 commentspage 2

acidhousemcnab 1 hour ago|

We need better GUI and OS integrations with sandboxed local LLMs, before this is thrust on everyone and rolled out as the default in commercial OSes. Here in Berlin, I was functionally surrounded and hounded out of a local meetup, due to confrontation over the naive pushing of OS-level and network access agentic AI, done in the mode of mystical powers and artistic possibilities, which due to recent experiences, comes off as string-pulling, to produce a threat or danger that then must be observed and kept tabs on, according to Goodhart's Law.

jillesvangurp 2 hours ago||

I get the sentiment for self hosting. But there are a few counter arguments:

- Self hosting is expensive. It involves expensive machines with GPUs that cost hundreds per month if you use cloud based ones. You might need multiple of those. And you need people to mind those machines and they are even more expensive per month.

- If you run stuff on your laptop, it consumes a lot of resources and energy. I have qwen running on my laptop. Even minimal usage turns my laptop in a radiator. Nice as a demo, but I can't have it this hot all the time. It would run out of battery, and it's probably not great for longevity of components in the laptop.

- Models are evolving quickly and the self hosted smaller ones aren't as good when it comes to things like tool usage, reasoning, etc. Being able to switch tot he latest model is valuable.

- It's easier to get your use case working with one of the top models than with one of the smaller self hosted ones.

- If you get the wrong hardware, it might not be able to run the latest models very soon.

- Self hosting models is mostly a cost optimization. It only becomes relevant if you hit a certain scale.

- You have alternatives in the form of hosted models via a wide range of service providers. Some of those are EU based and offer all the things you'd be looking for if you are offering your services there. Including legal requirements.

- Reinventing what these companies do in house is technically challenging and possibly more expensive than self hosting models because now you need a lot of engineering capacity dedicated to that. And legal. And all the rest.

If, like most companies/people, you are at the experimenting stage, the cheapest and fastest is just getting an API key from an API provider of your choice. You can take it from there if your experiment actually works. And then it's mostly about optimizing cost. If your API usage goes to the thousands per month or worse, it becomes a cost/quality trade off.

mgrund 4 hours ago||

I really really want to like local AI, but I highly doubt it will see wide adoption for a long time.

The additional up-front cost for hardware designed to run an LLM in addition to normal workload is unlikely to be accepted by most consumers.

The scale will be very constrained (like Apples on-device models which are small, heavily quantized, and have a small 4K token context window). It’s also terrible for battery life.

AI as it is implemented today is simply just computationally expensive and unless you put in dedicated hardware (like the ANE) for only this purpose - a large cost driver - I don’t really see it getting large scale adoption.

Companies will probably need a server-backed solution as fallback if they want reasonable user experience, so why even invest in diverse hardware support.

ninjahawk1 5 hours ago||

In my opinion, this is similar to the earlier internet and computers. Few households or individuals had access to state of the art computers, it was primarily research or more well-off individuals. Most random people didn’t really know what it was and certainly didn’t use one.

Now today, AI is very expensive and not readily accessible to most people without paying a good amount.

The early internet became now you can just get a free phone from phone companies so long as you get their extras. Then you get a ton of subscriptions and ad-ons, but you don’t have to spend money, could just use youtube with ads etc.

Local AI would similarly shift this dynamic to paying for access to plug-in’s and tools for your local AI to be able to use. Like how the subscription model works right now.

With local model advancements, such as specifically Qwen 3.6 35B A3B, this future is becoming more likely by the year IMO.

robot-wrangler 14 hours ago||

Entrenched interests are going to do everything to stop local, but there's at least a few technical reasons to believe small and specialized models could be the norm eventually. If that does happen, local will follow.

TFA is focused on whether big models are necessary for what users want. There's some evidence they may never actually be reliable enough unless a) mechanistic interpretation matures far enough or b) our multi-agent systems all become multi-model.

For (a), advancement in MI might fix problems with big models, but would also mean we can maybe get unified representations, and just slice and dice the useful stuff out of huge models, getting only what we need without the junk. Ability to isolate problems won't really come without bringing the ability to isolate functional subsystems. Only want logic? Only vision? Just cut it out of the big monster and enjoy reduced costs and surface area for problems.

For (b), just look at stuff like the evil vector, or the category of hallucinations specific to tool-use. Without a complete solution for helpful/honest/harmless alignment, it seems likely that creativity and rigor (and many other things) are fundamentally at odds. If you start to need many models for everything anyway, why do we need the huge expensive do-everything ones? So specialization also becomes a pressure to shrink everything towards minimal reliable experts

wrxd 14 hours ago||

The example in the post confirms my theory that for local models to succeed they need to be "good enough", not big enough that they can compete with frontier models.

They need to be able to do a small task well and they need to be able to run reasonably on consumer-class devices. Even better if they can run on mobile phones.

In my experiments with local LLMs I noticed that while increasing the size of the model is nice the real thing that turns a barely useless model into something useful is the ability to use tools. Giving my models the ability to search the web and fetch web pages did way more to solve hallucinations than getting a bigger model. And it doesn't have a training cutoff. Sure, the bigger model is probably better at using tools but I often find the smaller models to be good enough.

Gigachad 8 hours ago|

Will there even be a web to search in the future? These days public access blogs are dying and being replaced with hallucinated AI websites. Sites with original research like Reddit and YouTube are being locked up to prevent 3rd party indexing.

Knowledge and clean data sets are becoming increasingly valuable, and free community knowledge is drying up. The next big programming language won’t have years of Stack Overflow posts to train on.

Maybe we will see some kind of licensing deals where owners of good datasets charge you a fee to let your AI search them.

teiferer 2 hours ago||

Every reply here forgets/overlooks the main reason for why this is not going to happen: The astronomical AI data center investments currently underway. Those place are not just for training. They are for inference too and the way all those investments are expected to eventually pay off. The whole AI sector of our industry depends on running models in these places.

zozbot234 2 hours ago|

These astronomical AI data centers will be used for high-value inference with smarter models that really are too large for running locally. The investments will be fine once they pivot to that use. Currently available open models are not in that range.

throawayonthe 1 hour ago||

it's not going to happen with LLMs unless ram + storage gets several orders of magnitude cheaper like, yesterday

informatics aren't magic, you'll never be able to compress """knowledge""" into a small model in a way equivalent to the 1.5 TB model

kilroy123 1 hour ago||

I agree. But I also think the future is some kind of hybrid approach where agents run locally, what they can, and then call out to the cloud for what they can't.

acidhousemcnab 1 hour ago||

This will happen, but reconfiguring the infrastructure of the entire planet to train LLMs and run them over networks might be the "bubble", the megalomania.

revolvingthrow 16 hours ago||

A local Answer Machine is the dream, especially when the internet is decaying and generally on its last legs, but the hardware requirements seem like a huge mountain to climb. Things are progressing tremendously - deepseek v4 flash is very good for what it is - but even that goes beyond any reasonable local setup, which imo is 128 GB ram + 16 GB vram. 4 ram slots on a consumer board craters ram speed, 256 gb macs are too expensive, and even then the inference is ungodly slow.

On the other hand… v4 flash model is actual magic compared to what was available 2 years ago. If the rate of improvement stays as is, we’ll get a similar performance in a ~120B model in a year, which is viable (if expensive) for everyman hardware. Possibly you’ll be able to run its equivalent on a ~$1200 laptop by 2028, which for me-in-2020 would sound straight out of a scifi movie. A good harness that lets the model fetch data from other sources like a local wikipedia copy from kiwix could do a lot for factual knowledge, too; there’s only so much you can encode in the model itself, but even a cheapish (pre-curent prices) 2TB drive can hold an immense amount of LLM-accessible data.

Big caveat: I don’t see local models for programming or generally demanding agentic tasks being worth it anytime soon. You likely want bleeding edge models for it, and speed is far more important. Chat at 20tok/s is fine; working on even a small codebase at 20tok/s, especially on a noticeably weaker model, is just a waste of time. Maybe it’s a PEBKAC but I have no idea how people make any meaningful use out of qwen 3.6.

zozbot234 14 hours ago|

> and even then the inference is ungodly slow.

This is the wrong way of putting it. Local inference with SOTA models is all about slowing down compute for the sake of fitting on bespoke repurposed hardware. You don't need to go fast if you have the whole machine to yourself 24/7. Cloud AI vendors can't match that kind of economics.

gregjw 3 hours ago|

Is there a place to learn more about Local AI specifically and maybe even more specifically about models for bespoke purposes or curating them yourself for more specific uses? Feels like theres a lot of fat you can trim off because you don't need generic use, but I don't understand where to even begin there.

StevenWaterman 3 hours ago|

/r/localllama is one of the most useful places

More comments...