Local AI needs to be the norm

Posted by cylo 20 hours ago

1352 points | 542 commentspage 4

duchenne 10 hours ago|

Cloud models can use batch processing which is significantly more efficient. A local model has basically a batch of one which takes as much time to process as a batch of 100 because the gpu is memory bound and spend most of its time loading the model from vram to the gpu cache while the gpu cores are idle. With a batch of 100 the model loading time and compute time are roughly similar. So local Models have a first 100x lower efficiency. Secondly, local models are idle most of the time waiting for the user to write a prompt, so the efficiency gap is probably more around 1000x.

DrScientist 4 hours ago||

And what if your local computer essentially has an model chip with dedicated memory where the model stays loading 100% of the time?

r0b05 8 hours ago||

It's an interesting point but local gpu efficiency is not something I think about when I'm being rate limited or when my subscription costs keep rising.

jjordan 18 hours ago||

It feels like we're one technological breakthrough away from all of these data centers going up to be deemed irrelevant.

Lalabadie 17 hours ago||

The cynical take is getting more and more to be the only rational one:

The promised mega-data center deals are meant to boost valuations today, not serve tons of customers three years from now.

_heimdall 17 hours ago|||

It seems pretty clearly inline with the dotcom bubble to me. Every company claims to be a leading AI company, those building infrastructure are promising the moon and getting 1/3 of the way there, and no one knows how to monetize it justify the hype or expense.

jjordan 17 hours ago|||

oof, this bubble popping is gonna be brutal.

krupan 16 hours ago|||

It took us only, what 70-ish years of computer and AI research to get to this point, so yeah, probably just one little thing and then we'll have it </sarcasm>

Seriously. I have never ever seen so many people so willingly drink the marketing kool-aid from companies selling their product before. It's scarier to me than any threats of AI actually disrupting society (because it is so far from being capable of doing that).

i_love_retros 18 hours ago||

What would that breakthrough be?

Waterluvian 18 hours ago|||

Magic math and computer science that allows us to get the same quality response for a fraction of the GPU.

intothemild 17 hours ago|||

That's already happening. Qwen3.6 and Gemma4.

Basically small and medium models that are crazy well trained for their sizes.

Then we have a lot of specular decoding stuff like MTP and others coming to speed up responses, and finally better quantisation to use less memory.

Local LLM is the future, and the larger labs know that the open models will eat their lunch once people realise that the gap is only a few months. If we were good with LLMs a couple months ago, we're good with the open models now.

krupan 16 hours ago||

And how were those models developed and trained?

lelanthran 16 hours ago|||

> And how were those models developed and trained?

That's irrelevant to my decision to use local or not.

krupan 16 hours ago|||

That's not what this thread is about? We're saying some new breakthrough is needed, someone said it already has happened, and I'm asking if it really has. Has it? I don't think so, those models are not in some way fundamentally different than other LLMs

lelanthran 15 hours ago||

> We're saying some new breakthrough is needed, someone said it already has happened, and I'm asking if it really has.

I didn't read "and how were those models trained" as "Are we there yet?"

intothemild 7 hours ago||

There's a percentage of people who love to question how the open models were trained.. they are almost always going to try and make some argument about using the closed frontier models for distillation as some form of theft.

Just totally forgetting that the frontier models themselves stole an insane amount to get to where they are.

It's theft all the way across the board, and when someone tries to make the argument that open models theft is bad, but Altman or Amodei's theft is good.. they are revealing a lot about themselves

benjiro3000 1 hour ago|||

[dead]

YZF 17 hours ago||||

The current LLMs are also "magic" so anything is possible. AFAIK there is no proof that the current architecture is optimal. And we have our brains as a pretty powerful local thinking machine as a counter-example to the idea that thinking has to happen in data centers.

_heimdall 17 hours ago||

I want to ask what makes them magic, but even those building LLMs don't really know what happens when they run inference...

I have to assume current architectures aren't optimal though, the idea that we stumbled into the one and only optimal solution seems almost impossible.

toufka 17 hours ago||||

I mean, the most cutting edge of iPhones, iPads and MacBook Pros _today_ are quite capable of running in realtime today’s high-end local LLMs.

If you project out that hardware just a couple of years, and the trained models out a couple of years, you end up in a place where it makes so much more sense to run them locally, for all sorts of latency, privacy, efficacy, and domain-specific reasons.

Not all that different from the old terminal & mainframe->pc shifts.

Finally - hardware has seemingly gotten out ahead of software that most folks use - watching YouTube, listening to music, playing a game or two. There was a time when playing an mp3 or watching a 4k video really taxed all but the nicest systems. Hardware fixed that problem, like it very well could this one.

sofixa 17 hours ago||

> I mean, the most cutting edge of iPhones, iPads and MacBook Pros _today_ are quite capable of running in realtime today’s high-end local LLMs

Definitely not the high end local LLMs. The small ones, yes, absolutely.

> If you project out that hardware just a couple of years

One of the biggest bottlenecks for LLMs is memory capacity and bandwidth. With the current glut for memory, it's unlikely we'll see lots of advancements in terms of average memory available or its bandwidth on regular (not super high end devices) in the coming years.

Alternatively, it's possible we get dedicated SMLs for e.g. phone specific use cases, that are optimised and run well.

_heimdall 17 hours ago|||

I'd assume its a totally different architecture that isn't based on storing a compressed dataset of all digital human text.

acidhousemcnab 3 hours ago||

We need better GUI and OS integrations with sandboxed local LLMs, before this is thrust on everyone and rolled out as the default in commercial OSes. Here in Berlin, I was functionally surrounded and hounded out of a local meetup, due to confrontation over the naive pushing of OS-level and network access agentic AI, done in the mode of mystical powers and artistic possibilities, which due to recent experiences, comes off as string-pulling, to produce a threat or danger that then must be observed and kept tabs on, according to Goodhart's Law.

Tepix 6 hours ago||

I'm pretty sure that AI assistants will become widespread.

I consider it to be very careless to entrust your emails, your chats, your calendar, your notes, your calls, your pictures, your contacts, your location history, your waking hours, your files, your TODO list, i.e. stuff including your health data to the for-profit AI companies. The temptation to earn money with your data is just too great, plus the risk of the data being stolen and sold illegally.

Local AI should be the default. For everone who can't do local AI, we need confidential compute. Yes, it has been hacked before. But it's making it a lot harder.

pjerem 6 hours ago|

> I consider it to be very careless to entrust your emails, your chats, your calendar, your notes, your calls, your pictures, your contacts, your location history, your waking hours, your files, your TODO list, i.e. stuff including your health data to the for-profit AI companies.

Still, we all do it with Google. (I don't do it anymore but i did it for mostly two decades so I include myself)

QuadrupleA 9 hours ago||

Not sure how excited I feel about visiting your website and having it auto-download a 8GB model with GPT-3.5 level hallucinations, and then probably crash because I only have 6GB of VRAM. My dad won't be able to use it, or anyone else without a bleeding edge device. On a powerful enough "neural engine" device the battery will be drained quickly, while the heatsink burns a hole in my lap.

dgb23 5 hours ago|

Local could also mean self hosted.

The obvious optimization for the case presented would be to generate all the summaries on a server instead of in the client. Then the totally used compute would scale with the number of articles instead of number of users.

holtkam2 17 hours ago||

I wish I could upvote this twice. We (devs) really REALLY need to consider on-device compute before going to the cloud for LLM inference.

mattlondon 17 hours ago||

Yet there is another post a few rows down where people are losing their shit that Chrome has a local LLM model that uses a couple of GB of space for local-inference.

Damned if they do, damned if they don't.

dlcarrier 17 hours ago||

Maybe don't use gigabytes of bandwidth and storage space, without asking.

hparadiz 16 hours ago||

Easy. Stop using Chrome.

userbinator 15 hours ago|||

If I want a model I'll go download one. (And I did, not long ago, to play around with image generation.)

bytecauldron 17 hours ago|||

This is a bit disingenuous. People aren't losing their shit about a local model being installed. It's the lack of user autonomy. Just give the option to download a model instead of a silent install. It's not that hard. This is how every other local option works.

wmf 16 hours ago||

AFAIK Apple and MS auto-download local models.

FridgeSeal 13 hours ago|||

The former has made a big deal about local inference and marketed that as an OS level feature.

You can also…turn it off.

Chrome silently elected people into it _and_ downloaded the model without asking because they decided that’s something they (chrome) fancied doing.

The difference should be pretty obvious.

bytecauldron 12 hours ago||||

Sorry, I should have been more specific. This is how every *good local option works.

aabhay 17 hours ago|||

This is a weird take. If its not opt in or you’re shoe horning it into a browser, then that sucks. Nobody is getting enraged that an app for running local LLMs downloads data to do so.

avadodin 16 hours ago||

Although you can opt out and even disable the download feature when you build them in some cases, most of the local LLM tools are too download–happy by default.

fg137 16 hours ago|||

You might want to read the comments to understand what people are actually complaining about.

This comment is quite dishonest about the nature of the discussion.

themafia 17 hours ago|||

If it was such a good and laudable idea why didn't they tell me about it before they activated it? It seems to me like they avoided it in the hopes that I wouldn't notice, because, presumably if I had, I would have IMMEDIATELY disabled it.

Also why doesn't their task manager show that it's actually the one downloading? Why does it go out of it's way to hide this activity?

Since I have conky on my desktop I could catch this immediately, and take the action I preferred with my own computer, which was to _immediately_ disable it.

StilesCrisis 17 hours ago||

I'm guessing you immediately close the What's New Chrome tab when you update?

https://developer.chrome.com/blog/new-in-chrome-148#prompt-a...

https://www.google.com/chrome/ai-innovations/

They have absolutely not been shy about any of this.

themafia 17 hours ago||

I've never had a "What's new" tab ever open because I disable the customized home page where that's displayed. I'm guessing you're not aware that's an option.

Please show me where in either of those documents it explains it's going to download a 4GB model.

crazygringo 16 hours ago||

I use an extension that gives me a customized homepage, but I still always get the "what's new" tab on every major version upgrade.

It's a totally separate tab that opens. It's got nothing to do with what you use as your homepage.

themafia 5 hours ago||

Thank you for going out of your way to deny my exact experience. Do you think I'm doing this to rag on Google? And you're this eager to defend them?

I'm on gentoo. I have to update chrome manually. I updated it. On update I _never_ get a "what's new" page. I've had this profile for more than a decade so I have no actual idea why, but, I can absolutely tell you, I do *not* get one. After update it started consuming all my bandwidth. This use did not show in it's task manager. I have a metered connection. This is a problem for me. I worried it was a compromised plugin. I had to spend 10 minutes in Firefox discovering why chrome was doing this then going to the configuration and disabling this.

This was a disappointing experience. I'm sorry you feel differently; other than stating the obvious, I seriously have no idea what you and the other corporate defense squad members are trying to achieve with this gaslighting nonsense.

StilesCrisis 1 hour ago||

That makes sense. You aren't really "updating" at all. You're basically reinstalling a new Chrome on every update. It makes sense then that you aren't seeing "what's new" because that's not how a fresh install starts up.

Note that this package and update is actually not maintained by Google at all, it's done by Gentoo: https://wiki.gentoo.org/wiki/Project:Chromium/How_to_bump_Ch...

I hate to be an apologist for anything but I think you are pointing fingers in the wrong place. The Google-official releases use the built-in automatic updater and do show What's New. This is a Gentoo release and they chose to do their own thing for updates.

ekjhgkejhgk 17 hours ago||

You don't understand the difference between "I run a local LLM because I chose to" vs "The browser chose to run a local LLM and I have no say"? You don't understand?

Not to mention that the LLM that I choose to run requires a monster machine and is infinitely more capable than whatever google chose to put on their browser?

I mean, none of this affects me because I don't use chrome, obviously, but you don't see the difference? Bewildering.

StilesCrisis 17 hours ago||

Did you opt into WebGPU? QUIC? Canvas 2D? Brotli? Browsers don't work that way.

za_creature 16 hours ago||

The size difference between the local LLM and all of the above is about... the size of the local LLM.

timeattack 17 hours ago||

My problem with LLMs (apart from philosophical aspects and economical impact) is that it would be unlikely for any of us to be able to train something functional locally (toy-like LLMs -- sure, but something really useful -- no). Apart from that it requires immense computing power, it also requires a dataset which is for the most part is obtained illegally.

kibwen 17 hours ago||

This seems overly pessimistic.

I may personally be of modest intelligence, but to acquire the intelligence that I do have, I did not need to train on every book ever written, every Wikipedia article ever written, every blog post ever written, every reference manual ever written, every line of code ever written, and so on. In fact, I didn't train on even 1% of those materials, or even 0.00000000001% of those. The texts themselves were demonstrably not a prerequisite for intelligence.

At minimum, given that it only took me about 20 years of casual observation of my surroundings to approximate intelligence, this is proof positive that the only "dataset" you need is a bunch of sensors and the world around you.

And yes, of course, the human brain does not start from zero; it had a few million years of evolution to produce a fertile plot for intelligence to take root. But that fundamental architecture is fairly generic, and does not at all seem predicated on any sort of specific training set. You could feasibly evolve it artificially.

krupan 16 hours ago|||

What does this even have to do with the parent? Your capabilities have nothing to do with LLM capabilities. The two work in completely different ways. The reason LLMs work is because they are huge and have been trained on vast amounts of data, full stop. Sure, there's potential someday to get something useful using less data, but we aren't there.

avadodin 15 hours ago||

You are right on the limitations of the architecture but I wouldn't call LLMs huge. Flagship models maybe but that's just because they don't scale very well.

A universal translator with image and voice recognition and a decent breadth of encyclopedic knowledge in only a small fraction of an English Wikipedia dump(6GB/20+GB) is not "huge".

It is probably closer to the theoretical limit than anyone could have expected.

_heimdall 17 hours ago|||

You're also embodied and experiencing the world around you with more senses than only the ability to read text.

rogerrogerr 17 hours ago||

> the only "dataset" you need is a bunch of sensors and the world around you.

dlcarrier 17 hours ago|||

Not the whole thing, at least with current technology, but LoRAs are really good at fine tuning, and can be generated in a few hours on high-end gaming computers, so as long as the base model is in your language, you likely have enough spate computing power, in whatever electronics you own, to train a few LoRAs a month.

In the future, when regular home computers have the capabilities of modern servers, we'll be able to train the entire LLM at home.

pronik 16 hours ago|||

There is so much technology that we are unable to reproduce locally, I don't think LLMs are in any way different. There will be large LLM manufacturers, small LLM manufacturers, LLM artisanals, LLM enthusiasts and of course LLM consumers, just like with everything.

krupan 16 hours ago|||

And this is important because even though you are running a model locally, it's still a proprietary model. You have no say in what it was trained on, how that training data is labeled, what the guardrails are, what biases it might have, none of that.

Ucalegon 17 hours ago|||

Depends on the domain. There are plenty of different use cases where the data needed for training is available for personal, or non-commercial, use. At that point, it does come down to compute/time to do the training, which if you are willing to wait, consumer grade hardware is perfectly capable of developing useful models.

woah 15 hours ago|||

Can you make your own CPU, locally?

RataNova 16 hours ago|||

That's a fair concern, but I'd separate training from inference here

cyanydeez 17 hours ago||

That sounds like government. So your problem is mostly that you expect to have a collective social effort, but not enough to pay for it as a public good.

diwank 10 hours ago||

in order for us to get there, i think we need a standardized api at the os layer for local models so that the os could optimize, batch and safely allocate resources. something like an analog of chrome's local model "prompt" api but provided and managed by the os itself. the user can choose which model they want to primarily use and so on but all of the heavy lifting and continuous batching is done automatically by the os

andychiare 3 hours ago|

> “AI everywhere” is not the goal. Useful software is the goal.

Great observation! Often the excitement of novelty makes us lose sight of the real goal

More comments...