Local AI needs to be the norm

Posted by cylo 1 day ago

1722 points | 682 commentspage 13

DoctorOetker 19 hours ago|

One advantage of local AI is continual learning.

When I say 'moat' I don't mean moat specific to a company vis-a-vis other companies, but 'moat' specific to the set of inference providers vis-a-vis self-hosted local inference.

The moat consists primarily of being able to batch inference requests.

If we pretend people weren't interested in long context-lengths, there would be a moat for inference providers. who can batch many requests so that streaming the model weights (regardless if from system RAM to GPU RAM; or from GPU RAM to GPU cache SRAM) can be amortized over multiple requests.

However people do want longer memory than the native context length.

One approach is continual learning (basically continue training by using the past conversation as extra corpus material; interspersed with training on continuations from the frozen model, so it doesn't drift or catastrophically forget knowledge / politeness / ...).

However this is very expensive for inference providers, since they would have to multiply model weight storage with the number of users U=N. For a single user the memory cost of continual learning is much less since they only need to support a single user, and are returned some of the memory cost through elimination of KV-caches, and returned higher quality answers compared to subquadratic approximations of quadratic attention.

An advantage of continual learning is that the conversation / code base / context is continuously rebaked into model weights, and so doesn't need KV caches! It doesn't need imperfect approximations to quadratic attention, it attends through working knowledge being updated.

Nothing prevents local LLM users from implementing this and benefiting from the dropped requirements of KV caches and enjoying true quadratic attention implicitly over the whole codebase, or many overlapping projects indeed.

The only remaining moat of inference providers vis-a-vis continual learning local LLM's is the batching advantage, plus the gradient update costs for continual learning minus the KV storage and compute costs, minus the performance loss due to inexact approximations to quadratic attention.

This points towards a stronger incentive for local hosting than currently realized (none of the popular local LLM tools currently support continual learning, once this genie is out of the bottle it will be a permanent decrease of the inference provider moat, the cost of which can't be expressed merely in hardware or energy costs, since it is difficult to quantify the financial loss of inexact approximations to quadratic attention, the financial loss due to limited effective context length and the concomitant loss in quality of the result)

DrScientist 12 hours ago|

Anybody know of good real world examples for continual learning?

Does it really work?

DoctorOetker 1 hour ago||

In this case I think you'd want to use Source-Aware-Training [0] to associate a "timestamp" vector to each native context chunk (perhaps overlapping) of conversation, probably the weights using a kind of Gray code so that the LLM has the immediate out-of-native context history can be retrieved through the nearby gray code of 1, 2, etc steps ago compared to the current timestep gray code.

https://arxiv.org/abs/2404.01019

j45 19 hours ago||

It’s easier to say 32 gb ram needs to be the norm to start getting movement on this

jmyeet 23 hours ago||

I've been looking into options for this and we are getting close. There are two main constraints: memory and memory bandwidth.

NVidia segments the market by limiting the amount of memory on GPUs. It currently tops out at 32GB (on a 5090) but it has excellent memory bandwidth (~1.8TB/s). If you want more than the you need to buy an RTX Pro (eg RTX 6000 Pro w/ 96GB for ~$10K) or you get into high high end solutions like H100, H200, etc that have significantly more memory and even higher bandwidth on HBM memory (eg 3.2TB/s+).

NVidia has released the DGX Spark w/ 128GB of memory for ~$4k. The problem is the memory bandwidth. It's only 273GB/s, which is less than the M5 Pro (307GB/s) but more than the M5. You can buy a 16" Macbook Pro with an M5 Max and 128GB of memory for $6k and it has a bandwidth of 614GB/s. So the DGX Spark is a joke, really.

In case it wasn't clear, Apple is interesting in this space because it has a shared memory architecture so the GPU can use all the memory.

Many, myself include, expect there to be no refresh to the 5000 series consumer GPUs this year, which would otherwise happen based on product cycles. So no 5080 Super, for example. And I wouldn't expect a 6090 before 2028 realistically.

One thing Apple hasn't done yet is release the M5 Mac Studios, which are widely expected in Q3 this year. They are interesting because, for example, the M3 Ultra has a memory bandwidth of 819GB/s and previously had a max spec of 512GB but that got discontinued (and the 256GB version also got discontinued more recently).

So many expect an M5 Max Mac Studio with 1TB/s+ bandwidth and specs up to 256GB or 512GB, probably for ~$10k later this year.

You really have to use this hardware almost 24x7 for it to be economical because otherwise H100 computer hours are probably cheaper.

But what happens when the next generation of GPUs comes out to the trillions in AI DC investment? It's going to halve its value. That's over $1 trillion in capex that will disappear overnight, effectively.

I think Apple is the dark horse here because they have no interest in NVidia's psuedo-monopoly. I'm just waiting for them to realize it.

Now CUDA is an issue here still but I think as time goes on it's going to be less of an issue. Memory is still a huge constraint both in terms of price and just general supply because NVidia can justify paying way more for it than you can, probably.

It's still sad to see that 128GB (2x64GB) DDR5 kits are almost $2k now and werre $400 a year ago. Expect that to continue until this bubble pops (which IMHO it will) and we're likely in a global recession.

So the other issue is models. OpenAI and Anthropic are built on proprietary models. Their entire valuation depends on this moat. I don't think this last so both companies are doomed because open source models are going to be sufficiently good.

We can already do some reasonably cool stuff on local hardware that isn't that expensive and even more so once you get to $5-10k hardware. That's going to be so much better in 2 years that I'm hesitant to spend any amount of money now.

Plus the code for running these things is getting better. Just in the last month there have been huge speed ups in local LLMs with MTP.

zozbot234 23 hours ago||

> So the DGX Spark is a joke, really.

Not at all sure about that. They have really good compute, and DeepSeek V4 (with antirez's 2-bit expert layer quant) may be able to leverage that compute via parallel inference - the jury is still out on that. Now if you had said Strix Halo/Strix Point or perhaps the Intel close equivalents, that would've been a slightly stronger case.

regexorcist 23 hours ago|||

> So many expect an M5 Max Mac Studio with 1TB/s+ bandwidth and specs up to 256GB or 512GB, probably for ~$10k later this year.

This is what I'm really waiting for. It will enable models comparable to current SOTA at the enthusiast price range.

heydryft 22 hours ago||

[flagged]

QuadrupleA 17 hours ago||

This is just emotional rhetoric. Pretty much any app in the last 20 years has depended on a server somewhere, or a cloud provider. Like an AI provider, they can go down, they can turn off if you don't pay your bill, etc.

And local inference requires fairly beefy hardware, that is FAR from ubiquitous across today's userbases. Local models are also still far dumber than what frontier labs can serve.

Weird that this is getting such a tidal wave of upvotes.

krupan 1 day ago||

If you don't need a lot of smarts, do you even need an LLM? Aren't older machine learning techniques just as good, or like, you know, old-school algorithms?

holoduke 1 day ago||

We need computers with 128gb or maybe even 192gb of memory before local use make sense. From my own experience 32b LLMs are the absolute minimum for proper tool use and decent output quality. But for local ai you want also vision models and maybe even various LLMs. Plus some memory for the system of course. On my 36gb M3 the 24b Gemma model is nice. But the entire system gets allocated for that thing.

artursapek 1 day ago||

I'm someone who is trying to build a subscription-based business to cover underlying LLM costs, and very hopeful I can one day just sell a permanent license to the software instead with customers using local LLMs to power it.

sgt 1 day ago||

I guess Google got that memo!

cubefox 1 day ago||

Local AI is a bit like wind parks. Everyone is in favor, except if they are in your own backyard. There was recently a huge outcry when Chrome shipped a local 4 GB AI model: https://news.ycombinator.com/item?id=48019219

I have to conclude that people would like to have powerful local AI but it should at the same time only be a tiny model. In which case it wouldn't be powerful.

maxothex 6 hours ago|

[flagged]

More comments...