Local AI needs to be the norm

Posted by cylo 19 hours ago

1276 points | 534 commentspage 3

harrouet 1 hour ago|

Running LLMs locally is one way to realize the level of hardware and infrastructure that frontier AI companies are running. Makes me wonder about future strategies.

As one commenter mentioned, 2x Mac Studio M3 Max with 512GB can run frontier models and it costs $30k (with RDMA). Apply an efficiency ratio for being in a datacenter, and you understand why OpenAI and the likes spend north of $10k _per customer_ of CAPEX.

Add to that the electricity costs and you've got a very shaky business model. I for one would like to thank the VC for subsidizing my tokens.

With that said, the VCs are not crazy and probably factored in an annual cost decrease of computing power. But how do you make sure that we won't run local LLMs when the HW becomes affordable -- if ever ?

The answer has always been the same in our industry: vendor lock-in. They are getting the users now at a loss, hoping for future captive revenues.

So, be careful when your code maintenance requires the full context that yielded that code, and that this context is in [Claude Code|Codex|Cursor].

almogodel 4 hours ago||

Remember nodes and graphs? A comfy user interface allows pretty incredible wiring among models local ai is like eurorack. The current graph skews heavily towards a a pair of small dense models collaborating with the large heavyweights selectively. It’s Qwen 3.6 27B with Gemma 4 31B, both unquantized, bf16/fp16, with phi 14b, nemotron cascade 2, and then those large heavyweights, r1 and subsequent deepseek models including speciale, gpt oss 120b, glm, min max,kimi, command r, mistrals, ever body, up in one graph, all them llm nodes patched and interconnected. Slow, resource intense, better than non local ai. I used Matteo’s graphllm for inspiration, and comfy ui (and st), and used the models to roll a new imgui node/graph model compositor. Now what?!

gpugreg 3 hours ago|

> Slow, resource intense, better than non local ai

Why should connecting small models to big models result in higher output quality than just running the big models without the small models?

tomelders 4 hours ago||

I do think local models are the future, but there's still the question of cost to be answered. Even if there's some slew of effincency improvements that mean an LLM can run locally on consumer level hardware on an affordable budget (and that's a big "if"), there's still the cost of training the modles to consider.

Assuming we end up in a future where people pay to run multiple smaller models on their machines for specific tasks (e.g. A summariser model, a python coding model, or however fine grained/macro you want to go), the people training those models will need to turn a profit.

So how much will that cost? And how often will consumers have to pay? Models have a very short self life. Say you have a dedicated python coding model - that needs re-training every time there's a significant update to the language itself, any popular packages, related technologies (e.g. servers, cloud infra etc). So how often will users need to "upgrade" to the lastest version? It's going to be "frequently".

And it still needs the language stuff on top of that. Users aren't going to interact with a python coding model by writing python. They're going to use natural language. So the model needs all that stuff. And they're going to give it problems to solve. What if you asked the model "Write me a Bezier curve function". It needs to know about bezier curves, which have nothing to do with Python. So where do these LLM providers draw the line on what makes it into the training data and what doesn't?

And if an LLM doesn't know what a Bezier curve is, that's not going to stop it from just hallucinating an answer. If a significat proportion of prompts resulted in a response that said "Sorry, I don't know what you're talking about", then people will just stop using it. The utility of these things will be quickly overshadowed by the frustrations.

The way these frontier models have been introduced and promoted has set unrealistic expectations, and there's no putting the genie back in the bottle.

rufasterisco 4 hours ago|

> the question of cost to be answered.

Commoditizing complements. If Anthropic/OpenAI/etc is eating your lunch, make it work with cheap local LLMs , you can beat them on price by having local inference you don't pay (nor need data centers for), and try to keep your (user/data) moat.

The more Anth/OAI disrupt, the more likely this is to happen. If they don't disrupt enough (.ie: grow as an ecosystem to defend against incentives to commoditize), then yes, those incentives are removed, but they also leave money on the table, which they need.

Not only at business level, but also geopolitical (to a lesser extent? or not since lots of open weight models comes form China?).

tomelders 3 hours ago||

What are you talking about Willis?

teiferer 3 hours ago||

Every reply here forgets/overlooks the main reason for why this is not going to happen: The astronomical AI data center investments currently underway. Those place are not just for training. They are for inference too and the way all those investments are expected to eventually pay off. The whole AI sector of our industry depends on running models in these places.

zozbot234 2 hours ago|

These astronomical AI data centers will be used for high-value inference with smarter models that really are too large for running locally. The investments will be fine once they pivot to that use. Currently available open models are not in that range.

acidhousemcnab 2 hours ago||

We need better GUI and OS integrations with sandboxed local LLMs, before this is thrust on everyone and rolled out as the default in commercial OSes. Here in Berlin, I was functionally surrounded and hounded out of a local meetup, due to confrontation over the naive pushing of OS-level and network access agentic AI, done in the mode of mystical powers and artistic possibilities, which due to recent experiences, comes off as string-pulling, to produce a threat or danger that then must be observed and kept tabs on, according to Goodhart's Law.

andychiare 2 hours ago||

> “AI everywhere” is not the goal. Useful software is the goal.

Great observation! Often the excitement of novelty makes us lose sight of the real goal

manyatoms 9 hours ago||

It just depends how quickly models become "good enough" that we don't care about SOTA

julianlam 9 hours ago|

Arguably, some of the things HN readers ask for can be capably completed by a local open weight model for free.

vb-8448 16 hours ago||

> Use cloud models only when they’re genuinely necessary.

The problem is that it's much easier to use the SOTA models (especially if they are subsidized) instead of spending time fixing the knobs with the local one.

I just realized this with coding agents, yeah, you probably shouldn't always use latest version at xhigh, but you will end doing it because you do the job in less time, with less "effort" and basically at the same price.

I guess we'll see a real effort for local AI only when major vendors will start billing based on actual token usage.

lelanthran 15 hours ago||

> The problem is that it's much easier to use the SOTA models (especially if they are subsidized) instead of spending time fixing the knobs with the local one.

That's not a problem, that's a feature; I have something like 8 tabs open to different free-tier providers. ChatGPT, Claude and Gemini are the SOTA ones.

I have no problem maxing one out, then moving to the next. I can do this all day, have them implement specific functions (or classes) in my code. The things is, because I actually know how to write and design software, I don't need to run an agent in a loop to produce everything in a day, I can use the web chatbots with copy/paste to literally generate thousands of lines of code per hour while still having a strong mental model of the code that I can go in and change whatever I need to.[1]

---------------------

[1] Just did that this morning on a Python project: because I designed what I needed, each generation was me prompting for a single function. So when I needed to add something this morning I didn't even bother asking an chatbot to do it, I just went ahead directly to the correct place and did it.

You can't do that if you generate the entire thing from specs.

vb-8448 15 hours ago||

We are speaking about local AI, and having all this SOTA models basically for free is blocking the progress of local or independent third party setups.

lelanthran 14 hours ago||

Maybe I should have clarified what the feature is (After re-reading my post, I see that I basically just ended after adding the footnote)

The feature of using all these SOTAs to exhaustion on the free tiers is burning their VC money!

The more I use for free, the more of their money I burn, the closer we'll get to actual 3rd-party and independent setups (local or otherwise).

RataNova 15 hours ago|||

The path of least resistance usually wins, especially when the pricing hides the real cost

Analemma_ 16 hours ago||

I'm also just not seeing good performance from local models. Every time a thread about LLMs comes up, there are tons of people in the comments insisting that they're getting just as good results from the latest DeepSeek/qwen/whatever as with Opus, and that just hasn't been my experience at all: open-source models just fall over completely compared to Claude when asked to do anything remotely complicated.

I have a sneaking suspicion this is kinda like the situation with Linux in the 90s, where it kinda worked but it reeeeeally wasn't ready for the home user, but you had a lot of people who would insist to your face everything was fine, mostly for ideological reasons.

kgeist 15 hours ago|||

It depends a lot on how you run those models. I think a lot of disagreement is because of that. A lot of people run local models with incredibly small context windows (makes an agentic LLM circle in loops), use very small quants (like 4 bit => huge degradation), don't set the recommended parameters (like top-p/temperature), or download GGUFs with broken chat templates. And then they claim model X is bad :)

I'm currently running both Sonnet 4.6 and Qwen 3.6-27b on the same codebase (via OpenCode, the parameters were carefully tuned to have a good quality/context size ratio), and on this project, they both struggle with complex non-trivial tasks, and both work flawlessly otherwise. Sonnet 4.6 understands the intent better if my task is ambiguously formulated, but otherwise the gap is pretty small for coding under a harness.

lelanthran 15 hours ago||||

> Every time a thread about LLMs comes up, there are tons of people in the comments insisting that they're getting just as good results from the latest DeepSeek/qwen/whatever as with Opus, and that just hasn't been my experience at all: open-source models just fall over completely compared to Claude when asked to do anything remotely complicated.

Different usage patterns - you want to issue a single spec then walk away and come back later (when it has consumed $10k worth of API tokens inside your $200/m subscription) to a finished product.

Many people issue a spec for a single function, a single class or similar. When you break it down like that, the advantages of SOTA models shrinks.

vb-8448 15 hours ago||

My experience is that in medium/big codebases even with single functions going with the xhigh is basically better from a user perspective (faster to get the result, and you can trust it) while going with lower models(e.g. sonnet instead of opus) you have to always carefully review the output because 1 of 10 it will hallucinate, you won't catch it immediately and at some point it will bite you.

lelanthran 14 hours ago||

> My experience is that in medium/big codebases even with single functions going with the xhigh is basically better from a user perspective (faster to get the result, and you can trust it) while going with lower models(e.g. sonnet instead of opus) you have to always carefully review the output because 1 of 10 it will hallucinate,

What do you mean "trust it"? It sounds like you want to vibe-code (never look at the output), and maybe for that you need SOTA, but like I said in a different comment, I can easily generate 1000s of lines of code per hour just prompting the chatbots.

I don't, because I actually review everything, but I can, and some of those chatbots are actually SOTA anyway.

vb-8448 14 hours ago||

With SOTA models I can just set up the instructions (even a little bit fuzzy), go away for 10 or 15 minutes, come back and just check result and adjust when necessary (and most of the time small adjustment are necessary, but the overall work is pretty good).

With subpar models I must be more careful on providing instructions and check it step by step because the path it chose is wrong, or I didn't ask for or the agent stuck in a loop somewhere.

catlifeonmars 11 hours ago||

A lot of people aren’t using agents that way. Not saying that it’s not a legitimate use or anything, just that I think the use cases are different. And yeah maybe for your specific use case, sota hosted models are the right choice

bilbo0s 15 hours ago|||

This.

I’ve begun to suspect that most people are probably running different hardware. Sure, you run the latest deep flash on your brand new M5 128G maybe you get acceptable performance?

But honestly, how many people have an extra $9000 laying around these days?

Right now, running with acceptable performance is kind of a luxury. I wish the people who always say - “This is great!” - would realize that not everyone has their hardware.

vb-8448 14 hours ago||

Actually even with a 9k hardware you won't get good enough performance. There is an interesting video from antirez on trying to run deepseek v4 flash 2bits on a m3 max 128GB ... and the result is kind delusional: as soon as the context start growing you are around 20token/s.

zozbot234 14 hours ago|||

Prefill performance used to be the real bottleneck on antirez's DS4 and that's been greatly improved by now, it doesn't perceivably slow down with growing context.

duchenne 9 hours ago||

Cloud models can use batch processing which is significantly more efficient. A local model has basically a batch of one which takes as much time to process as a batch of 100 because the gpu is memory bound and spend most of its time loading the model from vram to the gpu cache while the gpu cores are idle. With a batch of 100 the model loading time and compute time are roughly similar. So local Models have a first 100x lower efficiency. Secondly, local models are idle most of the time waiting for the user to write a prompt, so the efficiency gap is probably more around 1000x.

DrScientist 3 hours ago||

And what if your local computer essentially has an model chip with dedicated memory where the model stays loading 100% of the time?

r0b05 7 hours ago||

It's an interesting point but local gpu efficiency is not something I think about when I'm being rate limited or when my subscription costs keep rising.

hyfgfh 14 hours ago|

Local LLMs is the only thing viable and probably the only thing it will remain once the hype dies down.

A smaller cheaper local model can delivery most the value for coding, while we still use some services for code review and security compliance.

Once the VC money runs out and they start to charge the real price, the C-level will have to impose budges or limits. The current pissing contest over who can expend the most tokens is both ridiculous and shortsighted

More comments...