As one commenter mentioned, 2x Mac Studio M3 Max with 512GB can run frontier models and it costs $30k (with RDMA). Apply an efficiency ratio for being in a datacenter, and you understand why OpenAI and the likes spend north of $10k _per customer_ of CAPEX.
Add to that the electricity costs and you've got a very shaky business model. I for one would like to thank the VC for subsidizing my tokens.
With that said, the VCs are not crazy and probably factored in an annual cost decrease of computing power. But how do you make sure that we won't run local LLMs when the HW becomes affordable -- if ever ?
The answer has always been the same in our industry: vendor lock-in. They are getting the users now at a loss, hoping for future captive revenues.
So, be careful when your code maintenance requires the full context that yielded that code, and that this context is in [Claude Code|Codex|Cursor].
Why should connecting small models to big models result in higher output quality than just running the big models without the small models?
Assuming we end up in a future where people pay to run multiple smaller models on their machines for specific tasks (e.g. A summariser model, a python coding model, or however fine grained/macro you want to go), the people training those models will need to turn a profit.
So how much will that cost? And how often will consumers have to pay? Models have a very short self life. Say you have a dedicated python coding model - that needs re-training every time there's a significant update to the language itself, any popular packages, related technologies (e.g. servers, cloud infra etc). So how often will users need to "upgrade" to the lastest version? It's going to be "frequently".
And it still needs the language stuff on top of that. Users aren't going to interact with a python coding model by writing python. They're going to use natural language. So the model needs all that stuff. And they're going to give it problems to solve. What if you asked the model "Write me a Bezier curve function". It needs to know about bezier curves, which have nothing to do with Python. So where do these LLM providers draw the line on what makes it into the training data and what doesn't?
And if an LLM doesn't know what a Bezier curve is, that's not going to stop it from just hallucinating an answer. If a significat proportion of prompts resulted in a response that said "Sorry, I don't know what you're talking about", then people will just stop using it. The utility of these things will be quickly overshadowed by the frustrations.
The way these frontier models have been introduced and promoted has set unrealistic expectations, and there's no putting the genie back in the bottle.
Commoditizing complements. If Anthropic/OpenAI/etc is eating your lunch, make it work with cheap local LLMs , you can beat them on price by having local inference you don't pay (nor need data centers for), and try to keep your (user/data) moat.
The more Anth/OAI disrupt, the more likely this is to happen. If they don't disrupt enough (.ie: grow as an ecosystem to defend against incentives to commoditize), then yes, those incentives are removed, but they also leave money on the table, which they need.
Not only at business level, but also geopolitical (to a lesser extent? or not since lots of open weight models comes form China?).
Great observation! Often the excitement of novelty makes us lose sight of the real goal
The problem is that it's much easier to use the SOTA models (especially if they are subsidized) instead of spending time fixing the knobs with the local one.
I just realized this with coding agents, yeah, you probably shouldn't always use latest version at xhigh, but you will end doing it because you do the job in less time, with less "effort" and basically at the same price.
I guess we'll see a real effort for local AI only when major vendors will start billing based on actual token usage.
That's not a problem, that's a feature; I have something like 8 tabs open to different free-tier providers. ChatGPT, Claude and Gemini are the SOTA ones.
I have no problem maxing one out, then moving to the next. I can do this all day, have them implement specific functions (or classes) in my code. The things is, because I actually know how to write and design software, I don't need to run an agent in a loop to produce everything in a day, I can use the web chatbots with copy/paste to literally generate thousands of lines of code per hour while still having a strong mental model of the code that I can go in and change whatever I need to.[1]
---------------------
[1] Just did that this morning on a Python project: because I designed what I needed, each generation was me prompting for a single function. So when I needed to add something this morning I didn't even bother asking an chatbot to do it, I just went ahead directly to the correct place and did it.
You can't do that if you generate the entire thing from specs.
The feature of using all these SOTAs to exhaustion on the free tiers is burning their VC money!
The more I use for free, the more of their money I burn, the closer we'll get to actual 3rd-party and independent setups (local or otherwise).
I have a sneaking suspicion this is kinda like the situation with Linux in the 90s, where it kinda worked but it reeeeeally wasn't ready for the home user, but you had a lot of people who would insist to your face everything was fine, mostly for ideological reasons.
I'm currently running both Sonnet 4.6 and Qwen 3.6-27b on the same codebase (via OpenCode, the parameters were carefully tuned to have a good quality/context size ratio), and on this project, they both struggle with complex non-trivial tasks, and both work flawlessly otherwise. Sonnet 4.6 understands the intent better if my task is ambiguously formulated, but otherwise the gap is pretty small for coding under a harness.
Different usage patterns - you want to issue a single spec then walk away and come back later (when it has consumed $10k worth of API tokens inside your $200/m subscription) to a finished product.
Many people issue a spec for a single function, a single class or similar. When you break it down like that, the advantages of SOTA models shrinks.
What do you mean "trust it"? It sounds like you want to vibe-code (never look at the output), and maybe for that you need SOTA, but like I said in a different comment, I can easily generate 1000s of lines of code per hour just prompting the chatbots.
I don't, because I actually review everything, but I can, and some of those chatbots are actually SOTA anyway.
With subpar models I must be more careful on providing instructions and check it step by step because the path it chose is wrong, or I didn't ask for or the agent stuck in a loop somewhere.
I’ve begun to suspect that most people are probably running different hardware. Sure, you run the latest deep flash on your brand new M5 128G maybe you get acceptable performance?
But honestly, how many people have an extra $9000 laying around these days?
Right now, running with acceptable performance is kind of a luxury. I wish the people who always say - “This is great!” - would realize that not everyone has their hardware.
A smaller cheaper local model can delivery most the value for coding, while we still use some services for code review and security compliance.
Once the VC money runs out and they start to charge the real price, the C-level will have to impose budges or limits. The current pissing contest over who can expend the most tokens is both ridiculous and shortsighted