Posted by virgildotcodes 20 hours ago
I also wouldn’t be surprised if memory providers weren’t intimately involved, as they’ve been caught price fixing in the past: https://en.wikipedia.org/wiki/DRAM_price_fixing_scandal
Alleviating the memory constraint would only really make Nvidia a danger to cloud margins, and their consumer sales are neutered while they focus on the datacenter segment. It's feels facetious to insinuate that people would be doing inference on their Macbook Neo or Wintel laptop if they only had a gorbillion gigabytes of memory and a 400W accelerator card plugged into the wall outlet.
There is a pretty large and growing community of us using entirely local models for our agentic flows. From GLM 4.7 flash on 32gb machines with >60tok/s to Gemma and Qwen dense and MOE models on 64gb machines all the way up to Deepseek V4 flash on 128gb machines with 450tok/s prefill and 25-30tok/s decode.
I use DS4 on the daily - it’s become my main model.
I know it’s in fashion to talk trash about Apple but their hardware outperforms other options like DGX Sparc when it comes to local inference, they got the unified memory, memory bandwidth and the GPU cores to actually be useful in a way that most other hardware just isn’t.
I also use it in local agent mode if im coding directly on the machine which is nice cause you can save sessions and resume them, and so for personal projects and training related stuff it's been great.
Even got an autoresearch loop going where the agent looks at the previous run, adjusts parameters and code if needed, and then hands off training to another script (so full system resources are available for training), ad infinitum - it works really well - what antirez has done with that project is pretty incredible.
GLM 4.7 Flash is a 30b model that was far behind SOTA at launch, and I know that because I pay for z.ai inference and have run the model locally. Qwen and Deepseek V4 Flash have the same issue, and beg the question; are you really going to process a 64k agentic context at 450tok/s? That's 2+ minutes that you spend waiting for the first token to generate! Of course nobody can sell that as competitive inference, and it only gets worse with larger models. We're talking about non-interactive speeds, here.
If you're satisfied with small local models, more power to you. It puts you in the same barrel as Strix Halo enthusiasts or the guys that bought 2x3090s on Reddit. You are completely ignoring the market if you think that any of those SOCs are unprecedented or unparalleled for inference workloads, though. The free DS4 API is faster at prefill and decode, you could not give away Mac inference at zero cost and compete with what China provides for free. That's how far behind Macs are for local inference, to put things into perspective.
The datacenter builders and the big hosted AI models. The person you're replying to even mentions OpenAI by name.
There are two things that would prevent people from using local models - pricing and regulations. And we're seeing moves from both of those fronts lately.
Hey, Infantino was ahead of the curve! For the same price as an English MBP, you can get an American one and see the Three Lions disappoint against Panama!
I suspect that these price increases will stick around permanently (or at least for a long while).
How is that calculated?
$500!! I mean that's not crazy surprising given price increase in the components I'm trying to buy (ram and hard drives, maybe an SSD) but damn. The M6 is probably the next laptop I'll get, I can only hope that component prices have calmed down by the time it's released but I'm not holding my breath.