Jamesob's guide to running SOTA LLMs locally

Posted by livestyle 5 hours ago

Jamesob's guide to running SOTA LLMs locally(github.com)

176 points | 82 commentspage 2

rishabhaiover 39 minutes ago|

This is a great guide. However, the economics just do not work in my favor at all. Even if I were to spend $2k, I get much more flexibility of model intelligence and choice from a provider for $20/month.

chompychop 2 hours ago||

Is Whisper still considered SOTA for STT? Since it came out years ago, I'd have assumed there are better models by now.

randomblock1 2 hours ago||

No, there are quite a few models which are smaller, more accurate, and faster. For example Parakeet TDT v3 is half the size, way faster, and lower WER. There's also Voxstral, which is much larger but also even more accurate.

But the ecosystem isn't as mature, so Whisper is still a valid option, even now. For example Parakeet uses Nemotron framework (made by Nvdia), normally you need CUDA, so you need to use an ONNX version instead on AMD. Meanwhile Whisper has VLLM and desktop apps like Buzz.

There aren't many benchmarks and they often don't have all the models, since STT doesn't get nearly enough attention as normal LLMs, but this is one of the more complete ones: https://artificialanalysis.ai/speech-to-text/non-streaming

venusenvy47 2 hours ago|||

I don't have anything to compare against, since I have just started using it. But I was fairly happy with it on my personal recordings from my phone. Also, I ran it on my CPU (Core i7) and it was perfectly usable, as something to run when not using the machine for anything else.

simonw 1 hour ago||

I'm a big fan of Parakeet v3 - I run it using the MacWhisper app, it's a 494MB model and the quality is excellent.

beardsciences 4 hours ago||

I am somewhere in the middle, where I want something with more than 48GB/$2k of VRAM, but less than 384GB/$40k.

I'm curious if GMKtec's EVO-X2, with ~96GB of usable VRAM, is still a good solution for something like this for $3,399.

sampullman 4 hours ago||

I picked up the 128gb version when it was $2,199 and it runs Qwen 3.6 reasonably well with a 128kb context. Not very useful for complex tasks but it can handle some web stuff.

mft_ 4 hours ago||

It has lower memory bandwidth than most comparable Macs.

maxignol 25 minutes ago||

Did not seem to find how much tokens per second he achieved with this setup ?

SwellJoe 9 minutes ago||

I recently wrote up how I run local LLMs, because several folks had asked (https://swelljoe.com/post/how-i-run-local-llms/) and I think even my setup, which I spent maybe $4200 on, half on a Strix Halo and half on upgrades for my desktop, would be too expensive to justify today. I bought before prices went through the roof, and only did so because I like to tinker with hardware...not because I expected it to ever pay for itself vs. buying subsidized tokens from the big guys or the cheap tokens from efficient providers like DeepSeek.

Buying four $13000 GPUs and several thousand dollars worth of supporting hardware seems crazy. This supply shortage has to end eventually, and I can buy billions of DeepSeek, MiMo, and GLM tokens, and use $100 or $200 a month subscriptions for the big guys in the meantime for the difference in price once that happens. And, you can't even run the full-sized GLM on that hardware, it is quantized and so is your KV cache; the degradation is small, but not non-existent. You're not running a model that's equal to what you get when you buy GLM tokens from Z.ai.

My recommendation for self-hosting is this: If you already have a 24GB or 32GB GPU, or two, or a recent Mac with 32GB or more, run the appropriate quantization of Qwen 3.6 27B or Gemma 4 31B. If your hardware is older and too slow for that, use the MoE, but know it'll be dumber. Use the tiny model for the stuff that doesn't need deep smarts: Research (give it a Brave or Exa MCP for web search), summarization, simple Python scripts for basic tasks, simple websites or web apps, categorization of stuff (I used Gemma 4 to review my past writing for friendliness and helpfulness), etc. It can also be a sub-agent for bigger agents (for those same kinds of tasks). Gemma 4 12B is an incredibly good model for its size, particularly for vision tasks, and in the 4-bit quantization (7GB on disk) it runs on anything, even a modern tablet or phone.

And, if you don't already have a big GPU or unified memory Mac, just wait. Use the cheap tokens every AI company wants to sell you, for now. A Claude or Codex or Gemini subscription is a good deal. Tokens from DeepSeek are a good deal, especially with Reasonix agent (which maximizes caching, which DeepSeek is uniquely good at, and cached tokens are uniquely cheap at DeepSeek). GLM is Good Enough and has a cheap coding plan. MiMo has the cheapest tokens for a 1T+ model in the game, though DeepSeek and GLM are better models, MiMo is fine.

When prices come down, I'll be speccing out a beast to run the big models, too. But, I'm not paying 4x for RAM and GPU and storage, and y'all shouldn't either. That's crazy. Computer prices go down over time. It is the law.

c4pt0r 1 hour ago||

Local open weight models will definitely be a future trend. Imagine if an Opus-level model could run locally: many more latent use cases would likely emerge, since Opus is priced so high. Perhaps the future will be a multi-model architecture, where frontier models handle planning and local models carry out the concrete execution.

zackify 4 hours ago||

You can get amazing local STT using parakeet which can use as little as 600mb of vram. Better or as good as whisper v3 large

subhobroto 3 hours ago|

[flagged]

QuantumNoodle 28 minutes ago||

$2k or $40k? One of those is not "self host."

wxw 4 hours ago||

I agree that local LLMs are the likely future and worth investing in… but at $40k for possible-SOTA right now, this isn’t worth it for the average consumer.

I’m pretty bullish that Apple will deliver something very competitive for the average consumer in the next couple years.

maxxxml 59 minutes ago|

What harness is the best for local LLMs? I've been researching optimizing local LLM agent harness performance with context/ tools. Quite the endeavor and would love to learn what users prefer for this type of workflow.

npodbielski 52 minutes ago|

I like vibe and pi. Vibe just looks nice and is good enough. But pi extensibility is just another level. There is also Dirac that is quite OK but seems like full of bugs. Zerostack is the simplest harness I saw. OpenCode is OK too. Rest I did not try.

More comments...