Running local models is good now

Posted by jfb 10 hours ago

Running local models is good now(vickiboykis.com)

931 points | 401 commentspage 6

jotato 8 hours ago|

I currently have a desktop with a 4060 ti (16gb of vram). Most models I have tested that fit within that are not good enough for anything other then type completion (in regards to coding tasks)

I have been considering getting the 58gb Mac Mini but that is a decent amount of money to spend without confirmation on a) how fast is it and b) will it work for well-defined tasks.

frollogaston 6 hours ago||

"Good" refers to the speed and not the quality. There's so much hype about Macs being great for LLMs, but nobody seems to be seriously using them for that because the open models are unfortunately so far behind.

throwarayes 8 hours ago||

I am happy to pay OpenAI for a cheaper model a few generations behind. But they deprecate models aggressively. They push you to bigger and smarter models, when 95% of my work doesn’t need it.

I’d love it if model providers just let old models run and let us pay less, but the deprecation makes me want to look into local models.

ta-run 6 hours ago||

Not related, but, I can't seem to get my copilot-cli (office is an MS shop) use qwen3.5:27b on ollama for some odd reason.

After the recent changes to usage, I've spent an annoyingly long number of hours trying to get this to work.

blobbers 6 hours ago||

Have you tried optimizing for MLX? It seems like a waste to have neural cores and not use them.

I've often wondered why the hype around apple neural core when 99% of software doesn't use them.

WASDx 7 hours ago||

Looking at some benchmarks, the latest ~30B Gemma/Qwen score similar as Claude or GPT versions that were released just one year earlier. That's crazy progress. I can't imagine how it will be in a few years.

k__ 7 hours ago||

I tried some smaller Gemma4 and Qwen3.6 quants on my MBA with M5/16GB and had like 20-60 tokens per second. At 60 it felt pretty okay and that hardware is on the lower end.

I'd assume a Mac with 32-64GB memory would get some reasonable results.

fridder 8 hours ago||

Is there a local harness designed around the local model use case that is claude code like? Opencode has been problematic at times, pi works for one off for me but not back and forth conversations with the LLM. Considering I only use Qwen or Gemma models I'm close to just writing my own at this point

nikagrawal121 5 hours ago||

I tried for my legal AI application that I'm building and it was able to do majority of the tasks. I used gemma4:26B

anax32 9 hours ago|

I've just made a milestone on my project, moving away from AWS (budget) to self-hosted and the local models are so much faster than in the past. Beyond LLMs, having embeddings, image, video, audio gen available is crazy.

Running locally is the bar; it's hard to make these things a service which scales.

More comments...