Posted by frontsideair 5 days ago
It's a balancing game, how slow a token generation speed can you tolerate? Would you rather get an answer quick, or wait for a few seconds (or sometimes minutes) for reasoning?
For quick answers, Gemma 3 12B is still good. GPT-OSS 20B is pretty quick when reasoning is set to low, which usually doesn't think longer than one sentence. I haven't gotten much use out of Qwen3 4B Thinking (2507) but at least it's fast while reasoning.
https://apps.apple.com/us/app/pico-ai-server-llm-vlm-mlx/id6...
Witsy:
https://github.com/nbonamy/witsy
...and you really want at least 48G RAM to run >24B models.
Reads like someone starting to get their daily drinks, already using them for "company" and fun, and saying "I'm not an alcoholic, I can quit anytime".
Luckily llama.cpp has come a long way and was at a point that I could easily recommend as the open source option instead.
Also let’s not forget they are first and foremost designers of hardware and the arms race is only getting started.