- 4090 : 27b-q4_k_m
- A100: 27b-q6_k
- 3*A100: 122b-a10b-q6_k_L
Using the Qwen team's "thinking" presets, I found that non-agentic coding performance doesn't feel significant leap over unquantized GPT-OSS-120B. It shows some hallucination and repetition for mujoco codes with default presence penalty. 27b-q4_k_m with 4090 generates 30~35 tok/s in good quality.https://github.com/ollama/ollama/issues/14419
https://github.com/ollama/ollama/issues/14503
So for now I'm back to Qwen 3 30B A3B, kind of a bummer, because the previous model is pretty fast but kinda dumb, even for simple tasks like on-prem code review!