Posted by ricardbejarano 10 hours ago
ive been working with quite a few open weight models for the last year and especially for things like images, models from 6 months would return garbage data quickly, but these days qwen 3.5 is incredible, even the 9b model.
But yes, if there is a choice I want quality over speed. At same quality, I definitely want speed.
Its using WebGPU as a proxy to estimate system resource. Chrome tends to leverage as much resources (Compute + Memory) as the OS makes available. Safari tends to be more efficient.
Maybe this was obvious to everyone else. But its worth re-iterating for those of us skimmers of HN :)
$ ./llama-cli unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf -c 65536 -p "Hello"
[snip 73 lines]
[ Prompt: 86,6 t/s | Generation: 34,8 t/s ]
$ ./llama-cli unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf -c 262144 -p "Hello"
[snip 128 lines]
[ Prompt: 78,3 t/s | Generation: 30,9 t/s ]
I suspect the ROCm build will be faster, but it doesn't work out of the box for me.
Currently, Nemotron 3 Super using Unsloth's UD Q4_K_XL quant is running nearly everything I do locally (replacing Qwen3.5 122b)
In reality, gpt-oss-120b fits great on the machine with plenty of room to spare and easily runs inference north of 50 t/s depending on context.
The tool is very nice though.