Posted by dev-experiments 2 days ago
For larger sizes you still can, it just becomes slower and slower. For a simple classification task (small input, tiny output, and you can constrain output to a couple tokens) you could even run something like a 4B or 8B model on the CPU
GPU and VRAM (or fast unified RAM) is generally the option that is both available and performant, but especially really small models also run quite well on CPU and system RAM.
The advantage is mainly in memory bandwidth. External GPUs' internal memory is slightly faster than DDR attached to your CPU.
Other types of "AI" models do make use of the extra compute in GPUs but not LLMs.
Half of the times I ask qwen 0.6b "what is 1 + 2?" it ends up in a thinking loop of "but wait, the user is asking me to ..."
providers:
- # llama-server
id: openai:chat:qwen
config:
apiBaseUrl: http://localhost:7876
apiKey: "..."
passthrough:
chat_template_kwargs:
enable_thinking: false
The looping may be due to quantization -- I've seen it on locally quantized Q6_K Qwen 3.5/3.6 models. I recall seeing somewhere (here or r/LocalLlama) that Qwen models are sensitive to quantization of the keys, though I haven't yet experimented with/looked into fixing this. (I've been building up my promptfoo tests/infrastructure to detect looping, etc. on Qwen and other models.)I'm also interested in it as a student for distillation.