Top
Best
New

Posted by dev-experiments 2 days ago

Good results fine tuning a local LLM like Qwen 3:0.6B to categorize questions(www.teachmecoolstuff.com)
210 points | 49 commentspage 2
abhashanand1501 1 day ago|
Do small language models run on cpus or you still need a gpus to run them?
wongarsu 1 day ago||
Anything below one billion parameters you can run on the CPU at acceptable speed

For larger sizes you still can, it just becomes slower and slower. For a simple classification task (small input, tiny output, and you can constrain output to a couple tokens) you could even run something like a 4B or 8B model on the CPU

a96 1 day ago|||
I guess that technically depends on the software used to run the model, but in general it's always been possible to run on a CPU (and may even be possible to run on TPU or something else). It's just been slower. Likewise GPU RAM vs system RAM and the bandwidths involved can make hard bottlenecks.

GPU and VRAM (or fast unified RAM) is generally the option that is both available and performant, but especially really small models also run quite well on CPU and system RAM.

avadodin 1 day ago||
iGPUs are often slower or only as fast as CPUs when it comes to LLM text generation.

The advantage is mainly in memory bandwidth. External GPUs' internal memory is slightly faster than DDR attached to your CPU.

Other types of "AI" models do make use of the extra compute in GPUs but not LLMs.

throwa356262 1 day ago||
Are 0.6b models useful without fine tuning?

Half of the times I ask qwen 0.6b "what is 1 + 2?" it ends up in a thinking loop of "but wait, the user is asking me to ..."

rhdunn 1 day ago||
If you don't want the thinking, you can pass `enable_thinking: false` to the `chat_template_kwargs`. If using promptfoo, this can be done via:

    providers:
      - # llama-server
        id: openai:chat:qwen
        config:
          apiBaseUrl: http://localhost:7876
          apiKey: "..."
          passthrough:
            chat_template_kwargs:
              enable_thinking: false
The looping may be due to quantization -- I've seen it on locally quantized Q6_K Qwen 3.5/3.6 models. I recall seeing somewhere (here or r/LocalLlama) that Qwen models are sensitive to quantization of the keys, though I haven't yet experimented with/looked into fixing this. (I've been building up my promptfoo tests/infrastructure to detect looping, etc. on Qwen and other models.)
kamranjon 1 day ago||
A fun thing I do with Qwen 3.5 0.8b is to take a screenshot of the Hackernews homepage and ask it to give me a JSON representation of the data and it does surprisingly well. With a well structured prompt I think it could be made to be pretty reliable tool for that type of task out of the box.
Zambyte 1 day ago||
While a fun poc, surely it would be better to just use the API (see the footer)? Or just `curl | x2j | jq` and map the HTML directly to JSON?
kamranjon 1 day ago||
Yes apologies, Hackernews was just an example, you can do this with any website - it’s just a simple benchmark I like to use for testing vision models.
jszymborski 1 day ago||
I think the Qwen 0.6B is so cool. It is super fast and as illustrated here it has a clear niche, esp. when fine-tuned.

I'm also interested in it as a student for distillation.

armcat 1 day ago||
I mean it's always nice to play around with sLLM finetuning, but for practical purposes I would always start with a lazy learner using embeddings (something like a small Stella model), pre-embed the topics/categories, embed the question, perform a kNN using cosine distance. You can use an LLM to "expand" the topics before embedding to make them more contextual. This is usually super fast and super simple and gives you a nice baseline. Then I would add a classification head after embedding layer (with maybe some dropout + 2-3 MLP layers) and train my own classifier, and compare that to lazy learner. Only after that would I start finetuning an LLM.
danielhanchen 1 day ago||
Very cool write-up and GitHub repo!
crimsoneer 1 day ago||
Tangentially related, but the UK Gov Incubator for AI has quite a nifty LLM driven classification pipeline for survey answers.

https://github.com/i-dot-ai/consult

737max 1 day ago||
Is it just me or half these comments read like AI
VaporJournalAPP 1 day ago||
[flagged]
mlpicker 1 day ago||
[flagged]
lastdrop 1 day ago|
[dead]