Can I run AI locally?

Posted by ricardbejarano 19 hours ago

1134 points | 279 commentspage 9

Readerium 11 hours ago|

Qwen 3.5 4B is the goat then

casey2 6 hours ago||

Something notable is that Qwen3.5:0.8B does better on benchmarks than GPT3.5. Runs much faster on local hardware than GPT3.5 on release. However Qwen3.5:0.8B dumber and slower than GPT3.5. It's dumber: it can do 3*3, but if asked to explain it in terms of the definition (i.e. 3+3+3=9) it fails. It's slower: It's a thinking model so your 900T/S are mainly spent "thinking" most of the time it will just repeat until it hangs.

It pretty obvious that this reasoning scaling is a mirage, parameters are all you need. Everything else is mostly just wasting time while hardware get better.

reactordev 13 hours ago||

This shows no models work with my hardware but that’s furthest from the truth as I’m running Qwen3.5…

This isn’t nearly complete.

kennywinker 12 hours ago|

Well… don’t keep us guessing -what hardware? And which size qwen3.5?

g_br_l 15 hours ago||

could you add raspi to the list to see which ridiculously small models it can run?

varispeed 15 hours ago||

Does it make any sense? I tried few models at 128GB and it's all pretty much rubbish. Yes they do give coherent answers, sometimes they are even correct, but most of the time it is just plain wrong. I find it massive waste of time.

mongrelion 10 hours ago||

Apparently there is a whole science behind running models. I have seen the instructions that unsloth publishes for their quants and depending on the model they'll tweak things like the temperature, top k, etc.

The size of the quantization you chose also makes a difference.

The GPU driver also plays an important role.

What was your approach? What software did you use to run the models?

boutell 15 hours ago||

I'm not sure how long ago you tried it, but look at Qwen 3.5 32b on a fast machine. Usually best to shut off thinking if you're not doing tool use.

metalliqaz 15 hours ago||

Hugging Face can already do this for you (with much more up-to-date list of available models). Also LM Studio. However they don't attempt to estimate tok/sec, so that's a cool feature. However I don't really trust those numbers that much because it is not incorporating information about the CPU, etc. True GPU offload isn't often possible on consumer PC hardware. Also there are different quants available that make a big difference.

charcircuit 15 hours ago||

On mobile it does not show the name of the model in favor of the other stats.

butILoveLife 8 hours ago||

This is borderline irresponsible. Conflating first tokens with all tokens is terrible. Apple looks far better than it actually is.

Just ask any Apple user, they don't actually use local models.

bheadmaster 13 hours ago||

Missing 5060 Ti 16GB

lagrange77 13 hours ago|

Finally! I've been waiting for something like this.

More comments...