Top
Best
New

Posted by jfb 4 days ago

Running local models is good now(vickiboykis.com)
1573 points | 602 commentspage 12
Patchistry 4 days ago|
do you run you local models along side some of your "paid" models?
atulmy 4 days ago||
Exact reason I'm building csuite.so, do check it out and let me know if you need early access!
henryoman 4 days ago||
Will there be a gemma4n
sn0n 4 days ago||
Qwen 3? Qwen 2.5 coder?? Is this an llm article written on an outdated model?? LoL
matrix12 4 days ago||
gemma:12b at 75% of frontier? Yeah....
etoxin 4 days ago|
I think 75% is about right. It calls tools pretty well and has a good knowledge base. It's absolutely not 90% there, but 75% feels right.
jingw222 4 days ago||
open source must win
monegator 4 days ago||
I've been trying local models for the boring stuff you might be thinking about: writing small docs.

So i've tested a couple, and the speed is finally impressive. My colleague uses paid tiers of claude and GPT, and the speed is comparable. Maybe even slightly faster on my end.

The problem is: i'm running the model on my work laptop, a 12th gen i5 with 16GB of RAM (which, you know, i asked to upgrade to 64, but that was right at the time of the great RAM shortage of the '20s) so i'm pretty limited in what i can use. And this is running alongside the usual suspects: Web browser hugging 1.5GB, MPLABX hugging 3, windows taking at least 5 just to sit idle, thermal throttled to 1GHz ... And yet its speed is comparable to a paid service. A lunch's worth of tokens vs a few cents of power.

So, what i found, what i fount... What i found is that i need AT LEAST 16k of context window, otherwise they will halt when i pass a small C file for analysis. And coding models will shit the bed with 4k. But we all know that, context size is King.

I found out that Qwen will keep looping while thinking, but that's not a surprise to you, either. But give it enough time and you will get an useful answer. I was hoping to using it as a better warning system for some languages, but i fear i need muuuch more context size, because i tried to feed a file that had a function with an endless loop:

At 4k context it almost shit the bed if i gave it just the offending function, then told it where to look at. At 16k context, with the whole file, it needed some guidance to what the problem was, and after 10-15 minutes of thinking it found the issue. Problem is, it kept second guessing itself for another 20 minutes on the same unrelated thing before giving the output. For which the fix was wrong, but the semanthic was correct. Good enough. Maybe it will be faster if i don't ask for a fix (which i didn't i just asked to look for a specific issue)

Wish i had 3 times the RAM so i can see what happens with more context.

Then i gave it the task to analyze a C file to make an API document. It took half an hour, but then i had a good starting point, which i had to keep changing because it would confuse commands with IDs and things like that.

This was the Qwen 3.5 9B model.

I then tested Gemma 4, being impressed at the tokens per second it gives on my Pixel 8A. Same tasks: same issues with short context, with long context it gave absolutely useless answers when looking at code, but it took 1/3 the time of qwen.

In producing documentation, instead, it was much faster, and it never hallucinated data. Good. in 15 minutes i had everything done.

Not bad for stuff running on a business laptop, while doing actual work.

Tomorrow i will try Qwen 3.6, let's see how it goes..

pauljeba 3 days ago||
How do I beleive thi? you wrote this blog by hand.lol
pauljeba 3 days ago||
How do I beleive you? You wrote this post by hand. lol
TaniaDictee 3 days ago|
[flagged]
More comments...