Top
Best
New

Posted by cafkafk 13 hours ago

A 10 year old Xeon is all you need(point.free)
580 points | 240 commentspage 3
kristjansson 5 hours ago|
Noting for reference that Gemma4 MTP work is in progress[0] on llama.cpp; similar work for Qwen3.6 landed recently and has been great thus far.

[0]: https://github.com/ggml-org/llama.cpp/pull/23398

cykros 9 hours ago||
Does this mean my 15 year old Phenom is too old? But it has 16 gb of DDR3 RAM!

Admittedly web browsers and it don't get along that well. Literally the only thing that drags though on my Slackware 15 system, and even then usually only when it gets to around 15 or so open tabs.

anon-3988 9 hours ago||
I tried to run gemma 4 on this CPU and it did not go well

https://www.techpowerup.com/cpu-specs/ryzen-7-4800u.c2281

It is way too slow

potus_kushner 12 hours ago||
@cafkafk got a recommendation for a good model that fits into 64GB and leaves a couple GB free for other tasks ?
cafkafk 12 hours ago|
Honestly, at this point you're probably looking at a smaller model, for the Gemma series I'd go with Gemma 4 E4B with drafters, but that's just a hunch from using it on my laptop (where I do have a RTX 4060 M and 96gb ram).

So you'd change the invocation slightly here, but a lot of things you can potentially reuse.

That said, the Gemma 4 E4B models have so far in my experience been... not great when it comes to long context, but they are very passable for basic tasks, and even seem surprisingly okay at tool calls.

sleepyeldrazi 9 hours ago|||
Have you tested Qwen3.6 35B? Putting aside the capability claims for that model (which I support, but are not my point here), that 35B has smaller active parameter count than the gemma 4 26B, potentially making both prefill and decode faster out of the box, and has MTP heads built in the model and well supported (you may need to make sure you download a quant that didn't strip them off, as some do to preserve space). I would be curious to see your numbers there too. And if you do test this, please go for a clean one and not a fine-tuned one.
potus_kushner 11 hours ago|||
i tried the Q4_K_M model form unsloth with your Q4_K_M drafter, but the required memory to load everything is 72GB. odd. otoh i could load Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled.IQ4_XS.gguf and it requires just ~18 GB:

~/ik_llama.cpp[main]$ build/bin/llama-cli --model ~/models/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled.IQ4_XS.gguf --spec-type mtp --draft-max 3 --draft-p-min 0.0 --spec-autotune -cnv --color --jinja --special -smgs -sas -mea 256 --temp 0.7 -t 6 --parallel 6 --cpu-moe --merge-up-gate-experts --flash-attn on --mla-use 3 --mlock --run-time-repack --no-kv-offload . works pretty fast, at about 15 t/s:

llama_print_timings: sample time = 45.28 ms / 404 runs ( 0.11 ms per token, 8921.67 tokens per second) llama_print_timings: prompt eval time = 949.42 ms / 51 tokens ( 18.62 ms per token, 53.72 tokens per second) llama_print_timings: eval time = 24067.08 ms / 400 runs ( 60.17 ms per token, 16.62 tokens per second) llama_print_timings: total time = 242192.55 ms / 451 tokens

so i wonder why the params used by the quantified qwen model use way less memory than the ones of gemma.

Liftyee 6 hours ago||
Very intriguing. This might be the use for my e5-2430 V2 X2 server that's been lying around. DDR3 is (relatively) cheap now too. Could fit 192GB of RAM in it and play around for much cheaper than a new GPU.
alimbada 7 hours ago||
What's the best way to apply this to slightly more modern hardware - i.e. 5800XT 32GB DDR4, 9060XT 16GB?
mv4 5 hours ago||
I have an old 192GB DDR4 Dell Precision with dual Intel Xeon Gold 6130 that I've considered spinning up. What's giving me pause is 250W at idle.
mtoner23 5 hours ago|
Surely that number can go lower with some tweaks
mv4 4 hours ago||
I am sure it can. It will still generate a lot of heat when under load.

Are you telling me I should go for it? :)

I do have a dual DGX Spark cluster running MiniMax M2.7 already so I am all for on-prem. But will be interesting how this old machine will perform!

shovas 6 hours ago||
I have run llama.cpp on an i7-2600 with a 1050. It's too slow for everyday usage but it's not too slow to make it obvious AI is going to be everywhere and in everything. It's too easy to run.
b65e8bee43c2ed0 2 hours ago||
so how many tokens/s do you get, pp and tg? did I miss it in the article?
bombcar 5 hours ago|
Is this John Siracusa? It sounds like it could be something he’d say…

(He has a fully maxed out “last Intel” Mac Pro and laments the lack of replacement).

More comments...