Top
Best
New

Posted by cafkafk 14 hours ago

A 10 year old Xeon is all you need(point.free)
599 points | 248 commentspage 4
asimovDev 12 hours ago|
I have an ancient DDR3 Xeon that doesn't support any AVX (dual x5690 and 96GB 1333 MHz RAM). You reckon it would even build / run at all?
qwertox 12 hours ago||
CPU (2012)

  Model name: Intel(R) Xeon(R) CPU E3-1265L V2 @ 2.50GHz
Mainboard

  Product Name: P8Z77 WS
GPU

  05:00.0 VGA compatible controller: NVIDIA Corporation AD106 [GeForce RTX 4060 Ti 16GB] (rev a1)
  05:00.1 Audio device: NVIDIA Corporation AD106M High Definition Audio Controller (rev a1)
Memory: 32GB

This works.

cafkafk 12 hours ago|||
Loading will take some minutes, but at 96 you can squeeze the model in and have some headroom around like ~10 GB, although depending on the Xeon, you may have to downgrade to E4B instead. Should still work thou.
tgtweak 12 hours ago|||
It may work - depending on your ram speeds it might not even be that much slower.
burnt-resistor 11 hours ago||
I run Win 11 Enterprise on an el cheapo spare parts Xeon E3-1275 V2 + 32 GiB DDR3-2133 + Gigabyte GA-B75M-D3H rev. 1.2 (TPM support)
haunter 10 hours ago||
And this is one of those CPUs which had dual slot motherboards so you can have double the fun (and power bill)

https://pcpartpicker.com/products/motherboard/#s=20028,20029...

bombcar 6 hours ago||
Is this John Siracusa? It sounds like it could be something he’d say…

(He has a fully maxed out “last Intel” Mac Pro and laments the lack of replacement).

robotswantdata 9 hours ago||
Granite or sapphire rapids are very under rated for MoE inference loads. But you need a GPU for the KV cache.

Plus many boards also support CXL for RAM expansion over PCI 5!

Source: building a hybrid inference business for regulated industry workloads.

Eonexus 13 hours ago||
I wonder what the tokens per second actually are. Yes, it does say "reading speed" but that varies for everyone, no?
cafkafk 13 hours ago|
That is a very fair point! I just ran a not very scientific benchmark with the system under load, and posted the raw logs in a sibling comment above, but the short answer is that it's hitting 11.94 tokens per second for generation - while it's also being a binary cache and CI build server.

Totally just vibes based, I think it goes up to 20+ tps when it's not under load (and that's me trying to be conservative). For context, reading speed at 250 wpm would be around 5 to 6 tokens per second.

Eonexus 13 hours ago||
Huh, that's actually not bad at all! Sure, it's not at the speed of a GPU, but still, 20 tps is cromulent for a CPU.
SirMaster 7 hours ago||
Either they have a E5-2620 V2 from 13 years ago, or they have DDR4, not DDR3. The V3 and V4 only support DDR4.
qingcharles 6 hours ago||
Would there be any advantage of running this as dual Xeon? The CPUs are $5 and a dual mobo is $50...
bee_rider 6 hours ago|
More memory bandwidth presumably. Not sure how well the ecosystem handles thread pinning though.
Hasan121212 9 hours ago||
I think one overlooked advantage of older Xeon systems is their availability. Many people can experiment with local AI deployments at a fraction of the cost of building a brand-new setup.
b65e8bee43c2ed0 3 hours ago||
so how many tokens/s do you get, pp and tg? did I miss it in the article?
egorfine 10 hours ago|
This and the previous one are insanely good articles. Thank you!
More comments...