A 10 year old Xeon is all you need

Posted by cafkafk 14 hours ago

A 10 year old Xeon is all you need(point.free)

599 points | 248 commentspage 4

asimovDev 12 hours ago|

I have an ancient DDR3 Xeon that doesn't support any AVX (dual x5690 and 96GB 1333 MHz RAM). You reckon it would even build / run at all?

qwertox 12 hours ago||

CPU (2012)

  Model name: Intel(R) Xeon(R) CPU E3-1265L V2 @ 2.50GHz

Mainboard

  Product Name: P8Z77 WS

GPU

  05:00.0 VGA compatible controller: NVIDIA Corporation AD106 [GeForce RTX 4060 Ti 16GB] (rev a1)
  05:00.1 Audio device: NVIDIA Corporation AD106M High Definition Audio Controller (rev a1)

Memory: 32GB

This works.

cafkafk 12 hours ago|||

Loading will take some minutes, but at 96 you can squeeze the model in and have some headroom around like ~10 GB, although depending on the Xeon, you may have to downgrade to E4B instead. Should still work thou.

tgtweak 12 hours ago|||

It may work - depending on your ram speeds it might not even be that much slower.

burnt-resistor 11 hours ago||

I run Win 11 Enterprise on an el cheapo spare parts Xeon E3-1275 V2 + 32 GiB DDR3-2133 + Gigabyte GA-B75M-D3H rev. 1.2 (TPM support)

haunter 10 hours ago||

And this is one of those CPUs which had dual slot motherboards so you can have double the fun (and power bill)

https://pcpartpicker.com/products/motherboard/#s=20028,20029...

bombcar 6 hours ago||

Is this John Siracusa? It sounds like it could be something he’d say…

(He has a fully maxed out “last Intel” Mac Pro and laments the lack of replacement).

robotswantdata 9 hours ago||

Granite or sapphire rapids are very under rated for MoE inference loads. But you need a GPU for the KV cache.

Plus many boards also support CXL for RAM expansion over PCI 5!

Source: building a hybrid inference business for regulated industry workloads.

Eonexus 13 hours ago||

I wonder what the tokens per second actually are. Yes, it does say "reading speed" but that varies for everyone, no?

cafkafk 13 hours ago|

That is a very fair point! I just ran a not very scientific benchmark with the system under load, and posted the raw logs in a sibling comment above, but the short answer is that it's hitting 11.94 tokens per second for generation - while it's also being a binary cache and CI build server.

Totally just vibes based, I think it goes up to 20+ tps when it's not under load (and that's me trying to be conservative). For context, reading speed at 250 wpm would be around 5 to 6 tokens per second.

Eonexus 13 hours ago||

Huh, that's actually not bad at all! Sure, it's not at the speed of a GPU, but still, 20 tps is cromulent for a CPU.

SirMaster 7 hours ago||

Either they have a E5-2620 V2 from 13 years ago, or they have DDR4, not DDR3. The V3 and V4 only support DDR4.

qingcharles 6 hours ago||

Would there be any advantage of running this as dual Xeon? The CPUs are $5 and a dual mobo is $50...

bee_rider 6 hours ago|

More memory bandwidth presumably. Not sure how well the ecosystem handles thread pinning though.

Hasan121212 9 hours ago||

I think one overlooked advantage of older Xeon systems is their availability. Many people can experiment with local AI deployments at a fraction of the cost of building a brand-new setup.

b65e8bee43c2ed0 3 hours ago||

so how many tokens/s do you get, pp and tg? did I miss it in the article?

egorfine 10 hours ago|

This and the previous one are insanely good articles. Thank you!

More comments...