A 10 year old Xeon is all you need

Posted by cafkafk 12 hours ago

A 10 year old Xeon is all you need(point.free)

547 points | 236 commentspage 2

FartyMcFarter 7 hours ago|

I may have missed this in the article, but:

What was the net effect of the optimisations? How much faster did it get?

tomega2134 2 hours ago||

I wish this were somehow tagged with AI, so I would know that it's not about say, general computing or cost-efficiency (e.g. using an old xeon machine from ebay instead of new, in these cost-conscious times.)

As it is, the title is click-bait for me, as 1) it says I need at least a Xeon somehow and 2) as it doesn't say what I actually need it for.

ryandrake 3 hours ago||

I've got an old HP Z-620 workstation with dual E5-2697 v2 CPUs (24 cores total, 48 threads @ 2.7GHz) and 128GB of DDR3 RAM. The docs say it supports up to 192GB, but I wasn't able to get it to POST with all the RAM slots full.

It's still a "homelab" beast and does great with development and GIS/Mapping applications. I was not able to figure out how to run AI workloads on it with decent performance, however, so I finally broke down and got a dedicated GPU for it. It's pretty great what can still be done with older hardware.

FpUser 3 hours ago|

I self host on old HP Z-840 with 2x3.6 GHz Xeons 24 total cores, and 512 GB RAM. Cost me peanuts used and works like a charm for many years already

Aurornis 3 hours ago||

llama.cpp includes a benchmarking tool called llama-bench https://github.com/ggml-org/llama.cpp/blob/master/tools/llam...

ik_llama includes llama-sweep-bench https://github.com/ikawrakow/ik_llama.cpp/blob/main/examples...

When comparing hardware, the output of these tools is very helpful to let others put it into context. The post says the output is "reading speed" but knowing the prefill and token generation speeds would be a lot more helpful.

vhaudiquet 10 hours ago||

The E5 2620-v4 only supports DDR4.

bobmcnamara 6 hours ago|

Probably in an x99 motherboard

mwpmaybe 5 hours ago||

The memory controller is integrated into the CPU, so the motherboard chipset is irrelevant. There are some OEM-only v3/v4 parts with dual memory controllers, but the E5-2620 v4 is not one of them.

NSUserDefaults 9 hours ago||

How about the iMac Pro? Would that work? I was able to put 128gb in it (not as easy as the regular iMac but possible).

wazoox 9 hours ago|

I've been running various models on a Mac Pro 2013 (8 cores, 32 GB RAM) at about 8 to 10 t/s for months. It's not fast, but it's more than enough for many actual tasks, in particular background tasks. An iMac pro will do just as well I suppose.

fooker 8 hours ago||

What are the tasks that do well with 8-10 t/s ?

wazoox 6 hours ago||

The sort of task you don't expect to end immediately. If extracting data from a bunch of PDFs takes 1 hour or the whole night, that doesn't make much difference to me. It's not fast enough for auto completion and slightly too slow for chat (but bearable IMO).

fooker 3 hours ago||

Running a local llm at 10 t/s overnight to extract data from a few PDFs will burn more in electricity than paying cents for the hosted kimi models.

You can (sometimes) break even if you have a workstation GPU.

cbdevidal 6 hours ago||

Old hardware is surprisingly effective. I've been considering a side hustle selling offline AI to local businesses who are privacy-sensitive. Medical, legal, places like that.

At the low end, I'd use old Xeons with gobs of DDR3, install some V100s, run a smaller agent for general chat inquiries, and a frontier model for the deeper stuff, with a router that passes between them depending on the complexity.

The frontier model would perform very slowly, but if it's a deep task the user can submit it in a batch in the evening e.g. "Correlate all of these cases and look for patterns" then receive the output with morning coffee.

Of course, AI helped me work out a plan for this. Haha

nicogentile 1 hour ago|

[flagged]

danbruc 3 hours ago||

Did some try to estimates what it would take to bake interference for a capable large language model into silicon so that one can pipeline inputs through it and produce outputs at one token per clock cycle?

knorker 3 hours ago|

I'd expect it to require too much RAM bandwidth to be feasible.

RAM is really slow at silicon speeds. Very little is reachable in one clock cycle, unless the clock cycle is abysmally slow.

danbruc 3 hours ago||

No RAM. Instead of having a general purpose multiplier that multiplies an input with a weight stored in RAM, just have a multiplier that hardcodes the weight. In some sense replace each weight with a specialized multiplier and wire them together with accumulators and activation functions in between. And some registers for pipelining. If one goes for four bit quantization, one could have sixteen optimized multipliers, one for each possible weight, and the one just selects and connects them according to the model weights and structure.

Example. If you have a neuron with 16 inputs each 8 bit wide and with a 4 bit weight per input, you will have 16 specialized multipliers each scaling its input by the corresponding weight and then the 16 scaled inputs feed into an adder tree and finally an activation function.

b65e8bee43c2ed0 1 hour ago||

so how many tokens/s do you get, pp and tg? did I miss it in the article?

lreeves 6 hours ago|

Doesn't accepting 100% of the MTP draft tokens mean you should just be using the smaller model? Usually the acceptance rate in Qwen36 at least is around 60-70% and the "wrong" tokens are still filled in entirely by the base model, but when you just accept 100% of the draft tokens it seems kind of self defeating unless I'm wrong.

Also I feel like everyone leaves off prompt processing/prefill speeds in these articles. If you are using a very small prompt and asking for mostly generated tokens, sure but I'd love to know the time-to-response of asking for an analysis of an image or a few hundred lines of code.

dvdkon 6 hours ago||

As far as I know, speculative decoding still verifies that the proposed tokens are what the "big" model would generate, it just uses the guesses to make that process faster. Setting the probability threshold too low then shouldn't affect correctness, just speed (time will be wasted verifying bad guesses).

lreeves 6 hours ago||

But won't setting it to accept 100% of the proposed tokens will skip the verification?

ac29 5 hours ago|||

None of those settings set the speculative decoder to accept 100% of drafted token. I assume you are looking at --draft-p-min 0.0, if so, you are misunderstanding what it does.

naasking 6 hours ago||

It depends on the type of MTP. If you're using two models, draft + full, then arguably yes, the larger model isn't providing much benefit if you really are seeing 100% acceptance rates. There are other forms of speculative decoding that work within the larger model by itself though, eg. Qwen has additional speculative decoding attention heads, so there is no secondary drafting model.

More comments...