Posted by cafkafk 12 hours ago
What was the net effect of the optimisations? How much faster did it get?
As it is, the title is click-bait for me, as 1) it says I need at least a Xeon somehow and 2) as it doesn't say what I actually need it for.
It's still a "homelab" beast and does great with development and GIS/Mapping applications. I was not able to figure out how to run AI workloads on it with decent performance, however, so I finally broke down and got a dedicated GPU for it. It's pretty great what can still be done with older hardware.
ik_llama includes llama-sweep-bench https://github.com/ikawrakow/ik_llama.cpp/blob/main/examples...
When comparing hardware, the output of these tools is very helpful to let others put it into context. The post says the output is "reading speed" but knowing the prefill and token generation speeds would be a lot more helpful.
You can (sometimes) break even if you have a workstation GPU.
At the low end, I'd use old Xeons with gobs of DDR3, install some V100s, run a smaller agent for general chat inquiries, and a frontier model for the deeper stuff, with a router that passes between them depending on the complexity.
The frontier model would perform very slowly, but if it's a deep task the user can submit it in a batch in the evening e.g. "Correlate all of these cases and look for patterns" then receive the output with morning coffee.
Of course, AI helped me work out a plan for this. Haha
RAM is really slow at silicon speeds. Very little is reachable in one clock cycle, unless the clock cycle is abysmally slow.
Example. If you have a neuron with 16 inputs each 8 bit wide and with a 4 bit weight per input, you will have 16 specialized multipliers each scaling its input by the corresponding weight and then the 16 scaled inputs feed into an adder tree and finally an activation function.
Also I feel like everyone leaves off prompt processing/prefill speeds in these articles. If you are using a very small prompt and asking for mostly generated tokens, sure but I'd love to know the time-to-response of asking for an analysis of an image or a few hundred lines of code.