Performance per dollar is getting faster and cheaper

Posted by latchkey 23 hours ago

Performance per dollar is getting faster and cheaper(www.wafer.ai)

333 points | 132 commentspage 3

beffjezos 17 hours ago|

This is very interesting and yet not at the same time. This looks to be optimized for single-stream LLM traffic which is not viable to serve in a production setting. It's only interesting to hobbyists that want to run the model locally.

It's genuinely neat that AI can find the right optimization pathways in an AMD inference server to unlock this but at the same token (pun-intended) this is a classic case of benchmark hacking that doesn't stand up to real-world application.

wmf 17 hours ago||

You got it backwards; it's ~200 on single stream so the 2,600 is achieved with ~13 streams.

beffjezos 17 hours ago||

Yeah that makes sense. I'm more familiar with seeing tok/s/user + TTFT rather than the total node throughput.

technoabsurdist 17 hours ago||

hi yes it’s not optimized for single stream it’s optimized for total node throughput

foobar10000 7 minutes ago|||

Well, for a lot of agentic stuff nowadays, having 250k-500K context is where things live - and the benchmarks don't really show that unfortunately - but they could :)

beffjezos 17 hours ago|||

Oh, that's much better then. A good metric to share is the tokens per second per user for the node rather than the total throughput of the node. It disambiguates what's being optimized for much better than your blog post currently does.

technoabsurdist 15 hours ago||

sounds good feedback taken, thanks beffjezos

gowthamsaiyadav 10 hours ago||

world is not limited by Nvidia, AMD can be used

calin2k 13 hours ago||

then why is token per dollar getting more expensive?

ilaksh 6 hours ago||

There are a limited number of these available in comparison to demand. I think people figured out that LLMs and VLMs can do real work that can replace a lot of humans. And for plenty of jobs, it's good enough to reduce already outsourced staff by 75-90% at a fraction of the cost.

FeepingCreature 10 hours ago|||

Because lots of people are willing to pay more dollar for smarter token.

AtlasBarfed 13 hours ago||

Because they are dumping/subsidizing it token processing to try and get companies to fire as many people as possible. So they'll be dependent upon the companies when they have to Jack the rates

yieldcrv 21 hours ago||

Agentic coding drivers for different architectures is a massive unlock for the world

So much compute is under utilized waiting for a savant or company to prioritize an architecture, and now all the other engineers can tackle this at any time if they get inspired on the right prompts

technoabsurdist 20 hours ago||

this is exactly our thesis at wafer :) thank you for the support

yieldcrv 16 hours ago||

well done

yogthos 20 hours ago|||

Personally, I can't wait till something like this starts getting to consumer level. https://www.anuragk.com/blog/posts/Taalas.html

yieldcrv 20 hours ago||

That’s pretty fascinating, Apple has some innocuous LLMs and transformers baked into its devices and leveraging their neural chipset

So I could see something like this where the neural chipset has an LLM that cant be so easily updated baked into it, until you get a new device

yogthos 7 hours ago||

Exactly, it'd be the same as regular chip designed evolving. You get a specific model version baked into the chip, if it does what you need then it's fine. If you need more capability in the future, you just buy a new chip.

I also think the dynamic would be really different if model inference can run at ridiculous speeds. You could make a genetic algorithm loop around it, so it can generate a population of proposals at each step, then have those tested and whittled down iteratively. If inference happens at thousands of tokens per second, then from user perspective it would still be really fast, and even a small model could solve complex problems.

innis226 16 hours ago||

[dead]

zuzululu 15 hours ago||

yeah but we are still far far away from being able to run the frontier model equivalents locally without significant quantization

even having something like opus 4.8 locally would completely change the landscape

villgax 16 hours ago||

They fail to mention non speculative numbers & whether baseline was nvfp4 as well. So much for erosion against an older gen

bitwize 13 hours ago||

(in a high-pitched, pathetic regency-era British orphan voice) Please sir, may I have some compute as well?

paulreaney 4 hours ago||

[dead]

pullrun 8 hours ago|

[flagged]

More comments...