The path to ubiquitous AI (17k tokens/sec)

Posted by sidnarsipur 22 hours ago

The path to ubiquitous AI (17k tokens/sec)(taalas.com)

709 points | 402 commentspage 7

baq 21 hours ago|

one step closer to being able to purchase a box of llms on aliexpress, though 1.7ktok/s would be quite enough

d2ou 14 hours ago||

Would it make sense for the big players to buy them? Seems to be a huge avenue here to kill inference costs which always made me dubious on LLMs in general.

japoneris 20 hours ago||

I am super happy to see people working on hardware for local llm. Yet, isnt it premature ? Space is still evolving. Today, i refuse to buy a gpu because i do not know what will be the best model tomorrow. Waiting to get a on the shelf device to run an opus like model

coppsilgold 14 hours ago||

Performance like that may open the door to the strategy of brutefocing solutions to problems for which you have a verifier (problems such as decompilation).

saivishwak 19 hours ago||

But as models are changing rapidly and new architectures coming up, how do they scale and also we do t yet know the current transformer architecture will scale more than it already is. Soo many ope questions but VCs seems to be pouring money.

hbbio 21 hours ago||

Strange that they apparently raised $169M (really?) and the website looks like this. Don't get me wrong: Plain HTML would do if "perfect", or you would expect something heavily designed. But script-kiddie vibe coded seems off.

The idea is good though and could work.

ACCount37 20 hours ago|

Strange that they raised money at all with an idea like this.

It's a bad idea that can't work well. Not while the field is advancing the way it is.

Manufacturing silicon is a long pipeline - and in the world of AI, one year of capability gap isn't something you can afford. You build a SOTA model into your chips, and by the time you get those chips, it's outperformed at its tasks by open weights models half their size.

Now, if AI advances somehow ground to a screeching halt, with model upgrades coming out every 4 years, not every 4 months? Maybe it'll be viable. As is, it's a waste of silicon.

small_model 20 hours ago|||

Poverty of imagination here, plenty uses of this and its a prototype at this stage.

ACCount37 20 hours ago||

What uses, exactly?

The prototype is: silicon with a Llama 3.1 8B etched into it. Today's 4B models already outperform it.

Token rate in five digits is a major technical flex, but, does anyone really need to run a very dumb model at this speed?

The only things that come to mind that could reap a benefit are: asymmetric exotics like VLA action policies and voice stages for V2V models. Both of which are "small fast low latency model backed by a large smart model", and both depend on model to model comms, which this doesn't demonstrate.

In a way, it's an I/O accelerator rather than an inference engine. At best.

MITSardine 20 hours ago|||

With LLMs this fast, you could imagine using them as any old function in programs.

ACCount37 14 hours ago||

You could always have. Assuming you have an API or a local model.

Which was always the killer assumption, and this changes little.

leoedin 20 hours ago|||

Even if this first generation is not useful, the learning and architecture decisions in this generation will be. You really can't think of any value to having a chip which can run LLMs at high speed and locally for 1/10 of the energy budget and (presumably) significantly lower cost than a GPU?

If you look at any development in computing, ASICs are the next step. It seems almost inevitable. Yes, it will always trail behind state of the art. But value will come quickly in a few generations.

xav_authentique 19 hours ago|||

maybe they're betting on improvement in models to plateau, and that having a fairly stablized capable model that is orders of magnitude faster than running on GPU's can be valuable in the future?

TheServitor 10 hours ago||

I don't know the use of this yet but I'm certain there will be one.

bloggie 21 hours ago||

I wonder if this is the first step towards AI as an appliance rather than a subscription?

impossiblefork 21 hours ago|

So I'm guessing this is some kind of weights as ROM type of thing? At least that's how I interpret the product page, or maybe even a sort of ROM type thing that you can only access by doing matrix multiplies.

readitalready 21 hours ago|

You shouldn't need any ROM. It's likely the architecture is just fixed hardware with weights loaded in via scan flip-flows. If it was me making it, I'd just design a systolic array. Just multipliers feeding into multipliers, without even going through RAM.

More comments...