The path to ubiquitous AI (17k tokens/sec)

Posted by sidnarsipur 15 hours ago

The path to ubiquitous AI (17k tokens/sec)(taalas.com)

650 points | 373 commentspage 3

max8539 4 hours ago|

This is crazy! These chips could make high-reasoning models run so fast that they could generate lots of solution variants and automatically choose the best. Or you could have a smart chip in your home lab and run local models - fast, without needing a lot of expensive hardware or electricity

segmondy 10 hours ago||

Pretty cool, what they need is to build a tool that can take any model to chip in short a time as possible. How quick can they give me DeepSeek, Kimi, Qwen or GLM on a chip? I'll take 5k tk/sec for those!

throwaw12 9 hours ago|

also imagine it will cost 300$/unit, we all will host our own set of models locally, dream dream

bmc7505 8 hours ago||

17k TPS is slow compared to other probabilistic models. It was possible to hit ~10-20 million TPS decades ago with n-gram and PDFA models, without custom silicon. A more informative KPI would be Pass@k on a downstream reasoning task - for many such benchmarks, increasing token throughput by several orders of magnitude does not even move the needle on sample efficiency.

luyu_wu 5 hours ago||

I think this is quite interesting for local AI applications. As this technology basically scales with parameter size, if there could be some ASIC for a QWen 0.5B or Google 0.3B model thrown onto a laptop motherboard it'd be very interesting.

Obviously not for any hard applications, but for significantly better autocorrect, local next word predictions, file indexing (tagging I suppose).

The efficiency of such a small model should theoretically be great!

aetherspawn 13 hours ago||

This is what’s gonna be in the brain of the robot that ends the world.

The sheer speed of how fast this thing can “think” is insanity.

arjie 7 hours ago||

This is incredible. With this speed I can use LLMs in a lot of pre-filtering etc. tasks. As a trivial example, I have a personal OpenClaw-like bot that I use to do a bunch of things. Some of the things just require it to do trivial tool-calling and tell me what's up. Things like skill or tool pre-filtering become a lot more feasible if they're always done.

Anyway, I imagine these are incredibly expensive, but if they ever sell them with Linux drivers and slotting into a standard PCIe it would be absolutely sick. At 3 kW that seems unlikely, but for that kind of speed I bet I could find space in my cabinet and just rip it. I just can't justify $300k, you know.

rhodey 10 hours ago||

I wanted to try the demo so I found the link

> Write me 10 sentences about your favorite Subway sandwich

Click button

Instant! It was so fast I started laughing. This kind of speed will really, really change things

TheServitor 3 hours ago||

I don't know the use of this yet but I'm certain there will be one.

jtr1 8 hours ago||

The demo was so fast it highlighted a UX component of LLMs I hadn’t considered before: there’s such a thing as too fast, at least in the chatbot context. The demo answered with a page of text so fast I had to scroll up every time to see where it started. It completely broke the illusion of conversation where I can usually interrupt if we’re headed in the wrong direction. At least in some contexts, it may become useful to artificially slow down the delivery of output or somehow tune it to the reader’s speed based on how quickly they reply. TTS probably does this naturally, but for text based interactions, still a thing to think about.

troyvit 9 hours ago|

So they create a new chip for every model they want to support, is that right? Looking at that from 2026, when new large models are coming out every week, that seems troubling, but that's also a surface take. As many people here know better than I that a lot of the new models the big guys release are just incremental changes with little optimization going into how they're used, maybe there's plenty of room for a model-as-hardware model.

Which brings me to my second thing. We mostly pitch the AI wars as OpenAI vs Meta vs Claude vs Google vs etc. But another take is the war between open, locally run models and SaaS models, which really is about the war for general computing. Maybe a business model like this is a great tool to help keep general computing in the fight.

gordonhart 7 hours ago||

We’re reaching a saturation threshold where older models are good enough for many tasks, certainly at 100x faster inference speeds. Llama3.1 8B might be a little too old to be directly useful for e.g. coding but it certainly gets the gears turning about what you could do with one Opus orchestrator and a few of these blazing fast minions to spit out boilerplate…

g-mork 7 hours ago||

One of these things, however old, coupled with robust tool calling is a chip that could remain useful for decades. Baking in incremental updates of world knowledge isn't all that useful. It's kinda horrifying if you think about it, this chip among other things contains knowledge of Donald Trump encoded in silicon. I think this is a way cooler legacy for Melania than the movie haha.

More comments...