The path to ubiquitous AI (17k tokens/sec)

Posted by sidnarsipur 1 day ago

The path to ubiquitous AI (17k tokens/sec)(taalas.com)

744 points | 412 commentspage 9

retrac98 1 day ago|

Wow. I’m finding it hard to even conceive of what it’d be like to have one of the frontier models on hardware at this speed.

btbuildem 23 hours ago||

This is impressive. If you can scale it to larger models, and somehow make the ROM writeable, wow, you win the game.

dsign 1 day ago||

This is like microcontrollers, but for AI? Awesome! I want one for my electric guitar; and please add an AI TTS module...

brazzy 1 day ago|

No, it's ASICs, but for AI.

xnx 22 hours ago||

Gemini Flash 2.5 lite does 400 tokens/sec. Is there benefit to going faster than a person can read?

atls 21 hours ago||

There is also the use case of delegating tasks programmatically to an LLM, for example, transforming unstructured data to structured data. This task often can’t be done reliably without either 1. lots of manual work, or 2. intelligence, especially when the structure of the individual data pieces are unknown. Problems like these can be much more efficiently solved by LLMs, and if you imagine these programs are processing very large datasets, then sub-millisecond inference is crucial.

xnx 21 hours ago||

Aren't such tasks inherently parrallelizable?

booli 22 hours ago|||

Agents also "read", so yes there is. Think about spinning up 10, 20, 100 sub agents for a small task and they all return near instant. That's the usecase, not the chatbot.

xi_studio 20 hours ago|||

Agents already bypass human inference time, if it can input-output instantly it can also loop it generating near instantly long cached tasks

cheema33 22 hours ago||

Yes. You can allow multiple people to use a single chip. A slower solution will be able to service far fewer users.

xnx 21 hours ago||

Right, but it is also possible it's cheaper to use 42 Google TPUs for a second than one of these.

brainless 21 hours ago||

I know it is not easy to see the benefits of small models easily but this is what I am building for (1). I created a product for Google Gemini 3 Hackathon and I used Gemini 3 Flash (2). I tested locally using Ministral 3B and it was promising. Definitely will need work. But 8B/14B may give awesome results.

I am building a data extraction software on top of emails, attachments, cloud/local files. I use a reverse template generation with only variable translation done by LLMs (3). Small models are awesome for this (4).

I just applied for API access. If privacy policies are a fit, I would love to enable this for MVP launch.

1. https://github.com/brainless/dwata

2. https://youtu.be/Uhs6SK4rocU

3. https://github.com/brainless/dwata/tree/feature/reverse-temp...

4. https://github.com/brainless/dwata/tree/feature/reverse-temp...

PeterStuer 18 hours ago||

Not sure, but is this just ASICs for a particular model release?

Adexintart 1 day ago||

The token throughput improvements are impressive. This has direct implications for usage-based billing in AI products — faster inference means lower cost per request, which changes the economics of credits-based pricing models significantly.

stego-tech 1 day ago||

I still believe this is the right - and inevitable - path for AI, especially as I use more premium AI tooling and evaluate its utility (I’m still a societal doomer on it, but even I gotta admit its coding abilities are incredible to behold, albeit lacking in quality).

Everyone in Capital wants the perpetual rent-extraction model of API calls and subscription fees, which makes sense given how well it worked in the SaaS boom. However, as Taalas points out, new innovations often scale in consumption closer to the point of service rather than monopolized centers, and I expect AI to be no different. When it’s being used sparsely for odd prompts or agentically to produce larger outputs, having local (or near-local) inferencing is the inevitable end goal: if a model like Qwen or Llama can output something similar to Opus or Codex running on an affordable accelerator at home or in the office server, then why bother with the subscription fees or API bills? That compounds when technical folks (hi!) point out that any process done agentically can instead just be output as software for infinite repetition in lieu of subscriptions and maintained indefinitely by existing technical talent and the same accelerator you bought with CapEx, rather than a fleet of pricey AI seats with OpEx.

The big push seems to be building processes dependent upon recurring revenue streams, but I’m gradually seeing more and more folks work the slop machines for the output they want and then put it away or cancel their sub. I think Taalas - conceptually, anyway - is on to something.

niek_pas 1 day ago||

> Though society seems poised to build a dystopian future defined by data centers and adjacent power plants, history hints at a different direction. Past technological revolutions often started with grotesque prototypes, only to be eclipsed by breakthroughs yielding more practical outcomes.

…for a privileged minority, yes, and to the detriment of billions of people whose names the history books conveniently forget. AI, like past technological revolutions, is a force multiplier for both productivity and exploitation.

clbrmbr 1 day ago|

What would it take to put Opus on a chip? Can it be done? What’s the minimum size?

cheema33 22 hours ago|

Maybe not today. Opus is quite large. This demo works with a very small 8B model. But, maybe one day. Hopefully soon. Opus on a chip would be very awesome, even if it can never be upgraded.

Someone mentioned that maybe we'd see a future where these things come in something like Nintendo cartridges. Want a newer model? Pop in the right catridge.

More comments...