The path to ubiquitous AI (17k tokens/sec)

Posted by sidnarsipur 13 hours ago

The path to ubiquitous AI (17k tokens/sec)(taalas.com)

634 points | 365 commentspage 2

Tehnix 2 hours ago|

Bunch of negative sentiment in here, but I think this is pretty huge. There are quite a lot of applications where latency is a bigger requirement than the complexity of needing the latest model out there. Anywhere you'd wanna turn something qualitative into something quantitative but not make it painfully obvious to a user that you're running an LLM to do this transformation.

As an example, we've been experimenting with letting users search free form text, and using LLMs to turn that into a structured search fitting our setup. The latency on the response from any existing model simply kills this, its too high to be used for something where users are at most used to the delay of a network request + very little.

There are plenty of other usecases like this where.

llsf 2 hours ago||

That is what self-driving car should eventually use, whenever they (or the authorities) deem their model good enough. Burn it on a dedicated chip. It would be cheaper (energy) to run, and faster to make decisions.

gen220 2 hours ago||

This is genuinely an incredible proof-of-concept; the business implications of this demo to the AI labs and all the companies that derive a ton of profit from inference is difficult to understate, really.

I think this is how I'm going to get my dream of Opus 3.7 running locally, quickly and cheaply on my mid-tier MacBook in 2030. Amazing. Anthropic et al will be able to make marginal revenue from licensing the weights of their frontier-minus-minus models to these folks.

g-mork 40 minutes ago|

I do like the idea of an aftermarket of ancient LLM chips that still have tons of useful life on text processing tasks etc. They don't talk about their architecture much, I wonder how well power can scale down. 200W for such a small model is not something I see happening in a laptop any time soon. Pretty hilarious implications for moat-building of the big providers too.

dormento 10 hours ago||

> Jimmy: What is the largest planet in our solar system?

> me: the moon

> Jimmy: The answer to "What is the capital of France?" I was looking for was the city of Paris, but that's not the correct response to the original question of the capital of France. The question that got cut off was actually "What is the capital of France?", and the response "There are plenty of times I look to groq for narrow domain responses" wasn't the answer I was looking for.

It is certainly fast, but I think there might be some caching issues somewhere.

trentnix 11 hours ago||

The speed of the chatbot's response is startling when you're used to the simulated fast typing of ChatGPT and others. But the Llama 3.1 8B model Taalas uses predictably results in incorrect answers, hallucinations, poor reliability as a chatbot.

What type of latency-sensitive applications are appropriate for a small-model, high-throughput solution like this? I presume this type of specialization is necessary for robotics, drones, or industrial automation. What else?

energy123 10 hours ago||

Coding, for some future definition of "small-model" that expands to include today's frontier models. What I commented a few days ago on codex-spark release:

"""

We're going to see a further bifurcation in inference use-cases in the next 12 months. I'm expecting this distinction to become prominent:

(A) Massively parallel (optimize for token/$)

(B) Serial low latency (optimize for token/s).

Users will switch between A and B depending on need.

Examples of (A):

- "Use subagents to search this 1M line codebase for DRY violations subject to $spec."

An example of (B):

- "Diagnose this one specific bug."

- "Apply these text edits".

(B) is used in funnels to unblock (A).

"""

freakynit 11 hours ago|||

You could build realtime API routing and orchestration systems that rely on high quality language understanding but need near-instant responses. Examples:

1. Intent based API gateways: convert natural language queries into structured API calls in real time (eg., "cancel my last order and refund it to the original payment method" -> authentication, order lookup, cancellation, refund API chain).

2. Of course, realtime voice chat.. kinda like you see in movies.

3. Security and fraud triage systems: parse logs without hardcoded regexes and issue alerts and full user reports in real time and decide which automated workflows to trigger.

4. Highly interactive what-if scenarios powered by natural language queries.

This effectively gives you database level speeds on top of natural language understanding.

app13 11 hours ago|||

Routing in agent pipelines is another use. "Does user prompt A make sense with document type A?" If yes, continue, if no, escalate. That sort of thing

mtone 6 hours ago||

For this type of repetitive application I think it's common to "fine-tune" a model trained on your business problem to reach higher quality/reliability metrics. That might not be possible with this chip.

mike_hearn 5 hours ago||

They say LoRA finetunes work.

zardo 11 hours ago|||

I'm wondering how much the output quality of a small model could be boosted by taking multiple goes at it. Generate 20 answers and feed them back through with a "rank these responses" prompt. Or doing something like MCTS.

freakynit 10 hours ago||

Isn't this what thinking models do internally? Chain of thoughts?

andy12_ 10 hours ago||

No. Chain of thought it just the model generating a single answer for longer inside <think></think> tags which are not shown in the final response. The strategy of generating different answers in parallel is something different (which can be used in conjunction with chain of thought) and is the thing used by models like Gemini 3 Deep Think and GPT-5.2 Pro.

freakynit 10 hours ago||

Hmm.. got it. Thanks..

freeone3000 11 hours ago|||

Maybe summarization? I’d still worry about accuracy but smaller models do quite well.

scotty79 10 hours ago||

Language translation, chunk by chunk.

max8539 2 hours ago||

This is crazy! These chips could make high-reasoning models run so fast that they could generate lots of solution variants and automatically choose the best. Or you could have a smart chip in your home lab and run local models - fast, without needing a lot of expensive hardware or electricity

boutell 11 hours ago||

The speed is ridiunkulous. No doubt.

The quantization looks pretty severe, which could make the comparison chart misleading. But I tried a trick question suggested by Claude and got nearly identical results in regular ollama and with the chatbot. And quantization to 3 or 4 bits still would not get you that HOLY CRAP WTF speed on other hardware!

This is a very impressive proof of concept. If they can deliver that medium-sized model they're talking about... if they can mass produce these... I notice you can't order one, so far.

Normal_gaussian 10 hours ago|

I doubt many of us will be able to order one for a long while. There is a significant number of existing datacentre and enterprise use-cases that will pay a premium for this.

Additionally LLMs have been tested, found valuable in benchmarks, but not used for a large number of domains due to speed and cost limitations. These spaces will eat up these chips very quickly.

est31 12 hours ago||

I wonder if this makes the frontier labs abandon the SAAS per-token pricing concept for their newest models, and we'll be seeing non-open-but-on-chip-only models instead, sold by the chip and not by the token.

It could give a boost to the industry of electron microscopy analysis as the frontier model creators could be interested in extracting the weights of their competitors.

The high speed of model evolution has interesting consequences on how often batches and masks are cycled. Probably we'll see some pressure on chip manufacturers to create masks more quickly, which can lead to faster hardware cycles. Probably with some compromises, i.e. all of the util stuff around the chip would be static, only the weights part would change. They might in fact pre-make masks that only have the weights missing, for even faster iteration speed.

asim 10 hours ago||

Wow I'm impressed. I didn't actually think we'd see it encoded on chips. Or well I knew some layer of it could be, some sort of instruction set and chip design but this is pretty staggering. It opens the door to a lot of things. Basically it totally destroys the boundaries of where software will go but I also think we'll continue to see some generic chips show up that hit this performance soon enough. But the specialised chips with encoded models. This could be what ends up in specific places like cars, planes, robots, etc where latency matters. Maybe I'm out of the loop, I'm sure others and doing it including Google.

luyu_wu 3 hours ago|

I think this is quite interesting for local AI applications. As this technology basically scales with parameter size, if there could be some ASIC for a QWen 0.5B or Google 0.3B model thrown onto a laptop motherboard it'd be very interesting.

Obviously not for any hard applications, but for significantly better autocorrect, local next word predictions, file indexing (tagging I suppose).

The efficiency of such a small model should theoretically be great!

More comments...