The path to ubiquitous AI (17k tokens/sec)

Posted by sidnarsipur 16 hours ago

The path to ubiquitous AI (17k tokens/sec)(taalas.com)

665 points | 380 commentspage 4

notsylver 12 hours ago|

I always thought eventually someone would come along and make a hardware accelerator for LLMs, but I thought it would be like google TPUs where you can load up whatever model you want. Baking the model into hardware sounds like the monkey paw curled, but it might be interesting selling an old.. MPU..? because it wasn't smart enough for your latest project

troyvit 11 hours ago||

So they create a new chip for every model they want to support, is that right? Looking at that from 2026, when new large models are coming out every week, that seems troubling, but that's also a surface take. As many people here know better than I that a lot of the new models the big guys release are just incremental changes with little optimization going into how they're used, maybe there's plenty of room for a model-as-hardware model.

Which brings me to my second thing. We mostly pitch the AI wars as OpenAI vs Meta vs Claude vs Google vs etc. But another take is the war between open, locally run models and SaaS models, which really is about the war for general computing. Maybe a business model like this is a great tool to help keep general computing in the fight.

gordonhart 9 hours ago||

We’re reaching a saturation threshold where older models are good enough for many tasks, certainly at 100x faster inference speeds. Llama3.1 8B might be a little too old to be directly useful for e.g. coding but it certainly gets the gears turning about what you could do with one Opus orchestrator and a few of these blazing fast minions to spit out boilerplate…

g-mork 9 hours ago||

One of these things, however old, coupled with robust tool calling is a chip that could remain useful for decades. Baking in incremental updates of world knowledge isn't all that useful. It's kinda horrifying if you think about it, this chip among other things contains knowledge of Donald Trump encoded in silicon. I think this is a way cooler legacy for Melania than the movie haha.

jtr1 10 hours ago||

The demo was so fast it highlighted a UX component of LLMs I hadn’t considered before: there’s such a thing as too fast, at least in the chatbot context. The demo answered with a page of text so fast I had to scroll up every time to see where it started. It completely broke the illusion of conversation where I can usually interrupt if we’re headed in the wrong direction. At least in some contexts, it may become useful to artificially slow down the delivery of output or somehow tune it to the reader’s speed based on how quickly they reply. TTS probably does this naturally, but for text based interactions, still a thing to think about.

andai 14 hours ago||

>Founded 2.5 years ago, Taalas developed a platform for transforming any AI model into custom silicon. From the moment a previously unseen model is received, it can be realized in hardware in only two months.

So this is very cool. Though I'm not sure how the economics work out? 2 months is a long time in the model space. Although for many tasks, the models are now "good enough", especially when you put them in a "keep trying until it works" loop and run them at high inference speed.

Seems like a chip would only be good for a few months though, they'd have to be upgrading them on a regular basis.

Unless model growth plateaus, or we exceed "good enough" for the relevant tasks, or both. The latter part seems quite likely, at least for certain types of work.

On that note I've shifted my focus from "best model" to "fastest/cheapest model that can do the job". For example testing Gemini Flash against Gemini Pro for simple tasks, they both complete the task fine, but Flash does it 3x cheaper and 3x faster. (Also had good results with Grok Fast in that category of bite-sized "realtime" workflows.)

FieryTransition 15 hours ago||

If it's not reprogrammable, it's just expensive glass.

If you etch the bits into silicon, you then have to accommodate the bits by physical area, which is the transistor density for whatever modern process they use. This will give you a lower bound for the size of the wafers.

This can give huge wafers for a very set model which is old by the time it is finalized.

Etching generic functions used in ML and common fused kernels would seem much more viable as they could be used as building blocks.

audunw 14 hours ago||

Models don’t get old as fast as they used to. A lot of the improvements seem to go into making the models more efficient, or the infrastructure around the models. If newer models mainly compete on efficiency it means you can run older models for longer on more efficient hardware while staying competitive.

If power costs are significantly lower, they can pay for themselves by the time they are outdated. It also means you can run more instances of a model in one datacenter, and that seems to be a big challenge these days: simply building an enough data centres and getting power to them. (See the ridiculous plans for building data centres in space)

A huge part of the cost with making chips is the masks. The transistor masks are expensive. Metal masks less so.

I figure they will eventually freeze the transistor layer and use metal masks to reconfigure the chips when the new models come out. That should further lower costs.

I don’t really know if this makes sanse. Depends on whether we get new breakthroughs in LLM architecture or not. It’s a gamble essentially. But honestly, so is buying nvidia blackwell chips for inference. I could see them getting uneconomical very quickly if any of the alternative inference optimised hardware pans out

FieryTransition 10 hours ago|||

From my own experience, models are at the tipping point for being useful at prototypes in software, and those are very large frontier models not feasible to get down on wafers unless someone does something smart.

I really don't like the hallucination rate for most models but it is improving, so that is still far in the future.

What I could see though, is if the whole unit they made would be power efficient enough to run on a robotics platform for human computer interaction.

It makes sense they would try to make repurposing their tech as much as they could since making changes is frought with a long time frame and risk.

But if we look long term and pretend that they get it to work, they just need to stay afloat until better smaller models can be made with their technology, so it becomes a waiting game for investors and a risk assessment.

johnsimer 12 hours ago|||

“ Models don’t get old as fast as they used to”

^^^ I think the opposite is true

Anthropic and OpenAI are releasing new versions every 60-90 days it seems now, and you could argue they’re going to start releasing even faster

robotpepi 11 hours ago||

Are they becoming better at the same rate as before though?

Ancapistani 3 hours ago|||

Per release, I’d say no.

Per period of time, I’d say yes.

FieryTransition 10 hours ago||||

In my unscientific experience, yes, but being better at a certain rate is hard to really quantify, unless you just pull some random benchmark numbers.

otabdeveloper4 7 hours ago||||

No.

turnsout 7 hours ago|||

yes, pretty much

booli 13 hours ago|||

Reading the in depth article also linked in this thread, they say that only 2 layers need to change most of the time. They claim from new model to PCB in 2 months. Let's see, but sounds promising.

MagicMoonlight 14 hours ago||

You don’t need it to be reprogrammable if it can use tools and RAG.

gchadwick 14 hours ago||

This is an interesting piece of hardware though when they go multi-chip for larger models the speed will no doubt suffer.

They'll also be severely limited on context length as it needs to sit in SRAM. Looks like the current one tops out at 6144 tokens which I presume is a whole chips worth. You'd also have to dedicate a chip to a whole user as there's likely only enough SRAM for one user's worth of context. I wonder how much time it takes them to swap users in/out? I wouldn't be surprised if this chip is severely underutilized (can't use it all when running decode as you have to run token by token with one users and then idle time as you swap users in/out).

Maybe a more realistic deployment would have chips for linear layers and chips for attention? You could batch users through the shared weight chips and then provision more or less attention chips as you want which would be per user (or shared amongst a small group 2-4 users).

soleveloper 14 hours ago||

There are so many use cases for small and super fast models that are already in size capacity -

* Many top quality tts and stt models

* Image recognition, object tracking

* speculative decoding, attached to a much bigger model (big/small architecture?)

* agentic loop trying 20 different approaches / algorithms, and then picking the best one

* edited to add! Put 50 such small models to create a SOTA super fast model

xnorswap 14 hours ago||

Asking it what its knowledge cut-off is interesting, it doesn't seem to be consistent even within a single response. Sometimes it responds to say it cuts off 2020 too.

    You
    What is your knowledge cut-off?

    Jimmy
    My knowledge cut-off is 2022, which means that my training data is current up to 2021, but 
    I have been trained on a   dataset that is updated periodically. If you have any specific 
    questions about events or topics that occurred after 2021, I may not have information on 
    those topics. However, I can still provide general information and context on those topics 
    to help guide further research.

The instantaneous response is impressive though. I'm sure there will be applications for this, I just lack the imagination to know what they'll be.

wat10000 13 hours ago|

Sounds like they aren’t providing a system prompt which tells it what its knowledge cutoff is. It doesn’t know unless it’s told. Given the limited context, it makes sense they wouldn’t use up valuable space for that.

mike_hearn 9 hours ago||

The system prompt is configurable in the sidebar and minimal. It doesn't give a knowledge cutoff. This is a tech demo of the fact it works at all, it's not meant to be a good chatbot.

piker 13 hours ago||

The company slogan is great: "The Model is The Computer"

It's an homage to Jensen: "The display is the computer"

https://www.wired.com/2002/07/nvidia/

mips_avatar 15 hours ago|

I think the thing that makes 8b sized models interesting is the ability to train unique custom domain knowledge intelligence and this is the opposite of that. Like if you could deploy any 8b sized model on it and be this fast that would be super interesting, but being stuck with llama3 8b isn't that interesting.

ACCount37 15 hours ago|

The "small model with unique custom domain knowledge" approach has a very low capability ceiling.

Model intelligence is, in many ways, a function of model size. A small model tuned for a given domain is still crippled by being small.

Some things don't benefit from general intelligence much. Sometimes a dumb narrow specialist really is all you need for your tasks. But building that small specialized model isn't easy or cheap.

Engineering isn't free, models tend to grow obsolete as the price/capability frontier advances, and AI specialists are less of a commodity than AI inference is. I'm inclined to bet against approaches like this on a principle.

matu3ba 5 hours ago|||

> Engineering isn't free, models tend to grow obsolete as the price/capability frontier advances, and AI specialists are less of a commodity than AI inference is. I'm inclined to bet against approaches like this on a principle.

This does not sound like it will simplify the training and data side, unless their or subsequent models can somehow be efficiently utilized for that. However, this development may lead to (open source) hardware and distributed system compilation, EDA tooling, bus system design, etc getting more deserved attention and funding. In turn, new hardware may lead to more training and data competition instead of the current NVIDIA model training monopoly market. So I think you're correct for ~5 years.

mips_avatar 3 hours ago|||

A fine tuned 1.7B model probably is still too crippled to do anything useful. But around 8b the capabilities really start to change. I’m also extremely unemployed right now so I can provide the engineering.

More comments...