The path to ubiquitous AI (17k tokens/sec)

Posted by sidnarsipur 11 hours ago

The path to ubiquitous AI (17k tokens/sec)(taalas.com)

634 points | 365 comments

dust42 11 hours ago|

This is not a general purpose chip but specialized for high speed, low latency inference with small context. But it is potentially a lot cheaper than Nvidia for those purposes.

Tech summary:

  - 15k tok/sec on 8B dense 3bit quant (llama 3.1) 
  - limited KV cache
  - 880mm^2 die, TSMC 6nm, 53B transistors
  - presumably 200W per chip
  - 20x cheaper to produce
  - 10x less energy per token for inference
  - max context size: flexible
  - mid-sized thinking model upcoming this spring on same hardware
  - next hardware supposed to be FP4 
  - a frontier LLM planned within twelve months

This is all from their website, I am not affiliated. The founders have 25 years of career across AMD, Nvidia and others, $200M VC so far.

Certainly interesting for very low latency applications which need < 10k tokens context. If they deliver in spring, they will likely be flooded with VC money.

Not exactly a competitor for Nvidia but probably for 5-10% of the market.

Back of napkin, the cost for 1mm^2 of 6nm wafer is ~$0.20. So 1B parameters need about $20 of die. The larger the die size, the lower the yield. Supposedly the inference speed remains almost the same with larger models.

Interview with the founders: https://www.nextplatform.com/2026/02/19/taalas-etches-ai-mod...

vessenes 10 hours ago||

This math is useful. Lots of folks scoffing in the comments below. I have a couple reactions, after chatting with it:

1) 16k tokens / second is really stunningly fast. There’s an old saying about any factor of 10 being a new science / new product category, etc. This is a new product category in my mind, or it could be. It would be incredibly useful for voice agent applications, realtime loops, realtime video generation, .. etc.

2) https://nvidia.github.io/TensorRT-LLM/blogs/H200launch.html Has H200 doing 12k tokens/second on llama 2 12b fb8. Knowing these architectures that’s likely a 100+ ish batched run, meaning time to first token is almost certainly slower than taalas. Probably much slower, since Taalas is like milliseconds.

3) Jensen has these pareto curve graphs — for a certain amount of energy and a certain chip architecture, choose your point on the curve to trade off throughput vs latency. My quick math is that these probably do not shift the curve. The 6nm process vs 4nm process is likely 30-40% bigger, draws that much more power, etc; if we look at the numbers they give and extrapolate to an fp8 model (slower), smaller geometry (30% faster and lower power) and compare 16k tokens/second for taalas to 12k tokens/s for an h200, these chips are in the same ballpark curve.

However, I don’t think the H200 can reach into this part of the curve, and that does make these somewhat interesting. In fact even if you had a full datacenter of H200s already running your model, you’d probably buy a bunch of these to do speculative decoding - it’s an amazing use case for them; speculative decoding relies on smaller distillations or quants to get the first N tokens sorted, only when the big model and small model diverge do you infer on the big model.

Upshot - I think these will sell, even on 6nm process, and the first thing I’d sell them to do is speculative decoding for bread and butter frontier models. The thing that I’m really very skeptical of is the 2 month turnaround. To get leading edge geometry turned around on arbitrary 2 month schedules is .. ambitious. Hopeful. We could use other words as well.

I hope these guys make it! I bet the v3 of these chips will be serving some bread and butter API requests, which will be awesome.

rbanffy 8 hours ago|||

> any factor of 10 being a new science / new product category,

I often remind people two orders of quantitative change is a qualitative change.

> The thing that I’m really very skeptical of is the 2 month turnaround. To get leading edge geometry turned around on arbitrary 2 month schedules is .. ambitious. Hopeful. We could use other words as well.

The real product they have is automation. They figured out a way to compile a large model into a circuit. That's, in itself, pretty impressive. If they can do this, they can also compile models to an HDL and deploy them to large FPGA simulators for quick validation. If we see models maturing at a "good enough" state, even a longer turnaround between model release and silicon makes sense.

While I also see lots of these systems running standalone, I think they'll really shine combined with more flexible inference engines, running the unchanging parts of the model while the coupled inference engine deals with whatever is too new to have been baked into silicon.

I'm concerned with the environmental impact. Chip manufacture is not very clean and these chips will need to be swapped out and replaced at a cadence higher than we currently do with GPUs.

ttul 6 hours ago|||

Having dabbled in VLSI in the early-2010s, half the battle is getting a manufacturing slot with TSMC. It’s a dark art with secret handshakes. This demonstrator chip is an enormous accomplishment.

vessenes 1 hour ago||

Yeah and a team I’m not familiar with — I didn’t check bios but they don’t lead with ‘our team made this or that gpu for this or that bigco’.

The design ip at 6nm is still tough; I feel like this team must have at least one real genius and some incredibly good support at tsmc. Or they’ve been waiting a year for a slot :)

dust42 1 hour ago||

From the article:

"Ljubisa Bajic desiged video encoders for Teralogic and Oak Technology before moving over to AMD and rising through the engineering ranks to be the architect and senior manager of the company’s hybrid CPU-GPU chip designs for PCs and servers. Bajic did a one-year stint at Nvidia as s senior architect, bounced back to AMD as a director of integrated circuit design for two years, and then started Tenstorrent."

His wife (COO) worked at Altera, ATI, AMD and Testorrent.

"Drago Ignjatovic, who was a senior design engineer working on AMD APUs and GPUs and took over for Ljubisa Bajic as director of ASIC design when the latter left to start Tenstorrent. Nine months later, Ignjatovic joined Tenstorrent as its vice president of hardware engineering, and he started Taalas with the Bajices as the startup’s chief technology officer."

Not a youngster gang...

VagabundoP 7 hours ago|||

There might be a foodchain of lower order uses when they become "obsolete".

rbanffy 5 hours ago||

I think there will be a lot of space for sensorial models in robotics, as the laws of physics don't change much, and a light switch or automobile controls have remained stable and consistent over the last decades.

Gareth321 9 hours ago||||

I think the next major innovation is going to be intelligent model routing. I've been exploring OpenClaw and OpenRouter, and there is a real lack of options to select the best model for the job and execute. The providers are trying to do that with their own models, but none of them offer everything to everyone at all times. I see a future with increasingly niche models being offered for all kinds of novel use cases. We need a way to fluidly apply the right model for the job.

nylonstrung 9 hours ago|||

Agree that routing is becoming the critical layer here. Vllm iris is really promising for this https://blog.vllm.ai/2026/01/05/vllm-sr-iris.html

There's already some good work on router benchmarking which is pretty interesting

condiment 7 hours ago||||

At 16k tokens/s why bother routing? We're talking about multiple orders of magnitude faster and cheaper execution.

Abundance supports different strategies. One approach: Set a deadline for a response, send the turn to every AI that could possibly answer, and when the deadline arrives, cancel any request that hasn't yet completed. You know a priori which models have the highest quality in aggregate. Pick that one.

IanCal 5 hours ago||

The best coding model won’t be the best roleplay one which won’t be the best at tool use. It depends what you want to do in order to pick the best model.

PhunkyPhil 4 hours ago||

I'm not saying you're wrong, but why is this the case?

I'm out of the loop on training LLMs, but to me it's just pure data input. Are they choosing to include more code rather than, say fiction books?

jmalicki 2 hours ago|||

There is the pre-training, where you passively read stuff from the web.

From there you go to RL training, where humans are grading model responses, or the AI is writing code to try to pass tests and learning how to get the tests to pass, etc. The RL phase is pretty important because it's not passive, and it can focus on the weaker areas of the model too, so you can actually train on a larger dataset than the sum of recorded human knowledge.

refulgentis 4 hours ago|||

I’ll go ahead and say they’re wrong (source: building and maintaining llm client with llama.cpp integrated & 40+ 3p models via http)

I desperately want there to be differentiation. Reality has shown over and over again it doesn’t matter. Even if you do same query across X models and then some form of consensus, the improvements on benchmarks are marginal and UX is worse (more time, more expensive, final answer is muddied and bound by the quality of the best model)

monooso 8 hours ago||||

I came across this yesterday. Haven't tried it, but it looks interesting:

https://agent-relay.com/

eshaham78 9 hours ago|||

[dead]

btown 9 hours ago||||

For speculative decoding, wouldn’t this be of limited use for frontier models that don’t have the same tokenizer as Llama 3.1? Or would it be so good that retokenization/bridging would be worth it?

Zetaphor 9 hours ago|||

My understanding as well is that speculative decoding only works with a smaller quant of the same model. You're using the faster sampling of the smaller models representation of the larger models weights in order to attempt to accurately predict its token output. This wouldn't work cross-model as the token probabilities are completely different.

jasonjmcghee 7 hours ago|||

This is not correct.

Families of model sizes work great for speculative decoding. Use the 1B with the 32B or whatever.

It's a balance as you want it to be guessing correctly as much as possible but also be as fast as possible. Validation takes time and every guess needs to be validated etc

The model you're using to speculate could be anything, but if it's not guessing what the main model would predict, it's useless.

ashirviskas 8 hours ago|||

Smaller quant or smaller model?

Afaik it can work with anything, but sharing vocab solves a lot of headaches and the better token probs match, the more efficient it gets.

Which is why it is usually done with same family models and most often NOT just different quantizations of the same model.

vessenes 8 hours ago|||

I think they’d commission a quant directly. Benefits go down a lot when you leave model families.

joha4270 10 hours ago||||

The guts of a LLM isn't something I'm well versed in, but

> to get the first N tokens sorted, only when the big model and small model diverge do you infer on the big model

suggests there is something I'm unaware of. If you compare the small and big model, don't you have to wait for the big model anyway and then what's the point? I assume I'm missing some detail here, but what?

connorbrinton 9 hours ago|||

Speculative decoding takes advantage of the fact that it's faster to validate that a big model would have produced a particular sequence of tokens than to generate that sequence of tokens from scratch, because validation can take more advantage of parallel processing. So the process is generate with small model -> validate with big model -> then generate with big model only if validation fails

More info:

* https://research.google/blog/looking-back-at-speculative-dec...

* https://pytorch.org/blog/hitchhikers-guide-speculative-decod...

sails 9 hours ago||

See also speculative cascades which is a nice read and furthered my understanding of how it all works

https://research.google/blog/speculative-cascades-a-hybrid-a...

speedping 9 hours ago||||

Verification is faster than generation, one forward pass for verification of multiple tokens vs a pass for every new token in generation

vanviegen 9 hours ago||||

I don't understand how it would work either, but it may be something similar to this: https://developers.openai.com/api/docs/guides/predicted-outp...

ml_basics 9 hours ago||||

They are referring to a thing called "speculative decoding" I think.

cma 9 hours ago|||

When you predict with the small model, the big model can verify as more of a batch and be more similar in speed to processing input tokens, if the predictions are good and it doesn't have to be redone.

empath75 8 hours ago|||

Think about this for solving questions in math where you need to explore a search space. You can run 100 of these for the same cost and time of doing one api call to open ai.

xnx 26 minutes ago|||

Hardware decoders make sense for fixed codecs like MPEG, but I can't see it making sense for small models that improve every 6 months.

soleveloper 9 hours ago|||

In 20$ a die, they could sell Gameboy style cartridges for different models.

noveltyaccount 7 hours ago|||

That would be very cool, get an upgraded model every couple of months. Maybe PCIe form factor.

soleveloper 2 hours ago||

Yes, and even holding couple of cartridges for different scenarios e.g image generation, coding, tts/stt, etc

pennomi 6 hours ago|||

Make them shaped like floppy disks to confuse the younger generations.

Aissen 9 hours ago|||

> 880mm^2 die

That's a lot of surface, isn't it? As big an M1 Ultra (2x M1 Max at 432mm² on TSMC N5P), a bit bigger than an A100 (820mm² on TSMC N7) or H100 (814mm² on TSMC N5).

> The larger the die size, the lower the yield.

I wonder if that applies? What's the big deal if a few parameter have a few bit flips?

rbanffy 9 hours ago||

> I wonder if that applies? What's the big deal if a few parameter have a few bit flips?

We get into the sci-fi territory where a machine achieves sentience because it has all the right manufacturing defects.

Reminds me of this https://en.wikipedia.org/wiki/A_Logic_Named_Joe

sowbug 7 hours ago|||

Also see Adrian Thompson's Xilinx 6200 FPGA, programmed by a genetic algorithm that worked but exploited nuances unique to that specific physical chip, meaning the software couldn't be copied to another chip. https://news.ycombinator.com/item?id=43152877

rbanffy 5 hours ago||

I love that story.

philipwhiuk 9 hours ago|||

2000s movie line territory:

> There have always been ghosts in the machine. Random segments of code, that have grouped together to form unexpected protocols.

elternal_love 10 hours ago|||

Were we go towards really smart roboters. It is interesting what kind of diferent model chips they can produce.

varispeed 10 hours ago||

There is nothing smart about current LLMs. They just regurgitate text compressed in their memory based on probability. None of the LLMs currently have actual understanding of what you ask them to do and what they respond with.

adamtaylor_13 8 hours ago|||

If LLMs just regurgitate compressed text, they'd fail on any novel problem not in their training data. Yet, they routinely solve them, which means whatever's happening between input and output is more than retrieval, and calling it "not understanding" requires you to define understanding in a way that conveniently excludes everything except biological brains.

sfn42 6 hours ago|||

Yes there are some fascinating emergent properties at play, but when they fail it's blatantly obvious that there's no actual intelligence nor understanding. They are very cool and very useful tools, I use them on a daily basis now and the way I can just paste a vague screenshot with some vague text and they get it and give a useful response blows my mind every time. But it's very clear that it's all just smoke and mirrors, they're not intelligent and you can't trust them with anything.

pennomi 6 hours ago||

When humans fail a task, it’s obvious there is no actual intelligence nor understanding.

Intelligence is not as cool as you think it is.

sfn42 6 hours ago||

I assure you, intelligence is very cool.

varispeed 8 hours ago||||

They don't solve novel problems. But if you have such strong belief, please give us examples.

ainch 2 hours ago||

Depends how precisely you define novel - I don't think LLMs are yet capable of posing and solving interesting problems, but they have been used to address known problems, and in doing so have contributed novel work. Examples include Erdos Problem #728[0] (Terence Tao said it was solved "more or less autonomously" by an LLM), IMO problems (Deepmind, OpenAI and Huang 2025), GPT-5.2 Pro contributing a conjecture in particle physics[1], systems like AlphaEvolve leveraging LLMs + evolutionary algorithms to generate new, faster algorithms for certain problems[2].

[0] https://mathstodon.xyz/@tao/115855840223258103

[1] https://huggingface.co/blog/dlouapre/gpt-single-minus-gluons

[2] https://deepmind.google/blog/alphaevolve-a-gemini-powered-co...

otabdeveloper4 2 hours ago|||

> they'd fail on any novel problem not in their training data

Yes, and that's exactly what they do.

No, none of the problems you gave to the LLM while toying around with them are in any way novel.

adamtaylor_13 1 hour ago||

None of my codebases are in their training data, yet they routinely contribute to them in meaningful ways. They write code that I'm happy with that improves the codebases I work in.

Do you not consider that novel problem solving?

bsenftner 9 hours ago||||

We know that, but that does not make them unuseful. The opposite in fact, they are extremely useful in the hands of non-idiots.We just happen to have a oversupply of idiots at the moment, which AI is here to eradicate. /Sort of satire.

visarga 8 hours ago||||

So you are saying they are like copy, LLMs will copy some training data back to you? Why do we spend so much money training and running them if they "just regurgitate text compressed in their memory based on probability"? billions of dollars to build a lossy grep.

I think you are confused about LLMs - they take in context, and that context makes them generate new things, for existing things we have cp. By your logic pianos can't be creative instruments because they just produce the same 88 notes.

small_model 10 hours ago||||

Thats not how they work, pro-tip maybe don't comment until you have a good understanding?

fyltr 10 hours ago|||

Would you mind rectifying the wrong parts then?

retsibsi 9 hours ago|||

Phrases like "actual understanding", "true intelligence" etc. are not conducive to productive discussion unless you take the trouble to define what you mean by them (which ~nobody ever does). They're highly ambiguous and it's never clear what specific claims they do or don't imply when used by any given person.

But I think this specific claim is clearly wrong, if taken at face value:

> They just regurgitate text compressed in their memory

They're clearly capable of producing novel utterances, so they can't just be doing that. (Unless we're dealing with a very loose definition of "regurgitate", in which case it's probably best to use a different word if we want to understand each other.)

mhl47 9 hours ago|||

The fact that the outputs are probabilities is not important. What is important is how that output is computed.

You could imagine that it is possible to learn certain algorithms/ heuristics that "intelligence" is comprised of. No matter what you output. Training for optimal compression of tasks /taking actions -> could lead to intelligence being the best solution.

This is far from a formal argument but so is the stubborn reiteration off "it's just probabilities" or "it's just compression". Because this "just" thing is getting more an more capable of solving tasks that are surely not in the training data exactly like this.

100721 10 hours ago|||

Huh? Their words are an accurate, if simplified, description of how they work.

beyondCritics 9 hours ago|||

Just HI slop. Ask any decent model, it can explain what's wrong this this description.

aurareturn 11 hours ago|||

Don’t forget that the 8B model requires 10 of said chips to run.

And it’s a 3bit quant. So 3GB ram requirement.

If they run 8B using native 16bit quant, it will use 60 H100 sized chips.

dust42 10 hours ago||

> Don’t forget that the 8B model requires 10 of said chips to run.

Are you sure about that? If true it would definitely make it look a lot less interesting.

aurareturn 10 hours ago||

Their 2.4 kW is for 10 chips it seems based on the next platform article.

I assume they need all 10 chips for their 8B q3 model. Otherwise, they would have said so or they would have put a more impressive model as the demo.

https://www.nextplatform.com/2026/02/19/taalas-etches-ai-mod...

audunw 10 hours ago||

It doesn’t make any sense to think you need the whole server to run one model. It’s much more likely that each server runs 10 instances of the model

1. It doesn’t make sense in terms of architecture. It’s one chip. You can’t split one model over 10 identical hardwire chips

2. It doesn’t add up with their claims of better power efficiency. 2.4kW for one model would be really bad.

aurareturn 9 hours ago|||

We are both wrong.

First, it is likely one chip for llama 8B q3 with 1k context size. This could fit into around 3GB of SRAM which is about the theoretical maximum for TSMC N6 reticle limit.

Second, their plan is to etch larger models across multiple connected chips. It’s physically impossible to run bigger models otherwise since 3GB SRAM is about the max you can have on an 850mm2 chip.

  followed by a frontier-class large language model running inference across a collection of HC cards by year-end under its HC2 architecture

https://mlq.ai/news/taalas-secures-169m-funding-to-develop-a...

pigpop 16 minutes ago||

Aren't they only using the SRAM for the KV cache? They mention that the hardwired weights have a very high density. They say about the ROM part:

> We have got this scheme for the mask ROM recall fabric – the hard-wired part – where we can store four bits away and do the multiply related to it – everything – with a single transistor. So the density is basically insane.

I'm not a hardware guy but they seem to be making a strong distinction between the techniques they're using for the weights vs KV cache

> In the current generation, our density is 8 billion parameters on the hard wired part of the chip., plus the SRAM to allow us to do KV caches, adaptations like fine tuning, and etc.

moralestapia 10 hours ago|||

Thanks for having a brain.

Not sure who started that "split into 10 chips" claim, it's just dumb.

This is Llama 3B hardcoded (literally) on one chip. That's what the startup is about, they emphasize this multiple times.

aurareturn 9 hours ago||

It’s just dumb to think that one chip per model is their plan. They stated that their plan is to chain multiple chips together.

I was indeed wrong about 10 chips. I thought they would use llama 8B 16bit and a few thousand context size. It turns out, they used llama 8B 3bit with around 1k context size. That made me assume they must have chained multiple chips together since the max SRAM on TSMC n6 for reticle sized chip is only around 3GB.

WhitneyLand 8 hours ago|||

There’s a bit of a hidden cost here… the longevity of GPU hardware is going to be longer, it’s extended every time there’s an algorithmic improvement. Whereas any efficiency gains in software that are not compatible with this hardware will tend to accelerate their depreciation.

bsenftner 9 hours ago|||

Do not overlook traditional irrational investor exuberance, we've got an abundance of that right now. With the right PR manouveurs these guys could be a tulip craze.

mikhail-ramirez 7 hours ago|||

Yea its fast af but very quickly loses context/hallucinates from my own tests with large chunks of text

robotnikman 4 hours ago|||

Sounds perfect for use in consumer devices.

oliwary 11 hours ago|||

This is insane if true - could be super useful for data extraction tasks. Sounds like we could be talking in the cents per millions of tokens range.

zozbot234 10 hours ago|||

Low-latency inference is a huge waste of power; if you're going to the trouble of making an ASIC, it should be for dog-slow but very high throughput inference. Undervolt the devices as much as possible and use sub-threshold modes, multiple Vt and body biasing extensively to save further power and minimize leakage losses, but also keep working in fine-grained nodes to reduce areas and distances. The sensible goal is to expend the least possible energy per operation, even at increased latency.

dust42 10 hours ago|||

Low latency inference is very useful in voice-to-voice applications. You say it is a waste of power but at least their claim is that it is 10x more efficient. We'll see but if it works out it will definitely find its applications.

zozbot234 10 hours ago||

This is not voice-to-voice though, end-to-end voice chat models (the Her UX) are completely different.

dust42 10 hours ago||

I haven't found any end-to-end voice chat models useful. I had much better results with separate STT-LLM-TTS. One big problem is the turn detection and having inference with 150-200ms latency would allow for a whole new level of quality. I would just use it with a prompt: "You think the user is finished talking?" and then push it to a larger model. The AI should reply within the ballpark of 600ms-1000ms. Faster is often irritating, slower will make the user to start talking again.

PhunkyPhil 4 hours ago|||

I think it's really useful for agent to agent communication, as long as context loading doesn't become a bottleneck. Right now there can be noticeable delays under the hood, but at these speeds we'll never have to worry about latency when chain calling hundreds or thousands of agents in a network (I'm presuming this is going to take off in the future). Correct me if I'm wrong though.

Tepix 9 hours ago|||

Doesn't the blog state that it's now 4bit (the first gen was 3bit + 6bit)?

empath75 8 hours ago||

An on-device reasoning model what that kind of speed and cost would completely change the way people use their computers. It would be closer to star trek than anything else we've ever had. You'd never have to type anything or use a mouse again.

Alifatisk 5 hours ago||

What's happening in the comment section? How come so many cannot understand that his is running Llama 3.1 8B? Why are people judging its accuracy? It's almost a 2 years old 8B param model, why are people expecting to see Opus level response!?

The focus here should be on the custom hardware they are producing and its performance, that is whats impressive. Imagine putting GLM-5 on this, that'd be insane.

This reminds me a lot of when I tried the Mercury coder model by Inceptionlabs, they are creating something called a dLLM which is like a diffusion based llm. The speed is still impressive when playing aroun with it sometimes. But this, this is something else, it's almost unbelievable. As soon as I hit the enter key, the response appears, it feels instant.

I am also curious about Taalas pricing.

> Taalas’ silicon Llama achieves 17K tokens/sec per user, nearly 10X faster than the current state of the art, while costing 20X less to build, and consuming 10X less power.

Do we have an idea of how much a unit / inference / api will cost?

Also, considering how fast people switch models to keep up with the pace. Is there really a potential market for hardware designed for one model only? What will they do when they want to upgrade to a better version? Throw the current hardware and buy another one? Shouldn't there be a more flexible way? Maybe only having to switch the chip on top like how people upgrade CPUs. I don't know, just thinking out loudly.

mike_hearn 4 hours ago||

They don't give cost figures in their blog post but they do here:

https://www.nextplatform.com/wp-content/uploads/2026/02/taal...

Probably they don't know what the market will bear and want to do some exploratory pricing, hence the "contact us" API access form. That's fair enough. But they're claiming orders of magnitude cost reduction.

> Is there really a potential market for hardware designed for one model only?

I'm sure there is. Models are largely interchangeable especially as the low end. There are lots of use cases where you don't need super smart models but cheapness and fastness can matter a lot.

Think about a simple use case: a company has a list of one million customer names but no information about gender or age. They'd like to get a rough understanding of this. Mapping name -> guessed gender, rough guess of age is a simple problem for even dumb LLMs. I just tried it on ChatJimmy and it worked fine. For this kind of exploratory data problem you really benefit from mass parallelism, low cost and low latency.

> Shouldn't there be a more flexible way?

The whole point of their design is to sacrifice flexibility for speed, although they claim they support fine tunes via LoRAs. LLMs are already supremely flexible so it probably doesn't matter.

pigpop 7 minutes ago||

Yes, there are all kinds of fuzzy NLP tasks that this would be great for. Jobs where you can chunk the text into small units and add instructions and only need a short response. You could burn through huge data sets very quickly using these chips.

himata4113 4 hours ago||

I personally don't buy it, cerebras is way more advanced than this, comparing this tok/s to cerebras is disingenious.

freakynit 10 hours ago||

Holy cow their chatapp demo!!! I for first time thought i mistakenly pasted the answer. It was literally in a blink of an eye.!!

https://chatjimmy.ai/

qingcharles 7 hours ago||

I asked it to design a submarine for my cat and literally the instant my finger touched return the answer was there. And that is factoring in the round-trip time for the data too. Crazy.

The answer wasn't dumb like others are getting. It was pretty comprehensive and useful.

  While the idea of a feline submarine is adorable, please be aware that building a real submarine requires significant expertise, specialized equipment, and resources.

robotpepi 4 hours ago||

it's incredible how many people are commenting here without having read the article. they completely lost the point.

smusamashah 8 hours ago|||

With this speed, you can keep looping and generating code until it passes all tests. If you have tests.

Generate lots of solutions and mix and match. This allows a new way to look at LLMs.

Retr0id 8 hours ago|||

Not just looping, you could do a parallel graph search of the solution-space until you hit one that works.

xi_studio 7 hours ago|||

Infinite Monkey Theory just reached its peak

dave1010uk 3 hours ago|||

You could also parse prompts into an AST, run inference, run evals, then optimise the prompts with something like a genetic algorithm.

Epskampie 8 hours ago||||

And then it's slow again to finally find a correct answer...

34679 5 hours ago||

It won't find the correct answer. Garbage in, garbage out.

otabdeveloper4 2 hours ago||||

This doesn't work. The model outputs the most probable tokens. Running it again and asking for less probable tokens just results in the same but with more errors.

therealdrag0 20 minutes ago||

Do you not have experience with agents solving problems? They already successfully do this. They try different things until they get a solution.

turnsout 6 hours ago||||

Agreed, this is exciting, and has me thinking about completely different orchestrator patterns. You could begin to approach the solution space much more like a traditional optimization strategy such as CMA-ES. Rather than expect the first answer to be correct, you diverge wildly before converging.

MattRix 8 hours ago|||

This is what people already do with “ralph” loops using the top coding models. It’s slow relative to this, but still very fast compared to hand-coding.

amelius 9 hours ago|||

OK investors, time to pull out of OpenAI and move all your money to ChatJimmy.

freakynit 9 hours ago|||

A related argument I raised a few days back on HN:

What's the moat with with these giant data-centers that are being built with 100's of billions of dollars on nvidia chips?

If such chips can be built so easily, and offer this insane level of performance at 10x efficiency, then one thing is 100% sure: more such startups are coming... and with that, an entire new ecosystem.

jzymbaluk 5 hours ago|||

You'd still need those giant data centers for training new frontier models. These Taalas chips, if they work, seem to do the job of inference well, but training will still require general purpose GPU compute

bonoboTP 1 hour ago||

Next up: wire up a specialized chip to run the training loop of a specific architecture.

codebje 9 hours ago||||

RAM hoarding is, AFAICT, the moat.

freakynit 9 hours ago||

lol... true that for now though

Windchaser 5 hours ago||

Yeah, just cause Cisco had a huge market lead on telecom in the late '90s, it doesn't mean they kept it.

(And people nowadays: "Who's Cisco?")

bee_rider 8 hours ago||||

I think their hope is that they’ll have the “brand name” and expertise to have a good head start when real inference hardware comes out. It does seem very strange, though, to have all these massive infrastructure investment on what is ultimately going to be useless prototyping hardware.

elictronic 7 hours ago||

Tools like openclaw start making the models a commodity.

I need some smarts to route my question to the correct model. I wont care which that is. Selling commodities is notorious for slow and steady growth.

wmf 2 hours ago||||

Nvidia bought all the capacity so their competitors can't be manufactured at scale.

mlboss 4 hours ago|||

If I am not mistaken this chip was build specifically for the llama 8b model. Nvidia chips are general purpose.

raincole 9 hours ago|||

You mean Nvidia?

gwd 9 hours ago|||

I dunno, it pretty quickly got stuck; the "attach file" didn't seem to work, and when I asked "can you see the attachment" it replied to my first message rather than my question.

scosman 9 hours ago|||

It’s llama 3.1 8B. No vision, not smart. It’s just a technical demo.

anthonypasq 7 hours ago||

why is everyone seemingly incapable of understanding this? waht is going on here? Its like ai doomers consistently have the foresight of a rat. yeah no shit it sucks its running llama 3 8b, but theyre completely incapable of extrapolation.

freakynit 9 hours ago|||

Hmm.. I had tried simple chat converation without file attachments.

zwaps 9 hours ago|||

I got 16.000 tokens per second ahaha

bsenftner 9 hours ago|||

I get nothing, no replies to anything.

freakynit 9 hours ago||

Maybe hn and reddit crowd have overloaded them lol

elliotbnvl 10 hours ago|||

That… what…

PlatoIsADisease 8 hours ago|||

Well it got all 10 incorrect when I asked for top 10 catchphrases from a character in Plato's books. It confused the baddie for Socrates.

rvz 8 hours ago|||

Fast, but stupid.

   Me: "How many r's in strawberry?"

   Jimmy: There are 2 r's in "strawberry".

   Generated in 0.001s • 17,825 tok/s

The question is not about how fast it is. The real question(s) are:

   1. How is this worth it over diffusion LLMs (No mention of diffusion LLMs at all in this thread)

(This also assumes that diffusion LLMs will get faster)

   2. Will Talaas also work with reasoning models, especially those that are beyond 100B parameters and with the output being correct? 

   3. How long will it take to create newer models to be turned into silicon? (This industry moves faster than Talaas.)

   4. How does this work when one needs to fine-tune the model, but still benefit from the speed advantages?

mike_hearn 4 hours ago|||

The blog answers all those questions. It says they're working on fabbing a reasoning model this summer. It also says how long they think they need to fab new models, and that the chips support LoRAs and tweaking context window size.

I don't get these posts about ChatJimmy's intelligence. It's a heavily quantized Llama 3, using a custom quantization scheme because that was state of the art when they started. They claim they can update quickly (so I wonder why they didn't wait a few more months tbh and fab a newer model). Llama 3 wasn't very smart but so what, a lot of LLM use cases don't need smart, they need fast and cheap.

Also apparently they can run DeepSeek R1 also, and they have benchmarks for that. New models only require a couple of new masks so they're flexible.

simlevesque 5 hours ago||||

LLMs can't count. They need tool use to answer these questions accurately.

refsys 6 hours ago|||

[dead]

b0ner_t0ner 9 hours ago|||

I asked, “What are the newest restaurants in New York City?”

Jimmy replied with, “2022 and 2023 openings:”

0_0

freakynit 8 hours ago|||

Well, technically it's answer is correct when you consider it's knowledge cutoff date... it just gave you a generic always right answer :)

xi_studio 7 hours ago|||

chatjimmy's trained on LLama 3.1

jvidalv 8 hours ago|||

Is super fast but also super inaccurate, I would say not even gpt-3 levels.

roywiggins 4 hours ago|||

That's because it's llama3 8b.

empath75 8 hours ago|||

There are a lot of people here that are completely missing the point. What is it called where you look at a point of time and judge an idea without seemingly being able to imagine 5 seconds into the future.

Alifatisk 5 hours ago||

“static evaluation”

Etheryte 9 hours ago||

It is incredibly fast, on that I agree, but even simple queries I tried got very inaccurate answers. Which makes sense, it's essentially a trade off of how much time you give it to "think", but if it's fast to the point where it has no accuracy, I'm not sure I see the appeal.

andrewdea 9 hours ago|||

the hardwired model is Llama 3.1 8B, which is a lightweight model from two years ago. Unlike other models, it doesn't use "reasoning:" the time between question and answer is spent predicting the next tokens. It doesn't run faster because it uses less time to "think," It runs faster because its weights are hardwired into the chip rather than loaded from memory. A larger model running on a larger hardwired chip would run about as fast and get far more accurate results. That's what this proof of concept shows

Etheryte 8 hours ago||

I see, that's very cool, that's the context I was missing, thanks a lot for explaining.

kaashif 9 hours ago||||

If it's incredibly fast at a 2022 state of the art level of accuracy, then surely it's only a matter of time until it's incredibly fast at a 2026 level of accuracy.

PrimaryExplorer 9 hours ago|||

yeah this is mindblowing speed. imagine this with opus 4.6 or gpt 5.2. probably coming soon

scotty79 9 hours ago||

I'd be happy if they can run GLM 5 like that. It's amazing at coding.

Gud 9 hours ago|||

Why do you assume this?

I can produce total jibberish even faster, doesn’t mean I produce Einstein level thought if I slow down

Closi 4 hours ago|||

Better models already exist, this is just proving you can dramatically increase inference speeds / reduce inference costs.

It isn't about model capability - it's about inference hardware. Same smarts, faster.

andy12_ 8 hours ago|||

Not what he said.

scotty79 9 hours ago|||

I think it might be pretty good for translation. Especially when fed with small chunks of the content at a time so it doesn't lose track on longer texts.

mbh159 30 minutes ago||

So cool, what's underappreciated imo: 17k tokens/sec doesn't just change deployment economics. It changes what evaluation means, static MMLU-style tests were designed around human-paced interaction. At this throughput you can run tens of thousands of adversarial agent interactions in the time a standard benchmark takes. Speed doesn't make static evals better it makes them even more obviously inadequate.

jameslk 10 minutes ago||

The implications for RLM is really interesting. RLM is expensive because of token economics. But when tokens are so cheap and fast to generate, context size of the model matters a lot less

Also interesting implications for optimization-driven frameworks like DSPy. If you have an eval loop and useful reward function, you can iterate to the best possible response every time and ignore the cost of each attempt

alexc05 3 hours ago||

If I could have one of these cards in my own computer do you think it would be possible to replace claude code?

1. Assume It's running a better model, even a dedicated coding model. High scoring but obviously not opus 4.5 2. Instead of the standard send-receive paradigm we set up a pipeline of agents, each of whom parses the output of the previous.

At 17k/tps running locally, you could effectively spin up tasks like "you are an agent who adds semicolons to the end of the line in javascript", with some sort of dedicated software in the style of claude code you could load an array of 20 agents each with a role to play in improving outpus.

take user input and gather context from codebase -> rewrite what you think the human asked you in the form of an LLM-optimized instructional prompt -> examine the prompt for uncertainties and gaps in your understanding or ability to execute -> <assume more steps as relevant> -> execute the work

Could you effectively set up something that is configurable to the individual developer - a folder of system prompts that every request loops through?

Do you really need the best model if you can pass your responses through a medium tier model that engages in rapid self improvement 30 times in a row before your claude server has returned its first shot response?

AmazingTurtle 3 hours ago|

Models can't improve themselves with their own (model) input, they need to be grounded in truth and reality.

jjcm 10 hours ago||

A lot of naysayers in the comments, but there are so many uses for non-frontier models. The proof of this is in the openrouter activity graph for llama 3.1: https://openrouter.ai/meta-llama/llama-3.1-8b-instruct/activ...

10b daily tokens growing at an average of 22% every week.

There are plenty of times I look to groq for narrow domain responses - these smaller models are fantastic for that and there's often no need for something heavier. Getting the latency of reponses down means you can use LLM-assisted processing in a standard webpage load, not just for async processes. I'm really impressed by this, especially if this is its first showing.

jtr1 4 hours ago||

Maybe this is a naive question, but why wouldn't there be market for this even for frontier models? If Anthropic wanted to burn Opus 4.6 into a chip, wouldn't there theoretically be a price point where this would lower inference costs for them?

ethmarks 2 hours ago||

Because we don't know if this would scale well to high-quality frontier models. If you need to manufacture dedicated hardware for each new model, that adds a lot of expense and causes a lot of e-waste once the next model releases. In contrast, even this current iteration seems like it would be fantastic for low-grade LLM work.

For example, searching a database of tens of millions of text files. Very little "intelligence" is required, but cost and speed are very important. If you want to know something specific on Wikipedia but don't want to figure out which article to search for, you can just have an LLM read the entire English Wikipedia (7,140,211 articles) and compile a report. Doing that would be prohibitively expensive and glacially slow with standard LLM providers, but Taalas could probably do it in a few minutes or even seconds, and it would probably be pretty cheap.

redman25 7 hours ago|||

Many older models are still better at "creative" tasks because new models have been benchmarking for code and reasoning. Pre-training is what gives a model its creativity and layering SFT and RL on top tends to remove some of it in order to have instruction following.

freakynit 9 hours ago|||

Exactly. One easily relatable use-case is structured content extraction or/and conversion to markdown for web page data. I used to use groq for same (gpt-oss20b model), but even that used to feel slow when doing theis task at scale.

LLM's have opened-up natural language interface to machines. This chip makes it realtime. And that opens a lot of use-cases.

spot5010 7 hours ago||

These seem ideal for robotics applications, where there is a low-latency narrow use case path that these chips can serve, maybe locally.

baalimago 9 hours ago||

I've never gotten incorrect answers faster than this, wow!

Jokes aside, it's very promising. For sure a lucrative market down the line, but definitely not for a model of size 8B. I think lower level intellect param amount is around 80B (but what do I know). Best of luck!

Derbasti 9 hours ago||

Amazing! It couldn't answer my question at all, but it couldn't answer it incredibly quickly!

Snarky, but true. It is truly astounding, and feels categorically different. But it's also perfectly useless at the moment. A digital fidget spinner.

anthonypasq 7 hours ago||

does no one understand what a tech demo is anymore? do you think this piece of technology is just going to be frozen in time at this capability for eternity?

do you have the foresight of a nematode?

otabdeveloper4 2 hours ago|||

Make it for Qwen 2.5 and I'd buy it.

You don't actually need "frontier models" for Real Work (c).

(Summarization, classification and the rest of the usual NLP suspects.)

PlatoIsADisease 8 hours ago|||

As someone with a 3060, I can attest that there are really really good 7-9B models. I still use berkeley-nest/Starling-LM-7B-alpha and that model is a few years old.

If we are going for accuracy, the question should be asked multiple times on multiple models and see if there is agreement.

But I do think once you hit 80B, you can struggle to see the difference between SOTA.

That said, GPT4.5 was the GOAT. I can't imagine how expensive that one was to run.

edot 8 hours ago||

Yeah, two p’s in the word pepperoni …

aurareturn 11 hours ago||

Edit: it seems like this is likely one chip and not 10. I assumed 8B 16bit quant with 4K or more context. This made me think that they must have chained multiple chips together since N6 850mm2 chip would only yield 3GB of SRAM max. Instead, they seem to have etched llama 8B q3 with 1k context instead which would indeed fit the chip size.

This requires 10 chips for an 8 billion q3 param model. 2.4kW.

10 reticle sized chips on TSMC N6. Basically 10x Nvidia H100 GPUs.

Model is etched onto the silicon chip. So can’t change anything about the model after the chip has been designed and manufactured.

Interesting design for niche applications.

What is a task that is extremely high value, only require a small model intelligence, require tremendous speed, is ok to run on a cloud due to power requirements, AND will be used for years without change since the model is etched into silicon?

danpalmer 11 hours ago||

Alternatively, you could run far more RAG and thinking to integrate recent knowledge, I would imagine models designed for this putting less emphasis on world knowledge and more on agentic search.

freeone3000 10 hours ago||

Maybe; models with more embedded associations are also better at search. (Intuitively, this tracks; a model with no world knowledge has no awareness of synonyms or relations (a pure markov model), so the more knowledge a model has, the better it can search.) It’s not clear if it’s possible to build such a model, since there doesn’t seem to be a scaling cliff.

pjc50 11 hours ago|||

Where are those numbers from? It's not immediately clear to me that you can distribute one model across chips with this design.

> Model is etched onto the silicon chip. So can’t change anything about the model after the chip has been designed and manufactured.

Subtle detail here: the fastest turnaround that one could reasonably expect on that process is about six months. This might eventually be useful, but at the moment it seems like the model churn is huge and people insist you use this week's model for best results.

mike_hearn 4 hours ago|||

Well they claim two month turnaround. Big If True. How does the six months break down in your estimation? Maybe they have found a way to reduce the turnaround time.

aurareturn 10 hours ago||||

  > The first generation HC1 chip is implemented in the 6 nanometer N6 process from TSMC. Each HC1 chip has 53 billion transistors on the package, most of it very likely for ROM and SRAM memory. The HC1 card burns about 200 watts, says Bajic, and a two-socket X86 server with ten HC1 cards in it runs 2,500 watts.

https://www.nextplatform.com/2026/02/19/taalas-etches-ai-mod...

darkwater 9 hours ago|||

And what of that makes you assume that having a server with 10 HC1 cards is needed to run a single model version on that server?

dakolli 10 hours ago|||

So it lights money on fire extra fast, AI focused VCs are going to really love it then!!

adityashankar 10 hours ago||||

This depends on how much better the models will get from now in, if Claude Opus 4.6 was transformed into one of these chips and ran at a hypothetical 17k tokens/second, I'm sure that would be astounding, this depends on how much better claude Opus 5 would be compared to the current generation

aurareturn 10 hours ago|||

I’m pretty sure they’d need a small data center to run a model the size of Opus.

empath75 8 hours ago|||

Even an O3 quality model at that speed would be incredible for a great many tasks. Not everything needs to be claude code. Imagine Apple fine tuning a mid tier reasoning model on personal assistant/MacOs/IOS sorts of tasks and burning a chip onto the mac studio motherboard. Could you run claude code on it? Probably not, would it be 1000x better than Siri? absolutely.

JKCalhoun 18 minutes ago||

Yeah, waiting for Apple to cut a die that can do excellent local AI.

empath75 8 hours ago|||

100x of a less good model might be better than 1 of a better model for many many applications.

This isn't ready for phones yet, but think of something like phones where people buy new ones every 3 years and even having a mediocre on-device model at that speed would be incredible for something like siri.

machiaweliczny 9 hours ago|||

A lot of NLP tasks could benefit from this

thrance 10 hours ago|||

> What is a task that is extremely high value, only require a small model intelligence, require tremendous speed, is ok to run on a cloud due to power requirements, AND will be used for years without change since the model is etched into silicon?

Video game NPCs?

aurareturn 10 hours ago||

Doesn’t pass the high value and require tremendous speed tests.

mike_hearn 4 hours ago|||

Speed = capacity = cost.

thrance 7 hours ago|||

Video games are a huge market, and speed and cost of current models are definitely huge barriers to integrating LLMs in video games.

Shaanveer 11 hours ago|||

ceo

charcircuit 10 hours ago||

No one would never give such a weak model that much power over a company.

teaearlgraycold 11 hours ago|||

I'm thinking the best end result would come from custom-built models. An 8 billion parameter generalized model will run really quickly while not being particularly good at anything. But the same parameter count dedicated to parsing emails, RAG summarization, or some other specialized task could be more than good enough while also running at crazy speeds.

metabrew 11 hours ago|

I tried the chatbot. jarring to see a large response come back instantly at over 15k tok/sec

I'll take one with a frontier model please, for my local coding and home ai needs..

grzracz 11 hours ago||

Absolute insanity to see a coherent text block that takes at least 2 minutes to read generated in a fraction of a second. Crazy stuff...

pjc50 10 hours ago|||

Accelerating the end of the usable text-based internet one chip at a time.

VMG 9 hours ago||||

Not at all if you consider the internet pre-LLM. That is the standard expectation when you load a website.

The slow word-by-word typing was what we started to get used to with LLMs.

If these techniques get widespread, we may grow accustomed to the "old" speed again where content loads ~instantly.

Imagine a content forest like Wikipedia instantly generated like a Minecraft word...

kleiba 10 hours ago|||

Yes, but the quality of the output leaves to be desired. I just asked about some sports history and got a mix of correct information and totally made up nonsense. Not unexpected for an 8k model, but raises the question of what the use case is for such small models.

kgeist 9 hours ago|||

8b models are great at converting unstructured data to a structured format. Say, you want to transcribe all your customer calls and get a list of issues they discussed most often. Currently with the larger models it takes me hours.

A chatbot which tells you various fun facts is not the only use case for LLMs. They're language models first and foremost, so they're good at language processing tasks (where they don't "hallucinate" as much).

Their ability to memorize various facts (with some "hallucinations") is an interesting side effect which is now abused to make them into "AI agents" and what not but they're just general-purpose language processing machines at their core.

eternauta3k 7 hours ago||

Would be nice to point this at (pre-LLM) Wikipedia and fill out Wikidata!

djb_hackernews 10 hours ago|||

You have a misunderstanding of what LLMs are good at.

cap11235 10 hours ago|||

Poster wants it to play Jeopardy, not process text.

kleiba 10 hours ago||||

Care to enlighten me?

vntok 9 hours ago||

Don't ask a small LLM about precise minutiae factual information.

Alternatively, ask yourself how plausible it sounds that all the facts in the world could be compressed into 8k parameters while remaining intact and fine-grained. If your answer is that it sounds pretty impossible... well it is.

kleiba 16 minutes ago||

Did you see the part in my original post where it said "Not unexpected for an 8k model"?

IshKebab 10 hours ago||||

I don't think he does. Larger models are definitely better at not hallucinating. Enough that they are good at answering questions on popular topics.

Smaller models, not so much.

paganel 10 hours ago|||

Not sure if you're correct, as the market is betting trillions of dollars on these LLMs, hoping that they'll be close to what the OP had expected to happen in this case.

raincole 9 hours ago||

The market didn't throw trillions of dollars to develop Llama 3 8B.

What GP is expected to happen has happened around late 2024 ~ early 2025 when LLM frontends got web search feature. It's old tech now.

paganel 8 hours ago||

The GP’s point was about LLMs generally, no matter the interface. I agree that this particular model is (relatively speaking) ancient in AI the world, but go back 3 or 4 years and this (pretty complex “reasoning” at almost instant speed) would have seemed taken out of a science-fiction book.

stabbles 10 hours ago|||

Reminds me of that solution to Fermi's paradox, that we don't detect signals from extraterrestrial civilizations because they run on a different clock speed.

dintech 10 hours ago|||

Iain M Banks’ The Algebraist does a great job of covering that territory. If an organism had a lifespan of millions of years, they might perceive time and communication differently to say a house fly or us.

xyzsparetimexyz 10 hours ago|||

:eyeroll:

pennomi 6 hours ago|||

Yeah, feeding that speed into a reasoning loop or a coding harness is going to revolutionize AI.

More comments...