The path to ubiquitous AI (17k tokens/sec)

Posted by sidnarsipur 20 hours ago

The path to ubiquitous AI (17k tokens/sec)(taalas.com)

694 points | 397 commentspage 6

flux3125 13 hours ago|

I imagine how advantageous it would be to have something like llama.cpp encoded on a chip instead, allowing us to run more than a single model. It would be slower than Jimmy, for sure, but depending on the speed, it could be an acceptable trade-off.

ThePhysicist 17 hours ago||

This is really cool! I am trying to find a way to accelerate LLM inference for PII detection purposes, where speed is really necessary as we want to process millions of log lines per minute, I am wondering how fast we could get e.g. llama 3.1 to run on a conventional NVIDIA card? 10k tokens per second would be fantastic but even at 1k this would be very useful.

freakynit 17 hours ago||

PII redaction is a really good use-case.

Also, "10k tokens per second would be fantastic" might not be sufficient (even remotely) if you want to "process millions of log lines per minute".

Assuming a single log line at just 100 tokens, you need (100 * 2 million / 60) ~ 3.3 million tokens per second processing speed :)

ThePhysicist 15 hours ago||

Yeah I mean we have a mechanism that can bypass AI models for log lines where we are pretty sure no PII is in there (kind of like smart caching using fuzzy template matching to identify things that we have seen before many times, as logs tend to contain the same stuff over and over with tiny variations e.g. different timestamps), so we only need to pass the lines where we cannot be sure there's nothing to the AI for inspection. And we can of course parallelize. Currently we use a homebrew CFR model with lots of tweaks and it's quite good but an LLM would of course be much better still and capture a lof of cases that would evade the simpler model.

freakynit 4 hours ago||

Oh okay... that's fine. Most log lines are indeed similar looking.

lopuhin 12 hours ago||

For that you only need high throughput which is much easier to achieve compared to high latency, thanks to batching -- assuming the log lines or chunks can be processed independently. You can check TensorRT-LLM benchmarks (https://nvidia.github.io/TensorRT-LLM/developer-guide/perf-o...), or try running vllm on a card you have access to.

rbanffy 17 hours ago||

This makes me think about how large would an FPGA-based system to be able to do this? Obviously there is no single-chip FPGA that can do this kind of job, but I wonder how many we would need.

Also, what if Cerebras decided to make a wafer-sized FPGA array and turned large language models into lots and lots of logical gates?

33a 18 hours ago||

If they made a low power/mobile version, this could be really huge for embedded electronics. Mass produced, highly efficient "good enough" but still sort of dumb ais could put intelligence in house hold devices like toasters, light switches, and toilets. Truly we could be entering into the golden age of curses.

left-struck 17 hours ago|

Oh god, this is the new version of every device having Bluetooth and an app and being called “smart”.

I just wanted some toast, but here I am installing an app, dismissing 10 popups, and maybe now arguing with a chat bot about how I don’t in fact want to turn on notifications.

loufe 19 hours ago||

Jarring to see these other comments so blindly positive.

Show me something at a model size 80GB+ or this feels like "positive results in mice"

viraptor 19 hours ago||

There are a lot of problems solved by tiny models. The huge ones are fun for large programming tasks, exploration, analysis, etc. but there's a massive amount of processing <10GB happening every day. Including on portable devices.

This is great even if it can't ever run Opus. Many people will be extremely happy about something like Phi accessible at lightning speed.

johnsimer 16 hours ago|||

Parameter density is doubling every 3-4 months

What does that mean for 8b models 24mo from now?

hkt 19 hours ago||

Positive results in mice also known as being a promising proof of concept. At this point, anything which deflates the enormous bubble around GPUs, memory, etc, is a welcome remedy. A decent amount of efficient, "good enough" AI will change the market very considerably, adding a segment for people who don't need frontier models. I'd be surprised if they didn't end up releasing something a lot bigger than they have.

kamranjon 15 hours ago||

It would be pretty incredible if they could host an embedding model on this same hardware, I would pay for that immediately. It would change the type of things you could build by enabling on the fly embeddings with negligible latency.

stuxf 19 hours ago||

I totally buy the thesis on specialization here, I think it makes total sense.

Asides from the obvious concern that this is a tiny 8B model, I'm also a bit skeptical of the power draw. 2.4 kW feels a little bit high, but someone else should try doing the napkin math compared to the total throughput to power ratio on the H200 and other chips.

TheServitor 8 hours ago||

I don't know the use of this yet but I'm certain there will be one.

d2ou 12 hours ago|

Would it make sense for the big players to buy them? Seems to be a huge avenue here to kill inference costs which always made me dubious on LLMs in general.

More comments...