The path to ubiquitous AI (17k tokens/sec)

notenlish 1 day ago|

Impressive stuff.

DeathArrow 21 hours ago||

Is amazingly fast but since the model is quantized and pretty limited, I don't know what it is useful for.

petesergeant 23 hours ago||

Future is these as small, swappable bits of SD-card sized hardware that you stick into your devices.

Aerroon 23 hours ago||

Imagine this thing for autocomplete.

I'm not sure how good llama 3.1 8b is for that, but it should work, right?

Autocomplete models don't have to be very big, but they gotta be fast.

nickpsecurity 1 day ago||

My concept was to do this with two pieces:

1. Generic, mask layers and board to handle what's common across models. Especially memory and interface.

2. Specific layers for the model implementation.

Masks are the most expensive part of ASIC design. So, keeping the custom part small with the rest pre-proven in silicon, even shared across companies, would drop the costs significantly. This is already done in hardware industry in many ways but not model acceleration.

Then, do 8B, 30-40B, 70B, and 405B models in hardware. Make sure they're RLHF-tuned well since changes will be impossible or limited. Prompts will drive most useful functionality. Keep cranking out chips. There's maybe a chance to keep the weights changeable on-chip but it should still be useful if only inputs can change.

The other concept is to use analog, neural networks with the analog layers on older, cheaper nodes. We only have to customize that per model. The rest is pre-built digital with standard interfaces on a modern node. Given the chips would be distributed, one might get away with 28nm for the shared part and develop it eith shuttle runs.

GaggiX 1 day ago||

For fun I'm imagining a future where you would be able to buy an ASIC with like an hard-wired 1B LLM model in it for cents and it could be used everywhere.

YetAnotherNick 1 day ago||

17k token/sec is $0.18/chip/hr for the size of H100 chip if they want to compete with the market rate[1]. But 17k token/sec could lead to some new usecases.

[1]: https://artificialanalysis.ai/models/llama-3-1-instruct-8b/p...

standeven 1 day ago||

Holy shit this is fast. It generated a legible, original, two-paragraph story on given topics in 0.025s.

OrvalWintermute 1 day ago|

wow that is fast!