Posted by anemll 8 hours ago
They didn't make special purpose hardware to run a model. They crafted a large model so that it could run on consumer hardware (a phone).
We haven't had phones running laptop-grade CPUs/GPUs for that long, and that is a very real hardware feat. Likewise, nobody would've said running a 400b LLM on a low-end laptop was feasible, and that is very much a software triumph.
Agree to disagree, we've had laptop-grade smartphone hardware for longer than we've had LLMs.
We've had solid CPUs for a while, but GPUs have lagged behind (and they're the ones that matter for this particular application). iPhones still lead by a comfortable margin on this front, but have historically been pretty limited on the IO front (only supported USB2 speeds until recently).
It’s been a lot of years, but all I can hear after reading that is … I’m making a note here, huge success
Remember when people were arguing about whether to use mmap? What a ridiculous argument.
At some point someone will figure out how to tile the weights and the memory requirements will drop again.
Experts are predicted by layer and the individual layer reads are quite small, so this is not really feasible. There's just not enough information to guide a prefetch.
It's just so slow that nobody pursued it seriously. It's fun to see these tricks implemented, but even on this 2025 top spec iPhone Pro the output is 100X slower than output from hosted services.
It is objectively slow at around 100X slower than what most people consider usable.
The quality is also degraded severely to get that speed.
> but the point of this is that you can run cheap inference in bulk on very low-end hardware.
You always could, if you didn't care about speed or efficiency.
iPhone 17 Pro outperforms AMD’s Ryzen 9 9950X per https://www.igorslab.de/en/iphone-17-pro-a19-pro-chip-uebert...
Practical LLMs on mobile devices are at least a few years away.
That said, it'd be a fun quote and I've jokingly said it as well, as I think of it more as part of 'popular' culture lol