iPhone 17 Pro Demonstrated Running a 400B LLM

Posted by anemll 8 hours ago

iPhone 17 Pro Demonstrated Running a 400B LLM(twitter.com)

https://xcancel.com/anemll/status/2035901335984611412

370 points | 205 commentspage 3

einpoklum 2 hours ago|

I read this title as: "iPhone 17 Pro demonstrated being an overpriced phone".

skiing_crawling 4 hours ago||

I can't understand why this is a surprise to anyone. An iphone is still a computer, of course it can run any model that fits in storage albiet very slowly. The implementation is impressive I guess but I don't see how this is a novel capability. And for 0.6t/s, its not a cost efficient hardware for doing it. The iphone can also render pixar movies if you let it run long enough, mine bitcoin with a pathetic hashrate, and do weather simulations but not in time for the forecast to be relevant.

anemll 4 hours ago|

SSD streaming to compute units is new. M4 max can do 15 t/s with its 15GB/s drives

ashwinnair99 7 hours ago||

A year ago this would have been considered impossible. The hardware is moving faster than anyone's software assumptions.

cogman10 7 hours ago||

This isn't a hardware feat, this is a software triumph.

They didn't make special purpose hardware to run a model. They crafted a large model so that it could run on consumer hardware (a phone).

pdpi 7 hours ago|||

It's both.

We haven't had phones running laptop-grade CPUs/GPUs for that long, and that is a very real hardware feat. Likewise, nobody would've said running a 400b LLM on a low-end laptop was feasible, and that is very much a software triumph.

bigyabai 6 hours ago||

> We haven't had phones running laptop-grade CPUs/GPUs for that long

Agree to disagree, we've had laptop-grade smartphone hardware for longer than we've had LLMs.

pdpi 5 hours ago||

Kind of.

We've had solid CPUs for a while, but GPUs have lagged behind (and they're the ones that matter for this particular application). iPhones still lead by a comfortable margin on this front, but have historically been pretty limited on the IO front (only supported USB2 speeds until recently).

smallerize 7 hours ago||||

The iPhone 17 Pro launched 8 months ago with 50% more RAM and about double the inference performance of the previous iPhone Pro (also 10x prompt processing speed).

SV_BubbleTime 5 hours ago||||

>triumph

It’s been a lot of years, but all I can hear after reading that is … I’m making a note here, huge success

GorbachevyChase 5 hours ago|||

There’s no use crying over every mistake. You just keep on trying until you run out of cake.

breggles 5 hours ago|||

It's hard to overstate my satisfaction!

anemll 4 hours ago|||

both, tbh

mannyv 7 hours ago|||

The software has real software engineers working on it instead of researchers.

Remember when people were arguing about whether to use mmap? What a ridiculous argument.

At some point someone will figure out how to tile the weights and the memory requirements will drop again.

snovv_crash 6 hours ago||

The real improvement will be when the software engineers get into the training loop. Then we can have MoE that use cache-friendly expert utilisation and maybe even learned prefetching for what the next experts will be.

zozbot234 6 hours ago||

> maybe even learned prefetching for what the next experts will be

Experts are predicted by layer and the individual layer reads are quite small, so this is not really feasible. There's just not enough information to guide a prefetch.

yorwba 5 hours ago|||

It's feasible to put the expert routing logic in a previous layer. People have done it: https://arxiv.org/abs/2507.20984

snovv_crash 6 hours ago|||

Manually no. It would have to be learned, and making the expert selection predictable would need to be a training metric to minimize.

zozbot234 6 hours ago||

Making the expert selection more predictable also means making it less effective. There's no real free lunch.

Aurornis 6 hours ago|||

It wasn't considered impossible. There are examples of large MoE LLMs running on small hardware all over the internet, like giant models on Raspberry Pi 5.

It's just so slow that nobody pursued it seriously. It's fun to see these tricks implemented, but even on this 2025 top spec iPhone Pro the output is 100X slower than output from hosted services.

zozbot234 6 hours ago||

If the bottleneck is storage bandwidth that's not "slow". It's only slow if you insist on interactive speeds, but the point of this is that you can run cheap inference in bulk on very low-end hardware.

Aurornis 4 hours ago|||

> If the bottleneck is storage bandwidth that's not "slow"

It is objectively slow at around 100X slower than what most people consider usable.

The quality is also degraded severely to get that speed.

> but the point of this is that you can run cheap inference in bulk on very low-end hardware.

You always could, if you didn't care about speed or efficiency.

zozbot234 4 hours ago||

You're simply pointing out that most people who use AI today expect interactive speeds. You're right that the point here is not raw power efficiency (having to read from storage will impact energy per operation, and datacenter-scale AI hardware beats edge hardware anyway by that metric) but the ability to repurpose cheaper, lesser-scale hardware is also compelling.

Terretta 5 hours ago|||

> very low-end hardware

iPhone 17 Pro outperforms AMD’s Ryzen 9 9950X per https://www.igorslab.de/en/iphone-17-pro-a19-pro-chip-uebert...

pinkgolem 5 hours ago||

In single threaded workloads, still impressive

t00 4 hours ago|||

/FIFY A year ago this would have been considered impossible. The software is moving faster than anyone's hardware assumptions.

ottah 6 hours ago|||

I mean, by any reasonable standard it still is. Almost any computer can run an llm, it's just a matter of how fast, and 0.4k/s (peak before first token) is not really considered running. It's a demo, but practically speaking entirely useless.

alephnerd 5 hours ago||

Devils advocate - this actually shows how promising TinyML and EdgeML capabilities are. SoCs comparable to the A19 Pro are highly likely to be commodified in the next 3-5 years in the same manner that SoCs comparable to the A13 already are.

iberator 4 hours ago||

Does iPhone have some kind of hardware acceleration for neural netwoeks/ai ?

HardCodedBias 4 hours ago||

The power draw is going to be crazy (today).

Practical LLMs on mobile devices are at least a few years away.

1970-01-01 4 hours ago||

"400 bytes should be enough for anybody"

Insanity 4 hours ago|

The 'B' in 400B is billion, not bytes. And the quote '640k ought to be enough for everyone' doesn't have evidence supporting Bill G said it: https://www.computerworld.com/article/1563853/the-640k-quote....

That said, it'd be a fun quote and I've jokingly said it as well, as I think of it more as part of 'popular' culture lol

pier25 7 hours ago||

https://xcancel.com/anemll/status/2035901335984611412

dang 6 hours ago|

Added to toptext. Thanks!

Yanko_11 4 hours ago||

[dead]

aplomb1026 5 hours ago||

[dead]

jlhawn 4 hours ago||

[dead]

jee599 7 hours ago|

[dead]

More comments...