iPhone 17 Pro Demonstrated Running a 400B LLM

Posted by anemll 6 hours ago

iPhone 17 Pro Demonstrated Running a 400B LLM(twitter.com)

https://xcancel.com/anemll/status/2035901335984611412

325 points | 187 commentspage 2

r4m18612 3 hours ago|

Impressive. Running a 400B model on-device, even at low throughput, is pretty wild.

Mr_RxBabu 2 hours ago|

MaxikCZ 29 seconds ago||

-1

redwood 3 hours ago||

It will be funny if we go back to lugging around brick-size batteries with us everywhere!

wiether 2 hours ago||

A backpack full of batteries!

https://www.youtube.com/watch?v=MI69LUXWiBc

gizajob 3 hours ago|||

Seeing as we have the power in our pockets we may as well utilise it. To…type…expert answers… very slowly.

wayeq 3 hours ago|||

might be worth it to keep Sam Altman from reading our AI generated fanfic

pokstad 3 hours ago||

Backpack computers!

skiing_crawling 2 hours ago||

I can't understand why this is a surprise to anyone. An iphone is still a computer, of course it can run any model that fits in storage albiet very slowly. The implementation is impressive I guess but I don't see how this is a novel capability. And for 0.6t/s, its not a cost efficient hardware for doing it. The iphone can also render pixar movies if you let it run long enough, mine bitcoin with a pathetic hashrate, and do weather simulations but not in time for the forecast to be relevant.

anemll 2 hours ago|

SSD streaming to compute units is new. M4 max can do 15 t/s with its 15GB/s drives

dv_dt 4 hours ago||

CPU, memory, storage, time tradeoffs rediscovered by AI model developers. There is something new here, add GPU to the trade space.

alephnerd 4 hours ago|

It's been known to people working in the space for a long time. Heck, I was working on similar stuff for the Maxwell and later Pascal over a decade ago.

You do have a lot of "MLEs" and "Data Scientists" who only know basic PyTorch and SKLearn, but that kind of fat is being trimmed industry wide now.

Domain experience remains gold, especially in a market like today's.

russellbeattie 4 hours ago||

I have some macro opinions about Apple - not sure if I'm correct, but tell me what you think.

Apple has always seen RAM as an economic advantage for their platform: Make the development effort to ensure that the OS and apps work well with minimal memory and save billions every year in hardware costs. In 2026, iPhones still come with 8Gb of RAM, Pro/Max come with 12Gb.

The problem is that AI (ML/LLM training and inference) are areas where you can't get around the need for copious amounts of fast working memory. (Thus the critical shortage of RAM at the moment as AI data centers consume as many memory chips as possible.)

Unless there's something I don't know (which is more than possible) Apple can't code their way around this problem, nor create specialized SoCs with ML cores that obviate the need for lots and lots of RAM.

So, it's going to be interesting whether they accept this reality and we start seeing the iPhones in the future with 16Gb, 32Gb or more as standard in order to make AI performant. And if they give up on adding AI to the billions of iPhones with minimal RAM already out there.

As a side note, 8Gb of RAM hasn't been enough for a decade. It prevents basic tasks like keeping web tabs live in the background. My pet peeve is having just a few websites open, and having the page refresh when swapping between them because of aggressive memory management.

To me, Apple's obvious strength is pushing AI to the edge as much as possible. While other companies are investing in massive data centers which will have millions of chips that will be outdated within the next couple years, Apple will be able to incrementally improve their ML/AI features by running on the latest and greatest chips every year. Apple has a huge advantage in that they can design their chips with a mega high speed bus, which is just as important as the quantity of RAM.

But all that depends on Apple's willingness to accept that RAM isn't an area they can skimp on any more, and I'm not sure they will.

Sorry for the brain dump. I'd love to be educated on this in case I'm totally off base.

mlsu 3 hours ago||

Models on the phone is never going to make sense.

If you're loading gigabytes of model weights into memory, you're also pushing gigabytes through the compute for inference. No matter how you slice it, no matter how dense you make the chips, that's going to cost a lot of energy. It's too energy intensive, simple as.

"On device" inference (for large LLM I mean) is a total red herring. You basically never want to do it unless you have unique privacy considerations and you've got a power cable attached to the wall. For a phone maybe you would want a very small model (like 3B something in that size) for Siri-like capabilities.

On a phone, each query/response is going to cost you 0.5% of your battery. That just isn't tenable for the way these models are being used.

Try this for yourself. Load a 7B model on your laptop and talk to it for 30 minutes. These things suck energy like a vacuum, even the shitty models. A network round trip costs gets you hundreds of tokens from a SOTA model and costs 1 joule. By contrast, a single forward pass (one token) of a shitty 7b model costs 1 joule. It's just not tenable.

russellbeattie 1 hour ago||

Huh, I hadn't thought of battery limitations. Good call. My initial reaction is that bigger/better batteries, hyper fast recharge times and more efficient processors might address this issue, but I need to learn more about it.

That said, power consumption is one of the reasons I think pushing this stuff to the edge is the only real path for AI in terms of a business model. It basically spreads the load and passes the cost of power to the end user, rather than trying to figure out how to pay for it at the data center level.

ecshafer 3 hours ago|||

In a recent episode of Dwarkesh the guest who is a semiconductor industry analyst predicted that an iPhone will increase in price by about $250 for the same stuff due to increased ram/chip costs from AI. Apple will not be able to afford to put a bunch more RAM into the phones and still sell them.

alwillis 2 hours ago||

> In a recent episode of Dwarkesh the guest who is a semiconductor industry analyst predicted that an iPhone will increase in price by about $250 for the same stuff due to increased ram/chip costs from AI. Apple will not be able to afford to put a bunch more RAM into the phones and still sell them.

Apple recently stated on an earnings call they signed contracts with RAM vendors before prices got out of control, so they should be good for a while. Nvidia also uses TSMC for their chips, which may affect A series and M series chip production.

Yes, TSMC has a plant in Arizona but my understanding is they can't make the cutting edge chips there; at least not yet.

zozbot234 4 hours ago|||

RAM is just too expensive. We need to bring back non-DRAM persistent memory that doesn't have the wearout issues of NAND.

anemll 3 hours ago||

multiple NAND, and apple already used it in Mac Studio. Plus better cooling

big_toast 3 hours ago|||

I think this is roughly true, but instead RAM will remain a discriminator even moreso. If the scaling laws apple has domain over are compute and model size, then they'll pretty easily be able to map that into their existing price tiers.

Pros will want higher intelligence or throughput. Less demanding or knowledgeable customers will get price-funneled to what Apple thinks is the market premium for their use case.

It'll probably be a little harder to keep their developers RAM disciplined (if that's even still true) for typical concerns. But model swap will be a big deal. The same exit vs voice issues will exist for apple customers but the margin logic seems to remain.

GTP 3 hours ago|||

> nor create specialized SoCs with ML cores that obviate the need for lots and lots of RAM

Why do you say they can't do this?

ottah 4 hours ago||

Possibly this just isn't the generation of hardware to solve this problem in? We're like, what three or four years in at most, and only barely two in towards AI assisted development being practical. I wouldn't want to be the first mover here, and I don't know if it's a good point in history to try and solve the problem. Everything we're doing right now with AI, we will likely not be doing in five years. If I were running a company like Apple, I'd just sit on the problem until the technology stabilizes and matures.

bigyabai 4 hours ago||

If I was running a company like Apple, I'd be working with Khronos to kill CUDA since yesterday. There are multiple trillions of dollars that could be Apple's if they sign CUDA drivers on macOS, or create a CUDA-compatible layer. Instead, Apple is spinning their wheels and promoting nothingburger technology like the NPU and MPS.

It's not like Apple's GPU designs are world-class anyways, they're basically neck-and-neck with AMD for raster efficiency. Except unlike AMD, Apple has all the resources in the world to compete with Nvidia and simply chooses to sit on their ass.

zozbot234 4 hours ago||

CUDA is not the real issue, AMD's HIP offers source-level compatibility with CUDA code, and ZLUDA even provides raw binary compatibility. nVidia GPUs really are quite good, and the projected advantages of going multi-vendor just aren't worth the hassle given the amount of architecture-specificity GPUs are going to have.

bigyabai 4 hours ago||

Okay, then don't kill CUDA, just sign CUDA drivers on macOS instead and quit pretending like MPS is a world-class solution. There are trillions on the table, this is not an unsolvable issue.

atultw 2 hours ago||

Admittedly, my use of CUDA and Metal is fairly surface-level. But I have had great success using LLMs to convert whole gaussian splatting CUDA codebases to Metal. It's not ideal for maintainability and not 1:1, but if CUDA was a moat for NVIDIA, I believe LLMs have dealt a blow to it.

1970-01-01 2 hours ago||

"400 bytes should be enough for anybody"

Insanity 2 hours ago|

The 'B' in 400B is billion, not bytes. And the quote '640k ought to be enough for everyone' doesn't have evidence supporting Bill G said it: https://www.computerworld.com/article/1563853/the-640k-quote....

That said, it'd be a fun quote and I've jokingly said it as well, as I think of it more as part of 'popular' culture lol

ashwinnair99 6 hours ago||

A year ago this would have been considered impossible. The hardware is moving faster than anyone's software assumptions.

cogman10 6 hours ago||

This isn't a hardware feat, this is a software triumph.

They didn't make special purpose hardware to run a model. They crafted a large model so that it could run on consumer hardware (a phone).

pdpi 5 hours ago|||

It's both.

We haven't had phones running laptop-grade CPUs/GPUs for that long, and that is a very real hardware feat. Likewise, nobody would've said running a 400b LLM on a low-end laptop was feasible, and that is very much a software triumph.

bigyabai 4 hours ago||

> We haven't had phones running laptop-grade CPUs/GPUs for that long

Agree to disagree, we've had laptop-grade smartphone hardware for longer than we've had LLMs.

pdpi 3 hours ago||

Kind of.

We've had solid CPUs for a while, but GPUs have lagged behind (and they're the ones that matter for this particular application). iPhones still lead by a comfortable margin on this front, but have historically been pretty limited on the IO front (only supported USB2 speeds until recently).

smallerize 5 hours ago||||

The iPhone 17 Pro launched 8 months ago with 50% more RAM and about double the inference performance of the previous iPhone Pro (also 10x prompt processing speed).

SV_BubbleTime 4 hours ago||||

>triumph

It’s been a lot of years, but all I can hear after reading that is … I’m making a note here, huge success

GorbachevyChase 3 hours ago|||

There’s no use crying over every mistake. You just keep on trying until you run out of cake.

breggles 3 hours ago|||

It's hard to overstate my satisfaction!

anemll 3 hours ago|||

both, tbh

mannyv 5 hours ago|||

The software has real software engineers working on it instead of researchers.

Remember when people were arguing about whether to use mmap? What a ridiculous argument.

At some point someone will figure out how to tile the weights and the memory requirements will drop again.

snovv_crash 5 hours ago||

The real improvement will be when the software engineers get into the training loop. Then we can have MoE that use cache-friendly expert utilisation and maybe even learned prefetching for what the next experts will be.

zozbot234 4 hours ago||

> maybe even learned prefetching for what the next experts will be

Experts are predicted by layer and the individual layer reads are quite small, so this is not really feasible. There's just not enough information to guide a prefetch.

yorwba 3 hours ago|||

It's feasible to put the expert routing logic in a previous layer. People have done it: https://arxiv.org/abs/2507.20984

snovv_crash 4 hours ago|||

Manually no. It would have to be learned, and making the expert selection predictable would need to be a training metric to minimize.

zozbot234 4 hours ago||

Making the expert selection more predictable also means making it less effective. There's no real free lunch.

Aurornis 5 hours ago|||

It wasn't considered impossible. There are examples of large MoE LLMs running on small hardware all over the internet, like giant models on Raspberry Pi 5.

It's just so slow that nobody pursued it seriously. It's fun to see these tricks implemented, but even on this 2025 top spec iPhone Pro the output is 100X slower than output from hosted services.

zozbot234 4 hours ago||

If the bottleneck is storage bandwidth that's not "slow". It's only slow if you insist on interactive speeds, but the point of this is that you can run cheap inference in bulk on very low-end hardware.

Aurornis 2 hours ago|||

> If the bottleneck is storage bandwidth that's not "slow"

It is objectively slow at around 100X slower than what most people consider usable.

The quality is also degraded severely to get that speed.

> but the point of this is that you can run cheap inference in bulk on very low-end hardware.

You always could, if you didn't care about speed or efficiency.

zozbot234 2 hours ago||

You're simply pointing out that most people who use AI today expect interactive speeds. You're right that the point here is not raw power efficiency (having to read from storage will impact energy per operation, and datacenter-scale AI hardware beats edge hardware anyway by that metric) but the ability to repurpose cheaper, lesser-scale hardware is also compelling.

Terretta 4 hours ago|||

> very low-end hardware

iPhone 17 Pro outperforms AMD’s Ryzen 9 9950X per https://www.igorslab.de/en/iphone-17-pro-a19-pro-chip-uebert...

pinkgolem 3 hours ago||

In single threaded workloads, still impressive

t00 2 hours ago|||

/FIFY A year ago this would have been considered impossible. The software is moving faster than anyone's hardware assumptions.

ottah 4 hours ago|||

I mean, by any reasonable standard it still is. Almost any computer can run an llm, it's just a matter of how fast, and 0.4k/s (peak before first token) is not really considered running. It's a demo, but practically speaking entirely useless.

alephnerd 3 hours ago||

Devils advocate - this actually shows how promising TinyML and EdgeML capabilities are. SoCs comparable to the A19 Pro are highly likely to be commodified in the next 3-5 years in the same manner that SoCs comparable to the A13 already are.

iberator 3 hours ago||

Does iPhone have some kind of hardware acceleration for neural netwoeks/ai ?

HardCodedBias 3 hours ago||

The power draw is going to be crazy (today).

Practical LLMs on mobile devices are at least a few years away.

pier25 5 hours ago||

https://xcancel.com/anemll/status/2035901335984611412

dang 4 hours ago|

Added to toptext. Thanks!

Yanko_11 3 hours ago|

[dead]

More comments...