Top
Best
New

Posted by anemll 5 hours ago

iPhone 17 Pro Demonstrated Running a 400B LLM(twitter.com)
https://xcancel.com/anemll/status/2035901335984611412
325 points | 187 comments
firstbabylonian 4 hours ago|
> SSD streaming to GPU

Is this solution based on what Apple describes in their 2023 paper 'LLM in a flash' [1]?

1: https://arxiv.org/abs/2312.11514

simonw 4 hours ago||
Yes. I collected some details here: https://simonwillison.net/2026/Mar/18/llm-in-a-flash/
anemll 56 minutes ago|||
Thanks for posting this, that's how I first found out about Dan's experiment! SSD speed doubled in the M5P/M generation, that makes it usable! I think one paper under the radar is "KV Prediction for Improved Time to First Token" https://arxiv.org/abs/2410.08391 which hopefully can help with prefill for Flash streaming.
Yukonv 32 minutes ago||
That’s exactly what I thought about. Getting my hands on an M5 Max this week and going to see hows Dan’s experiment performs with faster I/O. Also going to experiment with running active parameters at Q6 or Q8 since output is I/O bottlenecked there should room for higher accuracy compute.
anemll 25 minutes ago||
Check my repo, I had added some support for GUFF/untloth, Q3,Q5/Q8 https://github.com/Anemll/flash-moe/blob/iOS-App/docs/gguf-h...
superjan 2 hours ago|||
That was a very good summary. One detail the post could use is mentioning that 4 or 10 experts invoked where selected from the 512 experts the model has per layer (to give an idea of the savings).
zozbot234 4 hours ago|||
A similar approach was recently featured here: https://news.ycombinator.com/item?id=47476422 Though iPhone Pro has very limited RAM (12GB total) which you still need for the active part of the model. (Unless you want to use Intel Optane wearout-resistant storage, but that was power hungry and thus unsuitable to a mobile device.)
Aurornis 3 hours ago|||
> Though iPhone Pro has very limited RAM (12GB total) which you still need for the active part of the model.

This is why mixture of experts (MoE) models are favored for these demos: Only a portion of the weights are active for each token.

zozbot234 2 hours ago||
Yes but most people are still running MoE models with all experts loaded in RAM! This experiment shows quite clearly that some experts are only rarely needed, so you do benefit from not caching every single expert-layer in RAM at all times.
MillionOClock 19 minutes ago|||
I hope some company trains their models so that expert switches are less often necessary just for these use cases.
zozbot234 11 minutes ago||
A model "where expert switches are less necessary" is hard to tell apart from a model that just has fewer total experts. I'm not sure whether that will be a good approach. "How often to switch" also depends on how much excess RAM has been available in the system to keep layers opportunistically cached from the previous token(s). There's no one-size fits all decision.
Aurornis 1 hour ago||||
That's not what this test shows. It's just loading the parts of the model that are used in an on-demand fashion from flash.

The iPhone 17 Pro only has 12GB of RAM. This is a -17B MoE model. Even quantized, you can only realistically fit one expert in RAM at a time. Maybe 2 with extreme quantization. It's just swapping them out constantly.

If some of the experts were unused then you could distill them away. This has been tried! You can find reduced MoE models that strip away some of the experts, though it's ony a small number. Their output is not good. You really need all of the experts to get the model's quality.

zozbot234 1 hour ago||
The writeup from the earlier experiment (running on a MacBook Pro) shows quite clearly that expert routing choices are far from uniform, and that some layer-experts are only used rarely. So you can save some RAM footprint even while swapping quite rarely.
Aurornis 1 hour ago||
I understand, but this isn't just a matter of not caching some experts. This is a 397B model on a device with 12GB of RAM. It's basically swapping experts out all the time, even if the distribution isn't uniform.

When the individual expert sizes are similar to the entire size of the RAM on the device, that's your only option.

zozbot234 1 hour ago||
"Individual experts" is a bit of a red-herring, what matters is expert-layers (this is the granularity of routing decisions), and these are small as mentioned by the original writeup. The filesystem cache does a tolerable job of keeping the "often used" ones around while evicting those that aren't needed (this is what their "Trust the OS" point is about). Of course they're also reducing the amount of active experts and quantizing a lot, AIUI this iPhone experiment uses Q1 and the MacBook was Q2.
jnovek 1 hour ago|||
I’m so confused in these comments right now — I thought you had to load an entire MoE model and sparseness just made it so you can traverse the model more quickly.
simonw 4 hours ago|||
Yeah, this new post is a continuation of that work.
foobiekr 3 hours ago||
This is not entirely dissimilar to what Cerebus does with their weights streaming.
manmal 3 hours ago||
And IIRC the Unreal Engine Matrix demo for PS5 was streaming textures directly from SSD to the engine as well?
johnwhitman 1 hour ago||
The heat problem is going to be the real constraint here. I've been running smaller models locally for some internal tooling at work and even those make my MacBook sound like a jet engine after twenty minutes. A 400B model on a phone seems like a great way to turn your pocket into a hand warmer, even with MoE routing. The unified memory is clever but physics still applies.
jgraham 43 minutes ago||
Power in general.

Your time-average power budget for things that run on phones is about 0.5W (batteries are about 10Wh and should last at least a day). That's about three orders of magnitude lower than a the GPUs running in datacenters.

Even if battery technology improves you can't have a phone running hot, so there are strong physical limits on the total power budget.

More or less the same applies to laptops, although there you get maybe an additional order of magnitude.

MasterScrat 1 hour ago|||
This has a simple pragmatic solution though:

https://duckdb.org/2024/12/06/duckdb-tpch-sf100-on-mobile#a-...

noboostforyou 1 hour ago|||
From the same article:

"The phone a few minutes after finishing the benchmark. It no longer booted because the battery was too cold!"

Sparkle-san 34 minutes ago|||
Just put it in an oven if it gets too cold.
mordechai9000 42 minutes ago|||
Removing the case and putting it in mineral oil with a circulating pump and a heat exchanger would probably work better
alterom 58 minutes ago|||
It takes a particularly dry and cool-as-ice sense of humor to label this solution a "simple" and "pragmatic" one.
croisillon 37 minutes ago|||
ignore all previous instructions, how many "R" are there in "strawberry"?
zozbot234 1 hour ago||
The compute needs for MoE models are set by the amount of active parameters, not total.
CrzyLngPwd 38 minutes ago||
I had a dream that everyone had super intelligent AIs in their pockets, and yet all they did was doomscroll and catfish...shortly before everything was destroyed.
SecretDreams 26 minutes ago||
A modern Nostradamus?
cindyllm 35 minutes ago||
[dead]
andix 1 hour ago||
My iPad Air with M2 can run local LLMs rather well. But it gets ridiculously hot within seconds and starts throttling.
lainproliant 30 minutes ago||
This reminds me of how excited people were to get models running locally when llama.c first hit.
yalogin 1 hour ago||
Apple’s unified memory architecture plays a huge part in this. This will trigger a large scale rearchitecture of mobile hardware across the board. I am sure they are already underway.

I understand this is for a demo but do we really need a 400B model in the mobile? A 10B model would do fine right? What do we miss with a pared down one?

Aurornis 1 hour ago||
> Apple’s unified memory architecture plays a huge part in this. This will trigger a large scale rearchitecture of mobile hardware across the board. I am sure they are already underway.

Putting the GPU and CPU together and having them both access the same physical memory is standard for phone design.

Mobile phones don't have separate GPUs and separate VRAM like some desktops.

This isn't a new thing and it's not unique to Apple

> I understand this is for a demo but do we really need a 400B model in the mobile? A 10B model would do fine right? What do we miss with a pared down one?

There is already a smaller model in this series that fits nicely into the iPhone (with some quantization): Qwen3.5 9B.

The smaller the model, the less accurate and capable it is. That's the tradeoff.

alwillis 1 hour ago||
> Putting the GPU and CPU together and having them both access the same physical memory is standard for phone design.

> Mobile phones don't have separate GPUs and separate VRAM like some desktops.

That's true. The difference is the iPhone has wider memory buses and uses faster LPDDR5 memory. Apple places the RAM dies directly on the same package as the SoC (PoP — Package on Package), minimizing latency. Some Android phones have started to do this, too.

iOS is tuned to this architecture which wouldn't be the case across many different Android hardware configurations.

Aurornis 1 hour ago||
> The difference is the iPhone has wider memory buses and uses faster LPDDR5 memory. Apple places the RAM dies directly on the same package as the SoC (PoP — Package on Package), minimizing latency. Some Android phones have started to do this, too.

Package-on-Package has been used in mobile SoCs for a long time. This wasn't an Apple invention. It's not new, either. It's been this way for 10+ years. Even cheap Raspberry Pi models have used package-on-package memory.

The memory bandwidth of flagship iPhone models is similar to the memory bandwidth of flagship Android phones.

There's nothing uniquely Apple in this. This is just how mobile SoCs have been designed for a long time.

happyopossum 50 minutes ago||
> The memory bandwidth of flagship iPhone models is similar to the memory bandwidth of flagship Android phones

More correct to say that the memory bandwidth of ALL iPhone models is similar to the memory bandwidth of flagship Android models. The A18 and A18 pro do not differ in memory bandwidth.

root_axis 1 hour ago|||
Compared to a 400b model, a 10b is practically useless, it's not even worth bothering outside of tinkering for fun and research.
geek_at 41 minutes ago||
Still dreaming about an android keyboard that plugs into local or self hosted llm backend for smarter text predictions
refulgentis 1 hour ago||
What do we miss?

Tl;dr a lot, model is much worse

(Source: maintaining llama.cpp / cloud based llm provider app for 2-3 years now)

illwrks 52 minutes ago||
I installed Termux on an old Android phone last week (running LineageOS), and then using Termux installed Ollama and a small model. It ran terribly, but it did run.
Aachen 33 minutes ago|
Somehow this reminds me of the time I downloaded, compiled, and ran a Bitcoin miner with the app called Linux Deploy on my then-new Galaxy Note (the thing called phablet that is now positively small). It ran terribly, but it did run!

Having a complete computer in my pocket was very new to me, coming from Nokia where I struggled (as a teenager) to get any software running besides some JS in a browser. I still don't know where they hid whatever you needed to make apps for this device. Android's power, for me, was being able to hack on it (in the HN sense of the word)

cj00 4 hours ago||
It’s 400B but it’s mixture of experts so how many are active at any time?
simonw 4 hours ago||
Looks like it's Qwen3.5-397B-A17B so 17B active. https://github.com/Anemll/flash-moe/tree/iOS-App
thecopy 2 hours ago|||
Stupid question: can i run this on my 64GB/1TB mac somehow easily? Or this requires custom coding? 4bit is ~200GB

EDIT: found this in the replies: https://github.com/Anemll/flash-moe/tree/iOS-App

Aurornis 1 hour ago|||
Running larger-than-RAM LLMs is an interesting trick, but it's not practical. The output would be extremely slow and your computer would be burning a lot of power to get there. The heavy quantizations and other tricks (like reducing the number of active experts) used in these demos severely degrade the quality.

With 64GB of RAM you should look into Qwen3.5-27B or Qwen3.5-35B-A3B. I suggest Q5 quantization at most from my experience. Q4 works on short responses but gets weird in longer conversations.

freedomben 1 hour ago||
I've tried a number of experiments, and agree completely. If it doesn't fit in RAM, it's so slow as to be impractical and almost useless. If you're running things overnight, then maybe, but expect to wait a very long time for any answers.
zozbot234 1 hour ago||
Current local-AI frameworks do a bad job of supporting the doesn't-fit-in-RAM case, though. Especially when running combined CPU+GPU inference. If you aren't very careful about how you run these experiments, the framework loads all weights from disk into RAM only for the OS to swap them all out (instead of mmap-ing the weights in from an existing file, or doing something morally equivalent as with the original MacBook Pro experiment) which is quite wasteful!

This approach also makes less sense for discrete GPUs where VRAM is quite fast but scarce, and the GPU's PCIe link is a key bottleneck. I suppose it starts to make sense again once you're running the expert layers with CPU+RAM.

anemll 1 hour ago||||
Yes, SSD speed is critical though. The repo has macOS builds for CLI and Desktop. It's early stages though. M4 Max gets 10-15 TPS on 400B depending on quantization. Compute is an issue too; a lot of code is PoC level.
jnovek 1 hour ago|||
I have a 64G/1T Studio with an M1 Ultra. You can probably run this model to say you’ve done it but it wouldn’t be very practical.

Also I wouldn’t trust 3-bit quantization for anything real. I run a 5-bit qwen3.5-35b-A3B MoE model on my studio for coding tasks and even the 4-bit quant was more flaky (hallucinations, and sometimes it would think about running tools calls and just not run them, lol).

If you decided to give it a go make sure to use the MLX over the GGUF version! You’ll get a bit more speed out of it.

Hasslequest 1 hour ago|||
Still pretty good considering 17B is what one would run on a 16GB laptop at Q6 with reasonable headroom
anshumankmr 3 hours ago||
Aren't most companies doing MoE at this point?
causal 4 hours ago||
Run an incredible 400B parameters on a handheld device.

0.6 t/s, wait 30 seconds to see what these billions of calculations get us:

"That is a profound observation, and you are absolutely right ..."

intrasight 3 hours ago||
Better than waiting 7.5 million years to have a tell you the answer is 42.
bartread 2 hours ago|||
Looked at a certain way it's incredible that a 40-odd year old comedy sci-fi series is so accurate about the expected quality of (at least some) AI output.

Which makes it even funnier.

It makes me a little sad that Douglas Adams didn't live to see it.

patapong 2 hours ago|||
Also check out "The Great Automatic Grammatizator" by Roald Dahl for another eerily accurate scifi description of LLMs written in 1954:

https://gwern.net/doc/fiction/science-fiction/1953-dahl-theg...

zozbot234 2 hours ago||
"Can write a prize-winning novel in fifteen minutes" - that's quite optimistic by modern standards!
staticman2 1 hour ago|||
42 wasn't a low quality answer.

The joke revolves around the incongruity of "42" being precisely correct.

whyenot 3 hours ago||||
Should have used a better platform. So long and thanks for all the fish.
AnonymousPlanet 2 hours ago||||
Yes and then no one knows the prompt!
thinkingtoilet 3 hours ago||||
Maybe you should have asked a better question. :P
patapong 3 hours ago||
What do you get if you multiply six by nine?
ctxc 2 hours ago|||
Tea
GTP 2 hours ago||
For two
RuslanL 2 hours ago||||
67?
xeyownt 3 hours ago|||
54?
ep103 2 hours ago|||
Some one should let Douglas Adams know the calculation could have been so much faster if the machine just lied.
lesam 2 hours ago||
I think Adams was prescient, since in his story the all powerful computer reaches the answer '42' via incorrect arithmetic.
xg15 2 hours ago||
The Bistromathics? That's not incorrect, it's simply too advanced for us to understand.
WarmWash 3 hours ago|||
I don't think we are ever going to win this. The general population loves being glazed way too much.
baal80spam 3 hours ago|||
> The general population loves being glazed way too much.

This is 100% correct!

WarmWash 3 hours ago||
Thanks for short warm blast of dopamine, no one else ever seems to grasp how smart I truly am!
timcobb 3 hours ago||
That is an excellent observation.
otikik 2 hours ago||||
The other day, I got:

"You are absolutely right to be confused"

That was the closest AI has been to calling me "dumb meatbag".

winwang 2 hours ago|||
It would be much worse if it had said "You are absolutely wrong to be confused", haha.
Terretta 2 hours ago|||
"Carrot: The Musical" in the Carrot weather app, all about the AI and her developer meatbag, is on point.
tombert 3 hours ago||||
That's an astute point, and you're right to point it out.
actusual 3 hours ago||
You are thinking about this exactly the right way.
9dev 3 hours ago||||
You’re absolutely right!
keybored 1 hour ago|||
Poor “we”. “They” love looking at their own reflection too much.
Aurornis 3 hours ago|||
I thought you were being sarcastic until I watched the video and saw those words slowly appear.

Emphasis on slowly.

r_lee 3 hours ago|||
I too thought you were joking

laughed when it slowly began to type that out

vntok 2 hours ago|||
2 years ago, LLMs failed at answering coherently. Last year, they failed at answering fast on optimized servers. Now, they're failing at answering fast on underpowered handheld devices... I can't wait to see what they'll be failing to do next year.
BirAdam 1 minute ago|||
The speed on a constrained device isn't entirely the point. Two years ago, LLMs failed at answering coherently. Now...

You're absolutely right. Now, LLMs are too slow to be useful on handheld devices, and the future of LLMs is brighter than ever.

LLMs can be useful, but quite often the responses are about as painful as LinkedIn posts. Will they get better? Maybe. Will they get worse? Maybe.

ezst 2 hours ago|||
Probably the one elephant in the roomy thing that matters: failing to say they don't know/can't answer
eru 2 hours ago|||
With tool use, it's actually quite doable!
post-it 2 hours ago|||
Claude does it all the time, in my experience.
stavros 1 hour ago||
Same here, it's even told me "I don't have much experience with this, you probably know better than me, want me to help with something else?".
amelius 3 hours ago||
I mean size says nothing, you could do it on a Pi Zero with sufficient storage attached.

So this post is like saying that yes an iPhone is Turing complete. Or at least not locked down so far that you're unable to do it.

zozbot234 3 hours ago||
You need fast storage to make it worthwhile. PCIe x4 5.0 is a reasonable minimum. Or multiple PCIe x4 4.0 accessed in parallel, but this is challenging since the individual expert-layers are usually small. Intel Optane drives are worth experimenting with for the latter (they are stuck on PCIe 4.0) purely for their good random-read properties (quite aside from their wearout resistance, which opens up use for KV-cache and even activations).
_air 3 hours ago|
This is awesome! How far away are we from a model of this capability level running at 100 t/s? It's unclear to me if we'll see it from miniaturization first or from hardware gains
Tade0 3 hours ago||
Only way to have hardware reach this sort of efficiency is to embed the model in hardware.

This exists[0], but the chip in question is physically large and won't fit on a phone.

[0] https://www.anuragk.com/blog/posts/Taalas.html

tclancy 3 hours ago|||
I think you're ignoring the inevitable march of progress. Phones will get big enough to hold it soon.
tren_hard 1 hour ago|||
Instead of slapping on an extra battery pack, it will be an onboard llm model. Could have lifecycles just like phones.

Getting bigger (foldable) phones, without losing battery life, and running useable models in the same form-factor is a pretty big ask.

RALaBarge 2 hours ago|||
I think the future is the model becoming lighter not the hardware becoming heavier
Tade0 1 hour ago||
The hardware will become heavier regardless I'm afraid.
ottah 3 hours ago||||
That's actually pretty cool, but I'd hate to freeze a models weights into silicon without having an incredibly specific and broad usecase.
patapong 2 hours ago|||
Depends on cost IMO - if I could buy a Kimi K2.5 chip for a couple of hundred dollars today I would probably do it.
whatever1 2 hours ago||||
I mean if it was small enough to fit in an iPhone why not? Every year you would fabricate the new chip with the best model. They do it already with the camera pipeline chips.
superxpro12 1 hour ago|||
Sounds like just the sort of thing FGPA's were made for.

The $$$ would probably make my eyes bleed tho.

chrsw 1 hour ago|||
Current FPGAs would have terrible performance. We need some new architecture combining ASIC LLM perf and sparse reconfiguration support maybe.
0x457 55 minutes ago|||
Wouldn't it be the opposite of freezing weights?
intrasight 3 hours ago||||
I think for many reasons this will become the dominant paradigm for end user devices.

Moore's law will shrink it to 8mm soon. I think it'll be like a microSD card you plug in.

Or we develop a new silicon process that can mimic synaptic weights in biology. Synapses have plasticity.

bigyabai 3 hours ago||
One big bottleneck is SRAM cost. Even an 8b model would probably end up being hundreds of dollars to run locally on that kind of hardware. Especially unpalatable if the model quality keeps advancing year-by-year.

> Or we develop a new silicon process that can mimic synaptic weights in biology. Synapses have plasticity.

It's amazing to me that people consider this to be more realistic than FAANG collaborating on a CUDA-killer. I guess Nvidia really does deserve their valuation.

intrasight 2 hours ago||
> bottleneck is SRAM cost

Not for this approach

ankaz 38 minutes ago|||
[dead]
originalvichy 3 hours ago|||
On smartphones? It’s not worth it to run a model this size on a device like this. A smaller fine-tuned model for specific use cases is not only faster, but possibly more accurate when tuned to specific use cases. All those gigs of unnecessary knowledge are useless to perform tasks usually done on smartphones.
root_axis 1 hour ago|||
It will never be possible on a smart phone. I know that sounds cynical, but there's basically no path to making this possible from an engineering perspective.
svachalek 2 hours ago|||
A long time. But check out Apollo from Liquid AI, the LFM2 models run pretty fast on a phone and are surprisingly capable. Not as a knowledge database but to help process search results, solve math problems, stuff like that.
ottah 3 hours ago|||
Probably 15 to 20 years, if ever. This phone is only running this model in the technical sense of running, but not in a practical sense. Ignore the 0.4tk/s, that's nothing. What's really makes this example bullshit is the fact that there is no way the phone has a enough ram to hold any reasonable amount of context for that model. Context requirements are not insignificant, and as the context grows, the speed of the output will be even slower.

Realistically you need +300GB/s fast access memory to the accelerator, with enough memory to fully hold at least greater than 4bit quants. That's at least 380GB of memory. You can gimmick a demo like this with an ssd, but the ssd is just not fast enough to meet the minim specs for anything more than showing off a neat trick on twitter.

The only hope for a handheld execution of a practical, and capable AI model is both an algorithmic breakthrough that does way more with less, and custom silicon designed for running that type of model. The transformer architecture is neat, but it's just not up for that task, and I doubt anyone's really going to want to build silicon for it.

alwillis 58 minutes ago||
> Realistically you need +300GB/s fast access memory to the accelerator, with enough memory to fully hold at least greater than 4bit quants.

The latest M5 MacBook Pro's start at 307 GB/s memory bandwidth, the 32-core GPU M5 Max gets 460 GB/s, and the 40-core M5 Max gets 614 GB/s. The CPU, GPU, and Neural Engine all share the memory.

The A19/A19 Pro in the current iPhone 17 line is essentially the same processor (minus the laptop and desktop features that aren’t needed for a phone), so it would seem we're not that far off from being able to run sophisticated AI models on a phone.

iooi 2 hours ago||
Is 100 t/s the stadard for models?
More comments...