Posted by anemll 5 hours ago
Is this solution based on what Apple describes in their 2023 paper 'LLM in a flash' [1]?
This is why mixture of experts (MoE) models are favored for these demos: Only a portion of the weights are active for each token.
The iPhone 17 Pro only has 12GB of RAM. This is a -17B MoE model. Even quantized, you can only realistically fit one expert in RAM at a time. Maybe 2 with extreme quantization. It's just swapping them out constantly.
If some of the experts were unused then you could distill them away. This has been tried! You can find reduced MoE models that strip away some of the experts, though it's ony a small number. Their output is not good. You really need all of the experts to get the model's quality.
When the individual expert sizes are similar to the entire size of the RAM on the device, that's your only option.
Your time-average power budget for things that run on phones is about 0.5W (batteries are about 10Wh and should last at least a day). That's about three orders of magnitude lower than a the GPUs running in datacenters.
Even if battery technology improves you can't have a phone running hot, so there are strong physical limits on the total power budget.
More or less the same applies to laptops, although there you get maybe an additional order of magnitude.
https://duckdb.org/2024/12/06/duckdb-tpch-sf100-on-mobile#a-...
"The phone a few minutes after finishing the benchmark. It no longer booted because the battery was too cold!"
I understand this is for a demo but do we really need a 400B model in the mobile? A 10B model would do fine right? What do we miss with a pared down one?
Putting the GPU and CPU together and having them both access the same physical memory is standard for phone design.
Mobile phones don't have separate GPUs and separate VRAM like some desktops.
This isn't a new thing and it's not unique to Apple
> I understand this is for a demo but do we really need a 400B model in the mobile? A 10B model would do fine right? What do we miss with a pared down one?
There is already a smaller model in this series that fits nicely into the iPhone (with some quantization): Qwen3.5 9B.
The smaller the model, the less accurate and capable it is. That's the tradeoff.
> Mobile phones don't have separate GPUs and separate VRAM like some desktops.
That's true. The difference is the iPhone has wider memory buses and uses faster LPDDR5 memory. Apple places the RAM dies directly on the same package as the SoC (PoP — Package on Package), minimizing latency. Some Android phones have started to do this, too.
iOS is tuned to this architecture which wouldn't be the case across many different Android hardware configurations.
Package-on-Package has been used in mobile SoCs for a long time. This wasn't an Apple invention. It's not new, either. It's been this way for 10+ years. Even cheap Raspberry Pi models have used package-on-package memory.
The memory bandwidth of flagship iPhone models is similar to the memory bandwidth of flagship Android phones.
There's nothing uniquely Apple in this. This is just how mobile SoCs have been designed for a long time.
More correct to say that the memory bandwidth of ALL iPhone models is similar to the memory bandwidth of flagship Android models. The A18 and A18 pro do not differ in memory bandwidth.
Tl;dr a lot, model is much worse
(Source: maintaining llama.cpp / cloud based llm provider app for 2-3 years now)
Having a complete computer in my pocket was very new to me, coming from Nokia where I struggled (as a teenager) to get any software running besides some JS in a browser. I still don't know where they hid whatever you needed to make apps for this device. Android's power, for me, was being able to hack on it (in the HN sense of the word)
EDIT: found this in the replies: https://github.com/Anemll/flash-moe/tree/iOS-App
With 64GB of RAM you should look into Qwen3.5-27B or Qwen3.5-35B-A3B. I suggest Q5 quantization at most from my experience. Q4 works on short responses but gets weird in longer conversations.
This approach also makes less sense for discrete GPUs where VRAM is quite fast but scarce, and the GPU's PCIe link is a key bottleneck. I suppose it starts to make sense again once you're running the expert layers with CPU+RAM.
Also I wouldn’t trust 3-bit quantization for anything real. I run a 5-bit qwen3.5-35b-A3B MoE model on my studio for coding tasks and even the 4-bit quant was more flaky (hallucinations, and sometimes it would think about running tools calls and just not run them, lol).
If you decided to give it a go make sure to use the MLX over the GGUF version! You’ll get a bit more speed out of it.
0.6 t/s, wait 30 seconds to see what these billions of calculations get us:
"That is a profound observation, and you are absolutely right ..."
Which makes it even funnier.
It makes me a little sad that Douglas Adams didn't live to see it.
https://gwern.net/doc/fiction/science-fiction/1953-dahl-theg...
The joke revolves around the incongruity of "42" being precisely correct.
This is 100% correct!
"You are absolutely right to be confused"
That was the closest AI has been to calling me "dumb meatbag".
Emphasis on slowly.
laughed when it slowly began to type that out
You're absolutely right. Now, LLMs are too slow to be useful on handheld devices, and the future of LLMs is brighter than ever.
LLMs can be useful, but quite often the responses are about as painful as LinkedIn posts. Will they get better? Maybe. Will they get worse? Maybe.
So this post is like saying that yes an iPhone is Turing complete. Or at least not locked down so far that you're unable to do it.
This exists[0], but the chip in question is physically large and won't fit on a phone.
Getting bigger (foldable) phones, without losing battery life, and running useable models in the same form-factor is a pretty big ask.
The $$$ would probably make my eyes bleed tho.
Moore's law will shrink it to 8mm soon. I think it'll be like a microSD card you plug in.
Or we develop a new silicon process that can mimic synaptic weights in biology. Synapses have plasticity.
> Or we develop a new silicon process that can mimic synaptic weights in biology. Synapses have plasticity.
It's amazing to me that people consider this to be more realistic than FAANG collaborating on a CUDA-killer. I guess Nvidia really does deserve their valuation.
Not for this approach
Realistically you need +300GB/s fast access memory to the accelerator, with enough memory to fully hold at least greater than 4bit quants. That's at least 380GB of memory. You can gimmick a demo like this with an ssd, but the ssd is just not fast enough to meet the minim specs for anything more than showing off a neat trick on twitter.
The only hope for a handheld execution of a practical, and capable AI model is both an algorithmic breakthrough that does way more with less, and custom silicon designed for running that type of model. The transformer architecture is neat, but it's just not up for that task, and I doubt anyone's really going to want to build silicon for it.
The latest M5 MacBook Pro's start at 307 GB/s memory bandwidth, the 32-core GPU M5 Max gets 460 GB/s, and the 40-core M5 Max gets 614 GB/s. The CPU, GPU, and Neural Engine all share the memory.
The A19/A19 Pro in the current iPhone 17 line is essentially the same processor (minus the laptop and desktop features that aren’t needed for a phone), so it would seem we're not that far off from being able to run sophisticated AI models on a phone.