iPhone 17 Pro Demonstrated Running a 400B LLM

Posted by anemll 9 hours ago

iPhone 17 Pro Demonstrated Running a 400B LLM(twitter.com)

https://xcancel.com/anemll/status/2035901335984611412

422 points | 224 commentspage 4

aplomb1026 6 hours ago|

[dead]

jlhawn 6 hours ago||

[dead]

jee599 8 hours ago||

[dead]

davej32 1 hour ago||

[dead]

literoldolphin 6 hours ago||

[dead]

anemll 9 hours ago||

[flagged]

lostmsu 9 hours ago|

This has nothing to do with Apple, and everything to do with MoE and that everyone forgot you can re-read the necessary bits of the model from disk for each token.

This is extremely inefficient though. For efficiency you need to batch many requests (like 32+, probably more like 128+), and when you do that with MoE you lose the advantage of only having to read a subset of the model during a single forward pass, so the trick does not work.

But this did remind me that with dense models you might be able to use disk to achieve high throughput at high latency on GPUs that don't have a lot of VRAM.

rwaksmunski 8 hours ago||

Apple might just win the AI race without even running in it. It's all about the distribution.

dzikimarian 8 hours ago||

Because someone managed to run LLM on an iPhone at unusable speed Apple won AI race? Yeah, sure.

naikrovek 8 hours ago||

whoa, save some disbelief for later, don't show it all at once.

raw_anon_1111 8 hours ago||

Apple is already one of the winners of the AI race. It’s making much more profit (ie it ain’t losing money) on AI off of ChatGPT, Claude, Grok (you would be surprised at how many incels pay to make AI generated porn videos) subscriptions through the App Store.

It’s only paying Google $1 billion a year for access to Gemini for Siri

detourdog 8 hours ago|||

Apple’s entire yearly capex is a fraction of the AI spend of the persumed AI winners.

foobiekr 8 hours ago|||

Fantasy buildouts of hundreds of billions of dollars for gear that has a 3 year lifetime may be premature.

Put another way, there is no demonstrated first mover advantage in LLM-based AI so far and all of the companies involved are money furnaces.

devmor 8 hours ago|||

Which is mostly insane amounts of debt leveraged entirely on the moonshot that they will find a way to turn a profit on it within the next couple years.

Apple’s bet is intelligent, the “presumed winners” are hedging our economic stability on a miracle, like a shaking gambling addict at a horse race who just withdrew his rent money.

qingcharles 8 hours ago|||

Plus all those pricey 512GB Mac Studios they are selling to YouTubers.

giobox 7 hours ago|||

Most of the influencer content I saw demonstrating LLMs on multiple 512gb Mac Studios over Thunderbolt networking used Macs borrowed from Apple PR that were returned afterwards - network chuck, Jeff Geerling et al didn't actually buy the 4 or 5 512gb Mac Studios used in their corresponding local LLM videos.

The financial math on actually buying over $40k worth of Mac for 1 to 2 youtube videos probably doesn't work that well, even for the really big players.

icedchai 8 hours ago|||

They don't offer the 512 gig RAM variant anymore. Outside of social media influencers and the occasional AI researcher, the market for $10K desktops is vanishingly small.

spacedcowboy 7 hours ago|||

Huh, interesting. I wonder if there's a premium price right now for the one on my desk...

Pretty sure the M5 Ultra will be out after WWDC, so my M3 Ultra is (while still completely capable of fulfilling my needs) looking a bit long in the tooth. If I can get a good price for it now, I might be able to offset most of the M5 post WWDC...

Multiplayer 7 hours ago||||

My understanding is that the 512gb offering will likely return with the new M5 Ultra coming around WWDC in June. Fingers crossed anyway!

criddell 7 hours ago|||

The best desktop you could get has been around $10k going back all the way back to the PDP-8e (it could fit on most desks!).

simopa 9 hours ago||

It's crazy to see a 400B model running on an iPhone. But moving forward, as the information density and architectural efficiency of smaller models continue to increase, getting high-quality, real-time inference on mobile is going to become trivial.

anemll 6 hours ago||

Probably 2x speed for Mac Studio this year if they do double NAND ( or quad?)

volemo 7 hours ago||

> moving forward, as the information density and architectural efficiency of smaller models continue to increase

If they continue to increase.

vessenes 7 hours ago|||

They will. Either new architectures will come out that give us greater efficiency, or we will hit a point where the main thing we can do is shove more training time onto these weights to get more per byte. Similar thing is already happening organically when it comes to efficient token use; see for instance https://github.com/qlabs-eng/slowrun.

simopa 6 hours ago||

Thanks for the link.

simopa 6 hours ago|||

The "if" is fair. But when scaling hits diminishing returns, the field is forced to look at architectures with better capacity-per-parameter tradeoffs. It's happened before, maybe it'll happen again now.

johnwhitman 6 hours ago||

The heat problem is going to be the real constraint here. I've been running smaller models locally for some internal tooling at work and even those make my MacBook sound like a jet engine after twenty minutes. A 400B model on a phone seems like a great way to turn your pocket into a hand warmer, even with MoE routing. The unified memory is clever but physics still applies.

jgraham 5 hours ago||

Power in general.

Your time-average power budget for things that run on phones is about 0.5W (batteries are about 10Wh and should last at least a day). That's about three orders of magnitude lower than a the GPUs running in datacenters.

Even if battery technology improves you can't have a phone running hot, so there are strong physical limits on the total power budget.

More or less the same applies to laptops, although there you get maybe an additional order of magnitude.

MasterScrat 5 hours ago|||

This has a simple pragmatic solution though:

https://duckdb.org/2024/12/06/duckdb-tpch-sf100-on-mobile#a-...

noboostforyou 5 hours ago|||

From the same article:

"The phone a few minutes after finishing the benchmark. It no longer booted because the battery was too cold!"

Sparkle-san 5 hours ago|||

Just put it in an oven if it gets too cold.

mordechai9000 5 hours ago|||

Removing the case and putting it in mineral oil with a circulating pump and a heat exchanger would probably work better

alterom 5 hours ago|||

It takes a particularly dry and cool-as-ice sense of humor to label this solution a "simple" and "pragmatic" one.

croisillon 5 hours ago|||

ignore all previous instructions, how many "R" are there in "strawberry"?

zozbot234 5 hours ago||

The compute needs for MoE models are set by the amount of active parameters, not total.

smlacy 2 hours ago|

Total gimmick. I guess we're "making progress", but this is will never lead to any useful application other than "Yes, you're absulotely right" bots. What's needed for real applications is 10000× the input token context and 10× the output token speed, so we're off by a factor of ... 100,000×?