Quantized Llama models with increased speed and a reduced memory footprint

Posted by egnehots 2 days ago

Quantized Llama models with increased speed and a reduced memory footprint(ai.meta.com)

494 points | 118 commentspage 2

Tepix 2 days ago|

From TFA:

> At Connect 2024 last month, we open sourced Llama 3.2 1B and 3B

No you did not. There is no source (in this case: training data) included. Stop changing the meaning of "open source", Meta!

justanotheratom 2 days ago||

Any pointers no how to finetune this on my dataset and package and run it in my swift ios app?

behnamoh 2 days ago||

Does anyone know why the most common method to speed up inference time is quantization? I keep hearing about all sorts of new methods but nearly none of them is implemented in practice (except for flash attention).

regularfry 2 days ago||

In addition to the other answers in this thread, there's a practical one: sometimes (ok, often) you want to run a model on a card that doesn't have enough VRAM for it. Quantisation is a way to squeeze it down so it fits. For instance I've got a 4090 that won't fit the original Llama3 70b at 16 bits per param, but it will give me usable token rates at 2 bits.

formalsystem 2 days ago|||

It's particularly useful in memory bound workflows like batch size = 1 LLM inference where you're bottlenecked by how quickly you can send weights to your GPU. This is why at least in torchao we strongly recommend people try out int4 quantization.

At larger batch sizes you become compute bound so quantization matters less and you have to rely on hardware support to accelerate smaller dtypes like fp8

o11c 2 days ago|||

Because the way LLMs work is more-or-less "for every token, read the entire matrix from memory and do math on it". Math is fast, so if you manage to use only half the bits to store each item in the matrix, you only have to do half as much work. Of course, sometimes those least-significant-bits were relied-upon in the original training.

slimsag 2 days ago||

Has anyone worked on making tokens 'clusters of words with specific semantic meaning'?

e.g. instead of tokens ['i', 'am', 'beautiful'] having tokens ['I am', 'beautiful'] on the premise that 'I am' is a common set of bytes for a semantic token that identifies a 'property of self'?

Or taking that further and having much larger tokens based on statistical analysis of common phrases of ~5 words or such?

pizza 2 days ago|||

I think you might be thinking of applying a kind of low-rank decomposition to the vocabulary embeddings. A quick search on Google Scholar suggests that this might be useful in the context of multilingual tokenization.

visarga 2 days ago||||

yes, look up Byte Pair Encoding

https://huggingface.co/learn/nlp-course/chapter6/5

dragonwriter 2 days ago|||

Much larger tokens require a much larger token vocabulary.

xcodevn 2 days ago||

During inference, it is not a matrix x matrix operation, but rather a weight matrix x input vector operation, as we are generating one token at a time. The bottleneck now is how fast we can load the weight matrix from memory to tensor cores, hence the need for weight quantization.

EliBullockPapa 2 days ago||

Anyone know a nice iOS app to run these locally?

simonw 2 days ago||

MLC Chat is a great iPhone app for running models (it's on Android too) and currently ships with Llama 3.2 3B Instruct - not the version Meta released today, its a quantized version of their previous release.

I wouldn't be surprised to see it add the new ones shortly, it's quite actively maintained.

https://apps.apple.com/us/app/mlc-chat/id6448482937

Havoc 2 days ago||

Seems much more stable than the last time I tried it too

Arcuru 2 days ago|||

I access them by running the models in Ollama (on my own hardware), and then using my app Chaz[1] to access it through my normal Matrix client.

[1] - https://github.com/arcuru/chaz

drilbo 2 days ago|||

https://github.com/a-ghorbani/pocketpal-ai

This was just recently open sourced and is pretty nice. Only issue I've had is very minor UI stuff (on Android, sounds like it runs better on iOS from skimming comments)

evbogue 2 days ago|||

I'm on Android, however my somewhat elaborate solution was to install Ollama on my home laptop computer and then ssh in when I want to query a model. I figured that'd be better for my phone battery. Since my home computer is behind NAT I run yggdrasil on everything so I can access my AI on the go.

behnamoh 2 days ago||

I've been using PocketGPT.

arnaudsm 2 days ago||

How do they compare to their original quants on ollama like q4_K_S?

tcdent 2 days ago|

These undergo additional fine tuning (QLoRA) using some or all of the original dataset, so they're able to get the weights to align to the nf4 dtype better, which increases the accuracy.

ngamboa 2 days ago||

[dead]

newfocogi 2 days ago||

TLDR: Quantized versions of Llama 3.2 1B and 3B models with "competitive accuracy" to the original versions (meaning some degraded performance; plots included in the release notes).

newfocogi 2 days ago||

Quantization schemes include post-training quantization (PTQ), SpinQuant, and QLoRA.

grahamj 2 days ago||

Thx, I prefer not to visit meta properties :X

They were already pretty small but I guess the smaller the better as long as accuracy doesn't suffer too much.

mmaunder 2 days ago|

[flagged]

accrual 2 days ago||

Two days ago there was a pretty big discussion on this topic:

    Computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku
    https://news.ycombinator.com/item?id=41914989
    1421 points, 717 comments

refulgentis 2 days ago|||

I wouldn't be so haughty and presumptive of your understanding of things is as they are: this doesn't have practical applications.

No one serious is going to build on some horror of Python interpreter running inside your app to run an LLM when llama.cpp is right there, with more quants available. In practice, on mobile, you run out of RAM headroom way more quickly than CPU headroom. You've been able to run llama.cpp 3B models for almost a year now on iOS, whereas here, they're just starting to be able to. (allocating 6 GB is a quick way to get autokill'd on iOS...2.5GB? Doable)

It looks like spinquant is effectively Q8, in widespread blind testing over months, empirically, we found Q5 is assuredly indistinguishable from the base model.

(edit: just saw your comment. oy. best of luck! generally, I don't bother with these sorts of 'lived experience' details, because no one wants to hear they don't get it, and most LLM comments on HN are from ppl who don't have the same luck as to work on it fulltime. so you're either stuck aggressively asserting you're right in practice and they don't know what you're talking about, or, you're stuck being talked down to about things you've seen, even if they don't match a first-pass based on theory) https://news.ycombinator.com/item?id=41939841

pryelluw 2 days ago|||

I don’t get the comment. For one I’m excited for developments in the field. Not afraid it will “replace me” as technology has replaced me multiple times over. I’m looking towards working with these models more and more.

mmaunder 2 days ago||

No, I meant that a lot of us are working very fast on a pre-launch product, implementing some cutting edge ideas using e.g. the incredible speedup in a small fast inference model like quantized 3B in combination with other tools, and I think there's quite a bit of paranoia out there that someone else will beat you to market. And so not a lot of sharing going on in the comments. At least not as much as previously, and not as much technical discussion vs other non-AI threads on HN.

pryelluw 2 days ago|||

Ok, thank you for pointing that out.

I’m focused on making models play nice with each other rather than building a feature that relies on it. That’s where I see the more relevant work being. Why such news are exciting!

mattgreenrocks 2 days ago|||

This thread attracts a smaller audience than, say, a new version of ChatGPT.

keyle 2 days ago|||

Aren't we all just tired of arguing the same points?

lxgr 2 days ago|||

What kind of fundamental discussion are you hoping to see under an article about an iterative improvement to a known model?

"AI will destroy the world"? "AI is great and will save humanity"? If you're seriously missing that, there's really enough platforms (and articles for more fundamental announcements/propositions on this one) where you can have these.

flawn 2 days ago|||

A sign of the ongoing commoditization?

yieldcrv 2 days ago||

I mean, this outcome of LLMs is expected and the frequency of LLM drops are too fast, and definitely too fast to wait for Meta to do an annual conference with a ton of hype, and furthermore these things are just prerequisites for a massive lemming rush of altering these models for the real fun, which occurs in other communities