Top
Best
New

Posted by egnehots 10/24/2024

Quantized Llama models with increased speed and a reduced memory footprint(ai.meta.com)
508 points | 122 commentspage 2
nikolayasdf123 10/25/2024|
what's your opinion on LlamaStack?

for me it is nothing short of bad experience. it is way over-engineered with poor quality and just plain does not work, and maintainers are questionable. I would rather call HuggingFace python code for inference or anything else.

is ExecuTorch any better?

SoLoMo123 10/25/2024|
Hi, I'm Mergen and I work on ExecuTorch.

ExecuTorch is a runtime for mobile and embedded devices to run PyTorch models directly. Currently it runs pretty fast on CPU, but expanding our use-case for mobile accelerators and GPUs.

We're still in our early stages (just turned beta status). But try it out and let us know.

Regarding Llama Stack, it is built by my colleagues. What were some concrete issues have you experienced? If you have error/bug reports, I'll happy to pass along.

nikolayasdf123 10/26/2024||
will give executorch a try.

with llamastack, well making it work with CUDA for starters would be great.

it is also bloated. something that supposed to take direct 100 lines of code and a couple files, takes dozens of files, multiple frameworks, generators.. which in the end do not work at all, and nobody knows why. very obscure framework. can't believe this code is coming from Meta.

Tepix 10/25/2024||
From TFA:

> At Connect 2024 last month, we open sourced Llama 3.2 1B and 3B

No you did not. There is no source (in this case: training data) included. Stop changing the meaning of "open source", Meta!

justanotheratom 10/24/2024||
Any pointers no how to finetune this on my dataset and package and run it in my swift ios app?
behnamoh 10/24/2024||
Does anyone know why the most common method to speed up inference time is quantization? I keep hearing about all sorts of new methods but nearly none of them is implemented in practice (except for flash attention).
regularfry 10/25/2024||
In addition to the other answers in this thread, there's a practical one: sometimes (ok, often) you want to run a model on a card that doesn't have enough VRAM for it. Quantisation is a way to squeeze it down so it fits. For instance I've got a 4090 that won't fit the original Llama3 70b at 16 bits per param, but it will give me usable token rates at 2 bits.
formalsystem 10/24/2024|||
It's particularly useful in memory bound workflows like batch size = 1 LLM inference where you're bottlenecked by how quickly you can send weights to your GPU. This is why at least in torchao we strongly recommend people try out int4 quantization.

At larger batch sizes you become compute bound so quantization matters less and you have to rely on hardware support to accelerate smaller dtypes like fp8

o11c 10/24/2024|||
Because the way LLMs work is more-or-less "for every token, read the entire matrix from memory and do math on it". Math is fast, so if you manage to use only half the bits to store each item in the matrix, you only have to do half as much work. Of course, sometimes those least-significant-bits were relied-upon in the original training.
slimsag 10/25/2024||
Has anyone worked on making tokens 'clusters of words with specific semantic meaning'?

e.g. instead of tokens ['i', 'am', 'beautiful'] having tokens ['I am', 'beautiful'] on the premise that 'I am' is a common set of bytes for a semantic token that identifies a 'property of self'?

Or taking that further and having much larger tokens based on statistical analysis of common phrases of ~5 words or such?

pizza 10/25/2024|||
I think you might be thinking of applying a kind of low-rank decomposition to the vocabulary embeddings. A quick search on Google Scholar suggests that this might be useful in the context of multilingual tokenization.
visarga 10/25/2024||||
yes, look up Byte Pair Encoding

https://huggingface.co/learn/nlp-course/chapter6/5

dragonwriter 10/25/2024|||
Much larger tokens require a much larger token vocabulary.
xcodevn 10/25/2024||
During inference, it is not a matrix x matrix operation, but rather a weight matrix x input vector operation, as we are generating one token at a time. The bottleneck now is how fast we can load the weight matrix from memory to tensor cores, hence the need for weight quantization.
EliBullockPapa 10/24/2024||
Anyone know a nice iOS app to run these locally?
simonw 10/24/2024||
MLC Chat is a great iPhone app for running models (it's on Android too) and currently ships with Llama 3.2 3B Instruct - not the version Meta released today, its a quantized version of their previous release.

I wouldn't be surprised to see it add the new ones shortly, it's quite actively maintained.

https://apps.apple.com/us/app/mlc-chat/id6448482937

Havoc 10/25/2024||
Seems much more stable than the last time I tried it too
Arcuru 10/24/2024|||
I access them by running the models in Ollama (on my own hardware), and then using my app Chaz[1] to access it through my normal Matrix client.

[1] - https://github.com/arcuru/chaz

drilbo 10/24/2024|||
https://github.com/a-ghorbani/pocketpal-ai

This was just recently open sourced and is pretty nice. Only issue I've had is very minor UI stuff (on Android, sounds like it runs better on iOS from skimming comments)

evbogue 10/24/2024|||
I'm on Android, however my somewhat elaborate solution was to install Ollama on my home laptop computer and then ssh in when I want to query a model. I figured that'd be better for my phone battery. Since my home computer is behind NAT I run yggdrasil on everything so I can access my AI on the go.
behnamoh 10/24/2024||
I've been using PocketGPT.
arnaudsm 10/24/2024||
How do they compare to their original quants on ollama like q4_K_S?
tcdent 10/24/2024|
These undergo additional fine tuning (QLoRA) using some or all of the original dataset, so they're able to get the weights to align to the nf4 dtype better, which increases the accuracy.
ngamboa 10/24/2024||
[dead]
newfocogi 10/24/2024||
TLDR: Quantized versions of Llama 3.2 1B and 3B models with "competitive accuracy" to the original versions (meaning some degraded performance; plots included in the release notes).
newfocogi 10/24/2024||
Quantization schemes include post-training quantization (PTQ), SpinQuant, and QLoRA.
grahamj 10/25/2024||
Thx, I prefer not to visit meta properties :X

They were already pretty small but I guess the smaller the better as long as accuracy doesn't suffer too much.

mmaunder 10/24/2024|
[flagged]
accrual 10/24/2024||
Two days ago there was a pretty big discussion on this topic:

    Computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku
    https://news.ycombinator.com/item?id=41914989
    1421 points, 717 comments
refulgentis 10/24/2024|||
I wouldn't be so haughty and presumptive of your understanding of things is as they are: this doesn't have practical applications.

No one serious is going to build on some horror of Python interpreter running inside your app to run an LLM when llama.cpp is right there, with more quants available. In practice, on mobile, you run out of RAM headroom way more quickly than CPU headroom. You've been able to run llama.cpp 3B models for almost a year now on iOS, whereas here, they're just starting to be able to. (allocating 6 GB is a quick way to get autokill'd on iOS...2.5GB? Doable)

It looks like spinquant is effectively Q8, in widespread blind testing over months, empirically, we found Q5 is assuredly indistinguishable from the base model.

(edit: just saw your comment. oy. best of luck! generally, I don't bother with these sorts of 'lived experience' details, because no one wants to hear they don't get it, and most LLM comments on HN are from ppl who don't have the same luck as to work on it fulltime. so you're either stuck aggressively asserting you're right in practice and they don't know what you're talking about, or, you're stuck being talked down to about things you've seen, even if they don't match a first-pass based on theory) https://news.ycombinator.com/item?id=41939841

pryelluw 10/24/2024|||
I don’t get the comment. For one I’m excited for developments in the field. Not afraid it will “replace me” as technology has replaced me multiple times over. I’m looking towards working with these models more and more.
mmaunder 10/24/2024||
No, I meant that a lot of us are working very fast on a pre-launch product, implementing some cutting edge ideas using e.g. the incredible speedup in a small fast inference model like quantized 3B in combination with other tools, and I think there's quite a bit of paranoia out there that someone else will beat you to market. And so not a lot of sharing going on in the comments. At least not as much as previously, and not as much technical discussion vs other non-AI threads on HN.
pryelluw 10/24/2024|||
Ok, thank you for pointing that out.

I’m focused on making models play nice with each other rather than building a feature that relies on it. That’s where I see the more relevant work being. Why such news are exciting!

mattgreenrocks 10/24/2024|||
This thread attracts a smaller audience than, say, a new version of ChatGPT.
keyle 10/24/2024|||
Aren't we all just tired of arguing the same points?
lxgr 10/24/2024|||
What kind of fundamental discussion are you hoping to see under an article about an iterative improvement to a known model?

"AI will destroy the world"? "AI is great and will save humanity"? If you're seriously missing that, there's really enough platforms (and articles for more fundamental announcements/propositions on this one) where you can have these.

flawn 10/24/2024|||
A sign of the ongoing commoditization?
yieldcrv 10/24/2024||
I mean, this outcome of LLMs is expected and the frequency of LLM drops are too fast, and definitely too fast to wait for Meta to do an annual conference with a ton of hype, and furthermore these things are just prerequisites for a massive lemming rush of altering these models for the real fun, which occurs in other communities