Ggml.ai joins Hugging Face to ensure the long-term progress of Local AI

Posted by lairv 12 hours ago

Ggml.ai joins Hugging Face to ensure the long-term progress of Local AI(github.com)

638 points | 152 commentspage 3

forty 4 hours ago|

Looks like someone tried to type "Gmail" while drunk...

rkomorn 4 hours ago|

Looks like Gargamel of Smurfs fame to me.

dhruv3006 11 hours ago||

Huggingface is actually something thats driving good in the world. Good to see this collab/

lukebechtel 5 hours ago||

Thank you Georgi <3

superkuh 10 hours ago||

I'm glad the llama.cpp and the ggml backing are getting consistent reliable economic support. I'm glad that ggerganov is getting rewarded for making such excellent tools.

I am somewhat anxious about "integration with the Hugging Face transformers library" and possible python ecosystem entanglements that might cause. I know llama.cpp and ggml already have plenty of python tooling but it's not strictly required unless you're quantizing models yourself or other such things.

moralestapia 3 hours ago||

I hope Georgi gets a big fat check out of this, he deserves it 100%.

dmezzetti 11 hours ago||

This is really great news. I've been one of the strongest supporters of local AI dedicating thousands of hours towards building a framework to enable it. I'm looking forward to seeing what comes of it!

logicallee 11 hours ago|

>I've been one of the strongest supporters of local AI, dedicating thousands of hours towards building a framework to enable it.

Sounds like you're very serious about supporting local AI. I have a query for you (and anyone else who feels like donating) about whether you'd be willing to donate some memory/bandwidth resources p2p to hosting an offline model:

We have a local model we would like to distribute but don't have a good CDN.

As a user/supporter question, would you be willing to donate some spare memory/bandwidth in a simple dedicated browser tab you keep open on your desktop that plays silent audio (to not be put in the background and deloaded) and then allocates 100mb -1 gb of RAM and acts as a webrtc peer, serving checksumed models?[1] (Then our server only has to check that you still have it from time to time, by sending you some salt and a part of the file to hash and your tab proves it still has it by doing so). This doesn't require any trust, and the receiving user will also hash it and report if there's a mismatch.

Our server federates the p2p connections, so when someone downloads they do so from a trusted peer (one who has contributed and passed the audits) like you. We considered building a binary for people to run but we consider that people couldn't trust our binaries, or would target our build process somehow, we are paranoid about trust, whereas a web model is inherently untrusted and safer. Why do all this?

The purpose of this would be to host an offline model: we successfully ported a 1 GB model from C++ and Python to WASM and WebGPU (you can see Claude doing so here, we livestreamed some of it[2]), but the model weights at 1 GB are too much for us to host.

Please let us know whether this is something you would contribute a background tab to hosting on your desktop. It wouldn't impact you much and you could set how much memory to dedicate to it, but you would have the good feeling of knowing that you're helping people run a trusted offline model if they want - from their very own browser, no download required. The model we ported is fast enough for anyone to run on their own machines. Let me know if this is something you'd be willing to keep a tab open for.

[1] filesharing over webrtc works like this: https://taonexus.com/p2pfilesharing/ you can try it in 2 browser tabs.

[2] https://www.youtube.com/watch?v=tbAkySCXyp0and and some other videos

HanClinto 9 hours ago|||

Hosting model weights for projects like this I think is something that you could upload to a space in Hugging Face?

What services would you need that Hugging Face doesn't provide?

echoangle 9 hours ago||||

Maybe stupid question but why not just put it in a torrent?

liuliu 8 hours ago|||

It is very simple. Storage / bandwidth is not expensive. Residential bandwidth is. If you can convince people to install a bandwidth-related software on their residential homes, you can then charge other people $5 to $10 per 1GiB bandwidth (useful for botnet mostly, get around DDOS protections and other reCAPTCHA tasks).

logicallee 6 hours ago||

Thank you for your suggestion. Below is only our plans/intentions, we welcome feedback about it:

We are not going to do what you suggest. Instead, our approach is to use the RAM people aren't using at the moment for a fast edge cache close to their area.

We've tried this architecture and get very low latency and high bandwidth. People would not be contributing their resources to anything they don't know about.

logicallee 9 hours ago|||

Torrents require users to download and install a torrent client! In addition, we would like to retain the possibility of giving live updates to the latest version of a sovereign fine-tuned file, torrents don't autoupdate. We want to keep improving what people get.

Finally, we would like the possibility of setting up market dynamics in the future: if you aren't currently using all your ram, why not rent it out? This matches the p2p edge architecture we envision.

In addition, our work on WebGPU would allow you to rent out your gpu to a background tab whenever you're not using it. Why have all that silicon sit idle when you could rent it out?

You could also donate it to help fine tune our own sovereign model.

All of this will let us bootstrap to the point where we could be trusted with a download.

We have a rather paranoid approach to security.

liuliu 9 hours ago|||

> We have a local model we would like to distribute but don't have a good CDN.

That is not true. I am serving models off Cloudflare R2. It is 1 petabyte per month in egress use and I basically pay peanuts (~$200 everything included).

logicallee 9 hours ago||

1 petabyte per month is 1 million downloads of a 1 GB file. We intend to scale to more than 1 million downloads per month. We have a specific scaling architecture in mind. We're qualified to say this because we've ported a billion parameter model to run in your browser - fast - on either webgpu or wasm. (You can see us doing it live at the youtube link in my comment above.) There is a lot of demand for that.

liuliu 8 hours ago||

The bandwidth is free on Cloudflare R2. I paid money for storage (~10TiB storage of different models). If you only host 1GiB file there, you are only paying $0.01 per month I believe.

geooff_ 11 hours ago||

As someone who's been in the "AI" space for a while its strange how Hugging Face went from one of the biggest name to not a part of the discussion at all.

r_lee 11 hours ago||

I think that's because there's less local AI usage now since there's all kinds of image models by the big labs, so there's really no rush of people self hosting stable diffusion etc anymore

the space moved from Consumer to Enterprise pretty fast due to models getting bigger

zozbot234 11 hours ago||

Today's free models are not really bigger when you account for the use of MoE (with ever increasing sparsity, meaning a smaller fraction of active parameters), and better ways of managing KV caching. You can do useful things with very little RAM/VRAM, it just gets slower and slower the more you try to squeeze it where it doesn't quite belong. But that's not a problem if you're willing to wait for every answer.

r_lee 8 hours ago||

yeah, but I mean more like the old setups where you'd just load a model on a 4090 or something, even with MoE it's a lot more complex and takes more VRAM, right? like it just seems not justifiable for most hobbyists

but maybe I'm just slightly out of the loop

zozbot234 7 hours ago||

With sparse MoE it's worth running the experts in system RAM since that allows you to transparently use mmap and inactive experts can stay on disk. Of course that's also a slowdown unless you have enough RAM for the full set, but it lets you run much larger models on smaller systems.

segmondy 10 hours ago|||

part of what discussion? anyone in the AI space knows and uses HF, but the public doesn't give a care and why should they? It's just an advanced site were nerds download AI stuff. HF is super valuable with their transformers library, their code, tutorials, smol-models, etc, but how does it translate to investor dollars?

LatencyKills 11 hours ago||

It isn't necessary to be part of the discussion if you are truly adding value (which HF continues to do). It's nice to see a company doing what it does best without constantly driving the hype train.

option 11 hours ago||

Isn't HF banned in China? Also, how are many Chinese labs on Twitter all the time?

In either case - huge thanks to them for keeping AI open!

dragonwriter 10 hours ago||

> Isn't HF banned in China?

I think, for some definition of “banned”, that’s the case. It doesn’t stop the Chinese labs from having organization accounts on HF and distributing models there. ModelScope is apparently the HF-equivalent for reaching Chinese users.

disiplus 10 hours ago|||

I think in the West we think everything is blocked. But for example, if you book an eSIM, when you visit you already get direct access to Western services because they route it to some other server. Hong Kong is totally different: they basically use WhatsApp and Google Maps, and everything worked when I was there.

embedding-shape 9 hours ago||

But also yes, parent is right, HF is more or less inaccessible, and Modelscope frequently cited as the mirror to use (although many Chinese labs seems to treat HF as the mirror, and Modelscope as the "real" origin).

woadwarrior01 10 hours ago||

HF is indeed banned in China. The Chinese equivalent of HF is ModelScope[1].

[1]: https://modelscope.cn/

periodjet 9 hours ago||

Prediction: Amazon will end up buying HuggingFace. Screenshot this.

cyanydeez 4 hours ago|

Is there a local webui that integrates with Hugging face?

Ollama and webui seem to rapidly lose their charm. Ollama now includes cloud apis which makes no sense as a local.

More comments...