Ggml.ai joins Hugging Face to ensure the long-term progress of Local AI

Posted by lairv 8 hours ago

Ggml.ai joins Hugging Face to ensure the long-term progress of Local AI(github.com)

608 points | 143 comments

simonw 5 hours ago|

It's hard to overstate the impact Georgi Gerganov and llama.cpp have had on the local model space. He pretty much kicked off the revolution in March 2023, making LLaMA work on consumer laptops.

Here's that README from March 10th 2023 https://github.com/ggml-org/llama.cpp/blob/775328064e69db1eb...

> The main goal is to run the model using 4-bit quantization on a MacBook. [...] This was hacked in an evening - I have no idea if it works correctly.

Hugging Face have been a great open source steward of Transformers, I'm optimistic the same will be true for GGML.

I wrote a bit about this here: https://simonwillison.net/2026/Feb/20/ggmlai-joins-hugging-f...

ushakov 4 hours ago|

[flagged]

carbocation 4 hours ago|||

Because many of us think simonw has discerning taste on this topic and like to read what he has to say about it, so we upvote his comments.

ushakov 4 hours ago||

i don't doubt this. i just find it questionable that one particular poster always gets in the spotlight when AI is the topic - while other conversations in my opinion offer more interesting angles.

jonas21 4 hours ago|||

Upvote the conversations that you find to be more interesting. If enough people do the same, they too will make it to the top.

colesantiago 3 hours ago|||

Agreed,

I would like to see others, being promoted to the top rather than Simon’s constant shilling for backlinks to his blog every time an AI topic is on the front page.

simonw 4 hours ago||||

At a guess that's because my comment attracted more up-votes than the other top-level comments in the thread.

I generally try to include something in a comment that's not information already under discussion - in this case that was the link and quote from the original README.

ushakov 4 hours ago||

of course your comment attracts more upvotes - it's at the top.

seanhunter 2 hours ago|||

It’s at the top because of upvotes. They don’t have an “if simonw: boost” branch in the code.

ushakov 2 hours ago||

the code is not public, so we can't know. i think it's much more nuanced and certain users' comments might get a preferential treatment, based on factors other than the upvote count - which itself is hidden from us.

ComplexSystems 2 hours ago|||

> the code is not public, so we can't know.

I feel like you're making this statement in bad faith, rather than honestly believing the developers of the forum software here have built in a clause to pin simonw's comments to the top.

satvikpendem 1 hour ago|||

> certain users' comments might get a preferential treatment

This does not happen. It hasn't even happened when pg made the forum in the first place.

dcrazy 1 hour ago||

I thought dang explicitly said it does happen? It certainly happens for stories.

ontouchstart 3 hours ago|||

Attention feeds attention.

Attention is ALL You Need.

satvikpendem 1 hour ago||||

They aren't pinned, people just vote on them, and more so because simonw is a recognizable name with lots of posts and comments.

llm_nerd 3 hours ago||||

HN goes through phases. I remember when patio11 was the star of the hour on here. At another time it was that security guy (can't remember his name).

And for those who think it's just organic with all of the upvotes, HN absolutely does have a +/- comment bias for users, and it does automatically feature certain people and suppress others.

rymc 3 hours ago|||

the security you mean is probably tptacek (https://news.ycombinator.com/user?id=tptacek)

imiric 3 hours ago|||

> And for those who think it's just organic with all of the upvotes, HN absolutely does have a bias for authors, and it does automatically feature certain people and suppress others.

Exactly.

There are configurable settings for each account, which might be automatically or manually set—I'm not sure–, that control the initial position of a comment in threads, and how long it stays there. There might be a reward system, where comments from high-karma accounts are prioritized over others, and accounts with "strikes", e.g. direct warnings from moderators, are penalized.

The difference in upvotes that account ultimately receives, and thus the impact on the discussion, is quite stark. The more visible a comment is, i.e. the more at the top it is, the more upvotes it can collect, which in turn makes it stay at the top, and so on.

It's safe to assume that certain accounts, such as those of YC staff, mods, or alumni, or tech celebrities like simonw, are given the highest priority.

I've noticed this on my own account. Before being warned for an IMO bullshit reason, my comments started to appear near the middle, and quickly float down to the bottom, whereas before they would usually be at the top for a few minutes. The quality of what I say hasn't changed, though the account's standing, and certainly the community itself, has.

I don't mind, nor particularly care about an arbitrary number. This is a proprietary platform run by a VC firm. It would be silly to expect that they've cracked the code of online discourse, or that their goal is to keep it balanced. The discussions here are better on average than elsewhere because of the community, although that also has been declining over the years.

I still find it jarring that most people would vote on a comment depending on if they agree with it or not, instead of engaging with it intellectually, which often pushes interesting comments to the bottom. This is an unsolved problem here, as much as it is on other platforms.

throwaway2027 2 hours ago||||

Time flies and simonw his AI feedback isn't always received favorably, sometimes he pushes it too much.

francispauli 2 hours ago||||

thanks for reminding me i need to follow his blog weekly again

mythz 8 hours ago||

I consider HuggingFace more "Open AI" than OpenAI - one of the few quiet heroes (along with Chinese OSS) helping bring on-premise AI to the masses.

I'm old enough to remember when traffic was expensive, so I've no idea how they've managed to offer free hosting for so many models. Hopefully it's backed by a sustainable business model, as the ecosystem would be meaningfully worse without them.

We still need good value hardware to run Kimi/GLM in-house, but at least we've got the weights and distribution sorted.

data-ottawa 7 hours ago||

Can we toss in the work unsloth does too as an unsung hero?

They provide excellent documentation and they’re often very quick to get high quality quants up in major formats. They’re a very trustworthy brand.

disiplus 7 hours ago|||

Yeah, they're the good guys. I suspect the open source work is mostly advertisements for them to sell consulting and services to enterprises. Otherwise, the work they do doesn't make sense to offer for free.

arcanemachiner 4 hours ago||

I hope that is exactly what is happening. It benefits them, and it benefits us.

cubie 7 hours ago||||

I'm a big fan of their work as well, good shout.

Tepix 6 hours ago|||

It's insane how much traffic HF must be pushing out of the door. I routinely download models that are hundreds of gigabytes in size from them. A fantastic service to the sovererign AI community.

razster 3 hours ago|||

My fear is that these large "AI" companies will lobby to have these open source options removed or banned, growing concern. I'm not sure how else to explain how much I enjoy using what HF provides, I religiously browse their site for new and exciting models to try.

culi 3 hours ago||

ModelScope is the Chinese equivalent of Hugging Face and a good back up. All the open models are Chinese anyways

thot_experiment 11 minutes ago||

Not true! Mistral is really really good, but I agree that there isn't a single decent open model from the USA.

vardalab 4 hours ago||||

Yup, I have downloaded probably a terabyte in the last week, especially with the Step 3.5 model being released and Minimax quants. I wonder what my ISP thinks. I hope they don't cut me off. They gave me a fast lane, they better let me use it, lol

fc417fc802 2 hours ago||

Even fairly restrictive data caps are in the range of 6 Tb per month. P2P at a mere 100 Mb works out to 1 TiB per 24 hours.

Hypothetically my ISP will sell me unmetered 10 Gb service but I wonder if they would actually make good on their word ...

Onavo 3 hours ago|||

Bandwidth is not that expensive. The Big 3 clouds just want to milk customers via egress. Look at Hetzner or CloudFlare R2 if you want to get get an idea of commodity bandwidth costs.

zozbot234 7 hours ago|||

> We still need good value hardware to run Kimi/GLM in-house

If you stream weights in from SSD storage and freely use swap to extend your KV cache it will be really slow (multiple seconds per token!) but run on basically anything. And that's still really good for stuff that can be computed overnight, perhaps even by batching many requests simultaneously. It gets progressively better as you add more compute, of course.

Aurornis 4 hours ago|||

> it will be really slow (multiple seconds per token!)

This is fun for proving that it can be done, but that's 100X slower than hosted models and 1000X slower than GPT-Codex-Spark.

That's like going from real time conversation to e-mailing someone who only checks their inbox twice a day if you're lucky.

HPsquared 7 hours ago|||

At a certain point the energy starts to cost more than renting some GPUs.

vardalab 4 hours ago|||

Yeah, that is hard to argue with because I just go to OpenRouter and play around with a lot of models before I decide which ones I like. But there's something special about running it locally in your basement

fc417fc802 2 hours ago|||

Aren't decent GPU boxes in excess of $5 per hour? At $0.20 per kWhr (which is on the high side in the US) running a 1 kW workstation 24/7 would work out to the same price as 1 hour of GPU time.

The issue you'll actually run into is that most residential housing isn't wired for more than ~2kW per room.

sowbug 7 hours ago|||

Why doesn't HF support BitTorrent? I know about hf-torrent and hf_transfer, but those aren't nearly as accessible as a link in the web UI.

embedding-shape 6 hours ago||

> Why doesn't HF support BitTorrent?

Harder to track downloads then. Only when clients hit the tracker would they be able to get download states, and forget about private repositories or the "gated" ones that Meta/Facebook does for their "open" models.

Still, if vanity metrics wasn't so important, it'd be a great option. I've even thought of creating my own torrent mirror of HF to provide as a public service, as eventually access to models will be restricted, and it would be nice to be prepared for that moment a bit better.

sowbug 6 hours ago|||

I thought of the tracking and gate questions, too, when I vibed up an HF torrent service a few nights ago. (Super annoying BTW to have to download the files just to hash the parts, especially when webseeds exist.) Model owners could disable or gate torrents the same way they gate the models, and HF could still measure traffic by .torrent downloads and magnet clicks.

It's a bit like any legalization question -- the black market exists anyway, so a regulatory framework could bring at least some of it into the sunlight.

embedding-shape 6 hours ago||

> Model owners could disable or gate torrents the same way they gate the models, and HF could still measure traffic by .torrent downloads and magnet clicks.

But that'll only stop a small part, anyone could share the infohash and if you're using the dht/magnet without .torrent files or clicks on a website, no one can count those downloads unless they too scrape the dht for peers who are reporting they've completed the download.

fc417fc802 2 hours ago|||

> unless they too scrape the dht for peers who are reporting they've completed the download.

Which can be falsified. Head over to your favorite tracker and sort by completed downloads to see what I mean.

sowbug 6 hours ago|||

Right, but that's already happening today. That's the black-market point.

taminka 4 hours ago||||

most of the traffic is probably from open weights, just seed those, host private ones as is

jimbob45 3 hours ago||||

Wouldn’t it still provide massive benefits if they could convince/coerce their most popular downloaded models to move to torrenting?

homarp 4 hours ago|||

how are all the private trackers tracking ratios?

Fin_Code 6 hours ago||

I still don't know why they are not running on torrent. Its the perfect use case.

heliumtera 6 hours ago|||

How can you be the man in the middle in a truly P2P environment?

freedomben 6 hours ago|||

That would shut out most people working for big corp, which is probably a huge percentage of the user base. It's dumb, but that's just the way corp IT is (no torrenting allowed).

zozbot234 6 hours ago||

It's a sensible option, even when not everyone can really use it. Linux distros are routinely transfered via torrent, so why not other massive, open-licensed data?

thot_experiment 7 minutes ago|||

I have terabytes of linux isos I got via torrents, many such cases!

freedomben 6 hours ago|||

Oh as an option, yeah I agree it makes a ton of sense. I just would expect a very, very small percentage of people to use the torrent over the direct download. With Linux distros, the vast majority of downloads still come from standard web servers. When I download distro images I opt for torrents, but very few people do the same

Const-me 1 hour ago|||

> very small percentage of people to use the torrent over the direct download

BitTorrent protocol is IMO better for downloading large files. When I want to download something which exceeds couple GB, and I see two links direct download and BitTorrent, I always click on the torrent.

On paper, HTTP supports range requests to resume partial downloads. IME, it seems modern web browsers neglected to implement it properly. They won’t resume after browser is reopened, or the computer is restarted. Command-line HTTP clients like wget are more reliable, however many web servers these days require some session cookies or one-time query string tokens, and it’s hard to pass that stuff from browser to command-line.

I live in Montenegro, CDN connectivity is not great here. Only a few of them like steam and GOG saturate my 300 megabit/sec download link. Others are much slower, e.g. windows updates download at about 100 megabit/sec. BitTorrent protocol almost always delivers the 300 megabit/sec bandwidth.

zrm 5 hours ago|||

With Linux distros they typically put the web link right on the main page and have a torrent available if you go look for it, because they want you to try their distro more than they want to save some bandwidth.

Suppose HF did the opposite because the bandwidth saved is more and they're not as concerned you might download a different model from someone else.

HanClinto 8 hours ago||

I'm regularly amazed that HuggingFace is able to make money. It does so much good for the world.

How solid is its business model? Is it long-term viable? Will they ever "sell out"?

microsoftedging 6 hours ago||

FT had a solid piece a few weeks back: "Why AI start-up Hugging Face turned down a $500mn Nvidia deal"

https://giftarticle.ft.com/giftarticle/actions/redeem/9b4eca...

jackbravo 6 hours ago||

sounds very interesting, but even though it says giftarticle.ft, I got blocked by a paywall.

nerevarthelame 6 hours ago|||

https://archive.is/zSyUc

To summarize, they rejected Nvidia's offer because they didn't want one outsized investor who could sway decisions. And "the company was also able to turn down Nvidia due to its stable finances. Hugging Face operates a 'freemium' business model. Three per cent of customers, usually large corporations, pay for additional features such as more storage space and the ability to set up private repositories."

bee_rider 5 hours ago||

Freemium seems to be working pretty well for them—what’s the alternative website, after all. They seem to command their niche.

culi 3 hours ago|||

find the Bypass Paywalls Clean extension. Never worry about a paywall again

bityard 5 hours ago|||

Their business model is essentially the same as GitHub. Host lots of stuff for free and build a community around it, sell the upscaled/private version to businesses. They are already profitable.

HanClinto 4 hours ago||

This is what Sourceforge did too, and they still had the DevShare adware thing didn't they?

GitHub is great -- huge fan. To some degree they "sold out" to Microsoft and things could have gone more south, but thankfully Microsoft has ruled them with a very kind hand, and overall I'm extremely happy with the way they've handled it.

I guess I always retain a bit of skepticism with such things, and the long-term viability and goodness of such things never feels totally sure.

dmezzetti 8 hours ago|||

They have paid hosting - https://huggingface.co/enterprise and paid accounts. Also consulting services. Seems like a pretty good foundation to me.

julien_c 6 hours ago||

and a lot of traction on paid (private in particular) storage these days; sneak peek at new landing page: https://huggingface.co/storage

heliumtera 6 hours ago|||

>Will they ever "sell out"?

Oh no, never. Don't worry, the usual investors are very well known for fighting for user autonomy (AMD, Nvidia, Intel,IBM, Qualcomm)

They are all very pro consumers and all backers are certainly here for your enjoyment only

zozbot234 6 hours ago||

These are all big hardware firms, which makes a lot of sense as a classic 'commoditize the complement' play. Not exactly pro-consumer, but not quite anti-consumer either!

5o1ecist 4 hours ago|||

> AMD, Nvidia, Intel, IBM, Qualcomm

> but not quite anti-consumer either!

All of them are public companies, which means that their default state is anti-consumer and pro-shareholder. By law they are required to do whatever they can to maximize profits. History teaches that shareholders can demand whatever they want, with the respective companies following orders, since nobody ever really has to suffer consequences and any and all potential fines are already priced in, in advance, anyway.

Conversely, this is why Valve is such a great company. Valve is probably one of the only few actual pro-consumer companies out there.

Fun Fact! Rarely is it ever mentioned anywhere, but Valve is not a public company! Valve is a private company! That's why they can operate the way they do! If Valve was a public company, then greedy, crooked billionaire shareholders would have managed to get rid of Gabe a long time ago.

RussianCow 1 hour ago|||

> By law they are required to do whatever they can to maximize profits.

I know it's a nit-pick, but I hate that this always gets brought up when it's not actually true. Public corporations face pressure from investors to maximize returns, sure, but there is no law stating that they have to maximize profits at all costs. Public companies can (and often do) act against the interest of immediate profits for some other gain. The only real leverage that investors have is the board's ability to fire executives, but that assumes that they have the necessary votes to do so. As a counter-example, Mark Zuckerberg still controls the majority of voting power at Meta, so he can effectively do whatever he wants with the company without major consequence (assuming you don't consider stock price fluctuations "major").

But I say this not to take away from your broader point, which I agree with: the short-term profit-maximizing culture is indeed the default when it comes to publicly traded corporations. It just isn't something inherent in being publicly traded, and in the inverse, private companies often have the same kind of culture, so that's not a silver bullet either.

HanClinto 4 hours ago|||

Great points.

Valve is one of my top favorite companies right now. Love the work they're doing, and their products are amazing.

Can hardly wait for the Steam Frame.

smallerize 3 hours ago|||

heliumtera is being sarcastic.

I_am_tiberius 8 hours ago||

I once tried hugging face because I wanted I worked through some tutorial. They wanted my credit card details during the registration as far as I remember. After a month they invoiced me some amount of money and I had no idea what it was. To be honest, I don't understand what exactly they do and what services I was paying for, but I cancelled my account and never touched it again. For me that was a totally intransparent process.

shafyy 8 hours ago||

Their pricing seems pretty transparent: https://huggingface.co/pricing

mnewme 8 hours ago||

Huggingface is the silent GOAT of the AI space, such a great community and platform

lairv 8 hours ago|

Truly amazing that they've managed to build an open and profitable platform without shady practices

al_borland 7 hours ago||

It’s such a sad state of affairs when shady practices are so normal that finding a company without them is noteworthy.

0xbadcafebee 5 hours ago||

> The community will continue to operate fully autonomously and make technical and architectural decisions as usual. Hugging Face is providing the project with long-term sustainable resources, improving the chances of the project to grow and thrive. The project will continue to be 100% open-source and community driven as it is now.

I want this to be true, but business interests win out in the end. Llama.cpp is now the de-facto standard for local inference; more and more projects depend on it. If a company controls it, that means that company controls the local LLM ecosystem. And yeah, Hugging Face seems nice now... so did Google originally. If we all don't want to be locked in, we either need a llama.cpp competitor (with a universal abstration), or it should be controlled by an independent nonprofit.

zozbot234 5 hours ago|

Llama.cpp is an open source project that anyone can fork as needed, so any "control" over it really only extends to facilitating development of certain features.

0xbadcafebee 1 hour ago||

In practice, nobody does this, because you then have to keep the fork up to date with upstream plus your changes, and this is an endless amount of work.

jgrahamc 6 hours ago||

This is great news. I've been sponsoring ggml/llama.cpp/Georgi since 2023 via Github. Glad to see this outcome. I hope you don't mind Georgi but I'm going to cancel my sponsorship now you and the code have found a home!

beoberha 8 hours ago||

Seems like a great fit - kinda surprised it didn’t happen sooner. I think we are deep in the valley of local AI, but I’d be willing to bet it breaks out in the next 2-3 years. Here’s hoping!

breisa 2 hours ago|

I mean they already supported the project quite a bit. @ngxson and maybe others? from Huggingface are big contributors to llama.cpp.

tkp-415 7 hours ago||

Can anyone point me in the direction of getting a model to run locally and efficiently inside something like a Docker container on a system with not so strong computing power (aka a Macbook M1 with 8gb of memory)?

Is my only option to invest in a system with more computing power? These local models look great, especially something like https://huggingface.co/AlicanKiraz0/Cybersecurity-BaronLLM_O... for assisting in penetration testing.

I've experimented with a variety of configurations on my local system, but in the end it turns into a make shift heater.

0xbadcafebee 4 hours ago||

8GB is not enough to do complex reasoning, but you could do very small simple things. Models like Whisper, SmolVLM, Quen2.5-0.5B, Phi-3-mini, Granite-4.0-micro, Mistral-7B, Gemma3, Llama-3.2 all work on very little memory. Tiny models can do a lot if you tune/train them. They also need to be used differently: system prompt preloaded with information, few-shot examples, reasoning guidance, single-task purpose, strict output guidelines. See https://github.com/acon96/home-llm for an example. For each small model, check if Unsloth has a tuned version of it; it reduces your memory footprint and makes inference faster.

For your Mac, you can use Ollama, or MLX (Mac ARM specific, requires different engine and different model disk format, but is faster). Ramalama may help fix bugs or ease the process w/MLX. Use either Docker Desktop or Colima for the VM + Docker.

For today's coding & reasoning models, you need a minimum of 32GB VRAM combined (graphics + system), the more in GPU the better. Copying memory between CPU and GPU is too slow so the model needs to "live" in GPU space. If it can't fit all in GPU space, your CPU has to work hard, and you get a space heater. That Mac M1 will do 5-10 tokens/s with 8GB (and CPU on full blast), or 50 token/s with 32GB RAM (CPU idling). And now you know why there's a RAM shortage.

mft_ 7 hours ago|||

There’s no way around needing a powerful-enough system to run the model. So you either choose a model that can fit on what you have —i.e. via a small model, or a quantised slightly larger model— or you access more powerful hardware, either by buying it or renting it. (IME you don’t need Docker. For an easy start just install LM Studio and have a play.)

I picked up a second-hand 64GB M1 Max MacBook Pro a while back for not too much money for such experimentation. It’s sufficiently fast at running any LLM models that it can fit in memory, but the gap between those models and Claude is considerable. However, this might be a path for you? It can also run all manner of diffusion models, but there the performance suffers (vs. an older discrete GPU) and you’re waiting sometimes many minutes for an edit or an image.

ryandrake 6 hours ago|||

I wasn't able to have very satisfying success until I bit the bullet and threw a GPU at the problem. Found an actually reasonably priced A4000 Ada generation 20GB GPU on eBay and never looked back. I still can't run the insanely large models, but 20GB should hold me over for a while, and I didn't have to upgrade my 10 year old Ivy Bridge vintage homelab.

sigbottle 7 hours ago|||

Are mac kernels optimized compared to CUDA kernels? I know that the unified GPU approach is inherently slower, but I thought a ton of optimizations were at the kernel level too (CUDA itself is a moat)

liuliu 2 hours ago|||

Depending on what you do. If you are doing token generations, compute-dense kernel optimization is less interesting (as, it is memory-bounded) than latency optimizations else where (data transfers, kernel invocations etc). And for these, Mac devices actually have a leg than CUDA kernels (as pretty much Metal shaders pipelines are optimized for latencies (a.k.a. games) while CUDA shaders are not (until cudagraph introduction, and of course there are other issues).

bigyabai 3 hours ago|||

Mac kernels are almost always compute shaders written in Metal. That's the bare-minimum of acceleration, being done in a non-portable proprietary graphics API. It's optimized in the loosest sense of the word, but extremely far from "optimal" relative to CUDA (or hell, even Vulkan Compute).

Most people will not choose Metal if they're picking between the two moats. CUDA is far-and-away the better hardware architecture, not to mention better-supported by the community.

zozbot234 7 hours ago|||

The general rule of thumb is that you should feel free to quantize even as low as 2 bits average if this helps you run a model with more active parameters. Quantized models are not perfect at all, but they're preferable to the models with fewer, bigger parameters. With 8GB usable, you could run models with up to 32B active at heavy quantization.

xrd 7 hours ago|||

I think a better bet is to ask on reddit.

https://www.reddit.com/r/LocalLLM/

Everytime I ask the same thing here, people point me there.

yjftsjthsd-h 4 hours ago|||

With only 8 GB of memory, you're going to be running a really small quant, and it's going to be slow and lower quality. But yes, it should be doable. In the worst case, find a tiny gguf and run it on CPU with llamafile.

ontouchstart 5 hours ago|||

This is the easiest set up on a Mac. You need at least 16gb on a MacBook:

https://github.com/ggml-org/llama.cpp/discussions/15396

HanClinto 6 hours ago|||

Maybe check out Docker Model Runner -- it's built on llama.cpp (in a good way -- not like Ollama) and handles I think most of what you're looking for?

https://www.docker.com/blog/run-llms-locally/

As far as how to find good models to run locally, I found this site recently, and I liked the data it provides:

https://localclaw.io/

Hamuko 5 hours ago||

I tried to run some models on my M1 Max (32 GB) Mac Studio and it was a pretty miserable experience. Slow performance and awful results.

kristianp 2 hours ago||

> Towards seamless “single-click” integration with the transformers library

That's interesting. I thought they would be somewhat redundant. They do similar things after all, except training.

fancy_pantser 2 hours ago|

Was Georgi ever approached by Meta? I wonder what they offered (I'm glad they didn't succeed, just morbid curiosity).

More comments...