Top
Best
New

Posted by bundie 4 days ago

Introducing Gemma 3n(developers.googleblog.com)
403 points | 190 commentspage 2
eabeezxjc 3 days ago|
https://github.com/Mozilla-Ocho/llamafile not working ;(

gemma-3n-E4B-it-Q8_0 import_cuda_impl: initializing gpu module... get_rocm_bin_path: note: hipcc not found on $PATH [...] llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'gemma3n' llama_load_model_from_file: failed to load model llama_init_from_gpt_params: error: failed to load model 'gemma-3n-E4B-it-Q8_0.gguf' main: error: unable to load model

lostmsu 4 days ago||
I made a simple website[0] to check online model MMLU quickly (runs a subset), and Gemma 3n consistently loses to LLaMA 3.3 (~61% vs ~66%), and definitely loses to LLaMA 4 Scout (~86%). I suspect that means its rating on LMArena Leaderboard is just some form of gaming the metric.

What's interesting, that it beats smarter models in my Turing Test Battle Royale[1]. I wonder if it means it is a better talker.

0. https://mmlu.borgcloud.ai/

1. https://trashtalk.borg.games/

lucb1e 4 days ago||
I read the general parts and skimmed the inner workings but I can't figure out what the high-level news is. What does this concretely do that Gemma didn't already do, or what benchmark/tasks did it improve upon?

Until it goes into the inner details (MatFormer, per-layer embeddings, caching...), the only sentence I've found that concretely mentions a new thing is "the first model under 10 billion parameters to reach [an LMArena score over 1300]". So it's supposed to be better than other models until those that use 10GB+ RAM, if I understand that right?

awestroke 4 days ago|
> What does this concretely do that Gemma didn't already do

Open weights

lucb1e 4 days ago|||
Huh? I'm pretty sure I ran Gemma on my phone last month. Or is there a difference between downloadable (you get the weights because it's necessary to run the thing) and "open" weights?
ColonelPhantom 4 days ago|||
I think the other poster is confused. Both Gemma 3 and Gemma 3n are open-weight models.

Google's proprietary model line is called Gemini. There is a variant that can be ran offline called Gemini Nano, but I don't think it can be freely distributed and is only allowed as part of Android.

As for what's new, Gemma 3n seems to have some optimizations done to it that lead it to be better than the 'small' Gemma 3 models (such as 4B) at similar speed or footprint.

lucb1e 3 days ago||
Thank you!
throwaway2087 4 days ago|||
Wasn't it a preview version?
lucb1e 4 days ago||
Oh, that could be. So this is the first on-device model that Google releases, that's the news?
zknowledge 4 days ago||
anyone know how much it costs to use the deployed version of gemma 3n? The docs indicate you can use the gemini api for deployed gemma 3n but the pricing page just shows "unavailable"
turnsout 4 days ago||
This looks amazing given the parameter sizes and capabilities (audio, visual, text). I like the idea of keeping simple tasks local. I’ll be curious to see if this can be run on an M1 machine…
Fergusonb 4 days ago||
Sure it can, easiest way is to get ollama, then `ollama run gemma3n` You can pair it with tools like simonw's LLM to pipe stuff to it.
bigyabai 4 days ago||
This should run fine on most hardware - CPU inference of the E2B model on my Pixel 8 Pro gives me ~9tok/second of decode speed.
lxgr 4 days ago||
> It’s supported by your favorite tools including Hugging Face Transformers, llama.cpp, Google AI Edge, Ollama, MLX, and many others

Does anybody know how to actually run these using MLX? mlx-lm does not currently seem to support them, so I wonder what Google means exactly by "MLX support".

mh- 4 days ago|
I had success using the models from lmstudio-community here.

https://huggingface.co/lmstudio-community

lxgr 4 days ago||
Thank you!

Do you know if these actually preserve the structure of Gemma 3n that make these models more memory efficient on consumer devices? I feel like the modified inference architecture described in the article is what makes this possible, but it probably needs additional software support.

But given that they were uploaded a day ago (together with the blog post), maybe these are actually the real deal? In that case, I wish Google could just link to these instead of to https://huggingface.co/mlx-community/gemma-3n-E4B-it-bf16.

Edit: Ah, these are just non-MLX models. I might give them a try, but not what I was looking for. Still, thank you!

mh- 4 days ago||
That's a great question that is beyond my technical competency in this area, unfortunately. I fired up LM Studio when I saw this HN post, and saw it updated its MLX runtime [0] for gemma3n support. Then went looking for an MLX version of the model and found that one.

[0]: https://github.com/lmstudio-ai/mlx-engine

nsingh2 4 days ago||
Whats are some use cases for these local small models, for individuals? Seems like for programming related work, the proprietary models are significantly better and that's all I really use LLMs for personally.

Though I can imagine a few commercial applications where something like this would be useful. Maybe in some sort of document processing pipeline.

jsphweid 4 days ago||
For me? Handling data like private voice memos, pictures, videos, calendar information, emails, some code etc. Stuff I wouldn't want to share on the internet / have a model potential slurp up and regurgitate as part of its memory when the data is invariably used in some future training process.
toddmorey 4 days ago|||
I think speech to text is the highlight used case for local models because they are now really good at it and there’s no network latency.
oezi 4 days ago||
How does it compare to whisper? Does it hallucinate less or is more capable?
msabalau 4 days ago|||
I just like having quick access to reasonable model that runs comfortably on my phone, even if I'm in a place without connectivity.
thimabi 4 days ago|||
I’m thinking about building a pipeline to mass generate descriptions for the images in my photo collection, to facilitate search. Object recognition in local models is already pretty good, and perhaps I can pair it with models to recognize specific people by name as well.
russdill 4 days ago|||
Hoping to try it out with home assistant.
androng 4 days ago||
filtering out spam SMS messages without sending all SMS to the cloud
thimabi 4 days ago||
Suppose I'd like to use models like this one to perform web searches. Is there anything available in the open-source world that would let me do that without much tinkering needed?

I think it’s something that even Google should consider: publishing open-source models with the possibility of grounding their replies in Google Search.

vorticalbox 4 days ago||
I have been using ollama + open web ui. open webUI already have a web search tool all you would need to do is click the toggle for it under the chat.
zettabomb 4 days ago||
Unfortunately the OWUI web search is really slow and just not great overall. I would suggest using an MCP integration instead.
thimabi 4 days ago||
Can you recommend a specific piece of software for using an MCP integration for web searches with local LLMs? That’s the first time I’ve heard of this.
nickthegreek 4 days ago||
i think the ones i’ve heard mentioned on youtube use an mcp that interfaces with searxng.
joerick 4 days ago||
Google do have an API for this. It has limits but perfectly good for personal use.

https://developers.google.com/custom-search/v1/overview

thimabi 4 days ago||
Unfortunately 100 queries per day is quite low for LLMs, which tend to average 5-10 searches per prompt in my experience. And paying for the search API doesn’t seem to be worth it compared to something like a ChatGPT subscription.
gkbrk 4 days ago|||
You're not limited to 100 queries per day though. You're limited to 10,000 queries per day.
thimabi 4 days ago|||
> Custom Search JSON API provides 100 search queries per day for free. If you need more, you may sign up for billing in the API Console. Additional requests cost $5 per 1000 queries, up to 10k queries per day.

What I meant is that pricing for 10k queries per day does not make sense instead of a simple $20/month ChatGPT subscription.

scrivna 4 days ago|||
10k queries would cost $50/day
gkbrk 4 days ago||
If you actually did 10k queries a day with any free search service you'd quickly find yourself banned.

ChatGPT deep research does a lot of searches, but it's also heavily rate-limited even on paid accounts.

Building a search index, running a lot of servers, storing and querying all that data costs money. Even a cheap search API is gonna cost a bit for 10k queries if the results are any good.

Kagi charges me 15 USD and last month I've done 863 searches with it, more than worth it for the result quality. For me the Google API would be cheaper. I'm pretty sure Kagi would kick me out if I did 10k searches a day even if I'm paying for "unlimited searches".

A similar API from Bing costs between 15-25 USD for 1000 searches. Exa.ai costs 5 dollars for 1000 searches, and it goes up to 25 dollars if you want more than 25 results for your query.

Good web search is quite expensive.

sadeshmukh 4 days ago|||
The Google programmable search engine is unlimited and can search the web: https://programmablesearchengine.google.com/about/
yorwba 4 days ago||
That's intended for human users. If you try to use it for automated requests, you'll get banned for botting fairly quickly.
sadeshmukh 3 days ago||
They have a 10k/day API. I'm sure that's enough for one person.
yorwba 3 days ago||
That's the one that's described upthread as "paying for the search API doesn’t seem to be worth it."
lowbatt 4 days ago||
If I wanted to run this locally at somewhat decent speeds, is an RK3588S board (like OrangePi 5) the cheapest option?
ThatPlayer 3 days ago||
The RK3588 is a bit interesting because of its NPU. You can find models that have been converted to take advantage of that on HuggingFace: https://huggingface.co/models?search=rk3588 .

No clue how performance compares. Not sure it's worth dealing with the lesser software support compared to getting an AMD mini PC and using Vulkan on llama.cpp for standard GGUF models.

zipping1549 4 days ago|||
Tried with S25+ (SD 8 elite). 0.82tok/s(4B L model). It's barely useful speed but it's pretty impressive either.
jm4 4 days ago|||
It depends on your idea of decent speeds and what you would use it for. I just tried it on a laptop with an AMD HX 370 running on battery in power save mode and it's not especially impressive, although it runs much better in balanced or performance mode. I gave it the prompt "write a fizzbuzz program in rust" and it took almost a minute and a half. I expect it to be pretty terrible on an SBC. Your best bet is to try it out on the oldest hardware you have and figure out if you can tolerate worse performance.
lowbatt 4 days ago||
good idea, will test that out
ac29 4 days ago|||
RK3588 uses a 7 year old CPU design and OrangePi 5 looks expensive (well over $100).

A used sub-$100 x86 box is going to be much better

lowbatt 4 days ago||
you're right. For my purposes, I was thinking of something I could use if I wanted to manufacture a new (smallish) product
babl-yc 4 days ago||
I'm going to attempt to get it running on the BeagleY-AI https://www.beagleboard.org/boards/beagley-ai

Similar form factor to raspberry pi but with 4 TOPS of performance and enough RAM.

impure 4 days ago|
I've been playing around with E4B in AI Studio and it has been giving me really great results, much better than what you'd expect from an 8B model. In fact I'm thinking of trying to install it on a VPS so I can have an alternative to pricy APIs.
More comments...