Posted by bundie 4 days ago
gemma-3n-E4B-it-Q8_0 import_cuda_impl: initializing gpu module... get_rocm_bin_path: note: hipcc not found on $PATH [...] llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'gemma3n' llama_load_model_from_file: failed to load model llama_init_from_gpt_params: error: failed to load model 'gemma-3n-E4B-it-Q8_0.gguf' main: error: unable to load model
What's interesting, that it beats smarter models in my Turing Test Battle Royale[1]. I wonder if it means it is a better talker.
Until it goes into the inner details (MatFormer, per-layer embeddings, caching...), the only sentence I've found that concretely mentions a new thing is "the first model under 10 billion parameters to reach [an LMArena score over 1300]". So it's supposed to be better than other models until those that use 10GB+ RAM, if I understand that right?
Open weights
Google's proprietary model line is called Gemini. There is a variant that can be ran offline called Gemini Nano, but I don't think it can be freely distributed and is only allowed as part of Android.
As for what's new, Gemma 3n seems to have some optimizations done to it that lead it to be better than the 'small' Gemma 3 models (such as 4B) at similar speed or footprint.
Does anybody know how to actually run these using MLX? mlx-lm does not currently seem to support them, so I wonder what Google means exactly by "MLX support".
Do you know if these actually preserve the structure of Gemma 3n that make these models more memory efficient on consumer devices? I feel like the modified inference architecture described in the article is what makes this possible, but it probably needs additional software support.
But given that they were uploaded a day ago (together with the blog post), maybe these are actually the real deal? In that case, I wish Google could just link to these instead of to https://huggingface.co/mlx-community/gemma-3n-E4B-it-bf16.
Edit: Ah, these are just non-MLX models. I might give them a try, but not what I was looking for. Still, thank you!
Though I can imagine a few commercial applications where something like this would be useful. Maybe in some sort of document processing pipeline.
I think it’s something that even Google should consider: publishing open-source models with the possibility of grounding their replies in Google Search.
What I meant is that pricing for 10k queries per day does not make sense instead of a simple $20/month ChatGPT subscription.
ChatGPT deep research does a lot of searches, but it's also heavily rate-limited even on paid accounts.
Building a search index, running a lot of servers, storing and querying all that data costs money. Even a cheap search API is gonna cost a bit for 10k queries if the results are any good.
Kagi charges me 15 USD and last month I've done 863 searches with it, more than worth it for the result quality. For me the Google API would be cheaper. I'm pretty sure Kagi would kick me out if I did 10k searches a day even if I'm paying for "unlimited searches".
A similar API from Bing costs between 15-25 USD for 1000 searches. Exa.ai costs 5 dollars for 1000 searches, and it goes up to 25 dollars if you want more than 25 results for your query.
Good web search is quite expensive.
No clue how performance compares. Not sure it's worth dealing with the lesser software support compared to getting an AMD mini PC and using Vulkan on llama.cpp for standard GGUF models.
A used sub-$100 x86 box is going to be much better
Similar form factor to raspberry pi but with 4 TOPS of performance and enough RAM.