Posted by denysvitali 9/2/2025
It’s easy to become jaded with so many huge models being released, but the reality is they are still from a relatively small group of countries.
For example India has no indigenous models this big despite having a world class talent pool.
Capital though ;)
[I am a grad student here in reinforcement learning]
Anyways, among all the VC/made-at-home driven snake oil, I'd say you should look at sarvam.ai, they are the most focussed and no-nonsense group. They have a few good from-scratch models (I believe upto 7B or 14B), as well as a few llama finetunes. Their API is pretty good.
The main thing folks here are attempting are to get LLMs good at local indian languages (and I don't mean hindi). I don't think people see a value in creating an "indigenous llama" that doesn't have that property. For this, the main bottleneck is data (relatively speaking, there is zero data in those languages on the internet), so there's a team AI4Bharat whose main job is curating datasets good enough to get stuff like _translation_ and other NLP benchmarks working well. LLMs too, for which they work with sarvam frequently.
Are there any plans for further models after this one?
I can't imagine that this actually complies with the law.
>> "The "o" stands for "open", "openness", "open source" and is placed where a "TM" symbol (indicating patents, trademarks, protection) would normally reside. Instead openness is the apertus° trademark."
It's also a completely different kind of thing so trademark probably wouldn't come into it even if they had one.
> Unlike many prior models that release weights without reproducible data pipelines or regard for content-owner rights, Apertus models are pretrained exclusively on openly available data, retroactively respecting robots.txt exclusions and filtering for copyrighted, non-permissive, toxic, and personally identifiable content.