Posted by j0e1 3 days ago
Google Translate is a good default, but LLMs are really good at translations, as they’re better capable at understanding context and providing culturally appropriate translations.
(I live in Cambodia where they speak Khmer)
I actually found Facebook’s translations pretty good (better than Google Translate for things longer than a sentence). From my understanding of Khmer, Khmer is a bit more verbose and context dependent, hence LLMs in Khmer would be a big help understand those nuances.
In the inverse case (LLMs generating khmer from English) I heard from locals that it sounds formal and “robotic” which I found quite interesting.
I'm interested in find some thorough testing of translations on different LLMs vs Translation APIs.
(Sorry I had to)
I'm currently concentrating on better data gathering for low-resource languages.
When you look in detail at data like Common Crawl, finepdfs, and fineweb, (1) they are really lacking quality data sources if you know where to look, and (2) the sources they have are not processed "finely" enough (e.g. finepdfs classify each page of PDF as having a specific language, where-as many language learning sources have language pairs, etc.
What languages are you prioritizing?
I'm living in Guatemala, so have been focusing on the Mayan languages here (22 languages, millions of speakers).
In one of the villages we visited, there was a language school where foreigners were learning Jacalteco. One student was from Israel and where most of the students had vocabulary lists in three columns (Jacalteco - Spanish - English), his had four columns where he did one more step of translation to Hebrew.
Is it open weight? If so, why isn't there just a straight link to the models?
They say their leaderboard and evaluation datasets are freely available. Closest statement I've seen in the paper, "Our translation models are built on top of freely available models."
It looks like meta found a way forward.
Reading meta’s abstract, it seems that they have found ways to improve the quality of the training data, and also new evaluation tools?
They are also saying that OMT-LLaMA does a better job at text generation than other baseline models.
And the errors are really basic, like translating shortly to short, not the same thing at all!