Posted by doener 4 days ago
When I ask smaller models a question in English, the model does well. When I ask the same model a question in Turkish, the answer is mediocre. When I ask the model to translate my question into English, get the answer, and translate the answer back to Turkish, the model again does well.
For example, I tried the above with Llama 3.3 70B, and asked it to plan me a 3-day trip to Istanbul. When I asked Llama to do the translations between English <> Turkish, the answer was notably better.
Anyone else observed a similar behavior?
Last I checked, none of open weight LLMs has languages other than English as its sole dominant language represented in the dataset.
LLMs are actually designed to have some randomness in their responses.
To make the answer reproducible, set the temperature to O (eliminating randomness) and provide a static seed (ensuring consistent results) in the LLM's configuration.
Setting it to 0 in theory eliminates all randomness, and instead of choosing one from a list of next words that may be predicted, always only the MOST PROBABLY word would be chosen.
However, in practice, setting the temperature to 0 in most GUIs does not actually set the temperature to 0, but to a "very small" value ("epsilon"), the reason being to avoid a division by zero exception/crawsh in a mathematical formula. So don't be surprised if you cannot get rid of random behavior entirely.
Why don't they just special-case it?
Would be interesting to see whether they actually score better in leetcode questions when using python.
But in this case the LLM is not exposed to explicit translation pairs between these two languages and rather by seeing enough examples in similar contexts, LLMs transfer some of their learnings in Python to Ruby (for better or worse results)
Presumably there is a lot more public info about, and code in Javascript and Python, hence this "preference"
Maybe the LLM preferring English is because of a similar phenomenon - it has been trained on mostly western, English speaking internet?
For example consider Pascal or C89 requiring all variables to be declared at the start of the function body. That makes it much harder to generate code in a linear fashion. In Python you can just make up a variable the moment you decide you need it. In Pascal or C89 you would have to go back and change previous code, which LLMs can't easily do.
Similar things likely apply to strict typing. Typing makes it easier to reason about existing code, but it makes it harder to write new code if you don't have the ability to go back and change your mind on a type choice.
Both could be solved if we selected tokens in a beam search, searching for the path with the highest combined token probability instead of greedily selecting one token at a time. But that's much more expensive and I'm not sure anyone still does that with large-scale LLMs.
Human programmers also did this more frequently in those days than probably is the case now.
This likely plays a major - probably dominant - role.
It's interesting to think of other factors too though. The relatively concise syntax of those languages might make them easier for LLMs to work with. If resources are in any way token limited then reading and writing Spring Boot apps is going to be burdensome.
Those languages also have a lot of single file applications, which might make them easier for LLMs to learn. So much of iOS development for example is split across many files and I wonder if that affects the quality of the training data.
So, part of its improved performance as they grow in parameter count is probably not only due to expanded raw material that it is trained upon, but a greater ability to ultimately ”realize” and connect apparent meanings of words, so that a German speaker might benefit more and more from training material in Korean.
> These results show that features at the beginning and end of models are highly language-specific (consistent with the {de, re}-tokenization hypothesis [31] ), while features in the middle are more language-agnostic. Moreover, we observe that compared to the smaller model, Claude 3.5 Haiku exhibits a higher degree of generalization, and displays an especially notable generalization improvement for language pairs that do not share an alphabet (English-Chinese, French-Chinese).
Source: https://transformer-circuits.pub/2025/attribution-graphs/bio...
However, they do see that Claude 3.5 Haiku seemed to have an English ”default” with more direct connections. It’s possible that a LLM needs to go a more roundabout way via generalizations to communicate in alternative languages and where this causes a dropoff in performance the smaller the model is?
It is like a student in school that is really brilliant in learning by heart, and repeating the words it studied, but not understanding the concept versus a student that actually understands the topic and can reason about the concepts.
My point is, those language pairs aren't random examples. Chinese isn't something completely foreign and new thing when it comes to difference between it and English.
It's clear from the start that language modelling is not yet there. It can't reason about low level structure (letters, syllables, rhyme, rhythm), it can't map all languages to a singular clear representation. Representation is mushy distributed mess out of which you get good or bad results.
It's brilliant how relevant the responses are and when they're correct, but the underlying process is driven by very weird internal representations.
Yep, that there seems like the definition of knowing. Don't worry, your humanity isn't at risk.
Knowing implies reasoning. LLMs don't "know" things. These statistical models continuate text. Having a mental model that they "know" things, that they can "reason" or "follow instructions" is driving all sorts of poor decisions.
Software has an abstraction fetish. So much of the material available for learners is riddled with analogies and "you don't need to know that" attitude. That is counter productive and I think having accurate mental models matters.
That's not really clear-cut, that's simply a position you're taking. JTB could (I reckon) say that a model's "knowledge" is justified by the training process and reward functions.
> LLMs don't "know" things. These statistical models continuate text.
I don't think it's clear to anyone at this point whether or not the steps taken before token selection (eg: the journey through their dimensional knowledge space provided by attention) are close to or far from how our own thought processes work, but the description of LLMs as "simply" continuating text reduces them to their outputs. From my perspective, as someone on the other side of a text-based web-app from you, you also are an entity that simply continuates text.
You have no way of knowing whether this comment was written by a sentient entity -- with thoughts and agency -- or an LLM.
And while accurate mental models can help in certain contexts, they're not always necessary. I don't need a detailed model of how my OS handles file operations to use it effectively. A high-level understanding is usually enough. Insisting on deep internal accuracy in every case seems more like gatekeeping than good practice.
There is a steep drop in quality in any non-English language, but in general less native speakers = worse results. They tend to have a certain "voice" which is extremely easy to spot and the accuracy of results goes out the window (way worse than in English).
This kind of training data typically looks like ChatGPT style conversations where all the prompts are all templated like “Translate the following text from X to Y: [text]” and the LLM’s expected answer is the translated text.
LLMs can generalize through transfer learning (to a certain extent) from these translation pairs to some understanding (strong) and even answering (weak) in the target language. It also means that the LLM’s actual sweet spot is in translation itself since that’s what was trained in, not just a generalization.
This is probably the case for the "deep reasoning" models as well. If you for example try DeepSeek R1, it will likely reason in either English or Chinese (where it presumably is well trained) even if the prompt is in other languages.
This should fix your issue, right?
[0] I am simplifying here, but it would make sense for an LLM to learn this, even though the intermediate representation is not exactly English, given the fact that much of the internet in English and the empirical fact that they are good at translating.
In the past hours a related, seemingly important article appeared - see https://www.quantamagazine.org/to-make-language-models-work-...
https://www.anthropic.com/research/tracing-thoughts-language...
Was pretty good with Latvian (better than other models this size as well as variants of Llama or Qwen that I could run) and I assume probably with other EU languages as well.
one of them with 50k population
It was called ScandEval until recently.
Meltemi: A large foundation Language Model for the Greek language
Also, Stable Diffusion was originally (and still is I believe) developed in Munich.
It's true though that raising capital and finding investors works wayyy better in the US (kindof needless to say on HN) and so was getting top talent - at least in the past. Don't get me started on energy prices ;) but I don't believe those contribute significantly in the end anyway.
I think a pile of money and talent is largely the cause of where they're at.
But this is an image-like benchmark. Has anyone looked at the article about the EU-ARC, what is the difference? Why can't you measure it on a regular one?
I glanced through it, didn't find it right away, but judging by their tokenizer, they are learning from scratch. In general, I don't like this approach for the task at hand. For large languages, there are already good models that they don't want to compare with. And for low-resource languages, it is very important to take more languages from this language group, which are not necessarily part of the EU
We need more tokens, more variety of topics in texts and more complexity.
(That amount is equivalent to 50000 books, which few nationals will have read.)