Posted by abnry 6 days ago
A word's "difficulty" would be some function of how rare it is. Once you have a reasonable estimate of the user's "skill" you can infer that a user won't know more difficult words. The benefit of this is you're not spending time asking the user about words they probably know.
Of course it's possible at an individual level, difficulty does not monotonically increase as a function of how rare the word is. A person might be very familiar with a domain-specific subset of English. But the "stratified sampling" approach will also have this problem.
There is a similar problem in chess, where players have ratings which really only change on one dimension. So there can theoretically be a mismatch when puzzles are also scored on a single axis, since a "harder" puzzle that contains a motif a player is familiar with will actually be easier for the player.
At least I learned a bunch of «faux-amis» in the process.
So not surprising perhaps that many of the more obscure words end up being french.
Of course, for a native speaker at least, but for people with English as a second language there are many lower-class words that we never encountered before, because they simply don't occur in books or in online discussions. I got 88 correct out of 100 in this list but I'm almost certain I'd have faired much worse had the list been about niche house or agricultural items.
What counts as "obscure" is highly context dependent.
Fun fact: according to a quick count by AI using web search, the previous sentence contains 21 words of Germanic origin, 2 of Latin origin, 2 of Greek origin and 1 of French origin. Also the etymology of the word Germanic is Latin, while that of the word French is Germanic
A lot of the more common and simpler words are Germanic, as is the grammar (e.g. compound words like cupboard).
At some point the word becomes both. Sourced from its mother language and maybe even still meaning the same thing in both, but no less an English word than any other at this point.
To be fair, I think I messed up a few advanced words by accident but I think the general pattern would hold because many of the expert level words seemed to have French root. So it felt like it got easier towards the end for me. Grandmaster words were a bit weirder on the whole.
I'm an engineer and read mostly non-fiction so this probably explains the gap too.
Latin isn't really any sort of parent to Old English afaik, even though the Romans ran Britain for a while.
The alternatives to choose between appear to be LLM-generated, you can see several patterns ("now" and "forever" appear a lot).
Years ago, I used to play a similar game that you could keep playing and where you levelled up when you had enough words correct in a row, or down for a single mistake. A fun thing about it was that at very high levels, it got easier for me because they mixed in some old English words which were essentially the same as in Dutch, my native language. There was a charity aspect to it as well, I think it was https://freerice.com/ , but they seem to have simplified the game now.
The university of Ghent (Belgium) also used to have an interesting test which rated your proficiency according to average scores at certain education levels. There I got 41.000 (IIRC), which was rated as average for a university-level native English speaker. An update at the bottom of https://languagehat.com/ghent-vocabulary-test/ discusses where that test went and has a few alternatives. Edit: https://www.myvocab.info/en is pretty similar to this test (found in another comment).