Posted by ingve 4 days ago
LLMs are notoriously bad at counting letters in words or performing simply oulipos of letter omission. GPT-4o, for example, writes a small python program and executes it in order to count letter instances. We all know that tokenization effectively erases knowledge about letters in prompts and directly negatively impacts performance at these tasks, yet we haven't found a way to solve it.
In other words, if you train a model using word2vec's preprocessing and GloVe's algorithm, the result looks more like a "standard-issue" word2vec model than a "standard-issue" GloVe model.
However Word2Vec and GloVe were fundamentally different, when used as designed GloVe worked better pretty uniformly.
Since they're part of the pre-processing pipeline, you can't quickly test them out for effectiveness. You have to restart a pretraining run to test downstream effectiveness.
Separately,
As much as an attention module can do universal nonlinear transformations....I wonder if it makes sense to add specifuc modules for some math primitives as well. I remember that the executor paper [1] (slightly precursor to the attention is allyou need paper) created self contained modules for operations like less than, count, sum and then explicitly orchestrated them in the decoder.
I'm surprised we haven't seen such solutions produce sota results from math-ai or code-ai research communities.
Besides that, for alphabetic languages, there exists almost no relation between form and meaning. I.e.: “ring” and “wing” differ by one letter but have no real common meaning. By picking the character or byte as your choice of representation, the model basically has to learn to distinguish ring and wing in context. This is a lot of work!
So, while working on the character or byte level saves you some embeddings and thus makes your model smaller, it puts all of the work of distinguishing similar sequences with divergent meanings on the model itself, which means you need a larger model.
By having subwords, a part of this distinguishing work already has been done by the vocabulary itself. As the article points out, this sometimes fails.
Also true for Abugida-based languages, for eg. சரம் (saram = string) vs மரம் (maram = tree), and many more. I think your intention with specifying "alphabetic languages" was to say "non-logographic languages", right?
And even in Chinese it's a fairly weak relationship. A large portion of the meanings of individual characters come from sound loan. For example the 英 in 英雄 means "hero", in 英语 means "England", an in 精英 means "flower". The relationship there is simple homophony.
On the other hand, one thing you do get with written Chinese is that "1 character = 1 morpheme" very nearly works. So mechanistically breaking a text into a sequence of morphemes can be done pretty reliably without the aid of a semantic model or exhaustive hard-coded mapping. I think that for many other languages you can't even get close using only syntactic analysis.
Written Japanese is much more ideographic than written Chinese. Japanese spelling is determined, such as it is, by semantics. Chinese spelling is determined by sound. Thus, 女的, 娘们, and 妮子, all meaning 'girl' or 'woman', have no spelling in common because they are different words, while Japanese uses 女 for "jo" and "onna" despite a total lack of any relationship between those words.
In either case, isn't this something we already do well?
And no, at least for the languages with which I'm familiar SOTA tokenizers tend to only capture the easy cases.
For example, the GPT-4 tokenizer breaks the first sentence of your post like so:
What/ do/ you/ mean/ by/ non/-m/orp/heme/ lexical/ units/?
Notice how "morpheme" gets broken into three tokens, and none of them matches "morpheme"'s two morphemes. "Lexical" and "units" are each a single token, when they have three and two morphemes respectively.Or in French, the word "cafetière" gets chopped willy-nilly into "c/afet/ière". The canonical breakdown is "cafe/t/ière".
The problem with pure numerical values representing a class as input to nueral network layer is that the byte encoding number is going to be very hard for the transformer to memorize exact values especially when relatively close numbers to each other often do not share much meaning. Catagories are usually encoded somehow, like a one-hot embedding layer, or more recently a learned embedding, so that these different categories can be easily distinguished (different categories are close to orthogonal).
My prediction would be that using the numerical value directly would not work at all, and using learnable embeddings would work but You would have to reserve that part of the token embedding for each token which would hurt performance a lot on non-spelling tasks relative to just letting that whole embedding represent the token however the model sees fit.
But, IDK! It would be easy enough to try! You should on a small toy model. And then try a small learnable re-usable character embedding. And write a blog post. Would be happy to coach / offer some gpu time / answer any other questions you have while building it.
And then use something like helloswag to measure how much you've lost on general text completion compared to a vanilla LLM with the same embedding size all dedicated to just the token.
For example “parent” and “parents” are aligned, they share letters in the same position, but “skew” and “askew” share no letters in the same position.
you can hypothetically try to ameliorate this by other means, but if you just naively drop from tokenization to character or byte level models this is what goes wrong
I feel this way about embeddings
This line of thought seems related to the old wisdom of finding innovative solutions by mucking around in the layer below whatever the "tools of the trade" are for your domain
If it were so simple, why hasn’t this already been dealt with?
Multimodal VQA models also have had a hard time generalizing counting. Counting is not as simple as changing the tokenizer.
When you ask an LLM to count the number of "r" in the word Strawberry, the LLM will output a random number. If you ask it to separate the letters into S t r a w b e r r y, then each letter is tokenized independently and the attention mechanism is capable of performing the task.
What you are doing is essentially denying that the problem exists.
"How many letters "r" are in the word Frurirpoprar"
And it didn't use a program execution (at least it didn't show the icon and the answer was very fast so it's unlikely it generated an executed a program to count)
You interpret the token sequence by constructing a parse tree, but that doesn't require you to forget that the tokens exist.
The point is, you have a choice. You can do the tokenization however you like. The reason 23 is interesting is that there is a case to be made that a model will more likely understand 23 is related to Jordan if it's one token, and if it's two tokens it's more difficult. The opposite is true for math problems.
The reality is whatever we want to make it. It's likely that current schemes are... sub optimal. In practice it would be great if every token was geometrically well spaced after embedding, and preserve semantic information, among other things. The "other things" have taken precedent thus far.
?
IMHO that's the main reason people turn to any sort of automated data-processing tools in the first place: they don't want to look at the input data. They'd rather have "the computer" look at it and maybe query them back with some additional info gathering requests. But thinking on their own? Ugh.
So I boldly propose the new definition of AGI: it's the data-processing entity that will (at last!) reliably liberate you from having to look at your data before you start shoving this data into that processing entity.
And this is an aside, but I see folks using LLMs to do this correction in the first place. I don't think using LLMs to do correction in a multi-pass system is inherently bad but I haven't been able to get good results out of "call/response" (i.e. a prompt to clean up this text). The best results are when you're running an LLM locally and cleaning incrementally by using token probabilities to help guide you. You get some candidate words from your wordlist based on the fuzzy match of the text you do have, and candidate words predicted from the previous text and when both align -- ding! It's (obviously) not the fastest method however.
We were discussing this earlier this week -- I'm helping with a RAG-like application for a project right now, and we're concerned with how much small typos or formatting differences in users' queries can throw off our embedding distances.
One thought was: Should we be augmenting our training data (or at the very least, our pretraining data) with intentional typos / substitutions / capitalizations, just to help it learn that "wrk" and "work" are probably synonyms? I looked briefly around for typo augmentation for (pre)training, and didn't see anything at first blush, so I'm guessing that if this is a common practice, that it's called something else.
Stemming: Reducing words to their base or root form (e.g., “working,” “worked” becoming “work”).
Lemmatization: Similar to stemming, but more sophisticated, accounting for context (e.g., “better” lemmatizes to “good”).
Token normalization: Standardizing tokens, such as converting “wrk” to “work” through predefined rules (case folding, character replacement).
Fuzzy matching: Allowing approximate matches based on edit distance (e.g., “wrk” matches “work” due to minimal character difference).
Phonetic matching: Matching words that sound similar, sometimes used to match abbreviations or common misspellings.
Thesaurus-based search: Using a predefined list of synonyms or alternative spellings to expand search queries.
Most of these are open and free lists you can use, check the sources on manticore search for example.
I don't understand. How is that different from stemming? What's the base form of "better" if not "good"? The nature of the relationship between "better" and "good" is no different from that between "work" and "worked".
This is because most words in most languages follow patterns of affixes/prefixes (e.g. worse/worst, harder/hardest), but not always (good/better/best)
The problem was that word/term frequency based modelling would inappropriately not linked terms that actually had the same route (stam or stem).
Stemming removed those affixes so it turned "worse and worst" into "wor and wor" and "harder/hardest" into "hard", etc.
However it failed for cases like good/better.
Lemmatizing was a larger context and built up databases of word senses linking such cases to more reliably process words. So lemmatizing is rules based, plus more.
Fundamentally, the rule of lemmatizing is that you encounter a word, you look it up in a table, and your output is whatever the table says. There are no other rules. Thus, the lemma of seraphim is seraph and the lemma of interim is interim. (I'm also puzzled by your invocation of "context", since this is an entirely context-free process.)
There has never been any period in linguistic analysis or its ancestor, philology, in which this wasn't done. The only reason to do it on a computer is that you don't have a digital representation of the mapping from token to lemma. But it's not an approach to language processing, it's an approach to lack of resources.
Context today may mean more (e.g. the whole sentence, or string, or the prompt context for an LLM), and obviously context has a meaning in computational linguistics (e.g. "context free grammar"), but the point here is stemmers arbitrary follow the same process without a second stage. If a stemmer encounters "best" and "good" it by definition does not have a stage to use the same lemma for them. Context is just one of those overloaded terms unfortunately.
Lemmatizing, in terms of how it works on simple scenarios (lets imagine reviews) helps to lump those words together and correctly identify the proportion of term frequencies for words we might be interested in more consistently than stemming can. It's still limited by using word breaks like spaces or punctuation ofcourse.
Typos should minimally impact your RAG.
In general, fine-tuned models often fail to generalize well on inputs that aren’t very close to examples in the fine-tuning data set.
You can use set fit, less examples, or SVM or etc depending on how much separation, recall and other aspects matter to you for the task at hand.
Sensitivity level to biasing to the dataset is a choice of training method, not an attribute.
It's just not really a major issue unless you finetune with an entirely new or unseen language in the present day.
We are doing a fair bit of task-specific fine-tuning for an asymmetric embeddings model (connecting user-entered descriptions of symptoms with the service solutions that resolved their issues).
I would like to run more experiments with this and see if introducing typos into the user-entered descriptions will help it not forget as much.
Thank you again!
This might also work for indexing your data, but has the potential to get really expensive quickly.
It was fascinating how much tokenization strategies could affect a particular subset of queries. A really great example is a "W-4" or "W4" Standard tokenization might split on the "-" or split on letter / number boundaries. That input now becomes completely unidentifiable in the index, when it otherwise would have been a very rich factor in matching HR / salary / tax related content.
Different domain, but this doesn't shock me at all.
He goes through why we need them instead of raw byte sequences (too expensive) and how the Byte Pair Encoding algorithm works. Worth spending 2h for the deeper understanding if you deal with LLMs.
I’m a developer and don’t struggle with this, where I really struggle is trying to explain this to users.
1. chunk the corpus of data (various strategies but they're all somewhat intuitive)
2. compute embedding for each chunk
3. generate search query/queries
4. compute embedding for each query
5. rank corpus chunks by distance to query (vector search)
6. construct return values (e.g chunk + surrounding context, or whole doc, etc)
So this article really gets at the importance of a hidden, relatively mundane-feeling, operation that occurs which can have an outsized impact on the performance of the system. I do wish it had more concrete recommendations in the last section and code sample of a robust project with normalization, fine-tuning, and eval.
> Chunking is more or less a fixable problem with some clever techniques: these are pretty well documented around the internet;
Curious about what chunking solutions are out there for different sets of data/problems
That said, the chunking people are doing is worse than the SOTA. The core thing you want to do is understand your data well enough to ensure that any question, as best as possible, has relevant data within a single chunk. Details vary (maybe the details are what you're asking for?).
1. do naive chunking like before 2. calculate the embeddings of each chunk 3. clusterize the chunks by their embeddings to see which chunks actually bring new information to the corpus 4. summarize similar chunks into smaller chunks
Sounds like a smart way of using embeddings to reduce the amount of context misses. I'm not sure it works well, though :)