Posted by ingve 4 days ago
Run a spell check before tokenizing? Maybe even tokenize the misspelled word and the potential corrected word next to each other like „misspld (misspelled)“?
For the issue with the brand names the tokenizer doesn’t know, I have no idea how to handle it. This problem is probably even worse in less common languages, or in languages which use a lot of compound words.
>> Do not panic! A lot of the large LLM vocabularies are pretty huge (30k-300k tokens large)
Seems small by an order of magnitude (at least). English alone is 1+ millions words
It is also probably possible now to go even for larger vocabularies, in the 1-2 million range (by factorising the embedding matrix, for example), but this does not lead to noticeable improvements in performance, AFAIK.
BPE basically starts with a token for every valid value for a Unicode byte and then creates new tokens by looking at common pairs of bytes (‘t’ followed by ‘h’ becomes a new token ’th’)
Months in different locales can be written as yyyy-MM-dd. It can also be a catalog/reference number. So, it seems right that their embedding similarity is not perfectly aligned.
So, it's not a tokenizer problem. The text meant different things according to the LLM.
from sentence_transformers import SentenceTransformer
from sentence_transformers import util
model = SentenceTransformer('all-MiniLM-L6-v2')
data_to_check = [
"I have recieved wrong package",
"I hve recieved wrong package"
]
embeddings = model.encode(data_to_check)
util.cos_sim(embeddings, embeddings)
Outputs: tensor([[1.0000, 0.9749],
[0.9749, 1.0000]])