Probably pay attention to tokenizers

Posted by ingve 4 days ago

Probably pay attention to tokenizers(cybernetist.com)

314 points | 94 commentspage 2

halyax7 3 days ago|

an issue I've seen in several RAG implementations is assuming that the target documents, however cleverly they're chunked, will be good search keys for incoming queries. Unless your incoming search text looks semantically like the documents you're searching over (not the case in general), you'll get bad hits. On a recent project, we saw a big improvement in retrieval relevance when we separated the search keys from the returned values (chunked documents), and we used an LM to generate appropriate keys which were then embedded. Appropriate in this case means "sentences like what the user might input if theyre expecting this chunk back"

marlott 3 days ago|

Interesting! So you basically got a LM to rephrase the search phrase/keys into the style of the target documents, then used that in the RAG pipeline? Did you do an initial search first to limit the documents?

NitpickLawyer 3 days ago||

IIUC they're doing some sort of "q/a" for each chunk from documents, where they ask an LLM to "play the user role and ask a question that would be answered by this chunk". They then embed those questions, and match live user queries with those questions first, then maybe re-rank on the document chunks retrieved.

andix 3 days ago||

This is an awesome article, but I’m missing the part where solutions for each of the problems were discussed.

Run a spell check before tokenizing? Maybe even tokenize the misspelled word and the potential corrected word next to each other like „misspld (misspelled)“?

For the issue with the brand names the tokenizer doesn’t know, I have no idea how to handle it. This problem is probably even worse in less common languages, or in languages which use a lot of compound words.

quirkot 3 days ago||

Is this true?

>> Do not panic! A lot of the large LLM vocabularies are pretty huge (30k-300k tokens large)

Seems small by an order of magnitude (at least). English alone is 1+ millions words

macleginn 3 days ago||

Most of these 1+ million words are almost never used, so 200k is plenty for English. Optimistically, we hope that rarer words would be longer and to some degree compositional (optim-ism, optim-istic, etc.), but unfortunately this is not what tokenisers arrive at (and you are more likely to get "opt-i-mis-m" or something like that). People have tried to optimise tokenisation and the main part of LLM training jointly, which leads to more sensible results, but this is unworkable for larger models, so we are stuck with inflated basic vocabularies.

It is also probably possible now to go even for larger vocabularies, in the 1-2 million range (by factorising the embedding matrix, for example), but this does not lead to noticeable improvements in performance, AFAIK.

Der_Einzige 3 days ago||

Performance would be massively improved on constrained text tasks. That alone makes it worth it to expand the vocabulary size.

mmoskal 3 days ago|||

Tokens are often sub-word, all the way down to bytes (which are implicitly understood as UTF8 but models will sometimes generate invalid UTF8...).

spott 3 days ago|||

BPE is complete. Every valid Unicode string can be encoded with any BPE tokenizer.

BPE basically starts with a token for every valid value for a Unicode byte and then creates new tokens by looking at common pairs of bytes (‘t’ followed by ‘h’ becomes a new token ’th’)

maytc 3 days ago||

The difference in the dates example seems right to me 20 October 2024 and 2024-20-10 are not the same.

Months in different locales can be written as yyyy-MM-dd. It can also be a catalog/reference number. So, it seems right that their embedding similarity is not perfectly aligned.

So, it's not a tokenizer problem. The text meant different things according to the LLM.

woolr 3 days ago||

Can't repro some of the numbers in this blog post, for example:

  from sentence_transformers import SentenceTransformer
  from sentence_transformers import util

  model = SentenceTransformer('all-MiniLM-L6-v2')

  data_to_check = [
    "I have recieved wrong package",
    "I hve recieved wrong package"
  ]
  embeddings = model.encode(data_to_check)
  util.cos_sim(embeddings, embeddings)

Outputs:

  tensor([[1.0000, 0.9749],
        [0.9749, 1.0000]])

1986 3 days ago|

Your data differs from theirs - they have "I have received wrong package" vs "I hve received wrong pckage", you misspelled "received" in both and didn't omit an "a" from "package" in the "bad" data

gavin_gee 2 days ago||

do pictograms represent a way to reduce tokens?

cebu_blue 4 days ago|

[flagged]