Probably pay attention to tokenizers

Posted by ingve 4 days ago

Probably pay attention to tokenizers(cybernetist.com)

314 points | 94 comments

kelseyfrog 3 days ago|

Tokenizers aren't considered the "sexy" part of LLMs, but where others see boring, I see opportunity. Papers like xVal[1], point toward specialization strategies in tokenization. Spelling and letter tasks are another problem that could benefit from innovation on the tokenization.

LLMs are notoriously bad at counting letters in words or performing simply oulipos of letter omission. GPT-4o, for example, writes a small python program and executes it in order to count letter instances. We all know that tokenization effectively erases knowledge about letters in prompts and directly negatively impacts performance at these tasks, yet we haven't found a way to solve it.

1. https://ar5iv.labs.arxiv.org/html/2310.02989

bunderbunder 3 days ago||

This was ages ago, in the pre-transformer era, and I can't find the link anymore. But once upon a time I read a great paper that demonstrated that most of the performance differences being reported among popular embedding models of the time were better explained by text cleaning and tokenization than they were by the embedding model itself.

In other words, if you train a model using word2vec's preprocessing and GloVe's algorithm, the result looks more like a "standard-issue" word2vec model than a "standard-issue" GloVe model.

authorfly 3 days ago||

Yes those models were sensitive to the preprocessing, far more than transformers.

However Word2Vec and GloVe were fundamentally different, when used as designed GloVe worked better pretty uniformly.

screye 3 days ago|||

Tokenizers face an odd compute issue.

Since they're part of the pre-processing pipeline, you can't quickly test them out for effectiveness. You have to restart a pretraining run to test downstream effectiveness.

Separately,

As much as an attention module can do universal nonlinear transformations....I wonder if it makes sense to add specifuc modules for some math primitives as well. I remember that the executor paper [1] (slightly precursor to the attention is allyou need paper) created self contained modules for operations like less than, count, sum and then explicitly orchestrated them in the decoder.

I'm surprised we haven't seen such solutions produce sota results from math-ai or code-ai research communities.

[1] https://arxiv.org/abs/1705.03633

IncreasePosts 3 days ago|||

What's the issue with character-level tokenization(I assume this would be much better at count-the-letter tasks)? The article mentions it as an option but doesn't talk about why subword tokenization is preferred by most of the big LLMs out there.

stephantul 3 days ago|||

Using subwords makes your sequences shorter, which makes them cost less.

Besides that, for alphabetic languages, there exists almost no relation between form and meaning. I.e.: “ring” and “wing” differ by one letter but have no real common meaning. By picking the character or byte as your choice of representation, the model basically has to learn to distinguish ring and wing in context. This is a lot of work!

So, while working on the character or byte level saves you some embeddings and thus makes your model smaller, it puts all of the work of distinguishing similar sequences with divergent meanings on the model itself, which means you need a larger model.

By having subwords, a part of this distinguishing work already has been done by the vocabulary itself. As the article points out, this sometimes fails.

sundarurfriend 3 days ago|||

> Besides that, for alphabetic languages, there exists almost no relation between form and meaning.

Also true for Abugida-based languages, for eg. சரம் (saram = string) vs மரம் (maram = tree), and many more. I think your intention with specifying "alphabetic languages" was to say "non-logographic languages", right?

bunderbunder 3 days ago|||

I'll do you one more and say "non-Chinese languages". Written Japanese - including the kanji portion of the script - has the same characteristic.

And even in Chinese it's a fairly weak relationship. A large portion of the meanings of individual characters come from sound loan. For example the 英 in 英雄 means "hero", in 英语 means "England", an in 精英 means "flower". The relationship there is simple homophony.

On the other hand, one thing you do get with written Chinese is that "1 character = 1 morpheme" very nearly works. So mechanistically breaking a text into a sequence of morphemes can be done pretty reliably without the aid of a semantic model or exhaustive hard-coded mapping. I think that for many other languages you can't even get close using only syntactic analysis.

thaumasiotes 3 days ago||

> I'll do you one more and say "non-Chinese languages". Written Japanese - including the kanji portion of the script - has the same characteristic.

Written Japanese is much more ideographic than written Chinese. Japanese spelling is determined, such as it is, by semantics. Chinese spelling is determined by sound. Thus, 女的, 娘们, and 妮子, all meaning 'girl' or 'woman', have no spelling in common because they are different words, while Japanese uses 女 for "jo" and "onna" despite a total lack of any relationship between those words.

stephantul 3 days ago|||

I was trying to say “at least for alphabetic languages”. I don’t like to say things about languages I can’t speak or write. So, no, it wasn’t my intention to say “non-logographic languages”

bunderbunder 3 days ago||||

I suspect that the holy grail here is figuring out how to break the input into a sequence of morphemes and non-morpheme lexical units.

thaumasiotes 3 days ago||

What do you mean by non-morpheme lexical units? Syntactic particles, units too small to be morphemes? Lexical items that contain multiple morphemes?

In either case, isn't this something we already do well?

bunderbunder 3 days ago||

Punctuation, for example.

And no, at least for the languages with which I'm familiar SOTA tokenizers tend to only capture the easy cases.

For example, the GPT-4 tokenizer breaks the first sentence of your post like so:

  What/ do/ you/ mean/ by/ non/-m/orp/heme/ lexical/ units/?

Notice how "morpheme" gets broken into three tokens, and none of them matches "morpheme"'s two morphemes. "Lexical" and "units" are each a single token, when they have three and two morphemes respectively.

Or in French, the word "cafetière" gets chopped willy-nilly into "c/afet/ière". The canonical breakdown is "cafe/t/ière".

p1esk 3 days ago|||

Has anyone tried to combine a token embedding with some representation of the characters in the (sub)word? For example, use a 512 long vector to represent a token, and reserve the last 12 values to spell out the word.

mattnewton 3 days ago|||

I'm not following - spell out the word how? Like put the actual bytes as numerical input to the transformer layer?

p1esk 3 days ago||

Yes

mattnewton 2 days ago|||

Actually adding the numerical value I think is not the right way to do it because of what happens when you matmul those values - usually the right way to do it would be to have low dimensional character embeddings that are learnable in addition to the token embeddings.

The problem with pure numerical values representing a class as input to nueral network layer is that the byte encoding number is going to be very hard for the transformer to memorize exact values especially when relatively close numbers to each other often do not share much meaning. Catagories are usually encoded somehow, like a one-hot embedding layer, or more recently a learned embedding, so that these different categories can be easily distinguished (different categories are close to orthogonal).

My prediction would be that using the numerical value directly would not work at all, and using learnable embeddings would work but You would have to reserve that part of the token embedding for each token which would hurt performance a lot on non-spelling tasks relative to just letting that whole embedding represent the token however the model sees fit.

But, IDK! It would be easy enough to try! You should on a small toy model. And then try a small learnable re-usable character embedding. And write a blog post. Would be happy to coach / offer some gpu time / answer any other questions you have while building it.

kadushka 2 days ago||

Which tasks would you test the expected improvements on from this addition?

mattnewton 2 days ago||

maybe make a metric of "count letter $L in work $word" - if you want to game it, you can choose words so that they are all tokenized into 1-2 tokens and each token has multiples of letter $L

And then use something like helloswag to measure how much you've lost on general text completion compared to a vanilla LLM with the same embedding size all dedicated to just the token.

kadushka 2 days ago||

Which model would you recommend to try it on? Would you train it from scratch, or finetune an existing one?

mattnewton 2 days ago||

You would have to train the new model from scratch since it would be all new token embeddings with whatever character encoding scheme you come up with. It would probably make sense to train the vanilla gpt from scratch with the same total embeddings size as your control. I would start with https://github.com/karpathy/nanoGPT as a baseline since you can train a toy (GPT2 sized) llm in a couple days on an a100 which are pretty easy to come by.

stephantul 3 days ago|||

Not that I know of, but encoding orthography in a fixed width vector usually carries the assumption that words with the same prefix are more similar. So there’s an alignment problem. You usually solve this using dynamic programming, but that doesn’t work in a vector.

For example “parent” and “parents” are aligned, they share letters in the same position, but “skew” and “askew” share no letters in the same position.

p1esk 3 days ago||

The other 500 values in the skew/askew vectors will be similar though. The 12 character values don’t need to be aligned, their function is to provide spelling. Adding such info will probably help LLM answer questions requiring character level knowledge (e.g. counting ‘r’s in ‘strawberry’).

RicoElectrico 3 days ago|||

Well, fastText uses character n-grams to compute embeddings for out-of-vocabulary words. This is pre-transformers work BTW.

p1esk 3 days ago||

IIRC, overlapping ngram vectors are summed to form the token embedding - doesn’t it effectively destroy any character level representation of the token? Doesn’t really make sense to me.

stephantul 3 days ago||

It works because they use really large ngram values, up to 6. So most character-level information is in these subwords.

p1esk 3 days ago||

Let’s say we want to use 6-grams and build an embedding vector for the word “because”: we add integer vectors for “becaus” and “ecause”, right? For example: [1,2,3,4,5,6] + [2,3,4,5,6,2] = [3,5,7,9,11,8]. Obviously we cannot use this resulting numerical vector to spell the input word. Pretty much all character level info is lost.

SEGyges 3 days ago||||

tokens are on average four characters and the number of residual streams (and therefore RAM) the LLM allocates to a given sequence is proportionate to the number of units of input. the flops is proportionate to their square in the attention calculation.

you can hypothetically try to ameliorate this by other means, but if you just naively drop from tokenization to character or byte level models this is what goes wrong

p1esk 3 days ago||

4x seq length expansion doesn’t sound that bad.

lechatonnoir 3 days ago||

I mean, it's not completely fatal, but it means an approximately 16x increase in runtime cost, if I'm not mistaken. That's probably not worth trying to solve letter counting in most applications.

SEGyges 1 day ago||

it is not necessarily 16x if you, e.g., decrease model width by a factor of 4 or so also, but yeah naively the RAM and FLOPs scale up by n^2

Centigonal 3 days ago||||

I think it has to do with both performance (smaller tokens means more tokens per sentence read and more runs per sentence generated) and with how embeddings work. You need a token for "dog" and a token for "puppy" to represent the relationship between the two as a dimension in latent space.

cma 3 days ago|||

Context length performance and memory scales N^2. Smaller tokens mean worse scaling, up to a point.

Der_Einzige 3 days ago|||

I wrote a whole paper about this exact topic! (Syntactic, phonetic, and related constraints)

https://aclanthology.org/2022.cai-1.2/

kaycebasques 3 days ago|||

> but where others see boring, I see opportunity

I feel this way about embeddings

This line of thought seems related to the old wisdom of finding innovative solutions by mucking around in the layer below whatever the "tools of the trade" are for your domain

doctorpangloss 3 days ago|||

> LLMs are notoriously bad at counting letters in words or performing simply oulipos of letter omission.

If it were so simple, why hasn’t this already been dealt with?

Multimodal VQA models also have had a hard time generalizing counting. Counting is not as simple as changing the tokenizer.

smougel 3 days ago|||

There is a paper about : https://arxiv.org/pdf/2405.18719

kelseyfrog 3 days ago||||

I'm saying the oulipo rule is simple, not the task given current tokenization methods

danielmarkbruce 3 days ago|||

Should the number 23 be tokenized as one token or two tokens?

doctorpangloss 3 days ago|||

It doesn’t matter. The challenge with counting doesn’t have to do with tokenization. Why this got into the zeitgeist, I don’t know.

imtringued 3 days ago|||

No LLM struggles with two digit arithmetic. 100 digit addition is possible with the use of state of the art position encodings. Counting is not bottlenecked by arithmetic at all.

When you ask an LLM to count the number of "r" in the word Strawberry, the LLM will output a random number. If you ask it to separate the letters into S t r a w b e r r y, then each letter is tokenized independently and the attention mechanism is capable of performing the task.

What you are doing is essentially denying that the problem exists.

ithkuil 3 days ago||

Gpt4-o correctly answered

"How many letters "r" are in the word Frurirpoprar"

And it didn't use a program execution (at least it didn't show the icon and the answer was very fast so it's unlikely it generated an executed a program to count)

danielmarkbruce 2 days ago||

I wouldn't consider that a thing that's going to work generally. That word may tokenize to one per char and have seen relevant data, or it may be relatively close to some other word and it's seen data which gives the answer.

danielmarkbruce 3 days ago||||

Tokenization absolutely screws with math and counting.

Der_Einzige 1 day ago|||

Why would you confidently say such a lie like this? It's exactly the opposite. It's mostly due to toeknization. Show NeurIPS papers which give evidence of the opposite because I can square up with NeurIPS papers to substantiate that it is tokenization that causes these issues.

thaumasiotes 3 days ago||||

Two. That's the reality.

You interpret the token sequence by constructing a parse tree, but that doesn't require you to forget that the tokens exist.

danielmarkbruce 3 days ago||

If you use standard BPE, you likely won't tokenize every number by it's digits, depending on the data set used to create the tokenizer.

The point is, you have a choice. You can do the tokenization however you like. The reason 23 is interesting is that there is a case to be made that a model will more likely understand 23 is related to Jordan if it's one token, and if it's two tokens it's more difficult. The opposite is true for math problems.

The reality is whatever we want to make it. It's likely that current schemes are... sub optimal. In practice it would be great if every token was geometrically well spaced after embedding, and preserve semantic information, among other things. The "other things" have taken precedent thus far.

tomrod 3 days ago|||

We already solved that with binary representation ;-)

Suppafly 3 days ago|||

>oulipos

danielmarkbruce 3 days ago||

And decoders.

Joker_vD 3 days ago||

> You need to understand [the input data] before you can do anything meaningful with it.

IMHO that's the main reason people turn to any sort of automated data-processing tools in the first place: they don't want to look at the input data. They'd rather have "the computer" look at it and maybe query them back with some additional info gathering requests. But thinking on their own? Ugh.

So I boldly propose the new definition of AGI: it's the data-processing entity that will (at last!) reliably liberate you from having to look at your data before you start shoving this data into that processing entity.

bunderbunder 3 days ago|

Over the past year I've encountered so many situations where a person's opinion of how well an LLM accomplishes a task actually says more about that person's reading comprehension skills than it does the LLM's performance. This applies to both positive and negative opinions.

Spivak 3 days ago||

I think I take something different away from the article, yes tokenizers are important but they're a means to get at something much much bigger which is how to clean up and normalize unstructured data. It's a current endeavor of mine at $dayjob for how to do this in a way that can work reasonably well even for badly mangled documents. I don't have any silver bullets, at least nothing worthy of a blog-post yet, but since this is needed when dealing with OCR documents so "post-ocr correction" turns up quite a few different approaches.

And this is an aside, but I see folks using LLMs to do this correction in the first place. I don't think using LLMs to do correction in a multi-pass system is inherently bad but I haven't been able to get good results out of "call/response" (i.e. a prompt to clean up this text). The best results are when you're running an LLM locally and cleaning incrementally by using token probabilities to help guide you. You get some candidate words from your wordlist based on the fuzzy match of the text you do have, and candidate words predicted from the previous text and when both align -- ding! It's (obviously) not the fastest method however.

SEGyges 3 days ago||

you might have better luck giving the LM the original document and having it generate its own OCR independently, then asking the llm to tiebreak between its own generation and the OCR output while the image is still in the context window until it is satisfied that it got things correct

7thpower 3 days ago||

This is interesting. What types of content are you using this approach on and how does it handle semi structured data? For instance, embedded tables.

HanClinto 3 days ago||

I really appreciated this blog post, and in particular I appreciated the segment talking about typos.

We were discussing this earlier this week -- I'm helping with a RAG-like application for a project right now, and we're concerned with how much small typos or formatting differences in users' queries can throw off our embedding distances.

One thought was: Should we be augmenting our training data (or at the very least, our pretraining data) with intentional typos / substitutions / capitalizations, just to help it learn that "wrk" and "work" are probably synonyms? I looked briefly around for typo augmentation for (pre)training, and didn't see anything at first blush, so I'm guessing that if this is a common practice, that it's called something else.

tmikaeld 3 days ago||

I work with full text search where this is common. Here is some points.

Stemming: Reducing words to their base or root form (e.g., “working,” “worked” becoming “work”).

Lemmatization: Similar to stemming, but more sophisticated, accounting for context (e.g., “better” lemmatizes to “good”).

Token normalization: Standardizing tokens, such as converting “wrk” to “work” through predefined rules (case folding, character replacement).

Fuzzy matching: Allowing approximate matches based on edit distance (e.g., “wrk” matches “work” due to minimal character difference).

Phonetic matching: Matching words that sound similar, sometimes used to match abbreviations or common misspellings.

Thesaurus-based search: Using a predefined list of synonyms or alternative spellings to expand search queries.

Most of these are open and free lists you can use, check the sources on manticore search for example.

thaumasiotes 3 days ago|||

> Lemmatization: Similar to stemming, but more sophisticated, accounting for context (e.g., “better” lemmatizes to “good”).

I don't understand. How is that different from stemming? What's the base form of "better" if not "good"? The nature of the relationship between "better" and "good" is no different from that between "work" and "worked".

authorfly 3 days ago||

Stemming is basically rules based on the characters. It came first.

This is because most words in most languages follow patterns of affixes/prefixes (e.g. worse/worst, harder/hardest), but not always (good/better/best)

The problem was that word/term frequency based modelling would inappropriately not linked terms that actually had the same route (stam or stem).

Stemming removed those affixes so it turned "worse and worst" into "wor and wor" and "harder/hardest" into "hard", etc.

However it failed for cases like good/better.

Lemmatizing was a larger context and built up databases of word senses linking such cases to more reliably process words. So lemmatizing is rules based, plus more.

thaumasiotes 3 days ago||

> So lemmatizing is rules based, plus more.

Fundamentally, the rule of lemmatizing is that you encounter a word, you look it up in a table, and your output is whatever the table says. There are no other rules. Thus, the lemma of seraphim is seraph and the lemma of interim is interim. (I'm also puzzled by your invocation of "context", since this is an entirely context-free process.)

There has never been any period in linguistic analysis or its ancestor, philology, in which this wasn't done. The only reason to do it on a computer is that you don't have a digital representation of the mapping from token to lemma. But it's not an approach to language processing, it's an approach to lack of resources.

authorfly 2 days ago|||

We don't disagree. A look up table with exact rules is a rules system to me from an NLP/GOFAI perspective. I was aware of how the libraries tend to work because I had often used things like looking up lemmas/word sense/pos in NLTK and Spacy in the past, and I know the libraries code fairly well.

Context today may mean more (e.g. the whole sentence, or string, or the prompt context for an LLM), and obviously context has a meaning in computational linguistics (e.g. "context free grammar"), but the point here is stemmers arbitrary follow the same process without a second stage. If a stemmer encounters "best" and "good" it by definition does not have a stage to use the same lemma for them. Context is just one of those overloaded terms unfortunately.

Lemmatizing, in terms of how it works on simple scenarios (lets imagine reviews) helps to lump those words together and correctly identify the proportion of term frequencies for words we might be interested in more consistently than stemming can. It's still limited by using word breaks like spaces or punctuation ofcourse.

mannykannot 3 days ago|||

I see your point about context-free table lookup, but it looks to me as though authorfly's distinctions would apply to how the tables get written in the first place.

soared 3 days ago|||

Porter stemming is currently widely used in adtech for keywords.

bongodongobob 3 days ago|||

I'm glad this is mentioned. I've suspected that using correct grammar, punctuation and spelling greatly impacts response quality. It's hard to objectify so I've just decided to write my prompts in perfect English just to be sure. I have a friend who prompts like he texts and I've always felt he was getting lower quality responses. Not unusable, just a little worse, and he needs to correct it more.

authorfly 3 days ago|||

Check out the training data. Sentence transformer models training data includes lots of typos and this is desirable. There was debate around training/inferencing with stemmed/postprocessing words for a long time.

Typos should minimally impact your RAG.

alexhutcheson 3 days ago||

It depends if they are using a “vanilla” instruction-tuned model or are applying additional task-specific fine-tuning. Fine-tuning with data that doesn’t have misspellings can make the model “forget” how to handle them.

In general, fine-tuned models often fail to generalize well on inputs that aren’t very close to examples in the fine-tuning data set.

authorfly 2 days ago|||

Yes, but you can control that.

You can use set fit, less examples, or SVM or etc depending on how much separation, recall and other aspects matter to you for the task at hand.

Sensitivity level to biasing to the dataset is a choice of training method, not an attribute.

It's just not really a major issue unless you finetune with an entirely new or unseen language in the present day.

HanClinto 2 days ago|||

This is very helpful, thank you!

We are doing a fair bit of task-specific fine-tuning for an asymmetric embeddings model (connecting user-entered descriptions of symptoms with the service solutions that resolved their issues).

I would like to run more experiments with this and see if introducing typos into the user-entered descriptions will help it not forget as much.

Thank you again!

andix 3 days ago||

For queries there is an easy solution: give the question/search term to a LLM and let it rephrase it. A lot of basic RAG examples do that.

This might also work for indexing your data, but has the potential to get really expensive quickly.

yoelhacks 3 days ago||

I used to work on an app that very heavily leaned on Elasticsearch to do advanced text querying for similarities between a 1-2 sentence input and a corpus of paragraph+ length documents.

It was fascinating how much tokenization strategies could affect a particular subset of queries. A really great example is a "W-4" or "W4" Standard tokenization might split on the "-" or split on letter / number boundaries. That input now becomes completely unidentifiable in the index, when it otherwise would have been a very rich factor in matching HR / salary / tax related content.

Different domain, but this doesn't shock me at all.

carom 3 days ago||

The trained embedding vectors for the token equivalents of W4 and W-4 would be mapped to a similar space due to their appearance in the same contexts.

dangerlibrary 3 days ago||

The point of the GP post is that the "w-4" token had very different results from ["w", "-4"] or similar algorithms where the "w" and "4" wound up in separate tokens.

AStrangeMorrow 3 days ago||

Yes, used to work on a system that has elasticsearch and also some custom Word2Vec models. What had the most impact on the quality of the search is ES and on the quality of our W2V model were tokenization and a custom ngrams system.

cranium 3 days ago||

I finally understood the weirdness of tokenizers after watching the video Andrej Karpathy made: "Let's build the GPT Tokenizer" (https://www.youtube.com/watch?v=zduSFxRajkE).

He goes through why we need them instead of raw byte sequences (too expensive) and how the Byte Pair Encoding algorithm works. Worth spending 2h for the deeper understanding if you deal with LLMs.

Xenoamorphous 3 days ago||

> One of the things I noticed over the past year is how a lot of developers who are used to developing in the traditional (deterministic) space fail to change the way they should think about problems in the statistical space which is ultimately what LLM apps are.

I’m a developer and don’t struggle with this, where I really struggle is trying to explain this to users.

bcherry 3 days ago||

It's kind of interesting because I think most people implementing RAG aren't even thinking about tokenization at all. They're thinking about embeddings:

1. chunk the corpus of data (various strategies but they're all somewhat intuitive)

2. compute embedding for each chunk

3. generate search query/queries

4. compute embedding for each query

5. rank corpus chunks by distance to query (vector search)

6. construct return values (e.g chunk + surrounding context, or whole doc, etc)

So this article really gets at the importance of a hidden, relatively mundane-feeling, operation that occurs which can have an outsized impact on the performance of the system. I do wish it had more concrete recommendations in the last section and code sample of a robust project with normalization, fine-tuning, and eval.

r_hanz 3 days ago||

Very nicely written article. Personally, I find RAG (and more abstractly, vector search) the only mildly interesting development in the latest LLM fad, and have always felt that LLMs sit way too far down the diminishing returns curve to be interesting. However, I can’t believe tokenization and embeddings in general, are not broadly considered the absolutely most paramount aspect of all deep learning. The latent space your model captures is the most important aspect of the whole pipeline, or else what is any deep learning model even doing?

ratedgene 3 days ago|

Can't someone expand on this

> Chunking is more or less a fixable problem with some clever techniques: these are pretty well documented around the internet;

Curious about what chunking solutions are out there for different sets of data/problems

hansvm 3 days ago||

It's only "solved" if you're okay with a 50-90% retrieval rate or have particularly nice data. There's a lot of stuff like "referencing the techniques from Chapter 2 we do <blah>" in the wild, and any chunking solution is unlikely to correctly answer queries involving both Chapter 2 and <blah>, at least not without significant false positive rates.

That said, the chunking people are doing is worse than the SOTA. The core thing you want to do is understand your data well enough to ensure that any question, as best as possible, has relevant data within a single chunk. Details vary (maybe the details are what you're asking for?).

pphysch 3 days ago|||

Most data has semantic boundaries: whether tokens, words, lines, paragraphs, blocks, sections, articles, chapters, versions, etc. and ideally the chunking algorithm will align with those boundaries in the actual data. But there is a lot of variety.

haolez 3 days ago||

I had some success with simple aliasing at the beginning and end of chunks. In my next project, I'll try an idea that I saw somewhere:

1. do naive chunking like before 2. calculate the embeddings of each chunk 3. clusterize the chunks by their embeddings to see which chunks actually bring new information to the corpus 4. summarize similar chunks into smaller chunks

Sounds like a smart way of using embeddings to reduce the amount of context misses. I'm not sure it works well, though :)

More comments...