Top
Best
New

Posted by matthewolfe 22 hours ago

Show HN: TokenDagger – A tokenizer faster than OpenAI's Tiktoken(github.com)
TokenDagger is a drop-in replacement for OpenAI’s Tiktoken (the tokenizer behind Llama 3, Mistral, GPT-3.*, etc.). It’s written in C++ 17 with thin Python bindings, keeps the exact same BPE vocab/special-token rules, and focuses on raw speed.

I’m teaching myself LLM internals by re-implementing the stack from first principles. Profiling TikToken’s Python/Rust implementation showed a lot of time was spent doing regex matching. Most of my perf gains come from a) using a faster jit-compiled regex engine; and b) simplifying the algorithm to forego regex matching special tokens at all.

Benchmarking code is included. Notable results show: - 4x faster code sample tokenization on a single thread. - 2-3x higher throughput when tested on a 1GB natural language text file.

256 points | 71 commentspage 2
pamelafox 19 hours ago|
Just curious whether it's possible to push any of your performance improvements to tiktoken itself?
matthewolfe 19 hours ago|
I probably will. Was hesitant initially, because adding PCRE2 as a dependency might cause issues to existing projects. I believe this was discussed briefly in a closed PR with other performance improvements.
isjustintime 13 hours ago||
Very cool. We use Tiktoken and I'd love to see the performance impact. Pretty great decision to make it drop-in compatible.
b0a04gl 19 hours ago||
if dagger builds a byte level DFA for special tokens and resolves overlaps via longest match, how does it handle inputs with partial matches at chunk boundaries, say a stream ends mid token like <|endo , does it buffer forward or require lookahead
semiinfinitely 12 hours ago||
I'm relieved to see that its not written in rust
matthewolfe 12 hours ago|
haha, I thought about it.
konsalexee 21 hours ago||
> simplifying the algorithm to forego regex matching special tokens at all

Does that mean there could be cases with less quality in terms of tokenization?

matthewolfe 21 hours ago|
The output should be identical, assuming no bugs.

The Tiktoken implementation takes a collection of all special tokens upon initialization and compiles them into a regex by joining them with `|` [0]. Then the actual encoding process checks for matches on this expression.

Models like Llama 4 define a list of 1,135 special tokens. Notably, 1,115 of those are "reserved" special tokens! So this yields a huge regexp of special tokens that shouldn't be considered at all.

TokenDagger does not do this. Instead, simple string matching is used. This works because we don't need to consider the entire special vocabulary every time. The caller of `encode` must explicitly define which special tokens should be considered [1]. So it's faster to check against the much smaller list we _know_ is being used.

[0] https://github.com/openai/tiktoken/blob/main/src/lib.rs#L476

[1] https://github.com/openai/tiktoken/blob/main/tiktoken/core.p...

matrix2596 18 hours ago||
is is possible for your tokenizer to give different tokenization ever then openai tokenizer? i am asking because there are multiple ways to tokenize the same string?? sry if i am mistaken
matthewolfe 18 hours ago|
Should be the same. Both use Byte-Pair Encoding (BPE) as underlying algo.
polynomial 19 hours ago||
Just to note that Tiktoken is still the tokenizer behind the GPT-4x series, it just uses a different token model. (Post only says GPT-3, implying they were using something else for subsequent iterations.)
EGreg 20 hours ago||
What about pairing this with BigBird and Mamba?
manishsharan 20 hours ago||
Is there a tokenizer someone can recommend for code ? I have tried CodeBert but maybe I am using it wrong as my results with it were pretty bad.
sheerun 13 hours ago|
Now that byte-patch-level embeddings are discovered?
More comments...