Top
Best
New

Posted by matthewolfe 6/30/2025

Show HN: TokenDagger – A tokenizer faster than OpenAI's Tiktoken(github.com)
TokenDagger is a drop-in replacement for OpenAI’s Tiktoken (the tokenizer behind Llama 3, Mistral, GPT-3.*, etc.). It’s written in C++ 17 with thin Python bindings, keeps the exact same BPE vocab/special-token rules, and focuses on raw speed.

I’m teaching myself LLM internals by re-implementing the stack from first principles. Profiling TikToken’s Python/Rust implementation showed a lot of time was spent doing regex matching. Most of my perf gains come from a) using a faster jit-compiled regex engine; and b) simplifying the algorithm to forego regex matching special tokens at all.

Benchmarking code is included. Notable results show: - 4x faster code sample tokenization on a single thread. - 2-3x higher throughput when tested on a 1GB natural language text file.

281 points | 73 commentspage 2
konsalexee 6/30/2025|
> simplifying the algorithm to forego regex matching special tokens at all

Does that mean there could be cases with less quality in terms of tokenization?

matthewolfe 6/30/2025|
The output should be identical, assuming no bugs.

The Tiktoken implementation takes a collection of all special tokens upon initialization and compiles them into a regex by joining them with `|` [0]. Then the actual encoding process checks for matches on this expression.

Models like Llama 4 define a list of 1,135 special tokens. Notably, 1,115 of those are "reserved" special tokens! So this yields a huge regexp of special tokens that shouldn't be considered at all.

TokenDagger does not do this. Instead, simple string matching is used. This works because we don't need to consider the entire special vocabulary every time. The caller of `encode` must explicitly define which special tokens should be considered [1]. So it's faster to check against the much smaller list we _know_ is being used.

[0] https://github.com/openai/tiktoken/blob/main/src/lib.rs#L476

[1] https://github.com/openai/tiktoken/blob/main/tiktoken/core.p...

anonymoushn 7/14/2025||
Isn't this incorrect? If the user doesn't specify what to do with almost all of the special tokens, you still must detect them so you can raise an error.
pamelafox 6/30/2025||
Just curious whether it's possible to push any of your performance improvements to tiktoken itself?
matthewolfe 6/30/2025|
I probably will. Was hesitant initially, because adding PCRE2 as a dependency might cause issues to existing projects. I believe this was discussed briefly in a closed PR with other performance improvements.
polynomial 6/30/2025||
Just to note that Tiktoken is still the tokenizer behind the GPT-4x series, it just uses a different token model. (Post only says GPT-3, implying they were using something else for subsequent iterations.)
manishsharan 6/30/2025||
Is there a tokenizer someone can recommend for code ? I have tried CodeBert but maybe I am using it wrong as my results with it were pretty bad.
isjustintime 6/30/2025||
Very cool. We use Tiktoken and I'd love to see the performance impact. Pretty great decision to make it drop-in compatible.
matrix2596 6/30/2025||
is is possible for your tokenizer to give different tokenization ever then openai tokenizer? i am asking because there are multiple ways to tokenize the same string?? sry if i am mistaken
matthewolfe 6/30/2025|
Should be the same. Both use Byte-Pair Encoding (BPE) as underlying algo.
semiinfinitely 6/30/2025||
I'm relieved to see that its not written in rust
matthewolfe 6/30/2025|
haha, I thought about it.
singularity2001 7/1/2025||
this is still the outdated architecture without special tokens for numbers like out-of-vocab tokens like NUM_FLOAT(3.1415) right?
EGreg 6/30/2025||
What about pairing this with BigBird and Mamba?
sheerun 6/30/2025|
Now that byte-patch-level embeddings are discovered?
More comments...