Posted by matthewolfe 22 hours ago
I’m teaching myself LLM internals by re-implementing the stack from first principles. Profiling TikToken’s Python/Rust implementation showed a lot of time was spent doing regex matching. Most of my perf gains come from a) using a faster jit-compiled regex engine; and b) simplifying the algorithm to forego regex matching special tokens at all.
Benchmarking code is included. Notable results show: - 4x faster code sample tokenization on a single thread. - 2-3x higher throughput when tested on a 1GB natural language text file.
Does that mean there could be cases with less quality in terms of tokenization?
The Tiktoken implementation takes a collection of all special tokens upon initialization and compiles them into a regex by joining them with `|` [0]. Then the actual encoding process checks for matches on this expression.
Models like Llama 4 define a list of 1,135 special tokens. Notably, 1,115 of those are "reserved" special tokens! So this yields a huge regexp of special tokens that shouldn't be considered at all.
TokenDagger does not do this. Instead, simple string matching is used. This works because we don't need to consider the entire special vocabulary every time. The caller of `encode` must explicitly define which special tokens should be considered [1]. So it's faster to check against the much smaller list we _know_ is being used.
[0] https://github.com/openai/tiktoken/blob/main/src/lib.rs#L476
[1] https://github.com/openai/tiktoken/blob/main/tiktoken/core.p...