Improve `PreTrainedTokenizerFast` loading time when there are many added tokens (#31404)

* use hash * use hash * update --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

Improve `PreTrainedTokenizerFast` loading time when there are many added tokens (#31404)
* use hash * use hash * update --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
1c7c34bc · Yih-Dar · GitHub · 6e56b834 · 1c7c34bc
Unverified Commit 1c7c34bc authored Jun 18, 2024 by Yih-Dar Committed by GitHub Jun 18, 2024
Show whitespace changes
Inline Side-by-side

Showing with 3 additions and 1 deletion

src/transformers/tokenization_utils_fast.py src/transformers/tokenization_utils_fast.py +3 -1

No files found.
--- a/src/transformers/tokenization_utils_fast.py
+++ b/src/transformers/tokenization_utils_fast.py
@@ -172,10 +172,12 @@ class PreTrainedTokenizerFast(PreTrainedTokenizerBase):
        # allows converting a slow -> fast, non-legacy: if the `tokenizer.json` does not have all the added tokens
        # uses the information stored in `added_tokens_decoder`.
        # this is costly for fast tokenizers as we re-compute the regex again. But not all tokens are added tokens
+        # Use hash to speed up the very slow operation `token not in added_tokens_decoder`.
+        added_tokens_decoder_hash = {hash(repr(token)) for token in self.added_tokens_decoder}
        tokens_to_add = [
            token
            for index, token in sorted(added_tokens_decoder.items(), key=lambda x: x[0])
-            if token not in self.added_tokens_decoder
+            if hash(repr(token)) not in added_tokens_decoder_hash
        ]
        encoder = list(self.added_tokens_encoder.keys()) + [str(token) for token in tokens_to_add]
        # if some of the special tokens are strings, we check if we don't already have a token