🚨🚨 🚨🚨 [`Tokenizer`] attemp to fix add_token issues🚨🚨 🚨🚨 (#23909)

* fix test for bart. Order is correct now let's skip BPEs * ouf * styling * fix bert.... * slow refactoring * current updates * massive refactoring * update * NICE! * update to see where I am at * updates * update * update * revert * updates * updates * start supporting legacy_save * styling * big update * revert some changes * nits * nniiiiiice * small fixes * kinda fix t5 with new behaviour * major update * fixup * fix copies * today's updates * fix byt5 * upfate * update * update * updates * update vocab size test * Barthez does not use not need the fairseq offset ids * super calll must be after * calll super * move all super init * move other super init * fixup * nits * more fixes * nits * more fixes * nits * more fix * remove useless files * ouch all of them are affected * and more! * small imporvements * no more sanitize token * more changes around unique no split tokens * partially fix more things * keep legacy save but add warning * so... more fixes * updates * guess deberta tokenizer could be nuked * fixup * fixup did some bad things * nuke it if it breaks * remove prints and pretrain fast from slow with new format. * fixups * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * fiou * nit * by default specials should not be normalized? * update * remove brakpoint * updates * a lot of updates * fixup * fixes revert some changes to match fast * small nits * that makes it cleaner * fix camembert accordingly * update * some lest breaking changes * update * fixup * fix byt5 and whisper mostly * some more fixes, canine's byte vocab * fix gpt2 * fix most of the perceiver tests (4 left) * fix layout lmv3 * fixup * fix copies for gpt2 style * make sure to only warn once * fix perciever and gpt2 tests * some more backward compatibility: also read special tokens map because some ppl use it........////..... * fixup * add else when reading * nits * fresh updates * fix copies * will this make everything faster? * fixes * more fixes * update * more fixes * fixup * is the source of truth right? * sorry camembert for the troubles * current updates * fixup * update led * update * fix regression * fix single word * more model specific fixes * fix t5 tests * fixup * more comments * update * fix nllb * rstrip removed * small fixes * better handle additional_special_tokens and vocab sizes * fixing * styling * fix 4 / 21 * fixup * fix nlbb's tests * some fixes * fix t5 * fixes * style * fix canine tests * damn this is nice * nits * m2m100 nit * fixups * fixes! * fixup * stash * fix merge * revert bad change * fixup * correct order for code Llama * fix speecht5 post merge * styling * revert source of 11 fails * small nits * all changes in one go * fnet hack * fix 2 more tests * update based on main branch of tokenizers * fixup * fix VITS issues * more fixes * fix mgp test * fix camembert issues * oups camembert still has 2 failing tests * mluke fixes * decode fixes * small nits * nits * fix llama and vits * fix camembert * smal nits * more fixes when initialising a fast from a slow and etc * fix one of the last test * fix CPM tokenizer test * fixups * fix pop2piano * fixup * ⚠️ Change tokenizers required version ⚠️ * ⚠️ Change tokenizers required version ⚠️ * "tokenizers>=0.14,<0.15", don't forget smaller than * fix musicgen tests and pretraiendtokenizerfast * fix owlvit and all * update t5 * fix 800 red * fix tests * fix the fix of the fix of t5 * styling * documentation nits * cache _added_tokens_encoder * fixups * Nit * fix red tests * one last nit! * make eveything a lot simpler * Now it's over 😉 * few small nits * Apply suggestions from code review Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * updates that work for now * tests that should no be skipped / changed and fixed next * fixup * i am ashamed * pushe the fix * update * fixups * nits * fix added_tokens_encoder * fix canine test * fix pegasus vocab * fix transfoXL * fixup * whisper needs to be fixed for train new * pegasus nits * more pegasus fixes * minor update * better error message in failed test * fix whisper failing test * fix whisper failing test * fix pegasus * fixup * fix **** pegasus * reset things * remove another file * attempts to fix the strange custome encoder and offset * nits here and there * update * fixup * nit * fix the whisper test * nits nits * Apply suggestions from code review Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * updates based on review * some small update to potentially remove * nits * import rlu cache * Update src/transformers/tokenization_utils_base.py Co-authored-by: Lysandre Debut <hi@lysand.re> * move warning to `from_pretrained` * update tests results now that the special tokens are always added --------- Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> Co-authored-by: Lysandre Debut <hi@lysand.re>

🚨🚨 🚨🚨 [`Tokenizer`] attemp to fix add_token issues🚨🚨 🚨🚨 (#23909)
* fix test for bart. Order is correct now let's skip BPEs * ouf * styling * fix bert.... * slow refactoring * current updates * massive refactoring * update * NICE! * update to see where I am at * updates * update * update * revert * updates * updates * start supporting legacy_save * styling * big update * revert some changes * nits * nniiiiiice * small fixes * kinda fix t5 with new behaviour * major update * fixup * fix copies * today's updates * fix byt5 * upfate * update * update * updates * update vocab size test * Barthez does not use not need the fairseq offset ids * super calll must be after * calll super * move all super init * move other super init * fixup * nits * more fixes * nits * more fixes * nits * more fix * remove useless files * ouch all of them are affected * and more! * small imporvements * no more sanitize token * more changes around unique no split tokens * partially fix more things * keep legacy save but add warning * so... more fixes * updates * guess deberta tokenizer could be nuked * fixup * fixup did some bad things * nuke it if it breaks * remove prints and pretrain fast from slow with new format. * fixups * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * fiou * nit * by default specials should not be normalized? * update * remove brakpoint * updates * a lot of updates * fixup * fixes revert some changes to match fast * small nits * that makes it cleaner * fix camembert accordingly * update * some lest breaking changes * update * fixup * fix byt5 and whisper mostly * some more fixes, canine's byte vocab * fix gpt2 * fix most of the perceiver tests (4 left) * fix layout lmv3 * fixup * fix copies for gpt2 style * make sure to only warn once * fix perciever and gpt2 tests * some more backward compatibility: also read special tokens map because some ppl use it........////..... * fixup * add else when reading * nits * fresh updates * fix copies * will this make everything faster? * fixes * more fixes * update * more fixes * fixup * is the source of truth right? * sorry camembert for the troubles * current updates * fixup * update led * update * fix regression * fix single word * more model specific fixes * fix t5 tests * fixup * more comments * update * fix nllb * rstrip removed * small fixes * better handle additional_special_tokens and vocab sizes * fixing * styling * fix 4 / 21 * fixup * fix nlbb's tests * some fixes * fix t5 * fixes * style * fix canine tests * damn this is nice * nits * m2m100 nit * fixups * fixes! * fixup * stash * fix merge * revert bad change * fixup * correct order for code Llama * fix speecht5 post merge * styling * revert source of 11 fails * small nits * all changes in one go * fnet hack * fix 2 more tests * update based on main branch of tokenizers * fixup * fix VITS issues * more fixes * fix mgp test * fix camembert issues * oups camembert still has 2 failing tests * mluke fixes * decode fixes * small nits * nits * fix llama and vits * fix camembert * smal nits * more fixes when initialising a fast from a slow and etc * fix one of the last test * fix CPM tokenizer test * fixups * fix pop2piano * fixup * ⚠️ Change tokenizers required version ⚠️ * ⚠️ Change tokenizers required version ⚠️ * "tokenizers>=0.14,<0.15", don't forget smaller than * fix musicgen tests and pretraiendtokenizerfast * fix owlvit and all * update t5 * fix 800 red * fix tests * fix the fix of the fix of t5 * styling * documentation nits * cache _added_tokens_encoder * fixups * Nit * fix red tests * one last nit! * make eveything a lot simpler * Now it's over 😉 * few small nits * Apply suggestions from code review Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * updates that work for now * tests that should no be skipped / changed and fixed next * fixup * i am ashamed * pushe the fix * update * fixups * nits * fix added_tokens_encoder * fix canine test * fix pegasus vocab * fix transfoXL * fixup * whisper needs to be fixed for train new * pegasus nits * more pegasus fixes * minor update * better error message in failed test * fix whisper failing test * fix whisper failing test * fix pegasus * fixup * fix **** pegasus * reset things * remove another file * attempts to fix the strange custome encoder and offset * nits here and there * update * fixup * nit * fix the whisper test * nits nits * Apply suggestions from code review Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * updates based on review * some small update to potentially remove * nits * import rlu cache * Update src/transformers/tokenization_utils_base.py Co-authored-by: Lysandre Debut <hi@lysand.re> * move warning to `from_pretrained` * update tests results now that the special tokens are always added --------- Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> Co-authored-by: Lysandre Debut <hi@lysand.re>
2da88537 · Arthur · GitHub · 835b0a05 · 2da88537 · 2da88537
Unverified Commit 2da88537 authored Sep 18, 2023 by Arthur Committed by GitHub Sep 18, 2023
20 changed files
--- a/src/transformers/models/whisper/tokenization_whisper_fast.py
+++ b/src/transformers/models/whisper/tokenization_whisper_fast.py
@@ -19,7 +19,7 @@ from functools import lru_cache
 from typing import List, Optional, Tuple
 import numpy as np
-from tokenizers import pre_tokenizers, processors
+from tokenizers import AddedToken, pre_tokenizers, processors
 from ...tokenization_utils_base import BatchEncoding
 from ...tokenization_utils_fast import PreTrainedTokenizerFast
@@ -148,6 +148,22 @@ class WhisperTokenizerFast(PreTrainedTokenizerFast):
        predict_timestamps=False,
        **kwargs,
    ):
+        bos_token = (
+            AddedToken(bos_token, lstrip=False, rstrip=False, normalized=False, special=True)
+            if isinstance(bos_token, str)
+            else bos_token
+        )
+        eos_token = (
+            AddedToken(eos_token, lstrip=False, rstrip=False, normalized=False, special=True)
+            if isinstance(eos_token, str)
+            else eos_token
+        )
+        unk_token = (
+            AddedToken(unk_token, lstrip=False, rstrip=False, normalized=False, special=True)
+            if isinstance(unk_token, str)
+            else unk_token
+        )
        super().__init__(
            vocab_file,
            merges_file,
@@ -444,11 +460,10 @@ class WhisperTokenizerFast(PreTrainedTokenizerFast):
    @property
    # Copied from transformers.models.whisper.tokenization_whisper.WhisperTokenizer.prefix_tokens
    def prefix_tokens(self) -> List[int]:
-        all_special_ids = self.all_special_ids
+        bos_token_id = self.convert_tokens_to_ids("<|startoftranscript|>")
-        bos_token_id = all_special_ids[-106]
+        translate_token_id = self.convert_tokens_to_ids("<|translate|>")
-        translate_token_id = all_special_ids[-6]
+        transcribe_token_id = self.convert_tokens_to_ids("<|transcribe|>")
-        transcribe_token_id = all_special_ids[-5]
+        notimestamps_token_id = self.convert_tokens_to_ids("<|notimestamps|>")
-        notimestamps_token_id = all_special_ids[-1]
        langs = tuple(LANGUAGES.keys())
        if self.language is not None:

--- a/src/transformers/models/xglm/tokenization_xglm.py
+++ b/src/transformers/models/xglm/tokenization_xglm.py
@@ -137,17 +137,6 @@ class XGLMTokenizer(PreTrainedTokenizer):
            word for word in madeup_words if word not in kwargs["additional_special_tokens"]
        ]
-        super().__init__(
-            bos_token=bos_token,
-            eos_token=eos_token,
-            unk_token=unk_token,
-            sep_token=sep_token,
-            cls_token=cls_token,
-            pad_token=pad_token,
-            sp_model_kwargs=self.sp_model_kwargs,
-            **kwargs,
-        )
        self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
        self.sp_model.Load(str(vocab_file))
        self.vocab_file = vocab_file
@@ -170,6 +159,17 @@ class XGLMTokenizer(PreTrainedTokenizer):
        self.fairseq_ids_to_tokens = {v: k for k, v in self.fairseq_tokens_to_ids.items()}
+        super().__init__(
+            bos_token=bos_token,
+            eos_token=eos_token,
+            unk_token=unk_token,
+            sep_token=sep_token,
+            cls_token=cls_token,
+            pad_token=pad_token,
+            sp_model_kwargs=self.sp_model_kwargs,
+            **kwargs,
+        )
    def __getstate__(self):
        state = self.__dict__.copy()
        state["sp_model"] = None

--- a/src/transformers/models/xlm/tokenization_xlm.py
+++ b/src/transformers/models/xlm/tokenization_xlm.py
@@ -613,20 +613,6 @@ class XLMTokenizer(PreTrainedTokenizer):
        do_lowercase_and_remove_accent=True,
        **kwargs,
    ):
-        super().__init__(
-            unk_token=unk_token,
-            bos_token=bos_token,
-            sep_token=sep_token,
-            pad_token=pad_token,
-            cls_token=cls_token,
-            mask_token=mask_token,
-            additional_special_tokens=additional_special_tokens,
-            lang2id=lang2id,
-            id2lang=id2lang,
-            do_lowercase_and_remove_accent=do_lowercase_and_remove_accent,
-            **kwargs,
-        )
        try:
            import sacremoses
        except ImportError:
@@ -660,6 +646,19 @@ class XLMTokenizer(PreTrainedTokenizer):
        merges = [tuple(merge.split()[:2]) for merge in merges]
        self.bpe_ranks = dict(zip(merges, range(len(merges))))
        self.cache = {}
+        super().__init__(
+            unk_token=unk_token,
+            bos_token=bos_token,
+            sep_token=sep_token,
+            pad_token=pad_token,
+            cls_token=cls_token,
+            mask_token=mask_token,
+            additional_special_tokens=additional_special_tokens,
+            lang2id=lang2id,
+            id2lang=id2lang,
+            do_lowercase_and_remove_accent=do_lowercase_and_remove_accent,
+            **kwargs,
+        )
    @property
    def do_lower_case(self):

--- a/src/transformers/models/xlm_prophetnet/tokenization_xlm_prophetnet.py
+++ b/src/transformers/models/xlm_prophetnet/tokenization_xlm_prophetnet.py
@@ -145,18 +145,6 @@ class XLMProphetNetTokenizer(PreTrainedTokenizer):
    ) -> None:
        self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
-        super().__init__(
-            bos_token=bos_token,
-            eos_token=eos_token,
-            sep_token=sep_token,
-            unk_token=unk_token,
-            pad_token=pad_token,
-            cls_token=cls_token,
-            mask_token=mask_token,
-            sp_model_kwargs=self.sp_model_kwargs,
-            **kwargs,
-        )
        try:
            import sentencepiece as spm
        except ImportError:
@@ -186,8 +174,20 @@ class XLMProphetNetTokenizer(PreTrainedTokenizer):
        # The first "real" token "," has position 15 in the embedding vocab and position 3 in the spm vocab
        self.fairseq_offset = 12
        self.fairseq_ids_to_tokens = {v: k for k, v in self.fairseq_tokens_to_ids.items()}
-        for k in self.fairseq_tokens_to_ids.keys():
-            self.unique_no_split_tokens.append(k)
+        # TODO ArthurZ fairseq_ids_to_tokens should be removed
+        super().__init__(
+            bos_token=bos_token,
+            eos_token=eos_token,
+            sep_token=sep_token,
+            unk_token=unk_token,
+            pad_token=pad_token,
+            cls_token=cls_token,
+            mask_token=mask_token,
+            sp_model_kwargs=self.sp_model_kwargs,
+            **kwargs,
+        )
    @property
    def can_save_slow_tokenizer(self) -> bool:

--- a/src/transformers/models/xlm_roberta/tokenization_xlm_roberta.py
+++ b/src/transformers/models/xlm_roberta/tokenization_xlm_roberta.py
@@ -152,18 +152,6 @@ class XLMRobertaTokenizer(PreTrainedTokenizer):
        self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
-        super().__init__(
-            bos_token=bos_token,
-            eos_token=eos_token,
-            unk_token=unk_token,
-            sep_token=sep_token,
-            cls_token=cls_token,
-            pad_token=pad_token,
-            mask_token=mask_token,
-            sp_model_kwargs=self.sp_model_kwargs,
-            **kwargs,
-        )
        self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
        self.sp_model.Load(str(vocab_file))
        self.vocab_file = vocab_file
@@ -183,6 +171,18 @@ class XLMRobertaTokenizer(PreTrainedTokenizer):
        self.fairseq_tokens_to_ids["<mask>"] = len(self.sp_model) + self.fairseq_offset
        self.fairseq_ids_to_tokens = {v: k for k, v in self.fairseq_tokens_to_ids.items()}
+        super().__init__(
+            bos_token=bos_token,
+            eos_token=eos_token,
+            unk_token=unk_token,
+            sep_token=sep_token,
+            cls_token=cls_token,
+            pad_token=pad_token,
+            mask_token=mask_token,
+            sp_model_kwargs=self.sp_model_kwargs,
+            **kwargs,
+        )
    def __getstate__(self):
        state = self.__dict__.copy()
        state["sp_model"] = None
@@ -288,6 +288,7 @@ class XLMRobertaTokenizer(PreTrainedTokenizer):
        return vocab
    def _tokenize(self, text: str) -> List[str]:
+        # TODO check if the t5/llama PR also applies here
        return self.sp_model.encode(text, out_type=str)
    def _convert_token_to_id(self, token):

--- a/src/transformers/models/xlnet/tokenization_xlnet.py
+++ b/src/transformers/models/xlnet/tokenization_xlnet.py
@@ -152,6 +152,14 @@ class XLNetTokenizer(PreTrainedTokenizer):
        self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
+        self.do_lower_case = do_lower_case
+        self.remove_space = remove_space
+        self.keep_accents = keep_accents
+        self.vocab_file = vocab_file
+        self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
+        self.sp_model.Load(vocab_file)
        super().__init__(
            do_lower_case=do_lower_case,
            remove_space=remove_space,
@@ -170,14 +178,6 @@ class XLNetTokenizer(PreTrainedTokenizer):
        self._pad_token_type_id = 3
-        self.do_lower_case = do_lower_case
-        self.remove_space = remove_space
-        self.keep_accents = keep_accents
-        self.vocab_file = vocab_file
-        self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
-        self.sp_model.Load(vocab_file)
    @property
    def vocab_size(self):
        return len(self.sp_model)

--- a/src/transformers/tokenization_utils.py
+++ b/src/transformers/tokenization_utils.py
@@ -57,6 +57,7 @@ class Trie:
    def __init__(self):
        self.data = {}
+        self._tokens = set()
    def add(self, word: str):
        """
@@ -81,6 +82,8 @@ class Trie:
        if not word:
            # Prevent empty string
            return
+        self._tokens.add(word)
        ref = self.data
        for char in word:
            ref[char] = char in ref and ref[char] or {}
@@ -344,17 +347,48 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase):
    """
    def __init__(self, **kwargs):
+        # 1. Init the parent class
        super().__init__(**kwargs)
-        # Added tokens - We store this for both slow and fast tokenizers
-        # until the serialization of Fast tokenizers is updated
-        self.added_tokens_encoder: Dict[str, int] = {}
-        self.added_tokens_decoder: Dict[int, str] = {}
-        self.unique_no_split_tokens: List[str] = []
        self.tokens_trie = Trie()
+        # 2. init `_added_tokens_decoder` if child class did not
+        if not hasattr(self, "_added_tokens_decoder"):
+            self._added_tokens_decoder: Dict[int, AddedToken] = {}
+        # 3. if a `added_tokens_decoder` is passed, we are loading from a saved tokenizer, we overwrite
+        if "added_tokens_decoder" in kwargs:
+            # overwriting the class's added_tokens_decoder. This is the source of truth!
+            self._added_tokens_decoder.update(kwargs.get("added_tokens_decoder"))
+        self._added_tokens_encoder: Dict[str, int] = {k.content: v for v, k in self._added_tokens_decoder.items()}
+        # 4. If some of the special tokens are not part of the vocab, we add them, at the end.
+        # the order of addition is the same as self.SPECIAL_TOKENS_ATTRIBUTES following `tokenizers`
+        self._add_tokens(self.all_special_tokens_extended, special_tokens=True)
        self._decode_use_source_tokenizer = False
+    @property
+    def added_tokens_decoder(self) -> Dict[int, AddedToken]:
+        """
+        Returns the added tokens in the vocabulary as a dictionary of index to AddedToken.
+        Returns:
+            `Dict[str, int]`: The added tokens.
+        """
+        return dict(sorted(self._added_tokens_decoder.items(), key=lambda item: item[0]))
+    @added_tokens_decoder.setter
+    def added_tokens_decoder(self, value: Dict[int, Union[AddedToken, str]]) -> Dict[int, AddedToken]:
+        # Always raise an error if string because users should define the behavior
+        for index, token in value.items():
+            if not isinstance(token, (str, AddedToken)) or not isinstance(index, int):
+                raise ValueError(
+                    f"The provided `added_tokens_decoder` has an element of type {index.__class__, token.__class__}, should be a dict of {int, Union[AddedToken, str]}"
+                )
+            self._added_tokens_decoder[index] = AddedToken(token) if isinstance(token, str) else token
+            self._added_tokens_encoder[str(token)] = index
    @property
    def is_fast(self) -> bool:
        return False
@@ -368,28 +402,34 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase):
    def get_added_vocab(self) -> Dict[str, int]:
        """
-        Returns the added tokens in the vocabulary as a dictionary of token to index.
+        Returns the added tokens in the vocabulary as a dictionary of token to index. Results might be different from
+        the fast call because for now we always add the tokens even if they are already in the vocabulary. This is
+        something we should change.
        Returns:
            `Dict[str, int]`: The added tokens.
        """
-        return self.added_tokens_encoder
+        return self._added_tokens_encoder
    def __len__(self):
        """
-        Size of the full vocabulary with the added tokens.
+        Size of the full vocabulary with the added tokens. Counts the `keys` and not the `values` because otherwise if
+        there is a hole in the vocab, we will add tokenizers at a wrong index.
        """
-        return self.vocab_size + len(self.added_tokens_encoder)
+        return len(set(self.get_vocab().keys()))
    def _add_tokens(self, new_tokens: Union[List[str], List[AddedToken]], special_tokens: bool = False) -> int:
        """
        Add a list of new tokens to the tokenizer class. If the new tokens are not in the vocabulary, they are added to
-        it with indices starting from length of the current vocabulary.
+        it with indices starting from length of the current vocabulary. Special tokens are sometimes already in the
+        vocab which is why they have to be handled specifically.
        Args:
            new_tokens (`List[str]`or `List[tokenizers.AddedToken]`):
-                Token(s) to add in vocabulary. A token is only added if it's not already in the vocabulary (tested by
+                Token(s) to add in vocabulary. A token is counted as added if it's not already in the vocabulary
-                checking if the tokenizer assign the index of the `unk_token` to them).
+                (tested by checking if the tokenizer assign the index of the `unk_token` to them). If a token is part
+                of the vocabulary then we simply mark this token as an `AddedToken` which allows to control the
+                stripping and normalization of this token. This is NOT possible in `tokenizers`.
            special_tokens (`bool`, *optional*, defaults to `False`):
                Whether or not the tokens should be added as special tokens.
@@ -408,52 +448,52 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase):
        # Note: resize_token_embeddings expects to receive the full size of the new vocabulary, i.e. the length of the tokenizer.
        model.resize_token_embeddings(len(tokenizer))
        ```"""
-        new_tokens = [str(tok) for tok in new_tokens]
+        added_tokens = 0
+        if new_tokens is None:
-        tokens_to_add = []
+            return added_tokens
+        current_vocab = self.get_vocab().copy()
+        new_idx = len(current_vocab)  # only call this once, len gives the last index + 1
        for token in new_tokens:
-            if not isinstance(token, str):
+            if not isinstance(token, (str, AddedToken)):
                raise TypeError(f"Token {token} is not a string but a {type(token)}.")
-            if not special_tokens and hasattr(self, "do_lower_case") and self.do_lower_case:
+            if str(token) == "":
-                token = token.lower()
+                continue
-            if (
+            if isinstance(token, str):
-                token != self.unk_token
+                # for legacy AddedTokens strip left and right by default
-                and self.convert_tokens_to_ids(token) == self.convert_tokens_to_ids(self.unk_token)
+                # TODO this will be remove to have the same default behavior as rust
-                and token not in tokens_to_add
+                token = AddedToken(token, normalized=not special_tokens, rstrip=True, lstrip=True)
-            ):
+            if special_tokens:
-                tokens_to_add.append(token)
+                token.special = True
-                if self.verbose:
+            if token in self._added_tokens_decoder:
-                    logger.info(f"Adding {token} to the vocabulary")
+                continue
+            if not token.special and token.normalized and hasattr(self, "do_lower_case") and self.do_lower_case:
-        added_tok_encoder = {tok: len(self) + i for i, tok in enumerate(tokens_to_add)}
+                # Normalize if requested
-        added_tok_decoder = {v: k for k, v in added_tok_encoder.items()}
+                token.content = token.content.lower()
-        self.added_tokens_encoder.update(added_tok_encoder)
+            if token.content not in current_vocab:
-        self.added_tokens_decoder.update(added_tok_decoder)
+                token_index = new_idx + added_tokens
+                current_vocab[token.content] = token_index
-        # Make sure we don't split on any special tokens (even they were already in the vocab before e.g. for Albert)
+                added_tokens += 1
-        if special_tokens:
-            if len(new_tokens) == 1:
-                _insert_one_token_to_ordered_list(self.unique_no_split_tokens, new_tokens[0])
-            else:
-                self.unique_no_split_tokens = sorted(set(self.unique_no_split_tokens).union(set(new_tokens)))
-        else:
-            # Or on the newly added tokens
-            if len(tokens_to_add) == 1:
-                _insert_one_token_to_ordered_list(self.unique_no_split_tokens, tokens_to_add[0])
            else:
-                self.unique_no_split_tokens = sorted(set(self.unique_no_split_tokens).union(set(tokens_to_add)))
+                token_index = current_vocab[token.content]
-        self._create_trie(self.unique_no_split_tokens)
+            if token.special and str(token) not in self.all_special_tokens:
-        return len(tokens_to_add)
+                self._additional_special_tokens.append(token)
+            # the setter automatically updates the reverse map
-    def _create_trie(self, unique_no_split_tokens):
+            self._added_tokens_decoder[token_index] = token
-        trie = Trie()
+            self._added_tokens_encoder[token.content] = token_index
+            if self.verbose:
+                logger.info(f"Adding {token} to the vocabulary")
+        self._update_trie()
+        return added_tokens
+    def _update_trie(self, unique_no_split_tokens: Optional[str] = []):
+        for token in self._added_tokens_decoder.values():
+            if token not in self.tokens_trie._tokens:
+                self.tokens_trie.add(token.content)
        for token in unique_no_split_tokens:
-            if hasattr(self, "do_lower_case") and self.do_lower_case and token not in self.all_special_tokens:
+            if token not in self.tokens_trie._tokens:
-                trie.add(token.lower())
+                self.tokens_trie.add(token)
-            else:
-                trie.add(token)
-        self.tokens_trie = trie
    def num_special_tokens_to_add(self, pair: bool = False) -> int:
        """
@@ -494,10 +534,6 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase):
        Returns:
            `List[str]`: The list of tokens.
        """
-        # Simple mapping string => AddedToken for special tokens with specific tokenization behaviors
-        all_special_tokens_extended = {
-            str(t): t for t in self.all_special_tokens_extended if isinstance(t, AddedToken)
-        }
        split_special_tokens = kwargs.pop("split_special_tokens", self.split_special_tokens)
        text, kwargs = self.prepare_for_tokenization(text, **kwargs)
@@ -505,27 +541,29 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase):
        if kwargs:
            logger.warning(f"Keyword arguments {kwargs} not recognized.")
-        # TODO: should this be in the base class?
        if hasattr(self, "do_lower_case") and self.do_lower_case:
            # convert non-special tokens to lowercase
-            escaped_special_toks = [
+            escaped_special_toks = [re.escape(s_tok) for s_tok in (self.all_special_tokens)]
-                re.escape(s_tok) for s_tok in (self.unique_no_split_tokens + self.all_special_tokens)
+            escaped_special_toks += [
+                re.escape(s_tok.content)
+                for s_tok in (self._added_tokens_decoder.values())
+                if not s_tok.special and s_tok.normalized
            ]
            pattern = r"(" + r"|".join(escaped_special_toks) + r")|" + r"(.+?)"
            text = re.sub(pattern, lambda m: m.groups()[0] or m.groups()[1].lower(), text)
-        # split_special_tokens: empty `no_split_token`
        if split_special_tokens:
            no_split_token = []
            tokens = [text]
        else:
-            no_split_token = set(self.unique_no_split_tokens)
+            no_split_token = set(self._added_tokens_encoder.keys())  # don't split on any of the added tokens
+            # "This is something<special_token_1>  else"
            tokens = self.tokens_trie.split(text)
        # ["This is something", "<special_token_1>", "  else"]
        for i, token in enumerate(tokens):
            if token in no_split_token:
-                tok_extended = all_special_tokens_extended.get(token, None)
+                tok_extended = self._added_tokens_decoder.get(self._added_tokens_encoder[token], None)
                left = tokens[i - 1] if i > 0 else None
                right = tokens[i + 1] if i < len(tokens) - 1 else None
                if isinstance(tok_extended, AddedToken):
@@ -536,12 +574,18 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase):
                    # Strip white spaces on the left
                    if tok_extended.lstrip and left:
                        tokens[i - 1] = left.rstrip()  # Opposite here
+                    if tok_extended.single_word and left and left[-1] != " ":
+                        tokens[i - 1] += token
+                        tokens[i] = ""
+                    elif tok_extended.single_word and right and right[0] != " ":
+                        tokens[i + 1] = token + tokens[i + 1]
+                        tokens[i] = ""
                else:
-                    # We strip left and right by default
+                    raise ValueError(
-                    if right:
+                        f"{tok_extended} cannot be tokenized because it was not properly added"
-                        tokens[i + 1] = right.lstrip()
+                        f" to the tokenizer. This means that it is not an `AddedToken` but a {type(tok_extended)}"
-                    if left:
+                    )
-                        tokens[i - 1] = left.rstrip()
        # ["This is something", "<special_token_1>", "else"]
        tokenized_text = []
        for token in tokens:
@@ -590,8 +634,8 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase):
        if token is None:
            return None
-        if token in self.added_tokens_encoder:
+        if token in self._added_tokens_encoder:
-            return self.added_tokens_encoder[token]
+            return self._added_tokens_encoder[token]
        return self._convert_token_to_id(token)
    def _convert_token_to_id(self, token):
@@ -904,8 +948,8 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase):
            `str` or `List[str]`: The decoded token(s).
        """
        if isinstance(ids, int):
-            if ids in self.added_tokens_decoder:
+            if ids in self._added_tokens_decoder:
-                return self.added_tokens_decoder[ids]
+                return self._added_tokens_decoder[ids].content
            else:
                return self._convert_id_to_token(ids)
        tokens = []
@@ -913,8 +957,8 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase):
            index = int(index)
            if skip_special_tokens and index in self.all_special_ids:
                continue
-            if index in self.added_tokens_decoder:
+            if index in self._added_tokens_decoder:
-                tokens.append(self.added_tokens_decoder[index])
+                tokens.append(self._added_tokens_decoder[index].content)
            else:
                tokens.append(self._convert_id_to_token(index))
        return tokens
@@ -935,19 +979,29 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase):
    ) -> str:
        self._decode_use_source_tokenizer = kwargs.pop("use_source_tokenizer", False)
+        if spaces_between_special_tokens:
+            logger.warning_once(
+                "spaces_between_special_tokens is deprecated and will be removed in transformers v5. It was adding spaces between `added_tokens`, not special tokens, "
+                "and does not exist in our fast implementation. Future tokenizers will handle the decoding process on a per-model rule."
+            )
        filtered_tokens = self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens)
+        legacy_added_tokens = set(self._added_tokens_encoder.keys()) - set(self.all_special_tokens) | {
+            token for token in self.additional_special_tokens if self.convert_tokens_to_ids(token) >= self.vocab_size
+        }
        # To avoid mixing byte-level and unicode for byte-level BPT
        # we need to build string separately for added tokens and byte-level tokens
        # cf. https://github.com/huggingface/transformers/issues/1133
        sub_texts = []
        current_sub_text = []
+        # TODO @ArthurZ in version 5, special tokens should be handled in convert_tokens_to_string, while _convert_tokens_to_string
        for token in filtered_tokens:
            if skip_special_tokens and token in self.all_special_ids:
                continue
-            if token in self.added_tokens_encoder:
+            if token in legacy_added_tokens:
                if current_sub_text:
-                    sub_texts.append(self.convert_tokens_to_string(current_sub_text))
+                    string = self.convert_tokens_to_string(current_sub_text)
+                    if len(string) > 0:
+                        sub_texts.append(string)
                    current_sub_text = []
                sub_texts.append(token)
            else:

--- a/src/transformers/tokenization_utils_base.py
+++ b/src/transformers/tokenization_utils_base.py
@@ -23,10 +23,10 @@ import json
 import os
 import re
 import warnings
-from collections import OrderedDict, UserDict
+from collections import UserDict
 from collections.abc import Mapping, Sized
 from contextlib import contextmanager
-from dataclasses import dataclass, field
+from dataclasses import dataclass
 from functools import lru_cache
 from typing import TYPE_CHECKING, Any, Dict, List, NamedTuple, Optional, Sequence, Tuple, Union
@@ -78,18 +78,25 @@ if is_tokenizers_available():
    from tokenizers import Encoding as EncodingFast
 else:
-    @dataclass(frozen=True, eq=True)
+    @dataclass(frozen=False, eq=True)
    class AddedToken:
        """
        AddedToken represents a token to be added to a Tokenizer An AddedToken can have special options defining the
        way it should behave.
+        The `normalized` will default to `not special` if it is not specified, similarly to the definition in
+        `tokenizers`.
        """
-        content: str = field(default_factory=str)
+        def __init__(
-        single_word: bool = False
+            self, content: str, single_word=False, lstrip=False, rstrip=False, special=False, normalized=None
-        lstrip: bool = False
+        ):
-        rstrip: bool = False
+            self.content = content
-        normalized: bool = True
+            self.single_word = single_word
+            self.lstrip = lstrip
+            self.rstrip = rstrip
+            self.special = special
+            self.normalized = normalized if normalized is not None else not special
        def __getstate__(self):
            return self.__dict__
@@ -806,7 +813,8 @@ class SpecialTokensMixin:
            A special token representing a masked token (used by masked-language modeling pretraining objectives, like
            BERT).
        additional_special_tokens (tuple or list of `str` or `tokenizers.AddedToken`, *optional*):
-            A tuple or a list of additional special tokens.
+            A tuple or a list of additional tokens, which will be marked as `special`, meaning that they will be
+            skipped when decoding if `skip_special_tokens` is set to `True`.
    """
    SPECIAL_TOKENS_ATTRIBUTES = [
@@ -845,21 +853,20 @@ class SpecialTokensMixin:
                        isinstance(t, (str, AddedToken)) for t in value
                    ), "One of the tokens is not a string or an AddedToken"
                    setattr(self, key, value)
-                elif isinstance(value, (str, AddedToken)):
+                elif isinstance(value, (str)):
+                    value = AddedToken(value, normalized=False, special=True)
+                    setattr(self, key, value)
+                elif isinstance(value, AddedToken):
                    setattr(self, key, value)
                else:
-                    raise TypeError(f"special token {key} has to be either str or AddedToken but got: {type(value)}")
+                    raise TypeError(f"Special token {key} has to be either str or AddedToken but got: {type(value)}")
    def sanitize_special_tokens(self) -> int:
        """
-        Make sure that all the special tokens attributes of the tokenizer (`tokenizer.mask_token`,
+        The `sanitize_special_tokens` is now deprecated kept for backward compatibility and will be removed in
-        `tokenizer.cls_token`, etc.) are in the vocabulary.
+        transformers v5.
-        Add the missing ones to the vocabulary if needed.
-        Return:
-            `int`: The number of tokens added in the vocabulary during the operation.
        """
+        logger.warning_once("The `sanitize_special_tokens` will be removed in transformers v5.")
        return self.add_tokens(self.all_special_tokens_extended, special_tokens=True)
    def add_special_tokens(
@@ -870,14 +877,15 @@ class SpecialTokensMixin:
        special tokens are NOT in the vocabulary, they are added to it (indexed starting from the last index of the
        current vocabulary).
-        Note,None When adding new tokens to the vocabulary, you should make sure to also resize the token embedding
+        When adding new tokens to the vocabulary, you should make sure to also resize the token embedding matrix of the
-        matrix of the model so that its embedding matrix matches the tokenizer.
+        model so that its embedding matrix matches the tokenizer.
        In order to do that, please use the [`~PreTrainedModel.resize_token_embeddings`] method.
        Using `add_special_tokens` will ensure your special tokens can be used in several ways:
-        - Special tokens are carefully handled by the tokenizer (they are never split).
+        - Special tokens can be skipped when decoding using `skip_special_tokens = True`.
+        - Special tokens are carefully handled by the tokenizer (they are never split), similar to `AddedTokens`.
        - You can easily refer to special tokens using tokenizer class attributes like `tokenizer.cls_token`. This
          makes it easy to develop model-agnostic training and fine-tuning scripts.
@@ -893,10 +901,12 @@ class SpecialTokensMixin:
                Tokens are only added if they are not already in the vocabulary (tested by checking if the tokenizer
                assign the index of the `unk_token` to them).
            replace_additional_special_tokens (`bool`, *optional*,, defaults to `True`):
-                If `True`, the existing list of additional special tokens will be replaced by the one specified in
+                If `True`, the existing list of additional special tokens will be replaced by the list provided in
-                `special_tokens_dict`. Otherwise, `self._additional_special_tokens` is updated. In the former case, the
+                `special_tokens_dict`. Otherwise, `self._additional_special_tokens` is just extended. In the former
-                tokens will NOT be removed from the tokenizer's full vocabulary - they are only being flagged as
+                case, the tokens will NOT be removed from the tokenizer's full vocabulary - they are only being flagged
-                non-special tokens.
+                as non-special tokens. Remember, this only affects which tokens are skipped during decoding, not the
+                `added_tokens_encoder` and `added_tokens_decoder`. This means that the previous
+                `additional_special_tokens` are still added tokens, and will not be split by the model.
        Returns:
            `int`: Number of tokens added to the vocabulary.
@@ -920,7 +930,7 @@ class SpecialTokensMixin:
        if not special_tokens_dict:
            return 0
-        added_tokens = 0
+        added_tokens = []
        for key, value in special_tokens_dict.items():
            assert key in self.SPECIAL_TOKENS_ATTRIBUTES, f"Key {key} is not a special token"
@@ -932,28 +942,32 @@ class SpecialTokensMixin:
                    isinstance(t, (str, AddedToken)) for t in value
                ), f"Tokens {value} for key {key} should all be str or AddedToken instances"
+                to_add = set()
+                for token in value:
+                    if isinstance(token, str):
+                        # for legacy purpose we default to stripping. `test_add_tokens_tokenizer` depends on this
+                        token = AddedToken(token, normalized=False, rstrip=True, lstrip=True)
+                    if str(token) not in self.additional_special_tokens:
+                        to_add.add(token)
                if replace_additional_special_tokens:
-                    setattr(self, key, value)
+                    setattr(self, key, list(to_add))
                else:
-                    # This is a copy of `self._additional_special_tokens`
+                    self._additional_special_tokens.extend(to_add)
-                    additional_special_tokens = getattr(self, key)
+                added_tokens += to_add
-                    additional_special_tokens_set = set(additional_special_tokens)
-                    to_add = []
-                    for token in value:
-                        if str(token) not in additional_special_tokens_set and str(token) not in to_add:
-                            to_add.append(token)
-                    # update the property
-                    additional_special_tokens.extend(to_add)
-                    self.additional_special_tokens = additional_special_tokens
-                added_tokens += self.add_tokens(value, special_tokens=True)
            else:
-                assert isinstance(
+                if not isinstance(value, (str, AddedToken)):
-                    value, (str, AddedToken)
+                    raise ValueError(f"Token {value} for key {key} should be a str or an AddedToken instance")
-                ), f"Token {value} for key {key} should be a str or an AddedToken instance"
+                if isinstance(value, (str)):
-                setattr(self, key, value)
+                    # for legacy purpose we default to stripping. `test_add_tokens_tokenizer` depends on this
-                added_tokens += self.add_tokens([value], special_tokens=True)
+                    value = AddedToken(value, normalized=False, rstrip=True, lstrip=True)
+                if isinstance(value, AddedToken):
+                    setattr(self, key, value)
+                if value not in added_tokens:
+                    added_tokens.append(value)
+        # if we are adding tokens that were not part of the vocab, we ought to add them
+        added_tokens = self.add_tokens(added_tokens, special_tokens=True)
        return added_tokens
    def add_tokens(
@@ -1102,35 +1116,74 @@ class SpecialTokensMixin:
    @bos_token.setter
    def bos_token(self, value):
+        if isinstance(value, str) and value != "":
+            value = AddedToken(value, normalized=False, rstrip=True, lstrip=True, special=True)
+        elif not isinstance(value, AddedToken) and value is not None:
+            raise ValueError("Cannot set a non-string value as the BOS token")
        self._bos_token = value
    @eos_token.setter
    def eos_token(self, value):
+        if isinstance(value, str) and value != "":
+            value = AddedToken(value, normalized=False, rstrip=True, lstrip=True, special=True)
+        elif not isinstance(value, AddedToken) and value is not None:
+            raise ValueError("Cannot set a non-string value as the EOS token")
        self._eos_token = value
    @unk_token.setter
    def unk_token(self, value):
+        if isinstance(value, str) and value != "":
+            value = AddedToken(value, normalized=False, rstrip=True, lstrip=True, special=True)
+        elif not isinstance(value, AddedToken) and value is not None:
+            raise ValueError("Cannot set a non-string value as the UNK token")
        self._unk_token = value
    @sep_token.setter
    def sep_token(self, value):
+        if isinstance(value, str) and value != "":
+            value = AddedToken(value, normalized=False, rstrip=True, lstrip=True, special=True)
+        elif not isinstance(value, AddedToken) and value is not None:
+            raise ValueError("Cannot set a non-string value as the SEP token")
        self._sep_token = value
    @pad_token.setter
    def pad_token(self, value):
+        if isinstance(value, str) and value != "":
+            value = AddedToken(value, normalized=False, rstrip=True, lstrip=True, special=True)
+        elif not isinstance(value, AddedToken) and value is not None:
+            raise ValueError("Cannot set a non-string value as the PAD token")
        self._pad_token = value
    @cls_token.setter
    def cls_token(self, value):
+        if isinstance(value, str) and value != "":
+            value = AddedToken(value, normalized=False, rstrip=True, lstrip=True, special=True)
+        elif not isinstance(value, AddedToken) and value is not None:
+            raise ValueError("Cannot set a non-string value as the CLS token")
        self._cls_token = value
    @mask_token.setter
    def mask_token(self, value):
+        if isinstance(value, str) and value != "":
+            value = AddedToken(value, normalized=False, rstrip=True, lstrip=True, special=True)
+        elif not isinstance(value, AddedToken) and value is not None:
+            raise ValueError("Cannot set a non-string value as the MASK token")
        self._mask_token = value
    @additional_special_tokens.setter
    def additional_special_tokens(self, value):
-        self._additional_special_tokens = value
+        if value is None:
+            self._additional_special_tokens = value
+            return
+        if self._additional_special_tokens is None:
+            self._additional_special_tokens = []
+        # We store the `AddedToken` to allow adding tokens via `tokenizer.add_special_tokens`
+        for token in value:
+            if isinstance(token, str) and token != "":
+                token = AddedToken(token, normalized=False, rstrip=True, lstrip=True, special=True)
+            elif not isinstance(token, AddedToken):
+                raise ValueError(f"Cannot add instance of type {type(value)} to additional_special_tokens!")
+            self._additional_special_tokens.append(token)
    @property
    def bos_token_id(self) -> Optional[int]:
@@ -1259,13 +1312,9 @@ class SpecialTokensMixin:
        """
        set_attr = {}
        for attr in self.SPECIAL_TOKENS_ATTRIBUTES:
-            attr_value = getattr(self, "_" + attr)
+            attr_value = getattr(self, attr)
            if attr_value:
-                set_attr[attr] = (
+                set_attr[attr] = attr_value
-                    type(attr_value)(str(attr_value_sub) for attr_value_sub in attr_value)
-                    if isinstance(attr_value, (list, tuple))
-                    else str(attr_value)
-                )
        return set_attr
    @property
@@ -1285,29 +1334,34 @@ class SpecialTokensMixin:
        return set_attr
    @property
-    def all_special_tokens(self) -> List[str]:
+    def all_special_tokens_extended(self) -> List[Union[str, AddedToken]]:
        """
-        `List[str]`: All the special tokens (`'<unk>'`, `'<cls>'`, etc.) mapped to class attributes.
+        `List[Union[str, tokenizers.AddedToken]]`: All the special tokens (`'<unk>'`, `'<cls>'`, etc.), the order has
+        nothing to do with the index of each tokens. If you want to know the correct indices, check
+        `self.added_tokens_encoder`. We can't create an order anymore as the keys are `AddedTokens` and not `Strings`.
-        Convert tokens of `tokenizers.AddedToken` type to string.
+        Don't convert tokens of `tokenizers.AddedToken` type to string so they can be used to control more finely how
+        special tokens are tokenized.
        """
-        all_toks = [str(s) for s in self.all_special_tokens_extended]
+        all_tokens = []
-        return all_toks
+        seen = set()
+        for value in self.special_tokens_map_extended.values():
+            if isinstance(value, (list, tuple)):
+                tokens_to_add = [token for token in value if str(token) not in seen]
+            else:
+                tokens_to_add = [value] if str(value) not in seen else []
+            seen.update(map(str, tokens_to_add))
+            all_tokens.extend(tokens_to_add)
+        return all_tokens
    @property
-    def all_special_tokens_extended(self) -> List[Union[str, AddedToken]]:
+    def all_special_tokens(self) -> List[str]:
        """
-        `List[Union[str, tokenizers.AddedToken]]`: All the special tokens (`'<unk>'`, `'<cls>'`, etc.) mapped to class
+        `List[str]`: A list of the unique special tokens (`'<unk>'`, `'<cls>'`, ..., etc.).
-        attributes.
-        Don't convert tokens of `tokenizers.AddedToken` type to string so they can be used to control more finely how
+        Convert tokens of `tokenizers.AddedToken` type to string.
-        special tokens are tokenized.
        """
-        all_toks = []
+        all_toks = [str(s) for s in self.all_special_tokens_extended]
-        set_attr = self.special_tokens_map_extended
-        for attr_value in set_attr.values():
-            all_toks = all_toks + (list(attr_value) if isinstance(attr_value, (list, tuple)) else [attr_value])
-        all_toks = list(OrderedDict.fromkeys(all_toks))
        return all_toks
    @property
@@ -1322,7 +1376,10 @@ class SpecialTokensMixin:
 ENCODE_KWARGS_DOCSTRING = r"""
            add_special_tokens (`bool`, *optional*, defaults to `True`):
-                Whether or not to encode the sequences with the special tokens relative to their model.
+                Whether or not to add special tokens when encoding the sequences. This will use the underlying
+                `PretrainedTokenizerBase.build_inputs_with_special_tokens` function, which defines which tokens are
+                automatically added to the input ids. This is usefull if you want to add `bos` or `eos` tokens
+                automatically.
            padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `False`):
                Activates and controls padding. Accepts the following values:
@@ -1492,9 +1549,9 @@ INIT_TOKENIZER_DOCSTRING = r"""
            A special token representing a masked token (used by masked-language modeling pretraining objectives, like
            BERT). Will be associated to `self.mask_token` and `self.mask_token_id`.
        additional_special_tokens (tuple or list of `str` or `tokenizers.AddedToken`, *optional*):
-            A tuple or a list of additional special tokens. Add them here to ensure they won't be split by the
+            A tuple or a list of additional special tokens. Add them here to ensure they are skipped when decoding with
-            tokenization process. Will be associated to `self.additional_special_tokens` and
+            `skip_special_tokens` is set to True. If they are not part of the vocabulary, they will be added at the end
-            `self.additional_special_tokens_ids`.
+            of the vocabulary.
        clean_up_tokenization_spaces (`bool`, *optional*, defaults to `True`):
            Whether or not the model should cleanup the spaces that were added when splitting the input text during the
            tokenization process.
@@ -1614,12 +1671,26 @@ class PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin):
        """Sets processor class as an attribute."""
        self._processor_class = processor_class
+    @property
+    def added_tokens_encoder(self) -> Dict[str, int]:
+        """
+        Returns the sorted mapping from string to index. The added tokens encoder is cached for performance
+        optimisation in `self._added_tokens_encoder` for the slow tokenizers.
+        """
+        return {k.content: v for v, k in sorted(self._added_tokens_decoder.items(), key=lambda item: item[0])}
+    @property
+    def added_tokens_decoder(self) -> Dict[int, AddedToken]:
+        raise NotImplementedError()
    def __repr__(self) -> str:
+        added_tokens_decoder_rep = "\n\t".join([f"{k}: {v.__repr__()}," for k, v in self.added_tokens_decoder.items()])
        return (
            f"{self.__class__.__name__}(name_or_path='{self.name_or_path}',"
            f" vocab_size={self.vocab_size}, model_max_length={self.model_max_length}, is_fast={self.is_fast},"
            f" padding_side='{self.padding_side}', truncation_side='{self.truncation_side}',"
-            f" special_tokens={self.special_tokens_map_extended}, clean_up_tokenization_spaces={self.clean_up_tokenization_spaces})"
+            f" special_tokens={self.special_tokens_map}, clean_up_tokenization_spaces={self.clean_up_tokenization_spaces}), "
+            " added_tokens_decoder={\n\t" + added_tokens_decoder_rep + "\n}"
        )
    def __len__(self) -> int:
@@ -1878,12 +1949,13 @@ class PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin):
        else:
            # At this point pretrained_model_name_or_path is either a directory or a model identifier name
            additional_files_names = {
-                "added_tokens_file": ADDED_TOKENS_FILE,
+                "added_tokens_file": ADDED_TOKENS_FILE,  # kept only for legacy
-                "special_tokens_map_file": SPECIAL_TOKENS_MAP_FILE,
+                "special_tokens_map_file": SPECIAL_TOKENS_MAP_FILE,  # kept only for legacy
                "tokenizer_config_file": TOKENIZER_CONFIG_FILE,
+                # tokenizer_file used to initialize a slow from a fast. Properly copy the `addedTokens` instead of adding in random orders
+                "tokenizer_file": FULL_TOKENIZER_FILE,
            }
            vocab_files = {**cls.vocab_files_names, **additional_files_names}
            if "tokenizer_file" in vocab_files:
                # Try to get the tokenizer config to see if there are versioned tokenizer files.
                fast_tokenizer_file = FULL_TOKENIZER_FILE
@@ -2019,6 +2091,8 @@ class PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin):
            # First attempt. We get tokenizer_class from tokenizer_config to check mismatch between tokenizers.
            config_tokenizer_class = init_kwargs.get("tokenizer_class")
            init_kwargs.pop("tokenizer_class", None)
+            if not has_tokenizer_file:
+                init_kwargs.pop("tokenizer_file", None)
            saved_init_inputs = init_kwargs.pop("init_inputs", ())
            if not init_inputs:
                init_inputs = saved_init_inputs
@@ -2084,19 +2158,6 @@ class PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin):
        # Update with newly provided kwargs
        init_kwargs.update(kwargs)
-        # Convert AddedTokens serialized as dict to class instances
-        def convert_added_tokens(obj: Union[AddedToken, Any]):
-            if isinstance(obj, dict) and "__type" in obj and obj["__type"] == "AddedToken":
-                obj.pop("__type")
-                return AddedToken(**obj)
-            elif isinstance(obj, (list, tuple)):
-                return [convert_added_tokens(o) for o in obj]
-            elif isinstance(obj, dict):
-                return {k: convert_added_tokens(v) for k, v in obj.items()}
-            return obj
-        init_kwargs = convert_added_tokens(init_kwargs)
        # Set max length if needed
        if pretrained_model_name_or_path in cls.max_model_input_sizes:
            # if we're using a pretrained model, ensure the tokenizer
@@ -2116,16 +2177,75 @@ class PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin):
        # Merge resolved_vocab_files arguments in init_kwargs.
        added_tokens_file = resolved_vocab_files.pop("added_tokens_file", None)
+        special_tokens_map_file = resolved_vocab_files.pop("special_tokens_map_file", None)
        for args_name, file_path in resolved_vocab_files.items():
            if args_name not in init_kwargs:
                init_kwargs[args_name] = file_path
        if slow_tokenizer is not None:
            init_kwargs["__slow_tokenizer"] = slow_tokenizer
        init_kwargs["name_or_path"] = pretrained_model_name_or_path
-        # Instantiate tokenizer.
+        additional_special_tokens = init_kwargs.pop("additional_special_tokens", None) or []
+        added_tokens_decoder = {}
+        legacy_saved = "added_tokens_decoder" not in init_kwargs
+        if not legacy_saved:
+            for idx, token in init_kwargs["added_tokens_decoder"].items():
+                if isinstance(token, dict):
+                    token = AddedToken(**token)
+                if isinstance(token, AddedToken):
+                    added_tokens_decoder[int(idx)] = token
+                else:
+                    raise ValueError(
+                        f"Found a {token.__class__} in the saved `added_tokens_decoder`, should be a dictionary."
+                    )
+        else:
+            logger.warning_once(
+                "Loading the tokenizer from the `special_tokens_map.json` and the `added_tokens.json` will be removed in `transformers 5`, "
+                " it is kept for forward compatibility, but it is recommended to update your `tokenizer_config.json` by uploading it again."
+                " You will see the new `added_tokens_decoder` attribute that will store the relevant information."
+            )
+            # begin legacy: read the added_tokens_file and update kwargs with special_tokens_map if modified
+            if special_tokens_map_file is not None:
+                with open(special_tokens_map_file, encoding="utf-8") as special_tokens_map_handle:
+                    special_tokens_map = json.load(special_tokens_map_handle)
+                    for key, value in special_tokens_map.items():
+                        if key in kwargs and kwargs[key]:
+                            # This value has already been redefined by the kwargs
+                            # We keep this new value and ignore the one stored in the special_tokens_map_file
+                            continue
+                        if isinstance(value, dict):
+                            value = AddedToken(**value)
+                        elif key == "additional_special_tokens" and isinstance(value, list):
+                            for token in value:
+                                token = AddedToken(**token) if isinstance(token, dict) else token
+                                if token not in additional_special_tokens:
+                                    additional_special_tokens.append(token)
+                        else:
+                            init_kwargs[key] = value
+            # slow -> slow|fast, legacy: convert the `"added_tokens.json"` file to `added_tokens_decoder`.
+            if added_tokens_file is not None:
+                with open(added_tokens_file, encoding="utf-8") as added_tokens_handle:
+                    added_tok_encoder = json.load(added_tokens_handle)
+                # legacy: we have to init with (rstrip=True, lstrip=True)
+                added_tokens_decoder = {
+                    index: AddedToken(token, rstrip=True, lstrip=True) for token, index in added_tok_encoder.items()
+                }
+            # end legacy
+        # slow -> fast, non-legacy: we need to make sure the `added_tokens_decoder` is used to add tokens if the `fast` was not properly saved!
+        # thus we delay adding special tokens in the init using `slow_to_fast` flag.
+        if added_tokens_decoder is not {} and "Fast" in cls.__name__:
+            init_kwargs["slow_to_fast"] = True
+        if len(additional_special_tokens) > 0:
+            init_kwargs["additional_special_tokens"] = additional_special_tokens
+        init_kwargs["added_tokens_decoder"] = added_tokens_decoder
+        # convert {'__type': 'AddedToken', 'content': '<ent>', 'lstrip': False, 'normalized': True, ...} to AddedTokens
+        init_kwargs = cls.convert_added_tokens(init_kwargs, False)
+        # Instantiate the tokenizer.
        try:
            tokenizer = cls(*init_inputs, **init_kwargs)
        except OSError:
@@ -2134,79 +2254,43 @@ class PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin):
                "Please check that the provided vocabulary is accessible and not corrupted."
            )
-        # Save inputs and kwargs for saving and re-loading with ``save_pretrained``
+        # allows converting a fast -> slow: add the `tokenizer.json`'s `"added_tokens"` to the slow tokenizer
-        # Removed: Now done at the base class level
+        # if `added_tokens_decoder` not in `tokenizer_config.json` and  `added_tokens.json` is `None`
-        # tokenizer.init_inputs = init_inputs
+        tokenizer_file = resolved_vocab_files.pop("tokenizer_file", None)
-        # tokenizer.init_kwargs = init_kwargs
+        if legacy_saved and "Fast" not in cls.__name__ and added_tokens_file is None and tokenizer_file is not None:
+            tokens_to_add_from_fast = []
-        # If there is a complementary special token map, load it
+            with open(tokenizer_file, encoding="utf-8") as tokenizer_file_handle:
-        special_tokens_map_file = resolved_vocab_files.pop("special_tokens_map_file", None)
+                tokenizer_file_handle = json.load(tokenizer_file_handle)
-        if special_tokens_map_file is not None:
+                added_tokens = tokenizer_file_handle.pop("added_tokens")
-            with open(special_tokens_map_file, encoding="utf-8") as special_tokens_map_handle:
+            for serialized_tokens in added_tokens:
-                special_tokens_map = json.load(special_tokens_map_handle)
+                serialized_tokens.pop("id")
-            for key, value in special_tokens_map.items():
+                # for legacy purpose, we ignore whether or not these tokens are special.
-                if key in kwargs and kwargs[key]:
+                serialized_tokens.pop("special")
-                    # This value has already been redefined by the kwargs
+                tokens_to_add_from_fast.append(AddedToken(**serialized_tokens))
-                    # We keep this new value and ignore the one stored in the special_tokens_map_file
+            tokenizer.add_tokens(tokens_to_add_from_fast)
-                    continue
+        # allows converting a slow -> fast, non-legacy: if the `tokenizer.json` does not have all the added tokens
+        # uses the information stored in `added_tokens_decoder`. Checks after addition that we have the same ids
-                if isinstance(value, dict):
+        if init_kwargs.get("slow_to_fast", False):
-                    value = AddedToken(**value)
+            tokenizer.add_tokens([token for _, token in sorted(added_tokens_decoder.items(), key=lambda x: x[0])])
-                elif isinstance(value, list):
+            warnings = ""
-                    value = [AddedToken(**token) if isinstance(token, dict) else token for token in value]
+            for index, token in sorted(added_tokens_decoder.items(), key=lambda x: x[0]):
-                setattr(tokenizer, key, value)
+                if tokenizer.convert_tokens_to_ids(str(token)) != index:
+                    warnings += f"\texpected id: {tokenizer.convert_tokens_to_ids(str(token))}, found: {index},  token: `{token}`,\n"
-        # Add supplementary tokens.
+            if len(warnings) > 1:
-        special_tokens = tokenizer.all_special_tokens
+                logger.warn(
-        if added_tokens_file is not None:
+                    f"You are converting a {slow_tokenizer.__class__.__name__} to a {cls.__name__}, but"
-            with open(added_tokens_file, encoding="utf-8") as added_tokens_handle:
+                    f" wrong indexes were founds when adding the `added_tokens` from the `slow` tokenizer to the `fast`. "
-                added_tok_encoder = json.load(added_tokens_handle)
+                    f" The following tokens had unexpected id :\n{warnings}. You should try using `from_slow`."
+                )
-            # Sort added tokens by index
+            # finally we add all the special_tokens to make sure eveything is initialized
-            added_tok_encoder_sorted = sorted(added_tok_encoder.items(), key=lambda x: x[1])
+            tokenizer.add_tokens(tokenizer.all_special_tokens_extended, special_tokens=True)
-            # Accumulate added tokens into batches of special/non-special tokens, because calling add_tokens() for
-            # individual tokens would repeatedly rebuild a trie, which can be slow.
-            is_last_special = None
-            tokens = []
-            for token, index in added_tok_encoder_sorted:
-                current_index = len(tokenizer) + len(tokens)
-                if has_tokenizer_file and index != current_index and tokenizer.convert_tokens_to_ids(token) != index:
-                    # Tokenizer fast: added token needs to either be in the vocabulary with the proper index or the
-                    # index is the current length of the tokenizer (not in vocabulary)
-                    raise ValueError(
-                        f"Wrong index found for {token}: should be {tokenizer.convert_tokens_to_ids(token)} but found "
-                        f"{index}."
-                    )
-                elif not has_tokenizer_file and index != current_index:
-                    # Tokenizer slow: added token cannot already be in the vocabulary so its index needs to be the
-                    # current length of the tokenizer.
-                    raise ValueError(
-                        f"Non-consecutive added token '{token}' found. "
-                        f"Should have index {current_index} but has index {index} in saved vocabulary."
-                    )
-                is_special = bool(token in special_tokens)
-                if is_last_special is None or is_last_special == is_special:
-                    tokens.append(token)
-                else:
-                    tokenizer.add_tokens(tokens, special_tokens=is_last_special)
-                    tokens = [token]
-                is_last_special = is_special
-            if tokens:
-                tokenizer.add_tokens(tokens, special_tokens=is_last_special)
-        # Check all our special tokens are registered as "no split" token (we don't cut them) and are in the vocab
+        if len(added_tokens_decoder) > 0:
-        added_tokens = tokenizer.sanitize_special_tokens()
-        if added_tokens:
            logger.warning_advice(
                "Special tokens have been added in the vocabulary, make sure the associated word embeddings are"
                " fine-tuned or trained."
            )
        return tokenizer
    @staticmethod
@@ -2217,6 +2301,21 @@ class PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin):
        # which we will correct in Transformers v5.
        return max_model_length
+    @classmethod
+    def convert_added_tokens(cls, obj: Union[AddedToken, Any], add_type_field=True):
+        if isinstance(obj, dict) and "__type" in obj and obj["__type"] == "AddedToken":
+            obj.pop("__type")
+            return AddedToken(**obj)
+        if isinstance(obj, AddedToken):
+            if add_type_field:
+                obj = obj.content
+            return obj
+        elif isinstance(obj, (list, tuple)):
+            return [cls.convert_added_tokens(o, add_type_field=add_type_field) for o in obj]
+        elif isinstance(obj, dict):
+            return {k: cls.convert_added_tokens(v, add_type_field=add_type_field) for k, v in obj.items()}
+        return obj
    def save_pretrained(
        self,
        save_directory: Union[str, os.PathLike],
@@ -2295,7 +2394,7 @@ class PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin):
        # TODO: Ensure the modified attributes (those are also in the __init__ kwargs) will give identical tokenizers
        # target_keys = self.init_kwargs.keys()
-        target_keys = ["model_max_length", "clean_up_tokenization_spaces"]
+        target_keys = ["model_max_length", "clean_up_tokenization_spaces", "additional_special_tokens"]
        for k in target_keys:
            if hasattr(self, k):
                tokenizer_config[k] = getattr(self, k)
@@ -2308,21 +2407,13 @@ class PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin):
        for file_id in self.vocab_files_names.keys():
            tokenizer_config.pop(file_id, None)
-        # Sanitize AddedTokens
-        def convert_added_tokens(obj: Union[AddedToken, Any], add_type_field=True):
-            if isinstance(obj, AddedToken):
-                out = obj.__getstate__()
-                if add_type_field:
-                    out["__type"] = "AddedToken"
-                return out
-            elif isinstance(obj, (list, tuple)):
-                return [convert_added_tokens(o, add_type_field=add_type_field) for o in obj]
-            elif isinstance(obj, dict):
-                return {k: convert_added_tokens(v, add_type_field=add_type_field) for k, v in obj.items()}
-            return obj
        # add_type_field=True to allow dicts in the kwargs / differentiate from AddedToken serialization
-        tokenizer_config = convert_added_tokens(tokenizer_config, add_type_field=True)
+        tokenizer_config = self.convert_added_tokens(tokenizer_config, add_type_field=True)
+        added_tokens = {}
+        for key, value in self.added_tokens_decoder.items():
+            added_tokens[key] = value.__getstate__()
+        tokenizer_config["added_tokens_decoder"] = added_tokens
        # Add tokenizer class to the tokenizer config to be able to reload it with from_pretrained
        tokenizer_class = self.__class__.__name__
@@ -2351,7 +2442,9 @@ class PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin):
        logger.info(f"tokenizer config file saved in {tokenizer_config_file}")
        # Sanitize AddedTokens in special_tokens_map
-        write_dict = convert_added_tokens(self.special_tokens_map_extended, add_type_field=False)
+        # kept for forward compatibility, will be removed in transoformers 5
+        write_dict = self.convert_added_tokens(self.special_tokens_map_extended, add_type_field=True)
        with open(special_tokens_map_file, "w", encoding="utf-8") as f:
            out_str = json.dumps(write_dict, indent=2, sort_keys=True, ensure_ascii=False) + "\n"
            f.write(out_str)

--- a/src/transformers/tokenization_utils_fast.py
+++ b/src/transformers/tokenization_utils_fast.py
@@ -96,6 +96,7 @@ class PreTrainedTokenizerFast(PreTrainedTokenizerBase):
        slow_tokenizer = kwargs.pop("__slow_tokenizer", None)
        fast_tokenizer_file = kwargs.pop("tokenizer_file", None)
        from_slow = kwargs.pop("from_slow", False)
+        slow_to_fast = kwargs.pop("slow_to_fast", False)
        if from_slow and slow_tokenizer is None and self.slow_tokenizer_class is None:
            raise ValueError(
@@ -154,6 +155,10 @@ class PreTrainedTokenizerFast(PreTrainedTokenizerBase):
        # We call this after having initialized the backend tokenizer because we update it.
        super().__init__(**kwargs)
+        # We add the additional tokens that are not part of the vocab
+        if not slow_to_fast:
+            self._add_tokens(self.all_special_tokens_extended, special_tokens=True)
    @property
    def is_fast(self) -> bool:
        return True
@@ -180,6 +185,16 @@ class PreTrainedTokenizerFast(PreTrainedTokenizerBase):
    def vocab(self) -> Dict[str, int]:
        return self.get_vocab()
+    @property
+    def added_tokens_decoder(self) -> Dict[int, AddedToken]:
+        """
+        Returns the added tokens in the vocabulary as a dictionary of index to AddedToken.
+        Returns:
+            `Dict[str, int]`: The added tokens.
+        """
+        return self._tokenizer.get_added_tokens_decoder()
    def get_added_vocab(self) -> Dict[str, int]:
        """
        Returns the added tokens in the vocabulary as a dictionary of token to index.
@@ -779,6 +794,7 @@ class PreTrainedTokenizerFast(PreTrainedTokenizerBase):
                        lstrip=special_token_full.lstrip,
                        rstrip=special_token_full.rstrip,
                        normalized=special_token_full.normalized,
+                        special=True,
                    )
                else:
                    kwargs[token] = special_token

--- a/tests/models/bart/test_tokenization_bart.py
+++ b/tests/models/bart/test_tokenization_bart.py
@@ -170,7 +170,6 @@ class TestTokenizationBart(TokenizerTesterMixin, unittest.TestCase):
                tokens_r_str = tokenizer_r.convert_ids_to_tokens(tokens_r["input_ids"])
                tokens_p_str = tokenizer_p.convert_ids_to_tokens(tokens_p["input_ids"])
-                # Rust correctly handles the space before the mask while python doesnt
                self.assertSequenceEqual(tokens_p["input_ids"], [0, 250, 6, 50264, 3823, 487, 21992, 3645, 4, 2])
                self.assertSequenceEqual(tokens_r["input_ids"], [0, 250, 6, 50264, 3823, 487, 21992, 3645, 4, 2])

--- a/tests/models/bloom/test_tokenization_bloom.py
+++ b/tests/models/bloom/test_tokenization_bloom.py
@@ -42,6 +42,10 @@ class BloomTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
        kwargs.update(self.special_tokens_map)
        return BloomTokenizerFast.from_pretrained(self.tmpdirname, **kwargs)
+    @unittest.skip("This needs a slow tokenizer. Bloom does not have one!")
+    def test_encode_decode_with_spaces(self):
+        return
    def test_encodings_from_sample_data(self):
        """
        Assert that the created tokens are the same than the hard-coded ones

--- a/tests/models/byt5/test_tokenization_byt5.py
+++ b/tests/models/byt5/test_tokenization_byt5.py
@@ -205,7 +205,9 @@ class ByT5TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
                tokenizer.add_tokens(["bim", "bambam"])
                additional_special_tokens = tokenizer.additional_special_tokens
                additional_special_tokens.append("new_additional_special_token")
-                tokenizer.add_special_tokens({"additional_special_tokens": additional_special_tokens})
+                tokenizer.add_special_tokens(
+                    {"additional_special_tokens": additional_special_tokens}, replace_additional_special_tokens=False
+                )
                before_tokens = tokenizer.encode(sample_text, add_special_tokens=False)
                tokenizer.save_pretrained(tmpdirname)

--- a/tests/models/camembert/test_tokenization_camembert.py
+++ b/tests/models/camembert/test_tokenization_camembert.py
@@ -43,13 +43,19 @@ class CamembertTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
        tokenizer = CamembertTokenizer(SAMPLE_VOCAB)
        tokenizer.save_pretrained(self.tmpdirname)
+    @unittest.skip(
+        "Token maps are not equal because someone set the probability of ('<unk>NOTUSED', -100), so it's never encoded for fast"
+    )
+    def test_special_tokens_map_equal(self):
+        return
    def test_convert_token_and_id(self):
        """Test ``_convert_token_to_id`` and ``_convert_id_to_token``."""
        token = "<pad>"
-        token_id = 1
+        token_id = 1  # 1 is the offset id, but in the spm vocab it's 3
-        self.assertEqual(self.get_tokenizer()._convert_token_to_id(token), token_id)
+        self.assertEqual(self.get_tokenizer().convert_tokens_to_ids(token), token_id)
-        self.assertEqual(self.get_tokenizer()._convert_id_to_token(token_id), token)
+        self.assertEqual(self.get_tokenizer().convert_ids_to_tokens(token_id), token)
    def test_get_vocab(self):
        vocab_keys = list(self.get_tokenizer().get_vocab().keys())
@@ -57,10 +63,10 @@ class CamembertTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
        self.assertEqual(vocab_keys[0], "<s>NOTUSED")
        self.assertEqual(vocab_keys[1], "<pad>")
        self.assertEqual(vocab_keys[-1], "<mask>")
-        self.assertEqual(len(vocab_keys), 1_004)
+        self.assertEqual(len(vocab_keys), 1_005)
    def test_vocab_size(self):
-        self.assertEqual(self.get_tokenizer().vocab_size, 1_005)
+        self.assertEqual(self.get_tokenizer().vocab_size, 1_000)
    def test_rust_and_python_bpe_tokenizers(self):
        tokenizer = CamembertTokenizer(SAMPLE_BPE_VOCAB)

--- a/tests/models/canine/test_tokenization_canine.py
+++ b/tests/models/canine/test_tokenization_canine.py
@@ -122,7 +122,9 @@ class CanineTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
                # We can add a new special token for Canine as follows:
                new_additional_special_token = chr(0xE007)
                additional_special_tokens.append(new_additional_special_token)
-                tokenizer.add_special_tokens({"additional_special_tokens": additional_special_tokens})
+                tokenizer.add_special_tokens(
+                    {"additional_special_tokens": additional_special_tokens}, replace_additional_special_tokens=False
+                )
                before_tokens = tokenizer.encode(sample_text, add_special_tokens=False)
                tokenizer.save_pretrained(tmpdirname)
@@ -167,11 +169,7 @@ class CanineTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
            with self.subTest(f"{tokenizer.__class__.__name__}"):
                SPECIAL_TOKEN_1 = chr(0xE005)
                SPECIAL_TOKEN_2 = chr(0xE006)
-                # `add_tokens` method stores special tokens only in `tokenizer.unique_no_split_tokens`. (in tokenization_utils.py)
                tokenizer.add_tokens([SPECIAL_TOKEN_1], special_tokens=True)
-                # `add_special_tokens` method stores special tokens in `tokenizer.additional_special_tokens`,
-                # which also occur in `tokenizer.all_special_tokens`. (in tokenization_utils_base.py)
                tokenizer.add_special_tokens({"additional_special_tokens": [SPECIAL_TOKEN_2]})
                token_1 = tokenizer.tokenize(SPECIAL_TOKEN_1)

--- a/tests/models/code_llama/test_tokenization_code_llama.py
+++ b/tests/models/code_llama/test_tokenization_code_llama.py
@@ -65,6 +65,10 @@ class CodeLlamaTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
        tokenizer.pad_token = tokenizer.eos_token
        tokenizer.save_pretrained(self.tmpdirname)
+    def get_tokenizers(self, **kwargs):
+        kwargs.update({"pad_token": "<PAD>"})
+        return super().get_tokenizers(**kwargs)
    def test_no_infilling_init(self):
        tokenizer = CodeLlamaTokenizer(SAMPLE_VOCAB, prefix_token=None, keep_accents=True)
        with self.assertRaises(ValueError):
@@ -518,7 +522,7 @@ class LlamaIntegrationTest(unittest.TestCase):
    def test_special_token_special_word(self):
        # the word inform should be split as ['in', 'form']
        tokenizer = CodeLlamaTokenizer.from_pretrained("codellama/CodeLlama-7b-hf", legacy=False)
-        tokenizer.add_tokens(["<REPR_END>"], special_tokens=True)
+        tokenizer.add_tokens(["<REPR_END>"], special_tokens=False)
        out1 = tokenizer.decode(
            tokenizer.encode("<REPR_END>inform", add_special_tokens=False), spaces_between_special_tokens=False
        )
@@ -526,7 +530,8 @@ class LlamaIntegrationTest(unittest.TestCase):
        out2 = tokenizer.decode(
            tokenizer.encode("<REPR_END>inform", add_special_tokens=False), spaces_between_special_tokens=True
        )
-        self.assertEqual(out2, " <REPR_END> inform")
+        # the added prefix token should not be decoded
+        self.assertEqual(out2, "<REPR_END> inform")
        input_ids = tokenizer.encode("<REPR_END>inform", add_special_tokens=False)
        self.assertEqual(input_ids, [29871, 32016, 262, 689])  # 29871 is the spiece underline, '▁'

--- a/tests/models/codegen/test_tokenization_codegen.py
+++ b/tests/models/codegen/test_tokenization_codegen.py
@@ -244,8 +244,8 @@ class CodeGenTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
        decode_s = tokenizer.decode(out_s.input_ids)
        decode_s2 = tokenizer.batch_decode(out_s2.input_ids)
-        self.assertEqual(decode_s.split()[0], bos_token)
+        self.assertTrue(decode_s.startswith(bos_token))
-        self.assertTrue(all(d.split()[0] == bos_token for d in decode_s2))
+        self.assertTrue(all(d.startswith(bos_token) for d in decode_s2))
    @slow
    def test_truncation(self):
@@ -258,6 +258,7 @@ class CodeGenTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
        truncation_pattern = ["^#", re.escape("<|endoftext|>"), "^'''", '^"""', "\n\n\n"]
        decoded_text = tokenizer.decode(input_ids, truncate_before_pattern=truncation_pattern)
        self.assertEqual(decoded_text, expected_trucated_text)
+        # TODO @ArthurZ outputs of the fast tokenizer are different in this case, un-related to the PR
    # tokenizer has no padding token
    def test_padding_different_model_input_name(self):

--- a/tests/models/deberta_v2/test_tokenization_deberta_v2.py
+++ b/tests/models/deberta_v2/test_tokenization_deberta_v2.py
@@ -68,12 +68,12 @@ class DebertaV2TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
        tokens_target = ["▁hello", "!", "how", "▁are", "▁you", "?"]
        # fmt: on
-        tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, do_lower_case=True)
+        tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, unk_token="<unk>", do_lower_case=True)
        tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(sequence, add_special_tokens=False))
        self.assertListEqual(tokens, tokens_target)
-        rust_tokenizer = DebertaV2TokenizerFast(SAMPLE_VOCAB, do_lower_case=True)
+        rust_tokenizer = DebertaV2TokenizerFast(SAMPLE_VOCAB, unk_token="<unk>", do_lower_case=True)
        rust_tokens = rust_tokenizer.convert_ids_to_tokens(rust_tokenizer.encode(sequence, add_special_tokens=False))
        self.assertListEqual(rust_tokens, tokens_target)
@@ -92,12 +92,12 @@ class DebertaV2TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
        tokens_target = ["▁", "<unk>", "▁was", "▁born", "▁in", "▁9", "2000", "▁", ",", "▁and", "▁this", "▁is", "▁fal", "s", "<unk>", "▁", ".", ]
        # fmt: on
-        tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, split_by_punct=True)
+        tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, unk_token="<unk>", split_by_punct=True)
        tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(sequence, add_special_tokens=False))
        self.assertListEqual(tokens, tokens_target)
-        rust_tokenizer = DebertaV2TokenizerFast(SAMPLE_VOCAB, split_by_punct=True)
+        rust_tokenizer = DebertaV2TokenizerFast(SAMPLE_VOCAB, unk_token="<unk>", split_by_punct=True)
        rust_tokens = rust_tokenizer.convert_ids_to_tokens(rust_tokenizer.encode(sequence, add_special_tokens=False))
        self.assertListEqual(rust_tokens, tokens_target)
@@ -108,11 +108,13 @@ class DebertaV2TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
        tokens_target = ["▁i", "▁was", "▁born", "▁in", "▁9", "2000", "▁", ",", "▁and", "▁this", "▁is", "▁fal", "s", "<unk>", "▁", ".", ]
        # fmt: on
-        tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, do_lower_case=True, split_by_punct=True)
+        tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, unk_token="<unk>", do_lower_case=True, split_by_punct=True)
        tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(sequence, add_special_tokens=False))
        self.assertListEqual(tokens, tokens_target)
-        rust_tokenizer = DebertaV2TokenizerFast(SAMPLE_VOCAB, do_lower_case=True, split_by_punct=True)
+        rust_tokenizer = DebertaV2TokenizerFast(
+            SAMPLE_VOCAB, unk_token="<unk>", do_lower_case=True, split_by_punct=True
+        )
        rust_tokens = rust_tokenizer.convert_ids_to_tokens(rust_tokenizer.encode(sequence, add_special_tokens=False))
        self.assertListEqual(rust_tokens, tokens_target)
@@ -122,12 +124,14 @@ class DebertaV2TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
        tokens_target = ["▁i", "▁was", "▁born", "▁in", "▁9", "2000", ",", "▁and", "▁this", "▁is", "▁fal", "s", "<unk>", ".", ]
        # fmt: on
-        tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, do_lower_case=True, split_by_punct=False)
+        tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, unk_token="<unk>", do_lower_case=True, split_by_punct=False)
        tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(sequence, add_special_tokens=False))
        self.assertListEqual(tokens, tokens_target)
-        rust_tokenizer = DebertaV2TokenizerFast(SAMPLE_VOCAB, do_lower_case=True, split_by_punct=False)
+        rust_tokenizer = DebertaV2TokenizerFast(
+            SAMPLE_VOCAB, unk_token="<unk>", do_lower_case=True, split_by_punct=False
+        )
        rust_tokens = rust_tokenizer.convert_ids_to_tokens(rust_tokenizer.encode(sequence, add_special_tokens=False))
        self.assertListEqual(rust_tokens, tokens_target)
@@ -138,12 +142,14 @@ class DebertaV2TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
        tokens_target = ["▁", "<unk>", "▁was", "▁born", "▁in", "▁9", "2000", "▁", ",", "▁and", "▁this", "▁is", "▁fal", "s", "<unk>", "▁", ".", ]
        # fmt: on
-        tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, do_lower_case=False, split_by_punct=True)
+        tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, unk_token="<unk>", do_lower_case=False, split_by_punct=True)
        tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(sequence, add_special_tokens=False))
        self.assertListEqual(tokens, tokens_target)
-        rust_tokenizer = DebertaV2TokenizerFast(SAMPLE_VOCAB, do_lower_case=False, split_by_punct=True)
+        rust_tokenizer = DebertaV2TokenizerFast(
+            SAMPLE_VOCAB, unk_token="<unk>", do_lower_case=False, split_by_punct=True
+        )
        rust_tokens = rust_tokenizer.convert_ids_to_tokens(rust_tokenizer.encode(sequence, add_special_tokens=False))
        self.assertListEqual(rust_tokens, tokens_target)
@@ -154,12 +160,14 @@ class DebertaV2TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
        tokens_target = ["▁", "<unk>", "e", "<unk>", "o", "!", "how", "▁", "<unk>", "re", "▁yo", "<unk>", "?"]
        # fmt: on
-        tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, do_lower_case=False, split_by_punct=False)
+        tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, unk_token="<unk>", do_lower_case=False, split_by_punct=False)
        tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(sequence, add_special_tokens=False))
        self.assertListEqual(tokens, tokens_target)
-        rust_tokenizer = DebertaV2TokenizerFast(SAMPLE_VOCAB, do_lower_case=False, split_by_punct=False)
+        rust_tokenizer = DebertaV2TokenizerFast(
+            SAMPLE_VOCAB, unk_token="<unk>", do_lower_case=False, split_by_punct=False
+        )
        rust_tokens = rust_tokenizer.convert_ids_to_tokens(rust_tokenizer.encode(sequence, add_special_tokens=False))
        self.assertListEqual(rust_tokens, tokens_target)
@@ -189,8 +197,8 @@ class DebertaV2TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
        tokens_target = ["▁", "T", "his", "▁is", "▁a", "▁test"]
        back_tokens_target = ["▁", "<unk>", "his", "▁is", "▁a", "▁test"]
-        tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, keep_accents=True)
+        tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, unk_token="<unk>", keep_accents=True)
-        rust_tokenizer = DebertaV2TokenizerFast(SAMPLE_VOCAB, keep_accents=True)
+        rust_tokenizer = DebertaV2TokenizerFast(SAMPLE_VOCAB, unk_token="<unk>", keep_accents=True)
        ids = tokenizer.encode(sequence, add_special_tokens=False)
        self.assertListEqual(ids, ids_target)

--- a/tests/models/gpt2/test_tokenization_gpt2.py
+++ b/tests/models/gpt2/test_tokenization_gpt2.py
@@ -243,8 +243,8 @@ class GPT2TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
        decode_s = tokenizer.decode(out_s.input_ids)
        decode_s2 = tokenizer.batch_decode(out_s2.input_ids)
-        self.assertEqual(decode_s.split()[0], bos_token)
+        self.assertTrue(decode_s.startswith(bos_token))
-        self.assertTrue(all(d.split()[0] == bos_token for d in decode_s2))
+        self.assertTrue(all(d.startswith(bos_token) for d in decode_s2))
    # tokenizer has no padding token
    def test_padding_different_model_input_name(self):

--- a/tests/models/gpt_sw3/test_tokenization_gpt_sw3.py
+++ b/tests/models/gpt_sw3/test_tokenization_gpt_sw3.py
@@ -145,10 +145,10 @@ class GPTSw3TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
        tokenized_chats = [tokenizer.apply_chat_template(test_chat) for test_chat in test_chats]
        # fmt: off
        expected_tokens = [
-            [268, 63, 127, 462, 276, 294, 348, 536, 797, 275, 127, 65, 63, 263, 65, 938, 541, 419, 530, 339, 265, 878, 708, 727, 275, 347, 541, 260, 63, 263, 65, 1256, 263, 314, 419, 366, 354, 294, 360, 63, 263, 65, 938, 541, 419, ],
+            [2000, 1, 575, 541, 419, 530, 339, 265, 878, 708, 727, 275, 347, 541, 260, 1, 968, 263, 314, 419, 366, 354, 294, 360, 1, 575, 541, 419],
-            [268, 63, 127, 462, 276, 294, 348, 536, 797, 275, 127, 65, 63, 263, 65, 938, 541, 419, 530, 339, 265, 878, 708, 727, 275, 347, 541, 260, 63, 263, 65, 1256, 263, 314, 419, 366, 354, 294, 360, 63, 263, 65, 938, 541, 419, 984, 429, 281, 264, 1261, 291, 260, 63, 263, 65, 938, 541, 419, ],
+            [2000, 1, 575, 541, 419, 530, 339, 265, 878, 708, 727, 275, 347, 541, 260, 1, 968, 263, 314, 419, 366, 354, 294, 360, 1, 575, 541, 419, 984, 429, 281, 264, 1261, 291, 260, 1, 575, 541, 419],
-            [268, 63, 127, 462, 276, 294, 348, 536, 797, 275, 127, 65, 63, 263, 65, 938, 541, 419, 984, 429, 281, 264, 1261, 291, 260, 63, 263, 65, 1256, 263, 314, 419, 366, 354, 294, 360, 63, 263, 65, 938, 541, 419, ]
+            [2000, 1, 575, 541, 419, 984, 429, 281, 264, 1261, 291, 260, 1, 968, 263, 314, 419, 366, 354, 294, 360, 1, 575, 541, 419]
-        ]
+            ]
        # fmt: on
        for tokenized_chat, expected_tokens in zip(tokenized_chats, expected_tokens):
            self.assertListEqual(tokenized_chat, expected_tokens)
--- a/tests/models/gptsan_japanese/test_tokenization_gptsan_japanese.py
+++ b/tests/models/gptsan_japanese/test_tokenization_gptsan_japanese.py
@@ -210,9 +210,9 @@ class GPTSanJapaneseTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
        tokenized_chats = [tokenizer.apply_chat_template(test_chat) for test_chat in test_chats]
        # fmt: off
        expected_tokens = [
-            [35993, 35998, 35637, 35659, 35665, 35716, 35645, 35662, 35649, 35716, 35645, 35716, 35652, 35649, 35656, 35660, 35650, 35665, 35656, 35716, 35647, 35652, 35645, 35664, 35646, 35659, 35664, 35595, 35999, 35993, 35998, 35620, 35649, 35656, 35656, 35659, 35582, 35999],
+            [35993, 35998, 35637, 35659, 35665, 35716, 35645, 35662, 35649, 35716, 35645, 35716, 35652, 35649, 35656, 35660, 35650, 35665, 35656, 35716, 35647, 35652, 35645, 35664, 35646, 35659, 35664, 35595, 35716, 35999, 35993, 35998, 35620, 35649, 35656, 35656, 35659, 35582, 35716, 35999],
-            [35993, 35998, 35637, 35659, 35665, 35716, 35645, 35662, 35649, 35716, 35645, 35716, 35652, 35649, 35656, 35660, 35650, 35665, 35656, 35716, 35647, 35652, 35645, 35664, 35646, 35659, 35664, 35595, 35999, 35993, 35998, 35620, 35649, 35656, 35656, 35659, 35582, 35999, 35993, 35998, 35626, 35653, 35647, 35649, 35716, 35664, 35659, 35716, 35657, 35649, 35649, 35664, 35716, 35669, 35659, 35665, 35595, 35999],
+            [35993, 35998, 35637, 35659, 35665, 35716, 35645, 35662, 35649, 35716, 35645, 35716, 35652, 35649, 35656, 35660, 35650, 35665, 35656, 35716, 35647, 35652, 35645, 35664, 35646, 35659, 35664, 35595, 35716, 35999, 35993, 35998, 35620, 35649, 35656, 35656, 35659, 35582, 35716, 35999, 35993, 35998, 35626, 35653, 35647, 35649, 35716, 35664, 35659, 35716, 35657, 35649, 35649, 35664, 35716, 35669, 35659, 35665, 35595, 35716, 35999],
-            [35993, 35998, 35626, 35653, 35647, 35649, 35716, 35664, 35659, 35716, 35657, 35649, 35649, 35664, 35716, 35669, 35659, 35665, 35595, 35999, 35993, 35998, 35620, 35649, 35656, 35656, 35659, 35582, 35999],
+            [35993, 35998, 35626, 35653, 35647, 35649, 35716, 35664, 35659, 35716, 35657, 35649, 35649, 35664, 35716, 35669, 35659, 35665, 35595, 35716, 35999, 35993, 35998, 35620, 35649, 35656, 35656, 35659, 35582, 35716, 35999]
        ]
        # fmt: on
        for tokenized_chat, expected_tokens in zip(tokenized_chats, expected_tokens):