"vscode:/vscode.git/clone" did not exist on "32090c729f8ac3c07ee909ace64f36d2a5a363f2"
Unverified Commit 2da88537 authored by Arthur's avatar Arthur Committed by GitHub
Browse files

🚨🚨 🚨🚨 [`Tokenizer`] attemp to fix add_token issues🚨🚨 🚨🚨 (#23909)



* fix test for bart. Order is correct now let's skip BPEs

* ouf

* styling

* fix bert....

* slow refactoring

* current updates

* massive refactoring

* update

* NICE!

* update to see where I am at

* updates

* update

* update

* revert

* updates

* updates

* start supporting legacy_save

* styling

* big update

* revert some changes

* nits

* nniiiiiice

* small fixes

* kinda fix t5 with new behaviour

* major update

* fixup

* fix copies

* today's updates

* fix byt5

* upfate

* update

* update

* updates

* update vocab size test

* Barthez does not use not need the fairseq offset ids

* super calll must be after

* calll super

* move all super init

* move other super init

* fixup

* nits

* more fixes

* nits

* more fixes

* nits

* more fix

* remove useless files

* ouch all of them are affected

* and more!

* small imporvements

* no more sanitize token

* more changes around unique no split tokens

* partially fix more things

* keep legacy save but add warning

* so... more fixes

* updates

* guess deberta tokenizer could be nuked

* fixup

* fixup did some bad things

* nuke it if it breaks

* remove prints and pretrain fast from slow with new format.

* fixups

* Apply suggestions from code review
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* fiou

* nit

* by default specials should not be normalized?

* update

* remove brakpoint

* updates

* a lot of updates

* fixup

* fixes revert some changes to match fast

* small nits

* that makes it cleaner

* fix camembert accordingly

* update

* some lest breaking changes

* update

* fixup

* fix byt5 and whisper mostly

* some more fixes, canine's byte vocab

* fix gpt2

* fix most of the perceiver tests (4 left)

* fix layout lmv3

* fixup

* fix copies for gpt2 style

* make sure to only warn once

* fix perciever and gpt2 tests

* some more backward compatibility: also read special tokens map because some ppl use it........////.....

* fixup

* add else when reading

* nits

* fresh updates

* fix copies

* will this make everything faster?

* fixes

* more fixes

* update

* more fixes

* fixup

* is the source of truth right?

* sorry camembert for the troubles

* current updates

* fixup

* update led

* update

* fix regression

* fix single word

* more model specific fixes

* fix t5 tests

* fixup

* more comments

* update

* fix nllb

* rstrip removed

* small fixes

* better handle additional_special_tokens and vocab sizes

* fixing

* styling

* fix 4 / 21

* fixup

* fix nlbb's tests

* some fixes

* fix t5

* fixes

* style

* fix canine tests

* damn this is nice

* nits

* m2m100 nit

* fixups

* fixes!

* fixup

* stash

* fix merge

* revert bad change

* fixup

* correct order for code Llama

* fix speecht5 post merge

* styling

* revert source of 11 fails

* small nits

* all changes in one go

* fnet hack

* fix 2 more tests

* update based on main branch of tokenizers

* fixup

* fix VITS issues

* more fixes

* fix mgp test

* fix camembert issues

* oups camembert still has 2 failing tests

* mluke fixes

* decode fixes

* small nits

* nits

* fix llama and vits

* fix camembert

* smal nits

* more fixes when initialising a fast from a slow and etc

* fix one of the last test

* fix CPM tokenizer test

* fixups

* fix pop2piano

* fixup

* ️ Change tokenizers required version ️

* ️ Change tokenizers required version ️

* "tokenizers>=0.14,<0.15", don't forget smaller than

* fix musicgen tests and pretraiendtokenizerfast

* fix owlvit and all

* update t5

* fix 800 red

* fix tests

* fix the fix of the fix of t5

* styling

* documentation nits

* cache _added_tokens_encoder

* fixups

* Nit

* fix red tests

* one last nit!

* make eveything a lot simpler

* Now it's over 😉



* few small nits

* Apply suggestions from code review
Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>

* updates that work for now

* tests that should no be skipped / changed and fixed next

* fixup

* i am ashamed

* pushe the fix

* update

* fixups

* nits

* fix added_tokens_encoder

* fix canine test

* fix pegasus vocab

* fix transfoXL

* fixup

* whisper needs to be fixed for train new

* pegasus nits

* more pegasus fixes

* minor update

* better error message in failed test

* fix whisper failing test

* fix whisper failing test

* fix pegasus

* fixup

* fix **** pegasus

* reset things

* remove another file

* attempts to fix the strange custome encoder and offset

* nits here and there

* update

* fixup

* nit

* fix the whisper test

* nits nits

* Apply suggestions from code review
Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>

* updates based on review

* some small update to potentially remove

* nits

* import rlu cache

* Update src/transformers/tokenization_utils_base.py
Co-authored-by: default avatarLysandre Debut <hi@lysand.re>

* move warning to `from_pretrained`

* update tests results now that the special tokens are always added

---------
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: default avatarLysandre Debut <hi@lysand.re>
parent 835b0a05
...@@ -19,7 +19,7 @@ from functools import lru_cache ...@@ -19,7 +19,7 @@ from functools import lru_cache
from typing import List, Optional, Tuple from typing import List, Optional, Tuple
import numpy as np import numpy as np
from tokenizers import pre_tokenizers, processors from tokenizers import AddedToken, pre_tokenizers, processors
from ...tokenization_utils_base import BatchEncoding from ...tokenization_utils_base import BatchEncoding
from ...tokenization_utils_fast import PreTrainedTokenizerFast from ...tokenization_utils_fast import PreTrainedTokenizerFast
...@@ -148,6 +148,22 @@ class WhisperTokenizerFast(PreTrainedTokenizerFast): ...@@ -148,6 +148,22 @@ class WhisperTokenizerFast(PreTrainedTokenizerFast):
predict_timestamps=False, predict_timestamps=False,
**kwargs, **kwargs,
): ):
bos_token = (
AddedToken(bos_token, lstrip=False, rstrip=False, normalized=False, special=True)
if isinstance(bos_token, str)
else bos_token
)
eos_token = (
AddedToken(eos_token, lstrip=False, rstrip=False, normalized=False, special=True)
if isinstance(eos_token, str)
else eos_token
)
unk_token = (
AddedToken(unk_token, lstrip=False, rstrip=False, normalized=False, special=True)
if isinstance(unk_token, str)
else unk_token
)
super().__init__( super().__init__(
vocab_file, vocab_file,
merges_file, merges_file,
...@@ -444,11 +460,10 @@ class WhisperTokenizerFast(PreTrainedTokenizerFast): ...@@ -444,11 +460,10 @@ class WhisperTokenizerFast(PreTrainedTokenizerFast):
@property @property
# Copied from transformers.models.whisper.tokenization_whisper.WhisperTokenizer.prefix_tokens # Copied from transformers.models.whisper.tokenization_whisper.WhisperTokenizer.prefix_tokens
def prefix_tokens(self) -> List[int]: def prefix_tokens(self) -> List[int]:
all_special_ids = self.all_special_ids bos_token_id = self.convert_tokens_to_ids("<|startoftranscript|>")
bos_token_id = all_special_ids[-106] translate_token_id = self.convert_tokens_to_ids("<|translate|>")
translate_token_id = all_special_ids[-6] transcribe_token_id = self.convert_tokens_to_ids("<|transcribe|>")
transcribe_token_id = all_special_ids[-5] notimestamps_token_id = self.convert_tokens_to_ids("<|notimestamps|>")
notimestamps_token_id = all_special_ids[-1]
langs = tuple(LANGUAGES.keys()) langs = tuple(LANGUAGES.keys())
if self.language is not None: if self.language is not None:
......
...@@ -137,17 +137,6 @@ class XGLMTokenizer(PreTrainedTokenizer): ...@@ -137,17 +137,6 @@ class XGLMTokenizer(PreTrainedTokenizer):
word for word in madeup_words if word not in kwargs["additional_special_tokens"] word for word in madeup_words if word not in kwargs["additional_special_tokens"]
] ]
super().__init__(
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
sp_model_kwargs=self.sp_model_kwargs,
**kwargs,
)
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs) self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
self.sp_model.Load(str(vocab_file)) self.sp_model.Load(str(vocab_file))
self.vocab_file = vocab_file self.vocab_file = vocab_file
...@@ -170,6 +159,17 @@ class XGLMTokenizer(PreTrainedTokenizer): ...@@ -170,6 +159,17 @@ class XGLMTokenizer(PreTrainedTokenizer):
self.fairseq_ids_to_tokens = {v: k for k, v in self.fairseq_tokens_to_ids.items()} self.fairseq_ids_to_tokens = {v: k for k, v in self.fairseq_tokens_to_ids.items()}
super().__init__(
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
sp_model_kwargs=self.sp_model_kwargs,
**kwargs,
)
def __getstate__(self): def __getstate__(self):
state = self.__dict__.copy() state = self.__dict__.copy()
state["sp_model"] = None state["sp_model"] = None
......
...@@ -613,20 +613,6 @@ class XLMTokenizer(PreTrainedTokenizer): ...@@ -613,20 +613,6 @@ class XLMTokenizer(PreTrainedTokenizer):
do_lowercase_and_remove_accent=True, do_lowercase_and_remove_accent=True,
**kwargs, **kwargs,
): ):
super().__init__(
unk_token=unk_token,
bos_token=bos_token,
sep_token=sep_token,
pad_token=pad_token,
cls_token=cls_token,
mask_token=mask_token,
additional_special_tokens=additional_special_tokens,
lang2id=lang2id,
id2lang=id2lang,
do_lowercase_and_remove_accent=do_lowercase_and_remove_accent,
**kwargs,
)
try: try:
import sacremoses import sacremoses
except ImportError: except ImportError:
...@@ -660,6 +646,19 @@ class XLMTokenizer(PreTrainedTokenizer): ...@@ -660,6 +646,19 @@ class XLMTokenizer(PreTrainedTokenizer):
merges = [tuple(merge.split()[:2]) for merge in merges] merges = [tuple(merge.split()[:2]) for merge in merges]
self.bpe_ranks = dict(zip(merges, range(len(merges)))) self.bpe_ranks = dict(zip(merges, range(len(merges))))
self.cache = {} self.cache = {}
super().__init__(
unk_token=unk_token,
bos_token=bos_token,
sep_token=sep_token,
pad_token=pad_token,
cls_token=cls_token,
mask_token=mask_token,
additional_special_tokens=additional_special_tokens,
lang2id=lang2id,
id2lang=id2lang,
do_lowercase_and_remove_accent=do_lowercase_and_remove_accent,
**kwargs,
)
@property @property
def do_lower_case(self): def do_lower_case(self):
......
...@@ -145,18 +145,6 @@ class XLMProphetNetTokenizer(PreTrainedTokenizer): ...@@ -145,18 +145,6 @@ class XLMProphetNetTokenizer(PreTrainedTokenizer):
) -> None: ) -> None:
self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
super().__init__(
bos_token=bos_token,
eos_token=eos_token,
sep_token=sep_token,
unk_token=unk_token,
pad_token=pad_token,
cls_token=cls_token,
mask_token=mask_token,
sp_model_kwargs=self.sp_model_kwargs,
**kwargs,
)
try: try:
import sentencepiece as spm import sentencepiece as spm
except ImportError: except ImportError:
...@@ -186,8 +174,20 @@ class XLMProphetNetTokenizer(PreTrainedTokenizer): ...@@ -186,8 +174,20 @@ class XLMProphetNetTokenizer(PreTrainedTokenizer):
# The first "real" token "," has position 15 in the embedding vocab and position 3 in the spm vocab # The first "real" token "," has position 15 in the embedding vocab and position 3 in the spm vocab
self.fairseq_offset = 12 self.fairseq_offset = 12
self.fairseq_ids_to_tokens = {v: k for k, v in self.fairseq_tokens_to_ids.items()} self.fairseq_ids_to_tokens = {v: k for k, v in self.fairseq_tokens_to_ids.items()}
for k in self.fairseq_tokens_to_ids.keys():
self.unique_no_split_tokens.append(k) # TODO ArthurZ fairseq_ids_to_tokens should be removed
super().__init__(
bos_token=bos_token,
eos_token=eos_token,
sep_token=sep_token,
unk_token=unk_token,
pad_token=pad_token,
cls_token=cls_token,
mask_token=mask_token,
sp_model_kwargs=self.sp_model_kwargs,
**kwargs,
)
@property @property
def can_save_slow_tokenizer(self) -> bool: def can_save_slow_tokenizer(self) -> bool:
......
...@@ -152,18 +152,6 @@ class XLMRobertaTokenizer(PreTrainedTokenizer): ...@@ -152,18 +152,6 @@ class XLMRobertaTokenizer(PreTrainedTokenizer):
self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
super().__init__(
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
sp_model_kwargs=self.sp_model_kwargs,
**kwargs,
)
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs) self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
self.sp_model.Load(str(vocab_file)) self.sp_model.Load(str(vocab_file))
self.vocab_file = vocab_file self.vocab_file = vocab_file
...@@ -183,6 +171,18 @@ class XLMRobertaTokenizer(PreTrainedTokenizer): ...@@ -183,6 +171,18 @@ class XLMRobertaTokenizer(PreTrainedTokenizer):
self.fairseq_tokens_to_ids["<mask>"] = len(self.sp_model) + self.fairseq_offset self.fairseq_tokens_to_ids["<mask>"] = len(self.sp_model) + self.fairseq_offset
self.fairseq_ids_to_tokens = {v: k for k, v in self.fairseq_tokens_to_ids.items()} self.fairseq_ids_to_tokens = {v: k for k, v in self.fairseq_tokens_to_ids.items()}
super().__init__(
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
sp_model_kwargs=self.sp_model_kwargs,
**kwargs,
)
def __getstate__(self): def __getstate__(self):
state = self.__dict__.copy() state = self.__dict__.copy()
state["sp_model"] = None state["sp_model"] = None
...@@ -288,6 +288,7 @@ class XLMRobertaTokenizer(PreTrainedTokenizer): ...@@ -288,6 +288,7 @@ class XLMRobertaTokenizer(PreTrainedTokenizer):
return vocab return vocab
def _tokenize(self, text: str) -> List[str]: def _tokenize(self, text: str) -> List[str]:
# TODO check if the t5/llama PR also applies here
return self.sp_model.encode(text, out_type=str) return self.sp_model.encode(text, out_type=str)
def _convert_token_to_id(self, token): def _convert_token_to_id(self, token):
......
...@@ -152,6 +152,14 @@ class XLNetTokenizer(PreTrainedTokenizer): ...@@ -152,6 +152,14 @@ class XLNetTokenizer(PreTrainedTokenizer):
self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
self.do_lower_case = do_lower_case
self.remove_space = remove_space
self.keep_accents = keep_accents
self.vocab_file = vocab_file
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
self.sp_model.Load(vocab_file)
super().__init__( super().__init__(
do_lower_case=do_lower_case, do_lower_case=do_lower_case,
remove_space=remove_space, remove_space=remove_space,
...@@ -170,14 +178,6 @@ class XLNetTokenizer(PreTrainedTokenizer): ...@@ -170,14 +178,6 @@ class XLNetTokenizer(PreTrainedTokenizer):
self._pad_token_type_id = 3 self._pad_token_type_id = 3
self.do_lower_case = do_lower_case
self.remove_space = remove_space
self.keep_accents = keep_accents
self.vocab_file = vocab_file
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
self.sp_model.Load(vocab_file)
@property @property
def vocab_size(self): def vocab_size(self):
return len(self.sp_model) return len(self.sp_model)
......
...@@ -57,6 +57,7 @@ class Trie: ...@@ -57,6 +57,7 @@ class Trie:
def __init__(self): def __init__(self):
self.data = {} self.data = {}
self._tokens = set()
def add(self, word: str): def add(self, word: str):
""" """
...@@ -81,6 +82,8 @@ class Trie: ...@@ -81,6 +82,8 @@ class Trie:
if not word: if not word:
# Prevent empty string # Prevent empty string
return return
self._tokens.add(word)
ref = self.data ref = self.data
for char in word: for char in word:
ref[char] = char in ref and ref[char] or {} ref[char] = char in ref and ref[char] or {}
...@@ -344,17 +347,48 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase): ...@@ -344,17 +347,48 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase):
""" """
def __init__(self, **kwargs): def __init__(self, **kwargs):
# 1. Init the parent class
super().__init__(**kwargs) super().__init__(**kwargs)
# Added tokens - We store this for both slow and fast tokenizers
# until the serialization of Fast tokenizers is updated
self.added_tokens_encoder: Dict[str, int] = {}
self.added_tokens_decoder: Dict[int, str] = {}
self.unique_no_split_tokens: List[str] = []
self.tokens_trie = Trie() self.tokens_trie = Trie()
# 2. init `_added_tokens_decoder` if child class did not
if not hasattr(self, "_added_tokens_decoder"):
self._added_tokens_decoder: Dict[int, AddedToken] = {}
# 3. if a `added_tokens_decoder` is passed, we are loading from a saved tokenizer, we overwrite
if "added_tokens_decoder" in kwargs:
# overwriting the class's added_tokens_decoder. This is the source of truth!
self._added_tokens_decoder.update(kwargs.get("added_tokens_decoder"))
self._added_tokens_encoder: Dict[str, int] = {k.content: v for v, k in self._added_tokens_decoder.items()}
# 4. If some of the special tokens are not part of the vocab, we add them, at the end.
# the order of addition is the same as self.SPECIAL_TOKENS_ATTRIBUTES following `tokenizers`
self._add_tokens(self.all_special_tokens_extended, special_tokens=True)
self._decode_use_source_tokenizer = False self._decode_use_source_tokenizer = False
@property
def added_tokens_decoder(self) -> Dict[int, AddedToken]:
"""
Returns the added tokens in the vocabulary as a dictionary of index to AddedToken.
Returns:
`Dict[str, int]`: The added tokens.
"""
return dict(sorted(self._added_tokens_decoder.items(), key=lambda item: item[0]))
@added_tokens_decoder.setter
def added_tokens_decoder(self, value: Dict[int, Union[AddedToken, str]]) -> Dict[int, AddedToken]:
# Always raise an error if string because users should define the behavior
for index, token in value.items():
if not isinstance(token, (str, AddedToken)) or not isinstance(index, int):
raise ValueError(
f"The provided `added_tokens_decoder` has an element of type {index.__class__, token.__class__}, should be a dict of {int, Union[AddedToken, str]}"
)
self._added_tokens_decoder[index] = AddedToken(token) if isinstance(token, str) else token
self._added_tokens_encoder[str(token)] = index
@property @property
def is_fast(self) -> bool: def is_fast(self) -> bool:
return False return False
...@@ -368,28 +402,34 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase): ...@@ -368,28 +402,34 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase):
def get_added_vocab(self) -> Dict[str, int]: def get_added_vocab(self) -> Dict[str, int]:
""" """
Returns the added tokens in the vocabulary as a dictionary of token to index. Returns the added tokens in the vocabulary as a dictionary of token to index. Results might be different from
the fast call because for now we always add the tokens even if they are already in the vocabulary. This is
something we should change.
Returns: Returns:
`Dict[str, int]`: The added tokens. `Dict[str, int]`: The added tokens.
""" """
return self.added_tokens_encoder return self._added_tokens_encoder
def __len__(self): def __len__(self):
""" """
Size of the full vocabulary with the added tokens. Size of the full vocabulary with the added tokens. Counts the `keys` and not the `values` because otherwise if
there is a hole in the vocab, we will add tokenizers at a wrong index.
""" """
return self.vocab_size + len(self.added_tokens_encoder) return len(set(self.get_vocab().keys()))
def _add_tokens(self, new_tokens: Union[List[str], List[AddedToken]], special_tokens: bool = False) -> int: def _add_tokens(self, new_tokens: Union[List[str], List[AddedToken]], special_tokens: bool = False) -> int:
""" """
Add a list of new tokens to the tokenizer class. If the new tokens are not in the vocabulary, they are added to Add a list of new tokens to the tokenizer class. If the new tokens are not in the vocabulary, they are added to
it with indices starting from length of the current vocabulary. it with indices starting from length of the current vocabulary. Special tokens are sometimes already in the
vocab which is why they have to be handled specifically.
Args: Args:
new_tokens (`List[str]`or `List[tokenizers.AddedToken]`): new_tokens (`List[str]`or `List[tokenizers.AddedToken]`):
Token(s) to add in vocabulary. A token is only added if it's not already in the vocabulary (tested by Token(s) to add in vocabulary. A token is counted as added if it's not already in the vocabulary
checking if the tokenizer assign the index of the `unk_token` to them). (tested by checking if the tokenizer assign the index of the `unk_token` to them). If a token is part
of the vocabulary then we simply mark this token as an `AddedToken` which allows to control the
stripping and normalization of this token. This is NOT possible in `tokenizers`.
special_tokens (`bool`, *optional*, defaults to `False`): special_tokens (`bool`, *optional*, defaults to `False`):
Whether or not the tokens should be added as special tokens. Whether or not the tokens should be added as special tokens.
...@@ -408,52 +448,52 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase): ...@@ -408,52 +448,52 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase):
# Note: resize_token_embeddings expects to receive the full size of the new vocabulary, i.e. the length of the tokenizer. # Note: resize_token_embeddings expects to receive the full size of the new vocabulary, i.e. the length of the tokenizer.
model.resize_token_embeddings(len(tokenizer)) model.resize_token_embeddings(len(tokenizer))
```""" ```"""
new_tokens = [str(tok) for tok in new_tokens] added_tokens = 0
if new_tokens is None:
tokens_to_add = [] return added_tokens
current_vocab = self.get_vocab().copy()
new_idx = len(current_vocab) # only call this once, len gives the last index + 1
for token in new_tokens: for token in new_tokens:
if not isinstance(token, str): if not isinstance(token, (str, AddedToken)):
raise TypeError(f"Token {token} is not a string but a {type(token)}.") raise TypeError(f"Token {token} is not a string but a {type(token)}.")
if not special_tokens and hasattr(self, "do_lower_case") and self.do_lower_case: if str(token) == "":
token = token.lower() continue
if ( if isinstance(token, str):
token != self.unk_token # for legacy AddedTokens strip left and right by default
and self.convert_tokens_to_ids(token) == self.convert_tokens_to_ids(self.unk_token) # TODO this will be remove to have the same default behavior as rust
and token not in tokens_to_add token = AddedToken(token, normalized=not special_tokens, rstrip=True, lstrip=True)
): if special_tokens:
tokens_to_add.append(token) token.special = True
if self.verbose: if token in self._added_tokens_decoder:
logger.info(f"Adding {token} to the vocabulary") continue
if not token.special and token.normalized and hasattr(self, "do_lower_case") and self.do_lower_case:
added_tok_encoder = {tok: len(self) + i for i, tok in enumerate(tokens_to_add)} # Normalize if requested
added_tok_decoder = {v: k for k, v in added_tok_encoder.items()} token.content = token.content.lower()
self.added_tokens_encoder.update(added_tok_encoder) if token.content not in current_vocab:
self.added_tokens_decoder.update(added_tok_decoder) token_index = new_idx + added_tokens
current_vocab[token.content] = token_index
# Make sure we don't split on any special tokens (even they were already in the vocab before e.g. for Albert) added_tokens += 1
if special_tokens:
if len(new_tokens) == 1:
_insert_one_token_to_ordered_list(self.unique_no_split_tokens, new_tokens[0])
else:
self.unique_no_split_tokens = sorted(set(self.unique_no_split_tokens).union(set(new_tokens)))
else:
# Or on the newly added tokens
if len(tokens_to_add) == 1:
_insert_one_token_to_ordered_list(self.unique_no_split_tokens, tokens_to_add[0])
else: else:
self.unique_no_split_tokens = sorted(set(self.unique_no_split_tokens).union(set(tokens_to_add))) token_index = current_vocab[token.content]
self._create_trie(self.unique_no_split_tokens)
if token.special and str(token) not in self.all_special_tokens:
return len(tokens_to_add) self._additional_special_tokens.append(token)
# the setter automatically updates the reverse map
def _create_trie(self, unique_no_split_tokens): self._added_tokens_decoder[token_index] = token
trie = Trie() self._added_tokens_encoder[token.content] = token_index
if self.verbose:
logger.info(f"Adding {token} to the vocabulary")
self._update_trie()
return added_tokens
def _update_trie(self, unique_no_split_tokens: Optional[str] = []):
for token in self._added_tokens_decoder.values():
if token not in self.tokens_trie._tokens:
self.tokens_trie.add(token.content)
for token in unique_no_split_tokens: for token in unique_no_split_tokens:
if hasattr(self, "do_lower_case") and self.do_lower_case and token not in self.all_special_tokens: if token not in self.tokens_trie._tokens:
trie.add(token.lower()) self.tokens_trie.add(token)
else:
trie.add(token)
self.tokens_trie = trie
def num_special_tokens_to_add(self, pair: bool = False) -> int: def num_special_tokens_to_add(self, pair: bool = False) -> int:
""" """
...@@ -494,10 +534,6 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase): ...@@ -494,10 +534,6 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase):
Returns: Returns:
`List[str]`: The list of tokens. `List[str]`: The list of tokens.
""" """
# Simple mapping string => AddedToken for special tokens with specific tokenization behaviors
all_special_tokens_extended = {
str(t): t for t in self.all_special_tokens_extended if isinstance(t, AddedToken)
}
split_special_tokens = kwargs.pop("split_special_tokens", self.split_special_tokens) split_special_tokens = kwargs.pop("split_special_tokens", self.split_special_tokens)
text, kwargs = self.prepare_for_tokenization(text, **kwargs) text, kwargs = self.prepare_for_tokenization(text, **kwargs)
...@@ -505,27 +541,29 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase): ...@@ -505,27 +541,29 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase):
if kwargs: if kwargs:
logger.warning(f"Keyword arguments {kwargs} not recognized.") logger.warning(f"Keyword arguments {kwargs} not recognized.")
# TODO: should this be in the base class?
if hasattr(self, "do_lower_case") and self.do_lower_case: if hasattr(self, "do_lower_case") and self.do_lower_case:
# convert non-special tokens to lowercase # convert non-special tokens to lowercase
escaped_special_toks = [ escaped_special_toks = [re.escape(s_tok) for s_tok in (self.all_special_tokens)]
re.escape(s_tok) for s_tok in (self.unique_no_split_tokens + self.all_special_tokens) escaped_special_toks += [
re.escape(s_tok.content)
for s_tok in (self._added_tokens_decoder.values())
if not s_tok.special and s_tok.normalized
] ]
pattern = r"(" + r"|".join(escaped_special_toks) + r")|" + r"(.+?)" pattern = r"(" + r"|".join(escaped_special_toks) + r")|" + r"(.+?)"
text = re.sub(pattern, lambda m: m.groups()[0] or m.groups()[1].lower(), text) text = re.sub(pattern, lambda m: m.groups()[0] or m.groups()[1].lower(), text)
# split_special_tokens: empty `no_split_token`
if split_special_tokens: if split_special_tokens:
no_split_token = [] no_split_token = []
tokens = [text] tokens = [text]
else: else:
no_split_token = set(self.unique_no_split_tokens) no_split_token = set(self._added_tokens_encoder.keys()) # don't split on any of the added tokens
# "This is something<special_token_1> else"
tokens = self.tokens_trie.split(text) tokens = self.tokens_trie.split(text)
# ["This is something", "<special_token_1>", " else"] # ["This is something", "<special_token_1>", " else"]
for i, token in enumerate(tokens): for i, token in enumerate(tokens):
if token in no_split_token: if token in no_split_token:
tok_extended = all_special_tokens_extended.get(token, None) tok_extended = self._added_tokens_decoder.get(self._added_tokens_encoder[token], None)
left = tokens[i - 1] if i > 0 else None left = tokens[i - 1] if i > 0 else None
right = tokens[i + 1] if i < len(tokens) - 1 else None right = tokens[i + 1] if i < len(tokens) - 1 else None
if isinstance(tok_extended, AddedToken): if isinstance(tok_extended, AddedToken):
...@@ -536,12 +574,18 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase): ...@@ -536,12 +574,18 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase):
# Strip white spaces on the left # Strip white spaces on the left
if tok_extended.lstrip and left: if tok_extended.lstrip and left:
tokens[i - 1] = left.rstrip() # Opposite here tokens[i - 1] = left.rstrip() # Opposite here
if tok_extended.single_word and left and left[-1] != " ":
tokens[i - 1] += token
tokens[i] = ""
elif tok_extended.single_word and right and right[0] != " ":
tokens[i + 1] = token + tokens[i + 1]
tokens[i] = ""
else: else:
# We strip left and right by default raise ValueError(
if right: f"{tok_extended} cannot be tokenized because it was not properly added"
tokens[i + 1] = right.lstrip() f" to the tokenizer. This means that it is not an `AddedToken` but a {type(tok_extended)}"
if left: )
tokens[i - 1] = left.rstrip()
# ["This is something", "<special_token_1>", "else"] # ["This is something", "<special_token_1>", "else"]
tokenized_text = [] tokenized_text = []
for token in tokens: for token in tokens:
...@@ -590,8 +634,8 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase): ...@@ -590,8 +634,8 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase):
if token is None: if token is None:
return None return None
if token in self.added_tokens_encoder: if token in self._added_tokens_encoder:
return self.added_tokens_encoder[token] return self._added_tokens_encoder[token]
return self._convert_token_to_id(token) return self._convert_token_to_id(token)
def _convert_token_to_id(self, token): def _convert_token_to_id(self, token):
...@@ -904,8 +948,8 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase): ...@@ -904,8 +948,8 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase):
`str` or `List[str]`: The decoded token(s). `str` or `List[str]`: The decoded token(s).
""" """
if isinstance(ids, int): if isinstance(ids, int):
if ids in self.added_tokens_decoder: if ids in self._added_tokens_decoder:
return self.added_tokens_decoder[ids] return self._added_tokens_decoder[ids].content
else: else:
return self._convert_id_to_token(ids) return self._convert_id_to_token(ids)
tokens = [] tokens = []
...@@ -913,8 +957,8 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase): ...@@ -913,8 +957,8 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase):
index = int(index) index = int(index)
if skip_special_tokens and index in self.all_special_ids: if skip_special_tokens and index in self.all_special_ids:
continue continue
if index in self.added_tokens_decoder: if index in self._added_tokens_decoder:
tokens.append(self.added_tokens_decoder[index]) tokens.append(self._added_tokens_decoder[index].content)
else: else:
tokens.append(self._convert_id_to_token(index)) tokens.append(self._convert_id_to_token(index))
return tokens return tokens
...@@ -935,19 +979,29 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase): ...@@ -935,19 +979,29 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase):
) -> str: ) -> str:
self._decode_use_source_tokenizer = kwargs.pop("use_source_tokenizer", False) self._decode_use_source_tokenizer = kwargs.pop("use_source_tokenizer", False)
if spaces_between_special_tokens:
logger.warning_once(
"spaces_between_special_tokens is deprecated and will be removed in transformers v5. It was adding spaces between `added_tokens`, not special tokens, "
"and does not exist in our fast implementation. Future tokenizers will handle the decoding process on a per-model rule."
)
filtered_tokens = self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens) filtered_tokens = self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens)
legacy_added_tokens = set(self._added_tokens_encoder.keys()) - set(self.all_special_tokens) | {
token for token in self.additional_special_tokens if self.convert_tokens_to_ids(token) >= self.vocab_size
}
# To avoid mixing byte-level and unicode for byte-level BPT # To avoid mixing byte-level and unicode for byte-level BPT
# we need to build string separately for added tokens and byte-level tokens # we need to build string separately for added tokens and byte-level tokens
# cf. https://github.com/huggingface/transformers/issues/1133 # cf. https://github.com/huggingface/transformers/issues/1133
sub_texts = [] sub_texts = []
current_sub_text = [] current_sub_text = []
# TODO @ArthurZ in version 5, special tokens should be handled in convert_tokens_to_string, while _convert_tokens_to_string
for token in filtered_tokens: for token in filtered_tokens:
if skip_special_tokens and token in self.all_special_ids: if skip_special_tokens and token in self.all_special_ids:
continue continue
if token in self.added_tokens_encoder: if token in legacy_added_tokens:
if current_sub_text: if current_sub_text:
sub_texts.append(self.convert_tokens_to_string(current_sub_text)) string = self.convert_tokens_to_string(current_sub_text)
if len(string) > 0:
sub_texts.append(string)
current_sub_text = [] current_sub_text = []
sub_texts.append(token) sub_texts.append(token)
else: else:
......
...@@ -23,10 +23,10 @@ import json ...@@ -23,10 +23,10 @@ import json
import os import os
import re import re
import warnings import warnings
from collections import OrderedDict, UserDict from collections import UserDict
from collections.abc import Mapping, Sized from collections.abc import Mapping, Sized
from contextlib import contextmanager from contextlib import contextmanager
from dataclasses import dataclass, field from dataclasses import dataclass
from functools import lru_cache from functools import lru_cache
from typing import TYPE_CHECKING, Any, Dict, List, NamedTuple, Optional, Sequence, Tuple, Union from typing import TYPE_CHECKING, Any, Dict, List, NamedTuple, Optional, Sequence, Tuple, Union
...@@ -78,18 +78,25 @@ if is_tokenizers_available(): ...@@ -78,18 +78,25 @@ if is_tokenizers_available():
from tokenizers import Encoding as EncodingFast from tokenizers import Encoding as EncodingFast
else: else:
@dataclass(frozen=True, eq=True) @dataclass(frozen=False, eq=True)
class AddedToken: class AddedToken:
""" """
AddedToken represents a token to be added to a Tokenizer An AddedToken can have special options defining the AddedToken represents a token to be added to a Tokenizer An AddedToken can have special options defining the
way it should behave. way it should behave.
The `normalized` will default to `not special` if it is not specified, similarly to the definition in
`tokenizers`.
""" """
content: str = field(default_factory=str) def __init__(
single_word: bool = False self, content: str, single_word=False, lstrip=False, rstrip=False, special=False, normalized=None
lstrip: bool = False ):
rstrip: bool = False self.content = content
normalized: bool = True self.single_word = single_word
self.lstrip = lstrip
self.rstrip = rstrip
self.special = special
self.normalized = normalized if normalized is not None else not special
def __getstate__(self): def __getstate__(self):
return self.__dict__ return self.__dict__
...@@ -806,7 +813,8 @@ class SpecialTokensMixin: ...@@ -806,7 +813,8 @@ class SpecialTokensMixin:
A special token representing a masked token (used by masked-language modeling pretraining objectives, like A special token representing a masked token (used by masked-language modeling pretraining objectives, like
BERT). BERT).
additional_special_tokens (tuple or list of `str` or `tokenizers.AddedToken`, *optional*): additional_special_tokens (tuple or list of `str` or `tokenizers.AddedToken`, *optional*):
A tuple or a list of additional special tokens. A tuple or a list of additional tokens, which will be marked as `special`, meaning that they will be
skipped when decoding if `skip_special_tokens` is set to `True`.
""" """
SPECIAL_TOKENS_ATTRIBUTES = [ SPECIAL_TOKENS_ATTRIBUTES = [
...@@ -845,21 +853,20 @@ class SpecialTokensMixin: ...@@ -845,21 +853,20 @@ class SpecialTokensMixin:
isinstance(t, (str, AddedToken)) for t in value isinstance(t, (str, AddedToken)) for t in value
), "One of the tokens is not a string or an AddedToken" ), "One of the tokens is not a string or an AddedToken"
setattr(self, key, value) setattr(self, key, value)
elif isinstance(value, (str, AddedToken)): elif isinstance(value, (str)):
value = AddedToken(value, normalized=False, special=True)
setattr(self, key, value)
elif isinstance(value, AddedToken):
setattr(self, key, value) setattr(self, key, value)
else: else:
raise TypeError(f"special token {key} has to be either str or AddedToken but got: {type(value)}") raise TypeError(f"Special token {key} has to be either str or AddedToken but got: {type(value)}")
def sanitize_special_tokens(self) -> int: def sanitize_special_tokens(self) -> int:
""" """
Make sure that all the special tokens attributes of the tokenizer (`tokenizer.mask_token`, The `sanitize_special_tokens` is now deprecated kept for backward compatibility and will be removed in
`tokenizer.cls_token`, etc.) are in the vocabulary. transformers v5.
Add the missing ones to the vocabulary if needed.
Return:
`int`: The number of tokens added in the vocabulary during the operation.
""" """
logger.warning_once("The `sanitize_special_tokens` will be removed in transformers v5.")
return self.add_tokens(self.all_special_tokens_extended, special_tokens=True) return self.add_tokens(self.all_special_tokens_extended, special_tokens=True)
def add_special_tokens( def add_special_tokens(
...@@ -870,14 +877,15 @@ class SpecialTokensMixin: ...@@ -870,14 +877,15 @@ class SpecialTokensMixin:
special tokens are NOT in the vocabulary, they are added to it (indexed starting from the last index of the special tokens are NOT in the vocabulary, they are added to it (indexed starting from the last index of the
current vocabulary). current vocabulary).
Note,None When adding new tokens to the vocabulary, you should make sure to also resize the token embedding When adding new tokens to the vocabulary, you should make sure to also resize the token embedding matrix of the
matrix of the model so that its embedding matrix matches the tokenizer. model so that its embedding matrix matches the tokenizer.
In order to do that, please use the [`~PreTrainedModel.resize_token_embeddings`] method. In order to do that, please use the [`~PreTrainedModel.resize_token_embeddings`] method.
Using `add_special_tokens` will ensure your special tokens can be used in several ways: Using `add_special_tokens` will ensure your special tokens can be used in several ways:
- Special tokens are carefully handled by the tokenizer (they are never split). - Special tokens can be skipped when decoding using `skip_special_tokens = True`.
- Special tokens are carefully handled by the tokenizer (they are never split), similar to `AddedTokens`.
- You can easily refer to special tokens using tokenizer class attributes like `tokenizer.cls_token`. This - You can easily refer to special tokens using tokenizer class attributes like `tokenizer.cls_token`. This
makes it easy to develop model-agnostic training and fine-tuning scripts. makes it easy to develop model-agnostic training and fine-tuning scripts.
...@@ -893,10 +901,12 @@ class SpecialTokensMixin: ...@@ -893,10 +901,12 @@ class SpecialTokensMixin:
Tokens are only added if they are not already in the vocabulary (tested by checking if the tokenizer Tokens are only added if they are not already in the vocabulary (tested by checking if the tokenizer
assign the index of the `unk_token` to them). assign the index of the `unk_token` to them).
replace_additional_special_tokens (`bool`, *optional*,, defaults to `True`): replace_additional_special_tokens (`bool`, *optional*,, defaults to `True`):
If `True`, the existing list of additional special tokens will be replaced by the one specified in If `True`, the existing list of additional special tokens will be replaced by the list provided in
`special_tokens_dict`. Otherwise, `self._additional_special_tokens` is updated. In the former case, the `special_tokens_dict`. Otherwise, `self._additional_special_tokens` is just extended. In the former
tokens will NOT be removed from the tokenizer's full vocabulary - they are only being flagged as case, the tokens will NOT be removed from the tokenizer's full vocabulary - they are only being flagged
non-special tokens. as non-special tokens. Remember, this only affects which tokens are skipped during decoding, not the
`added_tokens_encoder` and `added_tokens_decoder`. This means that the previous
`additional_special_tokens` are still added tokens, and will not be split by the model.
Returns: Returns:
`int`: Number of tokens added to the vocabulary. `int`: Number of tokens added to the vocabulary.
...@@ -920,7 +930,7 @@ class SpecialTokensMixin: ...@@ -920,7 +930,7 @@ class SpecialTokensMixin:
if not special_tokens_dict: if not special_tokens_dict:
return 0 return 0
added_tokens = 0 added_tokens = []
for key, value in special_tokens_dict.items(): for key, value in special_tokens_dict.items():
assert key in self.SPECIAL_TOKENS_ATTRIBUTES, f"Key {key} is not a special token" assert key in self.SPECIAL_TOKENS_ATTRIBUTES, f"Key {key} is not a special token"
...@@ -932,28 +942,32 @@ class SpecialTokensMixin: ...@@ -932,28 +942,32 @@ class SpecialTokensMixin:
isinstance(t, (str, AddedToken)) for t in value isinstance(t, (str, AddedToken)) for t in value
), f"Tokens {value} for key {key} should all be str or AddedToken instances" ), f"Tokens {value} for key {key} should all be str or AddedToken instances"
to_add = set()
for token in value:
if isinstance(token, str):
# for legacy purpose we default to stripping. `test_add_tokens_tokenizer` depends on this
token = AddedToken(token, normalized=False, rstrip=True, lstrip=True)
if str(token) not in self.additional_special_tokens:
to_add.add(token)
if replace_additional_special_tokens: if replace_additional_special_tokens:
setattr(self, key, value) setattr(self, key, list(to_add))
else: else:
# This is a copy of `self._additional_special_tokens` self._additional_special_tokens.extend(to_add)
additional_special_tokens = getattr(self, key) added_tokens += to_add
additional_special_tokens_set = set(additional_special_tokens)
to_add = []
for token in value:
if str(token) not in additional_special_tokens_set and str(token) not in to_add:
to_add.append(token)
# update the property
additional_special_tokens.extend(to_add)
self.additional_special_tokens = additional_special_tokens
added_tokens += self.add_tokens(value, special_tokens=True)
else: else:
assert isinstance( if not isinstance(value, (str, AddedToken)):
value, (str, AddedToken) raise ValueError(f"Token {value} for key {key} should be a str or an AddedToken instance")
), f"Token {value} for key {key} should be a str or an AddedToken instance" if isinstance(value, (str)):
setattr(self, key, value) # for legacy purpose we default to stripping. `test_add_tokens_tokenizer` depends on this
added_tokens += self.add_tokens([value], special_tokens=True) value = AddedToken(value, normalized=False, rstrip=True, lstrip=True)
if isinstance(value, AddedToken):
setattr(self, key, value)
if value not in added_tokens:
added_tokens.append(value)
# if we are adding tokens that were not part of the vocab, we ought to add them
added_tokens = self.add_tokens(added_tokens, special_tokens=True)
return added_tokens return added_tokens
def add_tokens( def add_tokens(
...@@ -1102,35 +1116,74 @@ class SpecialTokensMixin: ...@@ -1102,35 +1116,74 @@ class SpecialTokensMixin:
@bos_token.setter @bos_token.setter
def bos_token(self, value): def bos_token(self, value):
if isinstance(value, str) and value != "":
value = AddedToken(value, normalized=False, rstrip=True, lstrip=True, special=True)
elif not isinstance(value, AddedToken) and value is not None:
raise ValueError("Cannot set a non-string value as the BOS token")
self._bos_token = value self._bos_token = value
@eos_token.setter @eos_token.setter
def eos_token(self, value): def eos_token(self, value):
if isinstance(value, str) and value != "":
value = AddedToken(value, normalized=False, rstrip=True, lstrip=True, special=True)
elif not isinstance(value, AddedToken) and value is not None:
raise ValueError("Cannot set a non-string value as the EOS token")
self._eos_token = value self._eos_token = value
@unk_token.setter @unk_token.setter
def unk_token(self, value): def unk_token(self, value):
if isinstance(value, str) and value != "":
value = AddedToken(value, normalized=False, rstrip=True, lstrip=True, special=True)
elif not isinstance(value, AddedToken) and value is not None:
raise ValueError("Cannot set a non-string value as the UNK token")
self._unk_token = value self._unk_token = value
@sep_token.setter @sep_token.setter
def sep_token(self, value): def sep_token(self, value):
if isinstance(value, str) and value != "":
value = AddedToken(value, normalized=False, rstrip=True, lstrip=True, special=True)
elif not isinstance(value, AddedToken) and value is not None:
raise ValueError("Cannot set a non-string value as the SEP token")
self._sep_token = value self._sep_token = value
@pad_token.setter @pad_token.setter
def pad_token(self, value): def pad_token(self, value):
if isinstance(value, str) and value != "":
value = AddedToken(value, normalized=False, rstrip=True, lstrip=True, special=True)
elif not isinstance(value, AddedToken) and value is not None:
raise ValueError("Cannot set a non-string value as the PAD token")
self._pad_token = value self._pad_token = value
@cls_token.setter @cls_token.setter
def cls_token(self, value): def cls_token(self, value):
if isinstance(value, str) and value != "":
value = AddedToken(value, normalized=False, rstrip=True, lstrip=True, special=True)
elif not isinstance(value, AddedToken) and value is not None:
raise ValueError("Cannot set a non-string value as the CLS token")
self._cls_token = value self._cls_token = value
@mask_token.setter @mask_token.setter
def mask_token(self, value): def mask_token(self, value):
if isinstance(value, str) and value != "":
value = AddedToken(value, normalized=False, rstrip=True, lstrip=True, special=True)
elif not isinstance(value, AddedToken) and value is not None:
raise ValueError("Cannot set a non-string value as the MASK token")
self._mask_token = value self._mask_token = value
@additional_special_tokens.setter @additional_special_tokens.setter
def additional_special_tokens(self, value): def additional_special_tokens(self, value):
self._additional_special_tokens = value if value is None:
self._additional_special_tokens = value
return
if self._additional_special_tokens is None:
self._additional_special_tokens = []
# We store the `AddedToken` to allow adding tokens via `tokenizer.add_special_tokens`
for token in value:
if isinstance(token, str) and token != "":
token = AddedToken(token, normalized=False, rstrip=True, lstrip=True, special=True)
elif not isinstance(token, AddedToken):
raise ValueError(f"Cannot add instance of type {type(value)} to additional_special_tokens!")
self._additional_special_tokens.append(token)
@property @property
def bos_token_id(self) -> Optional[int]: def bos_token_id(self) -> Optional[int]:
...@@ -1259,13 +1312,9 @@ class SpecialTokensMixin: ...@@ -1259,13 +1312,9 @@ class SpecialTokensMixin:
""" """
set_attr = {} set_attr = {}
for attr in self.SPECIAL_TOKENS_ATTRIBUTES: for attr in self.SPECIAL_TOKENS_ATTRIBUTES:
attr_value = getattr(self, "_" + attr) attr_value = getattr(self, attr)
if attr_value: if attr_value:
set_attr[attr] = ( set_attr[attr] = attr_value
type(attr_value)(str(attr_value_sub) for attr_value_sub in attr_value)
if isinstance(attr_value, (list, tuple))
else str(attr_value)
)
return set_attr return set_attr
@property @property
...@@ -1285,29 +1334,34 @@ class SpecialTokensMixin: ...@@ -1285,29 +1334,34 @@ class SpecialTokensMixin:
return set_attr return set_attr
@property @property
def all_special_tokens(self) -> List[str]: def all_special_tokens_extended(self) -> List[Union[str, AddedToken]]:
""" """
`List[str]`: All the special tokens (`'<unk>'`, `'<cls>'`, etc.) mapped to class attributes. `List[Union[str, tokenizers.AddedToken]]`: All the special tokens (`'<unk>'`, `'<cls>'`, etc.), the order has
nothing to do with the index of each tokens. If you want to know the correct indices, check
`self.added_tokens_encoder`. We can't create an order anymore as the keys are `AddedTokens` and not `Strings`.
Convert tokens of `tokenizers.AddedToken` type to string. Don't convert tokens of `tokenizers.AddedToken` type to string so they can be used to control more finely how
special tokens are tokenized.
""" """
all_toks = [str(s) for s in self.all_special_tokens_extended] all_tokens = []
return all_toks seen = set()
for value in self.special_tokens_map_extended.values():
if isinstance(value, (list, tuple)):
tokens_to_add = [token for token in value if str(token) not in seen]
else:
tokens_to_add = [value] if str(value) not in seen else []
seen.update(map(str, tokens_to_add))
all_tokens.extend(tokens_to_add)
return all_tokens
@property @property
def all_special_tokens_extended(self) -> List[Union[str, AddedToken]]: def all_special_tokens(self) -> List[str]:
""" """
`List[Union[str, tokenizers.AddedToken]]`: All the special tokens (`'<unk>'`, `'<cls>'`, etc.) mapped to class `List[str]`: A list of the unique special tokens (`'<unk>'`, `'<cls>'`, ..., etc.).
attributes.
Don't convert tokens of `tokenizers.AddedToken` type to string so they can be used to control more finely how Convert tokens of `tokenizers.AddedToken` type to string.
special tokens are tokenized.
""" """
all_toks = [] all_toks = [str(s) for s in self.all_special_tokens_extended]
set_attr = self.special_tokens_map_extended
for attr_value in set_attr.values():
all_toks = all_toks + (list(attr_value) if isinstance(attr_value, (list, tuple)) else [attr_value])
all_toks = list(OrderedDict.fromkeys(all_toks))
return all_toks return all_toks
@property @property
...@@ -1322,7 +1376,10 @@ class SpecialTokensMixin: ...@@ -1322,7 +1376,10 @@ class SpecialTokensMixin:
ENCODE_KWARGS_DOCSTRING = r""" ENCODE_KWARGS_DOCSTRING = r"""
add_special_tokens (`bool`, *optional*, defaults to `True`): add_special_tokens (`bool`, *optional*, defaults to `True`):
Whether or not to encode the sequences with the special tokens relative to their model. Whether or not to add special tokens when encoding the sequences. This will use the underlying
`PretrainedTokenizerBase.build_inputs_with_special_tokens` function, which defines which tokens are
automatically added to the input ids. This is usefull if you want to add `bos` or `eos` tokens
automatically.
padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `False`): padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `False`):
Activates and controls padding. Accepts the following values: Activates and controls padding. Accepts the following values:
...@@ -1492,9 +1549,9 @@ INIT_TOKENIZER_DOCSTRING = r""" ...@@ -1492,9 +1549,9 @@ INIT_TOKENIZER_DOCSTRING = r"""
A special token representing a masked token (used by masked-language modeling pretraining objectives, like A special token representing a masked token (used by masked-language modeling pretraining objectives, like
BERT). Will be associated to `self.mask_token` and `self.mask_token_id`. BERT). Will be associated to `self.mask_token` and `self.mask_token_id`.
additional_special_tokens (tuple or list of `str` or `tokenizers.AddedToken`, *optional*): additional_special_tokens (tuple or list of `str` or `tokenizers.AddedToken`, *optional*):
A tuple or a list of additional special tokens. Add them here to ensure they won't be split by the A tuple or a list of additional special tokens. Add them here to ensure they are skipped when decoding with
tokenization process. Will be associated to `self.additional_special_tokens` and `skip_special_tokens` is set to True. If they are not part of the vocabulary, they will be added at the end
`self.additional_special_tokens_ids`. of the vocabulary.
clean_up_tokenization_spaces (`bool`, *optional*, defaults to `True`): clean_up_tokenization_spaces (`bool`, *optional*, defaults to `True`):
Whether or not the model should cleanup the spaces that were added when splitting the input text during the Whether or not the model should cleanup the spaces that were added when splitting the input text during the
tokenization process. tokenization process.
...@@ -1614,12 +1671,26 @@ class PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin): ...@@ -1614,12 +1671,26 @@ class PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin):
"""Sets processor class as an attribute.""" """Sets processor class as an attribute."""
self._processor_class = processor_class self._processor_class = processor_class
@property
def added_tokens_encoder(self) -> Dict[str, int]:
"""
Returns the sorted mapping from string to index. The added tokens encoder is cached for performance
optimisation in `self._added_tokens_encoder` for the slow tokenizers.
"""
return {k.content: v for v, k in sorted(self._added_tokens_decoder.items(), key=lambda item: item[0])}
@property
def added_tokens_decoder(self) -> Dict[int, AddedToken]:
raise NotImplementedError()
def __repr__(self) -> str: def __repr__(self) -> str:
added_tokens_decoder_rep = "\n\t".join([f"{k}: {v.__repr__()}," for k, v in self.added_tokens_decoder.items()])
return ( return (
f"{self.__class__.__name__}(name_or_path='{self.name_or_path}'," f"{self.__class__.__name__}(name_or_path='{self.name_or_path}',"
f" vocab_size={self.vocab_size}, model_max_length={self.model_max_length}, is_fast={self.is_fast}," f" vocab_size={self.vocab_size}, model_max_length={self.model_max_length}, is_fast={self.is_fast},"
f" padding_side='{self.padding_side}', truncation_side='{self.truncation_side}'," f" padding_side='{self.padding_side}', truncation_side='{self.truncation_side}',"
f" special_tokens={self.special_tokens_map_extended}, clean_up_tokenization_spaces={self.clean_up_tokenization_spaces})" f" special_tokens={self.special_tokens_map}, clean_up_tokenization_spaces={self.clean_up_tokenization_spaces}), "
" added_tokens_decoder={\n\t" + added_tokens_decoder_rep + "\n}"
) )
def __len__(self) -> int: def __len__(self) -> int:
...@@ -1878,12 +1949,13 @@ class PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin): ...@@ -1878,12 +1949,13 @@ class PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin):
else: else:
# At this point pretrained_model_name_or_path is either a directory or a model identifier name # At this point pretrained_model_name_or_path is either a directory or a model identifier name
additional_files_names = { additional_files_names = {
"added_tokens_file": ADDED_TOKENS_FILE, "added_tokens_file": ADDED_TOKENS_FILE, # kept only for legacy
"special_tokens_map_file": SPECIAL_TOKENS_MAP_FILE, "special_tokens_map_file": SPECIAL_TOKENS_MAP_FILE, # kept only for legacy
"tokenizer_config_file": TOKENIZER_CONFIG_FILE, "tokenizer_config_file": TOKENIZER_CONFIG_FILE,
# tokenizer_file used to initialize a slow from a fast. Properly copy the `addedTokens` instead of adding in random orders
"tokenizer_file": FULL_TOKENIZER_FILE,
} }
vocab_files = {**cls.vocab_files_names, **additional_files_names} vocab_files = {**cls.vocab_files_names, **additional_files_names}
if "tokenizer_file" in vocab_files: if "tokenizer_file" in vocab_files:
# Try to get the tokenizer config to see if there are versioned tokenizer files. # Try to get the tokenizer config to see if there are versioned tokenizer files.
fast_tokenizer_file = FULL_TOKENIZER_FILE fast_tokenizer_file = FULL_TOKENIZER_FILE
...@@ -2019,6 +2091,8 @@ class PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin): ...@@ -2019,6 +2091,8 @@ class PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin):
# First attempt. We get tokenizer_class from tokenizer_config to check mismatch between tokenizers. # First attempt. We get tokenizer_class from tokenizer_config to check mismatch between tokenizers.
config_tokenizer_class = init_kwargs.get("tokenizer_class") config_tokenizer_class = init_kwargs.get("tokenizer_class")
init_kwargs.pop("tokenizer_class", None) init_kwargs.pop("tokenizer_class", None)
if not has_tokenizer_file:
init_kwargs.pop("tokenizer_file", None)
saved_init_inputs = init_kwargs.pop("init_inputs", ()) saved_init_inputs = init_kwargs.pop("init_inputs", ())
if not init_inputs: if not init_inputs:
init_inputs = saved_init_inputs init_inputs = saved_init_inputs
...@@ -2084,19 +2158,6 @@ class PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin): ...@@ -2084,19 +2158,6 @@ class PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin):
# Update with newly provided kwargs # Update with newly provided kwargs
init_kwargs.update(kwargs) init_kwargs.update(kwargs)
# Convert AddedTokens serialized as dict to class instances
def convert_added_tokens(obj: Union[AddedToken, Any]):
if isinstance(obj, dict) and "__type" in obj and obj["__type"] == "AddedToken":
obj.pop("__type")
return AddedToken(**obj)
elif isinstance(obj, (list, tuple)):
return [convert_added_tokens(o) for o in obj]
elif isinstance(obj, dict):
return {k: convert_added_tokens(v) for k, v in obj.items()}
return obj
init_kwargs = convert_added_tokens(init_kwargs)
# Set max length if needed # Set max length if needed
if pretrained_model_name_or_path in cls.max_model_input_sizes: if pretrained_model_name_or_path in cls.max_model_input_sizes:
# if we're using a pretrained model, ensure the tokenizer # if we're using a pretrained model, ensure the tokenizer
...@@ -2116,16 +2177,75 @@ class PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin): ...@@ -2116,16 +2177,75 @@ class PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin):
# Merge resolved_vocab_files arguments in init_kwargs. # Merge resolved_vocab_files arguments in init_kwargs.
added_tokens_file = resolved_vocab_files.pop("added_tokens_file", None) added_tokens_file = resolved_vocab_files.pop("added_tokens_file", None)
special_tokens_map_file = resolved_vocab_files.pop("special_tokens_map_file", None)
for args_name, file_path in resolved_vocab_files.items(): for args_name, file_path in resolved_vocab_files.items():
if args_name not in init_kwargs: if args_name not in init_kwargs:
init_kwargs[args_name] = file_path init_kwargs[args_name] = file_path
if slow_tokenizer is not None: if slow_tokenizer is not None:
init_kwargs["__slow_tokenizer"] = slow_tokenizer init_kwargs["__slow_tokenizer"] = slow_tokenizer
init_kwargs["name_or_path"] = pretrained_model_name_or_path init_kwargs["name_or_path"] = pretrained_model_name_or_path
# Instantiate tokenizer. additional_special_tokens = init_kwargs.pop("additional_special_tokens", None) or []
added_tokens_decoder = {}
legacy_saved = "added_tokens_decoder" not in init_kwargs
if not legacy_saved:
for idx, token in init_kwargs["added_tokens_decoder"].items():
if isinstance(token, dict):
token = AddedToken(**token)
if isinstance(token, AddedToken):
added_tokens_decoder[int(idx)] = token
else:
raise ValueError(
f"Found a {token.__class__} in the saved `added_tokens_decoder`, should be a dictionary."
)
else:
logger.warning_once(
"Loading the tokenizer from the `special_tokens_map.json` and the `added_tokens.json` will be removed in `transformers 5`, "
" it is kept for forward compatibility, but it is recommended to update your `tokenizer_config.json` by uploading it again."
" You will see the new `added_tokens_decoder` attribute that will store the relevant information."
)
# begin legacy: read the added_tokens_file and update kwargs with special_tokens_map if modified
if special_tokens_map_file is not None:
with open(special_tokens_map_file, encoding="utf-8") as special_tokens_map_handle:
special_tokens_map = json.load(special_tokens_map_handle)
for key, value in special_tokens_map.items():
if key in kwargs and kwargs[key]:
# This value has already been redefined by the kwargs
# We keep this new value and ignore the one stored in the special_tokens_map_file
continue
if isinstance(value, dict):
value = AddedToken(**value)
elif key == "additional_special_tokens" and isinstance(value, list):
for token in value:
token = AddedToken(**token) if isinstance(token, dict) else token
if token not in additional_special_tokens:
additional_special_tokens.append(token)
else:
init_kwargs[key] = value
# slow -> slow|fast, legacy: convert the `"added_tokens.json"` file to `added_tokens_decoder`.
if added_tokens_file is not None:
with open(added_tokens_file, encoding="utf-8") as added_tokens_handle:
added_tok_encoder = json.load(added_tokens_handle)
# legacy: we have to init with (rstrip=True, lstrip=True)
added_tokens_decoder = {
index: AddedToken(token, rstrip=True, lstrip=True) for token, index in added_tok_encoder.items()
}
# end legacy
# slow -> fast, non-legacy: we need to make sure the `added_tokens_decoder` is used to add tokens if the `fast` was not properly saved!
# thus we delay adding special tokens in the init using `slow_to_fast` flag.
if added_tokens_decoder is not {} and "Fast" in cls.__name__:
init_kwargs["slow_to_fast"] = True
if len(additional_special_tokens) > 0:
init_kwargs["additional_special_tokens"] = additional_special_tokens
init_kwargs["added_tokens_decoder"] = added_tokens_decoder
# convert {'__type': 'AddedToken', 'content': '<ent>', 'lstrip': False, 'normalized': True, ...} to AddedTokens
init_kwargs = cls.convert_added_tokens(init_kwargs, False)
# Instantiate the tokenizer.
try: try:
tokenizer = cls(*init_inputs, **init_kwargs) tokenizer = cls(*init_inputs, **init_kwargs)
except OSError: except OSError:
...@@ -2134,79 +2254,43 @@ class PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin): ...@@ -2134,79 +2254,43 @@ class PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin):
"Please check that the provided vocabulary is accessible and not corrupted." "Please check that the provided vocabulary is accessible and not corrupted."
) )
# Save inputs and kwargs for saving and re-loading with ``save_pretrained`` # allows converting a fast -> slow: add the `tokenizer.json`'s `"added_tokens"` to the slow tokenizer
# Removed: Now done at the base class level # if `added_tokens_decoder` not in `tokenizer_config.json` and `added_tokens.json` is `None`
# tokenizer.init_inputs = init_inputs tokenizer_file = resolved_vocab_files.pop("tokenizer_file", None)
# tokenizer.init_kwargs = init_kwargs if legacy_saved and "Fast" not in cls.__name__ and added_tokens_file is None and tokenizer_file is not None:
tokens_to_add_from_fast = []
# If there is a complementary special token map, load it with open(tokenizer_file, encoding="utf-8") as tokenizer_file_handle:
special_tokens_map_file = resolved_vocab_files.pop("special_tokens_map_file", None) tokenizer_file_handle = json.load(tokenizer_file_handle)
if special_tokens_map_file is not None: added_tokens = tokenizer_file_handle.pop("added_tokens")
with open(special_tokens_map_file, encoding="utf-8") as special_tokens_map_handle: for serialized_tokens in added_tokens:
special_tokens_map = json.load(special_tokens_map_handle) serialized_tokens.pop("id")
for key, value in special_tokens_map.items(): # for legacy purpose, we ignore whether or not these tokens are special.
if key in kwargs and kwargs[key]: serialized_tokens.pop("special")
# This value has already been redefined by the kwargs tokens_to_add_from_fast.append(AddedToken(**serialized_tokens))
# We keep this new value and ignore the one stored in the special_tokens_map_file tokenizer.add_tokens(tokens_to_add_from_fast)
continue # allows converting a slow -> fast, non-legacy: if the `tokenizer.json` does not have all the added tokens
# uses the information stored in `added_tokens_decoder`. Checks after addition that we have the same ids
if isinstance(value, dict): if init_kwargs.get("slow_to_fast", False):
value = AddedToken(**value) tokenizer.add_tokens([token for _, token in sorted(added_tokens_decoder.items(), key=lambda x: x[0])])
elif isinstance(value, list): warnings = ""
value = [AddedToken(**token) if isinstance(token, dict) else token for token in value] for index, token in sorted(added_tokens_decoder.items(), key=lambda x: x[0]):
setattr(tokenizer, key, value) if tokenizer.convert_tokens_to_ids(str(token)) != index:
warnings += f"\texpected id: {tokenizer.convert_tokens_to_ids(str(token))}, found: {index}, token: `{token}`,\n"
# Add supplementary tokens. if len(warnings) > 1:
special_tokens = tokenizer.all_special_tokens logger.warn(
if added_tokens_file is not None: f"You are converting a {slow_tokenizer.__class__.__name__} to a {cls.__name__}, but"
with open(added_tokens_file, encoding="utf-8") as added_tokens_handle: f" wrong indexes were founds when adding the `added_tokens` from the `slow` tokenizer to the `fast`. "
added_tok_encoder = json.load(added_tokens_handle) f" The following tokens had unexpected id :\n{warnings}. You should try using `from_slow`."
)
# Sort added tokens by index # finally we add all the special_tokens to make sure eveything is initialized
added_tok_encoder_sorted = sorted(added_tok_encoder.items(), key=lambda x: x[1]) tokenizer.add_tokens(tokenizer.all_special_tokens_extended, special_tokens=True)
# Accumulate added tokens into batches of special/non-special tokens, because calling add_tokens() for
# individual tokens would repeatedly rebuild a trie, which can be slow.
is_last_special = None
tokens = []
for token, index in added_tok_encoder_sorted:
current_index = len(tokenizer) + len(tokens)
if has_tokenizer_file and index != current_index and tokenizer.convert_tokens_to_ids(token) != index:
# Tokenizer fast: added token needs to either be in the vocabulary with the proper index or the
# index is the current length of the tokenizer (not in vocabulary)
raise ValueError(
f"Wrong index found for {token}: should be {tokenizer.convert_tokens_to_ids(token)} but found "
f"{index}."
)
elif not has_tokenizer_file and index != current_index:
# Tokenizer slow: added token cannot already be in the vocabulary so its index needs to be the
# current length of the tokenizer.
raise ValueError(
f"Non-consecutive added token '{token}' found. "
f"Should have index {current_index} but has index {index} in saved vocabulary."
)
is_special = bool(token in special_tokens)
if is_last_special is None or is_last_special == is_special:
tokens.append(token)
else:
tokenizer.add_tokens(tokens, special_tokens=is_last_special)
tokens = [token]
is_last_special = is_special
if tokens:
tokenizer.add_tokens(tokens, special_tokens=is_last_special)
# Check all our special tokens are registered as "no split" token (we don't cut them) and are in the vocab if len(added_tokens_decoder) > 0:
added_tokens = tokenizer.sanitize_special_tokens()
if added_tokens:
logger.warning_advice( logger.warning_advice(
"Special tokens have been added in the vocabulary, make sure the associated word embeddings are" "Special tokens have been added in the vocabulary, make sure the associated word embeddings are"
" fine-tuned or trained." " fine-tuned or trained."
) )
return tokenizer return tokenizer
@staticmethod @staticmethod
...@@ -2217,6 +2301,21 @@ class PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin): ...@@ -2217,6 +2301,21 @@ class PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin):
# which we will correct in Transformers v5. # which we will correct in Transformers v5.
return max_model_length return max_model_length
@classmethod
def convert_added_tokens(cls, obj: Union[AddedToken, Any], add_type_field=True):
if isinstance(obj, dict) and "__type" in obj and obj["__type"] == "AddedToken":
obj.pop("__type")
return AddedToken(**obj)
if isinstance(obj, AddedToken):
if add_type_field:
obj = obj.content
return obj
elif isinstance(obj, (list, tuple)):
return [cls.convert_added_tokens(o, add_type_field=add_type_field) for o in obj]
elif isinstance(obj, dict):
return {k: cls.convert_added_tokens(v, add_type_field=add_type_field) for k, v in obj.items()}
return obj
def save_pretrained( def save_pretrained(
self, self,
save_directory: Union[str, os.PathLike], save_directory: Union[str, os.PathLike],
...@@ -2295,7 +2394,7 @@ class PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin): ...@@ -2295,7 +2394,7 @@ class PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin):
# TODO: Ensure the modified attributes (those are also in the __init__ kwargs) will give identical tokenizers # TODO: Ensure the modified attributes (those are also in the __init__ kwargs) will give identical tokenizers
# target_keys = self.init_kwargs.keys() # target_keys = self.init_kwargs.keys()
target_keys = ["model_max_length", "clean_up_tokenization_spaces"] target_keys = ["model_max_length", "clean_up_tokenization_spaces", "additional_special_tokens"]
for k in target_keys: for k in target_keys:
if hasattr(self, k): if hasattr(self, k):
tokenizer_config[k] = getattr(self, k) tokenizer_config[k] = getattr(self, k)
...@@ -2308,21 +2407,13 @@ class PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin): ...@@ -2308,21 +2407,13 @@ class PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin):
for file_id in self.vocab_files_names.keys(): for file_id in self.vocab_files_names.keys():
tokenizer_config.pop(file_id, None) tokenizer_config.pop(file_id, None)
# Sanitize AddedTokens
def convert_added_tokens(obj: Union[AddedToken, Any], add_type_field=True):
if isinstance(obj, AddedToken):
out = obj.__getstate__()
if add_type_field:
out["__type"] = "AddedToken"
return out
elif isinstance(obj, (list, tuple)):
return [convert_added_tokens(o, add_type_field=add_type_field) for o in obj]
elif isinstance(obj, dict):
return {k: convert_added_tokens(v, add_type_field=add_type_field) for k, v in obj.items()}
return obj
# add_type_field=True to allow dicts in the kwargs / differentiate from AddedToken serialization # add_type_field=True to allow dicts in the kwargs / differentiate from AddedToken serialization
tokenizer_config = convert_added_tokens(tokenizer_config, add_type_field=True) tokenizer_config = self.convert_added_tokens(tokenizer_config, add_type_field=True)
added_tokens = {}
for key, value in self.added_tokens_decoder.items():
added_tokens[key] = value.__getstate__()
tokenizer_config["added_tokens_decoder"] = added_tokens
# Add tokenizer class to the tokenizer config to be able to reload it with from_pretrained # Add tokenizer class to the tokenizer config to be able to reload it with from_pretrained
tokenizer_class = self.__class__.__name__ tokenizer_class = self.__class__.__name__
...@@ -2351,7 +2442,9 @@ class PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin): ...@@ -2351,7 +2442,9 @@ class PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin):
logger.info(f"tokenizer config file saved in {tokenizer_config_file}") logger.info(f"tokenizer config file saved in {tokenizer_config_file}")
# Sanitize AddedTokens in special_tokens_map # Sanitize AddedTokens in special_tokens_map
write_dict = convert_added_tokens(self.special_tokens_map_extended, add_type_field=False)
# kept for forward compatibility, will be removed in transoformers 5
write_dict = self.convert_added_tokens(self.special_tokens_map_extended, add_type_field=True)
with open(special_tokens_map_file, "w", encoding="utf-8") as f: with open(special_tokens_map_file, "w", encoding="utf-8") as f:
out_str = json.dumps(write_dict, indent=2, sort_keys=True, ensure_ascii=False) + "\n" out_str = json.dumps(write_dict, indent=2, sort_keys=True, ensure_ascii=False) + "\n"
f.write(out_str) f.write(out_str)
......
...@@ -96,6 +96,7 @@ class PreTrainedTokenizerFast(PreTrainedTokenizerBase): ...@@ -96,6 +96,7 @@ class PreTrainedTokenizerFast(PreTrainedTokenizerBase):
slow_tokenizer = kwargs.pop("__slow_tokenizer", None) slow_tokenizer = kwargs.pop("__slow_tokenizer", None)
fast_tokenizer_file = kwargs.pop("tokenizer_file", None) fast_tokenizer_file = kwargs.pop("tokenizer_file", None)
from_slow = kwargs.pop("from_slow", False) from_slow = kwargs.pop("from_slow", False)
slow_to_fast = kwargs.pop("slow_to_fast", False)
if from_slow and slow_tokenizer is None and self.slow_tokenizer_class is None: if from_slow and slow_tokenizer is None and self.slow_tokenizer_class is None:
raise ValueError( raise ValueError(
...@@ -154,6 +155,10 @@ class PreTrainedTokenizerFast(PreTrainedTokenizerBase): ...@@ -154,6 +155,10 @@ class PreTrainedTokenizerFast(PreTrainedTokenizerBase):
# We call this after having initialized the backend tokenizer because we update it. # We call this after having initialized the backend tokenizer because we update it.
super().__init__(**kwargs) super().__init__(**kwargs)
# We add the additional tokens that are not part of the vocab
if not slow_to_fast:
self._add_tokens(self.all_special_tokens_extended, special_tokens=True)
@property @property
def is_fast(self) -> bool: def is_fast(self) -> bool:
return True return True
...@@ -180,6 +185,16 @@ class PreTrainedTokenizerFast(PreTrainedTokenizerBase): ...@@ -180,6 +185,16 @@ class PreTrainedTokenizerFast(PreTrainedTokenizerBase):
def vocab(self) -> Dict[str, int]: def vocab(self) -> Dict[str, int]:
return self.get_vocab() return self.get_vocab()
@property
def added_tokens_decoder(self) -> Dict[int, AddedToken]:
"""
Returns the added tokens in the vocabulary as a dictionary of index to AddedToken.
Returns:
`Dict[str, int]`: The added tokens.
"""
return self._tokenizer.get_added_tokens_decoder()
def get_added_vocab(self) -> Dict[str, int]: def get_added_vocab(self) -> Dict[str, int]:
""" """
Returns the added tokens in the vocabulary as a dictionary of token to index. Returns the added tokens in the vocabulary as a dictionary of token to index.
...@@ -779,6 +794,7 @@ class PreTrainedTokenizerFast(PreTrainedTokenizerBase): ...@@ -779,6 +794,7 @@ class PreTrainedTokenizerFast(PreTrainedTokenizerBase):
lstrip=special_token_full.lstrip, lstrip=special_token_full.lstrip,
rstrip=special_token_full.rstrip, rstrip=special_token_full.rstrip,
normalized=special_token_full.normalized, normalized=special_token_full.normalized,
special=True,
) )
else: else:
kwargs[token] = special_token kwargs[token] = special_token
......
...@@ -170,7 +170,6 @@ class TestTokenizationBart(TokenizerTesterMixin, unittest.TestCase): ...@@ -170,7 +170,6 @@ class TestTokenizationBart(TokenizerTesterMixin, unittest.TestCase):
tokens_r_str = tokenizer_r.convert_ids_to_tokens(tokens_r["input_ids"]) tokens_r_str = tokenizer_r.convert_ids_to_tokens(tokens_r["input_ids"])
tokens_p_str = tokenizer_p.convert_ids_to_tokens(tokens_p["input_ids"]) tokens_p_str = tokenizer_p.convert_ids_to_tokens(tokens_p["input_ids"])
# Rust correctly handles the space before the mask while python doesnt
self.assertSequenceEqual(tokens_p["input_ids"], [0, 250, 6, 50264, 3823, 487, 21992, 3645, 4, 2]) self.assertSequenceEqual(tokens_p["input_ids"], [0, 250, 6, 50264, 3823, 487, 21992, 3645, 4, 2])
self.assertSequenceEqual(tokens_r["input_ids"], [0, 250, 6, 50264, 3823, 487, 21992, 3645, 4, 2]) self.assertSequenceEqual(tokens_r["input_ids"], [0, 250, 6, 50264, 3823, 487, 21992, 3645, 4, 2])
......
...@@ -42,6 +42,10 @@ class BloomTokenizationTest(TokenizerTesterMixin, unittest.TestCase): ...@@ -42,6 +42,10 @@ class BloomTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
kwargs.update(self.special_tokens_map) kwargs.update(self.special_tokens_map)
return BloomTokenizerFast.from_pretrained(self.tmpdirname, **kwargs) return BloomTokenizerFast.from_pretrained(self.tmpdirname, **kwargs)
@unittest.skip("This needs a slow tokenizer. Bloom does not have one!")
def test_encode_decode_with_spaces(self):
return
def test_encodings_from_sample_data(self): def test_encodings_from_sample_data(self):
""" """
Assert that the created tokens are the same than the hard-coded ones Assert that the created tokens are the same than the hard-coded ones
......
...@@ -205,7 +205,9 @@ class ByT5TokenizationTest(TokenizerTesterMixin, unittest.TestCase): ...@@ -205,7 +205,9 @@ class ByT5TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
tokenizer.add_tokens(["bim", "bambam"]) tokenizer.add_tokens(["bim", "bambam"])
additional_special_tokens = tokenizer.additional_special_tokens additional_special_tokens = tokenizer.additional_special_tokens
additional_special_tokens.append("new_additional_special_token") additional_special_tokens.append("new_additional_special_token")
tokenizer.add_special_tokens({"additional_special_tokens": additional_special_tokens}) tokenizer.add_special_tokens(
{"additional_special_tokens": additional_special_tokens}, replace_additional_special_tokens=False
)
before_tokens = tokenizer.encode(sample_text, add_special_tokens=False) before_tokens = tokenizer.encode(sample_text, add_special_tokens=False)
tokenizer.save_pretrained(tmpdirname) tokenizer.save_pretrained(tmpdirname)
......
...@@ -43,13 +43,19 @@ class CamembertTokenizationTest(TokenizerTesterMixin, unittest.TestCase): ...@@ -43,13 +43,19 @@ class CamembertTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
tokenizer = CamembertTokenizer(SAMPLE_VOCAB) tokenizer = CamembertTokenizer(SAMPLE_VOCAB)
tokenizer.save_pretrained(self.tmpdirname) tokenizer.save_pretrained(self.tmpdirname)
@unittest.skip(
"Token maps are not equal because someone set the probability of ('<unk>NOTUSED', -100), so it's never encoded for fast"
)
def test_special_tokens_map_equal(self):
return
def test_convert_token_and_id(self): def test_convert_token_and_id(self):
"""Test ``_convert_token_to_id`` and ``_convert_id_to_token``.""" """Test ``_convert_token_to_id`` and ``_convert_id_to_token``."""
token = "<pad>" token = "<pad>"
token_id = 1 token_id = 1 # 1 is the offset id, but in the spm vocab it's 3
self.assertEqual(self.get_tokenizer()._convert_token_to_id(token), token_id) self.assertEqual(self.get_tokenizer().convert_tokens_to_ids(token), token_id)
self.assertEqual(self.get_tokenizer()._convert_id_to_token(token_id), token) self.assertEqual(self.get_tokenizer().convert_ids_to_tokens(token_id), token)
def test_get_vocab(self): def test_get_vocab(self):
vocab_keys = list(self.get_tokenizer().get_vocab().keys()) vocab_keys = list(self.get_tokenizer().get_vocab().keys())
...@@ -57,10 +63,10 @@ class CamembertTokenizationTest(TokenizerTesterMixin, unittest.TestCase): ...@@ -57,10 +63,10 @@ class CamembertTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
self.assertEqual(vocab_keys[0], "<s>NOTUSED") self.assertEqual(vocab_keys[0], "<s>NOTUSED")
self.assertEqual(vocab_keys[1], "<pad>") self.assertEqual(vocab_keys[1], "<pad>")
self.assertEqual(vocab_keys[-1], "<mask>") self.assertEqual(vocab_keys[-1], "<mask>")
self.assertEqual(len(vocab_keys), 1_004) self.assertEqual(len(vocab_keys), 1_005)
def test_vocab_size(self): def test_vocab_size(self):
self.assertEqual(self.get_tokenizer().vocab_size, 1_005) self.assertEqual(self.get_tokenizer().vocab_size, 1_000)
def test_rust_and_python_bpe_tokenizers(self): def test_rust_and_python_bpe_tokenizers(self):
tokenizer = CamembertTokenizer(SAMPLE_BPE_VOCAB) tokenizer = CamembertTokenizer(SAMPLE_BPE_VOCAB)
......
...@@ -122,7 +122,9 @@ class CanineTokenizationTest(TokenizerTesterMixin, unittest.TestCase): ...@@ -122,7 +122,9 @@ class CanineTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
# We can add a new special token for Canine as follows: # We can add a new special token for Canine as follows:
new_additional_special_token = chr(0xE007) new_additional_special_token = chr(0xE007)
additional_special_tokens.append(new_additional_special_token) additional_special_tokens.append(new_additional_special_token)
tokenizer.add_special_tokens({"additional_special_tokens": additional_special_tokens}) tokenizer.add_special_tokens(
{"additional_special_tokens": additional_special_tokens}, replace_additional_special_tokens=False
)
before_tokens = tokenizer.encode(sample_text, add_special_tokens=False) before_tokens = tokenizer.encode(sample_text, add_special_tokens=False)
tokenizer.save_pretrained(tmpdirname) tokenizer.save_pretrained(tmpdirname)
...@@ -167,11 +169,7 @@ class CanineTokenizationTest(TokenizerTesterMixin, unittest.TestCase): ...@@ -167,11 +169,7 @@ class CanineTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
with self.subTest(f"{tokenizer.__class__.__name__}"): with self.subTest(f"{tokenizer.__class__.__name__}"):
SPECIAL_TOKEN_1 = chr(0xE005) SPECIAL_TOKEN_1 = chr(0xE005)
SPECIAL_TOKEN_2 = chr(0xE006) SPECIAL_TOKEN_2 = chr(0xE006)
# `add_tokens` method stores special tokens only in `tokenizer.unique_no_split_tokens`. (in tokenization_utils.py)
tokenizer.add_tokens([SPECIAL_TOKEN_1], special_tokens=True) tokenizer.add_tokens([SPECIAL_TOKEN_1], special_tokens=True)
# `add_special_tokens` method stores special tokens in `tokenizer.additional_special_tokens`,
# which also occur in `tokenizer.all_special_tokens`. (in tokenization_utils_base.py)
tokenizer.add_special_tokens({"additional_special_tokens": [SPECIAL_TOKEN_2]}) tokenizer.add_special_tokens({"additional_special_tokens": [SPECIAL_TOKEN_2]})
token_1 = tokenizer.tokenize(SPECIAL_TOKEN_1) token_1 = tokenizer.tokenize(SPECIAL_TOKEN_1)
......
...@@ -65,6 +65,10 @@ class CodeLlamaTokenizationTest(TokenizerTesterMixin, unittest.TestCase): ...@@ -65,6 +65,10 @@ class CodeLlamaTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
tokenizer.pad_token = tokenizer.eos_token tokenizer.pad_token = tokenizer.eos_token
tokenizer.save_pretrained(self.tmpdirname) tokenizer.save_pretrained(self.tmpdirname)
def get_tokenizers(self, **kwargs):
kwargs.update({"pad_token": "<PAD>"})
return super().get_tokenizers(**kwargs)
def test_no_infilling_init(self): def test_no_infilling_init(self):
tokenizer = CodeLlamaTokenizer(SAMPLE_VOCAB, prefix_token=None, keep_accents=True) tokenizer = CodeLlamaTokenizer(SAMPLE_VOCAB, prefix_token=None, keep_accents=True)
with self.assertRaises(ValueError): with self.assertRaises(ValueError):
...@@ -518,7 +522,7 @@ class LlamaIntegrationTest(unittest.TestCase): ...@@ -518,7 +522,7 @@ class LlamaIntegrationTest(unittest.TestCase):
def test_special_token_special_word(self): def test_special_token_special_word(self):
# the word inform should be split as ['in', 'form'] # the word inform should be split as ['in', 'form']
tokenizer = CodeLlamaTokenizer.from_pretrained("codellama/CodeLlama-7b-hf", legacy=False) tokenizer = CodeLlamaTokenizer.from_pretrained("codellama/CodeLlama-7b-hf", legacy=False)
tokenizer.add_tokens(["<REPR_END>"], special_tokens=True) tokenizer.add_tokens(["<REPR_END>"], special_tokens=False)
out1 = tokenizer.decode( out1 = tokenizer.decode(
tokenizer.encode("<REPR_END>inform", add_special_tokens=False), spaces_between_special_tokens=False tokenizer.encode("<REPR_END>inform", add_special_tokens=False), spaces_between_special_tokens=False
) )
...@@ -526,7 +530,8 @@ class LlamaIntegrationTest(unittest.TestCase): ...@@ -526,7 +530,8 @@ class LlamaIntegrationTest(unittest.TestCase):
out2 = tokenizer.decode( out2 = tokenizer.decode(
tokenizer.encode("<REPR_END>inform", add_special_tokens=False), spaces_between_special_tokens=True tokenizer.encode("<REPR_END>inform", add_special_tokens=False), spaces_between_special_tokens=True
) )
self.assertEqual(out2, " <REPR_END> inform") # the added prefix token should not be decoded
self.assertEqual(out2, "<REPR_END> inform")
input_ids = tokenizer.encode("<REPR_END>inform", add_special_tokens=False) input_ids = tokenizer.encode("<REPR_END>inform", add_special_tokens=False)
self.assertEqual(input_ids, [29871, 32016, 262, 689]) # 29871 is the spiece underline, '▁' self.assertEqual(input_ids, [29871, 32016, 262, 689]) # 29871 is the spiece underline, '▁'
......
...@@ -244,8 +244,8 @@ class CodeGenTokenizationTest(TokenizerTesterMixin, unittest.TestCase): ...@@ -244,8 +244,8 @@ class CodeGenTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
decode_s = tokenizer.decode(out_s.input_ids) decode_s = tokenizer.decode(out_s.input_ids)
decode_s2 = tokenizer.batch_decode(out_s2.input_ids) decode_s2 = tokenizer.batch_decode(out_s2.input_ids)
self.assertEqual(decode_s.split()[0], bos_token) self.assertTrue(decode_s.startswith(bos_token))
self.assertTrue(all(d.split()[0] == bos_token for d in decode_s2)) self.assertTrue(all(d.startswith(bos_token) for d in decode_s2))
@slow @slow
def test_truncation(self): def test_truncation(self):
...@@ -258,6 +258,7 @@ class CodeGenTokenizationTest(TokenizerTesterMixin, unittest.TestCase): ...@@ -258,6 +258,7 @@ class CodeGenTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
truncation_pattern = ["^#", re.escape("<|endoftext|>"), "^'''", '^"""', "\n\n\n"] truncation_pattern = ["^#", re.escape("<|endoftext|>"), "^'''", '^"""', "\n\n\n"]
decoded_text = tokenizer.decode(input_ids, truncate_before_pattern=truncation_pattern) decoded_text = tokenizer.decode(input_ids, truncate_before_pattern=truncation_pattern)
self.assertEqual(decoded_text, expected_trucated_text) self.assertEqual(decoded_text, expected_trucated_text)
# TODO @ArthurZ outputs of the fast tokenizer are different in this case, un-related to the PR
# tokenizer has no padding token # tokenizer has no padding token
def test_padding_different_model_input_name(self): def test_padding_different_model_input_name(self):
......
...@@ -68,12 +68,12 @@ class DebertaV2TokenizationTest(TokenizerTesterMixin, unittest.TestCase): ...@@ -68,12 +68,12 @@ class DebertaV2TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
tokens_target = ["▁hello", "!", "how", "▁are", "▁you", "?"] tokens_target = ["▁hello", "!", "how", "▁are", "▁you", "?"]
# fmt: on # fmt: on
tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, do_lower_case=True) tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, unk_token="<unk>", do_lower_case=True)
tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(sequence, add_special_tokens=False)) tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(sequence, add_special_tokens=False))
self.assertListEqual(tokens, tokens_target) self.assertListEqual(tokens, tokens_target)
rust_tokenizer = DebertaV2TokenizerFast(SAMPLE_VOCAB, do_lower_case=True) rust_tokenizer = DebertaV2TokenizerFast(SAMPLE_VOCAB, unk_token="<unk>", do_lower_case=True)
rust_tokens = rust_tokenizer.convert_ids_to_tokens(rust_tokenizer.encode(sequence, add_special_tokens=False)) rust_tokens = rust_tokenizer.convert_ids_to_tokens(rust_tokenizer.encode(sequence, add_special_tokens=False))
self.assertListEqual(rust_tokens, tokens_target) self.assertListEqual(rust_tokens, tokens_target)
...@@ -92,12 +92,12 @@ class DebertaV2TokenizationTest(TokenizerTesterMixin, unittest.TestCase): ...@@ -92,12 +92,12 @@ class DebertaV2TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
tokens_target = ["▁", "<unk>", "▁was", "▁born", "▁in", "▁9", "2000", "▁", ",", "▁and", "▁this", "▁is", "▁fal", "s", "<unk>", "▁", ".", ] tokens_target = ["▁", "<unk>", "▁was", "▁born", "▁in", "▁9", "2000", "▁", ",", "▁and", "▁this", "▁is", "▁fal", "s", "<unk>", "▁", ".", ]
# fmt: on # fmt: on
tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, split_by_punct=True) tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, unk_token="<unk>", split_by_punct=True)
tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(sequence, add_special_tokens=False)) tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(sequence, add_special_tokens=False))
self.assertListEqual(tokens, tokens_target) self.assertListEqual(tokens, tokens_target)
rust_tokenizer = DebertaV2TokenizerFast(SAMPLE_VOCAB, split_by_punct=True) rust_tokenizer = DebertaV2TokenizerFast(SAMPLE_VOCAB, unk_token="<unk>", split_by_punct=True)
rust_tokens = rust_tokenizer.convert_ids_to_tokens(rust_tokenizer.encode(sequence, add_special_tokens=False)) rust_tokens = rust_tokenizer.convert_ids_to_tokens(rust_tokenizer.encode(sequence, add_special_tokens=False))
self.assertListEqual(rust_tokens, tokens_target) self.assertListEqual(rust_tokens, tokens_target)
...@@ -108,11 +108,13 @@ class DebertaV2TokenizationTest(TokenizerTesterMixin, unittest.TestCase): ...@@ -108,11 +108,13 @@ class DebertaV2TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
tokens_target = ["▁i", "▁was", "▁born", "▁in", "▁9", "2000", "▁", ",", "▁and", "▁this", "▁is", "▁fal", "s", "<unk>", "▁", ".", ] tokens_target = ["▁i", "▁was", "▁born", "▁in", "▁9", "2000", "▁", ",", "▁and", "▁this", "▁is", "▁fal", "s", "<unk>", "▁", ".", ]
# fmt: on # fmt: on
tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, do_lower_case=True, split_by_punct=True) tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, unk_token="<unk>", do_lower_case=True, split_by_punct=True)
tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(sequence, add_special_tokens=False)) tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(sequence, add_special_tokens=False))
self.assertListEqual(tokens, tokens_target) self.assertListEqual(tokens, tokens_target)
rust_tokenizer = DebertaV2TokenizerFast(SAMPLE_VOCAB, do_lower_case=True, split_by_punct=True) rust_tokenizer = DebertaV2TokenizerFast(
SAMPLE_VOCAB, unk_token="<unk>", do_lower_case=True, split_by_punct=True
)
rust_tokens = rust_tokenizer.convert_ids_to_tokens(rust_tokenizer.encode(sequence, add_special_tokens=False)) rust_tokens = rust_tokenizer.convert_ids_to_tokens(rust_tokenizer.encode(sequence, add_special_tokens=False))
self.assertListEqual(rust_tokens, tokens_target) self.assertListEqual(rust_tokens, tokens_target)
...@@ -122,12 +124,14 @@ class DebertaV2TokenizationTest(TokenizerTesterMixin, unittest.TestCase): ...@@ -122,12 +124,14 @@ class DebertaV2TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
tokens_target = ["▁i", "▁was", "▁born", "▁in", "▁9", "2000", ",", "▁and", "▁this", "▁is", "▁fal", "s", "<unk>", ".", ] tokens_target = ["▁i", "▁was", "▁born", "▁in", "▁9", "2000", ",", "▁and", "▁this", "▁is", "▁fal", "s", "<unk>", ".", ]
# fmt: on # fmt: on
tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, do_lower_case=True, split_by_punct=False) tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, unk_token="<unk>", do_lower_case=True, split_by_punct=False)
tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(sequence, add_special_tokens=False)) tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(sequence, add_special_tokens=False))
self.assertListEqual(tokens, tokens_target) self.assertListEqual(tokens, tokens_target)
rust_tokenizer = DebertaV2TokenizerFast(SAMPLE_VOCAB, do_lower_case=True, split_by_punct=False) rust_tokenizer = DebertaV2TokenizerFast(
SAMPLE_VOCAB, unk_token="<unk>", do_lower_case=True, split_by_punct=False
)
rust_tokens = rust_tokenizer.convert_ids_to_tokens(rust_tokenizer.encode(sequence, add_special_tokens=False)) rust_tokens = rust_tokenizer.convert_ids_to_tokens(rust_tokenizer.encode(sequence, add_special_tokens=False))
self.assertListEqual(rust_tokens, tokens_target) self.assertListEqual(rust_tokens, tokens_target)
...@@ -138,12 +142,14 @@ class DebertaV2TokenizationTest(TokenizerTesterMixin, unittest.TestCase): ...@@ -138,12 +142,14 @@ class DebertaV2TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
tokens_target = ["▁", "<unk>", "▁was", "▁born", "▁in", "▁9", "2000", "▁", ",", "▁and", "▁this", "▁is", "▁fal", "s", "<unk>", "▁", ".", ] tokens_target = ["▁", "<unk>", "▁was", "▁born", "▁in", "▁9", "2000", "▁", ",", "▁and", "▁this", "▁is", "▁fal", "s", "<unk>", "▁", ".", ]
# fmt: on # fmt: on
tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, do_lower_case=False, split_by_punct=True) tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, unk_token="<unk>", do_lower_case=False, split_by_punct=True)
tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(sequence, add_special_tokens=False)) tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(sequence, add_special_tokens=False))
self.assertListEqual(tokens, tokens_target) self.assertListEqual(tokens, tokens_target)
rust_tokenizer = DebertaV2TokenizerFast(SAMPLE_VOCAB, do_lower_case=False, split_by_punct=True) rust_tokenizer = DebertaV2TokenizerFast(
SAMPLE_VOCAB, unk_token="<unk>", do_lower_case=False, split_by_punct=True
)
rust_tokens = rust_tokenizer.convert_ids_to_tokens(rust_tokenizer.encode(sequence, add_special_tokens=False)) rust_tokens = rust_tokenizer.convert_ids_to_tokens(rust_tokenizer.encode(sequence, add_special_tokens=False))
self.assertListEqual(rust_tokens, tokens_target) self.assertListEqual(rust_tokens, tokens_target)
...@@ -154,12 +160,14 @@ class DebertaV2TokenizationTest(TokenizerTesterMixin, unittest.TestCase): ...@@ -154,12 +160,14 @@ class DebertaV2TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
tokens_target = ["▁", "<unk>", "e", "<unk>", "o", "!", "how", "▁", "<unk>", "re", "▁yo", "<unk>", "?"] tokens_target = ["▁", "<unk>", "e", "<unk>", "o", "!", "how", "▁", "<unk>", "re", "▁yo", "<unk>", "?"]
# fmt: on # fmt: on
tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, do_lower_case=False, split_by_punct=False) tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, unk_token="<unk>", do_lower_case=False, split_by_punct=False)
tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(sequence, add_special_tokens=False)) tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(sequence, add_special_tokens=False))
self.assertListEqual(tokens, tokens_target) self.assertListEqual(tokens, tokens_target)
rust_tokenizer = DebertaV2TokenizerFast(SAMPLE_VOCAB, do_lower_case=False, split_by_punct=False) rust_tokenizer = DebertaV2TokenizerFast(
SAMPLE_VOCAB, unk_token="<unk>", do_lower_case=False, split_by_punct=False
)
rust_tokens = rust_tokenizer.convert_ids_to_tokens(rust_tokenizer.encode(sequence, add_special_tokens=False)) rust_tokens = rust_tokenizer.convert_ids_to_tokens(rust_tokenizer.encode(sequence, add_special_tokens=False))
self.assertListEqual(rust_tokens, tokens_target) self.assertListEqual(rust_tokens, tokens_target)
...@@ -189,8 +197,8 @@ class DebertaV2TokenizationTest(TokenizerTesterMixin, unittest.TestCase): ...@@ -189,8 +197,8 @@ class DebertaV2TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
tokens_target = ["▁", "T", "his", "▁is", "▁a", "▁test"] tokens_target = ["▁", "T", "his", "▁is", "▁a", "▁test"]
back_tokens_target = ["▁", "<unk>", "his", "▁is", "▁a", "▁test"] back_tokens_target = ["▁", "<unk>", "his", "▁is", "▁a", "▁test"]
tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, keep_accents=True) tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, unk_token="<unk>", keep_accents=True)
rust_tokenizer = DebertaV2TokenizerFast(SAMPLE_VOCAB, keep_accents=True) rust_tokenizer = DebertaV2TokenizerFast(SAMPLE_VOCAB, unk_token="<unk>", keep_accents=True)
ids = tokenizer.encode(sequence, add_special_tokens=False) ids = tokenizer.encode(sequence, add_special_tokens=False)
self.assertListEqual(ids, ids_target) self.assertListEqual(ids, ids_target)
......
...@@ -243,8 +243,8 @@ class GPT2TokenizationTest(TokenizerTesterMixin, unittest.TestCase): ...@@ -243,8 +243,8 @@ class GPT2TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
decode_s = tokenizer.decode(out_s.input_ids) decode_s = tokenizer.decode(out_s.input_ids)
decode_s2 = tokenizer.batch_decode(out_s2.input_ids) decode_s2 = tokenizer.batch_decode(out_s2.input_ids)
self.assertEqual(decode_s.split()[0], bos_token) self.assertTrue(decode_s.startswith(bos_token))
self.assertTrue(all(d.split()[0] == bos_token for d in decode_s2)) self.assertTrue(all(d.startswith(bos_token) for d in decode_s2))
# tokenizer has no padding token # tokenizer has no padding token
def test_padding_different_model_input_name(self): def test_padding_different_model_input_name(self):
......
...@@ -145,10 +145,10 @@ class GPTSw3TokenizationTest(TokenizerTesterMixin, unittest.TestCase): ...@@ -145,10 +145,10 @@ class GPTSw3TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
tokenized_chats = [tokenizer.apply_chat_template(test_chat) for test_chat in test_chats] tokenized_chats = [tokenizer.apply_chat_template(test_chat) for test_chat in test_chats]
# fmt: off # fmt: off
expected_tokens = [ expected_tokens = [
[268, 63, 127, 462, 276, 294, 348, 536, 797, 275, 127, 65, 63, 263, 65, 938, 541, 419, 530, 339, 265, 878, 708, 727, 275, 347, 541, 260, 63, 263, 65, 1256, 263, 314, 419, 366, 354, 294, 360, 63, 263, 65, 938, 541, 419, ], [2000, 1, 575, 541, 419, 530, 339, 265, 878, 708, 727, 275, 347, 541, 260, 1, 968, 263, 314, 419, 366, 354, 294, 360, 1, 575, 541, 419],
[268, 63, 127, 462, 276, 294, 348, 536, 797, 275, 127, 65, 63, 263, 65, 938, 541, 419, 530, 339, 265, 878, 708, 727, 275, 347, 541, 260, 63, 263, 65, 1256, 263, 314, 419, 366, 354, 294, 360, 63, 263, 65, 938, 541, 419, 984, 429, 281, 264, 1261, 291, 260, 63, 263, 65, 938, 541, 419, ], [2000, 1, 575, 541, 419, 530, 339, 265, 878, 708, 727, 275, 347, 541, 260, 1, 968, 263, 314, 419, 366, 354, 294, 360, 1, 575, 541, 419, 984, 429, 281, 264, 1261, 291, 260, 1, 575, 541, 419],
[268, 63, 127, 462, 276, 294, 348, 536, 797, 275, 127, 65, 63, 263, 65, 938, 541, 419, 984, 429, 281, 264, 1261, 291, 260, 63, 263, 65, 1256, 263, 314, 419, 366, 354, 294, 360, 63, 263, 65, 938, 541, 419, ] [2000, 1, 575, 541, 419, 984, 429, 281, 264, 1261, 291, 260, 1, 968, 263, 314, 419, 366, 354, 294, 360, 1, 575, 541, 419]
] ]
# fmt: on # fmt: on
for tokenized_chat, expected_tokens in zip(tokenized_chats, expected_tokens): for tokenized_chat, expected_tokens in zip(tokenized_chats, expected_tokens):
self.assertListEqual(tokenized_chat, expected_tokens) self.assertListEqual(tokenized_chat, expected_tokens)
...@@ -210,9 +210,9 @@ class GPTSanJapaneseTokenizationTest(TokenizerTesterMixin, unittest.TestCase): ...@@ -210,9 +210,9 @@ class GPTSanJapaneseTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
tokenized_chats = [tokenizer.apply_chat_template(test_chat) for test_chat in test_chats] tokenized_chats = [tokenizer.apply_chat_template(test_chat) for test_chat in test_chats]
# fmt: off # fmt: off
expected_tokens = [ expected_tokens = [
[35993, 35998, 35637, 35659, 35665, 35716, 35645, 35662, 35649, 35716, 35645, 35716, 35652, 35649, 35656, 35660, 35650, 35665, 35656, 35716, 35647, 35652, 35645, 35664, 35646, 35659, 35664, 35595, 35999, 35993, 35998, 35620, 35649, 35656, 35656, 35659, 35582, 35999], [35993, 35998, 35637, 35659, 35665, 35716, 35645, 35662, 35649, 35716, 35645, 35716, 35652, 35649, 35656, 35660, 35650, 35665, 35656, 35716, 35647, 35652, 35645, 35664, 35646, 35659, 35664, 35595, 35716, 35999, 35993, 35998, 35620, 35649, 35656, 35656, 35659, 35582, 35716, 35999],
[35993, 35998, 35637, 35659, 35665, 35716, 35645, 35662, 35649, 35716, 35645, 35716, 35652, 35649, 35656, 35660, 35650, 35665, 35656, 35716, 35647, 35652, 35645, 35664, 35646, 35659, 35664, 35595, 35999, 35993, 35998, 35620, 35649, 35656, 35656, 35659, 35582, 35999, 35993, 35998, 35626, 35653, 35647, 35649, 35716, 35664, 35659, 35716, 35657, 35649, 35649, 35664, 35716, 35669, 35659, 35665, 35595, 35999], [35993, 35998, 35637, 35659, 35665, 35716, 35645, 35662, 35649, 35716, 35645, 35716, 35652, 35649, 35656, 35660, 35650, 35665, 35656, 35716, 35647, 35652, 35645, 35664, 35646, 35659, 35664, 35595, 35716, 35999, 35993, 35998, 35620, 35649, 35656, 35656, 35659, 35582, 35716, 35999, 35993, 35998, 35626, 35653, 35647, 35649, 35716, 35664, 35659, 35716, 35657, 35649, 35649, 35664, 35716, 35669, 35659, 35665, 35595, 35716, 35999],
[35993, 35998, 35626, 35653, 35647, 35649, 35716, 35664, 35659, 35716, 35657, 35649, 35649, 35664, 35716, 35669, 35659, 35665, 35595, 35999, 35993, 35998, 35620, 35649, 35656, 35656, 35659, 35582, 35999], [35993, 35998, 35626, 35653, 35647, 35649, 35716, 35664, 35659, 35716, 35657, 35649, 35649, 35664, 35716, 35669, 35659, 35665, 35595, 35716, 35999, 35993, 35998, 35620, 35649, 35656, 35656, 35659, 35582, 35716, 35999]
] ]
# fmt: on # fmt: on
for tokenized_chat, expected_tokens in zip(tokenized_chats, expected_tokens): for tokenized_chat, expected_tokens in zip(tokenized_chats, expected_tokens):
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment