Unverified Commit 2da88537 authored by Arthur's avatar Arthur Committed by GitHub
Browse files

🚨🚨 🚨🚨 [`Tokenizer`] attemp to fix add_token issues🚨🚨 🚨🚨 (#23909)



* fix test for bart. Order is correct now let's skip BPEs

* ouf

* styling

* fix bert....

* slow refactoring

* current updates

* massive refactoring

* update

* NICE!

* update to see where I am at

* updates

* update

* update

* revert

* updates

* updates

* start supporting legacy_save

* styling

* big update

* revert some changes

* nits

* nniiiiiice

* small fixes

* kinda fix t5 with new behaviour

* major update

* fixup

* fix copies

* today's updates

* fix byt5

* upfate

* update

* update

* updates

* update vocab size test

* Barthez does not use not need the fairseq offset ids

* super calll must be after

* calll super

* move all super init

* move other super init

* fixup

* nits

* more fixes

* nits

* more fixes

* nits

* more fix

* remove useless files

* ouch all of them are affected

* and more!

* small imporvements

* no more sanitize token

* more changes around unique no split tokens

* partially fix more things

* keep legacy save but add warning

* so... more fixes

* updates

* guess deberta tokenizer could be nuked

* fixup

* fixup did some bad things

* nuke it if it breaks

* remove prints and pretrain fast from slow with new format.

* fixups

* Apply suggestions from code review
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* fiou

* nit

* by default specials should not be normalized?

* update

* remove brakpoint

* updates

* a lot of updates

* fixup

* fixes revert some changes to match fast

* small nits

* that makes it cleaner

* fix camembert accordingly

* update

* some lest breaking changes

* update

* fixup

* fix byt5 and whisper mostly

* some more fixes, canine's byte vocab

* fix gpt2

* fix most of the perceiver tests (4 left)

* fix layout lmv3

* fixup

* fix copies for gpt2 style

* make sure to only warn once

* fix perciever and gpt2 tests

* some more backward compatibility: also read special tokens map because some ppl use it........////.....

* fixup

* add else when reading

* nits

* fresh updates

* fix copies

* will this make everything faster?

* fixes

* more fixes

* update

* more fixes

* fixup

* is the source of truth right?

* sorry camembert for the troubles

* current updates

* fixup

* update led

* update

* fix regression

* fix single word

* more model specific fixes

* fix t5 tests

* fixup

* more comments

* update

* fix nllb

* rstrip removed

* small fixes

* better handle additional_special_tokens and vocab sizes

* fixing

* styling

* fix 4 / 21

* fixup

* fix nlbb's tests

* some fixes

* fix t5

* fixes

* style

* fix canine tests

* damn this is nice

* nits

* m2m100 nit

* fixups

* fixes!

* fixup

* stash

* fix merge

* revert bad change

* fixup

* correct order for code Llama

* fix speecht5 post merge

* styling

* revert source of 11 fails

* small nits

* all changes in one go

* fnet hack

* fix 2 more tests

* update based on main branch of tokenizers

* fixup

* fix VITS issues

* more fixes

* fix mgp test

* fix camembert issues

* oups camembert still has 2 failing tests

* mluke fixes

* decode fixes

* small nits

* nits

* fix llama and vits

* fix camembert

* smal nits

* more fixes when initialising a fast from a slow and etc

* fix one of the last test

* fix CPM tokenizer test

* fixups

* fix pop2piano

* fixup

* ️ Change tokenizers required version ️

* ️ Change tokenizers required version ️

* "tokenizers>=0.14,<0.15", don't forget smaller than

* fix musicgen tests and pretraiendtokenizerfast

* fix owlvit and all

* update t5

* fix 800 red

* fix tests

* fix the fix of the fix of t5

* styling

* documentation nits

* cache _added_tokens_encoder

* fixups

* Nit

* fix red tests

* one last nit!

* make eveything a lot simpler

* Now it's over 😉



* few small nits

* Apply suggestions from code review
Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>

* updates that work for now

* tests that should no be skipped / changed and fixed next

* fixup

* i am ashamed

* pushe the fix

* update

* fixups

* nits

* fix added_tokens_encoder

* fix canine test

* fix pegasus vocab

* fix transfoXL

* fixup

* whisper needs to be fixed for train new

* pegasus nits

* more pegasus fixes

* minor update

* better error message in failed test

* fix whisper failing test

* fix whisper failing test

* fix pegasus

* fixup

* fix **** pegasus

* reset things

* remove another file

* attempts to fix the strange custome encoder and offset

* nits here and there

* update

* fixup

* nit

* fix the whisper test

* nits nits

* Apply suggestions from code review
Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>

* updates based on review

* some small update to potentially remove

* nits

* import rlu cache

* Update src/transformers/tokenization_utils_base.py
Co-authored-by: default avatarLysandre Debut <hi@lysand.re>

* move warning to `from_pretrained`

* update tests results now that the special tokens are always added

---------
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: default avatarLysandre Debut <hi@lysand.re>
parent 835b0a05
......@@ -19,7 +19,7 @@ from functools import lru_cache
from typing import List, Optional, Tuple
import numpy as np
from tokenizers import pre_tokenizers, processors
from tokenizers import AddedToken, pre_tokenizers, processors
from ...tokenization_utils_base import BatchEncoding
from ...tokenization_utils_fast import PreTrainedTokenizerFast
......@@ -148,6 +148,22 @@ class WhisperTokenizerFast(PreTrainedTokenizerFast):
predict_timestamps=False,
**kwargs,
):
bos_token = (
AddedToken(bos_token, lstrip=False, rstrip=False, normalized=False, special=True)
if isinstance(bos_token, str)
else bos_token
)
eos_token = (
AddedToken(eos_token, lstrip=False, rstrip=False, normalized=False, special=True)
if isinstance(eos_token, str)
else eos_token
)
unk_token = (
AddedToken(unk_token, lstrip=False, rstrip=False, normalized=False, special=True)
if isinstance(unk_token, str)
else unk_token
)
super().__init__(
vocab_file,
merges_file,
......@@ -444,11 +460,10 @@ class WhisperTokenizerFast(PreTrainedTokenizerFast):
@property
# Copied from transformers.models.whisper.tokenization_whisper.WhisperTokenizer.prefix_tokens
def prefix_tokens(self) -> List[int]:
all_special_ids = self.all_special_ids
bos_token_id = all_special_ids[-106]
translate_token_id = all_special_ids[-6]
transcribe_token_id = all_special_ids[-5]
notimestamps_token_id = all_special_ids[-1]
bos_token_id = self.convert_tokens_to_ids("<|startoftranscript|>")
translate_token_id = self.convert_tokens_to_ids("<|translate|>")
transcribe_token_id = self.convert_tokens_to_ids("<|transcribe|>")
notimestamps_token_id = self.convert_tokens_to_ids("<|notimestamps|>")
langs = tuple(LANGUAGES.keys())
if self.language is not None:
......
......@@ -137,17 +137,6 @@ class XGLMTokenizer(PreTrainedTokenizer):
word for word in madeup_words if word not in kwargs["additional_special_tokens"]
]
super().__init__(
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
sp_model_kwargs=self.sp_model_kwargs,
**kwargs,
)
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
self.sp_model.Load(str(vocab_file))
self.vocab_file = vocab_file
......@@ -170,6 +159,17 @@ class XGLMTokenizer(PreTrainedTokenizer):
self.fairseq_ids_to_tokens = {v: k for k, v in self.fairseq_tokens_to_ids.items()}
super().__init__(
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
sp_model_kwargs=self.sp_model_kwargs,
**kwargs,
)
def __getstate__(self):
state = self.__dict__.copy()
state["sp_model"] = None
......
......@@ -613,20 +613,6 @@ class XLMTokenizer(PreTrainedTokenizer):
do_lowercase_and_remove_accent=True,
**kwargs,
):
super().__init__(
unk_token=unk_token,
bos_token=bos_token,
sep_token=sep_token,
pad_token=pad_token,
cls_token=cls_token,
mask_token=mask_token,
additional_special_tokens=additional_special_tokens,
lang2id=lang2id,
id2lang=id2lang,
do_lowercase_and_remove_accent=do_lowercase_and_remove_accent,
**kwargs,
)
try:
import sacremoses
except ImportError:
......@@ -660,6 +646,19 @@ class XLMTokenizer(PreTrainedTokenizer):
merges = [tuple(merge.split()[:2]) for merge in merges]
self.bpe_ranks = dict(zip(merges, range(len(merges))))
self.cache = {}
super().__init__(
unk_token=unk_token,
bos_token=bos_token,
sep_token=sep_token,
pad_token=pad_token,
cls_token=cls_token,
mask_token=mask_token,
additional_special_tokens=additional_special_tokens,
lang2id=lang2id,
id2lang=id2lang,
do_lowercase_and_remove_accent=do_lowercase_and_remove_accent,
**kwargs,
)
@property
def do_lower_case(self):
......
......@@ -145,18 +145,6 @@ class XLMProphetNetTokenizer(PreTrainedTokenizer):
) -> None:
self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
super().__init__(
bos_token=bos_token,
eos_token=eos_token,
sep_token=sep_token,
unk_token=unk_token,
pad_token=pad_token,
cls_token=cls_token,
mask_token=mask_token,
sp_model_kwargs=self.sp_model_kwargs,
**kwargs,
)
try:
import sentencepiece as spm
except ImportError:
......@@ -186,8 +174,20 @@ class XLMProphetNetTokenizer(PreTrainedTokenizer):
# The first "real" token "," has position 15 in the embedding vocab and position 3 in the spm vocab
self.fairseq_offset = 12
self.fairseq_ids_to_tokens = {v: k for k, v in self.fairseq_tokens_to_ids.items()}
for k in self.fairseq_tokens_to_ids.keys():
self.unique_no_split_tokens.append(k)
# TODO ArthurZ fairseq_ids_to_tokens should be removed
super().__init__(
bos_token=bos_token,
eos_token=eos_token,
sep_token=sep_token,
unk_token=unk_token,
pad_token=pad_token,
cls_token=cls_token,
mask_token=mask_token,
sp_model_kwargs=self.sp_model_kwargs,
**kwargs,
)
@property
def can_save_slow_tokenizer(self) -> bool:
......
......@@ -152,18 +152,6 @@ class XLMRobertaTokenizer(PreTrainedTokenizer):
self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
super().__init__(
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
sp_model_kwargs=self.sp_model_kwargs,
**kwargs,
)
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
self.sp_model.Load(str(vocab_file))
self.vocab_file = vocab_file
......@@ -183,6 +171,18 @@ class XLMRobertaTokenizer(PreTrainedTokenizer):
self.fairseq_tokens_to_ids["<mask>"] = len(self.sp_model) + self.fairseq_offset
self.fairseq_ids_to_tokens = {v: k for k, v in self.fairseq_tokens_to_ids.items()}
super().__init__(
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
sp_model_kwargs=self.sp_model_kwargs,
**kwargs,
)
def __getstate__(self):
state = self.__dict__.copy()
state["sp_model"] = None
......@@ -288,6 +288,7 @@ class XLMRobertaTokenizer(PreTrainedTokenizer):
return vocab
def _tokenize(self, text: str) -> List[str]:
# TODO check if the t5/llama PR also applies here
return self.sp_model.encode(text, out_type=str)
def _convert_token_to_id(self, token):
......
......@@ -152,6 +152,14 @@ class XLNetTokenizer(PreTrainedTokenizer):
self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
self.do_lower_case = do_lower_case
self.remove_space = remove_space
self.keep_accents = keep_accents
self.vocab_file = vocab_file
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
self.sp_model.Load(vocab_file)
super().__init__(
do_lower_case=do_lower_case,
remove_space=remove_space,
......@@ -170,14 +178,6 @@ class XLNetTokenizer(PreTrainedTokenizer):
self._pad_token_type_id = 3
self.do_lower_case = do_lower_case
self.remove_space = remove_space
self.keep_accents = keep_accents
self.vocab_file = vocab_file
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
self.sp_model.Load(vocab_file)
@property
def vocab_size(self):
return len(self.sp_model)
......
......@@ -57,6 +57,7 @@ class Trie:
def __init__(self):
self.data = {}
self._tokens = set()
def add(self, word: str):
"""
......@@ -81,6 +82,8 @@ class Trie:
if not word:
# Prevent empty string
return
self._tokens.add(word)
ref = self.data
for char in word:
ref[char] = char in ref and ref[char] or {}
......@@ -344,17 +347,48 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase):
"""
def __init__(self, **kwargs):
# 1. Init the parent class
super().__init__(**kwargs)
# Added tokens - We store this for both slow and fast tokenizers
# until the serialization of Fast tokenizers is updated
self.added_tokens_encoder: Dict[str, int] = {}
self.added_tokens_decoder: Dict[int, str] = {}
self.unique_no_split_tokens: List[str] = []
self.tokens_trie = Trie()
# 2. init `_added_tokens_decoder` if child class did not
if not hasattr(self, "_added_tokens_decoder"):
self._added_tokens_decoder: Dict[int, AddedToken] = {}
# 3. if a `added_tokens_decoder` is passed, we are loading from a saved tokenizer, we overwrite
if "added_tokens_decoder" in kwargs:
# overwriting the class's added_tokens_decoder. This is the source of truth!
self._added_tokens_decoder.update(kwargs.get("added_tokens_decoder"))
self._added_tokens_encoder: Dict[str, int] = {k.content: v for v, k in self._added_tokens_decoder.items()}
# 4. If some of the special tokens are not part of the vocab, we add them, at the end.
# the order of addition is the same as self.SPECIAL_TOKENS_ATTRIBUTES following `tokenizers`
self._add_tokens(self.all_special_tokens_extended, special_tokens=True)
self._decode_use_source_tokenizer = False
@property
def added_tokens_decoder(self) -> Dict[int, AddedToken]:
"""
Returns the added tokens in the vocabulary as a dictionary of index to AddedToken.
Returns:
`Dict[str, int]`: The added tokens.
"""
return dict(sorted(self._added_tokens_decoder.items(), key=lambda item: item[0]))
@added_tokens_decoder.setter
def added_tokens_decoder(self, value: Dict[int, Union[AddedToken, str]]) -> Dict[int, AddedToken]:
# Always raise an error if string because users should define the behavior
for index, token in value.items():
if not isinstance(token, (str, AddedToken)) or not isinstance(index, int):
raise ValueError(
f"The provided `added_tokens_decoder` has an element of type {index.__class__, token.__class__}, should be a dict of {int, Union[AddedToken, str]}"
)
self._added_tokens_decoder[index] = AddedToken(token) if isinstance(token, str) else token
self._added_tokens_encoder[str(token)] = index
@property
def is_fast(self) -> bool:
return False
......@@ -368,28 +402,34 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase):
def get_added_vocab(self) -> Dict[str, int]:
"""
Returns the added tokens in the vocabulary as a dictionary of token to index.
Returns the added tokens in the vocabulary as a dictionary of token to index. Results might be different from
the fast call because for now we always add the tokens even if they are already in the vocabulary. This is
something we should change.
Returns:
`Dict[str, int]`: The added tokens.
"""
return self.added_tokens_encoder
return self._added_tokens_encoder
def __len__(self):
"""
Size of the full vocabulary with the added tokens.
Size of the full vocabulary with the added tokens. Counts the `keys` and not the `values` because otherwise if
there is a hole in the vocab, we will add tokenizers at a wrong index.
"""
return self.vocab_size + len(self.added_tokens_encoder)
return len(set(self.get_vocab().keys()))
def _add_tokens(self, new_tokens: Union[List[str], List[AddedToken]], special_tokens: bool = False) -> int:
"""
Add a list of new tokens to the tokenizer class. If the new tokens are not in the vocabulary, they are added to
it with indices starting from length of the current vocabulary.
it with indices starting from length of the current vocabulary. Special tokens are sometimes already in the
vocab which is why they have to be handled specifically.
Args:
new_tokens (`List[str]`or `List[tokenizers.AddedToken]`):
Token(s) to add in vocabulary. A token is only added if it's not already in the vocabulary (tested by
checking if the tokenizer assign the index of the `unk_token` to them).
Token(s) to add in vocabulary. A token is counted as added if it's not already in the vocabulary
(tested by checking if the tokenizer assign the index of the `unk_token` to them). If a token is part
of the vocabulary then we simply mark this token as an `AddedToken` which allows to control the
stripping and normalization of this token. This is NOT possible in `tokenizers`.
special_tokens (`bool`, *optional*, defaults to `False`):
Whether or not the tokens should be added as special tokens.
......@@ -408,52 +448,52 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase):
# Note: resize_token_embeddings expects to receive the full size of the new vocabulary, i.e. the length of the tokenizer.
model.resize_token_embeddings(len(tokenizer))
```"""
new_tokens = [str(tok) for tok in new_tokens]
tokens_to_add = []
added_tokens = 0
if new_tokens is None:
return added_tokens
current_vocab = self.get_vocab().copy()
new_idx = len(current_vocab) # only call this once, len gives the last index + 1
for token in new_tokens:
if not isinstance(token, str):
if not isinstance(token, (str, AddedToken)):
raise TypeError(f"Token {token} is not a string but a {type(token)}.")
if not special_tokens and hasattr(self, "do_lower_case") and self.do_lower_case:
token = token.lower()
if (
token != self.unk_token
and self.convert_tokens_to_ids(token) == self.convert_tokens_to_ids(self.unk_token)
and token not in tokens_to_add
):
tokens_to_add.append(token)
if self.verbose:
logger.info(f"Adding {token} to the vocabulary")
added_tok_encoder = {tok: len(self) + i for i, tok in enumerate(tokens_to_add)}
added_tok_decoder = {v: k for k, v in added_tok_encoder.items()}
self.added_tokens_encoder.update(added_tok_encoder)
self.added_tokens_decoder.update(added_tok_decoder)
# Make sure we don't split on any special tokens (even they were already in the vocab before e.g. for Albert)
if special_tokens:
if len(new_tokens) == 1:
_insert_one_token_to_ordered_list(self.unique_no_split_tokens, new_tokens[0])
else:
self.unique_no_split_tokens = sorted(set(self.unique_no_split_tokens).union(set(new_tokens)))
else:
# Or on the newly added tokens
if len(tokens_to_add) == 1:
_insert_one_token_to_ordered_list(self.unique_no_split_tokens, tokens_to_add[0])
if str(token) == "":
continue
if isinstance(token, str):
# for legacy AddedTokens strip left and right by default
# TODO this will be remove to have the same default behavior as rust
token = AddedToken(token, normalized=not special_tokens, rstrip=True, lstrip=True)
if special_tokens:
token.special = True
if token in self._added_tokens_decoder:
continue
if not token.special and token.normalized and hasattr(self, "do_lower_case") and self.do_lower_case:
# Normalize if requested
token.content = token.content.lower()
if token.content not in current_vocab:
token_index = new_idx + added_tokens
current_vocab[token.content] = token_index
added_tokens += 1
else:
self.unique_no_split_tokens = sorted(set(self.unique_no_split_tokens).union(set(tokens_to_add)))
self._create_trie(self.unique_no_split_tokens)
return len(tokens_to_add)
def _create_trie(self, unique_no_split_tokens):
trie = Trie()
token_index = current_vocab[token.content]
if token.special and str(token) not in self.all_special_tokens:
self._additional_special_tokens.append(token)
# the setter automatically updates the reverse map
self._added_tokens_decoder[token_index] = token
self._added_tokens_encoder[token.content] = token_index
if self.verbose:
logger.info(f"Adding {token} to the vocabulary")
self._update_trie()
return added_tokens
def _update_trie(self, unique_no_split_tokens: Optional[str] = []):
for token in self._added_tokens_decoder.values():
if token not in self.tokens_trie._tokens:
self.tokens_trie.add(token.content)
for token in unique_no_split_tokens:
if hasattr(self, "do_lower_case") and self.do_lower_case and token not in self.all_special_tokens:
trie.add(token.lower())
else:
trie.add(token)
self.tokens_trie = trie
if token not in self.tokens_trie._tokens:
self.tokens_trie.add(token)
def num_special_tokens_to_add(self, pair: bool = False) -> int:
"""
......@@ -494,10 +534,6 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase):
Returns:
`List[str]`: The list of tokens.
"""
# Simple mapping string => AddedToken for special tokens with specific tokenization behaviors
all_special_tokens_extended = {
str(t): t for t in self.all_special_tokens_extended if isinstance(t, AddedToken)
}
split_special_tokens = kwargs.pop("split_special_tokens", self.split_special_tokens)
text, kwargs = self.prepare_for_tokenization(text, **kwargs)
......@@ -505,27 +541,29 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase):
if kwargs:
logger.warning(f"Keyword arguments {kwargs} not recognized.")
# TODO: should this be in the base class?
if hasattr(self, "do_lower_case") and self.do_lower_case:
# convert non-special tokens to lowercase
escaped_special_toks = [
re.escape(s_tok) for s_tok in (self.unique_no_split_tokens + self.all_special_tokens)
escaped_special_toks = [re.escape(s_tok) for s_tok in (self.all_special_tokens)]
escaped_special_toks += [
re.escape(s_tok.content)
for s_tok in (self._added_tokens_decoder.values())
if not s_tok.special and s_tok.normalized
]
pattern = r"(" + r"|".join(escaped_special_toks) + r")|" + r"(.+?)"
text = re.sub(pattern, lambda m: m.groups()[0] or m.groups()[1].lower(), text)
# split_special_tokens: empty `no_split_token`
if split_special_tokens:
no_split_token = []
tokens = [text]
else:
no_split_token = set(self.unique_no_split_tokens)
no_split_token = set(self._added_tokens_encoder.keys()) # don't split on any of the added tokens
# "This is something<special_token_1> else"
tokens = self.tokens_trie.split(text)
# ["This is something", "<special_token_1>", " else"]
for i, token in enumerate(tokens):
if token in no_split_token:
tok_extended = all_special_tokens_extended.get(token, None)
tok_extended = self._added_tokens_decoder.get(self._added_tokens_encoder[token], None)
left = tokens[i - 1] if i > 0 else None
right = tokens[i + 1] if i < len(tokens) - 1 else None
if isinstance(tok_extended, AddedToken):
......@@ -536,12 +574,18 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase):
# Strip white spaces on the left
if tok_extended.lstrip and left:
tokens[i - 1] = left.rstrip() # Opposite here
if tok_extended.single_word and left and left[-1] != " ":
tokens[i - 1] += token
tokens[i] = ""
elif tok_extended.single_word and right and right[0] != " ":
tokens[i + 1] = token + tokens[i + 1]
tokens[i] = ""
else:
# We strip left and right by default
if right:
tokens[i + 1] = right.lstrip()
if left:
tokens[i - 1] = left.rstrip()
raise ValueError(
f"{tok_extended} cannot be tokenized because it was not properly added"
f" to the tokenizer. This means that it is not an `AddedToken` but a {type(tok_extended)}"
)
# ["This is something", "<special_token_1>", "else"]
tokenized_text = []
for token in tokens:
......@@ -590,8 +634,8 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase):
if token is None:
return None
if token in self.added_tokens_encoder:
return self.added_tokens_encoder[token]
if token in self._added_tokens_encoder:
return self._added_tokens_encoder[token]
return self._convert_token_to_id(token)
def _convert_token_to_id(self, token):
......@@ -904,8 +948,8 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase):
`str` or `List[str]`: The decoded token(s).
"""
if isinstance(ids, int):
if ids in self.added_tokens_decoder:
return self.added_tokens_decoder[ids]
if ids in self._added_tokens_decoder:
return self._added_tokens_decoder[ids].content
else:
return self._convert_id_to_token(ids)
tokens = []
......@@ -913,8 +957,8 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase):
index = int(index)
if skip_special_tokens and index in self.all_special_ids:
continue
if index in self.added_tokens_decoder:
tokens.append(self.added_tokens_decoder[index])
if index in self._added_tokens_decoder:
tokens.append(self._added_tokens_decoder[index].content)
else:
tokens.append(self._convert_id_to_token(index))
return tokens
......@@ -935,19 +979,29 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase):
) -> str:
self._decode_use_source_tokenizer = kwargs.pop("use_source_tokenizer", False)
if spaces_between_special_tokens:
logger.warning_once(
"spaces_between_special_tokens is deprecated and will be removed in transformers v5. It was adding spaces between `added_tokens`, not special tokens, "
"and does not exist in our fast implementation. Future tokenizers will handle the decoding process on a per-model rule."
)
filtered_tokens = self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens)
legacy_added_tokens = set(self._added_tokens_encoder.keys()) - set(self.all_special_tokens) | {
token for token in self.additional_special_tokens if self.convert_tokens_to_ids(token) >= self.vocab_size
}
# To avoid mixing byte-level and unicode for byte-level BPT
# we need to build string separately for added tokens and byte-level tokens
# cf. https://github.com/huggingface/transformers/issues/1133
sub_texts = []
current_sub_text = []
# TODO @ArthurZ in version 5, special tokens should be handled in convert_tokens_to_string, while _convert_tokens_to_string
for token in filtered_tokens:
if skip_special_tokens and token in self.all_special_ids:
continue
if token in self.added_tokens_encoder:
if token in legacy_added_tokens:
if current_sub_text:
sub_texts.append(self.convert_tokens_to_string(current_sub_text))
string = self.convert_tokens_to_string(current_sub_text)
if len(string) > 0:
sub_texts.append(string)
current_sub_text = []
sub_texts.append(token)
else:
......
......@@ -23,10 +23,10 @@ import json
import os
import re
import warnings
from collections import OrderedDict, UserDict
from collections import UserDict
from collections.abc import Mapping, Sized
from contextlib import contextmanager
from dataclasses import dataclass, field
from dataclasses import dataclass
from functools import lru_cache
from typing import TYPE_CHECKING, Any, Dict, List, NamedTuple, Optional, Sequence, Tuple, Union
......@@ -78,18 +78,25 @@ if is_tokenizers_available():
from tokenizers import Encoding as EncodingFast
else:
@dataclass(frozen=True, eq=True)
@dataclass(frozen=False, eq=True)
class AddedToken:
"""
AddedToken represents a token to be added to a Tokenizer An AddedToken can have special options defining the
way it should behave.
The `normalized` will default to `not special` if it is not specified, similarly to the definition in
`tokenizers`.
"""
content: str = field(default_factory=str)
single_word: bool = False
lstrip: bool = False
rstrip: bool = False
normalized: bool = True
def __init__(
self, content: str, single_word=False, lstrip=False, rstrip=False, special=False, normalized=None
):
self.content = content
self.single_word = single_word
self.lstrip = lstrip
self.rstrip = rstrip
self.special = special
self.normalized = normalized if normalized is not None else not special
def __getstate__(self):
return self.__dict__
......@@ -806,7 +813,8 @@ class SpecialTokensMixin:
A special token representing a masked token (used by masked-language modeling pretraining objectives, like
BERT).
additional_special_tokens (tuple or list of `str` or `tokenizers.AddedToken`, *optional*):
A tuple or a list of additional special tokens.
A tuple or a list of additional tokens, which will be marked as `special`, meaning that they will be
skipped when decoding if `skip_special_tokens` is set to `True`.
"""
SPECIAL_TOKENS_ATTRIBUTES = [
......@@ -845,21 +853,20 @@ class SpecialTokensMixin:
isinstance(t, (str, AddedToken)) for t in value
), "One of the tokens is not a string or an AddedToken"
setattr(self, key, value)
elif isinstance(value, (str, AddedToken)):
elif isinstance(value, (str)):
value = AddedToken(value, normalized=False, special=True)
setattr(self, key, value)
elif isinstance(value, AddedToken):
setattr(self, key, value)
else:
raise TypeError(f"special token {key} has to be either str or AddedToken but got: {type(value)}")
raise TypeError(f"Special token {key} has to be either str or AddedToken but got: {type(value)}")
def sanitize_special_tokens(self) -> int:
"""
Make sure that all the special tokens attributes of the tokenizer (`tokenizer.mask_token`,
`tokenizer.cls_token`, etc.) are in the vocabulary.
Add the missing ones to the vocabulary if needed.
Return:
`int`: The number of tokens added in the vocabulary during the operation.
The `sanitize_special_tokens` is now deprecated kept for backward compatibility and will be removed in
transformers v5.
"""
logger.warning_once("The `sanitize_special_tokens` will be removed in transformers v5.")
return self.add_tokens(self.all_special_tokens_extended, special_tokens=True)
def add_special_tokens(
......@@ -870,14 +877,15 @@ class SpecialTokensMixin:
special tokens are NOT in the vocabulary, they are added to it (indexed starting from the last index of the
current vocabulary).
Note,None When adding new tokens to the vocabulary, you should make sure to also resize the token embedding
matrix of the model so that its embedding matrix matches the tokenizer.
When adding new tokens to the vocabulary, you should make sure to also resize the token embedding matrix of the
model so that its embedding matrix matches the tokenizer.
In order to do that, please use the [`~PreTrainedModel.resize_token_embeddings`] method.
Using `add_special_tokens` will ensure your special tokens can be used in several ways:
- Special tokens are carefully handled by the tokenizer (they are never split).
- Special tokens can be skipped when decoding using `skip_special_tokens = True`.
- Special tokens are carefully handled by the tokenizer (they are never split), similar to `AddedTokens`.
- You can easily refer to special tokens using tokenizer class attributes like `tokenizer.cls_token`. This
makes it easy to develop model-agnostic training and fine-tuning scripts.
......@@ -893,10 +901,12 @@ class SpecialTokensMixin:
Tokens are only added if they are not already in the vocabulary (tested by checking if the tokenizer
assign the index of the `unk_token` to them).
replace_additional_special_tokens (`bool`, *optional*,, defaults to `True`):
If `True`, the existing list of additional special tokens will be replaced by the one specified in
`special_tokens_dict`. Otherwise, `self._additional_special_tokens` is updated. In the former case, the
tokens will NOT be removed from the tokenizer's full vocabulary - they are only being flagged as
non-special tokens.
If `True`, the existing list of additional special tokens will be replaced by the list provided in
`special_tokens_dict`. Otherwise, `self._additional_special_tokens` is just extended. In the former
case, the tokens will NOT be removed from the tokenizer's full vocabulary - they are only being flagged
as non-special tokens. Remember, this only affects which tokens are skipped during decoding, not the
`added_tokens_encoder` and `added_tokens_decoder`. This means that the previous
`additional_special_tokens` are still added tokens, and will not be split by the model.
Returns:
`int`: Number of tokens added to the vocabulary.
......@@ -920,7 +930,7 @@ class SpecialTokensMixin:
if not special_tokens_dict:
return 0
added_tokens = 0
added_tokens = []
for key, value in special_tokens_dict.items():
assert key in self.SPECIAL_TOKENS_ATTRIBUTES, f"Key {key} is not a special token"
......@@ -932,28 +942,32 @@ class SpecialTokensMixin:
isinstance(t, (str, AddedToken)) for t in value
), f"Tokens {value} for key {key} should all be str or AddedToken instances"
to_add = set()
for token in value:
if isinstance(token, str):
# for legacy purpose we default to stripping. `test_add_tokens_tokenizer` depends on this
token = AddedToken(token, normalized=False, rstrip=True, lstrip=True)
if str(token) not in self.additional_special_tokens:
to_add.add(token)
if replace_additional_special_tokens:
setattr(self, key, value)
setattr(self, key, list(to_add))
else:
# This is a copy of `self._additional_special_tokens`
additional_special_tokens = getattr(self, key)
additional_special_tokens_set = set(additional_special_tokens)
to_add = []
for token in value:
if str(token) not in additional_special_tokens_set and str(token) not in to_add:
to_add.append(token)
# update the property
additional_special_tokens.extend(to_add)
self.additional_special_tokens = additional_special_tokens
added_tokens += self.add_tokens(value, special_tokens=True)
self._additional_special_tokens.extend(to_add)
added_tokens += to_add
else:
assert isinstance(
value, (str, AddedToken)
), f"Token {value} for key {key} should be a str or an AddedToken instance"
setattr(self, key, value)
added_tokens += self.add_tokens([value], special_tokens=True)
if not isinstance(value, (str, AddedToken)):
raise ValueError(f"Token {value} for key {key} should be a str or an AddedToken instance")
if isinstance(value, (str)):
# for legacy purpose we default to stripping. `test_add_tokens_tokenizer` depends on this
value = AddedToken(value, normalized=False, rstrip=True, lstrip=True)
if isinstance(value, AddedToken):
setattr(self, key, value)
if value not in added_tokens:
added_tokens.append(value)
# if we are adding tokens that were not part of the vocab, we ought to add them
added_tokens = self.add_tokens(added_tokens, special_tokens=True)
return added_tokens
def add_tokens(
......@@ -1102,35 +1116,74 @@ class SpecialTokensMixin:
@bos_token.setter
def bos_token(self, value):
if isinstance(value, str) and value != "":
value = AddedToken(value, normalized=False, rstrip=True, lstrip=True, special=True)
elif not isinstance(value, AddedToken) and value is not None:
raise ValueError("Cannot set a non-string value as the BOS token")
self._bos_token = value
@eos_token.setter
def eos_token(self, value):
if isinstance(value, str) and value != "":
value = AddedToken(value, normalized=False, rstrip=True, lstrip=True, special=True)
elif not isinstance(value, AddedToken) and value is not None:
raise ValueError("Cannot set a non-string value as the EOS token")
self._eos_token = value
@unk_token.setter
def unk_token(self, value):
if isinstance(value, str) and value != "":
value = AddedToken(value, normalized=False, rstrip=True, lstrip=True, special=True)
elif not isinstance(value, AddedToken) and value is not None:
raise ValueError("Cannot set a non-string value as the UNK token")
self._unk_token = value
@sep_token.setter
def sep_token(self, value):
if isinstance(value, str) and value != "":
value = AddedToken(value, normalized=False, rstrip=True, lstrip=True, special=True)
elif not isinstance(value, AddedToken) and value is not None:
raise ValueError("Cannot set a non-string value as the SEP token")
self._sep_token = value
@pad_token.setter
def pad_token(self, value):
if isinstance(value, str) and value != "":
value = AddedToken(value, normalized=False, rstrip=True, lstrip=True, special=True)
elif not isinstance(value, AddedToken) and value is not None:
raise ValueError("Cannot set a non-string value as the PAD token")
self._pad_token = value
@cls_token.setter
def cls_token(self, value):
if isinstance(value, str) and value != "":
value = AddedToken(value, normalized=False, rstrip=True, lstrip=True, special=True)
elif not isinstance(value, AddedToken) and value is not None:
raise ValueError("Cannot set a non-string value as the CLS token")
self._cls_token = value
@mask_token.setter
def mask_token(self, value):
if isinstance(value, str) and value != "":
value = AddedToken(value, normalized=False, rstrip=True, lstrip=True, special=True)
elif not isinstance(value, AddedToken) and value is not None:
raise ValueError("Cannot set a non-string value as the MASK token")
self._mask_token = value
@additional_special_tokens.setter
def additional_special_tokens(self, value):
self._additional_special_tokens = value
if value is None:
self._additional_special_tokens = value
return
if self._additional_special_tokens is None:
self._additional_special_tokens = []
# We store the `AddedToken` to allow adding tokens via `tokenizer.add_special_tokens`
for token in value:
if isinstance(token, str) and token != "":
token = AddedToken(token, normalized=False, rstrip=True, lstrip=True, special=True)
elif not isinstance(token, AddedToken):
raise ValueError(f"Cannot add instance of type {type(value)} to additional_special_tokens!")
self._additional_special_tokens.append(token)
@property
def bos_token_id(self) -> Optional[int]:
......@@ -1259,13 +1312,9 @@ class SpecialTokensMixin:
"""
set_attr = {}
for attr in self.SPECIAL_TOKENS_ATTRIBUTES:
attr_value = getattr(self, "_" + attr)
attr_value = getattr(self, attr)
if attr_value:
set_attr[attr] = (
type(attr_value)(str(attr_value_sub) for attr_value_sub in attr_value)
if isinstance(attr_value, (list, tuple))
else str(attr_value)
)
set_attr[attr] = attr_value
return set_attr
@property
......@@ -1285,29 +1334,34 @@ class SpecialTokensMixin:
return set_attr
@property
def all_special_tokens(self) -> List[str]:
def all_special_tokens_extended(self) -> List[Union[str, AddedToken]]:
"""
`List[str]`: All the special tokens (`'<unk>'`, `'<cls>'`, etc.) mapped to class attributes.
`List[Union[str, tokenizers.AddedToken]]`: All the special tokens (`'<unk>'`, `'<cls>'`, etc.), the order has
nothing to do with the index of each tokens. If you want to know the correct indices, check
`self.added_tokens_encoder`. We can't create an order anymore as the keys are `AddedTokens` and not `Strings`.
Convert tokens of `tokenizers.AddedToken` type to string.
Don't convert tokens of `tokenizers.AddedToken` type to string so they can be used to control more finely how
special tokens are tokenized.
"""
all_toks = [str(s) for s in self.all_special_tokens_extended]
return all_toks
all_tokens = []
seen = set()
for value in self.special_tokens_map_extended.values():
if isinstance(value, (list, tuple)):
tokens_to_add = [token for token in value if str(token) not in seen]
else:
tokens_to_add = [value] if str(value) not in seen else []
seen.update(map(str, tokens_to_add))
all_tokens.extend(tokens_to_add)
return all_tokens
@property
def all_special_tokens_extended(self) -> List[Union[str, AddedToken]]:
def all_special_tokens(self) -> List[str]:
"""
`List[Union[str, tokenizers.AddedToken]]`: All the special tokens (`'<unk>'`, `'<cls>'`, etc.) mapped to class
attributes.
`List[str]`: A list of the unique special tokens (`'<unk>'`, `'<cls>'`, ..., etc.).
Don't convert tokens of `tokenizers.AddedToken` type to string so they can be used to control more finely how
special tokens are tokenized.
Convert tokens of `tokenizers.AddedToken` type to string.
"""
all_toks = []
set_attr = self.special_tokens_map_extended
for attr_value in set_attr.values():
all_toks = all_toks + (list(attr_value) if isinstance(attr_value, (list, tuple)) else [attr_value])
all_toks = list(OrderedDict.fromkeys(all_toks))
all_toks = [str(s) for s in self.all_special_tokens_extended]
return all_toks
@property
......@@ -1322,7 +1376,10 @@ class SpecialTokensMixin:
ENCODE_KWARGS_DOCSTRING = r"""
add_special_tokens (`bool`, *optional*, defaults to `True`):
Whether or not to encode the sequences with the special tokens relative to their model.
Whether or not to add special tokens when encoding the sequences. This will use the underlying
`PretrainedTokenizerBase.build_inputs_with_special_tokens` function, which defines which tokens are
automatically added to the input ids. This is usefull if you want to add `bos` or `eos` tokens
automatically.
padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `False`):
Activates and controls padding. Accepts the following values:
......@@ -1492,9 +1549,9 @@ INIT_TOKENIZER_DOCSTRING = r"""
A special token representing a masked token (used by masked-language modeling pretraining objectives, like
BERT). Will be associated to `self.mask_token` and `self.mask_token_id`.
additional_special_tokens (tuple or list of `str` or `tokenizers.AddedToken`, *optional*):
A tuple or a list of additional special tokens. Add them here to ensure they won't be split by the
tokenization process. Will be associated to `self.additional_special_tokens` and
`self.additional_special_tokens_ids`.
A tuple or a list of additional special tokens. Add them here to ensure they are skipped when decoding with
`skip_special_tokens` is set to True. If they are not part of the vocabulary, they will be added at the end
of the vocabulary.
clean_up_tokenization_spaces (`bool`, *optional*, defaults to `True`):
Whether or not the model should cleanup the spaces that were added when splitting the input text during the
tokenization process.
......@@ -1614,12 +1671,26 @@ class PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin):
"""Sets processor class as an attribute."""
self._processor_class = processor_class
@property
def added_tokens_encoder(self) -> Dict[str, int]:
"""
Returns the sorted mapping from string to index. The added tokens encoder is cached for performance
optimisation in `self._added_tokens_encoder` for the slow tokenizers.
"""
return {k.content: v for v, k in sorted(self._added_tokens_decoder.items(), key=lambda item: item[0])}
@property
def added_tokens_decoder(self) -> Dict[int, AddedToken]:
raise NotImplementedError()
def __repr__(self) -> str:
added_tokens_decoder_rep = "\n\t".join([f"{k}: {v.__repr__()}," for k, v in self.added_tokens_decoder.items()])
return (
f"{self.__class__.__name__}(name_or_path='{self.name_or_path}',"
f" vocab_size={self.vocab_size}, model_max_length={self.model_max_length}, is_fast={self.is_fast},"
f" padding_side='{self.padding_side}', truncation_side='{self.truncation_side}',"
f" special_tokens={self.special_tokens_map_extended}, clean_up_tokenization_spaces={self.clean_up_tokenization_spaces})"
f" special_tokens={self.special_tokens_map}, clean_up_tokenization_spaces={self.clean_up_tokenization_spaces}), "
" added_tokens_decoder={\n\t" + added_tokens_decoder_rep + "\n}"
)
def __len__(self) -> int:
......@@ -1878,12 +1949,13 @@ class PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin):
else:
# At this point pretrained_model_name_or_path is either a directory or a model identifier name
additional_files_names = {
"added_tokens_file": ADDED_TOKENS_FILE,
"special_tokens_map_file": SPECIAL_TOKENS_MAP_FILE,
"added_tokens_file": ADDED_TOKENS_FILE, # kept only for legacy
"special_tokens_map_file": SPECIAL_TOKENS_MAP_FILE, # kept only for legacy
"tokenizer_config_file": TOKENIZER_CONFIG_FILE,
# tokenizer_file used to initialize a slow from a fast. Properly copy the `addedTokens` instead of adding in random orders
"tokenizer_file": FULL_TOKENIZER_FILE,
}
vocab_files = {**cls.vocab_files_names, **additional_files_names}
if "tokenizer_file" in vocab_files:
# Try to get the tokenizer config to see if there are versioned tokenizer files.
fast_tokenizer_file = FULL_TOKENIZER_FILE
......@@ -2019,6 +2091,8 @@ class PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin):
# First attempt. We get tokenizer_class from tokenizer_config to check mismatch between tokenizers.
config_tokenizer_class = init_kwargs.get("tokenizer_class")
init_kwargs.pop("tokenizer_class", None)
if not has_tokenizer_file:
init_kwargs.pop("tokenizer_file", None)
saved_init_inputs = init_kwargs.pop("init_inputs", ())
if not init_inputs:
init_inputs = saved_init_inputs
......@@ -2084,19 +2158,6 @@ class PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin):
# Update with newly provided kwargs
init_kwargs.update(kwargs)
# Convert AddedTokens serialized as dict to class instances
def convert_added_tokens(obj: Union[AddedToken, Any]):
if isinstance(obj, dict) and "__type" in obj and obj["__type"] == "AddedToken":
obj.pop("__type")
return AddedToken(**obj)
elif isinstance(obj, (list, tuple)):
return [convert_added_tokens(o) for o in obj]
elif isinstance(obj, dict):
return {k: convert_added_tokens(v) for k, v in obj.items()}
return obj
init_kwargs = convert_added_tokens(init_kwargs)
# Set max length if needed
if pretrained_model_name_or_path in cls.max_model_input_sizes:
# if we're using a pretrained model, ensure the tokenizer
......@@ -2116,16 +2177,75 @@ class PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin):
# Merge resolved_vocab_files arguments in init_kwargs.
added_tokens_file = resolved_vocab_files.pop("added_tokens_file", None)
special_tokens_map_file = resolved_vocab_files.pop("special_tokens_map_file", None)
for args_name, file_path in resolved_vocab_files.items():
if args_name not in init_kwargs:
init_kwargs[args_name] = file_path
if slow_tokenizer is not None:
init_kwargs["__slow_tokenizer"] = slow_tokenizer
init_kwargs["name_or_path"] = pretrained_model_name_or_path
# Instantiate tokenizer.
additional_special_tokens = init_kwargs.pop("additional_special_tokens", None) or []
added_tokens_decoder = {}
legacy_saved = "added_tokens_decoder" not in init_kwargs
if not legacy_saved:
for idx, token in init_kwargs["added_tokens_decoder"].items():
if isinstance(token, dict):
token = AddedToken(**token)
if isinstance(token, AddedToken):
added_tokens_decoder[int(idx)] = token
else:
raise ValueError(
f"Found a {token.__class__} in the saved `added_tokens_decoder`, should be a dictionary."
)
else:
logger.warning_once(
"Loading the tokenizer from the `special_tokens_map.json` and the `added_tokens.json` will be removed in `transformers 5`, "
" it is kept for forward compatibility, but it is recommended to update your `tokenizer_config.json` by uploading it again."
" You will see the new `added_tokens_decoder` attribute that will store the relevant information."
)
# begin legacy: read the added_tokens_file and update kwargs with special_tokens_map if modified
if special_tokens_map_file is not None:
with open(special_tokens_map_file, encoding="utf-8") as special_tokens_map_handle:
special_tokens_map = json.load(special_tokens_map_handle)
for key, value in special_tokens_map.items():
if key in kwargs and kwargs[key]:
# This value has already been redefined by the kwargs
# We keep this new value and ignore the one stored in the special_tokens_map_file
continue
if isinstance(value, dict):
value = AddedToken(**value)
elif key == "additional_special_tokens" and isinstance(value, list):
for token in value:
token = AddedToken(**token) if isinstance(token, dict) else token
if token not in additional_special_tokens:
additional_special_tokens.append(token)
else:
init_kwargs[key] = value
# slow -> slow|fast, legacy: convert the `"added_tokens.json"` file to `added_tokens_decoder`.
if added_tokens_file is not None:
with open(added_tokens_file, encoding="utf-8") as added_tokens_handle:
added_tok_encoder = json.load(added_tokens_handle)
# legacy: we have to init with (rstrip=True, lstrip=True)
added_tokens_decoder = {
index: AddedToken(token, rstrip=True, lstrip=True) for token, index in added_tok_encoder.items()
}
# end legacy
# slow -> fast, non-legacy: we need to make sure the `added_tokens_decoder` is used to add tokens if the `fast` was not properly saved!
# thus we delay adding special tokens in the init using `slow_to_fast` flag.
if added_tokens_decoder is not {} and "Fast" in cls.__name__:
init_kwargs["slow_to_fast"] = True
if len(additional_special_tokens) > 0:
init_kwargs["additional_special_tokens"] = additional_special_tokens
init_kwargs["added_tokens_decoder"] = added_tokens_decoder
# convert {'__type': 'AddedToken', 'content': '<ent>', 'lstrip': False, 'normalized': True, ...} to AddedTokens
init_kwargs = cls.convert_added_tokens(init_kwargs, False)
# Instantiate the tokenizer.
try:
tokenizer = cls(*init_inputs, **init_kwargs)
except OSError:
......@@ -2134,79 +2254,43 @@ class PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin):
"Please check that the provided vocabulary is accessible and not corrupted."
)
# Save inputs and kwargs for saving and re-loading with ``save_pretrained``
# Removed: Now done at the base class level
# tokenizer.init_inputs = init_inputs
# tokenizer.init_kwargs = init_kwargs
# If there is a complementary special token map, load it
special_tokens_map_file = resolved_vocab_files.pop("special_tokens_map_file", None)
if special_tokens_map_file is not None:
with open(special_tokens_map_file, encoding="utf-8") as special_tokens_map_handle:
special_tokens_map = json.load(special_tokens_map_handle)
for key, value in special_tokens_map.items():
if key in kwargs and kwargs[key]:
# This value has already been redefined by the kwargs
# We keep this new value and ignore the one stored in the special_tokens_map_file
continue
if isinstance(value, dict):
value = AddedToken(**value)
elif isinstance(value, list):
value = [AddedToken(**token) if isinstance(token, dict) else token for token in value]
setattr(tokenizer, key, value)
# Add supplementary tokens.
special_tokens = tokenizer.all_special_tokens
if added_tokens_file is not None:
with open(added_tokens_file, encoding="utf-8") as added_tokens_handle:
added_tok_encoder = json.load(added_tokens_handle)
# Sort added tokens by index
added_tok_encoder_sorted = sorted(added_tok_encoder.items(), key=lambda x: x[1])
# Accumulate added tokens into batches of special/non-special tokens, because calling add_tokens() for
# individual tokens would repeatedly rebuild a trie, which can be slow.
is_last_special = None
tokens = []
for token, index in added_tok_encoder_sorted:
current_index = len(tokenizer) + len(tokens)
if has_tokenizer_file and index != current_index and tokenizer.convert_tokens_to_ids(token) != index:
# Tokenizer fast: added token needs to either be in the vocabulary with the proper index or the
# index is the current length of the tokenizer (not in vocabulary)
raise ValueError(
f"Wrong index found for {token}: should be {tokenizer.convert_tokens_to_ids(token)} but found "
f"{index}."
)
elif not has_tokenizer_file and index != current_index:
# Tokenizer slow: added token cannot already be in the vocabulary so its index needs to be the
# current length of the tokenizer.
raise ValueError(
f"Non-consecutive added token '{token}' found. "
f"Should have index {current_index} but has index {index} in saved vocabulary."
)
is_special = bool(token in special_tokens)
if is_last_special is None or is_last_special == is_special:
tokens.append(token)
else:
tokenizer.add_tokens(tokens, special_tokens=is_last_special)
tokens = [token]
is_last_special = is_special
if tokens:
tokenizer.add_tokens(tokens, special_tokens=is_last_special)
# allows converting a fast -> slow: add the `tokenizer.json`'s `"added_tokens"` to the slow tokenizer
# if `added_tokens_decoder` not in `tokenizer_config.json` and `added_tokens.json` is `None`
tokenizer_file = resolved_vocab_files.pop("tokenizer_file", None)
if legacy_saved and "Fast" not in cls.__name__ and added_tokens_file is None and tokenizer_file is not None:
tokens_to_add_from_fast = []
with open(tokenizer_file, encoding="utf-8") as tokenizer_file_handle:
tokenizer_file_handle = json.load(tokenizer_file_handle)
added_tokens = tokenizer_file_handle.pop("added_tokens")
for serialized_tokens in added_tokens:
serialized_tokens.pop("id")
# for legacy purpose, we ignore whether or not these tokens are special.
serialized_tokens.pop("special")
tokens_to_add_from_fast.append(AddedToken(**serialized_tokens))
tokenizer.add_tokens(tokens_to_add_from_fast)
# allows converting a slow -> fast, non-legacy: if the `tokenizer.json` does not have all the added tokens
# uses the information stored in `added_tokens_decoder`. Checks after addition that we have the same ids
if init_kwargs.get("slow_to_fast", False):
tokenizer.add_tokens([token for _, token in sorted(added_tokens_decoder.items(), key=lambda x: x[0])])
warnings = ""
for index, token in sorted(added_tokens_decoder.items(), key=lambda x: x[0]):
if tokenizer.convert_tokens_to_ids(str(token)) != index:
warnings += f"\texpected id: {tokenizer.convert_tokens_to_ids(str(token))}, found: {index}, token: `{token}`,\n"
if len(warnings) > 1:
logger.warn(
f"You are converting a {slow_tokenizer.__class__.__name__} to a {cls.__name__}, but"
f" wrong indexes were founds when adding the `added_tokens` from the `slow` tokenizer to the `fast`. "
f" The following tokens had unexpected id :\n{warnings}. You should try using `from_slow`."
)
# finally we add all the special_tokens to make sure eveything is initialized
tokenizer.add_tokens(tokenizer.all_special_tokens_extended, special_tokens=True)
# Check all our special tokens are registered as "no split" token (we don't cut them) and are in the vocab
added_tokens = tokenizer.sanitize_special_tokens()
if added_tokens:
if len(added_tokens_decoder) > 0:
logger.warning_advice(
"Special tokens have been added in the vocabulary, make sure the associated word embeddings are"
" fine-tuned or trained."
)
return tokenizer
@staticmethod
......@@ -2217,6 +2301,21 @@ class PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin):
# which we will correct in Transformers v5.
return max_model_length
@classmethod
def convert_added_tokens(cls, obj: Union[AddedToken, Any], add_type_field=True):
if isinstance(obj, dict) and "__type" in obj and obj["__type"] == "AddedToken":
obj.pop("__type")
return AddedToken(**obj)
if isinstance(obj, AddedToken):
if add_type_field:
obj = obj.content
return obj
elif isinstance(obj, (list, tuple)):
return [cls.convert_added_tokens(o, add_type_field=add_type_field) for o in obj]
elif isinstance(obj, dict):
return {k: cls.convert_added_tokens(v, add_type_field=add_type_field) for k, v in obj.items()}
return obj
def save_pretrained(
self,
save_directory: Union[str, os.PathLike],
......@@ -2295,7 +2394,7 @@ class PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin):
# TODO: Ensure the modified attributes (those are also in the __init__ kwargs) will give identical tokenizers
# target_keys = self.init_kwargs.keys()
target_keys = ["model_max_length", "clean_up_tokenization_spaces"]
target_keys = ["model_max_length", "clean_up_tokenization_spaces", "additional_special_tokens"]
for k in target_keys:
if hasattr(self, k):
tokenizer_config[k] = getattr(self, k)
......@@ -2308,21 +2407,13 @@ class PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin):
for file_id in self.vocab_files_names.keys():
tokenizer_config.pop(file_id, None)
# Sanitize AddedTokens
def convert_added_tokens(obj: Union[AddedToken, Any], add_type_field=True):
if isinstance(obj, AddedToken):
out = obj.__getstate__()
if add_type_field:
out["__type"] = "AddedToken"
return out
elif isinstance(obj, (list, tuple)):
return [convert_added_tokens(o, add_type_field=add_type_field) for o in obj]
elif isinstance(obj, dict):
return {k: convert_added_tokens(v, add_type_field=add_type_field) for k, v in obj.items()}
return obj
# add_type_field=True to allow dicts in the kwargs / differentiate from AddedToken serialization
tokenizer_config = convert_added_tokens(tokenizer_config, add_type_field=True)
tokenizer_config = self.convert_added_tokens(tokenizer_config, add_type_field=True)
added_tokens = {}
for key, value in self.added_tokens_decoder.items():
added_tokens[key] = value.__getstate__()
tokenizer_config["added_tokens_decoder"] = added_tokens
# Add tokenizer class to the tokenizer config to be able to reload it with from_pretrained
tokenizer_class = self.__class__.__name__
......@@ -2351,7 +2442,9 @@ class PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin):
logger.info(f"tokenizer config file saved in {tokenizer_config_file}")
# Sanitize AddedTokens in special_tokens_map
write_dict = convert_added_tokens(self.special_tokens_map_extended, add_type_field=False)
# kept for forward compatibility, will be removed in transoformers 5
write_dict = self.convert_added_tokens(self.special_tokens_map_extended, add_type_field=True)
with open(special_tokens_map_file, "w", encoding="utf-8") as f:
out_str = json.dumps(write_dict, indent=2, sort_keys=True, ensure_ascii=False) + "\n"
f.write(out_str)
......
......@@ -96,6 +96,7 @@ class PreTrainedTokenizerFast(PreTrainedTokenizerBase):
slow_tokenizer = kwargs.pop("__slow_tokenizer", None)
fast_tokenizer_file = kwargs.pop("tokenizer_file", None)
from_slow = kwargs.pop("from_slow", False)
slow_to_fast = kwargs.pop("slow_to_fast", False)
if from_slow and slow_tokenizer is None and self.slow_tokenizer_class is None:
raise ValueError(
......@@ -154,6 +155,10 @@ class PreTrainedTokenizerFast(PreTrainedTokenizerBase):
# We call this after having initialized the backend tokenizer because we update it.
super().__init__(**kwargs)
# We add the additional tokens that are not part of the vocab
if not slow_to_fast:
self._add_tokens(self.all_special_tokens_extended, special_tokens=True)
@property
def is_fast(self) -> bool:
return True
......@@ -180,6 +185,16 @@ class PreTrainedTokenizerFast(PreTrainedTokenizerBase):
def vocab(self) -> Dict[str, int]:
return self.get_vocab()
@property
def added_tokens_decoder(self) -> Dict[int, AddedToken]:
"""
Returns the added tokens in the vocabulary as a dictionary of index to AddedToken.
Returns:
`Dict[str, int]`: The added tokens.
"""
return self._tokenizer.get_added_tokens_decoder()
def get_added_vocab(self) -> Dict[str, int]:
"""
Returns the added tokens in the vocabulary as a dictionary of token to index.
......@@ -779,6 +794,7 @@ class PreTrainedTokenizerFast(PreTrainedTokenizerBase):
lstrip=special_token_full.lstrip,
rstrip=special_token_full.rstrip,
normalized=special_token_full.normalized,
special=True,
)
else:
kwargs[token] = special_token
......
......@@ -170,7 +170,6 @@ class TestTokenizationBart(TokenizerTesterMixin, unittest.TestCase):
tokens_r_str = tokenizer_r.convert_ids_to_tokens(tokens_r["input_ids"])
tokens_p_str = tokenizer_p.convert_ids_to_tokens(tokens_p["input_ids"])
# Rust correctly handles the space before the mask while python doesnt
self.assertSequenceEqual(tokens_p["input_ids"], [0, 250, 6, 50264, 3823, 487, 21992, 3645, 4, 2])
self.assertSequenceEqual(tokens_r["input_ids"], [0, 250, 6, 50264, 3823, 487, 21992, 3645, 4, 2])
......
......@@ -42,6 +42,10 @@ class BloomTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
kwargs.update(self.special_tokens_map)
return BloomTokenizerFast.from_pretrained(self.tmpdirname, **kwargs)
@unittest.skip("This needs a slow tokenizer. Bloom does not have one!")
def test_encode_decode_with_spaces(self):
return
def test_encodings_from_sample_data(self):
"""
Assert that the created tokens are the same than the hard-coded ones
......
......@@ -205,7 +205,9 @@ class ByT5TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
tokenizer.add_tokens(["bim", "bambam"])
additional_special_tokens = tokenizer.additional_special_tokens
additional_special_tokens.append("new_additional_special_token")
tokenizer.add_special_tokens({"additional_special_tokens": additional_special_tokens})
tokenizer.add_special_tokens(
{"additional_special_tokens": additional_special_tokens}, replace_additional_special_tokens=False
)
before_tokens = tokenizer.encode(sample_text, add_special_tokens=False)
tokenizer.save_pretrained(tmpdirname)
......
......@@ -43,13 +43,19 @@ class CamembertTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
tokenizer = CamembertTokenizer(SAMPLE_VOCAB)
tokenizer.save_pretrained(self.tmpdirname)
@unittest.skip(
"Token maps are not equal because someone set the probability of ('<unk>NOTUSED', -100), so it's never encoded for fast"
)
def test_special_tokens_map_equal(self):
return
def test_convert_token_and_id(self):
"""Test ``_convert_token_to_id`` and ``_convert_id_to_token``."""
token = "<pad>"
token_id = 1
token_id = 1 # 1 is the offset id, but in the spm vocab it's 3
self.assertEqual(self.get_tokenizer()._convert_token_to_id(token), token_id)
self.assertEqual(self.get_tokenizer()._convert_id_to_token(token_id), token)
self.assertEqual(self.get_tokenizer().convert_tokens_to_ids(token), token_id)
self.assertEqual(self.get_tokenizer().convert_ids_to_tokens(token_id), token)
def test_get_vocab(self):
vocab_keys = list(self.get_tokenizer().get_vocab().keys())
......@@ -57,10 +63,10 @@ class CamembertTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
self.assertEqual(vocab_keys[0], "<s>NOTUSED")
self.assertEqual(vocab_keys[1], "<pad>")
self.assertEqual(vocab_keys[-1], "<mask>")
self.assertEqual(len(vocab_keys), 1_004)
self.assertEqual(len(vocab_keys), 1_005)
def test_vocab_size(self):
self.assertEqual(self.get_tokenizer().vocab_size, 1_005)
self.assertEqual(self.get_tokenizer().vocab_size, 1_000)
def test_rust_and_python_bpe_tokenizers(self):
tokenizer = CamembertTokenizer(SAMPLE_BPE_VOCAB)
......
......@@ -122,7 +122,9 @@ class CanineTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
# We can add a new special token for Canine as follows:
new_additional_special_token = chr(0xE007)
additional_special_tokens.append(new_additional_special_token)
tokenizer.add_special_tokens({"additional_special_tokens": additional_special_tokens})
tokenizer.add_special_tokens(
{"additional_special_tokens": additional_special_tokens}, replace_additional_special_tokens=False
)
before_tokens = tokenizer.encode(sample_text, add_special_tokens=False)
tokenizer.save_pretrained(tmpdirname)
......@@ -167,11 +169,7 @@ class CanineTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
with self.subTest(f"{tokenizer.__class__.__name__}"):
SPECIAL_TOKEN_1 = chr(0xE005)
SPECIAL_TOKEN_2 = chr(0xE006)
# `add_tokens` method stores special tokens only in `tokenizer.unique_no_split_tokens`. (in tokenization_utils.py)
tokenizer.add_tokens([SPECIAL_TOKEN_1], special_tokens=True)
# `add_special_tokens` method stores special tokens in `tokenizer.additional_special_tokens`,
# which also occur in `tokenizer.all_special_tokens`. (in tokenization_utils_base.py)
tokenizer.add_special_tokens({"additional_special_tokens": [SPECIAL_TOKEN_2]})
token_1 = tokenizer.tokenize(SPECIAL_TOKEN_1)
......
......@@ -65,6 +65,10 @@ class CodeLlamaTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
tokenizer.pad_token = tokenizer.eos_token
tokenizer.save_pretrained(self.tmpdirname)
def get_tokenizers(self, **kwargs):
kwargs.update({"pad_token": "<PAD>"})
return super().get_tokenizers(**kwargs)
def test_no_infilling_init(self):
tokenizer = CodeLlamaTokenizer(SAMPLE_VOCAB, prefix_token=None, keep_accents=True)
with self.assertRaises(ValueError):
......@@ -518,7 +522,7 @@ class LlamaIntegrationTest(unittest.TestCase):
def test_special_token_special_word(self):
# the word inform should be split as ['in', 'form']
tokenizer = CodeLlamaTokenizer.from_pretrained("codellama/CodeLlama-7b-hf", legacy=False)
tokenizer.add_tokens(["<REPR_END>"], special_tokens=True)
tokenizer.add_tokens(["<REPR_END>"], special_tokens=False)
out1 = tokenizer.decode(
tokenizer.encode("<REPR_END>inform", add_special_tokens=False), spaces_between_special_tokens=False
)
......@@ -526,7 +530,8 @@ class LlamaIntegrationTest(unittest.TestCase):
out2 = tokenizer.decode(
tokenizer.encode("<REPR_END>inform", add_special_tokens=False), spaces_between_special_tokens=True
)
self.assertEqual(out2, " <REPR_END> inform")
# the added prefix token should not be decoded
self.assertEqual(out2, "<REPR_END> inform")
input_ids = tokenizer.encode("<REPR_END>inform", add_special_tokens=False)
self.assertEqual(input_ids, [29871, 32016, 262, 689]) # 29871 is the spiece underline, '▁'
......
......@@ -244,8 +244,8 @@ class CodeGenTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
decode_s = tokenizer.decode(out_s.input_ids)
decode_s2 = tokenizer.batch_decode(out_s2.input_ids)
self.assertEqual(decode_s.split()[0], bos_token)
self.assertTrue(all(d.split()[0] == bos_token for d in decode_s2))
self.assertTrue(decode_s.startswith(bos_token))
self.assertTrue(all(d.startswith(bos_token) for d in decode_s2))
@slow
def test_truncation(self):
......@@ -258,6 +258,7 @@ class CodeGenTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
truncation_pattern = ["^#", re.escape("<|endoftext|>"), "^'''", '^"""', "\n\n\n"]
decoded_text = tokenizer.decode(input_ids, truncate_before_pattern=truncation_pattern)
self.assertEqual(decoded_text, expected_trucated_text)
# TODO @ArthurZ outputs of the fast tokenizer are different in this case, un-related to the PR
# tokenizer has no padding token
def test_padding_different_model_input_name(self):
......
......@@ -68,12 +68,12 @@ class DebertaV2TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
tokens_target = ["▁hello", "!", "how", "▁are", "▁you", "?"]
# fmt: on
tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, do_lower_case=True)
tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, unk_token="<unk>", do_lower_case=True)
tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(sequence, add_special_tokens=False))
self.assertListEqual(tokens, tokens_target)
rust_tokenizer = DebertaV2TokenizerFast(SAMPLE_VOCAB, do_lower_case=True)
rust_tokenizer = DebertaV2TokenizerFast(SAMPLE_VOCAB, unk_token="<unk>", do_lower_case=True)
rust_tokens = rust_tokenizer.convert_ids_to_tokens(rust_tokenizer.encode(sequence, add_special_tokens=False))
self.assertListEqual(rust_tokens, tokens_target)
......@@ -92,12 +92,12 @@ class DebertaV2TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
tokens_target = ["▁", "<unk>", "▁was", "▁born", "▁in", "▁9", "2000", "▁", ",", "▁and", "▁this", "▁is", "▁fal", "s", "<unk>", "▁", ".", ]
# fmt: on
tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, split_by_punct=True)
tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, unk_token="<unk>", split_by_punct=True)
tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(sequence, add_special_tokens=False))
self.assertListEqual(tokens, tokens_target)
rust_tokenizer = DebertaV2TokenizerFast(SAMPLE_VOCAB, split_by_punct=True)
rust_tokenizer = DebertaV2TokenizerFast(SAMPLE_VOCAB, unk_token="<unk>", split_by_punct=True)
rust_tokens = rust_tokenizer.convert_ids_to_tokens(rust_tokenizer.encode(sequence, add_special_tokens=False))
self.assertListEqual(rust_tokens, tokens_target)
......@@ -108,11 +108,13 @@ class DebertaV2TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
tokens_target = ["▁i", "▁was", "▁born", "▁in", "▁9", "2000", "▁", ",", "▁and", "▁this", "▁is", "▁fal", "s", "<unk>", "▁", ".", ]
# fmt: on
tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, do_lower_case=True, split_by_punct=True)
tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, unk_token="<unk>", do_lower_case=True, split_by_punct=True)
tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(sequence, add_special_tokens=False))
self.assertListEqual(tokens, tokens_target)
rust_tokenizer = DebertaV2TokenizerFast(SAMPLE_VOCAB, do_lower_case=True, split_by_punct=True)
rust_tokenizer = DebertaV2TokenizerFast(
SAMPLE_VOCAB, unk_token="<unk>", do_lower_case=True, split_by_punct=True
)
rust_tokens = rust_tokenizer.convert_ids_to_tokens(rust_tokenizer.encode(sequence, add_special_tokens=False))
self.assertListEqual(rust_tokens, tokens_target)
......@@ -122,12 +124,14 @@ class DebertaV2TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
tokens_target = ["▁i", "▁was", "▁born", "▁in", "▁9", "2000", ",", "▁and", "▁this", "▁is", "▁fal", "s", "<unk>", ".", ]
# fmt: on
tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, do_lower_case=True, split_by_punct=False)
tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, unk_token="<unk>", do_lower_case=True, split_by_punct=False)
tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(sequence, add_special_tokens=False))
self.assertListEqual(tokens, tokens_target)
rust_tokenizer = DebertaV2TokenizerFast(SAMPLE_VOCAB, do_lower_case=True, split_by_punct=False)
rust_tokenizer = DebertaV2TokenizerFast(
SAMPLE_VOCAB, unk_token="<unk>", do_lower_case=True, split_by_punct=False
)
rust_tokens = rust_tokenizer.convert_ids_to_tokens(rust_tokenizer.encode(sequence, add_special_tokens=False))
self.assertListEqual(rust_tokens, tokens_target)
......@@ -138,12 +142,14 @@ class DebertaV2TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
tokens_target = ["▁", "<unk>", "▁was", "▁born", "▁in", "▁9", "2000", "▁", ",", "▁and", "▁this", "▁is", "▁fal", "s", "<unk>", "▁", ".", ]
# fmt: on
tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, do_lower_case=False, split_by_punct=True)
tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, unk_token="<unk>", do_lower_case=False, split_by_punct=True)
tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(sequence, add_special_tokens=False))
self.assertListEqual(tokens, tokens_target)
rust_tokenizer = DebertaV2TokenizerFast(SAMPLE_VOCAB, do_lower_case=False, split_by_punct=True)
rust_tokenizer = DebertaV2TokenizerFast(
SAMPLE_VOCAB, unk_token="<unk>", do_lower_case=False, split_by_punct=True
)
rust_tokens = rust_tokenizer.convert_ids_to_tokens(rust_tokenizer.encode(sequence, add_special_tokens=False))
self.assertListEqual(rust_tokens, tokens_target)
......@@ -154,12 +160,14 @@ class DebertaV2TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
tokens_target = ["▁", "<unk>", "e", "<unk>", "o", "!", "how", "▁", "<unk>", "re", "▁yo", "<unk>", "?"]
# fmt: on
tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, do_lower_case=False, split_by_punct=False)
tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, unk_token="<unk>", do_lower_case=False, split_by_punct=False)
tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(sequence, add_special_tokens=False))
self.assertListEqual(tokens, tokens_target)
rust_tokenizer = DebertaV2TokenizerFast(SAMPLE_VOCAB, do_lower_case=False, split_by_punct=False)
rust_tokenizer = DebertaV2TokenizerFast(
SAMPLE_VOCAB, unk_token="<unk>", do_lower_case=False, split_by_punct=False
)
rust_tokens = rust_tokenizer.convert_ids_to_tokens(rust_tokenizer.encode(sequence, add_special_tokens=False))
self.assertListEqual(rust_tokens, tokens_target)
......@@ -189,8 +197,8 @@ class DebertaV2TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
tokens_target = ["▁", "T", "his", "▁is", "▁a", "▁test"]
back_tokens_target = ["▁", "<unk>", "his", "▁is", "▁a", "▁test"]
tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, keep_accents=True)
rust_tokenizer = DebertaV2TokenizerFast(SAMPLE_VOCAB, keep_accents=True)
tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, unk_token="<unk>", keep_accents=True)
rust_tokenizer = DebertaV2TokenizerFast(SAMPLE_VOCAB, unk_token="<unk>", keep_accents=True)
ids = tokenizer.encode(sequence, add_special_tokens=False)
self.assertListEqual(ids, ids_target)
......
......@@ -243,8 +243,8 @@ class GPT2TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
decode_s = tokenizer.decode(out_s.input_ids)
decode_s2 = tokenizer.batch_decode(out_s2.input_ids)
self.assertEqual(decode_s.split()[0], bos_token)
self.assertTrue(all(d.split()[0] == bos_token for d in decode_s2))
self.assertTrue(decode_s.startswith(bos_token))
self.assertTrue(all(d.startswith(bos_token) for d in decode_s2))
# tokenizer has no padding token
def test_padding_different_model_input_name(self):
......
......@@ -145,10 +145,10 @@ class GPTSw3TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
tokenized_chats = [tokenizer.apply_chat_template(test_chat) for test_chat in test_chats]
# fmt: off
expected_tokens = [
[268, 63, 127, 462, 276, 294, 348, 536, 797, 275, 127, 65, 63, 263, 65, 938, 541, 419, 530, 339, 265, 878, 708, 727, 275, 347, 541, 260, 63, 263, 65, 1256, 263, 314, 419, 366, 354, 294, 360, 63, 263, 65, 938, 541, 419, ],
[268, 63, 127, 462, 276, 294, 348, 536, 797, 275, 127, 65, 63, 263, 65, 938, 541, 419, 530, 339, 265, 878, 708, 727, 275, 347, 541, 260, 63, 263, 65, 1256, 263, 314, 419, 366, 354, 294, 360, 63, 263, 65, 938, 541, 419, 984, 429, 281, 264, 1261, 291, 260, 63, 263, 65, 938, 541, 419, ],
[268, 63, 127, 462, 276, 294, 348, 536, 797, 275, 127, 65, 63, 263, 65, 938, 541, 419, 984, 429, 281, 264, 1261, 291, 260, 63, 263, 65, 1256, 263, 314, 419, 366, 354, 294, 360, 63, 263, 65, 938, 541, 419, ]
]
[2000, 1, 575, 541, 419, 530, 339, 265, 878, 708, 727, 275, 347, 541, 260, 1, 968, 263, 314, 419, 366, 354, 294, 360, 1, 575, 541, 419],
[2000, 1, 575, 541, 419, 530, 339, 265, 878, 708, 727, 275, 347, 541, 260, 1, 968, 263, 314, 419, 366, 354, 294, 360, 1, 575, 541, 419, 984, 429, 281, 264, 1261, 291, 260, 1, 575, 541, 419],
[2000, 1, 575, 541, 419, 984, 429, 281, 264, 1261, 291, 260, 1, 968, 263, 314, 419, 366, 354, 294, 360, 1, 575, 541, 419]
]
# fmt: on
for tokenized_chat, expected_tokens in zip(tokenized_chats, expected_tokens):
self.assertListEqual(tokenized_chat, expected_tokens)
......@@ -210,9 +210,9 @@ class GPTSanJapaneseTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
tokenized_chats = [tokenizer.apply_chat_template(test_chat) for test_chat in test_chats]
# fmt: off
expected_tokens = [
[35993, 35998, 35637, 35659, 35665, 35716, 35645, 35662, 35649, 35716, 35645, 35716, 35652, 35649, 35656, 35660, 35650, 35665, 35656, 35716, 35647, 35652, 35645, 35664, 35646, 35659, 35664, 35595, 35999, 35993, 35998, 35620, 35649, 35656, 35656, 35659, 35582, 35999],
[35993, 35998, 35637, 35659, 35665, 35716, 35645, 35662, 35649, 35716, 35645, 35716, 35652, 35649, 35656, 35660, 35650, 35665, 35656, 35716, 35647, 35652, 35645, 35664, 35646, 35659, 35664, 35595, 35999, 35993, 35998, 35620, 35649, 35656, 35656, 35659, 35582, 35999, 35993, 35998, 35626, 35653, 35647, 35649, 35716, 35664, 35659, 35716, 35657, 35649, 35649, 35664, 35716, 35669, 35659, 35665, 35595, 35999],
[35993, 35998, 35626, 35653, 35647, 35649, 35716, 35664, 35659, 35716, 35657, 35649, 35649, 35664, 35716, 35669, 35659, 35665, 35595, 35999, 35993, 35998, 35620, 35649, 35656, 35656, 35659, 35582, 35999],
[35993, 35998, 35637, 35659, 35665, 35716, 35645, 35662, 35649, 35716, 35645, 35716, 35652, 35649, 35656, 35660, 35650, 35665, 35656, 35716, 35647, 35652, 35645, 35664, 35646, 35659, 35664, 35595, 35716, 35999, 35993, 35998, 35620, 35649, 35656, 35656, 35659, 35582, 35716, 35999],
[35993, 35998, 35637, 35659, 35665, 35716, 35645, 35662, 35649, 35716, 35645, 35716, 35652, 35649, 35656, 35660, 35650, 35665, 35656, 35716, 35647, 35652, 35645, 35664, 35646, 35659, 35664, 35595, 35716, 35999, 35993, 35998, 35620, 35649, 35656, 35656, 35659, 35582, 35716, 35999, 35993, 35998, 35626, 35653, 35647, 35649, 35716, 35664, 35659, 35716, 35657, 35649, 35649, 35664, 35716, 35669, 35659, 35665, 35595, 35716, 35999],
[35993, 35998, 35626, 35653, 35647, 35649, 35716, 35664, 35659, 35716, 35657, 35649, 35649, 35664, 35716, 35669, 35659, 35665, 35595, 35716, 35999, 35993, 35998, 35620, 35649, 35656, 35656, 35659, 35582, 35716, 35999]
]
# fmt: on
for tokenized_chat, expected_tokens in zip(tokenized_chats, expected_tokens):
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment