Unverified Commit 2da88537 authored by Arthur's avatar Arthur Committed by GitHub
Browse files

🚨🚨 🚨🚨 [`Tokenizer`] attemp to fix add_token issues🚨🚨 🚨🚨 (#23909)



* fix test for bart. Order is correct now let's skip BPEs

* ouf

* styling

* fix bert....

* slow refactoring

* current updates

* massive refactoring

* update

* NICE!

* update to see where I am at

* updates

* update

* update

* revert

* updates

* updates

* start supporting legacy_save

* styling

* big update

* revert some changes

* nits

* nniiiiiice

* small fixes

* kinda fix t5 with new behaviour

* major update

* fixup

* fix copies

* today's updates

* fix byt5

* upfate

* update

* update

* updates

* update vocab size test

* Barthez does not use not need the fairseq offset ids

* super calll must be after

* calll super

* move all super init

* move other super init

* fixup

* nits

* more fixes

* nits

* more fixes

* nits

* more fix

* remove useless files

* ouch all of them are affected

* and more!

* small imporvements

* no more sanitize token

* more changes around unique no split tokens

* partially fix more things

* keep legacy save but add warning

* so... more fixes

* updates

* guess deberta tokenizer could be nuked

* fixup

* fixup did some bad things

* nuke it if it breaks

* remove prints and pretrain fast from slow with new format.

* fixups

* Apply suggestions from code review
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* fiou

* nit

* by default specials should not be normalized?

* update

* remove brakpoint

* updates

* a lot of updates

* fixup

* fixes revert some changes to match fast

* small nits

* that makes it cleaner

* fix camembert accordingly

* update

* some lest breaking changes

* update

* fixup

* fix byt5 and whisper mostly

* some more fixes, canine's byte vocab

* fix gpt2

* fix most of the perceiver tests (4 left)

* fix layout lmv3

* fixup

* fix copies for gpt2 style

* make sure to only warn once

* fix perciever and gpt2 tests

* some more backward compatibility: also read special tokens map because some ppl use it........////.....

* fixup

* add else when reading

* nits

* fresh updates

* fix copies

* will this make everything faster?

* fixes

* more fixes

* update

* more fixes

* fixup

* is the source of truth right?

* sorry camembert for the troubles

* current updates

* fixup

* update led

* update

* fix regression

* fix single word

* more model specific fixes

* fix t5 tests

* fixup

* more comments

* update

* fix nllb

* rstrip removed

* small fixes

* better handle additional_special_tokens and vocab sizes

* fixing

* styling

* fix 4 / 21

* fixup

* fix nlbb's tests

* some fixes

* fix t5

* fixes

* style

* fix canine tests

* damn this is nice

* nits

* m2m100 nit

* fixups

* fixes!

* fixup

* stash

* fix merge

* revert bad change

* fixup

* correct order for code Llama

* fix speecht5 post merge

* styling

* revert source of 11 fails

* small nits

* all changes in one go

* fnet hack

* fix 2 more tests

* update based on main branch of tokenizers

* fixup

* fix VITS issues

* more fixes

* fix mgp test

* fix camembert issues

* oups camembert still has 2 failing tests

* mluke fixes

* decode fixes

* small nits

* nits

* fix llama and vits

* fix camembert

* smal nits

* more fixes when initialising a fast from a slow and etc

* fix one of the last test

* fix CPM tokenizer test

* fixups

* fix pop2piano

* fixup

* ️ Change tokenizers required version ️

* ️ Change tokenizers required version ️

* "tokenizers>=0.14,<0.15", don't forget smaller than

* fix musicgen tests and pretraiendtokenizerfast

* fix owlvit and all

* update t5

* fix 800 red

* fix tests

* fix the fix of the fix of t5

* styling

* documentation nits

* cache _added_tokens_encoder

* fixups

* Nit

* fix red tests

* one last nit!

* make eveything a lot simpler

* Now it's over 😉



* few small nits

* Apply suggestions from code review
Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>

* updates that work for now

* tests that should no be skipped / changed and fixed next

* fixup

* i am ashamed

* pushe the fix

* update

* fixups

* nits

* fix added_tokens_encoder

* fix canine test

* fix pegasus vocab

* fix transfoXL

* fixup

* whisper needs to be fixed for train new

* pegasus nits

* more pegasus fixes

* minor update

* better error message in failed test

* fix whisper failing test

* fix whisper failing test

* fix pegasus

* fixup

* fix **** pegasus

* reset things

* remove another file

* attempts to fix the strange custome encoder and offset

* nits here and there

* update

* fixup

* nit

* fix the whisper test

* nits nits

* Apply suggestions from code review
Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>

* updates based on review

* some small update to potentially remove

* nits

* import rlu cache

* Update src/transformers/tokenization_utils_base.py
Co-authored-by: default avatarLysandre Debut <hi@lysand.re>

* move warning to `from_pretrained`

* update tests results now that the special tokens are always added

---------
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: default avatarLysandre Debut <hi@lysand.re>
parent 835b0a05
......@@ -170,16 +170,7 @@ class GPT2Tokenizer(PreTrainedTokenizer):
eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token
unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token
pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token
super().__init__(
errors=errors,
unk_token=unk_token,
bos_token=bos_token,
eos_token=eos_token,
pad_token=pad_token,
add_prefix_space=add_prefix_space,
add_bos_token=add_bos_token,
**kwargs,
)
self.add_bos_token = add_bos_token
with open(vocab_file, encoding="utf-8") as vocab_handle:
......@@ -198,6 +189,17 @@ class GPT2Tokenizer(PreTrainedTokenizer):
# Should have added re.IGNORECASE so BPE merges can happen for capitalized versions of contractions
self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
super().__init__(
errors=errors,
unk_token=unk_token,
bos_token=bos_token,
eos_token=eos_token,
pad_token=pad_token,
add_prefix_space=add_prefix_space,
add_bos_token=add_bos_token,
**kwargs,
)
@property
def vocab_size(self):
return len(self.encoder)
......
......@@ -127,14 +127,6 @@ class GPTNeoXJapaneseTokenizer(PreTrainedTokenizer):
do_clean_text=False,
**kwargs,
):
super().__init__(
unk_token=unk_token,
pad_token=pad_token,
bos_token=bos_token,
eos_token=eos_token,
do_clean_text=do_clean_text,
**kwargs,
)
if not os.path.isfile(vocab_file):
raise ValueError(
f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained"
......@@ -150,6 +142,14 @@ class GPTNeoXJapaneseTokenizer(PreTrainedTokenizer):
self.subword_tokenizer = SubWordJapaneseTokenizer(
vocab=self.vocab, ids_to_tokens=self.ids_to_tokens, emoji=self.emoji
)
super().__init__(
unk_token=unk_token,
pad_token=pad_token,
bos_token=bos_token,
eos_token=eos_token,
do_clean_text=do_clean_text,
**kwargs,
)
@property
def vocab_size(self):
......
......@@ -103,7 +103,7 @@ class GPTSw3Tokenizer(PreTrainedTokenizer):
vocab_files_names = VOCAB_FILES_NAMES
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
model_input_names = ["input_ids", "attention_mask"]
model_input_names = ["input_ids", "token_type_ids", "attention_mask"]
def __init__(
self,
......@@ -138,18 +138,6 @@ class GPTSw3Tokenizer(PreTrainedTokenizer):
pad_token = "<pad>" if pad_token is None else pad_token
bos_token = "<s>" if bos_token is None else bos_token
super().__init__(
do_lower_case=do_lower_case,
remove_space=remove_space,
keep_accents=keep_accents,
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
pad_token=pad_token,
sp_model_kwargs=self.sp_model_kwargs,
**kwargs,
)
self.do_lower_case = do_lower_case
self.remove_space = remove_space
self.keep_accents = keep_accents
......@@ -168,6 +156,18 @@ class GPTSw3Tokenizer(PreTrainedTokenizer):
f"[{''.join(map(chr, list(range(0, 9)) + list(range(11, 32)) + list(range(127, 160)) + [160, 173, 8203]))}]"
)
super().__init__(
do_lower_case=do_lower_case,
remove_space=remove_space,
keep_accents=keep_accents,
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
pad_token=pad_token,
sp_model_kwargs=self.sp_model_kwargs,
**kwargs,
)
# Copied from transformers.models.albert.tokenization_albert.AlbertTokenizer.__getstate__
def __getstate__(self):
state = self.__dict__.copy()
......
......@@ -166,15 +166,6 @@ class GPTSanJapaneseTokenizer(PreTrainedTokenizer):
do_clean_text=False,
**kwargs,
):
super().__init__(
unk_token=unk_token,
pad_token=pad_token,
bos_token=bos_token,
eos_token=eos_token,
sep_token=sep_token,
do_clean_text=do_clean_text,
**kwargs,
)
if not os.path.isfile(vocab_file):
raise ValueError(
f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained"
......@@ -191,6 +182,16 @@ class GPTSanJapaneseTokenizer(PreTrainedTokenizer):
vocab=self.vocab, ids_to_tokens=self.ids_to_tokens, emoji=self.emoji
)
super().__init__(
unk_token=unk_token,
pad_token=pad_token,
bos_token=bos_token,
eos_token=eos_token,
sep_token=sep_token,
do_clean_text=do_clean_text,
**kwargs,
)
@property
# Copied from tokenization_gpt_neox_japanese.GPTNeoXJapaneseTokenizer.vocab_size
def vocab_size(self):
......
......@@ -334,21 +334,6 @@ class HerbertTokenizer(PreTrainedTokenizer):
id2lang=None,
**kwargs,
):
super().__init__(
unk_token=unk_token,
bos_token=bos_token,
sep_token=sep_token,
pad_token=pad_token,
cls_token=cls_token,
mask_token=mask_token,
additional_special_tokens=additional_special_tokens,
lang2id=lang2id,
id2lang=id2lang,
do_lowercase_and_remove_accent=do_lowercase_and_remove_accent,
tokenizer_file=None,
**kwargs,
)
try:
import sacremoses
except ImportError:
......@@ -383,6 +368,21 @@ class HerbertTokenizer(PreTrainedTokenizer):
self.bpe_ranks = dict(zip(merges, range(len(merges))))
self.cache = {}
super().__init__(
unk_token=unk_token,
bos_token=bos_token,
sep_token=sep_token,
pad_token=pad_token,
cls_token=cls_token,
mask_token=mask_token,
additional_special_tokens=additional_special_tokens,
lang2id=lang2id,
id2lang=id2lang,
do_lowercase_and_remove_accent=do_lowercase_and_remove_accent,
tokenizer_file=None,
**kwargs,
)
self.bert_pre_tokenizer = BasicTokenizer(
do_lower_case=False,
never_split=self.all_special_tokens,
......
......@@ -128,16 +128,10 @@ class JukeboxTokenizer(PreTrainedTokenizer):
**kwargs,
):
unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token
super().__init__(
unk_token=unk_token,
n_genres=n_genres,
version=version,
max_n_lyric_tokens=max_n_lyric_tokens,
**kwargs,
)
self.version = version
self.max_n_lyric_tokens = max_n_lyric_tokens
self.n_genres = n_genres
self._added_tokens_decoder = {0: unk_token}
with open(artists_file, encoding="utf-8") as vocab_handle:
self.artists_encoder = json.load(vocab_handle)
......@@ -157,13 +151,24 @@ class JukeboxTokenizer(PreTrainedTokenizer):
self.artists_decoder = {v: k for k, v in self.artists_encoder.items()}
self.genres_decoder = {v: k for k, v in self.genres_encoder.items()}
self.lyrics_decoder = {v: k for k, v in self.lyrics_encoder.items()}
super().__init__(
unk_token=unk_token,
n_genres=n_genres,
version=version,
max_n_lyric_tokens=max_n_lyric_tokens,
**kwargs,
)
@property
def vocab_size(self):
return len(self.artists_encoder) + len(self.genres_encoder) + len(self.lyrics_encoder)
def get_vocab(self):
return dict(self.artists_encoder, self.genres_encoder, self.lyrics_encoder)
return {
"artists_encoder": self.artists_encoder,
"genres_encoder": self.genres_encoder,
"lyrics_encoder": self.lyrics_encoder,
}
def _convert_token_to_id(self, list_artists, list_genres, list_lyrics):
"""Converts the artist, genre and lyrics tokens to their index using the vocabulary.
......
......@@ -134,20 +134,6 @@ class LayoutLMTokenizer(PreTrainedTokenizer):
strip_accents=None,
**kwargs,
):
super().__init__(
do_lower_case=do_lower_case,
do_basic_tokenize=do_basic_tokenize,
never_split=never_split,
unk_token=unk_token,
sep_token=sep_token,
pad_token=pad_token,
cls_token=cls_token,
mask_token=mask_token,
tokenize_chinese_chars=tokenize_chinese_chars,
strip_accents=strip_accents,
**kwargs,
)
if not os.path.isfile(vocab_file):
raise ValueError(
f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained"
......@@ -163,7 +149,22 @@ class LayoutLMTokenizer(PreTrainedTokenizer):
tokenize_chinese_chars=tokenize_chinese_chars,
strip_accents=strip_accents,
)
self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=self.unk_token)
self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=str(unk_token))
super().__init__(
do_lower_case=do_lower_case,
do_basic_tokenize=do_basic_tokenize,
never_split=never_split,
unk_token=unk_token,
sep_token=sep_token,
pad_token=pad_token,
cls_token=cls_token,
mask_token=mask_token,
tokenize_chinese_chars=tokenize_chinese_chars,
strip_accents=strip_accents,
**kwargs,
)
@property
def do_lower_case(self):
......
......@@ -244,27 +244,6 @@ class LayoutLMv2Tokenizer(PreTrainedTokenizer):
additional_special_tokens: Optional[List[str]] = None,
**kwargs,
):
super().__init__(
do_lower_case=do_lower_case,
do_basic_tokenize=do_basic_tokenize,
never_split=never_split,
unk_token=unk_token,
sep_token=sep_token,
pad_token=pad_token,
cls_token=cls_token,
mask_token=mask_token,
cls_token_box=cls_token_box,
sep_token_box=sep_token_box,
pad_token_box=pad_token_box,
pad_token_label=pad_token_label,
only_label_first_subword=only_label_first_subword,
tokenize_chinese_chars=tokenize_chinese_chars,
strip_accents=strip_accents,
model_max_length=model_max_length,
additional_special_tokens=additional_special_tokens,
**kwargs,
)
if not os.path.isfile(vocab_file):
raise ValueError(
f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained"
......@@ -280,7 +259,7 @@ class LayoutLMv2Tokenizer(PreTrainedTokenizer):
tokenize_chinese_chars=tokenize_chinese_chars,
strip_accents=strip_accents,
)
self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=self.unk_token)
self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=str(unk_token))
# additional properties
self.cls_token_box = cls_token_box
......@@ -288,6 +267,26 @@ class LayoutLMv2Tokenizer(PreTrainedTokenizer):
self.pad_token_box = pad_token_box
self.pad_token_label = pad_token_label
self.only_label_first_subword = only_label_first_subword
super().__init__(
do_lower_case=do_lower_case,
do_basic_tokenize=do_basic_tokenize,
never_split=never_split,
unk_token=unk_token,
sep_token=sep_token,
pad_token=pad_token,
cls_token=cls_token,
mask_token=mask_token,
cls_token_box=cls_token_box,
sep_token_box=sep_token_box,
pad_token_box=pad_token_box,
pad_token_label=pad_token_label,
only_label_first_subword=only_label_first_subword,
tokenize_chinese_chars=tokenize_chinese_chars,
strip_accents=strip_accents,
model_max_length=model_max_length,
additional_special_tokens=additional_special_tokens,
**kwargs,
)
@property
def do_lower_case(self):
......
......@@ -303,24 +303,6 @@ class LayoutLMv3Tokenizer(PreTrainedTokenizer):
# Mask token behave like a normal word, i.e. include the space before it
mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token
super().__init__(
errors=errors,
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
add_prefix_space=add_prefix_space,
cls_token_box=cls_token_box,
sep_token_box=sep_token_box,
pad_token_box=pad_token_box,
pad_token_label=pad_token_label,
only_label_first_subword=only_label_first_subword,
**kwargs,
)
with open(vocab_file, encoding="utf-8") as vocab_handle:
self.encoder = json.load(vocab_handle)
self.decoder = {v: k for k, v in self.encoder.items()}
......@@ -344,6 +326,24 @@ class LayoutLMv3Tokenizer(PreTrainedTokenizer):
self.pad_token_label = pad_token_label
self.only_label_first_subword = only_label_first_subword
super().__init__(
errors=errors,
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
add_prefix_space=add_prefix_space,
cls_token_box=cls_token_box,
sep_token_box=sep_token_box,
pad_token_box=pad_token_box,
pad_token_label=pad_token_label,
only_label_first_subword=only_label_first_subword,
**kwargs,
)
@property
# Copied from transformers.models.roberta.tokenization_roberta.RobertaTokenizer.vocab_size
def vocab_size(self):
......@@ -351,7 +351,9 @@ class LayoutLMv3Tokenizer(PreTrainedTokenizer):
# Copied from transformers.models.roberta.tokenization_roberta.RobertaTokenizer.get_vocab
def get_vocab(self):
return dict(self.encoder, **self.added_tokens_encoder)
vocab = dict(self.encoder).copy()
vocab.update(self.added_tokens_encoder)
return vocab
# Copied from transformers.models.roberta.tokenization_roberta.RobertaTokenizer.bpe
def bpe(self, token):
......@@ -539,7 +541,7 @@ class LayoutLMv3Tokenizer(PreTrainedTokenizer):
if (
(is_split_into_words or add_prefix_space)
and (len(text) > 0 and not text[0].isspace())
and sum([text.startswith(no_split_token) for no_split_token in self.unique_no_split_tokens]) == 0
and sum([text.startswith(no_split_token) for no_split_token in self.added_tokens_encoder]) == 0
):
text = " " + text
return (text, kwargs)
......
......@@ -254,23 +254,6 @@ class LayoutXLMTokenizer(PreTrainedTokenizer):
self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
super().__init__(
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
cls_token_box=cls_token_box,
sep_token_box=sep_token_box,
pad_token_box=pad_token_box,
pad_token_label=pad_token_label,
only_label_first_subword=only_label_first_subword,
sp_model_kwargs=self.sp_model_kwargs,
**kwargs,
)
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
self.sp_model.Load(str(vocab_file))
self.vocab_file = vocab_file
......@@ -297,6 +280,23 @@ class LayoutXLMTokenizer(PreTrainedTokenizer):
self.pad_token_label = pad_token_label
self.only_label_first_subword = only_label_first_subword
super().__init__(
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
cls_token_box=cls_token_box,
sep_token_box=sep_token_box,
pad_token_box=pad_token_box,
pad_token_label=pad_token_label,
only_label_first_subword=only_label_first_subword,
sp_model_kwargs=self.sp_model_kwargs,
**kwargs,
)
def __getstate__(self):
state = self.__dict__.copy()
state["sp_model"] = None
......
......@@ -197,21 +197,10 @@ class LEDTokenizer(PreTrainedTokenizer):
pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token
# Mask token behave like a normal word, i.e. include the space before it
# TODO seems like both slow and fast actually don't strip left and right soooooooo yeah. See `test_embeded_special_tokens`
# Also this not only will strip the spaces but any punctuation
mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token
super().__init__(
errors=errors,
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
add_prefix_space=add_prefix_space,
**kwargs,
)
with open(vocab_file, encoding="utf-8") as vocab_handle:
self.encoder = json.load(vocab_handle)
self.decoder = {v: k for k, v in self.encoder.items()}
......@@ -228,6 +217,19 @@ class LEDTokenizer(PreTrainedTokenizer):
# Should have added re.IGNORECASE so BPE merges can happen for capitalized versions of contractions
self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
super().__init__(
errors=errors,
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
add_prefix_space=add_prefix_space,
**kwargs,
)
@property
# Copied from transformers.models.bart.tokenization_bart.BartTokenizer.vocab_size
def vocab_size(self):
......
......@@ -152,6 +152,7 @@ class LEDTokenizerFast(PreTrainedTokenizerFast):
trim_offsets=True,
**kwargs,
):
mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token
super().__init__(
vocab_file,
merges_file,
......
......@@ -122,20 +122,7 @@ class LlamaTokenizer(PreTrainedTokenizer):
eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token
unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token
pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token
super().__init__(
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
pad_token=pad_token,
add_bos_token=add_bos_token,
add_eos_token=add_eos_token,
sp_model_kwargs=self.sp_model_kwargs,
clean_up_tokenization_spaces=clean_up_tokenization_spaces,
use_default_system_prompt=use_default_system_prompt,
spaces_between_special_tokens=spaces_between_special_tokens,
legacy=legacy,
**kwargs,
)
if legacy is None:
logger.warning_once(
f"You are using the default legacy behaviour of the {self.__class__}. If you see this, DO NOT PANIC! This is"
......@@ -151,9 +138,23 @@ class LlamaTokenizer(PreTrainedTokenizer):
self.add_bos_token = add_bos_token
self.add_eos_token = add_eos_token
self.use_default_system_prompt = use_default_system_prompt
self.sp_model = self.get_spm_processor()
super().__init__(
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
pad_token=pad_token,
add_bos_token=add_bos_token,
add_eos_token=add_eos_token,
sp_model_kwargs=self.sp_model_kwargs,
clean_up_tokenization_spaces=clean_up_tokenization_spaces,
use_default_system_prompt=use_default_system_prompt,
spaces_between_special_tokens=spaces_between_special_tokens,
legacy=legacy,
**kwargs,
)
@property
def unk_token_length(self):
return len(self.sp_model.encode(str(self.unk_token)))
......
......@@ -33,6 +33,14 @@ else:
logger = logging.get_logger(__name__)
VOCAB_FILES_NAMES = {"vocab_file": "tokenizer.model", "tokenizer_file": "tokenizer.json"}
PRETRAINED_VOCAB_FILES_MAP = {
"vocab_file": {
"hf-internal-testing/llama-tokenizer": "https://huggingface.co/hf-internal-testing/llama-tokenizer/resolve/main/tokenizer.model",
},
"tokenizer_file": {
"hf-internal-testing/llama-tokenizer": "https://huggingface.co/hf-internal-testing/llama-tokenizer/resolve/main/tokenizer_config.json",
},
}
B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
......@@ -93,6 +101,7 @@ class LlamaTokenizerFast(PreTrainedTokenizerFast):
"""
vocab_files_names = VOCAB_FILES_NAMES
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
slow_tokenizer_class = LlamaTokenizer
padding_side = "left"
model_input_names = ["input_ids", "attention_mask"]
......
......@@ -212,28 +212,21 @@ class LongformerTokenizer(PreTrainedTokenizer):
**kwargs,
):
bos_token = AddedToken(bos_token, lstrip=False, rstrip=False) if isinstance(bos_token, str) else bos_token
pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token
eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token
unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token
sep_token = AddedToken(sep_token, lstrip=False, rstrip=False) if isinstance(sep_token, str) else sep_token
cls_token = AddedToken(cls_token, lstrip=False, rstrip=False) if isinstance(cls_token, str) else cls_token
unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token
pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token
# Mask token behave like a normal word, i.e. include the space before it
mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token
super().__init__(
errors=errors,
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
add_prefix_space=add_prefix_space,
**kwargs,
mask_token = (
AddedToken(mask_token, lstrip=True, rstrip=False, normalized=False)
if isinstance(mask_token, str)
else mask_token
)
# these special tokens are not part of the vocab.json, let's add them in the correct order
with open(vocab_file, encoding="utf-8") as vocab_handle:
self.encoder = json.load(vocab_handle)
self.decoder = {v: k for k, v in self.encoder.items()}
......@@ -250,12 +243,27 @@ class LongformerTokenizer(PreTrainedTokenizer):
# Should have added re.IGNORECASE so BPE merges can happen for capitalized versions of contractions
self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
super().__init__(
errors=errors,
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
add_prefix_space=add_prefix_space,
**kwargs,
)
@property
def vocab_size(self):
return len(self.encoder)
def get_vocab(self):
return dict(self.encoder, **self.added_tokens_encoder)
vocab = dict(self.encoder).copy()
vocab.update(self.added_tokens_encoder)
return vocab
def bpe(self, token):
if token in self.cache:
......
......@@ -192,6 +192,11 @@ class LongformerTokenizerFast(PreTrainedTokenizerFast):
trim_offsets=True,
**kwargs,
):
mask_token = (
AddedToken(mask_token, lstrip=True, rstrip=False, normalized=False)
if isinstance(mask_token, str)
else mask_token
)
super().__init__(
vocab_file,
merges_file,
......
......@@ -326,28 +326,6 @@ class LukeTokenizer(PreTrainedTokenizer):
# Mask token behave like a normal word, i.e. include the space before it
mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token
super().__init__(
errors=errors,
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
add_prefix_space=add_prefix_space,
task=task,
max_entity_length=32,
max_mention_length=30,
entity_token_1="<ent>",
entity_token_2="<ent2>",
entity_unk_token=entity_unk_token,
entity_pad_token=entity_pad_token,
entity_mask_token=entity_mask_token,
entity_mask2_token=entity_mask2_token,
**kwargs,
)
with open(vocab_file, encoding="utf-8") as vocab_handle:
self.encoder = json.load(vocab_handle)
self.decoder = {v: k for k, v in self.encoder.items()}
......@@ -407,6 +385,28 @@ class LukeTokenizer(PreTrainedTokenizer):
self.max_mention_length = max_mention_length
super().__init__(
errors=errors,
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
add_prefix_space=add_prefix_space,
task=task,
max_entity_length=32,
max_mention_length=30,
entity_token_1="<ent>",
entity_token_2="<ent2>",
entity_unk_token=entity_unk_token,
entity_pad_token=entity_pad_token,
entity_mask_token=entity_mask_token,
entity_mask2_token=entity_mask2_token,
**kwargs,
)
@property
# Copied from transformers.models.roberta.tokenization_roberta.RobertaTokenizer.vocab_size with Roberta->Luke, RoBERTa->LUKE
def vocab_size(self):
......@@ -414,7 +414,9 @@ class LukeTokenizer(PreTrainedTokenizer):
# Copied from transformers.models.roberta.tokenization_roberta.RobertaTokenizer.get_vocab with Roberta->Luke, RoBERTa->LUKE
def get_vocab(self):
return dict(self.encoder, **self.added_tokens_encoder)
vocab = dict(self.encoder).copy()
vocab.update(self.added_tokens_encoder)
return vocab
# Copied from transformers.models.roberta.tokenization_roberta.RobertaTokenizer.bpe with Roberta->Luke, RoBERTa->LUKE
def bpe(self, token):
......
......@@ -126,20 +126,6 @@ class LxmertTokenizer(PreTrainedTokenizer):
strip_accents=None,
**kwargs,
):
super().__init__(
do_lower_case=do_lower_case,
do_basic_tokenize=do_basic_tokenize,
never_split=never_split,
unk_token=unk_token,
sep_token=sep_token,
pad_token=pad_token,
cls_token=cls_token,
mask_token=mask_token,
tokenize_chinese_chars=tokenize_chinese_chars,
strip_accents=strip_accents,
**kwargs,
)
if not os.path.isfile(vocab_file):
raise ValueError(
f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained"
......@@ -155,7 +141,22 @@ class LxmertTokenizer(PreTrainedTokenizer):
tokenize_chinese_chars=tokenize_chinese_chars,
strip_accents=strip_accents,
)
self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=self.unk_token)
self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=str(unk_token))
super().__init__(
do_lower_case=do_lower_case,
do_basic_tokenize=do_basic_tokenize,
never_split=never_split,
unk_token=unk_token,
sep_token=sep_token,
pad_token=pad_token,
cls_token=cls_token,
mask_token=mask_token,
tokenize_chinese_chars=tokenize_chinese_chars,
strip_accents=strip_accents,
**kwargs,
)
@property
def do_lower_case(self):
......
......@@ -150,26 +150,11 @@ class M2M100Tokenizer(PreTrainedTokenizer):
fairseq_language_code = FAIRSEQ_LANGUAGE_CODES[language_codes]
self.lang_code_to_token = {lang_code: f"__{lang_code}__" for lang_code in fairseq_language_code}
kwargs["additional_special_tokens"] = kwargs.get("additional_special_tokens", [])
kwargs["additional_special_tokens"] += [
self.get_lang_token(lang_code)
for lang_code in fairseq_language_code
if self.get_lang_token(lang_code) not in kwargs["additional_special_tokens"]
]
super().__init__(
src_lang=src_lang,
tgt_lang=tgt_lang,
bos_token=bos_token,
eos_token=eos_token,
sep_token=sep_token,
unk_token=unk_token,
pad_token=pad_token,
language_codes=language_codes,
sp_model_kwargs=self.sp_model_kwargs,
num_madeup_words=num_madeup_words,
**kwargs,
)
additional_special_tokens = kwargs.pop("additional_special_tokens", [])
for lang_code in fairseq_language_code:
token = self.get_lang_token(lang_code)
if token not in additional_special_tokens and lang_code not in str(token) not in self.added_tokens_encoder:
additional_special_tokens.append(token)
self.vocab_file = vocab_file
self.encoder = load_json(vocab_file)
......@@ -188,13 +173,33 @@ class M2M100Tokenizer(PreTrainedTokenizer):
self._src_lang = src_lang if src_lang is not None else "en"
self.tgt_lang = tgt_lang
self.cur_lang_id = self.get_lang_id(self._src_lang)
self.set_src_lang_special_tokens(self._src_lang)
self.num_madeup_words = num_madeup_words
super().__init__(
src_lang=src_lang,
tgt_lang=tgt_lang,
bos_token=bos_token,
eos_token=eos_token,
sep_token=sep_token,
unk_token=unk_token,
pad_token=pad_token,
language_codes=language_codes,
sp_model_kwargs=self.sp_model_kwargs,
additional_special_tokens=additional_special_tokens,
num_madeup_words=num_madeup_words,
**kwargs,
)
self.set_src_lang_special_tokens(self._src_lang)
@property
def vocab_size(self) -> int:
return len(self.encoder) + len(self.lang_token_to_id)
return len(self.encoder)
def get_vocab(self) -> Dict:
vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
vocab.update(self.added_tokens_encoder)
return vocab
@property
def src_lang(self) -> str:
......@@ -290,11 +295,6 @@ class M2M100Tokenizer(PreTrainedTokenizer):
# We don't expect to process pairs, but leave the pair logic for API consistency
return self.prefix_tokens + token_ids_0 + token_ids_1 + self.suffix_tokens
def get_vocab(self) -> Dict:
vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
vocab.update(self.added_tokens_encoder)
return vocab
def __getstate__(self) -> Dict:
state = self.__dict__.copy()
state["sp_model"] = None
......
......@@ -144,26 +144,13 @@ class MarianTokenizer(PreTrainedTokenizer):
) -> None:
self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
super().__init__(
# bos_token=bos_token, unused. Start decoding with config.decoder_start_token_id
source_lang=source_lang,
target_lang=target_lang,
unk_token=unk_token,
eos_token=eos_token,
pad_token=pad_token,
model_max_length=model_max_length,
sp_model_kwargs=self.sp_model_kwargs,
target_vocab_file=target_vocab_file,
separate_vocabs=separate_vocabs,
**kwargs,
)
assert Path(source_spm).exists(), f"cannot find spm source {source_spm}"
self.separate_vocabs = separate_vocabs
self.encoder = load_json(vocab)
if self.unk_token not in self.encoder:
raise KeyError("<unk> token must be in vocab")
assert self.pad_token in self.encoder
if unk_token not in self.encoder:
raise KeyError("<unk> token must be in the vocab")
assert pad_token in self.encoder
if separate_vocabs:
self.target_encoder = load_json(target_vocab_file)
......@@ -187,6 +174,20 @@ class MarianTokenizer(PreTrainedTokenizer):
self._setup_normalizer()
super().__init__(
# bos_token=bos_token, unused. Start decoding with config.decoder_start_token_id
source_lang=source_lang,
target_lang=target_lang,
unk_token=unk_token,
eos_token=eos_token,
pad_token=pad_token,
model_max_length=model_max_length,
sp_model_kwargs=self.sp_model_kwargs,
target_vocab_file=target_vocab_file,
separate_vocabs=separate_vocabs,
**kwargs,
)
def _setup_normalizer(self):
try:
from sacremoses import MosesPunctNormalizer
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment