"git@developer.sourcefind.cn:chenpangpang/transformers.git" did not exist on "48101cf8d127bbf22d751c7df118a6ce357e2e27"
Unverified Commit 2da88537 authored by Arthur's avatar Arthur Committed by GitHub
Browse files

🚨🚨 🚨🚨 [`Tokenizer`] attemp to fix add_token issues🚨🚨 🚨🚨 (#23909)



* fix test for bart. Order is correct now let's skip BPEs

* ouf

* styling

* fix bert....

* slow refactoring

* current updates

* massive refactoring

* update

* NICE!

* update to see where I am at

* updates

* update

* update

* revert

* updates

* updates

* start supporting legacy_save

* styling

* big update

* revert some changes

* nits

* nniiiiiice

* small fixes

* kinda fix t5 with new behaviour

* major update

* fixup

* fix copies

* today's updates

* fix byt5

* upfate

* update

* update

* updates

* update vocab size test

* Barthez does not use not need the fairseq offset ids

* super calll must be after

* calll super

* move all super init

* move other super init

* fixup

* nits

* more fixes

* nits

* more fixes

* nits

* more fix

* remove useless files

* ouch all of them are affected

* and more!

* small imporvements

* no more sanitize token

* more changes around unique no split tokens

* partially fix more things

* keep legacy save but add warning

* so... more fixes

* updates

* guess deberta tokenizer could be nuked

* fixup

* fixup did some bad things

* nuke it if it breaks

* remove prints and pretrain fast from slow with new format.

* fixups

* Apply suggestions from code review
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* fiou

* nit

* by default specials should not be normalized?

* update

* remove brakpoint

* updates

* a lot of updates

* fixup

* fixes revert some changes to match fast

* small nits

* that makes it cleaner

* fix camembert accordingly

* update

* some lest breaking changes

* update

* fixup

* fix byt5 and whisper mostly

* some more fixes, canine's byte vocab

* fix gpt2

* fix most of the perceiver tests (4 left)

* fix layout lmv3

* fixup

* fix copies for gpt2 style

* make sure to only warn once

* fix perciever and gpt2 tests

* some more backward compatibility: also read special tokens map because some ppl use it........////.....

* fixup

* add else when reading

* nits

* fresh updates

* fix copies

* will this make everything faster?

* fixes

* more fixes

* update

* more fixes

* fixup

* is the source of truth right?

* sorry camembert for the troubles

* current updates

* fixup

* update led

* update

* fix regression

* fix single word

* more model specific fixes

* fix t5 tests

* fixup

* more comments

* update

* fix nllb

* rstrip removed

* small fixes

* better handle additional_special_tokens and vocab sizes

* fixing

* styling

* fix 4 / 21

* fixup

* fix nlbb's tests

* some fixes

* fix t5

* fixes

* style

* fix canine tests

* damn this is nice

* nits

* m2m100 nit

* fixups

* fixes!

* fixup

* stash

* fix merge

* revert bad change

* fixup

* correct order for code Llama

* fix speecht5 post merge

* styling

* revert source of 11 fails

* small nits

* all changes in one go

* fnet hack

* fix 2 more tests

* update based on main branch of tokenizers

* fixup

* fix VITS issues

* more fixes

* fix mgp test

* fix camembert issues

* oups camembert still has 2 failing tests

* mluke fixes

* decode fixes

* small nits

* nits

* fix llama and vits

* fix camembert

* smal nits

* more fixes when initialising a fast from a slow and etc

* fix one of the last test

* fix CPM tokenizer test

* fixups

* fix pop2piano

* fixup

* ️ Change tokenizers required version ️

* ️ Change tokenizers required version ️

* "tokenizers>=0.14,<0.15", don't forget smaller than

* fix musicgen tests and pretraiendtokenizerfast

* fix owlvit and all

* update t5

* fix 800 red

* fix tests

* fix the fix of the fix of t5

* styling

* documentation nits

* cache _added_tokens_encoder

* fixups

* Nit

* fix red tests

* one last nit!

* make eveything a lot simpler

* Now it's over 😉



* few small nits

* Apply suggestions from code review
Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>

* updates that work for now

* tests that should no be skipped / changed and fixed next

* fixup

* i am ashamed

* pushe the fix

* update

* fixups

* nits

* fix added_tokens_encoder

* fix canine test

* fix pegasus vocab

* fix transfoXL

* fixup

* whisper needs to be fixed for train new

* pegasus nits

* more pegasus fixes

* minor update

* better error message in failed test

* fix whisper failing test

* fix whisper failing test

* fix pegasus

* fixup

* fix **** pegasus

* reset things

* remove another file

* attempts to fix the strange custome encoder and offset

* nits here and there

* update

* fixup

* nit

* fix the whisper test

* nits nits

* Apply suggestions from code review
Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>

* updates based on review

* some small update to potentially remove

* nits

* import rlu cache

* Update src/transformers/tokenization_utils_base.py
Co-authored-by: default avatarLysandre Debut <hi@lysand.re>

* move warning to `from_pretrained`

* update tests results now that the special tokens are always added

---------
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: default avatarLysandre Debut <hi@lysand.re>
parent 835b0a05
...@@ -166,4 +166,4 @@ tags ...@@ -166,4 +166,4 @@ tags
.DS_Store .DS_Store
# ruff # ruff
.ruff_cache .ruff_cache
\ No newline at end of file
...@@ -172,7 +172,7 @@ _deps = [ ...@@ -172,7 +172,7 @@ _deps = [
"tf2onnx", "tf2onnx",
"timeout-decorator", "timeout-decorator",
"timm", "timm",
"tokenizers>=0.11.1,!=0.11.3,<0.14", "tokenizers>=0.14,<0.15",
"torch>=1.10,!=1.12.0", "torch>=1.10,!=1.12.0",
"torchaudio", "torchaudio",
"torchvision", "torchvision",
......
...@@ -78,7 +78,7 @@ deps = { ...@@ -78,7 +78,7 @@ deps = {
"tf2onnx": "tf2onnx", "tf2onnx": "tf2onnx",
"timeout-decorator": "timeout-decorator", "timeout-decorator": "timeout-decorator",
"timm": "timm", "timm": "timm",
"tokenizers": "tokenizers>=0.11.1,!=0.11.3,<0.14", "tokenizers": "tokenizers>=0.14,<0.15",
"torch": "torch>=1.10,!=1.12.0", "torch": "torch>=1.10,!=1.12.0",
"torchaudio": "torchaudio", "torchaudio": "torchaudio",
"torchvision": "torchvision", "torchvision": "torchvision",
......
...@@ -159,6 +159,14 @@ class AlbertTokenizer(PreTrainedTokenizer): ...@@ -159,6 +159,14 @@ class AlbertTokenizer(PreTrainedTokenizer):
self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
self.do_lower_case = do_lower_case
self.remove_space = remove_space
self.keep_accents = keep_accents
self.vocab_file = vocab_file
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
self.sp_model.Load(vocab_file)
super().__init__( super().__init__(
do_lower_case=do_lower_case, do_lower_case=do_lower_case,
remove_space=remove_space, remove_space=remove_space,
...@@ -174,14 +182,6 @@ class AlbertTokenizer(PreTrainedTokenizer): ...@@ -174,14 +182,6 @@ class AlbertTokenizer(PreTrainedTokenizer):
**kwargs, **kwargs,
) )
self.do_lower_case = do_lower_case
self.remove_space = remove_space
self.keep_accents = keep_accents
self.vocab_file = vocab_file
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
self.sp_model.Load(vocab_file)
@property @property
def vocab_size(self) -> int: def vocab_size(self) -> int:
return len(self.sp_model) return len(self.sp_model)
...@@ -228,6 +228,8 @@ class AlbertTokenizer(PreTrainedTokenizer): ...@@ -228,6 +228,8 @@ class AlbertTokenizer(PreTrainedTokenizer):
new_pieces = [] new_pieces = []
for piece in pieces: for piece in pieces:
if len(piece) > 1 and piece[-1] == str(",") and piece[-2].isdigit(): if len(piece) > 1 and piece[-1] == str(",") and piece[-2].isdigit():
# Logic to handle special cases see https://github.com/google-research/bert/blob/master/README.md#tokenization
# `9,9` -> ['▁9', ',', '9'] instead of [`_9,`, '9']
cur_pieces = self.sp_model.EncodeAsPieces(piece[:-1].replace(SPIECE_UNDERLINE, "")) cur_pieces = self.sp_model.EncodeAsPieces(piece[:-1].replace(SPIECE_UNDERLINE, ""))
if piece[0] != SPIECE_UNDERLINE and cur_pieces[0][0] == SPIECE_UNDERLINE: if piece[0] != SPIECE_UNDERLINE and cur_pieces[0][0] == SPIECE_UNDERLINE:
if len(cur_pieces[0]) == 1: if len(cur_pieces[0]) == 1:
......
...@@ -204,21 +204,10 @@ class BartTokenizer(PreTrainedTokenizer): ...@@ -204,21 +204,10 @@ class BartTokenizer(PreTrainedTokenizer):
pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token
# Mask token behave like a normal word, i.e. include the space before it # Mask token behave like a normal word, i.e. include the space before it
# TODO seems like both slow and fast actually don't strip left and right soooooooo yeah. See `test_embeded_special_tokens`
# Also this not only will strip the spaces but any punctuation
mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token
super().__init__(
errors=errors,
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
add_prefix_space=add_prefix_space,
**kwargs,
)
with open(vocab_file, encoding="utf-8") as vocab_handle: with open(vocab_file, encoding="utf-8") as vocab_handle:
self.encoder = json.load(vocab_handle) self.encoder = json.load(vocab_handle)
self.decoder = {v: k for k, v in self.encoder.items()} self.decoder = {v: k for k, v in self.encoder.items()}
...@@ -235,6 +224,19 @@ class BartTokenizer(PreTrainedTokenizer): ...@@ -235,6 +224,19 @@ class BartTokenizer(PreTrainedTokenizer):
# Should have added re.IGNORECASE so BPE merges can happen for capitalized versions of contractions # Should have added re.IGNORECASE so BPE merges can happen for capitalized versions of contractions
self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""") self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
super().__init__(
errors=errors,
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
add_prefix_space=add_prefix_space,
**kwargs,
)
@property @property
def vocab_size(self): def vocab_size(self):
return len(self.encoder) return len(self.encoder)
......
...@@ -170,6 +170,7 @@ class BartTokenizerFast(PreTrainedTokenizerFast): ...@@ -170,6 +170,7 @@ class BartTokenizerFast(PreTrainedTokenizerFast):
trim_offsets=True, trim_offsets=True,
**kwargs, **kwargs,
): ):
mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token
super().__init__( super().__init__(
vocab_file, vocab_file,
merges_file, merges_file,
......
...@@ -47,6 +47,8 @@ PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = { ...@@ -47,6 +47,8 @@ PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
SPIECE_UNDERLINE = "▁" SPIECE_UNDERLINE = "▁"
# TODO this class is useless. This is the most standard sentencpiece model. Let's find which one is closest and nuke this.
class BarthezTokenizer(PreTrainedTokenizer): class BarthezTokenizer(PreTrainedTokenizer):
""" """
...@@ -141,6 +143,9 @@ class BarthezTokenizer(PreTrainedTokenizer): ...@@ -141,6 +143,9 @@ class BarthezTokenizer(PreTrainedTokenizer):
self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
self.vocab_file = vocab_file
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
self.sp_model.Load(str(vocab_file))
super().__init__( super().__init__(
bos_token=bos_token, bos_token=bos_token,
eos_token=eos_token, eos_token=eos_token,
...@@ -153,15 +158,6 @@ class BarthezTokenizer(PreTrainedTokenizer): ...@@ -153,15 +158,6 @@ class BarthezTokenizer(PreTrainedTokenizer):
**kwargs, **kwargs,
) )
self.vocab_file = vocab_file
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
self.sp_model.Load(str(vocab_file))
self.fairseq_tokens_to_ids = {"<s>": 0, "<pad>": 1, "</s>": 2, "<unk>": 3}
self.fairseq_tokens_to_ids["<mask>"] = len(self.sp_model) - 1
self.fairseq_ids_to_tokens = {v: k for k, v in self.fairseq_tokens_to_ids.items()}
def build_inputs_with_special_tokens( def build_inputs_with_special_tokens(
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
) -> List[int]: ) -> List[int]:
...@@ -251,16 +247,10 @@ class BarthezTokenizer(PreTrainedTokenizer): ...@@ -251,16 +247,10 @@ class BarthezTokenizer(PreTrainedTokenizer):
def _convert_token_to_id(self, token): def _convert_token_to_id(self, token):
"""Converts a token (str) in an id using the vocab.""" """Converts a token (str) in an id using the vocab."""
if token in self.fairseq_tokens_to_ids: return self.sp_model.PieceToId(token)
return self.fairseq_tokens_to_ids[token]
spm_id = self.sp_model.PieceToId(token)
return spm_id if spm_id else self.unk_token_id
def _convert_id_to_token(self, index): def _convert_id_to_token(self, index):
"""Converts an index (integer) in a token (str) using the vocab.""" """Converts an index (integer) in a token (str) using the vocab."""
if index in self.fairseq_ids_to_tokens:
return self.fairseq_ids_to_tokens[index]
return self.sp_model.IdToPiece(index) return self.sp_model.IdToPiece(index)
def convert_tokens_to_string(self, tokens): def convert_tokens_to_string(self, tokens):
......
...@@ -139,18 +139,6 @@ class BartphoTokenizer(PreTrainedTokenizer): ...@@ -139,18 +139,6 @@ class BartphoTokenizer(PreTrainedTokenizer):
self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
super().__init__(
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
sp_model_kwargs=self.sp_model_kwargs,
**kwargs,
)
self.vocab_file = vocab_file self.vocab_file = vocab_file
self.monolingual_vocab_file = monolingual_vocab_file self.monolingual_vocab_file = monolingual_vocab_file
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs) self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
...@@ -174,6 +162,18 @@ class BartphoTokenizer(PreTrainedTokenizer): ...@@ -174,6 +162,18 @@ class BartphoTokenizer(PreTrainedTokenizer):
self.fairseq_ids_to_tokens = {v: k for k, v in self.fairseq_tokens_to_ids.items()} self.fairseq_ids_to_tokens = {v: k for k, v in self.fairseq_tokens_to_ids.items()}
super().__init__(
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
sp_model_kwargs=self.sp_model_kwargs,
**kwargs,
)
def __getstate__(self): def __getstate__(self):
state = self.__dict__.copy() state = self.__dict__.copy()
state["sp_model"] = None state["sp_model"] = None
......
...@@ -196,20 +196,6 @@ class BertTokenizer(PreTrainedTokenizer): ...@@ -196,20 +196,6 @@ class BertTokenizer(PreTrainedTokenizer):
strip_accents=None, strip_accents=None,
**kwargs, **kwargs,
): ):
super().__init__(
do_lower_case=do_lower_case,
do_basic_tokenize=do_basic_tokenize,
never_split=never_split,
unk_token=unk_token,
sep_token=sep_token,
pad_token=pad_token,
cls_token=cls_token,
mask_token=mask_token,
tokenize_chinese_chars=tokenize_chinese_chars,
strip_accents=strip_accents,
**kwargs,
)
if not os.path.isfile(vocab_file): if not os.path.isfile(vocab_file):
raise ValueError( raise ValueError(
f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained" f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained"
...@@ -225,7 +211,22 @@ class BertTokenizer(PreTrainedTokenizer): ...@@ -225,7 +211,22 @@ class BertTokenizer(PreTrainedTokenizer):
tokenize_chinese_chars=tokenize_chinese_chars, tokenize_chinese_chars=tokenize_chinese_chars,
strip_accents=strip_accents, strip_accents=strip_accents,
) )
self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=self.unk_token)
self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=str(unk_token))
super().__init__(
do_lower_case=do_lower_case,
do_basic_tokenize=do_basic_tokenize,
never_split=never_split,
unk_token=unk_token,
sep_token=sep_token,
pad_token=pad_token,
cls_token=cls_token,
mask_token=mask_token,
tokenize_chinese_chars=tokenize_chinese_chars,
strip_accents=strip_accents,
**kwargs,
)
@property @property
def do_lower_case(self): def do_lower_case(self):
......
...@@ -96,6 +96,11 @@ class BertGenerationTokenizer(PreTrainedTokenizer): ...@@ -96,6 +96,11 @@ class BertGenerationTokenizer(PreTrainedTokenizer):
) -> None: ) -> None:
self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
self.vocab_file = vocab_file
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
self.sp_model.Load(vocab_file)
# Add extra_ids to the special token list # Add extra_ids to the special token list
super().__init__( super().__init__(
bos_token=bos_token, bos_token=bos_token,
...@@ -107,11 +112,6 @@ class BertGenerationTokenizer(PreTrainedTokenizer): ...@@ -107,11 +112,6 @@ class BertGenerationTokenizer(PreTrainedTokenizer):
**kwargs, **kwargs,
) )
self.vocab_file = vocab_file
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
self.sp_model.Load(vocab_file)
@property @property
def vocab_size(self): def vocab_size(self):
return self.sp_model.get_piece_size() return self.sp_model.get_piece_size()
......
...@@ -160,25 +160,6 @@ class BertJapaneseTokenizer(PreTrainedTokenizer): ...@@ -160,25 +160,6 @@ class BertJapaneseTokenizer(PreTrainedTokenizer):
jumanpp_kwargs=None, jumanpp_kwargs=None,
**kwargs, **kwargs,
): ):
super().__init__(
spm_file=spm_file,
unk_token=unk_token,
sep_token=sep_token,
pad_token=pad_token,
cls_token=cls_token,
mask_token=mask_token,
do_lower_case=do_lower_case,
do_word_tokenize=do_word_tokenize,
do_subword_tokenize=do_subword_tokenize,
word_tokenizer_type=word_tokenizer_type,
subword_tokenizer_type=subword_tokenizer_type,
never_split=never_split,
mecab_kwargs=mecab_kwargs,
sudachi_kwargs=sudachi_kwargs,
jumanpp_kwargs=jumanpp_kwargs,
**kwargs,
)
if subword_tokenizer_type == "sentencepiece": if subword_tokenizer_type == "sentencepiece":
if not os.path.isfile(spm_file): if not os.path.isfile(spm_file):
raise ValueError( raise ValueError(
...@@ -226,13 +207,31 @@ class BertJapaneseTokenizer(PreTrainedTokenizer): ...@@ -226,13 +207,31 @@ class BertJapaneseTokenizer(PreTrainedTokenizer):
self.subword_tokenizer_type = subword_tokenizer_type self.subword_tokenizer_type = subword_tokenizer_type
if do_subword_tokenize: if do_subword_tokenize:
if subword_tokenizer_type == "wordpiece": if subword_tokenizer_type == "wordpiece":
self.subword_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=self.unk_token) self.subword_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=str(unk_token))
elif subword_tokenizer_type == "character": elif subword_tokenizer_type == "character":
self.subword_tokenizer = CharacterTokenizer(vocab=self.vocab, unk_token=self.unk_token) self.subword_tokenizer = CharacterTokenizer(vocab=self.vocab, unk_token=str(unk_token))
elif subword_tokenizer_type == "sentencepiece": elif subword_tokenizer_type == "sentencepiece":
self.subword_tokenizer = SentencepieceTokenizer(vocab=self.spm_file, unk_token=self.unk_token) self.subword_tokenizer = SentencepieceTokenizer(vocab=self.spm_file, unk_token=str(unk_token))
else: else:
raise ValueError(f"Invalid subword_tokenizer_type '{subword_tokenizer_type}' is specified.") raise ValueError(f"Invalid subword_tokenizer_type '{subword_tokenizer_type}' is specified.")
super().__init__(
spm_file=spm_file,
unk_token=unk_token,
sep_token=sep_token,
pad_token=pad_token,
cls_token=cls_token,
mask_token=mask_token,
do_lower_case=do_lower_case,
do_word_tokenize=do_word_tokenize,
do_subword_tokenize=do_subword_tokenize,
word_tokenizer_type=word_tokenizer_type,
subword_tokenizer_type=subword_tokenizer_type,
never_split=never_split,
mecab_kwargs=mecab_kwargs,
sudachi_kwargs=sudachi_kwargs,
jumanpp_kwargs=jumanpp_kwargs,
**kwargs,
)
@property @property
def do_lower_case(self): def do_lower_case(self):
......
...@@ -134,18 +134,6 @@ class BertweetTokenizer(PreTrainedTokenizer): ...@@ -134,18 +134,6 @@ class BertweetTokenizer(PreTrainedTokenizer):
mask_token="<mask>", mask_token="<mask>",
**kwargs, **kwargs,
): ):
super().__init__(
normalization=normalization,
bos_token=bos_token,
eos_token=eos_token,
sep_token=sep_token,
cls_token=cls_token,
unk_token=unk_token,
pad_token=pad_token,
mask_token=mask_token,
**kwargs,
)
try: try:
from emoji import demojize from emoji import demojize
...@@ -161,10 +149,10 @@ class BertweetTokenizer(PreTrainedTokenizer): ...@@ -161,10 +149,10 @@ class BertweetTokenizer(PreTrainedTokenizer):
self.merges_file = merges_file self.merges_file = merges_file
self.encoder = {} self.encoder = {}
self.encoder[self.bos_token] = 0 self.encoder[bos_token] = 0
self.encoder[self.pad_token] = 1 self.encoder[pad_token] = 1
self.encoder[self.eos_token] = 2 self.encoder[eos_token] = 2
self.encoder[self.unk_token] = 3 self.encoder[unk_token] = 3
self.add_from_file(vocab_file) self.add_from_file(vocab_file)
...@@ -178,9 +166,20 @@ class BertweetTokenizer(PreTrainedTokenizer): ...@@ -178,9 +166,20 @@ class BertweetTokenizer(PreTrainedTokenizer):
self.normalization = normalization self.normalization = normalization
self.tweetPreprocessor = TweetTokenizer() self.tweetPreprocessor = TweetTokenizer()
self.special_puncts = {"’": "'", "…": "..."} self.special_puncts = {"’": "'", "…": "..."}
super().__init__(
normalization=normalization,
bos_token=bos_token,
eos_token=eos_token,
sep_token=sep_token,
cls_token=cls_token,
unk_token=unk_token,
pad_token=pad_token,
mask_token=mask_token,
**kwargs,
)
def build_inputs_with_special_tokens( def build_inputs_with_special_tokens(
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
) -> List[int]: ) -> List[int]:
......
...@@ -127,6 +127,11 @@ class BigBirdTokenizer(PreTrainedTokenizer): ...@@ -127,6 +127,11 @@ class BigBirdTokenizer(PreTrainedTokenizer):
self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
self.vocab_file = vocab_file
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
self.sp_model.Load(vocab_file)
super().__init__( super().__init__(
bos_token=bos_token, bos_token=bos_token,
eos_token=eos_token, eos_token=eos_token,
...@@ -139,11 +144,6 @@ class BigBirdTokenizer(PreTrainedTokenizer): ...@@ -139,11 +144,6 @@ class BigBirdTokenizer(PreTrainedTokenizer):
**kwargs, **kwargs,
) )
self.vocab_file = vocab_file
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
self.sp_model.Load(vocab_file)
@property @property
def vocab_size(self): def vocab_size(self):
return self.sp_model.get_piece_size() return self.sp_model.get_piece_size()
......
...@@ -112,15 +112,6 @@ class BioGptTokenizer(PreTrainedTokenizer): ...@@ -112,15 +112,6 @@ class BioGptTokenizer(PreTrainedTokenizer):
pad_token="<pad>", pad_token="<pad>",
**kwargs, **kwargs,
): ):
super().__init__(
bos_token=bos_token,
eos_token=eos_token,
sep_token=sep_token,
unk_token=unk_token,
pad_token=pad_token,
**kwargs,
)
try: try:
import sacremoses import sacremoses
except ImportError: except ImportError:
...@@ -145,6 +136,15 @@ class BioGptTokenizer(PreTrainedTokenizer): ...@@ -145,6 +136,15 @@ class BioGptTokenizer(PreTrainedTokenizer):
self.bpe_ranks = dict(zip(merges, range(len(merges)))) self.bpe_ranks = dict(zip(merges, range(len(merges))))
self.cache = {} self.cache = {}
super().__init__(
bos_token=bos_token,
eos_token=eos_token,
sep_token=sep_token,
unk_token=unk_token,
pad_token=pad_token,
**kwargs,
)
@property @property
def vocab_size(self): def vocab_size(self):
"""Returns vocab size""" """Returns vocab size"""
......
...@@ -187,28 +187,21 @@ class BlenderbotTokenizer(PreTrainedTokenizer): ...@@ -187,28 +187,21 @@ class BlenderbotTokenizer(PreTrainedTokenizer):
**kwargs, **kwargs,
): ):
bos_token = AddedToken(bos_token, lstrip=False, rstrip=False) if isinstance(bos_token, str) else bos_token bos_token = AddedToken(bos_token, lstrip=False, rstrip=False) if isinstance(bos_token, str) else bos_token
pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token
eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token
unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token
sep_token = AddedToken(sep_token, lstrip=False, rstrip=False) if isinstance(sep_token, str) else sep_token sep_token = AddedToken(sep_token, lstrip=False, rstrip=False) if isinstance(sep_token, str) else sep_token
cls_token = AddedToken(cls_token, lstrip=False, rstrip=False) if isinstance(cls_token, str) else cls_token cls_token = AddedToken(cls_token, lstrip=False, rstrip=False) if isinstance(cls_token, str) else cls_token
unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token
pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token
# Mask token behave like a normal word, i.e. include the space before it # Mask token behave like a normal word, i.e. include the space before it
mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token mask_token = (
AddedToken(mask_token, lstrip=True, rstrip=False, normalized=False)
super().__init__( if isinstance(mask_token, str)
errors=errors, else mask_token
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
add_prefix_space=add_prefix_space,
**kwargs,
) )
# these special tokens are not part of the vocab.json, let's add them in the correct order
with open(vocab_file, encoding="utf-8") as vocab_handle: with open(vocab_file, encoding="utf-8") as vocab_handle:
self.encoder = json.load(vocab_handle) self.encoder = json.load(vocab_handle)
self.decoder = {v: k for k, v in self.encoder.items()} self.decoder = {v: k for k, v in self.encoder.items()}
...@@ -225,6 +218,19 @@ class BlenderbotTokenizer(PreTrainedTokenizer): ...@@ -225,6 +218,19 @@ class BlenderbotTokenizer(PreTrainedTokenizer):
# Should have added re.IGNORECASE so BPE merges can happen for capitalized versions of contractions # Should have added re.IGNORECASE so BPE merges can happen for capitalized versions of contractions
self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""") self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
super().__init__(
errors=errors,
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
add_prefix_space=add_prefix_space,
**kwargs,
)
@property @property
# Copied from transformers.models.roberta.tokenization_roberta.RobertaTokenizer.vocab_size with Roberta->Blenderbot, RoBERTa->Blenderbot # Copied from transformers.models.roberta.tokenization_roberta.RobertaTokenizer.vocab_size with Roberta->Blenderbot, RoBERTa->Blenderbot
def vocab_size(self): def vocab_size(self):
...@@ -232,7 +238,9 @@ class BlenderbotTokenizer(PreTrainedTokenizer): ...@@ -232,7 +238,9 @@ class BlenderbotTokenizer(PreTrainedTokenizer):
# Copied from transformers.models.roberta.tokenization_roberta.RobertaTokenizer.get_vocab with Roberta->Blenderbot, RoBERTa->Blenderbot # Copied from transformers.models.roberta.tokenization_roberta.RobertaTokenizer.get_vocab with Roberta->Blenderbot, RoBERTa->Blenderbot
def get_vocab(self): def get_vocab(self):
return dict(self.encoder, **self.added_tokens_encoder) vocab = dict(self.encoder).copy()
vocab.update(self.added_tokens_encoder)
return vocab
# Copied from transformers.models.roberta.tokenization_roberta.RobertaTokenizer.bpe with Roberta->Blenderbot, RoBERTa->Blenderbot # Copied from transformers.models.roberta.tokenization_roberta.RobertaTokenizer.bpe with Roberta->Blenderbot, RoBERTa->Blenderbot
def bpe(self, token): def bpe(self, token):
......
...@@ -149,6 +149,11 @@ class BlenderbotTokenizerFast(PreTrainedTokenizerFast): ...@@ -149,6 +149,11 @@ class BlenderbotTokenizerFast(PreTrainedTokenizerFast):
trim_offsets=True, trim_offsets=True,
**kwargs, **kwargs,
): ):
mask_token = (
AddedToken(mask_token, lstrip=True, rstrip=False, normalized=False)
if isinstance(mask_token, str)
else mask_token
)
super().__init__( super().__init__(
vocab_file, vocab_file,
merges_file, merges_file,
......
...@@ -106,8 +106,6 @@ class BlenderbotSmallTokenizer(PreTrainedTokenizer): ...@@ -106,8 +106,6 @@ class BlenderbotSmallTokenizer(PreTrainedTokenizer):
pad_token="__null__", pad_token="__null__",
**kwargs, **kwargs,
): ):
super().__init__(unk_token=unk_token, bos_token=bos_token, eos_token=eos_token, pad_token=pad_token, **kwargs)
with open(vocab_file, encoding="utf-8") as vocab_handle: with open(vocab_file, encoding="utf-8") as vocab_handle:
self.encoder = json.load(vocab_handle) self.encoder = json.load(vocab_handle)
self.decoder = {v: k for k, v in self.encoder.items()} self.decoder = {v: k for k, v in self.encoder.items()}
...@@ -116,6 +114,7 @@ class BlenderbotSmallTokenizer(PreTrainedTokenizer): ...@@ -116,6 +114,7 @@ class BlenderbotSmallTokenizer(PreTrainedTokenizer):
merges = [tuple(merge.split()) for merge in merges] merges = [tuple(merge.split()) for merge in merges]
self.bpe_ranks = dict(zip(merges, range(len(merges)))) self.bpe_ranks = dict(zip(merges, range(len(merges))))
self.cache = {} self.cache = {}
super().__init__(unk_token=unk_token, bos_token=bos_token, eos_token=eos_token, pad_token=pad_token, **kwargs)
@property @property
def vocab_size(self) -> int: def vocab_size(self) -> int:
......
...@@ -16,7 +16,7 @@ ...@@ -16,7 +16,7 @@
import warnings import warnings
from typing import Dict, List, Optional, Tuple from typing import List, Optional, Tuple
from ...tokenization_utils import AddedToken, PreTrainedTokenizer from ...tokenization_utils import AddedToken, PreTrainedTokenizer
from ...utils import logging from ...utils import logging
...@@ -72,7 +72,7 @@ class ByT5Tokenizer(PreTrainedTokenizer): ...@@ -72,7 +72,7 @@ class ByT5Tokenizer(PreTrainedTokenizer):
# Add extra_ids to the special token list # Add extra_ids to the special token list
if extra_ids > 0 and additional_special_tokens is None: if extra_ids > 0 and additional_special_tokens is None:
additional_special_tokens = [f"<extra_id_{i}>" for i in range(extra_ids)] additional_special_tokens = [f"<extra_id_{i}>" for i in range(extra_ids)]
elif extra_ids > 0 and additional_special_tokens is not None: elif extra_ids > 0 and additional_special_tokens is not None and len(additional_special_tokens) > 0:
# Check that we have the right number of extra_id special tokens # Check that we have the right number of extra_id special tokens
extra_tokens = len(set(filter(lambda x: bool("extra_id" in str(x)), additional_special_tokens))) extra_tokens = len(set(filter(lambda x: bool("extra_id" in str(x)), additional_special_tokens)))
if extra_tokens != extra_ids: if extra_tokens != extra_ids:
...@@ -82,38 +82,31 @@ class ByT5Tokenizer(PreTrainedTokenizer): ...@@ -82,38 +82,31 @@ class ByT5Tokenizer(PreTrainedTokenizer):
" extra_ids tokens" " extra_ids tokens"
) )
pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token pad_token = AddedToken(pad_token, lstrip=True, rstrip=True) if isinstance(pad_token, str) else pad_token
eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token # we force left and right stripping for backward compatibility. The byt5tests depend on this.
unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token eos_token = AddedToken(eos_token, lstrip=True, rstrip=True) if isinstance(eos_token, str) else eos_token
unk_token = AddedToken(unk_token, lstrip=True, rstrip=True) if isinstance(unk_token, str) else unk_token
# unk token needs to be in the vocab with correct index
self._added_tokens_decoder = {0: pad_token, 1: eos_token, 2: unk_token}
self.offset = len(self._added_tokens_decoder)
self._utf_vocab_size = 2**8 # utf is 8 bits
super().__init__( super().__init__(
eos_token=eos_token, eos_token=eos_token,
unk_token=unk_token, unk_token=unk_token,
pad_token=pad_token, pad_token=pad_token,
extra_ids=extra_ids, extra_ids=0,
additional_special_tokens=additional_special_tokens, additional_special_tokens=additional_special_tokens, # TODO extra ids are not used :sweatywmile:
**kwargs, **kwargs,
) )
self._extra_ids = extra_ids
self._utf_vocab_size = 2**8 # utf is 8 bits
# define special tokens dict
self.special_tokens_encoder: Dict[int, str] = {
self.pad_token: 0,
self.eos_token: 1,
self.unk_token: 2,
}
self._num_special_tokens = len(self.special_tokens_encoder)
n = len(additional_special_tokens)
for i, token in enumerate(additional_special_tokens):
self.special_tokens_encoder[token] = self.vocab_size + i - n
self.special_tokens_decoder: Dict[str, int] = {v: k for k, v in self.special_tokens_encoder.items()}
@property @property
def vocab_size(self): def vocab_size(self):
return self._utf_vocab_size + self._num_special_tokens + self._extra_ids return self._utf_vocab_size
def get_vocab(self):
vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
vocab.update(self.added_tokens_encoder)
return vocab
def get_special_tokens_mask( def get_special_tokens_mask(
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
...@@ -209,34 +202,25 @@ class ByT5Tokenizer(PreTrainedTokenizer): ...@@ -209,34 +202,25 @@ class ByT5Tokenizer(PreTrainedTokenizer):
def _convert_token_to_id(self, token): def _convert_token_to_id(self, token):
"""Converts a token (str) in an id using the vocab.""" """Converts a token (str) in an id using the vocab."""
if token in self.special_tokens_encoder:
token_id = self.special_tokens_encoder[token] if len(token) != 1:
elif token in self.added_tokens_encoder: token_id = None
token_id = self.added_tokens_encoder[token]
elif len(token) != 1:
token_id = self.unk_token_id
else: else:
token_id = ord(token) + self._num_special_tokens token_id = ord(token) + self.offset
return token_id return token_id
def _convert_id_to_token(self, index): def _convert_id_to_token(self, index):
"""Converts an index (integer) in a token (str) using the vocab.""" """Converts an index (integer) in a token (str) using the vocab."""
if index in self.special_tokens_decoder: token = chr(index - self.offset)
token = self.special_tokens_decoder[index]
else:
token = chr(index - self._num_special_tokens)
return token return token
def convert_tokens_to_string(self, tokens): def convert_tokens_to_string(self, tokens):
"""Converts a sequence of tokens (string) in a single string.""" """Converts a sequence of tokens (string) in a single string."""
bstring = b"" bstring = b""
for token in tokens: for token in tokens:
if token in self.special_tokens_decoder: if token in self.added_tokens_decoder:
tok_string = self.special_tokens_decoder[token].encode("utf-8") tok_string = self.added_tokens_decoder[token].encode("utf-8")
elif token in self.added_tokens_decoder:
tok_string = self.special_tokens_decoder[token].encode("utf-8")
elif token in self.special_tokens_encoder:
tok_string = token.encode("utf-8")
elif token in self.added_tokens_encoder: elif token in self.added_tokens_encoder:
tok_string = token.encode("utf-8") tok_string = token.encode("utf-8")
else: else:
......
...@@ -136,6 +136,29 @@ class CamembertTokenizer(PreTrainedTokenizer): ...@@ -136,6 +136,29 @@ class CamembertTokenizer(PreTrainedTokenizer):
self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
self.sp_model.Load(str(vocab_file))
self.vocab_file = vocab_file
# HACK: These tokens were added by the author for an obscure reason as they were already part of the
# sentencepiece vocabulary (this is the case for <s> and </s> and <unk>).
# In this case it is recommended to properly set the tokens by hand.
self._added_tokens_decoder = {
0: AddedToken("<s>NOTUSED"),
1: AddedToken(pad_token),
2: AddedToken("</s>NOTUSED"),
3: AddedToken(unk_token),
4: AddedToken("<unk>NOTUSED"),
}
self.fairseq_offset = 4 # 3 tokens are newly added, but the offset starts from 4
# legacy: camemebert is a particular case were we have to make sure `"<unk>NOTUSED"` is here
if "added_tokens_decoder" in kwargs:
# this is the only class that requires this unfortunately.....
# the reason is that the fast version has a whole.
kwargs["added_tokens_decoder"].update(self._added_tokens_decoder)
super().__init__( super().__init__(
bos_token=bos_token, bos_token=bos_token,
eos_token=eos_token, eos_token=eos_token,
...@@ -148,15 +171,83 @@ class CamembertTokenizer(PreTrainedTokenizer): ...@@ -148,15 +171,83 @@ class CamembertTokenizer(PreTrainedTokenizer):
sp_model_kwargs=self.sp_model_kwargs, sp_model_kwargs=self.sp_model_kwargs,
**kwargs, **kwargs,
) )
@property
def vocab_size(self):
# The length of the vocabulary without added tokens is len(self.sp_model) but the added tokens are added at the beginning.
return len(self.sp_model)
def get_vocab(self):
vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size + self.fairseq_offset)}
vocab.update(self.added_tokens_encoder)
return vocab
def _tokenize(self, text: str) -> List[str]:
return self.sp_model.encode(text, out_type=str)
def _convert_token_to_id(self, token):
"""Converts a token (str) in an id using the vocab."""
# specifi to camembert, both 3 and 4 point to the unk token.
if self.sp_model.PieceToId(token) == 0:
# Convert sentence piece unk token to fairseq unk token index
return self.unk_token_id
return self.fairseq_offset + self.sp_model.PieceToId(token)
def _convert_id_to_token(self, index):
"""Converts an index (integer) in a token (str) using the vocab."""
return self.sp_model.IdToPiece(index - self.fairseq_offset)
def convert_tokens_to_string(self, tokens):
"""Converts a sequence of tokens (string) in a single string."""
# TODO decode outputs do not match between fast and slow
current_sub_tokens = []
out_string = ""
prev_is_special = False
for token in tokens:
# make sure that special tokens are not decoded using sentencepiece model
if token in self.all_special_tokens:
if not prev_is_special:
out_string += " "
out_string += self.sp_model.decode(current_sub_tokens) + token
prev_is_special = True
current_sub_tokens = []
else:
current_sub_tokens.append(token)
prev_is_special = False
out_string += self.sp_model.decode(current_sub_tokens)
return out_string.strip()
def __getstate__(self):
state = self.__dict__.copy()
state["sp_model"] = None
return state
def __setstate__(self, d):
self.__dict__ = d
# for backward compatibility
if not hasattr(self, "sp_model_kwargs"):
self.sp_model_kwargs = {}
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs) self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
self.sp_model.Load(str(vocab_file)) self.sp_model.Load(self.vocab_file)
self.vocab_file = vocab_file
# HACK: These tokens were added by fairseq but don't seem to be actually used when duplicated in the actual def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
# sentencepiece vocabulary (this is the case for <s> and </s> if not os.path.isdir(save_directory):
self.fairseq_tokens_to_ids = {"<s>NOTUSED": 0, "<pad>": 1, "</s>NOTUSED": 2, "<unk>": 3} logger.error(f"Vocabulary path ({save_directory}) should be a directory")
self.fairseq_offset = len(self.fairseq_tokens_to_ids) return
self.fairseq_tokens_to_ids["<mask>"] = len(self.sp_model) + len(self.fairseq_tokens_to_ids) out_vocab_file = os.path.join(
self.fairseq_ids_to_tokens = {v: k for k, v in self.fairseq_tokens_to_ids.items()} save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
)
if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file) and os.path.isfile(self.vocab_file):
copyfile(self.vocab_file, out_vocab_file)
elif not os.path.isfile(self.vocab_file):
with open(out_vocab_file, "wb") as fi:
content_spiece_model = self.sp_model.serialized_model_proto()
fi.write(content_spiece_model)
return (out_vocab_file,)
def build_inputs_with_special_tokens( def build_inputs_with_special_tokens(
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
...@@ -233,81 +324,3 @@ class CamembertTokenizer(PreTrainedTokenizer): ...@@ -233,81 +324,3 @@ class CamembertTokenizer(PreTrainedTokenizer):
if token_ids_1 is None: if token_ids_1 is None:
return len(cls + token_ids_0 + sep) * [0] return len(cls + token_ids_0 + sep) * [0]
return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0] return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0]
@property
def vocab_size(self):
return len(self.fairseq_tokens_to_ids) + len(self.sp_model)
def get_vocab(self):
vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
vocab.update(self.added_tokens_encoder)
return vocab
def _tokenize(self, text: str) -> List[str]:
return self.sp_model.encode(text, out_type=str)
def _convert_token_to_id(self, token):
"""Converts a token (str) in an id using the vocab."""
if token in self.fairseq_tokens_to_ids:
return self.fairseq_tokens_to_ids[token]
elif self.sp_model.PieceToId(token) == 0:
# Convert sentence piece unk token to fairseq unk token index
return self.unk_token_id
return self.fairseq_offset + self.sp_model.PieceToId(token)
def _convert_id_to_token(self, index):
"""Converts an index (integer) in a token (str) using the vocab."""
if index in self.fairseq_ids_to_tokens:
return self.fairseq_ids_to_tokens[index]
return self.sp_model.IdToPiece(index - self.fairseq_offset)
def convert_tokens_to_string(self, tokens):
"""Converts a sequence of tokens (string) in a single string."""
current_sub_tokens = []
out_string = ""
prev_is_special = False
for token in tokens:
# make sure that special tokens are not decoded using sentencepiece model
if token in self.all_special_tokens:
if not prev_is_special:
out_string += " "
out_string += self.sp_model.decode(current_sub_tokens) + token
prev_is_special = True
current_sub_tokens = []
else:
current_sub_tokens.append(token)
prev_is_special = False
out_string += self.sp_model.decode(current_sub_tokens)
return out_string.strip()
def __getstate__(self):
state = self.__dict__.copy()
state["sp_model"] = None
return state
def __setstate__(self, d):
self.__dict__ = d
# for backward compatibility
if not hasattr(self, "sp_model_kwargs"):
self.sp_model_kwargs = {}
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
self.sp_model.Load(self.vocab_file)
def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
if not os.path.isdir(save_directory):
logger.error(f"Vocabulary path ({save_directory}) should be a directory")
return
out_vocab_file = os.path.join(
save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
)
if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file) and os.path.isfile(self.vocab_file):
copyfile(self.vocab_file, out_vocab_file)
elif not os.path.isfile(self.vocab_file):
with open(out_vocab_file, "wb") as fi:
content_spiece_model = self.sp_model.serialized_model_proto()
fi.write(content_spiece_model)
return (out_vocab_file,)
...@@ -33,7 +33,6 @@ UNICODE_VOCAB_SIZE = 1114112 ...@@ -33,7 +33,6 @@ UNICODE_VOCAB_SIZE = 1114112
# Below: Constants defining canonical codepoints for special, pseudo-characters. # Below: Constants defining canonical codepoints for special, pseudo-characters.
# Copied from https://github.com/google-research/language/blob/master/language/canine/special_codepoints.py # Copied from https://github.com/google-research/language/blob/master/language/canine/special_codepoints.py
PAD = 0 PAD = 0
CLS = 0xE000 CLS = 0xE000
SEP = 0xE001 SEP = 0xE001
BOS = 0xE002 BOS = 0xE002
...@@ -97,18 +96,6 @@ class CanineTokenizer(PreTrainedTokenizer): ...@@ -97,18 +96,6 @@ class CanineTokenizer(PreTrainedTokenizer):
# Mask token behave like a normal word, i.e. include the space before it # Mask token behave like a normal word, i.e. include the space before it
mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token
super().__init__(
bos_token=bos_token,
eos_token=eos_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
add_prefix_space=add_prefix_space,
model_max_length=model_max_length,
**kwargs,
)
# Creates a mapping for looking up the IDs of special symbols. # Creates a mapping for looking up the IDs of special symbols.
self._special_codepoints: Dict[str, int] = {} self._special_codepoints: Dict[str, int] = {}
for codepoint, name in SPECIAL_CODEPOINTS.items(): for codepoint, name in SPECIAL_CODEPOINTS.items():
...@@ -122,10 +109,27 @@ class CanineTokenizer(PreTrainedTokenizer): ...@@ -122,10 +109,27 @@ class CanineTokenizer(PreTrainedTokenizer):
self._unicode_vocab_size = UNICODE_VOCAB_SIZE self._unicode_vocab_size = UNICODE_VOCAB_SIZE
self._num_special_tokens = len(self._special_codepoints) self._num_special_tokens = len(self._special_codepoints)
super().__init__(
bos_token=bos_token,
eos_token=eos_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
add_prefix_space=add_prefix_space,
model_max_length=model_max_length,
**kwargs,
)
@property @property
def vocab_size(self) -> int: def vocab_size(self) -> int:
return self._unicode_vocab_size return self._unicode_vocab_size
def get_vocab(self):
vocab = {chr(i): i for i in range(self.vocab_size)}
vocab.update(self.added_tokens_encoder)
return vocab
def _tokenize(self, text: str) -> List[str]: def _tokenize(self, text: str) -> List[str]:
"""Tokenize a string (i.e. perform character splitting).""" """Tokenize a string (i.e. perform character splitting)."""
return list(text) return list(text)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment