"vscode:/vscode.git/clone" did not exist on "7bac51837baf56a3c067d9d0db9faa1ead526ae3"
Unverified Commit 2da88537 authored by Arthur's avatar Arthur Committed by GitHub
Browse files

🚨🚨 🚨🚨 [`Tokenizer`] attemp to fix add_token issues🚨🚨 🚨🚨 (#23909)



* fix test for bart. Order is correct now let's skip BPEs

* ouf

* styling

* fix bert....

* slow refactoring

* current updates

* massive refactoring

* update

* NICE!

* update to see where I am at

* updates

* update

* update

* revert

* updates

* updates

* start supporting legacy_save

* styling

* big update

* revert some changes

* nits

* nniiiiiice

* small fixes

* kinda fix t5 with new behaviour

* major update

* fixup

* fix copies

* today's updates

* fix byt5

* upfate

* update

* update

* updates

* update vocab size test

* Barthez does not use not need the fairseq offset ids

* super calll must be after

* calll super

* move all super init

* move other super init

* fixup

* nits

* more fixes

* nits

* more fixes

* nits

* more fix

* remove useless files

* ouch all of them are affected

* and more!

* small imporvements

* no more sanitize token

* more changes around unique no split tokens

* partially fix more things

* keep legacy save but add warning

* so... more fixes

* updates

* guess deberta tokenizer could be nuked

* fixup

* fixup did some bad things

* nuke it if it breaks

* remove prints and pretrain fast from slow with new format.

* fixups

* Apply suggestions from code review
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* fiou

* nit

* by default specials should not be normalized?

* update

* remove brakpoint

* updates

* a lot of updates

* fixup

* fixes revert some changes to match fast

* small nits

* that makes it cleaner

* fix camembert accordingly

* update

* some lest breaking changes

* update

* fixup

* fix byt5 and whisper mostly

* some more fixes, canine's byte vocab

* fix gpt2

* fix most of the perceiver tests (4 left)

* fix layout lmv3

* fixup

* fix copies for gpt2 style

* make sure to only warn once

* fix perciever and gpt2 tests

* some more backward compatibility: also read special tokens map because some ppl use it........////.....

* fixup

* add else when reading

* nits

* fresh updates

* fix copies

* will this make everything faster?

* fixes

* more fixes

* update

* more fixes

* fixup

* is the source of truth right?

* sorry camembert for the troubles

* current updates

* fixup

* update led

* update

* fix regression

* fix single word

* more model specific fixes

* fix t5 tests

* fixup

* more comments

* update

* fix nllb

* rstrip removed

* small fixes

* better handle additional_special_tokens and vocab sizes

* fixing

* styling

* fix 4 / 21

* fixup

* fix nlbb's tests

* some fixes

* fix t5

* fixes

* style

* fix canine tests

* damn this is nice

* nits

* m2m100 nit

* fixups

* fixes!

* fixup

* stash

* fix merge

* revert bad change

* fixup

* correct order for code Llama

* fix speecht5 post merge

* styling

* revert source of 11 fails

* small nits

* all changes in one go

* fnet hack

* fix 2 more tests

* update based on main branch of tokenizers

* fixup

* fix VITS issues

* more fixes

* fix mgp test

* fix camembert issues

* oups camembert still has 2 failing tests

* mluke fixes

* decode fixes

* small nits

* nits

* fix llama and vits

* fix camembert

* smal nits

* more fixes when initialising a fast from a slow and etc

* fix one of the last test

* fix CPM tokenizer test

* fixups

* fix pop2piano

* fixup

* ️ Change tokenizers required version ️

* ️ Change tokenizers required version ️

* "tokenizers>=0.14,<0.15", don't forget smaller than

* fix musicgen tests and pretraiendtokenizerfast

* fix owlvit and all

* update t5

* fix 800 red

* fix tests

* fix the fix of the fix of t5

* styling

* documentation nits

* cache _added_tokens_encoder

* fixups

* Nit

* fix red tests

* one last nit!

* make eveything a lot simpler

* Now it's over 😉



* few small nits

* Apply suggestions from code review
Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>

* updates that work for now

* tests that should no be skipped / changed and fixed next

* fixup

* i am ashamed

* pushe the fix

* update

* fixups

* nits

* fix added_tokens_encoder

* fix canine test

* fix pegasus vocab

* fix transfoXL

* fixup

* whisper needs to be fixed for train new

* pegasus nits

* more pegasus fixes

* minor update

* better error message in failed test

* fix whisper failing test

* fix whisper failing test

* fix pegasus

* fixup

* fix **** pegasus

* reset things

* remove another file

* attempts to fix the strange custome encoder and offset

* nits here and there

* update

* fixup

* nit

* fix the whisper test

* nits nits

* Apply suggestions from code review
Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>

* updates based on review

* some small update to potentially remove

* nits

* import rlu cache

* Update src/transformers/tokenization_utils_base.py
Co-authored-by: default avatarLysandre Debut <hi@lysand.re>

* move warning to `from_pretrained`

* update tests results now that the special tokens are always added

---------
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: default avatarLysandre Debut <hi@lysand.re>
parent 835b0a05
...@@ -232,27 +232,6 @@ class MarkupLMTokenizer(PreTrainedTokenizer): ...@@ -232,27 +232,6 @@ class MarkupLMTokenizer(PreTrainedTokenizer):
# Mask token behave like a normal word, i.e. include the space before it # Mask token behave like a normal word, i.e. include the space before it
mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token
super().__init__(
vocab_file=vocab_file,
merges_file=merges_file,
tags_dict=tags_dict,
errors=errors,
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
add_prefix_space=add_prefix_space,
max_depth=max_depth,
max_width=max_width,
pad_width=pad_width,
pad_token_label=pad_token_label,
only_label_first_subword=only_label_first_subword,
**kwargs,
)
with open(vocab_file, encoding="utf-8") as vocab_handle: with open(vocab_file, encoding="utf-8") as vocab_handle:
self.encoder = json.load(vocab_handle) self.encoder = json.load(vocab_handle)
...@@ -279,6 +258,28 @@ class MarkupLMTokenizer(PreTrainedTokenizer): ...@@ -279,6 +258,28 @@ class MarkupLMTokenizer(PreTrainedTokenizer):
self.pad_tag_id = self.unk_tag_id + 1 self.pad_tag_id = self.unk_tag_id + 1
self.pad_xpath_tags_seq = [self.pad_tag_id] * self.max_depth self.pad_xpath_tags_seq = [self.pad_tag_id] * self.max_depth
self.pad_xpath_subs_seq = [self.pad_width] * self.max_depth self.pad_xpath_subs_seq = [self.pad_width] * self.max_depth
super().__init__(
vocab_file=vocab_file,
merges_file=merges_file,
tags_dict=tags_dict,
errors=errors,
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
add_prefix_space=add_prefix_space,
max_depth=max_depth,
max_width=max_width,
pad_width=pad_width,
pad_token_label=pad_token_label,
only_label_first_subword=only_label_first_subword,
**kwargs,
)
self.pad_token_label = pad_token_label self.pad_token_label = pad_token_label
self.only_label_first_subword = only_label_first_subword self.only_label_first_subword = only_label_first_subword
...@@ -312,7 +313,9 @@ class MarkupLMTokenizer(PreTrainedTokenizer): ...@@ -312,7 +313,9 @@ class MarkupLMTokenizer(PreTrainedTokenizer):
return len(self.encoder) return len(self.encoder)
def get_vocab(self): def get_vocab(self):
return dict(self.encoder, **self.added_tokens_encoder) vocab = self.encoder.copy()
vocab.update(self.added_tokens_encoder)
return vocab
def bpe(self, token): def bpe(self, token):
if token in self.cache: if token in self.cache:
......
...@@ -26,6 +26,7 @@ from tokenizers import pre_tokenizers, processors ...@@ -26,6 +26,7 @@ from tokenizers import pre_tokenizers, processors
from ...file_utils import PaddingStrategy, TensorType, add_end_docstrings from ...file_utils import PaddingStrategy, TensorType, add_end_docstrings
from ...tokenization_utils_base import ( from ...tokenization_utils_base import (
ENCODE_KWARGS_DOCSTRING, ENCODE_KWARGS_DOCSTRING,
AddedToken,
BatchEncoding, BatchEncoding,
EncodedInput, EncodedInput,
PreTokenizedInput, PreTokenizedInput,
...@@ -182,6 +183,16 @@ class MarkupLMTokenizerFast(PreTrainedTokenizerFast): ...@@ -182,6 +183,16 @@ class MarkupLMTokenizerFast(PreTrainedTokenizerFast):
trim_offsets=False, trim_offsets=False,
**kwargs, **kwargs,
): ):
bos_token = AddedToken(bos_token, lstrip=False, rstrip=False) if isinstance(bos_token, str) else bos_token
eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token
sep_token = AddedToken(sep_token, lstrip=False, rstrip=False) if isinstance(sep_token, str) else sep_token
cls_token = AddedToken(cls_token, lstrip=False, rstrip=False) if isinstance(cls_token, str) else cls_token
unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token
pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token
# Mask token behave like a normal word, i.e. include the space before it
mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token
super().__init__( super().__init__(
vocab_file=vocab_file, vocab_file=vocab_file,
merges_file=merges_file, merges_file=merges_file,
......
...@@ -101,22 +101,6 @@ class MBartTokenizer(PreTrainedTokenizer): ...@@ -101,22 +101,6 @@ class MBartTokenizer(PreTrainedTokenizer):
self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
super().__init__(
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
tokenizer_file=None,
src_lang=src_lang,
tgt_lang=tgt_lang,
additional_special_tokens=additional_special_tokens,
sp_model_kwargs=self.sp_model_kwargs,
**kwargs,
)
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs) self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
self.sp_model.Load(str(vocab_file)) self.sp_model.Load(str(vocab_file))
self.vocab_file = vocab_file self.vocab_file = vocab_file
...@@ -142,12 +126,28 @@ class MBartTokenizer(PreTrainedTokenizer): ...@@ -142,12 +126,28 @@ class MBartTokenizer(PreTrainedTokenizer):
self.fairseq_tokens_to_ids.update(self.lang_code_to_id) self.fairseq_tokens_to_ids.update(self.lang_code_to_id)
self.fairseq_ids_to_tokens = {v: k for k, v in self.fairseq_tokens_to_ids.items()} self.fairseq_ids_to_tokens = {v: k for k, v in self.fairseq_tokens_to_ids.items()}
self._additional_special_tokens = list(self.lang_code_to_id.keys()) _additional_special_tokens = list(self.lang_code_to_id.keys())
if additional_special_tokens is not None: if additional_special_tokens is not None:
# Only add those special tokens if they are not already there. # Only add those special tokens if they are not already there.
self._additional_special_tokens.extend( _additional_special_tokens.extend(
[t for t in additional_special_tokens if t not in self._additional_special_tokens] [t for t in additional_special_tokens if t not in _additional_special_tokens]
)
super().__init__(
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
tokenizer_file=None,
src_lang=src_lang,
tgt_lang=tgt_lang,
additional_special_tokens=_additional_special_tokens,
sp_model_kwargs=self.sp_model_kwargs,
**kwargs,
) )
self._src_lang = src_lang if src_lang is not None else "en_XX" self._src_lang = src_lang if src_lang is not None else "en_XX"
......
...@@ -112,6 +112,14 @@ class MBartTokenizerFast(PreTrainedTokenizerFast): ...@@ -112,6 +112,14 @@ class MBartTokenizerFast(PreTrainedTokenizerFast):
# Mask token behave like a normal word, i.e. include the space before it # Mask token behave like a normal word, i.e. include the space before it
mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token
_additional_special_tokens = FAIRSEQ_LANGUAGE_CODES.copy()
if additional_special_tokens is not None:
# Only add those special tokens if they are not already there.
_additional_special_tokens.extend(
[t for t in additional_special_tokens if t not in _additional_special_tokens]
)
super().__init__( super().__init__(
vocab_file=vocab_file, vocab_file=vocab_file,
tokenizer_file=tokenizer_file, tokenizer_file=tokenizer_file,
...@@ -124,21 +132,11 @@ class MBartTokenizerFast(PreTrainedTokenizerFast): ...@@ -124,21 +132,11 @@ class MBartTokenizerFast(PreTrainedTokenizerFast):
mask_token=mask_token, mask_token=mask_token,
src_lang=src_lang, src_lang=src_lang,
tgt_lang=tgt_lang, tgt_lang=tgt_lang,
additional_special_tokens=additional_special_tokens, additional_special_tokens=_additional_special_tokens,
**kwargs, **kwargs,
) )
self.vocab_file = vocab_file self.vocab_file = vocab_file
_additional_special_tokens = FAIRSEQ_LANGUAGE_CODES.copy()
if additional_special_tokens is not None:
# Only add those special tokens if they are not already there.
_additional_special_tokens.extend(
[t for t in additional_special_tokens if t not in _additional_special_tokens]
)
self.add_special_tokens({"additional_special_tokens": _additional_special_tokens})
self.lang_code_to_id = { self.lang_code_to_id = {
lang_code: self.convert_tokens_to_ids(lang_code) for lang_code in FAIRSEQ_LANGUAGE_CODES lang_code: self.convert_tokens_to_ids(lang_code) for lang_code in FAIRSEQ_LANGUAGE_CODES
} }
......
...@@ -137,19 +137,6 @@ class MBart50Tokenizer(PreTrainedTokenizer): ...@@ -137,19 +137,6 @@ class MBart50Tokenizer(PreTrainedTokenizer):
code for code in FAIRSEQ_LANGUAGE_CODES if code not in kwargs["additional_special_tokens"] code for code in FAIRSEQ_LANGUAGE_CODES if code not in kwargs["additional_special_tokens"]
] ]
super().__init__(
src_lang=src_lang,
tgt_lang=tgt_lang,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
sp_model_kwargs=self.sp_model_kwargs,
**kwargs,
)
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs) self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
self.sp_model.Load(str(vocab_file)) self.sp_model.Load(str(vocab_file))
self.vocab_file = vocab_file self.vocab_file = vocab_file
...@@ -176,6 +163,19 @@ class MBart50Tokenizer(PreTrainedTokenizer): ...@@ -176,6 +163,19 @@ class MBart50Tokenizer(PreTrainedTokenizer):
self.fairseq_tokens_to_ids.update(self.lang_code_to_id) self.fairseq_tokens_to_ids.update(self.lang_code_to_id)
self.fairseq_ids_to_tokens = {v: k for k, v in self.fairseq_tokens_to_ids.items()} self.fairseq_ids_to_tokens = {v: k for k, v in self.fairseq_tokens_to_ids.items()}
super().__init__(
src_lang=src_lang,
tgt_lang=tgt_lang,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
sp_model_kwargs=self.sp_model_kwargs,
**kwargs,
)
self._src_lang = src_lang if src_lang is not None else "en_XX" self._src_lang = src_lang if src_lang is not None else "en_XX"
self.cur_lang_code_id = self.lang_code_to_id[self._src_lang] self.cur_lang_code_id = self.lang_code_to_id[self._src_lang]
self.tgt_lang = tgt_lang self.tgt_lang = tgt_lang
......
...@@ -62,6 +62,9 @@ class MgpstrTokenizer(PreTrainedTokenizer): ...@@ -62,6 +62,9 @@ class MgpstrTokenizer(PreTrainedTokenizer):
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
def __init__(self, vocab_file, unk_token="[GO]", bos_token="[GO]", eos_token="[s]", pad_token="[GO]", **kwargs): def __init__(self, vocab_file, unk_token="[GO]", bos_token="[GO]", eos_token="[s]", pad_token="[GO]", **kwargs):
with open(vocab_file, encoding="utf-8") as vocab_handle:
self.vocab = json.load(vocab_handle)
self.decoder = {v: k for k, v in self.vocab.items()}
super().__init__( super().__init__(
unk_token=unk_token, unk_token=unk_token,
bos_token=bos_token, bos_token=bos_token,
...@@ -70,16 +73,14 @@ class MgpstrTokenizer(PreTrainedTokenizer): ...@@ -70,16 +73,14 @@ class MgpstrTokenizer(PreTrainedTokenizer):
**kwargs, **kwargs,
) )
with open(vocab_file, encoding="utf-8") as vocab_handle:
self.vocab = json.load(vocab_handle)
self.decoder = {v: k for k, v in self.vocab.items()}
@property @property
def vocab_size(self): def vocab_size(self):
return len(self.vocab) return len(self.vocab)
def get_vocab(self): def get_vocab(self):
return dict(self.vocab, **self.added_tokens_encoder) vocab = dict(self.vocab).copy()
vocab.update(self.added_tokens_encoder)
return vocab
def _tokenize(self, text): def _tokenize(self, text):
"""Tokenize a string.""" """Tokenize a string."""
......
...@@ -272,32 +272,11 @@ class MLukeTokenizer(PreTrainedTokenizer): ...@@ -272,32 +272,11 @@ class MLukeTokenizer(PreTrainedTokenizer):
if isinstance(entity_token_2, str) if isinstance(entity_token_2, str)
else entity_token_2 else entity_token_2
) )
kwargs["additional_special_tokens"] = kwargs.get("additional_special_tokens", []) additional_special_tokens = kwargs.pop("additional_special_tokens", [])
kwargs["additional_special_tokens"] += [entity_token_1, entity_token_2] additional_special_tokens += [entity_token_1, entity_token_2]
self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
super().__init__(
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
sp_model_kwargs=self.sp_model_kwargs,
task=task,
max_entity_length=max_entity_length,
max_mention_length=max_mention_length,
entity_token_1=entity_token_1,
entity_token_2=entity_token_2,
entity_unk_token=entity_unk_token,
entity_pad_token=entity_pad_token,
entity_mask_token=entity_mask_token,
entity_mask2_token=entity_mask2_token,
**kwargs,
)
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs) self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
self.sp_model.Load(str(vocab_file)) self.sp_model.Load(str(vocab_file))
self.vocab_file = vocab_file self.vocab_file = vocab_file
...@@ -345,6 +324,65 @@ class MLukeTokenizer(PreTrainedTokenizer): ...@@ -345,6 +324,65 @@ class MLukeTokenizer(PreTrainedTokenizer):
self.max_mention_length = max_mention_length self.max_mention_length = max_mention_length
super().__init__(
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
sp_model_kwargs=self.sp_model_kwargs,
task=task,
max_entity_length=max_entity_length,
max_mention_length=max_mention_length,
entity_token_1=entity_token_1,
entity_token_2=entity_token_2,
entity_unk_token=entity_unk_token,
entity_pad_token=entity_pad_token,
entity_mask_token=entity_mask_token,
entity_mask2_token=entity_mask2_token,
additional_special_tokens=additional_special_tokens,
**kwargs,
)
@property
# Copied from transformers.models.xlm_roberta.tokenization_xlm_roberta.XLMRobertaTokenizer.vocab_size
def vocab_size(self):
return len(self.sp_model) + self.fairseq_offset + 1 # Add the <mask> token
# Copied from transformers.models.xlm_roberta.tokenization_xlm_roberta.XLMRobertaTokenizer.get_vocab
def get_vocab(self):
vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
vocab.update(self.added_tokens_encoder)
return vocab
# Copied from transformers.models.xlm_roberta.tokenization_xlm_roberta.XLMRobertaTokenizer._tokenize
def _tokenize(self, text: str) -> List[str]:
# TODO check if the t5/llama PR also applies here
return self.sp_model.encode(text, out_type=str)
# Copied from transformers.models.xlm_roberta.tokenization_xlm_roberta.XLMRobertaTokenizer._convert_token_to_id
def _convert_token_to_id(self, token):
"""Converts a token (str) in an id using the vocab."""
if token in self.fairseq_tokens_to_ids:
return self.fairseq_tokens_to_ids[token]
spm_id = self.sp_model.PieceToId(token)
# Need to return unknown token if the SP model returned 0
return spm_id + self.fairseq_offset if spm_id else self.unk_token_id
def _convert_id_to_token(self, index):
"""Converts an index (integer) in a token (str) using the vocab."""
if index in self.fairseq_ids_to_tokens:
return self.fairseq_ids_to_tokens[index]
return self.sp_model.IdToPiece(index - self.fairseq_offset)
def convert_tokens_to_string(self, tokens):
"""Converts a sequence of tokens (strings for sub-words) in a single string."""
out_string = "".join(tokens).replace(SPIECE_UNDERLINE, " ").strip()
return out_string
def __getstate__(self): def __getstate__(self):
state = self.__dict__.copy() state = self.__dict__.copy()
state["sp_model"] = None state["sp_model"] = None
...@@ -1591,39 +1629,3 @@ class MLukeTokenizer(PreTrainedTokenizer): ...@@ -1591,39 +1629,3 @@ class MLukeTokenizer(PreTrainedTokenizer):
if token_ids_1 is None: if token_ids_1 is None:
return len(cls + token_ids_0 + sep) * [0] return len(cls + token_ids_0 + sep) * [0]
return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0] return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0]
@property
# Copied from transformers.models.xlm_roberta.tokenization_xlm_roberta.XLMRobertaTokenizer.vocab_size
def vocab_size(self):
return len(self.sp_model) + self.fairseq_offset + 1 # Add the <mask> token
# Copied from transformers.models.xlm_roberta.tokenization_xlm_roberta.XLMRobertaTokenizer.get_vocab
def get_vocab(self):
vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
vocab.update(self.added_tokens_encoder)
return vocab
# Copied from transformers.models.xlm_roberta.tokenization_xlm_roberta.XLMRobertaTokenizer._tokenize
def _tokenize(self, text: str) -> List[str]:
return self.sp_model.encode(text, out_type=str)
# Copied from transformers.models.xlm_roberta.tokenization_xlm_roberta.XLMRobertaTokenizer._convert_token_to_id
def _convert_token_to_id(self, token):
"""Converts a token (str) in an id using the vocab."""
if token in self.fairseq_tokens_to_ids:
return self.fairseq_tokens_to_ids[token]
spm_id = self.sp_model.PieceToId(token)
# Need to return unknown token if the SP model returned 0
return spm_id + self.fairseq_offset if spm_id else self.unk_token_id
def _convert_id_to_token(self, index):
"""Converts an index (integer) in a token (str) using the vocab."""
if index in self.fairseq_ids_to_tokens:
return self.fairseq_ids_to_tokens[index]
return self.sp_model.IdToPiece(index - self.fairseq_offset)
def convert_tokens_to_string(self, tokens):
"""Converts a sequence of tokens (strings for sub-words) in a single string."""
out_string = "".join(tokens).replace(SPIECE_UNDERLINE, " ").strip()
return out_string
...@@ -124,20 +124,6 @@ class MobileBertTokenizer(PreTrainedTokenizer): ...@@ -124,20 +124,6 @@ class MobileBertTokenizer(PreTrainedTokenizer):
strip_accents=None, strip_accents=None,
**kwargs, **kwargs,
): ):
super().__init__(
do_lower_case=do_lower_case,
do_basic_tokenize=do_basic_tokenize,
never_split=never_split,
unk_token=unk_token,
sep_token=sep_token,
pad_token=pad_token,
cls_token=cls_token,
mask_token=mask_token,
tokenize_chinese_chars=tokenize_chinese_chars,
strip_accents=strip_accents,
**kwargs,
)
if not os.path.isfile(vocab_file): if not os.path.isfile(vocab_file):
raise ValueError( raise ValueError(
f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained" f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained"
...@@ -153,7 +139,22 @@ class MobileBertTokenizer(PreTrainedTokenizer): ...@@ -153,7 +139,22 @@ class MobileBertTokenizer(PreTrainedTokenizer):
tokenize_chinese_chars=tokenize_chinese_chars, tokenize_chinese_chars=tokenize_chinese_chars,
strip_accents=strip_accents, strip_accents=strip_accents,
) )
self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=self.unk_token)
self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=str(unk_token))
super().__init__(
do_lower_case=do_lower_case,
do_basic_tokenize=do_basic_tokenize,
never_split=never_split,
unk_token=unk_token,
sep_token=sep_token,
pad_token=pad_token,
cls_token=cls_token,
mask_token=mask_token,
tokenize_chinese_chars=tokenize_chinese_chars,
strip_accents=strip_accents,
**kwargs,
)
@property @property
def do_lower_case(self): def do_lower_case(self):
......
...@@ -157,22 +157,6 @@ class MPNetTokenizer(PreTrainedTokenizer): ...@@ -157,22 +157,6 @@ class MPNetTokenizer(PreTrainedTokenizer):
# Mask token behave like a normal word, i.e. include the space before it # Mask token behave like a normal word, i.e. include the space before it
mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token
super().__init__(
do_lower_case=do_lower_case,
do_basic_tokenize=do_basic_tokenize,
never_split=never_split,
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
tokenize_chinese_chars=tokenize_chinese_chars,
strip_accents=strip_accents,
**kwargs,
)
if not os.path.isfile(vocab_file): if not os.path.isfile(vocab_file):
raise ValueError( raise ValueError(
f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained" f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained"
...@@ -188,7 +172,23 @@ class MPNetTokenizer(PreTrainedTokenizer): ...@@ -188,7 +172,23 @@ class MPNetTokenizer(PreTrainedTokenizer):
tokenize_chinese_chars=tokenize_chinese_chars, tokenize_chinese_chars=tokenize_chinese_chars,
strip_accents=strip_accents, strip_accents=strip_accents,
) )
self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=self.unk_token) self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=str(unk_token))
super().__init__(
do_lower_case=do_lower_case,
do_basic_tokenize=do_basic_tokenize,
never_split=never_split,
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
tokenize_chinese_chars=tokenize_chinese_chars,
strip_accents=strip_accents,
**kwargs,
)
@property @property
def do_lower_case(self): def do_lower_case(self):
...@@ -199,7 +199,9 @@ class MPNetTokenizer(PreTrainedTokenizer): ...@@ -199,7 +199,9 @@ class MPNetTokenizer(PreTrainedTokenizer):
return len(self.vocab) return len(self.vocab)
def get_vocab(self): def get_vocab(self):
return dict(self.vocab, **self.added_tokens_encoder) vocab = self.vocab.copy()
vocab.update(self.added_tokens_encoder)
return vocab
def _tokenize(self, text): def _tokenize(self, text):
split_tokens = [] split_tokens = []
......
...@@ -126,6 +126,16 @@ class MPNetTokenizerFast(PreTrainedTokenizerFast): ...@@ -126,6 +126,16 @@ class MPNetTokenizerFast(PreTrainedTokenizerFast):
strip_accents=None, strip_accents=None,
**kwargs, **kwargs,
): ):
bos_token = AddedToken(bos_token, lstrip=False, rstrip=False) if isinstance(bos_token, str) else bos_token
eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token
sep_token = AddedToken(sep_token, lstrip=False, rstrip=False) if isinstance(sep_token, str) else sep_token
cls_token = AddedToken(cls_token, lstrip=False, rstrip=False) if isinstance(cls_token, str) else cls_token
unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token
pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token
# Mask token behave like a normal word, i.e. include the space before it
mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token
super().__init__( super().__init__(
vocab_file, vocab_file,
tokenizer_file=tokenizer_file, tokenizer_file=tokenizer_file,
......
...@@ -193,19 +193,6 @@ class MvpTokenizer(PreTrainedTokenizer): ...@@ -193,19 +193,6 @@ class MvpTokenizer(PreTrainedTokenizer):
# Mask token behave like a normal word, i.e. include the space before it # Mask token behave like a normal word, i.e. include the space before it
mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token
super().__init__(
errors=errors,
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
add_prefix_space=add_prefix_space,
**kwargs,
)
with open(vocab_file, encoding="utf-8") as vocab_handle: with open(vocab_file, encoding="utf-8") as vocab_handle:
self.encoder = json.load(vocab_handle) self.encoder = json.load(vocab_handle)
self.decoder = {v: k for k, v in self.encoder.items()} self.decoder = {v: k for k, v in self.encoder.items()}
...@@ -222,12 +209,27 @@ class MvpTokenizer(PreTrainedTokenizer): ...@@ -222,12 +209,27 @@ class MvpTokenizer(PreTrainedTokenizer):
# Should have added re.IGNORECASE so BPE merges can happen for capitalized versions of contractions # Should have added re.IGNORECASE so BPE merges can happen for capitalized versions of contractions
self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""") self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
super().__init__(
errors=errors,
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
add_prefix_space=add_prefix_space,
**kwargs,
)
@property @property
def vocab_size(self): def vocab_size(self):
return len(self.encoder) return len(self.encoder)
def get_vocab(self): def get_vocab(self):
return dict(self.encoder, **self.added_tokens_encoder) vocab = self.encoder.copy()
vocab.update(self.added_tokens_encoder)
return vocab
def bpe(self, token): def bpe(self, token):
if token in self.cache: if token in self.cache:
......
...@@ -153,6 +153,15 @@ class MvpTokenizerFast(PreTrainedTokenizerFast): ...@@ -153,6 +153,15 @@ class MvpTokenizerFast(PreTrainedTokenizerFast):
trim_offsets=True, trim_offsets=True,
**kwargs, **kwargs,
): ):
bos_token = AddedToken(bos_token, lstrip=False, rstrip=False) if isinstance(bos_token, str) else bos_token
eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token
sep_token = AddedToken(sep_token, lstrip=False, rstrip=False) if isinstance(sep_token, str) else sep_token
cls_token = AddedToken(cls_token, lstrip=False, rstrip=False) if isinstance(cls_token, str) else cls_token
unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token
pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token
# Mask token behave like a normal word, i.e. include the space before it
mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token
super().__init__( super().__init__(
vocab_file, vocab_file,
merges_file, merges_file,
......
...@@ -149,23 +149,6 @@ class NllbTokenizer(PreTrainedTokenizer): ...@@ -149,23 +149,6 @@ class NllbTokenizer(PreTrainedTokenizer):
self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
self.legacy_behaviour = legacy_behaviour self.legacy_behaviour = legacy_behaviour
super().__init__(
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
tokenizer_file=tokenizer_file,
src_lang=src_lang,
tgt_lang=tgt_lang,
additional_special_tokens=additional_special_tokens,
sp_model_kwargs=self.sp_model_kwargs,
legacy_behaviour=legacy_behaviour,
**kwargs,
)
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs) self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
self.sp_model.Load(str(vocab_file)) self.sp_model.Load(str(vocab_file))
self.vocab_file = vocab_file self.vocab_file = vocab_file
...@@ -190,16 +173,35 @@ class NllbTokenizer(PreTrainedTokenizer): ...@@ -190,16 +173,35 @@ class NllbTokenizer(PreTrainedTokenizer):
self.fairseq_tokens_to_ids.update(self.lang_code_to_id) self.fairseq_tokens_to_ids.update(self.lang_code_to_id)
self.fairseq_ids_to_tokens = {v: k for k, v in self.fairseq_tokens_to_ids.items()} self.fairseq_ids_to_tokens = {v: k for k, v in self.fairseq_tokens_to_ids.items()}
self._additional_special_tokens = list(self.lang_code_to_id.keys())
self._src_lang = src_lang if src_lang is not None else "eng_Latn"
self.cur_lang_code_id = self.lang_code_to_id[self._src_lang]
_additional_special_tokens = list(self.lang_code_to_id.keys())
if additional_special_tokens is not None: if additional_special_tokens is not None:
# Only add those special tokens if they are not already there. # Only add those special tokens if they are not already there.
self._additional_special_tokens.extend( _additional_special_tokens.extend(
[t for t in additional_special_tokens if t not in self._additional_special_tokens] [t for t in additional_special_tokens if t not in _additional_special_tokens]
)
super().__init__(
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
tokenizer_file=tokenizer_file,
src_lang=src_lang,
tgt_lang=tgt_lang,
additional_special_tokens=_additional_special_tokens,
sp_model_kwargs=self.sp_model_kwargs,
legacy_behaviour=legacy_behaviour,
**kwargs,
) )
self._src_lang = src_lang if src_lang is not None else "eng_Latn"
self.cur_lang_code_id = self.lang_code_to_id[self._src_lang]
self.tgt_lang = tgt_lang self.tgt_lang = tgt_lang
self.set_src_lang_special_tokens(self._src_lang) self.set_src_lang_special_tokens(self._src_lang)
......
...@@ -157,6 +157,15 @@ class NllbTokenizerFast(PreTrainedTokenizerFast): ...@@ -157,6 +157,15 @@ class NllbTokenizerFast(PreTrainedTokenizerFast):
# Mask token behave like a normal word, i.e. include the space before it # Mask token behave like a normal word, i.e. include the space before it
mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token
self.legacy_behaviour = legacy_behaviour self.legacy_behaviour = legacy_behaviour
_additional_special_tokens = FAIRSEQ_LANGUAGE_CODES.copy()
if additional_special_tokens is not None:
# Only add those special tokens if they are not already there.
_additional_special_tokens.extend(
[t for t in additional_special_tokens if t not in _additional_special_tokens]
)
super().__init__( super().__init__(
vocab_file=vocab_file, vocab_file=vocab_file,
tokenizer_file=tokenizer_file, tokenizer_file=tokenizer_file,
...@@ -169,22 +178,13 @@ class NllbTokenizerFast(PreTrainedTokenizerFast): ...@@ -169,22 +178,13 @@ class NllbTokenizerFast(PreTrainedTokenizerFast):
mask_token=mask_token, mask_token=mask_token,
src_lang=src_lang, src_lang=src_lang,
tgt_lang=tgt_lang, tgt_lang=tgt_lang,
additional_special_tokens=additional_special_tokens, additional_special_tokens=_additional_special_tokens,
legacy_behaviour=legacy_behaviour, legacy_behaviour=legacy_behaviour,
**kwargs, **kwargs,
) )
self.vocab_file = vocab_file self.vocab_file = vocab_file
_additional_special_tokens = FAIRSEQ_LANGUAGE_CODES.copy()
if additional_special_tokens is not None:
# Only add those special tokens if they are not already there.
_additional_special_tokens.extend(
[t for t in additional_special_tokens if t not in _additional_special_tokens]
)
self.add_special_tokens({"additional_special_tokens": _additional_special_tokens})
self.lang_code_to_id = { self.lang_code_to_id = {
lang_code: self.convert_tokens_to_ids(lang_code) for lang_code in FAIRSEQ_LANGUAGE_CODES lang_code: self.convert_tokens_to_ids(lang_code) for lang_code in FAIRSEQ_LANGUAGE_CODES
} }
......
...@@ -269,8 +269,6 @@ class OpenAIGPTTokenizer(PreTrainedTokenizer): ...@@ -269,8 +269,6 @@ class OpenAIGPTTokenizer(PreTrainedTokenizer):
model_input_names = ["input_ids", "attention_mask"] model_input_names = ["input_ids", "attention_mask"]
def __init__(self, vocab_file, merges_file, unk_token="<unk>", **kwargs): def __init__(self, vocab_file, merges_file, unk_token="<unk>", **kwargs):
super().__init__(unk_token=unk_token, **kwargs)
try: try:
import ftfy import ftfy
from spacy.lang.en import English from spacy.lang.en import English
...@@ -292,6 +290,8 @@ class OpenAIGPTTokenizer(PreTrainedTokenizer): ...@@ -292,6 +290,8 @@ class OpenAIGPTTokenizer(PreTrainedTokenizer):
self.bpe_ranks = dict(zip(merges, range(len(merges)))) self.bpe_ranks = dict(zip(merges, range(len(merges))))
self.cache = {} self.cache = {}
super().__init__(unk_token=unk_token, **kwargs)
@property @property
def do_lower_case(self): def do_lower_case(self):
return True return True
......
...@@ -18,7 +18,7 @@ from typing import Any, Dict, List, Optional, Tuple ...@@ -18,7 +18,7 @@ from typing import Any, Dict, List, Optional, Tuple
import sentencepiece as spm import sentencepiece as spm
from ...tokenization_utils import PreTrainedTokenizer from ...tokenization_utils import AddedToken, PreTrainedTokenizer
from ...utils import logging from ...utils import logging
...@@ -38,6 +38,7 @@ PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = { ...@@ -38,6 +38,7 @@ PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
logger = logging.get_logger(__name__) logger = logging.get_logger(__name__)
# TODO ArthurZ refactor this to only use the added_tokens_encoder
class PegasusTokenizer(PreTrainedTokenizer): class PegasusTokenizer(PreTrainedTokenizer):
r""" r"""
Construct a PEGASUS tokenizer. Based on [SentencePiece](https://github.com/google/sentencepiece). Construct a PEGASUS tokenizer. Based on [SentencePiece](https://github.com/google/sentencepiece).
...@@ -95,8 +96,6 @@ class PegasusTokenizer(PreTrainedTokenizer): ...@@ -95,8 +96,6 @@ class PegasusTokenizer(PreTrainedTokenizer):
- `alpha`: Smoothing parameter for unigram sampling, and dropout probability of merge operations for - `alpha`: Smoothing parameter for unigram sampling, and dropout probability of merge operations for
BPE-dropout. BPE-dropout.
""" """
vocab_files_names = VOCAB_FILES_NAMES
vocab_files_names = VOCAB_FILES_NAMES vocab_files_names = VOCAB_FILES_NAMES
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
...@@ -122,7 +121,6 @@ class PegasusTokenizer(PreTrainedTokenizer): ...@@ -122,7 +121,6 @@ class PegasusTokenizer(PreTrainedTokenizer):
f"additional_special_tokens should be of type {type(list)}, but is" f"additional_special_tokens should be of type {type(list)}, but is"
f" {type(additional_special_tokens)}" f" {type(additional_special_tokens)}"
) )
additional_special_tokens_extended = ( additional_special_tokens_extended = (
([mask_token_sent] + additional_special_tokens) ([mask_token_sent] + additional_special_tokens)
if mask_token_sent not in additional_special_tokens and mask_token_sent is not None if mask_token_sent not in additional_special_tokens and mask_token_sent is not None
...@@ -140,10 +138,27 @@ class PegasusTokenizer(PreTrainedTokenizer): ...@@ -140,10 +138,27 @@ class PegasusTokenizer(PreTrainedTokenizer):
) )
additional_special_tokens = additional_special_tokens_extended additional_special_tokens = additional_special_tokens_extended
else: else:
additional_special_tokens_extended = []
additional_special_tokens = [mask_token_sent] if mask_token_sent is not None else [] additional_special_tokens = [mask_token_sent] if mask_token_sent is not None else []
additional_special_tokens += [f"<unk_{i}>" for i in range(2, self.offset)] additional_special_tokens += [f"<unk_{i}>" for i in range(2, self.offset)]
self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
self.mask_token_sent = mask_token_sent
self.vocab_file = vocab_file
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
self.sp_model.Load(vocab_file)
self._added_tokens_decoder = {
0: AddedToken(str(pad_token), lstrip=True, rstrip=True),
1: AddedToken(str(eos_token), lstrip=True, rstrip=True),
}
if self.mask_token_sent is not None:
self._added_tokens_decoder[2] = AddedToken(mask_token_sent)
self._added_tokens_decoder[3] = AddedToken(str(mask_token))
for i in range(1, self.offset - 1):
self._added_tokens_decoder[len(self._added_tokens_decoder)] = AddedToken(f"<unk_{i}>")
super().__init__( super().__init__(
eos_token=eos_token, eos_token=eos_token,
...@@ -156,31 +171,6 @@ class PegasusTokenizer(PreTrainedTokenizer): ...@@ -156,31 +171,6 @@ class PegasusTokenizer(PreTrainedTokenizer):
sp_model_kwargs=self.sp_model_kwargs, sp_model_kwargs=self.sp_model_kwargs,
**kwargs, **kwargs,
) )
self.mask_token_sent = mask_token_sent
self.vocab_file = vocab_file
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
self.sp_model.Load(vocab_file)
# add special tokens to encoder dict
self.encoder: Dict[int, str] = {
0: self.pad_token,
1: self.eos_token,
}
if self.mask_token_sent is not None:
self.encoder.update(
{
2: self.mask_token_sent,
3: self.mask_token,
}
)
if self.offset > 0:
# entries 2-104 are only used for pretraining and called <mask_1>, <mask_2>, unk_2, ...unk_102
# mask_token_sent is already added to list -> so start at 1
self.encoder.update({i + 3: additional_special_tokens[i] for i in range(1, self.offset - 1)})
self.decoder: Dict[str, int] = {v: k for k, v in self.encoder.items()}
@property @property
def vocab_size(self) -> int: def vocab_size(self) -> int:
...@@ -212,20 +202,13 @@ class PegasusTokenizer(PreTrainedTokenizer): ...@@ -212,20 +202,13 @@ class PegasusTokenizer(PreTrainedTokenizer):
def _convert_token_to_id(self, token: str) -> int: def _convert_token_to_id(self, token: str) -> int:
"""Converts a token (str) to an id using the vocab.""" """Converts a token (str) to an id using the vocab."""
if token in self.decoder:
return self.decoder[token]
elif token in self.added_tokens_decoder:
return self.added_tokens_decoder[token]
sp_id = self.sp_model.piece_to_id(token) sp_id = self.sp_model.piece_to_id(token)
return sp_id + self.offset return sp_id + self.offset
def _convert_id_to_token(self, index: int) -> str: def _convert_id_to_token(self, index: int) -> str:
"""Converts an index (integer) to a token (str) using the vocab.""" """Converts an index (integer) to a token (str) using the vocab."""
if index in self.encoder: if index < self.offset:
return self.encoder[index] return self.sp_model.IdToPiece(index)
elif index in self.added_tokens_encoder:
return self.added_tokens_encoder[index]
else:
token = self.sp_model.IdToPiece(index - self.offset) token = self.sp_model.IdToPiece(index - self.offset)
return token return token
......
...@@ -75,6 +75,18 @@ class PerceiverTokenizer(PreTrainedTokenizer): ...@@ -75,6 +75,18 @@ class PerceiverTokenizer(PreTrainedTokenizer):
cls_token = AddedToken(cls_token, lstrip=False, rstrip=False) if isinstance(cls_token, str) else cls_token cls_token = AddedToken(cls_token, lstrip=False, rstrip=False) if isinstance(cls_token, str) else cls_token
sep_token = AddedToken(sep_token, lstrip=False, rstrip=False) if isinstance(sep_token, str) else sep_token sep_token = AddedToken(sep_token, lstrip=False, rstrip=False) if isinstance(sep_token, str) else sep_token
self._utf_vocab_size = 2**8 # utf is 8 bits
# Since these tokens are not part of the vocabulary, we manually add them
self._added_tokens_decoder: Dict[str, int] = {
0: pad_token,
1: bos_token,
2: eos_token,
3: mask_token,
4: cls_token,
5: sep_token,
}
self._num_special_tokens = len(self._added_tokens_decoder)
super().__init__( super().__init__(
pad_token=pad_token, pad_token=pad_token,
bos_token=bos_token, bos_token=bos_token,
...@@ -86,31 +98,17 @@ class PerceiverTokenizer(PreTrainedTokenizer): ...@@ -86,31 +98,17 @@ class PerceiverTokenizer(PreTrainedTokenizer):
**kwargs, **kwargs,
) )
self._utf_vocab_size = 2**8 # utf is 8 bits
# define special tokens dict
self.special_tokens_encoder: Dict[str, int] = {
self.pad_token: 0,
self.bos_token: 1,
self.eos_token: 2,
self.mask_token: 3,
self.cls_token: 4,
self.sep_token: 5,
}
self._num_special_tokens = len(self.special_tokens_encoder)
self.special_tokens_decoder: Dict[int, str] = {v: k for k, v in self.special_tokens_encoder.items()}
def get_vocab(self) -> Dict[str, int]: def get_vocab(self) -> Dict[str, int]:
vocab = self.special_tokens_encoder.copy() vocab = {}
vocab.update(self.added_tokens_encoder)
for i in range(self._utf_vocab_size): for i in range(self._utf_vocab_size):
token = chr(i) token = chr(i)
vocab[token] = i + len(self.special_tokens_encoder) vocab[token] = i + self._num_special_tokens
vocab.update(self.added_tokens_encoder)
return vocab return vocab
@property @property
def vocab_size(self): def vocab_size(self):
return self._utf_vocab_size + self._num_special_tokens return self._utf_vocab_size
def get_special_tokens_mask( def get_special_tokens_mask(
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
...@@ -171,11 +169,7 @@ class PerceiverTokenizer(PreTrainedTokenizer): ...@@ -171,11 +169,7 @@ class PerceiverTokenizer(PreTrainedTokenizer):
def _convert_token_to_id(self, token): def _convert_token_to_id(self, token):
"""Converts a token (str) in an id using the vocab.""" """Converts a token (str) in an id using the vocab."""
if token in self.special_tokens_encoder: if len(token) != 1:
token_id = self.special_tokens_encoder[token]
elif token in self.added_tokens_encoder:
token_id = self.added_tokens_encoder[token]
elif len(token) != 1:
token_id = self.unk_token_id token_id = self.unk_token_id
else: else:
token_id = ord(token) + self._num_special_tokens token_id = ord(token) + self._num_special_tokens
...@@ -183,26 +177,16 @@ class PerceiverTokenizer(PreTrainedTokenizer): ...@@ -183,26 +177,16 @@ class PerceiverTokenizer(PreTrainedTokenizer):
def _convert_id_to_token(self, index): def _convert_id_to_token(self, index):
"""Converts an index (integer) in a token (str) using the vocab.""" """Converts an index (integer) in a token (str) using the vocab."""
if index in self.special_tokens_decoder:
token = self.special_tokens_decoder[index]
elif index in self.added_tokens_decoder:
token = self.added_tokens_decoder[index]
else:
token = chr(index - self._num_special_tokens) token = chr(index - self._num_special_tokens)
return token return token
# TODO @ArthurZ refactor this as well....
def convert_tokens_to_string(self, tokens): def convert_tokens_to_string(self, tokens):
"""Converts a sequence of tokens (string) in a single string.""" """Converts a sequence of tokens (string) in a single string."""
bstring = b"" bstring = b""
for token in tokens: for token in tokens:
if token in self.special_tokens_decoder: if token in self.added_tokens_encoder:
tok_string = self.special_tokens_decoder[token].encode("utf-8") tok_string = str(token).encode("utf-8")
elif token in self.added_tokens_decoder:
tok_string = self.special_tokens_decoder[token].encode("utf-8")
elif token in self.special_tokens_encoder:
tok_string = token.encode("utf-8")
elif token in self.added_tokens_encoder:
tok_string = token.encode("utf-8")
else: else:
tok_string = bytes([ord(token)]) tok_string = bytes([ord(token)])
bstring += tok_string bstring += tok_string
......
...@@ -131,25 +131,14 @@ class PhobertTokenizer(PreTrainedTokenizer): ...@@ -131,25 +131,14 @@ class PhobertTokenizer(PreTrainedTokenizer):
mask_token="<mask>", mask_token="<mask>",
**kwargs, **kwargs,
): ):
super().__init__(
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
**kwargs,
)
self.vocab_file = vocab_file self.vocab_file = vocab_file
self.merges_file = merges_file self.merges_file = merges_file
self.encoder = {} self.encoder = {}
self.encoder[self.bos_token] = 0 self.encoder[bos_token] = 0
self.encoder[self.pad_token] = 1 self.encoder[pad_token] = 1
self.encoder[self.eos_token] = 2 self.encoder[eos_token] = 2
self.encoder[self.unk_token] = 3 self.encoder[unk_token] = 3
self.add_from_file(vocab_file) self.add_from_file(vocab_file)
...@@ -158,9 +147,21 @@ class PhobertTokenizer(PreTrainedTokenizer): ...@@ -158,9 +147,21 @@ class PhobertTokenizer(PreTrainedTokenizer):
with open(merges_file, encoding="utf-8") as merges_handle: with open(merges_file, encoding="utf-8") as merges_handle:
merges = merges_handle.read().split("\n")[:-1] merges = merges_handle.read().split("\n")[:-1]
merges = [tuple(merge.split()[:-1]) for merge in merges] merges = [tuple(merge.split()[:-1]) for merge in merges]
self.bpe_ranks = dict(zip(merges, range(len(merges)))) self.bpe_ranks = dict(zip(merges, range(len(merges))))
self.cache = {} self.cache = {}
super().__init__(
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
**kwargs,
)
def build_inputs_with_special_tokens( def build_inputs_with_special_tokens(
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
) -> List[int]: ) -> List[int]:
......
...@@ -195,23 +195,6 @@ class PLBartTokenizer(PreTrainedTokenizer): ...@@ -195,23 +195,6 @@ class PLBartTokenizer(PreTrainedTokenizer):
mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token
self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
super().__init__(
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
language_codes=language_codes,
tokenizer_file=tokenizer_file,
src_lang=src_lang,
tgt_lang=tgt_lang,
additional_special_tokens=additional_special_tokens,
sp_model_kwargs=self.sp_model_kwargs,
**kwargs,
)
src_lang = self._convert_lang_code_special_format(src_lang) src_lang = self._convert_lang_code_special_format(src_lang)
tgt_lang = self._convert_lang_code_special_format(tgt_lang) tgt_lang = self._convert_lang_code_special_format(tgt_lang)
...@@ -245,12 +228,12 @@ class PLBartTokenizer(PreTrainedTokenizer): ...@@ -245,12 +228,12 @@ class PLBartTokenizer(PreTrainedTokenizer):
self.fairseq_tokens_to_ids.update(self.lang_code_to_id) self.fairseq_tokens_to_ids.update(self.lang_code_to_id)
self.fairseq_ids_to_tokens = {v: k for k, v in self.fairseq_tokens_to_ids.items()} self.fairseq_ids_to_tokens = {v: k for k, v in self.fairseq_tokens_to_ids.items()}
self._additional_special_tokens = list(self.lang_code_to_id.keys()) _additional_special_tokens = list(self.lang_code_to_id.keys())
if additional_special_tokens is not None: if additional_special_tokens is not None:
# Only add those special tokens if they are not already there. # Only add those special tokens if they are not already there.
self._additional_special_tokens.extend( _additional_special_tokens.extend(
[t for t in additional_special_tokens if t not in self._additional_special_tokens] [t for t in additional_special_tokens if t not in _additional_special_tokens]
) )
if self.language_codes == "base": if self.language_codes == "base":
...@@ -262,6 +245,23 @@ class PLBartTokenizer(PreTrainedTokenizer): ...@@ -262,6 +245,23 @@ class PLBartTokenizer(PreTrainedTokenizer):
self._src_lang = src_lang if src_lang is not None else "__en_XX__" self._src_lang = src_lang if src_lang is not None else "__en_XX__"
self.cur_lang_code_id = self.lang_code_to_id[self._src_lang] self.cur_lang_code_id = self.lang_code_to_id[self._src_lang]
super().__init__(
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
language_codes=language_codes,
tokenizer_file=tokenizer_file,
src_lang=src_lang,
tgt_lang=tgt_lang,
additional_special_tokens=_additional_special_tokens,
sp_model_kwargs=self.sp_model_kwargs,
**kwargs,
)
self.tgt_lang = tgt_lang self.tgt_lang = tgt_lang
self.set_src_lang_special_tokens(self._src_lang) self.set_src_lang_special_tokens(self._src_lang)
......
...@@ -101,14 +101,6 @@ class Pop2PianoTokenizer(PreTrainedTokenizer): ...@@ -101,14 +101,6 @@ class Pop2PianoTokenizer(PreTrainedTokenizer):
pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token
bos_token = AddedToken(bos_token, lstrip=False, rstrip=False) if isinstance(bos_token, str) else bos_token bos_token = AddedToken(bos_token, lstrip=False, rstrip=False) if isinstance(bos_token, str) else bos_token
super().__init__(
unk_token=unk_token,
eos_token=eos_token,
pad_token=pad_token,
bos_token=bos_token,
**kwargs,
)
self.default_velocity = default_velocity self.default_velocity = default_velocity
self.num_bars = num_bars self.num_bars = num_bars
...@@ -119,6 +111,14 @@ class Pop2PianoTokenizer(PreTrainedTokenizer): ...@@ -119,6 +111,14 @@ class Pop2PianoTokenizer(PreTrainedTokenizer):
# create mappings for encoder # create mappings for encoder
self.decoder = {v: k for k, v in self.encoder.items()} self.decoder = {v: k for k, v in self.encoder.items()}
super().__init__(
unk_token=unk_token,
eos_token=eos_token,
pad_token=pad_token,
bos_token=bos_token,
**kwargs,
)
@property @property
def vocab_size(self): def vocab_size(self):
"""Returns the vocabulary size of the tokenizer.""" """Returns the vocabulary size of the tokenizer."""
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment