"examples/vscode:/vscode.git/clone" did not exist on "076052f12eef6c2c64be85ca9c89054167cc1f24"
Unverified Commit 2da88537 authored by Arthur's avatar Arthur Committed by GitHub
Browse files

🚨🚨 🚨🚨 [`Tokenizer`] attemp to fix add_token issues🚨🚨 🚨🚨 (#23909)



* fix test for bart. Order is correct now let's skip BPEs

* ouf

* styling

* fix bert....

* slow refactoring

* current updates

* massive refactoring

* update

* NICE!

* update to see where I am at

* updates

* update

* update

* revert

* updates

* updates

* start supporting legacy_save

* styling

* big update

* revert some changes

* nits

* nniiiiiice

* small fixes

* kinda fix t5 with new behaviour

* major update

* fixup

* fix copies

* today's updates

* fix byt5

* upfate

* update

* update

* updates

* update vocab size test

* Barthez does not use not need the fairseq offset ids

* super calll must be after

* calll super

* move all super init

* move other super init

* fixup

* nits

* more fixes

* nits

* more fixes

* nits

* more fix

* remove useless files

* ouch all of them are affected

* and more!

* small imporvements

* no more sanitize token

* more changes around unique no split tokens

* partially fix more things

* keep legacy save but add warning

* so... more fixes

* updates

* guess deberta tokenizer could be nuked

* fixup

* fixup did some bad things

* nuke it if it breaks

* remove prints and pretrain fast from slow with new format.

* fixups

* Apply suggestions from code review
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* fiou

* nit

* by default specials should not be normalized?

* update

* remove brakpoint

* updates

* a lot of updates

* fixup

* fixes revert some changes to match fast

* small nits

* that makes it cleaner

* fix camembert accordingly

* update

* some lest breaking changes

* update

* fixup

* fix byt5 and whisper mostly

* some more fixes, canine's byte vocab

* fix gpt2

* fix most of the perceiver tests (4 left)

* fix layout lmv3

* fixup

* fix copies for gpt2 style

* make sure to only warn once

* fix perciever and gpt2 tests

* some more backward compatibility: also read special tokens map because some ppl use it........////.....

* fixup

* add else when reading

* nits

* fresh updates

* fix copies

* will this make everything faster?

* fixes

* more fixes

* update

* more fixes

* fixup

* is the source of truth right?

* sorry camembert for the troubles

* current updates

* fixup

* update led

* update

* fix regression

* fix single word

* more model specific fixes

* fix t5 tests

* fixup

* more comments

* update

* fix nllb

* rstrip removed

* small fixes

* better handle additional_special_tokens and vocab sizes

* fixing

* styling

* fix 4 / 21

* fixup

* fix nlbb's tests

* some fixes

* fix t5

* fixes

* style

* fix canine tests

* damn this is nice

* nits

* m2m100 nit

* fixups

* fixes!

* fixup

* stash

* fix merge

* revert bad change

* fixup

* correct order for code Llama

* fix speecht5 post merge

* styling

* revert source of 11 fails

* small nits

* all changes in one go

* fnet hack

* fix 2 more tests

* update based on main branch of tokenizers

* fixup

* fix VITS issues

* more fixes

* fix mgp test

* fix camembert issues

* oups camembert still has 2 failing tests

* mluke fixes

* decode fixes

* small nits

* nits

* fix llama and vits

* fix camembert

* smal nits

* more fixes when initialising a fast from a slow and etc

* fix one of the last test

* fix CPM tokenizer test

* fixups

* fix pop2piano

* fixup

* ️ Change tokenizers required version ️

* ️ Change tokenizers required version ️

* "tokenizers>=0.14,<0.15", don't forget smaller than

* fix musicgen tests and pretraiendtokenizerfast

* fix owlvit and all

* update t5

* fix 800 red

* fix tests

* fix the fix of the fix of t5

* styling

* documentation nits

* cache _added_tokens_encoder

* fixups

* Nit

* fix red tests

* one last nit!

* make eveything a lot simpler

* Now it's over 😉



* few small nits

* Apply suggestions from code review
Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>

* updates that work for now

* tests that should no be skipped / changed and fixed next

* fixup

* i am ashamed

* pushe the fix

* update

* fixups

* nits

* fix added_tokens_encoder

* fix canine test

* fix pegasus vocab

* fix transfoXL

* fixup

* whisper needs to be fixed for train new

* pegasus nits

* more pegasus fixes

* minor update

* better error message in failed test

* fix whisper failing test

* fix whisper failing test

* fix pegasus

* fixup

* fix **** pegasus

* reset things

* remove another file

* attempts to fix the strange custome encoder and offset

* nits here and there

* update

* fixup

* nit

* fix the whisper test

* nits nits

* Apply suggestions from code review
Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>

* updates based on review

* some small update to potentially remove

* nits

* import rlu cache

* Update src/transformers/tokenization_utils_base.py
Co-authored-by: default avatarLysandre Debut <hi@lysand.re>

* move warning to `from_pretrained`

* update tests results now that the special tokens are always added

---------
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: default avatarLysandre Debut <hi@lysand.re>
parent 835b0a05
...@@ -312,16 +312,6 @@ class CLIPTokenizer(PreTrainedTokenizer): ...@@ -312,16 +312,6 @@ class CLIPTokenizer(PreTrainedTokenizer):
bos_token = AddedToken(bos_token, lstrip=False, rstrip=False) if isinstance(bos_token, str) else bos_token bos_token = AddedToken(bos_token, lstrip=False, rstrip=False) if isinstance(bos_token, str) else bos_token
eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token
unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token
super().__init__(
errors=errors,
unk_token=unk_token,
bos_token=bos_token,
eos_token=eos_token,
pad_token=pad_token,
**kwargs,
)
try: try:
import ftfy import ftfy
...@@ -348,6 +338,15 @@ class CLIPTokenizer(PreTrainedTokenizer): ...@@ -348,6 +338,15 @@ class CLIPTokenizer(PreTrainedTokenizer):
re.IGNORECASE, re.IGNORECASE,
) )
super().__init__(
errors=errors,
unk_token=unk_token,
bos_token=bos_token,
eos_token=eos_token,
pad_token=pad_token,
**kwargs,
)
@property @property
def vocab_size(self): def vocab_size(self):
return len(self.encoder) return len(self.encoder)
......
...@@ -151,6 +151,17 @@ class CodeLlamaTokenizer(PreTrainedTokenizer): ...@@ -151,6 +151,17 @@ class CodeLlamaTokenizer(PreTrainedTokenizer):
for token in [prefix_token, middle_token, suffix_token, eot_token]: for token in [prefix_token, middle_token, suffix_token, eot_token]:
additional_special_tokens += [token] if token is not None else [] additional_special_tokens += [token] if token is not None else []
self.vocab_file = vocab_file
self.add_bos_token = add_bos_token
self.add_eos_token = add_eos_token
self._prefix_token = prefix_token
self._middle_token = middle_token
self._suffix_token = suffix_token
self._eot_token = eot_token
self.fill_token = fill_token
self.suffix_first = suffix_first
self.sp_model = self.get_spm_processor()
super().__init__( super().__init__(
bos_token=bos_token, bos_token=bos_token,
eos_token=eos_token, eos_token=eos_token,
...@@ -169,16 +180,6 @@ class CodeLlamaTokenizer(PreTrainedTokenizer): ...@@ -169,16 +180,6 @@ class CodeLlamaTokenizer(PreTrainedTokenizer):
use_default_system_prompt=use_default_system_prompt, use_default_system_prompt=use_default_system_prompt,
**kwargs, **kwargs,
) )
self.vocab_file = vocab_file
self.add_bos_token = add_bos_token
self.add_eos_token = add_eos_token
self._prefix_token = prefix_token
self._middle_token = middle_token
self._suffix_token = suffix_token
self._eot_token = eot_token
self.fill_token = fill_token
self.suffix_first = suffix_first
self.sp_model = self.get_spm_processor()
@property @property
def unk_token_length(self): def unk_token_length(self):
......
...@@ -167,16 +167,6 @@ class CodeGenTokenizer(PreTrainedTokenizer): ...@@ -167,16 +167,6 @@ class CodeGenTokenizer(PreTrainedTokenizer):
eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token
unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token
pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token
super().__init__(
errors=errors,
unk_token=unk_token,
bos_token=bos_token,
eos_token=eos_token,
pad_token=pad_token,
add_prefix_space=add_prefix_space,
add_bos_token=add_bos_token,
**kwargs,
)
self.add_bos_token = add_bos_token self.add_bos_token = add_bos_token
with open(vocab_file, encoding="utf-8") as vocab_handle: with open(vocab_file, encoding="utf-8") as vocab_handle:
...@@ -194,6 +184,16 @@ class CodeGenTokenizer(PreTrainedTokenizer): ...@@ -194,6 +184,16 @@ class CodeGenTokenizer(PreTrainedTokenizer):
# Should have added re.IGNORECASE so BPE merges can happen for capitalized versions of contractions # Should have added re.IGNORECASE so BPE merges can happen for capitalized versions of contractions
self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""") self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
super().__init__(
errors=errors,
unk_token=unk_token,
bos_token=bos_token,
eos_token=eos_token,
pad_token=pad_token,
add_prefix_space=add_prefix_space,
add_bos_token=add_bos_token,
**kwargs,
)
@property @property
def vocab_size(self): def vocab_size(self):
......
...@@ -135,20 +135,6 @@ class ConvBertTokenizer(PreTrainedTokenizer): ...@@ -135,20 +135,6 @@ class ConvBertTokenizer(PreTrainedTokenizer):
strip_accents=None, strip_accents=None,
**kwargs, **kwargs,
): ):
super().__init__(
do_lower_case=do_lower_case,
do_basic_tokenize=do_basic_tokenize,
never_split=never_split,
unk_token=unk_token,
sep_token=sep_token,
pad_token=pad_token,
cls_token=cls_token,
mask_token=mask_token,
tokenize_chinese_chars=tokenize_chinese_chars,
strip_accents=strip_accents,
**kwargs,
)
if not os.path.isfile(vocab_file): if not os.path.isfile(vocab_file):
raise ValueError( raise ValueError(
f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained" f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained"
...@@ -164,7 +150,22 @@ class ConvBertTokenizer(PreTrainedTokenizer): ...@@ -164,7 +150,22 @@ class ConvBertTokenizer(PreTrainedTokenizer):
tokenize_chinese_chars=tokenize_chinese_chars, tokenize_chinese_chars=tokenize_chinese_chars,
strip_accents=strip_accents, strip_accents=strip_accents,
) )
self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=self.unk_token)
self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=str(unk_token))
super().__init__(
do_lower_case=do_lower_case,
do_basic_tokenize=do_basic_tokenize,
never_split=never_split,
unk_token=unk_token,
sep_token=sep_token,
pad_token=pad_token,
cls_token=cls_token,
mask_token=mask_token,
tokenize_chinese_chars=tokenize_chinese_chars,
strip_accents=strip_accents,
**kwargs,
)
@property @property
def do_lower_case(self): def do_lower_case(self):
......
...@@ -38,6 +38,9 @@ PRETRAINED_VOCAB_FILES_MAP = { ...@@ -38,6 +38,9 @@ PRETRAINED_VOCAB_FILES_MAP = {
class CpmTokenizer(PreTrainedTokenizer): class CpmTokenizer(PreTrainedTokenizer):
"""Runs pre-tokenization with Jieba segmentation tool. It is used in CPM models.""" """Runs pre-tokenization with Jieba segmentation tool. It is used in CPM models."""
vocab_files_names = VOCAB_FILES_NAMES
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
def __init__( def __init__(
self, self,
vocab_file, vocab_file,
...@@ -121,24 +124,6 @@ class CpmTokenizer(PreTrainedTokenizer): ...@@ -121,24 +124,6 @@ class CpmTokenizer(PreTrainedTokenizer):
self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
super().__init__(
do_lower_case=do_lower_case,
remove_space=remove_space,
keep_accents=keep_accents,
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
pad_token=pad_token,
cls_token=cls_token,
mask_token=mask_token,
additional_special_tokens=additional_special_tokens,
sp_model_kwargs=self.sp_model_kwargs,
**kwargs,
)
self._pad_token_type_id = 3
self.do_lower_case = do_lower_case self.do_lower_case = do_lower_case
self.remove_space = remove_space self.remove_space = remove_space
self.keep_accents = keep_accents self.keep_accents = keep_accents
...@@ -157,6 +142,24 @@ class CpmTokenizer(PreTrainedTokenizer): ...@@ -157,6 +142,24 @@ class CpmTokenizer(PreTrainedTokenizer):
self.jieba = jieba self.jieba = jieba
self.translator = str.maketrans(" \n", "\u2582\u2583") self.translator = str.maketrans(" \n", "\u2582\u2583")
super().__init__(
do_lower_case=do_lower_case,
remove_space=remove_space,
keep_accents=keep_accents,
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
pad_token=pad_token,
cls_token=cls_token,
mask_token=mask_token,
additional_special_tokens=additional_special_tokens,
sp_model_kwargs=self.sp_model_kwargs,
**kwargs,
)
self._pad_token_type_id = 3
@property @property
# Copied from transformers.models.xlnet.tokenization_xlnet.XLNetTokenizer.vocab_size # Copied from transformers.models.xlnet.tokenization_xlnet.XLNetTokenizer.vocab_size
def vocab_size(self): def vocab_size(self):
......
...@@ -131,18 +131,6 @@ class CpmAntTokenizer(PreTrainedTokenizer): ...@@ -131,18 +131,6 @@ class CpmAntTokenizer(PreTrainedTokenizer):
**kwargs, **kwargs,
): ):
requires_backends(self, ["jieba"]) requires_backends(self, ["jieba"])
super().__init__(
bod_token=bod_token,
eod_token=eod_token,
bos_token=bos_token,
eos_token=eos_token,
pad_token=pad_token,
unk_token=unk_token,
line_token=line_token,
space_token=space_token,
padding_side=padding_side,
**kwargs,
)
self.bod_token = bod_token self.bod_token = bod_token
self.eod_token = eod_token self.eod_token = eod_token
self.encoder = load_vocab(vocab_file) self.encoder = load_vocab(vocab_file)
...@@ -155,7 +143,20 @@ class CpmAntTokenizer(PreTrainedTokenizer): ...@@ -155,7 +143,20 @@ class CpmAntTokenizer(PreTrainedTokenizer):
self.encoder = collections.OrderedDict(sorted(self.encoder.items(), key=lambda x: x[1])) self.encoder = collections.OrderedDict(sorted(self.encoder.items(), key=lambda x: x[1]))
self.decoder = {v: k for k, v in self.encoder.items()} self.decoder = {v: k for k, v in self.encoder.items()}
self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.encoder, unk_token=self.unk_token) self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.encoder, unk_token=unk_token)
super().__init__(
bod_token=bod_token,
eod_token=eod_token,
bos_token=bos_token,
eos_token=eos_token,
pad_token=pad_token,
unk_token=unk_token,
line_token=line_token,
space_token=space_token,
padding_side=padding_side,
**kwargs,
)
@property @property
def bod_token_id(self): def bod_token_id(self):
......
...@@ -139,8 +139,6 @@ class CTRLTokenizer(PreTrainedTokenizer): ...@@ -139,8 +139,6 @@ class CTRLTokenizer(PreTrainedTokenizer):
control_codes = CONTROL_CODES control_codes = CONTROL_CODES
def __init__(self, vocab_file, merges_file, unk_token="<unk>", **kwargs): def __init__(self, vocab_file, merges_file, unk_token="<unk>", **kwargs):
super().__init__(unk_token=unk_token, **kwargs)
with open(vocab_file, encoding="utf-8") as vocab_handle: with open(vocab_file, encoding="utf-8") as vocab_handle:
self.encoder = json.load(vocab_handle) self.encoder = json.load(vocab_handle)
self.decoder = {v: k for k, v in self.encoder.items()} self.decoder = {v: k for k, v in self.encoder.items()}
...@@ -149,6 +147,7 @@ class CTRLTokenizer(PreTrainedTokenizer): ...@@ -149,6 +147,7 @@ class CTRLTokenizer(PreTrainedTokenizer):
merges = [tuple(merge.split()) for merge in merges] merges = [tuple(merge.split()) for merge in merges]
self.bpe_ranks = dict(zip(merges, range(len(merges)))) self.bpe_ranks = dict(zip(merges, range(len(merges))))
self.cache = {} self.cache = {}
super().__init__(unk_token=unk_token, **kwargs)
@property @property
def vocab_size(self): def vocab_size(self):
......
...@@ -201,20 +201,6 @@ class DebertaTokenizer(PreTrainedTokenizer): ...@@ -201,20 +201,6 @@ class DebertaTokenizer(PreTrainedTokenizer):
# Mask token behave like a normal word, i.e. include the space before it # Mask token behave like a normal word, i.e. include the space before it
mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token
super().__init__(
errors=errors,
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
add_prefix_space=add_prefix_space,
add_bos_token=add_bos_token,
**kwargs,
)
self.add_bos_token = add_bos_token self.add_bos_token = add_bos_token
with open(vocab_file, encoding="utf-8") as vocab_handle: with open(vocab_file, encoding="utf-8") as vocab_handle:
...@@ -233,6 +219,20 @@ class DebertaTokenizer(PreTrainedTokenizer): ...@@ -233,6 +219,20 @@ class DebertaTokenizer(PreTrainedTokenizer):
# Should have added re.IGNORECASE so BPE merges can happen for capitalized versions of contractions # Should have added re.IGNORECASE so BPE merges can happen for capitalized versions of contractions
self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""") self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
super().__init__(
errors=errors,
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
add_prefix_space=add_prefix_space,
add_bos_token=add_bos_token,
**kwargs,
)
@property @property
# Copied from transformers.models.gpt2.tokenization_gpt2.GPT2Tokenizer.vocab_size # Copied from transformers.models.gpt2.tokenization_gpt2.GPT2Tokenizer.vocab_size
def vocab_size(self): def vocab_size(self):
......
...@@ -20,9 +20,12 @@ from typing import Any, Dict, List, Optional, Tuple ...@@ -20,9 +20,12 @@ from typing import Any, Dict, List, Optional, Tuple
import sentencepiece as sp import sentencepiece as sp
from ...tokenization_utils import PreTrainedTokenizer from ...tokenization_utils import AddedToken, PreTrainedTokenizer
from ...utils import logging
logger = logging.get_logger(__name__)
PRETRAINED_VOCAB_FILES_MAP = { PRETRAINED_VOCAB_FILES_MAP = {
"vocab_file": { "vocab_file": {
"microsoft/deberta-v2-xlarge": "https://huggingface.co/microsoft/deberta-v2-xlarge/resolve/main/spm.model", "microsoft/deberta-v2-xlarge": "https://huggingface.co/microsoft/deberta-v2-xlarge/resolve/main/spm.model",
...@@ -124,6 +127,18 @@ class DebertaV2Tokenizer(PreTrainedTokenizer): ...@@ -124,6 +127,18 @@ class DebertaV2Tokenizer(PreTrainedTokenizer):
) -> None: ) -> None:
self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
if not os.path.isfile(vocab_file):
raise ValueError(
f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained"
" model use `tokenizer = AutoTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`"
)
self.do_lower_case = do_lower_case
self.split_by_punct = split_by_punct
self.vocab_file = vocab_file
self._tokenizer = SPMTokenizer(
vocab_file, None, split_by_punct=split_by_punct, sp_model_kwargs=self.sp_model_kwargs
)
unk_token = AddedToken(unk_token, normalized=True, lstrip=False, rstrip=False)
super().__init__( super().__init__(
do_lower_case=do_lower_case, do_lower_case=do_lower_case,
bos_token=bos_token, bos_token=bos_token,
...@@ -137,18 +152,7 @@ class DebertaV2Tokenizer(PreTrainedTokenizer): ...@@ -137,18 +152,7 @@ class DebertaV2Tokenizer(PreTrainedTokenizer):
sp_model_kwargs=self.sp_model_kwargs, sp_model_kwargs=self.sp_model_kwargs,
**kwargs, **kwargs,
) )
self._tokenizer.special_tokens = self.all_special_tokens
if not os.path.isfile(vocab_file):
raise ValueError(
f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained"
" model use `tokenizer = AutoTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`"
)
self.do_lower_case = do_lower_case
self.split_by_punct = split_by_punct
self.vocab_file = vocab_file
self._tokenizer = SPMTokenizer(
vocab_file, self.all_special_tokens, split_by_punct=split_by_punct, sp_model_kwargs=self.sp_model_kwargs
)
@property @property
def vocab_size(self): def vocab_size(self):
...@@ -374,6 +378,7 @@ class SPMTokenizer: ...@@ -374,6 +378,7 @@ class SPMTokenizer:
text = "".join(words[word_start:word_end]) text = "".join(words[word_start:word_end])
return text return text
# TODO add a deprecation cycle as this can have different behaviour from our API
def add_special_token(self, token): def add_special_token(self, token):
if token not in self.special_tokens: if token not in self.special_tokens:
self.special_tokens.append(token) self.special_tokens.append(token)
...@@ -383,6 +388,9 @@ class SPMTokenizer: ...@@ -383,6 +388,9 @@ class SPMTokenizer:
return self.id(token) return self.id(token)
def part_of_whole_word(self, token, is_bos=False): def part_of_whole_word(self, token, is_bos=False):
logger.warning_once(
"The `DebertaTokenizer.part_of_whole_word` method is deprecated and will be removed in `transformers==4.35`"
)
if is_bos: if is_bos:
return True return True
if ( if (
...@@ -413,6 +421,9 @@ class SPMTokenizer: ...@@ -413,6 +421,9 @@ class SPMTokenizer:
return self.ids_to_tokens[id] return self.ids_to_tokens[id]
def id(self, sym): def id(self, sym):
logger.warning_once(
"The `DebertaTokenizer.id` method is deprecated and will be removed in `transformers==4.35`"
)
return self.vocab[sym] if sym in self.vocab else 1 return self.vocab[sym] if sym in self.vocab else 1
def _encode_as_pieces(self, text): def _encode_as_pieces(self, text):
...@@ -460,17 +471,6 @@ class SPMTokenizer: ...@@ -460,17 +471,6 @@ class SPMTokenizer:
return words return words
def _run_strip_accents(self, text):
"""Strips accents from a piece of text."""
text = unicodedata.normalize("NFD", text)
output = []
for char in text:
cat = unicodedata.category(char)
if cat == "Mn":
continue
output.append(char)
return "".join(output)
def _run_split_on_punc(self, text): def _run_split_on_punc(self, text):
"""Splits punctuation on a piece of text.""" """Splits punctuation on a piece of text."""
chars = list(text) chars = list(text)
......
...@@ -132,20 +132,6 @@ class RetriBertTokenizer(PreTrainedTokenizer): ...@@ -132,20 +132,6 @@ class RetriBertTokenizer(PreTrainedTokenizer):
strip_accents=None, strip_accents=None,
**kwargs, **kwargs,
): ):
super().__init__(
do_lower_case=do_lower_case,
do_basic_tokenize=do_basic_tokenize,
never_split=never_split,
unk_token=unk_token,
sep_token=sep_token,
pad_token=pad_token,
cls_token=cls_token,
mask_token=mask_token,
tokenize_chinese_chars=tokenize_chinese_chars,
strip_accents=strip_accents,
**kwargs,
)
if not os.path.isfile(vocab_file): if not os.path.isfile(vocab_file):
raise ValueError( raise ValueError(
f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained" f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained"
...@@ -161,7 +147,22 @@ class RetriBertTokenizer(PreTrainedTokenizer): ...@@ -161,7 +147,22 @@ class RetriBertTokenizer(PreTrainedTokenizer):
tokenize_chinese_chars=tokenize_chinese_chars, tokenize_chinese_chars=tokenize_chinese_chars,
strip_accents=strip_accents, strip_accents=strip_accents,
) )
self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=self.unk_token)
self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=str(unk_token))
super().__init__(
do_lower_case=do_lower_case,
do_basic_tokenize=do_basic_tokenize,
never_split=never_split,
unk_token=unk_token,
sep_token=sep_token,
pad_token=pad_token,
cls_token=cls_token,
mask_token=mask_token,
tokenize_chinese_chars=tokenize_chinese_chars,
strip_accents=strip_accents,
**kwargs,
)
@property @property
# Copied from transformers.models.bert.tokenization_bert.BertTokenizer.do_lower_case # Copied from transformers.models.bert.tokenization_bert.BertTokenizer.do_lower_case
......
...@@ -296,23 +296,6 @@ class TapexTokenizer(PreTrainedTokenizer): ...@@ -296,23 +296,6 @@ class TapexTokenizer(PreTrainedTokenizer):
# Mask token behave like a normal word, i.e. include the space before it # Mask token behave like a normal word, i.e. include the space before it
mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token
super().__init__(
vocab_file=vocab_file,
merges_file=merges_file,
do_lower_case=do_lower_case,
errors=errors,
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
add_prefix_space=add_prefix_space,
max_cell_length=max_cell_length,
**kwargs,
)
with open(vocab_file, encoding="utf-8") as vocab_handle: with open(vocab_file, encoding="utf-8") as vocab_handle:
self.encoder = json.load(vocab_handle) self.encoder = json.load(vocab_handle)
self.decoder = {v: k for k, v in self.encoder.items()} self.decoder = {v: k for k, v in self.encoder.items()}
...@@ -331,6 +314,24 @@ class TapexTokenizer(PreTrainedTokenizer): ...@@ -331,6 +314,24 @@ class TapexTokenizer(PreTrainedTokenizer):
self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""") self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
# additional properties # additional properties
super().__init__(
vocab_file=vocab_file,
merges_file=merges_file,
do_lower_case=do_lower_case,
errors=errors,
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
add_prefix_space=add_prefix_space,
max_cell_length=max_cell_length,
**kwargs,
)
self.max_cell_length = max_cell_length self.max_cell_length = max_cell_length
self.table_linearize = IndexedRowTableLinearize() self.table_linearize = IndexedRowTableLinearize()
......
...@@ -149,20 +149,6 @@ class DistilBertTokenizer(PreTrainedTokenizer): ...@@ -149,20 +149,6 @@ class DistilBertTokenizer(PreTrainedTokenizer):
strip_accents=None, strip_accents=None,
**kwargs, **kwargs,
): ):
super().__init__(
do_lower_case=do_lower_case,
do_basic_tokenize=do_basic_tokenize,
never_split=never_split,
unk_token=unk_token,
sep_token=sep_token,
pad_token=pad_token,
cls_token=cls_token,
mask_token=mask_token,
tokenize_chinese_chars=tokenize_chinese_chars,
strip_accents=strip_accents,
**kwargs,
)
if not os.path.isfile(vocab_file): if not os.path.isfile(vocab_file):
raise ValueError( raise ValueError(
f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained" f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained"
...@@ -178,7 +164,21 @@ class DistilBertTokenizer(PreTrainedTokenizer): ...@@ -178,7 +164,21 @@ class DistilBertTokenizer(PreTrainedTokenizer):
tokenize_chinese_chars=tokenize_chinese_chars, tokenize_chinese_chars=tokenize_chinese_chars,
strip_accents=strip_accents, strip_accents=strip_accents,
) )
self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=self.unk_token) self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=str(unk_token))
super().__init__(
do_lower_case=do_lower_case,
do_basic_tokenize=do_basic_tokenize,
never_split=never_split,
unk_token=unk_token,
sep_token=sep_token,
pad_token=pad_token,
cls_token=cls_token,
mask_token=mask_token,
tokenize_chinese_chars=tokenize_chinese_chars,
strip_accents=strip_accents,
**kwargs,
)
@property @property
# Copied from transformers.models.bert.tokenization_bert.BertTokenizer.do_lower_case # Copied from transformers.models.bert.tokenization_bert.BertTokenizer.do_lower_case
......
...@@ -152,20 +152,6 @@ class ElectraTokenizer(PreTrainedTokenizer): ...@@ -152,20 +152,6 @@ class ElectraTokenizer(PreTrainedTokenizer):
strip_accents=None, strip_accents=None,
**kwargs, **kwargs,
): ):
super().__init__(
do_lower_case=do_lower_case,
do_basic_tokenize=do_basic_tokenize,
never_split=never_split,
unk_token=unk_token,
sep_token=sep_token,
pad_token=pad_token,
cls_token=cls_token,
mask_token=mask_token,
tokenize_chinese_chars=tokenize_chinese_chars,
strip_accents=strip_accents,
**kwargs,
)
if not os.path.isfile(vocab_file): if not os.path.isfile(vocab_file):
raise ValueError( raise ValueError(
f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained" f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained"
...@@ -181,7 +167,22 @@ class ElectraTokenizer(PreTrainedTokenizer): ...@@ -181,7 +167,22 @@ class ElectraTokenizer(PreTrainedTokenizer):
tokenize_chinese_chars=tokenize_chinese_chars, tokenize_chinese_chars=tokenize_chinese_chars,
strip_accents=strip_accents, strip_accents=strip_accents,
) )
self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=self.unk_token)
self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=str(unk_token))
super().__init__(
do_lower_case=do_lower_case,
do_basic_tokenize=do_basic_tokenize,
never_split=never_split,
unk_token=unk_token,
sep_token=sep_token,
pad_token=pad_token,
cls_token=cls_token,
mask_token=mask_token,
tokenize_chinese_chars=tokenize_chinese_chars,
strip_accents=strip_accents,
**kwargs,
)
@property @property
def do_lower_case(self): def do_lower_case(self):
......
...@@ -112,6 +112,19 @@ class ErnieMTokenizer(PreTrainedTokenizer): ...@@ -112,6 +112,19 @@ class ErnieMTokenizer(PreTrainedTokenizer):
# is included in the raw text, there should be a match in a non-normalized sentence. # is included in the raw text, there should be a match in a non-normalized sentence.
self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
self.do_lower_case = do_lower_case
self.sentencepiece_model_ckpt = sentencepiece_model_ckpt
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
self.sp_model.Load(sentencepiece_model_ckpt)
# to mimic paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer functioning
if vocab_file is not None:
self.vocab = self.load_vocab(filepath=vocab_file)
else:
self.vocab = {self.sp_model.id_to_piece(id): id for id in range(self.sp_model.get_piece_size())}
self.reverse_vocab = {v: k for k, v in self.vocab.items()}
super().__init__( super().__init__(
do_lower_case=do_lower_case, do_lower_case=do_lower_case,
unk_token=unk_token, unk_token=unk_token,
...@@ -124,17 +137,6 @@ class ErnieMTokenizer(PreTrainedTokenizer): ...@@ -124,17 +137,6 @@ class ErnieMTokenizer(PreTrainedTokenizer):
sp_model_kwargs=self.sp_model_kwargs, sp_model_kwargs=self.sp_model_kwargs,
**kwargs, **kwargs,
) )
self.do_lower_case = do_lower_case
self.sentencepiece_model_ckpt = sentencepiece_model_ckpt
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
self.sp_model.Load(sentencepiece_model_ckpt)
# to mimic paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer functioning
if vocab_file is not None:
self.vocab = self.load_vocab(filepath=vocab_file)
else:
self.vocab = {self.sp_model.id_to_piece(id): id for id in range(self.sp_model.get_piece_size())}
self.reverse_vocab = {v: k for k, v in self.vocab.items()}
def get_offset_mapping(self, text): def get_offset_mapping(self, text):
if text is None: if text is None:
......
...@@ -64,17 +64,23 @@ class EsmTokenizer(PreTrainedTokenizer): ...@@ -64,17 +64,23 @@ class EsmTokenizer(PreTrainedTokenizer):
eos_token="<eos>", eos_token="<eos>",
**kwargs, **kwargs,
): ):
super().__init__(**kwargs)
self.all_tokens = load_vocab_file(vocab_file) self.all_tokens = load_vocab_file(vocab_file)
self._id_to_token = dict(enumerate(self.all_tokens)) self._id_to_token = dict(enumerate(self.all_tokens))
self._token_to_id = {tok: ind for ind, tok in enumerate(self.all_tokens)} self._token_to_id = {tok: ind for ind, tok in enumerate(self.all_tokens)}
self.unk_token = unk_token super().__init__(
self.cls_token = cls_token unk_token=unk_token,
self.pad_token = pad_token cls_token=cls_token,
self.mask_token = mask_token pad_token=pad_token,
self.eos_token = eos_token mask_token=mask_token,
eos_token=eos_token,
**kwargs,
)
# TODO, all the tokens are added? But they are also part of the vocab... bit strange.
# none of them are special, but they all need special splitting.
self.unique_no_split_tokens = self.all_tokens self.unique_no_split_tokens = self.all_tokens
self._create_trie(self.unique_no_split_tokens) self._update_trie(self.unique_no_split_tokens)
def _convert_id_to_token(self, index: int) -> str: def _convert_id_to_token(self, index: int) -> str:
return self._id_to_token.get(index, self.unk_token) return self._id_to_token.get(index, self.unk_token)
......
...@@ -258,19 +258,6 @@ class FlaubertTokenizer(PreTrainedTokenizer): ...@@ -258,19 +258,6 @@ class FlaubertTokenizer(PreTrainedTokenizer):
self.do_lowercase = do_lowercase self.do_lowercase = do_lowercase
super().__init__(
unk_token=unk_token,
bos_token=bos_token,
sep_token=sep_token,
pad_token=pad_token,
cls_token=cls_token,
mask_token=mask_token,
additional_special_tokens=additional_special_tokens,
lang2id=lang2id,
id2lang=id2lang,
**kwargs,
)
try: try:
import sacremoses import sacremoses
except ImportError: except ImportError:
...@@ -303,6 +290,19 @@ class FlaubertTokenizer(PreTrainedTokenizer): ...@@ -303,6 +290,19 @@ class FlaubertTokenizer(PreTrainedTokenizer):
self.bpe_ranks = dict(zip(merges, range(len(merges)))) self.bpe_ranks = dict(zip(merges, range(len(merges))))
self.cache = {} self.cache = {}
super().__init__(
unk_token=unk_token,
bos_token=bos_token,
sep_token=sep_token,
pad_token=pad_token,
cls_token=cls_token,
mask_token=mask_token,
additional_special_tokens=additional_special_tokens,
lang2id=lang2id,
id2lang=id2lang,
**kwargs,
)
@property @property
# Copied from transformers.models.xlm.tokenization_xlm.XLMTokenizer.do_lower_case # Copied from transformers.models.xlm.tokenization_xlm.XLMTokenizer.do_lower_case
def do_lower_case(self): def do_lower_case(self):
......
...@@ -15,7 +15,6 @@ ...@@ -15,7 +15,6 @@
""" Tokenization classes for FNet model.""" """ Tokenization classes for FNet model."""
import os import os
import re
import unicodedata import unicodedata
from shutil import copyfile from shutil import copyfile
from typing import Any, Dict, List, Optional, Tuple from typing import Any, Dict, List, Optional, Tuple
...@@ -117,14 +116,19 @@ class FNetTokenizer(PreTrainedTokenizer): ...@@ -117,14 +116,19 @@ class FNetTokenizer(PreTrainedTokenizer):
) -> None: ) -> None:
# Mask token behave like a normal word, i.e. include the space before it and # Mask token behave like a normal word, i.e. include the space before it and
# is included in the raw text, there should be a match in a non-normalized sentence. # is included in the raw text, there should be a match in a non-normalized sentence.
mask_token = ( mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token
AddedToken(mask_token, lstrip=True, rstrip=False, normalized=False) cls_token = AddedToken(cls_token, lstrip=False, rstrip=False) if isinstance(cls_token, str) else cls_token
if isinstance(mask_token, str) sep_token = AddedToken(sep_token, lstrip=False, rstrip=False) if isinstance(sep_token, str) else sep_token
else mask_token
)
self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
self.do_lower_case = do_lower_case
self.remove_space = remove_space
self.keep_accents = keep_accents
self.vocab_file = vocab_file
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
self.sp_model.Load(vocab_file)
super().__init__( super().__init__(
do_lower_case=do_lower_case, do_lower_case=do_lower_case,
remove_space=remove_space, remove_space=remove_space,
...@@ -138,14 +142,6 @@ class FNetTokenizer(PreTrainedTokenizer): ...@@ -138,14 +142,6 @@ class FNetTokenizer(PreTrainedTokenizer):
**kwargs, **kwargs,
) )
self.do_lower_case = do_lower_case
self.remove_space = remove_space
self.keep_accents = keep_accents
self.vocab_file = vocab_file
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
self.sp_model.Load(vocab_file)
@property @property
def vocab_size(self): def vocab_size(self):
return len(self.sp_model) return len(self.sp_model)
...@@ -237,48 +233,21 @@ class FNetTokenizer(PreTrainedTokenizer): ...@@ -237,48 +233,21 @@ class FNetTokenizer(PreTrainedTokenizer):
token_ids: List[int], token_ids: List[int],
skip_special_tokens: bool = False, skip_special_tokens: bool = False,
clean_up_tokenization_spaces: bool = None, clean_up_tokenization_spaces: bool = None,
spaces_between_special_tokens: bool = True, spaces_between_special_tokens: bool = False,
**kwargs, **kwargs,
) -> str: ) -> str:
self._decode_use_source_tokenizer = kwargs.pop("use_source_tokenizer", False) text = super()._decode(
token_ids=token_ids,
filtered_tokens = self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens) skip_special_tokens=skip_special_tokens,
clean_up_tokenization_spaces=clean_up_tokenization_spaces,
# To avoid mixing byte-level and unicode for byte-level BPT spaces_between_special_tokens=spaces_between_special_tokens,
# we need to build string separately for added tokens and byte-level tokens **kwargs,
# cf. https://github.com/huggingface/transformers/issues/1133 )
sub_texts = []
current_sub_text = []
for token in filtered_tokens:
if skip_special_tokens and token in self.all_special_ids:
continue
if token in self.added_tokens_encoder:
if current_sub_text:
sub_texts.append(self.convert_tokens_to_string(current_sub_text))
current_sub_text = []
sub_texts.append(token)
else:
current_sub_text.append(token)
if current_sub_text:
sub_texts.append(self.convert_tokens_to_string(current_sub_text))
# Mimic the behavior of the Rust tokenizer: # Mimic the behavior of the Rust tokenizer:
# No space after <unk> # No space after <unk>
if spaces_between_special_tokens: if not spaces_between_special_tokens:
text = re.sub(r"(<unk>) ", r"\1", " ".join(sub_texts)) text = text.replace("<unk> ", "<unk>")
else: return text
text = "".join(sub_texts)
clean_up_tokenization_spaces = (
clean_up_tokenization_spaces
if clean_up_tokenization_spaces is not None
else self.clean_up_tokenization_spaces
)
if clean_up_tokenization_spaces:
clean_text = self.clean_up_tokenization(text)
return clean_text
else:
return text
def build_inputs_with_special_tokens( def build_inputs_with_special_tokens(
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
......
...@@ -108,11 +108,9 @@ class FNetTokenizerFast(PreTrainedTokenizerFast): ...@@ -108,11 +108,9 @@ class FNetTokenizerFast(PreTrainedTokenizerFast):
): ):
# Mask token behave like a normal word, i.e. include the space before it and # Mask token behave like a normal word, i.e. include the space before it and
# is included in the raw text, there should be a match in a non-normalized sentence. # is included in the raw text, there should be a match in a non-normalized sentence.
mask_token = ( mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token
AddedToken(mask_token, lstrip=True, rstrip=False, normalized=False) cls_token = AddedToken(cls_token, lstrip=False, rstrip=False) if isinstance(cls_token, str) else cls_token
if isinstance(mask_token, str) sep_token = AddedToken(sep_token, lstrip=False, rstrip=False) if isinstance(sep_token, str) else sep_token
else mask_token
)
super().__init__( super().__init__(
vocab_file, vocab_file,
......
...@@ -197,19 +197,6 @@ class FSMTTokenizer(PreTrainedTokenizer): ...@@ -197,19 +197,6 @@ class FSMTTokenizer(PreTrainedTokenizer):
pad_token="<pad>", pad_token="<pad>",
**kwargs, **kwargs,
): ):
super().__init__(
langs=langs,
src_vocab_file=src_vocab_file,
tgt_vocab_file=tgt_vocab_file,
merges_file=merges_file,
do_lower_case=do_lower_case,
unk_token=unk_token,
bos_token=bos_token,
sep_token=sep_token,
pad_token=pad_token,
**kwargs,
)
try: try:
import sacremoses import sacremoses
except ImportError: except ImportError:
...@@ -250,6 +237,18 @@ class FSMTTokenizer(PreTrainedTokenizer): ...@@ -250,6 +237,18 @@ class FSMTTokenizer(PreTrainedTokenizer):
merges = [tuple(merge.split()[:2]) for merge in merges] merges = [tuple(merge.split()[:2]) for merge in merges]
self.bpe_ranks = dict(zip(merges, range(len(merges)))) self.bpe_ranks = dict(zip(merges, range(len(merges))))
self.cache = {} self.cache = {}
super().__init__(
langs=langs,
src_vocab_file=src_vocab_file,
tgt_vocab_file=tgt_vocab_file,
merges_file=merges_file,
do_lower_case=do_lower_case,
unk_token=unk_token,
bos_token=bos_token,
sep_token=sep_token,
pad_token=pad_token,
**kwargs,
)
# hack override # hack override
def get_vocab(self) -> Dict[str, int]: def get_vocab(self) -> Dict[str, int]:
......
...@@ -157,22 +157,6 @@ class FunnelTokenizer(PreTrainedTokenizer): ...@@ -157,22 +157,6 @@ class FunnelTokenizer(PreTrainedTokenizer):
strip_accents=None, strip_accents=None,
**kwargs, **kwargs,
): ):
super().__init__(
do_lower_case=do_lower_case,
do_basic_tokenize=do_basic_tokenize,
never_split=never_split,
unk_token=unk_token,
sep_token=sep_token,
pad_token=pad_token,
cls_token=cls_token,
mask_token=mask_token,
bos_token=bos_token,
eos_token=eos_token,
tokenize_chinese_chars=tokenize_chinese_chars,
strip_accents=strip_accents,
**kwargs,
)
if not os.path.isfile(vocab_file): if not os.path.isfile(vocab_file):
raise ValueError( raise ValueError(
f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained" f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained"
...@@ -188,7 +172,23 @@ class FunnelTokenizer(PreTrainedTokenizer): ...@@ -188,7 +172,23 @@ class FunnelTokenizer(PreTrainedTokenizer):
tokenize_chinese_chars=tokenize_chinese_chars, tokenize_chinese_chars=tokenize_chinese_chars,
strip_accents=strip_accents, strip_accents=strip_accents,
) )
self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=self.unk_token) self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=str(unk_token))
super().__init__(
do_lower_case=do_lower_case,
do_basic_tokenize=do_basic_tokenize,
never_split=never_split,
unk_token=unk_token,
sep_token=sep_token,
pad_token=pad_token,
cls_token=cls_token,
mask_token=mask_token,
bos_token=bos_token,
eos_token=eos_token,
tokenize_chinese_chars=tokenize_chinese_chars,
strip_accents=strip_accents,
**kwargs,
)
@property @property
# Copied from transformers.models.bert.tokenization_bert.BertTokenizer.do_lower_case # Copied from transformers.models.bert.tokenization_bert.BertTokenizer.do_lower_case
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment