Unverified Commit 96ab75b8 authored by Funtowicz Morgan's avatar Funtowicz Morgan Committed by GitHub
Browse files

Tokenizers v3.0.0 (#3185)



* Renamed num_added_tokens to num_special_tokens_to_add
Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>

* Cherry-Pick: Partially fix space only input without special tokens added to the output #3091
Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>

* Added property is_fast on PretrainedTokenizer and PretrainedTokenizerFast
Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>

* Make fast tokenizers unittests work on Windows.

* Entirely refactored unittest for tokenizers fast.

* Remove ABC class for CommonFastTokenizerTest

* Added embeded_special_tokens tests from allenai @dirkgr

* Make embeded_special_tokens tests from allenai more generic

* Uniformize vocab_size as a property for both Fast and normal tokenizers

* Move special tokens handling out of PretrainedTokenizer (SpecialTokensMixin)

* Ensure providing None input raise the same ValueError than Python tokenizer + tests.

* Fix invalid input for assert_padding when testing batch_encode_plus

* Move add_special_tokens from constructor to tokenize/encode/[batch_]encode_plus methods parameter.

* Ensure tokenize() correctly forward add_special_tokens to rust.

* Adding None checking on top on encode / encode_batch for TransfoXLTokenizerFast.
Avoid stripping on None values.

* unittests ensure tokenize() also throws a ValueError if provided None

* Added add_special_tokens unittest for all supported models.

* Style

* Make sure TransfoXL test run only if PyTorch is provided.

* Split up tokenizers tests for each model type.

* Fix invalid unittest with new tokenizers API.

* Filter out Roberta openai detector models from unittests.

* Introduce BatchEncoding on fast tokenizers path.

This new structure exposes all the mappings retrieved from Rust.
It also keeps the current behavior with model forward.

* Introduce BatchEncoding on slow tokenizers path.

Backward compatibility.

* Improve error message on BatchEncoding for slow path

* Make add_prefix_space True by default on Roberta fast to match Python in majority of cases.

* Style and format.

* Added typing on all methods for PretrainedTokenizerFast

* Style and format

* Added path for feeding pretokenized (List[str]) input to PretrainedTokenizerFast.

* Style and format

* encode_plus now supports pretokenized inputs.

* Remove user warning about add_special_tokens when working on pretokenized inputs.

* Always go through the post processor.

* Added support for pretokenized input pairs on encode_plus

* Added is_pretokenized flag on encode_plus for clarity and improved error message on input TypeError.

* Added pretokenized inputs support on batch_encode_plus

* Update BatchEncoding methods name to match Encoding.

* Bump setup.py tokenizers dependency to 0.7.0rc1

* Remove unused parameters in BertTokenizerFast

* Make sure Roberta returns token_type_ids for unittests.

* Added missing typings

* Update add_tokens prototype to match tokenizers side and allow AddedToken

* Bumping tokenizers to 0.7.0rc2

* Added documentation for BatchEncoding

* Added (unused) is_pretokenized parameter on PreTrainedTokenizer encode_plus/batch_encode_plus methods.

* Added higher-level typing for tokenize / encode_plus / batch_encode_plus.

* Fix unittests failing because add_special_tokens was defined as a constructor parameter on Rust Tokenizers.

* Fix text-classification pipeline using the wrong tokenizer

* Make pipelines works with BatchEncoding

* Turn off add_special_tokens on tokenize by default.
Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>

* Remove add_prefix_space from tokenize call in unittest.
Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>

* Style and quality
Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>

* Correct message for batch_encode_plus none input exception.
Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>

* Fix invalid list comprehension for offset_mapping overriding content every iteration.
Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>

* TransfoXL uses Strip normalizer.
Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>

* Bump tokenizers dependency to 0.7.0rc3
Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>

* Support AddedTokens for special_tokens and use left stripping on mask for Roberta.
Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>

* SpecilaTokenMixin can use slots to faster access to underlying attributes.
Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>

* Remove update_special_tokens from fast tokenizers.

* Ensure TransfoXL unittests are run only when torch is available.

* Style.
Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>

* Style

* Style 🙏🙏



* Remove slots on SpecialTokensMixin, need deep dive into pickle protocol.

* Remove Roberta warning on __init__.

* Move documentation to Google style.
Co-authored-by: default avatarLysandreJik <lysandre.debut@reseau.eseo.fr>
parent e52d1258
...@@ -96,7 +96,7 @@ setup( ...@@ -96,7 +96,7 @@ setup(
packages=find_packages("src"), packages=find_packages("src"),
install_requires=[ install_requires=[
"numpy", "numpy",
"tokenizers == 0.5.2", "tokenizers == 0.7.0rc3",
# dataclasses for Python versions that don't have it # dataclasses for Python versions that don't have it
"dataclasses;python_version<'3.7'", "dataclasses;python_version<'3.7'",
# accessing files from S3 directly # accessing files from S3 directly
......
...@@ -459,7 +459,7 @@ class Pipeline(_ScikitCompat): ...@@ -459,7 +459,7 @@ class Pipeline(_ScikitCompat):
) )
# Filter out features not available on specific models # Filter out features not available on specific models
inputs = self.inputs_for_model(inputs) # inputs = self.inputs_for_model(inputs)
return inputs return inputs
...@@ -480,7 +480,7 @@ class Pipeline(_ScikitCompat): ...@@ -480,7 +480,7 @@ class Pipeline(_ScikitCompat):
with self.device_placement(): with self.device_placement():
if self.framework == "tf": if self.framework == "tf":
# TODO trace model # TODO trace model
predictions = self.model(inputs, training=False)[0] predictions = self.model(inputs.data, training=False)[0]
else: else:
with torch.no_grad(): with torch.no_grad():
inputs = self.ensure_tensor_on_device(**inputs) inputs = self.ensure_tensor_on_device(**inputs)
...@@ -778,7 +778,7 @@ class NerPipeline(Pipeline): ...@@ -778,7 +778,7 @@ class NerPipeline(Pipeline):
# Forward # Forward
if self.framework == "tf": if self.framework == "tf":
entities = self.model(tokens)[0][0].numpy() entities = self.model(tokens.data)[0][0].numpy()
input_ids = tokens["input_ids"].numpy()[0] input_ids = tokens["input_ids"].numpy()[0]
else: else:
with torch.no_grad(): with torch.no_grad():
...@@ -1399,7 +1399,7 @@ SUPPORTED_TASKS = { ...@@ -1399,7 +1399,7 @@ SUPPORTED_TASKS = {
"tf": "distilbert-base-uncased-finetuned-sst-2-english", "tf": "distilbert-base-uncased-finetuned-sst-2-english",
}, },
"config": "distilbert-base-uncased-finetuned-sst-2-english", "config": "distilbert-base-uncased-finetuned-sst-2-english",
"tokenizer": "distilbert-base-uncased", "tokenizer": "distilbert-base-cased",
}, },
}, },
"ner": { "ner": {
......
...@@ -592,8 +592,6 @@ class BertTokenizerFast(PreTrainedTokenizerFast): ...@@ -592,8 +592,6 @@ class BertTokenizerFast(PreTrainedTokenizerFast):
self, self,
vocab_file, vocab_file,
do_lower_case=True, do_lower_case=True,
do_basic_tokenize=True,
never_split=None,
unk_token="[UNK]", unk_token="[UNK]",
sep_token="[SEP]", sep_token="[SEP]",
pad_token="[PAD]", pad_token="[PAD]",
...@@ -601,7 +599,6 @@ class BertTokenizerFast(PreTrainedTokenizerFast): ...@@ -601,7 +599,6 @@ class BertTokenizerFast(PreTrainedTokenizerFast):
mask_token="[MASK]", mask_token="[MASK]",
clean_text=True, clean_text=True,
tokenize_chinese_chars=True, tokenize_chinese_chars=True,
add_special_tokens=True,
strip_accents=True, strip_accents=True,
wordpieces_prefix="##", wordpieces_prefix="##",
**kwargs **kwargs
...@@ -609,7 +606,6 @@ class BertTokenizerFast(PreTrainedTokenizerFast): ...@@ -609,7 +606,6 @@ class BertTokenizerFast(PreTrainedTokenizerFast):
super().__init__( super().__init__(
BertWordPieceTokenizer( BertWordPieceTokenizer(
vocab_file=vocab_file, vocab_file=vocab_file,
add_special_tokens=add_special_tokens,
unk_token=unk_token, unk_token=unk_token,
sep_token=sep_token, sep_token=sep_token,
cls_token=cls_token, cls_token=cls_token,
......
...@@ -18,9 +18,11 @@ ...@@ -18,9 +18,11 @@
import logging import logging
from typing import List, Optional from typing import List, Optional
from tokenizers import AddedToken
from tokenizers.processors import RobertaProcessing from tokenizers.processors import RobertaProcessing
from .tokenization_gpt2 import GPT2Tokenizer, GPT2TokenizerFast from .tokenization_gpt2 import GPT2Tokenizer, GPT2TokenizerFast
from .tokenization_utils import PreTrainedTokenizer
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
...@@ -259,7 +261,7 @@ class RobertaTokenizerFast(GPT2TokenizerFast): ...@@ -259,7 +261,7 @@ class RobertaTokenizerFast(GPT2TokenizerFast):
unk_token="<unk>", unk_token="<unk>",
pad_token="<pad>", pad_token="<pad>",
mask_token="<mask>", mask_token="<mask>",
add_prefix_space=False, add_prefix_space=True,
**kwargs **kwargs
): ):
kwargs.setdefault("pad_token", pad_token) kwargs.setdefault("pad_token", pad_token)
...@@ -281,16 +283,24 @@ class RobertaTokenizerFast(GPT2TokenizerFast): ...@@ -281,16 +283,24 @@ class RobertaTokenizerFast(GPT2TokenizerFast):
(sep_token, self.sep_token_id), (cls_token, self.cls_token_id) (sep_token, self.sep_token_id), (cls_token, self.cls_token_id)
) )
self.tokenizer.add_special_tokens([kwargs["mask_token"]])
# As we override the post_processor post super.__init__ the computed num_added_tokens is wrong in super(). # As we override the post_processor post super.__init__ the computed num_added_tokens is wrong in super().
# We need to recompute max_len according to the newly register post_processor to get real values. # We need to recompute max_len according to the newly register post_processor to get real values.
self.max_len_single_sentence = self.max_len - self.num_added_tokens(False) # take into account special tokens self.max_len_single_sentence = self.max_len - self.num_special_tokens_to_add(
self.max_len_sentences_pair = self.max_len - self.num_added_tokens(True) # take into account special tokens False
) # take into account special tokens
logger.warning( self.max_len_sentences_pair = self.max_len - self.num_special_tokens_to_add(
"RobertaTokenizerFast has an issue when working on mask language modeling " True
"where it introduces an extra encoded space before the mask token." ) # take into account special tokens
"See https://github.com/huggingface/transformers/pull/2778 for more information."
) @PreTrainedTokenizer.mask_token.setter
def mask_token(self, value):
if not isinstance(value, AddedToken):
value = AddedToken(value, lstrip=True)
self._mask_token = str(value)
self.tokenizer.add_special_tokens([value])
def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None): def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
output = [self.bos_token_id] + token_ids_0 + [self.eos_token_id] output = [self.bos_token_id] + token_ids_0 + [self.eos_token_id]
......
...@@ -24,13 +24,13 @@ import os ...@@ -24,13 +24,13 @@ import os
import pickle import pickle
import re import re
from collections import Counter, OrderedDict from collections import Counter, OrderedDict
from typing import List, Optional, Tuple, Union from typing import Optional
import numpy as np import numpy as np
from tokenizers import Encoding, Tokenizer from tokenizers import Tokenizer
from tokenizers.implementations import BaseTokenizer from tokenizers.implementations import BaseTokenizer
from tokenizers.models import WordLevel from tokenizers.models import WordLevel
from tokenizers.normalizers import Lowercase, Sequence, unicode_normalizer_from_str from tokenizers.normalizers import Lowercase, Sequence, Strip, unicode_normalizer_from_str
from tokenizers.pre_tokenizers import CharDelimiterSplit, WhitespaceSplit from tokenizers.pre_tokenizers import CharDelimiterSplit, WhitespaceSplit
from tokenizers.processors import BertProcessing from tokenizers.processors import BertProcessing
...@@ -381,6 +381,9 @@ class _TransfoXLDelimiterLookupTokenizer(BaseTokenizer): ...@@ -381,6 +381,9 @@ class _TransfoXLDelimiterLookupTokenizer(BaseTokenizer):
if lowercase: if lowercase:
normalizer += [Lowercase()] normalizer += [Lowercase()]
# Strip normalizer at the end
normalizer += [Strip(left=True, right=True)]
if len(normalizer) > 0: if len(normalizer) > 0:
tokenizer.normalizer = Sequence(normalizer) if len(normalizer) > 1 else normalizer[0] tokenizer.normalizer = Sequence(normalizer) if len(normalizer) > 1 else normalizer[0]
...@@ -404,14 +407,6 @@ class _TransfoXLDelimiterLookupTokenizer(BaseTokenizer): ...@@ -404,14 +407,6 @@ class _TransfoXLDelimiterLookupTokenizer(BaseTokenizer):
super().__init__(tokenizer, parameters) super().__init__(tokenizer, parameters)
def encode_batch(self, sequences: List[Union[str, Tuple[str, str]]]) -> List[Encoding]:
return super().encode_batch(
[seq.strip() if isinstance(seq, str) else (seq[0].strip(), seq[1].strip()) for seq in sequences]
)
def encode(self, sequence: str, pair: Optional[str] = None) -> Encoding:
return super().encode(sequence.strip(), pair.strip() if pair else pair)
class TransfoXLTokenizerFast(PreTrainedTokenizerFast): class TransfoXLTokenizerFast(PreTrainedTokenizerFast):
......
This diff is collapsed.
...@@ -64,7 +64,7 @@ TF_TEXT_CLASSIF_FINETUNED_MODELS = { ...@@ -64,7 +64,7 @@ TF_TEXT_CLASSIF_FINETUNED_MODELS = {
TEXT_CLASSIF_FINETUNED_MODELS = { TEXT_CLASSIF_FINETUNED_MODELS = {
( (
"bert-base-uncased", "distilbert-base-cased",
"distilbert-base-uncased-finetuned-sst-2-english", "distilbert-base-uncased-finetuned-sst-2-english",
"distilbert-base-uncased-finetuned-sst-2-english", "distilbert-base-uncased-finetuned-sst-2-english",
) )
......
...@@ -82,7 +82,7 @@ class BertTokenizationTest(TokenizerTesterMixin, unittest.TestCase): ...@@ -82,7 +82,7 @@ class BertTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
return return
tokenizer = self.get_tokenizer() tokenizer = self.get_tokenizer()
rust_tokenizer = self.get_rust_tokenizer(add_special_tokens=False) rust_tokenizer = self.get_rust_tokenizer()
sequence = "UNwant\u00E9d,running" sequence = "UNwant\u00E9d,running"
...@@ -91,7 +91,7 @@ class BertTokenizationTest(TokenizerTesterMixin, unittest.TestCase): ...@@ -91,7 +91,7 @@ class BertTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
self.assertListEqual(tokens, rust_tokens) self.assertListEqual(tokens, rust_tokens)
ids = tokenizer.encode(sequence, add_special_tokens=False) ids = tokenizer.encode(sequence, add_special_tokens=False)
rust_ids = rust_tokenizer.encode(sequence) rust_ids = rust_tokenizer.encode(sequence, add_special_tokens=False)
self.assertListEqual(ids, rust_ids) self.assertListEqual(ids, rust_ids)
rust_tokenizer = self.get_rust_tokenizer() rust_tokenizer = self.get_rust_tokenizer()
......
...@@ -282,7 +282,7 @@ class TokenizerTesterMixin: ...@@ -282,7 +282,7 @@ class TokenizerTesterMixin:
# Method is implemented (e.g. not GPT-2) # Method is implemented (e.g. not GPT-2)
if len(attached_sequences) != 2: if len(attached_sequences) != 2:
self.assertEqual(tokenizer.num_added_tokens(pair=True), len(attached_sequences) - len(sequences)) self.assertEqual(tokenizer.num_special_tokens_to_add(pair=True), len(attached_sequences) - len(sequences))
def test_maximum_encoding_length_single_input(self): def test_maximum_encoding_length_single_input(self):
tokenizer = self.get_tokenizer() tokenizer = self.get_tokenizer()
...@@ -291,7 +291,7 @@ class TokenizerTesterMixin: ...@@ -291,7 +291,7 @@ class TokenizerTesterMixin:
stride = 2 stride = 2
sequence = tokenizer.encode(seq_0, add_special_tokens=False) sequence = tokenizer.encode(seq_0, add_special_tokens=False)
num_added_tokens = tokenizer.num_added_tokens() num_added_tokens = tokenizer.num_special_tokens_to_add()
total_length = len(sequence) + num_added_tokens total_length = len(sequence) + num_added_tokens
information = tokenizer.encode_plus( information = tokenizer.encode_plus(
seq_0, seq_0,
......
This diff is collapsed.
...@@ -94,7 +94,7 @@ class GPT2TokenizationTest(TokenizerTesterMixin, unittest.TestCase): ...@@ -94,7 +94,7 @@ class GPT2TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
return return
tokenizer = self.get_tokenizer() tokenizer = self.get_tokenizer()
rust_tokenizer = self.get_rust_tokenizer(add_special_tokens=False, add_prefix_space=True) rust_tokenizer = self.get_rust_tokenizer(add_prefix_space=True)
sequence = "lower newer" sequence = "lower newer"
...@@ -105,7 +105,7 @@ class GPT2TokenizationTest(TokenizerTesterMixin, unittest.TestCase): ...@@ -105,7 +105,7 @@ class GPT2TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
# Testing conversion to ids without special tokens # Testing conversion to ids without special tokens
ids = tokenizer.encode(sequence, add_special_tokens=False, add_prefix_space=True) ids = tokenizer.encode(sequence, add_special_tokens=False, add_prefix_space=True)
rust_ids = rust_tokenizer.encode(sequence) rust_ids = rust_tokenizer.encode(sequence, add_special_tokens=False)
self.assertListEqual(ids, rust_ids) self.assertListEqual(ids, rust_ids)
# Testing conversion to ids with special tokens # Testing conversion to ids with special tokens
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment