Unverified Commit 827d6d6e authored by Thomas Wolf's avatar Thomas Wolf Committed by GitHub
Browse files

Cleanup fast tokenizers integration (#3706)



* First pass on utility classes and python tokenizers

* finishing cleanup pass

* style and quality

* Fix tests

* Updating following @mfuntowicz comment

* style and quality

* Fix Roberta

* fix batch_size/seq_length inBatchEncoding

* add alignement methods + tests

* Fix OpenAI and Transfo-XL tokenizers

* adding trim_offsets=True default for GPT2 et RoBERTa

* style and quality

* fix tests

* add_prefix_space in roberta

* bump up tokenizers to rc7

* style

* unfortunately tensorfow does like these - removing shape/seq_len for now

* Update src/transformers/tokenization_utils.py
Co-Authored-By: default avatarStefan Schweter <stefan@schweter.it>

* Adding doc and docstrings

* making flake8 happy
Co-authored-by: default avatarStefan Schweter <stefan@schweter.it>
parent 60a42ef1
Tokenizer Tokenizer
---------------------------------------------------- ----------------------------------------------------
The base class ``PreTrainedTokenizer`` implements the common methods for loading/saving a tokenizer either from a local file or directory, or from a pretrained tokenizer provided by the library (downloaded from HuggingFace's AWS S3 repository). A tokenizer is in charge of preparing the inputs for a model. The library comprise tokenizers for all the models. Most of the tokenizers are available in two flavors: a full python implementation and a "Fast" implementation based on the Rust library `tokenizers`. The "Fast" implementations allows (1) a significant speed-up in particular when doing batched tokenization and (2) additional methods to map between the original string (character and words) and the token space (e.g. getting the index of the token comprising a given character or the span of characters corresponding to a given token). Currently no "Fast" implementation is available for the SentencePiece-based tokenizers (for T5, ALBERT, CamemBERT, XLMRoBERTa and XLNet models).
``PreTrainedTokenizer`` is the main entry point into tokenizers as it also implements the main methods for using all the tokenizers: The base classes ``PreTrainedTokenizer`` and ``PreTrainedTokenizerFast`` implements the common methods for encoding string inputs in model inputs (see below) and instantiating/saving python and "Fast" tokenizers either from a local file or directory or from a pretrained tokenizer provided by the library (downloaded from HuggingFace's AWS S3 repository).
- tokenizing, converting tokens to ids and back and encoding/decoding, ``PreTrainedTokenizer`` and ``PreTrainedTokenizerFast`` thus implements the main methods for using all the tokenizers:
- tokenizing (spliting strings in sub-word token strings), converting tokens strings to ids and back, and encoding/decoding (i.e. tokenizing + convert to integers),
- adding new tokens to the vocabulary in a way that is independant of the underlying structure (BPE, SentencePiece...), - adding new tokens to the vocabulary in a way that is independant of the underlying structure (BPE, SentencePiece...),
- managing special tokens (adding them, assigning them to roles, making sure they are not split during tokenization) - managing special tokens like mask, beginning-of-sentence, etc tokens (adding them, assigning them to attributes in the tokenizer for easy access and making sure they are not split during tokenization)
``BatchEncoding`` holds the output of the tokenizer's encoding methods (``encode_plus`` and ``batch_encode_plus``) and is derived from a Python dictionary. When the tokenizer is a pure python tokenizer, this class behave just like a standard python dictionary and hold the various model inputs computed by these methodes (``input_ids``, ``attention_mask``...). When the tokenizer is a "Fast" tokenizer (i.e. backed by HuggingFace tokenizers library), this class provides in addition several advanced alignement methods which can be used to map between the original string (character and words) and the token space (e.g. getting the index of the token comprising a given character or the span of characters corresponding to a given token).
``PreTrainedTokenizer`` ``PreTrainedTokenizer``
~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.PreTrainedTokenizer .. autoclass:: transformers.PreTrainedTokenizer
:members: :members:
``PreTrainedTokenizerFast``
~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.PreTrainedTokenizerFast
:members:
``BatchEncoding``
~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.BatchEncoding
:members:
``SpecialTokensMixin``
~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.SpecialTokensMixin
:members:
...@@ -52,6 +52,13 @@ BertTokenizer ...@@ -52,6 +52,13 @@ BertTokenizer
create_token_type_ids_from_sequences, save_vocabulary create_token_type_ids_from_sequences, save_vocabulary
BertTokenizerFast
~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.BertTokenizerFast
:members:
BertModel BertModel
~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~
......
...@@ -44,6 +44,13 @@ DistilBertTokenizer ...@@ -44,6 +44,13 @@ DistilBertTokenizer
:members: :members:
DistilBertTokenizerFast
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.DistilBertTokenizerFast
:members:
DistilBertModel DistilBertModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
......
...@@ -61,6 +61,13 @@ ElectraTokenizer ...@@ -61,6 +61,13 @@ ElectraTokenizer
:members: :members:
ElectraTokenizerFast
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.ElectraTokenizerFast
:members:
ElectraModel ElectraModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
......
...@@ -53,6 +53,13 @@ OpenAIGPTTokenizer ...@@ -53,6 +53,13 @@ OpenAIGPTTokenizer
:members: save_vocabulary :members: save_vocabulary
OpenAIGPTTokenizerFast
~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.OpenAIGPTTokenizerFast
:members:
OpenAIGPTModel OpenAIGPTModel
~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~
......
...@@ -51,6 +51,13 @@ GPT2Tokenizer ...@@ -51,6 +51,13 @@ GPT2Tokenizer
:members: save_vocabulary :members: save_vocabulary
GPT2TokenizerFast
~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.GPT2TokenizerFast
:members:
GPT2Model GPT2Model
~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~
......
...@@ -46,6 +46,13 @@ RobertaTokenizer ...@@ -46,6 +46,13 @@ RobertaTokenizer
create_token_type_ids_from_sequences, save_vocabulary create_token_type_ids_from_sequences, save_vocabulary
RobertaTokenizerFast
~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.RobertaTokenizerFast
:members: build_inputs_with_special_tokens
RobertaModel RobertaModel
~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~
......
...@@ -47,6 +47,13 @@ TransfoXLTokenizer ...@@ -47,6 +47,13 @@ TransfoXLTokenizer
:members: save_vocabulary :members: save_vocabulary
TransfoXLTokenizerFast
~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TransfoXLTokenizerFast
:members:
TransfoXLModel TransfoXLModel
~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~
......
...@@ -67,7 +67,7 @@ class TextDataset(Dataset): ...@@ -67,7 +67,7 @@ class TextDataset(Dataset):
def __init__(self, tokenizer: PreTrainedTokenizer, args, file_path: str, block_size=512): def __init__(self, tokenizer: PreTrainedTokenizer, args, file_path: str, block_size=512):
assert os.path.isfile(file_path) assert os.path.isfile(file_path)
block_size = block_size - (tokenizer.max_len - tokenizer.max_len_single_sentence) block_size = block_size - tokenizer.num_special_tokens_to_add(pair=False)
directory, filename = os.path.split(file_path) directory, filename = os.path.split(file_path)
cached_features_file = os.path.join( cached_features_file = os.path.join(
......
...@@ -96,7 +96,7 @@ setup( ...@@ -96,7 +96,7 @@ setup(
packages=find_packages("src"), packages=find_packages("src"),
install_requires=[ install_requires=[
"numpy", "numpy",
"tokenizers == 0.7.0rc5", "tokenizers == 0.7.0rc7",
# dataclasses for Python versions that don't have it # dataclasses for Python versions that don't have it
"dataclasses;python_version<'3.7'", "dataclasses;python_version<'3.7'",
# accessing files from S3 directly # accessing files from S3 directly
......
...@@ -137,9 +137,6 @@ class AlbertTokenizer(PreTrainedTokenizer): ...@@ -137,9 +137,6 @@ class AlbertTokenizer(PreTrainedTokenizer):
**kwargs, **kwargs,
) )
self.max_len_single_sentence = self.max_len - 2 # take into account special tokens
self.max_len_sentences_pair = self.max_len - 3 # take into account special tokens
try: try:
import sentencepiece as spm import sentencepiece as spm
except ImportError: except ImportError:
......
...@@ -182,8 +182,6 @@ class BertTokenizer(PreTrainedTokenizer): ...@@ -182,8 +182,6 @@ class BertTokenizer(PreTrainedTokenizer):
mask_token=mask_token, mask_token=mask_token,
**kwargs, **kwargs,
) )
self.max_len_single_sentence = self.max_len - 2 # take into account special tokens
self.max_len_sentences_pair = self.max_len - 3 # take into account special tokens
if not os.path.isfile(vocab_file): if not os.path.isfile(vocab_file):
raise ValueError( raise ValueError(
...@@ -583,6 +581,48 @@ def _is_punctuation(char): ...@@ -583,6 +581,48 @@ def _is_punctuation(char):
class BertTokenizerFast(PreTrainedTokenizerFast): class BertTokenizerFast(PreTrainedTokenizerFast):
r"""
Constructs a "Fast" BERT tokenizer (backed by HuggingFace's `tokenizers` library).
Bert tokenization is Based on WordPiece.
This tokenizer inherits from :class:`~transformers.PreTrainedTokenizerFast` which contains most of the methods. Users
should refer to the superclass for more information regarding methods.
Args:
vocab_file (:obj:`string`):
File containing the vocabulary.
do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether to lowercase the input when tokenizing.
unk_token (:obj:`string`, `optional`, defaults to "[UNK]"):
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
token instead.
sep_token (:obj:`string`, `optional`, defaults to "[SEP]"):
The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences
for sequence classification or for a text and a question for question answering.
It is also used as the last token of a sequence built with special tokens.
pad_token (:obj:`string`, `optional`, defaults to "[PAD]"):
The token used for padding, for example when batching sequences of different lengths.
cls_token (:obj:`string`, `optional`, defaults to "[CLS]"):
The classifier token which is used when doing sequence classification (classification of the whole
sequence instead of per-token classification). It is the first token of the sequence when built with
special tokens.
mask_token (:obj:`string`, `optional`, defaults to "[MASK]"):
The token used for masking values. This is the token used when training this model with masked language
modeling. This is the token which the model will try to predict.
tokenize_chinese_chars (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether to tokenize Chinese characters.
This should likely be deactivated for Japanese:
see: https://github.com/huggingface/transformers/issues/328
clean_text (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether to clean the text before tokenization by removing any control characters and
replacing all whitespaces by the classic one.
tokenize_chinese_chars (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether to tokenize Chinese characters.
This should likely be deactivated for Japanese:
see: https://github.com/huggingface/transformers/issues/328
"""
vocab_files_names = VOCAB_FILES_NAMES vocab_files_names = VOCAB_FILES_NAMES
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION
......
...@@ -119,8 +119,6 @@ class BertJapaneseTokenizer(BertTokenizer): ...@@ -119,8 +119,6 @@ class BertJapaneseTokenizer(BertTokenizer):
**kwargs, **kwargs,
) )
# ^^ We call the grandparent's init, not the parent's. # ^^ We call the grandparent's init, not the parent's.
self.max_len_single_sentence = self.max_len - 2 # take into account special tokens
self.max_len_sentences_pair = self.max_len - 3 # take into account special tokens
if not os.path.isfile(vocab_file): if not os.path.isfile(vocab_file):
raise ValueError( raise ValueError(
......
...@@ -129,8 +129,6 @@ class CamembertTokenizer(PreTrainedTokenizer): ...@@ -129,8 +129,6 @@ class CamembertTokenizer(PreTrainedTokenizer):
additional_special_tokens=additional_special_tokens, additional_special_tokens=additional_special_tokens,
**kwargs, **kwargs,
) )
self.max_len_single_sentence = self.max_len - 2 # take into account special tokens
self.max_len_sentences_pair = self.max_len - 4 # take into account special tokens
self.sp_model = spm.SentencePieceProcessor() self.sp_model = spm.SentencePieceProcessor()
self.sp_model.Load(str(vocab_file)) self.sp_model.Load(str(vocab_file))
self.vocab_file = vocab_file self.vocab_file = vocab_file
......
...@@ -140,12 +140,6 @@ class CTRLTokenizer(PreTrainedTokenizer): ...@@ -140,12 +140,6 @@ class CTRLTokenizer(PreTrainedTokenizer):
def __init__(self, vocab_file, merges_file, unk_token="<unk>", **kwargs): def __init__(self, vocab_file, merges_file, unk_token="<unk>", **kwargs):
super().__init__(unk_token=unk_token, **kwargs) super().__init__(unk_token=unk_token, **kwargs)
self.max_len_single_sentence = (
self.max_len
) # no default special tokens - you can update this value if you add special tokens
self.max_len_sentences_pair = (
self.max_len
) # no default special tokens - you can update this value if you add special tokens
with open(vocab_file, encoding="utf-8") as vocab_handle: with open(vocab_file, encoding="utf-8") as vocab_handle:
self.encoder = json.load(vocab_handle) self.encoder = json.load(vocab_handle)
......
...@@ -57,8 +57,9 @@ PRETRAINED_INIT_CONFIGURATION = { ...@@ -57,8 +57,9 @@ PRETRAINED_INIT_CONFIGURATION = {
class DistilBertTokenizer(BertTokenizer): class DistilBertTokenizer(BertTokenizer):
r""" r"""
Constructs a DistilBertTokenizer. Constructs a DistilBertTokenizer.
:class:`~transformers.DistilBertTokenizer` is identical to :class:`~transformers.BertTokenizer` and runs end-to-end
:class:`~transformers.DistilBertTokenizer is identical to :class:`~transformers.BertTokenizer` and runs end-to-end
tokenization: punctuation splitting + wordpiece. tokenization: punctuation splitting + wordpiece.
Refer to superclass :class:`~transformers.BertTokenizer` for usage examples and documentation concerning Refer to superclass :class:`~transformers.BertTokenizer` for usage examples and documentation concerning
...@@ -73,6 +74,16 @@ class DistilBertTokenizer(BertTokenizer): ...@@ -73,6 +74,16 @@ class DistilBertTokenizer(BertTokenizer):
class DistilBertTokenizerFast(BertTokenizerFast): class DistilBertTokenizerFast(BertTokenizerFast):
r"""
Constructs a "Fast" DistilBertTokenizer (backed by HuggingFace's `tokenizers` library).
:class:`~transformers.DistilBertTokenizerFast` is identical to :class:`~transformers.BertTokenizerFast` and runs end-to-end
tokenization: punctuation splitting + wordpiece.
Refer to superclass :class:`~transformers.BertTokenizerFast` for usage examples and documentation concerning
parameters.
"""
vocab_files_names = VOCAB_FILES_NAMES vocab_files_names = VOCAB_FILES_NAMES
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
......
...@@ -67,7 +67,8 @@ class ElectraTokenizer(BertTokenizer): ...@@ -67,7 +67,8 @@ class ElectraTokenizer(BertTokenizer):
class ElectraTokenizerFast(BertTokenizerFast): class ElectraTokenizerFast(BertTokenizerFast):
r""" r"""
Constructs an Electra Fast tokenizer. Constructs a "Fast" Electra Fast tokenizer (backed by HuggingFace's `tokenizers` library).
:class:`~transformers.ElectraTokenizerFast` is identical to :class:`~transformers.BertTokenizerFast` and runs end-to-end :class:`~transformers.ElectraTokenizerFast` is identical to :class:`~transformers.BertTokenizerFast` and runs end-to-end
tokenization: punctuation splitting + wordpiece. tokenization: punctuation splitting + wordpiece.
......
...@@ -147,12 +147,6 @@ class GPT2Tokenizer(PreTrainedTokenizer): ...@@ -147,12 +147,6 @@ class GPT2Tokenizer(PreTrainedTokenizer):
**kwargs **kwargs
): ):
super().__init__(bos_token=bos_token, eos_token=eos_token, unk_token=unk_token, **kwargs) super().__init__(bos_token=bos_token, eos_token=eos_token, unk_token=unk_token, **kwargs)
self.max_len_single_sentence = (
self.max_len
) # no default special tokens - you can update this value if you add special tokens
self.max_len_sentences_pair = (
self.max_len
) # no default special tokens - you can update this value if you add special tokens
with open(vocab_file, encoding="utf-8") as vocab_handle: with open(vocab_file, encoding="utf-8") as vocab_handle:
self.encoder = json.load(vocab_handle) self.encoder = json.load(vocab_handle)
...@@ -284,6 +278,47 @@ class GPT2Tokenizer(PreTrainedTokenizer): ...@@ -284,6 +278,47 @@ class GPT2Tokenizer(PreTrainedTokenizer):
class GPT2TokenizerFast(PreTrainedTokenizerFast): class GPT2TokenizerFast(PreTrainedTokenizerFast):
"""
Constructs a "Fast" GPT-2 BPE tokenizer (backed by HuggingFace's `tokenizers` library).
Peculiarities:
- Byte-level Byte-Pair-Encoding
- Requires a space to start the input string => the encoding methods should be called with the
``add_prefix_space`` flag set to ``True``.
Otherwise, this tokenizer ``encode`` and ``decode`` method will not conserve
the absence of a space at the beginning of a string:
::
tokenizer.decode(tokenizer.encode("Hello")) = " Hello"
This tokenizer inherits from :class:`~transformers.PreTrainedTokenizerFast` which contains most of the methods. Users
should refer to the superclass for more information regarding methods.
Args:
vocab_file (:obj:`str`):
Path to the vocabulary file.
merges_file (:obj:`str`):
Path to the merges file.
errors (:obj:`str`, `optional`, defaults to "replace"):
Paradigm to follow when decoding bytes to UTF-8. See `bytes.decode
<https://docs.python.org/3/library/stdtypes.html#bytes.decode>`__ for more information.
unk_token (:obj:`string`, `optional`, defaults to `<|endoftext|>`):
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
token instead.
bos_token (:obj:`string`, `optional`, defaults to `<|endoftext|>`):
The beginning of sequence token.
eos_token (:obj:`string`, `optional`, defaults to `<|endoftext|>`):
The end of sequence token.
add_prefix_space (:obj:`bool`, `optional`, defaults to `False`):
Whether to add a leading space to the first word.
This allows to treat the leading word just as any other word.
(GPT2 tokenizer detect beginning of words by the preceeding space)
trim_offsets (:obj:`bool`, `optional`, defaults to `True`):
Whether the post processing step should trim offsets to avoid including whitespaces.
"""
vocab_files_names = VOCAB_FILES_NAMES vocab_files_names = VOCAB_FILES_NAMES
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
...@@ -296,10 +331,16 @@ class GPT2TokenizerFast(PreTrainedTokenizerFast): ...@@ -296,10 +331,16 @@ class GPT2TokenizerFast(PreTrainedTokenizerFast):
bos_token="<|endoftext|>", bos_token="<|endoftext|>",
eos_token="<|endoftext|>", eos_token="<|endoftext|>",
add_prefix_space=False, add_prefix_space=False,
trim_offsets=True,
**kwargs **kwargs
): ):
super().__init__( super().__init__(
ByteLevelBPETokenizer(vocab_file=vocab_file, merges_file=merges_file, add_prefix_space=add_prefix_space), ByteLevelBPETokenizer(
vocab_file=vocab_file,
merges_file=merges_file,
add_prefix_space=add_prefix_space,
trim_offsets=trim_offsets,
),
bos_token=bos_token, bos_token=bos_token,
eos_token=eos_token, eos_token=eos_token,
unk_token=unk_token, unk_token=unk_token,
......
...@@ -19,15 +19,8 @@ import json ...@@ -19,15 +19,8 @@ import json
import logging import logging
import os import os
import re import re
from typing import List, Optional, Union
from tokenizers import Tokenizer from tokenizers import CharBPETokenizer
from tokenizers.decoders import BPEDecoder
from tokenizers.implementations import BaseTokenizer
from tokenizers.models import BPE
from tokenizers.normalizers import BertNormalizer, Sequence, unicode_normalizer_from_str
from tokenizers.pre_tokenizers import BertPreTokenizer
from tokenizers.trainers import BpeTrainer
from .tokenization_bert import BasicTokenizer from .tokenization_bert import BasicTokenizer
from .tokenization_utils import PreTrainedTokenizer, PreTrainedTokenizerFast from .tokenization_utils import PreTrainedTokenizer, PreTrainedTokenizerFast
...@@ -106,13 +99,6 @@ class OpenAIGPTTokenizer(PreTrainedTokenizer): ...@@ -106,13 +99,6 @@ class OpenAIGPTTokenizer(PreTrainedTokenizer):
def __init__(self, vocab_file, merges_file, unk_token="<unk>", **kwargs): def __init__(self, vocab_file, merges_file, unk_token="<unk>", **kwargs):
super().__init__(unk_token=unk_token, **kwargs) super().__init__(unk_token=unk_token, **kwargs)
self.max_len_single_sentence = (
self.max_len
) # no default special tokens - you can update this value if you add special tokens
self.max_len_sentences_pair = (
self.max_len
) # no default special tokens - you can update this value if you add special tokens
try: try:
import ftfy import ftfy
from spacy.lang.en import English from spacy.lang.en import English
...@@ -249,83 +235,28 @@ class OpenAIGPTTokenizer(PreTrainedTokenizer): ...@@ -249,83 +235,28 @@ class OpenAIGPTTokenizer(PreTrainedTokenizer):
return vocab_file, merge_file return vocab_file, merge_file
class _OpenAIGPTCharBPETokenizer(BaseTokenizer): class OpenAIGPTTokenizerFast(PreTrainedTokenizerFast):
"""
OpenAI character-level BPE Tokenizer
""" """
Construct a "Fast" BPE tokenizer for OpenAI GPT (backed by HuggingFace's `tokenizers` library).
def __init__( Peculiarities:
self,
vocab_file: Optional[str] = None,
merges_file: Optional[str] = None,
unk_token: Optional[str] = "<unk>",
suffix: Optional[str] = "</w>",
dropout: Optional[float] = None,
unicode_normalizer: Optional[str] = None,
):
if vocab_file is not None and merges_file is not None:
tokenizer = Tokenizer(
BPE(vocab_file, merges_file, dropout=dropout, unk_token=unk_token, end_of_word_suffix=suffix)
)
else:
tokenizer = Tokenizer(BPE())
# Check for Unicode normalization first (before everything else)
normalizers = []
if unicode_normalizer:
normalizers += [unicode_normalizer_from_str(unicode_normalizer)]
# OpenAI normalization is the same as Bert - lower case all inputs
normalizers += [BertNormalizer()] - uses SpaCy tokenizer and ftfy for pre-BPE tokenization if they are installed, fallback to BERT's BasicTokenizer if not.
# Create the normalizer structure This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the methods. Users
if len(normalizers) > 0: should refer to the superclass for more information regarding methods.
if len(normalizers) > 1:
tokenizer.normalizer = Sequence(normalizers)
else:
tokenizer.normalizer = normalizers[0]
tokenizer.pre_tokenizer = BertPreTokenizer()
tokenizer.decoder = BPEDecoder(suffix=suffix)
parameters = {
"model": "BPE",
"unk_token": unk_token,
"suffix": suffix,
"dropout": dropout,
}
super().__init__(tokenizer, parameters)
def train(
self,
files: Union[str, List[str]],
vocab_size: int = 30000,
min_frequency: int = 2,
special_tokens: List[str] = ["<unk>"],
limit_alphabet: int = 1000,
initial_alphabet: List[str] = [],
suffix: Optional[str] = "</w>",
show_progress: bool = True,
):
""" Train the model using the given files """
trainer = BpeTrainer(
vocab_size=vocab_size,
min_frequency=min_frequency,
special_tokens=special_tokens,
limit_alphabet=limit_alphabet,
initial_alphabet=initial_alphabet,
end_of_word_suffix=suffix,
show_progress=show_progress,
)
if isinstance(files, str):
files = [files]
self._tokenizer.train(trainer, files)
Args:
vocab_file (:obj:`str`):
Path to the vocabulary file.
merges_file (:obj:`str`):
Path to the merges file.
unk_token (:obj:`string`, `optional`, defaults to "<unk>"):
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
token instead.
"""
class OpenAIGPTTokenizerFast(PreTrainedTokenizerFast):
vocab_files_names = VOCAB_FILES_NAMES vocab_files_names = VOCAB_FILES_NAMES
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
...@@ -333,5 +264,6 @@ class OpenAIGPTTokenizerFast(PreTrainedTokenizerFast): ...@@ -333,5 +264,6 @@ class OpenAIGPTTokenizerFast(PreTrainedTokenizerFast):
def __init__(self, vocab_file, merges_file, unk_token="<unk>", **kwargs): def __init__(self, vocab_file, merges_file, unk_token="<unk>", **kwargs):
kwargs.setdefault("unk_token", unk_token) kwargs.setdefault("unk_token", unk_token)
super().__init__( super().__init__(
_OpenAIGPTCharBPETokenizer(vocab_file=vocab_file, merges_file=merges_file, unk_token=unk_token), **kwargs CharBPETokenizer(vocab_file=vocab_file, merges_file=merges_file, unk_token=unk_token, lowercase=True),
**kwargs,
) )
...@@ -150,8 +150,6 @@ class RobertaTokenizer(GPT2Tokenizer): ...@@ -150,8 +150,6 @@ class RobertaTokenizer(GPT2Tokenizer):
mask_token=mask_token, mask_token=mask_token,
**kwargs, **kwargs,
) )
self.max_len_single_sentence = self.max_len - 2 # take into account special tokens
self.max_len_sentences_pair = self.max_len - 4 # take into account special tokens
def build_inputs_with_special_tokens( def build_inputs_with_special_tokens(
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
...@@ -244,6 +242,47 @@ class RobertaTokenizer(GPT2Tokenizer): ...@@ -244,6 +242,47 @@ class RobertaTokenizer(GPT2Tokenizer):
class RobertaTokenizerFast(GPT2TokenizerFast): class RobertaTokenizerFast(GPT2TokenizerFast):
"""
Constructs a "Fast" RoBERTa BPE tokenizer (backed by HuggingFace's `tokenizers` library).
Peculiarities:
- Byte-level Byte-Pair-Encoding
- Requires a space to start the input string => the encoding methods should be called with the
``add_prefix_space`` flag set to ``True``.
Otherwise, this tokenizer ``encode`` and ``decode`` method will not conserve
the absence of a space at the beginning of a string:
::
tokenizer.decode(tokenizer.encode("Hello")) = " Hello"
This tokenizer inherits from :class:`~transformers.PreTrainedTokenizerFast` which contains most of the methods. Users
should refer to the superclass for more information regarding methods.
Args:
vocab_file (:obj:`str`):
Path to the vocabulary file.
merges_file (:obj:`str`):
Path to the merges file.
errors (:obj:`str`, `optional`, defaults to "replace"):
Paradigm to follow when decoding bytes to UTF-8. See `bytes.decode
<https://docs.python.org/3/library/stdtypes.html#bytes.decode>`__ for more information.
unk_token (:obj:`string`, `optional`, defaults to `<|endoftext|>`):
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
token instead.
bos_token (:obj:`string`, `optional`, defaults to `<|endoftext|>`):
The beginning of sequence token.
eos_token (:obj:`string`, `optional`, defaults to `<|endoftext|>`):
The end of sequence token.
add_prefix_space (:obj:`bool`, `optional`, defaults to `False`):
Whether to add a leading space to the first word.
This allows to treat the leading word just as any other word.
(GPT2 tokenizer detect beginning of words by the preceeding space)
trim_offsets (:obj:`bool`, `optional`, defaults to `True`):
Whether the post processing step should trim offsets to avoid including whitespaces.
"""
vocab_files_names = VOCAB_FILES_NAMES vocab_files_names = VOCAB_FILES_NAMES
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
...@@ -262,6 +301,7 @@ class RobertaTokenizerFast(GPT2TokenizerFast): ...@@ -262,6 +301,7 @@ class RobertaTokenizerFast(GPT2TokenizerFast):
pad_token="<pad>", pad_token="<pad>",
mask_token="<mask>", mask_token="<mask>",
add_prefix_space=True, add_prefix_space=True,
trim_offsets=True,
**kwargs **kwargs
): ):
kwargs.setdefault("pad_token", pad_token) kwargs.setdefault("pad_token", pad_token)
...@@ -276,23 +316,18 @@ class RobertaTokenizerFast(GPT2TokenizerFast): ...@@ -276,23 +316,18 @@ class RobertaTokenizerFast(GPT2TokenizerFast):
bos_token=bos_token, bos_token=bos_token,
eos_token=eos_token, eos_token=eos_token,
add_prefix_space=add_prefix_space, add_prefix_space=add_prefix_space,
trim_offsets=trim_offsets,
**kwargs, **kwargs,
) )
self.tokenizer._tokenizer.post_processor = RobertaProcessing( self.backend_tokenizer._tokenizer.post_processor = RobertaProcessing(
(sep_token, self.sep_token_id), (cls_token, self.cls_token_id) sep=(sep_token, self.sep_token_id),
cls=(cls_token, self.cls_token_id),
add_prefix_space=add_prefix_space,
trim_offsets=trim_offsets,
) )
self.tokenizer.add_special_tokens([kwargs["mask_token"]]) self.backend_tokenizer.add_special_tokens([kwargs["mask_token"]])
# As we override the post_processor post super.__init__ the computed num_added_tokens is wrong in super().
# We need to recompute max_len according to the newly register post_processor to get real values.
self.max_len_single_sentence = self.max_len - self.num_special_tokens_to_add(
False
) # take into account special tokens
self.max_len_sentences_pair = self.max_len - self.num_special_tokens_to_add(
True
) # take into account special tokens
@PreTrainedTokenizer.mask_token.setter @PreTrainedTokenizer.mask_token.setter
def mask_token(self, value): def mask_token(self, value):
...@@ -300,7 +335,7 @@ class RobertaTokenizerFast(GPT2TokenizerFast): ...@@ -300,7 +335,7 @@ class RobertaTokenizerFast(GPT2TokenizerFast):
value = AddedToken(value, lstrip=True) value = AddedToken(value, lstrip=True)
self._mask_token = str(value) self._mask_token = str(value)
self.tokenizer.add_special_tokens([value]) self._maybe_update_backend([value])
def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None): def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
output = [self.bos_token_id] + token_ids_0 + [self.eos_token_id] output = [self.bos_token_id] + token_ids_0 + [self.eos_token_id]
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment