Unverified Commit ba8c4d0a authored by Thomas Wolf's avatar Thomas Wolf Committed by GitHub
Browse files

[Dependencies|tokenizers] Make both SentencePiece and Tokenizers optional dependencies (#7659)

* splitting fast and slow tokenizers [WIP]

* [WIP] splitting sentencepiece and tokenizers dependencies

* update dummy objects

* add name_or_path to models and tokenizers

* prefix added to file names

* prefix

* styling + quality

* spliting all the tokenizer files - sorting sentencepiece based ones

* update tokenizer version up to 0.9.0

* remove hard dependency on sentencepiece 🎉

* and removed hard dependency on tokenizers 🎉



* update conversion script

* update missing models

* fixing tests

* move test_tokenization_fast to main tokenization tests - fix bugs

* bump up tokenizers

* fix bert_generation

* update ad fix several tokenizers

* keep sentencepiece in deps for now

* fix funnel and deberta tests

* fix fsmt

* fix marian tests

* fix layoutlm

* fix squeezebert and gpt2

* fix T5 tokenization

* fix xlnet tests

* style

* fix mbart

* bump up tokenizers to 0.9.2

* fix model tests

* fix tf models

* fix seq2seq examples

* fix tests without sentencepiece

* fix slow => fast  conversion without sentencepiece

* update auto and bert generation tests

* fix mbart tests

* fix auto and common test without tokenizers

* fix tests without tokenizers

* clean up tests lighten up when tokenizers + sentencepiece are both off

* style quality and tests fixing

* add sentencepiece to doc/examples reqs

* leave sentencepiece on for now

* style quality split hebert and fix pegasus

* WIP Herbert fast

* add sample_text_no_unicode and fix hebert tokenization

* skip FSMT example test for now

* fix style

* fix fsmt in example tests

* update following Lysandre and Sylvain's comments

* Update src/transformers/testing_utils.py
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/testing_utils.py
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/tokenization_utils_base.py
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/tokenization_utils_base.py
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
parent c65863ce
...@@ -198,7 +198,7 @@ jobs: ...@@ -198,7 +198,7 @@ jobs:
- v0.3-build_doc-{{ checksum "setup.py" }} - v0.3-build_doc-{{ checksum "setup.py" }}
- v0.3-{{ checksum "setup.py" }} - v0.3-{{ checksum "setup.py" }}
- run: pip install --upgrade pip - run: pip install --upgrade pip
- run: pip install .[tf,torch,docs] - run: pip install .[tf,torch,sentencepiece,docs]
- save_cache: - save_cache:
key: v0.3-build_doc-{{ checksum "setup.py" }} key: v0.3-build_doc-{{ checksum "setup.py" }}
paths: paths:
...@@ -219,7 +219,7 @@ jobs: ...@@ -219,7 +219,7 @@ jobs:
keys: keys:
- v0.3-deploy_doc-{{ checksum "setup.py" }} - v0.3-deploy_doc-{{ checksum "setup.py" }}
- v0.3-{{ checksum "setup.py" }} - v0.3-{{ checksum "setup.py" }}
- run: pip install .[tf,torch,docs] - run: pip install .[tf,torch,sentencepiece,docs]
- save_cache: - save_cache:
key: v0.3-deploy_doc-{{ checksum "setup.py" }} key: v0.3-deploy_doc-{{ checksum "setup.py" }}
paths: paths:
......
...@@ -30,8 +30,7 @@ jobs: ...@@ -30,8 +30,7 @@ jobs:
run: | run: |
pip install --upgrade pip pip install --upgrade pip
pip install torch pip install torch
pip install numpy filelock protobuf requests tqdm regex sentencepiece sacremoses packaging pip install numpy filelock protobuf requests tqdm regex sentencepiece sacremoses tokenizers packaging
pip install tokenizers==0.9.0.rc2
- name: Torch hub list - name: Torch hub list
run: | run: |
......
...@@ -9,7 +9,8 @@ __pycache__/ ...@@ -9,7 +9,8 @@ __pycache__/
*.so *.so
# tests and logs # tests and logs
tests/fixtures tests/fixtures/*
!tests/fixtures/sample_text_no_unicode.txt
logs/ logs/
lightning_logs/ lightning_logs/
lang_code_data/ lang_code_data/
......
...@@ -758,8 +758,8 @@ Here is an example of using the pipelines to do summarization. It leverages a Ba ...@@ -758,8 +758,8 @@ Here is an example of using the pipelines to do summarization. It leverages a Ba
... If convicted, Barrientos faces up to four years in prison. Her next court appearance is scheduled for May 18. ... If convicted, Barrientos faces up to four years in prison. Her next court appearance is scheduled for May 18.
... """ ... """
Because the summarization pipeline depends on the ``PretrainedModel.generate()`` method, we can override the default arguments Because the summarization pipeline depends on the ``PreTrainedModel.generate()`` method, we can override the default arguments
of ``PretrainedModel.generate()`` directly in the pipeline for ``max_length`` and ``min_length`` as shown below. of ``PreTrainedModel.generate()`` directly in the pipeline for ``max_length`` and ``min_length`` as shown below.
This outputs the following summary: This outputs the following summary:
.. code-block:: .. code-block::
...@@ -772,7 +772,7 @@ Here is an example of doing summarization using a model and a tokenizer. The pro ...@@ -772,7 +772,7 @@ Here is an example of doing summarization using a model and a tokenizer. The pro
1. Instantiate a tokenizer and a model from the checkpoint name. Summarization is usually done using an encoder-decoder model, such as ``Bart`` or ``T5``. 1. Instantiate a tokenizer and a model from the checkpoint name. Summarization is usually done using an encoder-decoder model, such as ``Bart`` or ``T5``.
2. Define the article that should be summarized. 2. Define the article that should be summarized.
3. Add the T5 specific prefix "summarize: ". 3. Add the T5 specific prefix "summarize: ".
4. Use the ``PretrainedModel.generate()`` method to generate the summary. 4. Use the ``PreTrainedModel.generate()`` method to generate the summary.
In this example we use Google`s T5 model. Even though it was pre-trained only on a multi-task mixed dataset (including CNN / Daily Mail), it yields very good results. In this example we use Google`s T5 model. Even though it was pre-trained only on a multi-task mixed dataset (including CNN / Daily Mail), it yields very good results.
...@@ -819,15 +819,15 @@ translation results. ...@@ -819,15 +819,15 @@ translation results.
>>> print(translator("Hugging Face is a technology company based in New York and Paris", max_length=40)) >>> print(translator("Hugging Face is a technology company based in New York and Paris", max_length=40))
[{'translation_text': 'Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.'}] [{'translation_text': 'Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.'}]
Because the translation pipeline depends on the ``PretrainedModel.generate()`` method, we can override the default arguments Because the translation pipeline depends on the ``PreTrainedModel.generate()`` method, we can override the default arguments
of ``PretrainedModel.generate()`` directly in the pipeline as is shown for ``max_length`` above. of ``PreTrainedModel.generate()`` directly in the pipeline as is shown for ``max_length`` above.
Here is an example of doing translation using a model and a tokenizer. The process is the following: Here is an example of doing translation using a model and a tokenizer. The process is the following:
1. Instantiate a tokenizer and a model from the checkpoint name. Summarization is usually done using an encoder-decoder model, such as ``Bart`` or ``T5``. 1. Instantiate a tokenizer and a model from the checkpoint name. Summarization is usually done using an encoder-decoder model, such as ``Bart`` or ``T5``.
2. Define the article that should be summarizaed. 2. Define the article that should be summarizaed.
3. Add the T5 specific prefix "translate English to German: " 3. Add the T5 specific prefix "translate English to German: "
4. Use the ``PretrainedModel.generate()`` method to perform the translation. 4. Use the ``PreTrainedModel.generate()`` method to perform the translation.
.. code-block:: .. code-block::
......
...@@ -17,3 +17,4 @@ datasets ...@@ -17,3 +17,4 @@ datasets
fire fire
pytest pytest
conllu conllu
sentencepiece != 0.1.92
...@@ -92,12 +92,13 @@ extras["onnxruntime"] = ["onnxruntime>=1.4.0", "onnxruntime-tools>=1.4.2"] ...@@ -92,12 +92,13 @@ extras["onnxruntime"] = ["onnxruntime>=1.4.0", "onnxruntime-tools>=1.4.2"]
extras["serving"] = ["pydantic", "uvicorn", "fastapi", "starlette"] extras["serving"] = ["pydantic", "uvicorn", "fastapi", "starlette"]
extras["all"] = extras["serving"] + ["tensorflow", "torch"] extras["all"] = extras["serving"] + ["tensorflow", "torch"]
extras["sentencepiece"] = ["sentencepiece!=0.1.92"]
extras["retrieval"] = ["faiss-cpu", "datasets"] extras["retrieval"] = ["faiss-cpu", "datasets"]
extras["testing"] = ["pytest", "pytest-xdist", "timeout-decorator", "parameterized", "psutil"] + extras["retrieval"] extras["testing"] = ["pytest", "pytest-xdist", "timeout-decorator", "parameterized", "psutil"] + extras["retrieval"]
# sphinx-rtd-theme==0.5.0 introduced big changes in the style. # sphinx-rtd-theme==0.5.0 introduced big changes in the style.
extras["docs"] = ["recommonmark", "sphinx", "sphinx-markdown-tables", "sphinx-rtd-theme==0.4.3", "sphinx-copybutton"] extras["docs"] = ["recommonmark", "sphinx", "sphinx-markdown-tables", "sphinx-rtd-theme==0.4.3", "sphinx-copybutton"]
extras["quality"] = ["black >= 20.8b1", "isort >= 5.5.4", "flake8 >= 3.8.3"] extras["quality"] = ["black >= 20.8b1", "isort >= 5.5.4", "flake8 >= 3.8.3"]
extras["dev"] = extras["testing"] + extras["quality"] + extras["ja"] + ["scikit-learn", "tensorflow", "torch"] extras["dev"] = extras["testing"] + extras["quality"] + extras["ja"] + ["scikit-learn", "tensorflow", "torch", "sentencepiece!=0.1.92"]
setup( setup(
name="transformers", name="transformers",
...@@ -114,7 +115,7 @@ setup( ...@@ -114,7 +115,7 @@ setup(
packages=find_packages("src"), packages=find_packages("src"),
install_requires=[ install_requires=[
"numpy", "numpy",
"tokenizers == 0.9.0.rc2", "tokenizers == 0.9.2",
# dataclasses for Python versions that don't have it # dataclasses for Python versions that don't have it
"dataclasses;python_version<'3.7'", "dataclasses;python_version<'3.7'",
# utilities from PyPA to e.g. compare versions # utilities from PyPA to e.g. compare versions
......
...@@ -92,6 +92,7 @@ from .file_utils import ( ...@@ -92,6 +92,7 @@ from .file_utils import (
MODEL_CARD_NAME, MODEL_CARD_NAME,
PYTORCH_PRETRAINED_BERT_CACHE, PYTORCH_PRETRAINED_BERT_CACHE,
PYTORCH_TRANSFORMERS_CACHE, PYTORCH_TRANSFORMERS_CACHE,
SPIECE_UNDERLINE,
TF2_WEIGHTS_NAME, TF2_WEIGHTS_NAME,
TF_WEIGHTS_NAME, TF_WEIGHTS_NAME,
TRANSFORMERS_CACHE, TRANSFORMERS_CACHE,
...@@ -104,8 +105,10 @@ from .file_utils import ( ...@@ -104,8 +105,10 @@ from .file_utils import (
is_faiss_available, is_faiss_available,
is_psutil_available, is_psutil_available,
is_py3nvml_available, is_py3nvml_available,
is_sentencepiece_available,
is_sklearn_available, is_sklearn_available,
is_tf_available, is_tf_available,
is_tokenizers_available,
is_torch_available, is_torch_available,
is_torch_tpu_available, is_torch_tpu_available,
) )
...@@ -152,49 +155,41 @@ from .pipelines import ( ...@@ -152,49 +155,41 @@ from .pipelines import (
from .retrieval_rag import RagRetriever from .retrieval_rag import RagRetriever
# Tokenizers # Tokenizers
from .tokenization_albert import AlbertTokenizer, AlbertTokenizerFast
from .tokenization_auto import TOKENIZER_MAPPING, AutoTokenizer from .tokenization_auto import TOKENIZER_MAPPING, AutoTokenizer
from .tokenization_bart import BartTokenizer, BartTokenizerFast from .tokenization_bart import BartTokenizer
from .tokenization_bert import BasicTokenizer, BertTokenizer, BertTokenizerFast, WordpieceTokenizer from .tokenization_bert import BasicTokenizer, BertTokenizer, WordpieceTokenizer
from .tokenization_bert_generation import BertGenerationTokenizer
from .tokenization_bert_japanese import BertJapaneseTokenizer, CharacterTokenizer, MecabTokenizer from .tokenization_bert_japanese import BertJapaneseTokenizer, CharacterTokenizer, MecabTokenizer
from .tokenization_bertweet import BertweetTokenizer from .tokenization_bertweet import BertweetTokenizer
from .tokenization_blenderbot import BlenderbotSmallTokenizer, BlenderbotTokenizer from .tokenization_blenderbot import BlenderbotSmallTokenizer, BlenderbotTokenizer
from .tokenization_camembert import CamembertTokenizer, CamembertTokenizerFast
from .tokenization_ctrl import CTRLTokenizer from .tokenization_ctrl import CTRLTokenizer
from .tokenization_deberta import DebertaTokenizer from .tokenization_deberta import DebertaTokenizer
from .tokenization_distilbert import DistilBertTokenizer, DistilBertTokenizerFast from .tokenization_distilbert import DistilBertTokenizer
from .tokenization_dpr import ( from .tokenization_dpr import (
DPRContextEncoderTokenizer, DPRContextEncoderTokenizer,
DPRContextEncoderTokenizerFast,
DPRQuestionEncoderTokenizer, DPRQuestionEncoderTokenizer,
DPRQuestionEncoderTokenizerFast, DPRReaderOutput,
DPRReaderTokenizer, DPRReaderTokenizer,
DPRReaderTokenizerFast,
) )
from .tokenization_electra import ElectraTokenizer, ElectraTokenizerFast from .tokenization_electra import ElectraTokenizer
from .tokenization_flaubert import FlaubertTokenizer from .tokenization_flaubert import FlaubertTokenizer
from .tokenization_fsmt import FSMTTokenizer from .tokenization_fsmt import FSMTTokenizer
from .tokenization_funnel import FunnelTokenizer, FunnelTokenizerFast from .tokenization_funnel import FunnelTokenizer
from .tokenization_gpt2 import GPT2Tokenizer, GPT2TokenizerFast from .tokenization_gpt2 import GPT2Tokenizer
from .tokenization_herbert import HerbertTokenizer, HerbertTokenizerFast from .tokenization_herbert import HerbertTokenizer
from .tokenization_layoutlm import LayoutLMTokenizer, LayoutLMTokenizerFast from .tokenization_layoutlm import LayoutLMTokenizer
from .tokenization_longformer import LongformerTokenizer, LongformerTokenizerFast from .tokenization_longformer import LongformerTokenizer
from .tokenization_lxmert import LxmertTokenizer, LxmertTokenizerFast from .tokenization_lxmert import LxmertTokenizer
from .tokenization_mbart import MBartTokenizer, MBartTokenizerFast from .tokenization_mobilebert import MobileBertTokenizer
from .tokenization_mobilebert import MobileBertTokenizer, MobileBertTokenizerFast from .tokenization_openai import OpenAIGPTTokenizer
from .tokenization_openai import OpenAIGPTTokenizer, OpenAIGPTTokenizerFast
from .tokenization_pegasus import PegasusTokenizer, PegasusTokenizerFast
from .tokenization_phobert import PhobertTokenizer from .tokenization_phobert import PhobertTokenizer
from .tokenization_rag import RagTokenizer from .tokenization_rag import RagTokenizer
from .tokenization_reformer import ReformerTokenizer, ReformerTokenizerFast from .tokenization_retribert import RetriBertTokenizer
from .tokenization_retribert import RetriBertTokenizer, RetriBertTokenizerFast from .tokenization_roberta import RobertaTokenizer
from .tokenization_roberta import RobertaTokenizer, RobertaTokenizerFast from .tokenization_squeezebert import SqueezeBertTokenizer
from .tokenization_squeezebert import SqueezeBertTokenizer, SqueezeBertTokenizerFast
from .tokenization_t5 import T5Tokenizer, T5TokenizerFast
from .tokenization_transfo_xl import TransfoXLCorpus, TransfoXLTokenizer from .tokenization_transfo_xl import TransfoXLCorpus, TransfoXLTokenizer
from .tokenization_utils import PreTrainedTokenizer from .tokenization_utils import PreTrainedTokenizer
from .tokenization_utils_base import ( from .tokenization_utils_base import (
AddedToken,
BatchEncoding, BatchEncoding,
CharSpan, CharSpan,
PreTrainedTokenizerBase, PreTrainedTokenizerBase,
...@@ -202,10 +197,59 @@ from .tokenization_utils_base import ( ...@@ -202,10 +197,59 @@ from .tokenization_utils_base import (
TensorType, TensorType,
TokenSpan, TokenSpan,
) )
from .tokenization_utils_fast import PreTrainedTokenizerFast
from .tokenization_xlm import XLMTokenizer from .tokenization_xlm import XLMTokenizer
from .tokenization_xlm_roberta import XLMRobertaTokenizer, XLMRobertaTokenizerFast
from .tokenization_xlnet import SPIECE_UNDERLINE, XLNetTokenizer, XLNetTokenizerFast
if is_sentencepiece_available():
from .tokenization_albert import AlbertTokenizer
from .tokenization_bert_generation import BertGenerationTokenizer
from .tokenization_camembert import CamembertTokenizer
from .tokenization_marian import MarianTokenizer
from .tokenization_mbart import MBartTokenizer
from .tokenization_pegasus import PegasusTokenizer
from .tokenization_reformer import ReformerTokenizer
from .tokenization_t5 import T5Tokenizer
from .tokenization_xlm_roberta import XLMRobertaTokenizer
from .tokenization_xlnet import XLNetTokenizer
else:
from .utils.dummy_sentencepiece_objects import *
if is_tokenizers_available():
from .tokenization_albert_fast import AlbertTokenizerFast
from .tokenization_bart_fast import BartTokenizerFast
from .tokenization_bert_fast import BertTokenizerFast
from .tokenization_camembert_fast import CamembertTokenizerFast
from .tokenization_distilbert_fast import DistilBertTokenizerFast
from .tokenization_dpr_fast import (
DPRContextEncoderTokenizerFast,
DPRQuestionEncoderTokenizerFast,
DPRReaderTokenizerFast,
)
from .tokenization_electra_fast import ElectraTokenizerFast
from .tokenization_funnel_fast import FunnelTokenizerFast
from .tokenization_gpt2_fast import GPT2TokenizerFast
from .tokenization_herbert_fast import HerbertTokenizerFast
from .tokenization_layoutlm_fast import LayoutLMTokenizerFast
from .tokenization_longformer_fast import LongformerTokenizerFast
from .tokenization_lxmert_fast import LxmertTokenizerFast
from .tokenization_mbart_fast import MBartTokenizerFast
from .tokenization_mobilebert_fast import MobileBertTokenizerFast
from .tokenization_openai_fast import OpenAIGPTTokenizerFast
from .tokenization_pegasus_fast import PegasusTokenizerFast
from .tokenization_reformer_fast import ReformerTokenizerFast
from .tokenization_retribert_fast import RetriBertTokenizerFast
from .tokenization_roberta_fast import RobertaTokenizerFast
from .tokenization_squeezebert_fast import SqueezeBertTokenizerFast
from .tokenization_t5_fast import T5TokenizerFast
from .tokenization_utils_fast import PreTrainedTokenizerFast
from .tokenization_xlm_roberta_fast import XLMRobertaTokenizerFast
from .tokenization_xlnet_fast import XLNetTokenizerFast
if is_sentencepiece_available():
from .convert_slow_tokenizer import SLOW_TO_FAST_CONVERTERS, convert_slow_tokenizer
else:
from .utils.dummy_tokenizers_objects import *
# Trainer # Trainer
from .trainer_callback import ( from .trainer_callback import (
...@@ -539,7 +583,6 @@ if is_torch_available(): ...@@ -539,7 +583,6 @@ if is_torch_available():
get_linear_schedule_with_warmup, get_linear_schedule_with_warmup,
get_polynomial_decay_schedule_with_warmup, get_polynomial_decay_schedule_with_warmup,
) )
from .tokenization_marian import MarianTokenizer
# Trainer # Trainer
from .trainer import Trainer from .trainer import Trainer
......
...@@ -266,7 +266,7 @@ class AutoConfig: ...@@ -266,7 +266,7 @@ class AutoConfig:
our S3, e.g., ``dbmdz/bert-base-german-cased``. our S3, e.g., ``dbmdz/bert-base-german-cased``.
- A path to a `directory` containing a configuration file saved using the - A path to a `directory` containing a configuration file saved using the
:meth:`~transformers.PretrainedConfig.save_pretrained` method, or the :meth:`~transformers.PretrainedConfig.save_pretrained` method, or the
:meth:`~transformers.PretrainedModel.save_pretrained` method, e.g., ``./my_model_directory/``. :meth:`~transformers.PreTrainedModel.save_pretrained` method, e.g., ``./my_model_directory/``.
- A path or url to a saved configuration JSON `file`, e.g., - A path or url to a saved configuration JSON `file`, e.g.,
``./my_model_directory/configuration.json``. ``./my_model_directory/configuration.json``.
cache_dir (:obj:`str`, `optional`): cache_dir (:obj:`str`, `optional`):
......
...@@ -43,6 +43,9 @@ class PretrainedConfig(object): ...@@ -43,6 +43,9 @@ class PretrainedConfig(object):
recreate the correct object in :class:`~transformers.AutoConfig`. recreate the correct object in :class:`~transformers.AutoConfig`.
Args: Args:
name_or_path (:obj:`str`, `optional`, defaults to :obj:`""`):
Store the string that was passed to :func:`~transformers.PreTrainedModel.from_pretrained` or :func:`~transformers.TFPreTrainedModel.from_pretrained`
as ``pretrained_model_name_or_path`` if the configuration was created with such a method.
output_hidden_states (:obj:`bool`, `optional`, defaults to :obj:`False`): output_hidden_states (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether or not the model should return all hidden-states. Whether or not the model should return all hidden-states.
output_attentions (:obj:`bool`, `optional`, defaults to :obj:`False`): output_attentions (:obj:`bool`, `optional`, defaults to :obj:`False`):
...@@ -206,6 +209,9 @@ class PretrainedConfig(object): ...@@ -206,6 +209,9 @@ class PretrainedConfig(object):
# TPU arguments # TPU arguments
self.xla_device = kwargs.pop("xla_device", None) self.xla_device = kwargs.pop("xla_device", None)
# Name or path to the pretrained checkpoint
self._name_or_path = str(kwargs.pop("name_or_path", ""))
# Additional attributes without default values # Additional attributes without default values
for key, value in kwargs.items(): for key, value in kwargs.items():
try: try:
...@@ -214,6 +220,14 @@ class PretrainedConfig(object): ...@@ -214,6 +220,14 @@ class PretrainedConfig(object):
logger.error("Can't set {} with value {} for {}".format(key, value, self)) logger.error("Can't set {} with value {} for {}".format(key, value, self))
raise err raise err
@property
def name_or_path(self) -> str:
return self._name_or_path
@name_or_path.setter
def name_or_path(self, value):
self._name_or_path = str(value) # Make sure that name_or_path is a string (for JSON encoding)
@property @property
def use_return_dict(self) -> bool: def use_return_dict(self) -> bool:
""" """
......
...@@ -20,13 +20,14 @@ ...@@ -20,13 +20,14 @@
from typing import Dict, List, Tuple from typing import Dict, List, Tuple
from sentencepiece import SentencePieceProcessor
from tokenizers import Tokenizer, decoders, normalizers, pre_tokenizers, processors from tokenizers import Tokenizer, decoders, normalizers, pre_tokenizers, processors
from tokenizers.models import BPE, Unigram, WordPiece from tokenizers.models import BPE, Unigram, WordPiece
# from transformers.tokenization_openai import OpenAIGPTTokenizer # from transformers.tokenization_openai import OpenAIGPTTokenizer
from transformers.utils import sentencepiece_model_pb2 as model from transformers.utils import sentencepiece_model_pb2 as model
from .file_utils import requires_sentencepiece
class SentencePieceExtractor: class SentencePieceExtractor:
""" """
...@@ -35,7 +36,9 @@ class SentencePieceExtractor: ...@@ -35,7 +36,9 @@ class SentencePieceExtractor:
""" """
def __init__(self, model: str): def __init__(self, model: str):
# Get SentencePiece requires_sentencepiece(self)
from sentencepiece import SentencePieceProcessor
self.sp = SentencePieceProcessor() self.sp = SentencePieceProcessor()
self.sp.Load(model) self.sp.Load(model)
...@@ -568,11 +571,10 @@ class T5Converter(SpmConverter): ...@@ -568,11 +571,10 @@ class T5Converter(SpmConverter):
) )
CONVERTERS = { SLOW_TO_FAST_CONVERTERS = {
"AlbertTokenizer": AlbertConverter, "AlbertTokenizer": AlbertConverter,
"BertTokenizer": BertConverter,
"BertGenerationTokenizer": BertGenerationConverter,
"BartTokenizer": RobertaConverter, "BartTokenizer": RobertaConverter,
"BertTokenizer": BertConverter,
"CamembertTokenizer": CamembertConverter, "CamembertTokenizer": CamembertConverter,
"DistilBertTokenizer": BertConverter, "DistilBertTokenizer": BertConverter,
"DPRReaderTokenizer": BertConverter, "DPRReaderTokenizer": BertConverter,
...@@ -582,12 +584,17 @@ CONVERTERS = { ...@@ -582,12 +584,17 @@ CONVERTERS = {
"FunnelTokenizer": FunnelConverter, "FunnelTokenizer": FunnelConverter,
"GPT2Tokenizer": GPT2Converter, "GPT2Tokenizer": GPT2Converter,
"HerbertTokenizer": HerbertConverter, "HerbertTokenizer": HerbertConverter,
"LayoutLMTokenizer": BertConverter,
"LongformerTokenizer": RobertaConverter,
"LxmertTokenizer": BertConverter, "LxmertTokenizer": BertConverter,
"MBartTokenizer": MBartConverter, "MBartTokenizer": MBartConverter,
"MobileBertTokenizer": BertConverter,
"OpenAIGPTTokenizer": OpenAIGPTConverter, "OpenAIGPTTokenizer": OpenAIGPTConverter,
"PegasusTokenizer": PegasusConverter, "PegasusTokenizer": PegasusConverter,
"ReformerTokenizer": ReformerConverter, "ReformerTokenizer": ReformerConverter,
"RetriBertTokenizer": BertConverter,
"RobertaTokenizer": RobertaConverter, "RobertaTokenizer": RobertaConverter,
"SqueezeBertTokenizer": BertConverter,
"T5Tokenizer": T5Converter, "T5Tokenizer": T5Converter,
"XLMRobertaTokenizer": XLMRobertaConverter, "XLMRobertaTokenizer": XLMRobertaConverter,
"XLNetTokenizer": XLNetConverter, "XLNetTokenizer": XLNetConverter,
...@@ -595,5 +602,26 @@ CONVERTERS = { ...@@ -595,5 +602,26 @@ CONVERTERS = {
def convert_slow_tokenizer(transformer_tokenizer) -> Tokenizer: def convert_slow_tokenizer(transformer_tokenizer) -> Tokenizer:
converter_class = CONVERTERS[transformer_tokenizer.__class__.__name__] """Utilities to convert a slow tokenizer instance in a fast tokenizer instance.
Args:
transformer_tokenizer (:class:`~transformers.tokenization_utils_base.PreTrainedTokenizer`):
Instance of a slow tokenizer to convert in the backend tokenizer for
:class:`~transformers.tokenization_utils_base.PreTrainedTokenizerFast`.
Return:
A instance of :class:`~tokenizers.Tokenizer` to be used as the backend tokenizer of a
:class:`~transformers.tokenization_utils_base.PreTrainedTokenizerFast`
"""
tokenizer_class_name = transformer_tokenizer.__class__.__name__
if tokenizer_class_name not in SLOW_TO_FAST_CONVERTERS:
raise ValueError(
f"An instance of tokenizer class {tokenizer_class_name} cannot be converted in a Fast tokenizer instance. "
f"No converter was found. Currently available slow->fast convertors: {list(SLOW_TO_FAST_CONVERTERS.keys())}"
)
converter_class = SLOW_TO_FAST_CONVERTERS[tokenizer_class_name]
return converter_class(transformer_tokenizer).converted() return converter_class(transformer_tokenizer).converted()
# coding=utf-8
# Copyright 2018 The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Convert slow tokenizers checkpoints in fast (serialization format of the `tokenizers` library) """
import argparse
import os
import transformers
from transformers.convert_slow_tokenizer import SLOW_TO_FAST_CONVERTERS
from transformers.utils import logging
logging.set_verbosity_info()
logger = logging.get_logger(__name__)
TOKENIZER_CLASSES = {name: getattr(transformers, name + "Fast") for name in SLOW_TO_FAST_CONVERTERS}
def convert_slow_checkpoint_to_fast(tokenizer_name, checkpoint_name, dump_path, force_download):
if tokenizer_name is not None and tokenizer_name not in TOKENIZER_CLASSES:
raise ValueError("Unrecognized tokenizer name, should be one of {}.".format(list(TOKENIZER_CLASSES.keys())))
if tokenizer_name is None:
tokenizer_names = TOKENIZER_CLASSES
else:
tokenizer_names = {tokenizer_name: getattr(transformers, tokenizer_name + "Fast")}
logger.info(f"Loading tokenizer classes: {tokenizer_names}")
for tokenizer_name in tokenizer_names:
tokenizer_class = TOKENIZER_CLASSES[tokenizer_name]
add_prefix = True
if checkpoint_name is None:
checkpoint_names = list(tokenizer_class.max_model_input_sizes.keys())
else:
checkpoint_names = [checkpoint_name]
logger.info(f"For tokenizer {tokenizer_class.__class__.__name__} loading checkpoints: {checkpoint_names}")
for checkpoint in checkpoint_names:
logger.info(f"Loading {tokenizer_class.__class__.__name__} {checkpoint}")
# Load tokenizer
tokenizer = tokenizer_class.from_pretrained(checkpoint, force_download=force_download)
# Save fast tokenizer
logger.info(
"Save fast tokenizer to {} with prefix {} add_prefix {}".format(dump_path, checkpoint, add_prefix)
)
# For organization names we create sub-directories
if "/" in checkpoint:
checkpoint_directory, checkpoint_prefix_name = checkpoint.split("/")
dump_path_full = os.path.join(dump_path, checkpoint_directory)
elif add_prefix:
checkpoint_prefix_name = checkpoint
dump_path_full = dump_path
else:
checkpoint_prefix_name = None
dump_path_full = dump_path
logger.info(
"=> {} with prefix {}, add_prefix {}".format(dump_path_full, checkpoint_prefix_name, add_prefix)
)
file_path = list(tokenizer.pretrained_vocab_files_map.values())[0][checkpoint]
next_char = file_path.split(checkpoint)[-1][0]
if next_char == "/":
dump_path_full = os.path.join(dump_path_full, checkpoint_prefix_name)
checkpoint_prefix_name = None
logger.info(
"=> {} with prefix {}, add_prefix {}".format(dump_path_full, checkpoint_prefix_name, add_prefix)
)
file_names = tokenizer.save_pretrained(
dump_path_full, legacy_format=False, filename_prefix=checkpoint_prefix_name
)
logger.info("=> File names {}".format(file_names))
for file_name in file_names:
if not file_name.endswith("tokenizer.json"):
os.remove(file_name)
logger.info("=> removing {}".format(file_name))
if __name__ == "__main__":
parser = argparse.ArgumentParser()
# Required parameters
parser.add_argument(
"--dump_path", default=None, type=str, required=True, help="Path to output generated fast tokenizer files."
)
parser.add_argument(
"--tokenizer_name",
default=None,
type=str,
help="Optional tokenizer type selected in the list of {}. If not given, will download and convert all the checkpoints from AWS.".format(
list(TOKENIZER_CLASSES.keys())
),
)
parser.add_argument(
"--checkpoint_name",
default=None,
type=str,
help="Optional checkpoint name. If not given, will download and convert the canonical checkpoints from AWS.",
)
parser.add_argument(
"--force_download",
action="store_true",
help="Re-dowload checkpoints.",
)
args = parser.parse_args()
convert_slow_checkpoint_to_fast(args.tokenizer_name, args.checkpoint_name, args.dump_path, args.force_download)
...@@ -4,9 +4,7 @@ from typing import Any, Callable, Dict, List, NewType, Optional, Tuple, Union ...@@ -4,9 +4,7 @@ from typing import Any, Callable, Dict, List, NewType, Optional, Tuple, Union
import torch import torch
from torch.nn.utils.rnn import pad_sequence from torch.nn.utils.rnn import pad_sequence
from ..tokenization_utils import PreTrainedTokenizer from ..tokenization_utils_base import BatchEncoding, PaddingStrategy, PreTrainedTokenizerBase
from ..tokenization_utils_base import BatchEncoding, PaddingStrategy
from ..tokenization_utils_fast import PreTrainedTokenizerFast
InputDataClass = NewType("InputDataClass", Any) InputDataClass = NewType("InputDataClass", Any)
...@@ -94,7 +92,7 @@ class DataCollatorWithPadding: ...@@ -94,7 +92,7 @@ class DataCollatorWithPadding:
>= 7.5 (Volta). >= 7.5 (Volta).
""" """
tokenizer: Union[PreTrainedTokenizer, PreTrainedTokenizerFast] tokenizer: PreTrainedTokenizerBase
padding: Union[bool, str, PaddingStrategy] = True padding: Union[bool, str, PaddingStrategy] = True
max_length: Optional[int] = None max_length: Optional[int] = None
pad_to_multiple_of: Optional[int] = None pad_to_multiple_of: Optional[int] = None
...@@ -124,7 +122,7 @@ class DataCollatorForLanguageModeling: ...@@ -124,7 +122,7 @@ class DataCollatorForLanguageModeling:
- preprocesses batches for masked language modeling - preprocesses batches for masked language modeling
""" """
tokenizer: PreTrainedTokenizer tokenizer: PreTrainedTokenizerBase
mlm: bool = True mlm: bool = True
mlm_probability: float = 0.15 mlm_probability: float = 0.15
...@@ -274,7 +272,7 @@ class DataCollatorForPermutationLanguageModeling: ...@@ -274,7 +272,7 @@ class DataCollatorForPermutationLanguageModeling:
- preprocesses batches for permutation language modeling with procedures specific to XLNet - preprocesses batches for permutation language modeling with procedures specific to XLNet
""" """
tokenizer: PreTrainedTokenizer tokenizer: PreTrainedTokenizerBase
plm_probability: float = 1 / 6 plm_probability: float = 1 / 6
max_span_length: int = 5 # maximum length of a span of masked tokens max_span_length: int = 5 # maximum length of a span of masked tokens
...@@ -406,7 +404,7 @@ class DataCollatorForNextSentencePrediction: ...@@ -406,7 +404,7 @@ class DataCollatorForNextSentencePrediction:
- preprocesses batches for masked language modeling - preprocesses batches for masked language modeling
""" """
tokenizer: PreTrainedTokenizer tokenizer: PreTrainedTokenizerBase
mlm: bool = True mlm: bool = True
block_size: int = 512 block_size: int = 512
short_seq_probability: float = 0.1 short_seq_probability: float = 0.1
......
...@@ -9,10 +9,7 @@ from torch.utils.data.dataset import Dataset ...@@ -9,10 +9,7 @@ from torch.utils.data.dataset import Dataset
from filelock import FileLock from filelock import FileLock
from ...tokenization_bart import BartTokenizer, BartTokenizerFast from ...tokenization_utils_base import PreTrainedTokenizerBase
from ...tokenization_roberta import RobertaTokenizer, RobertaTokenizerFast
from ...tokenization_utils import PreTrainedTokenizer
from ...tokenization_xlm_roberta import XLMRobertaTokenizer
from ...utils import logging from ...utils import logging
from ..processors.glue import glue_convert_examples_to_features, glue_output_modes, glue_processors from ..processors.glue import glue_convert_examples_to_features, glue_output_modes, glue_processors
from ..processors.utils import InputFeatures from ..processors.utils import InputFeatures
...@@ -69,7 +66,7 @@ class GlueDataset(Dataset): ...@@ -69,7 +66,7 @@ class GlueDataset(Dataset):
def __init__( def __init__(
self, self,
args: GlueDataTrainingArguments, args: GlueDataTrainingArguments,
tokenizer: PreTrainedTokenizer, tokenizer: PreTrainedTokenizerBase,
limit_length: Optional[int] = None, limit_length: Optional[int] = None,
mode: Union[str, Split] = Split.train, mode: Union[str, Split] = Split.train,
cache_dir: Optional[str] = None, cache_dir: Optional[str] = None,
...@@ -93,12 +90,12 @@ class GlueDataset(Dataset): ...@@ -93,12 +90,12 @@ class GlueDataset(Dataset):
), ),
) )
label_list = self.processor.get_labels() label_list = self.processor.get_labels()
if args.task_name in ["mnli", "mnli-mm"] and tokenizer.__class__ in ( if args.task_name in ["mnli", "mnli-mm"] and tokenizer.__class__.__name__ in (
RobertaTokenizer, "RobertaTokenizer",
RobertaTokenizerFast, "RobertaTokenizerFast",
XLMRobertaTokenizer, "XLMRobertaTokenizer",
BartTokenizer, "BartTokenizer",
BartTokenizerFast, "BartTokenizerFast",
): ):
# HACK(label indices are swapped in RoBERTa pretrained model) # HACK(label indices are swapped in RoBERTa pretrained model)
label_list[1], label_list[2] = label_list[2], label_list[1] label_list[1], label_list[2] = label_list[2], label_list[1]
......
...@@ -157,6 +157,24 @@ except (AttributeError, ImportError, KeyError): ...@@ -157,6 +157,24 @@ except (AttributeError, ImportError, KeyError):
_in_notebook = False _in_notebook = False
try:
import sentencepiece # noqa: F401
_sentencepiece_available = True
except ImportError:
_sentencepiece_available = False
try:
import tokenizers # noqa: F401
_tokenizers_available = True
except ImportError:
_tokenizers_available = False
default_cache_path = os.path.join(torch_cache_home, "transformers") default_cache_path = os.path.join(torch_cache_home, "transformers")
...@@ -170,6 +188,8 @@ TF_WEIGHTS_NAME = "model.ckpt" ...@@ -170,6 +188,8 @@ TF_WEIGHTS_NAME = "model.ckpt"
CONFIG_NAME = "config.json" CONFIG_NAME = "config.json"
MODEL_CARD_NAME = "modelcard.json" MODEL_CARD_NAME = "modelcard.json"
SENTENCEPIECE_UNDERLINE = "▁"
SPIECE_UNDERLINE = SENTENCEPIECE_UNDERLINE # Kept for backward compatibility
MULTIPLE_CHOICE_DUMMY_INPUTS = [ MULTIPLE_CHOICE_DUMMY_INPUTS = [
[[0, 1, 0, 1], [1, 0, 0, 1]] [[0, 1, 0, 1], [1, 0, 0, 1]]
...@@ -217,6 +237,18 @@ def is_faiss_available(): ...@@ -217,6 +237,18 @@ def is_faiss_available():
return _faiss_available return _faiss_available
def is_sklearn_available():
return _has_sklearn
def is_sentencepiece_available():
return _sentencepiece_available
def is_tokenizers_available():
return _tokenizers_available
def is_in_notebook(): def is_in_notebook():
return _in_notebook return _in_notebook
...@@ -234,10 +266,6 @@ def torch_only_method(fn): ...@@ -234,10 +266,6 @@ def torch_only_method(fn):
return wrapper return wrapper
def is_sklearn_available():
return _has_sklearn
DATASETS_IMPORT_ERROR = """ DATASETS_IMPORT_ERROR = """
{0} requires the 🤗 Datasets library but it was not found in your enviromnent. You can install it with: {0} requires the 🤗 Datasets library but it was not found in your enviromnent. You can install it with:
``` ```
...@@ -255,6 +283,25 @@ that python file if that's the case. ...@@ -255,6 +283,25 @@ that python file if that's the case.
""" """
TOKENIZERS_IMPORT_ERROR = """
{0} requires the 🤗 Tokenizers library but it was not found in your enviromnent. You can install it with:
```
pip install tokenizers
```
In a notebook or a colab, you can install it by executing a cell with
```
!pip install tokenizers
```
"""
SENTENCEPIECE_IMPORT_ERROR = """
{0} requires the SentencePiece library but it was not found in your enviromnent. Checkout the instructions on the
installation page of its repo: https://github.com/google/sentencepiece#installation and follow the ones
that match your enviromnent.
"""
FAISS_IMPORT_ERROR = """ FAISS_IMPORT_ERROR = """
{0} requires the faiss library but it was not found in your enviromnent. Checkout the instructions on the {0} requires the faiss library but it was not found in your enviromnent. Checkout the instructions on the
installation page of its repo: https://github.com/facebookresearch/faiss/blob/master/INSTALL.md and follow the ones installation page of its repo: https://github.com/facebookresearch/faiss/blob/master/INSTALL.md and follow the ones
...@@ -316,6 +363,18 @@ def requires_tf(obj): ...@@ -316,6 +363,18 @@ def requires_tf(obj):
raise ImportError(TENSORFLOW_IMPORT_ERROR.format(name)) raise ImportError(TENSORFLOW_IMPORT_ERROR.format(name))
def requires_tokenizers(obj):
name = obj.__name__ if hasattr(obj, "__name__") else obj.__class__.__name__
if not is_tokenizers_available():
raise ImportError(TOKENIZERS_IMPORT_ERROR.format(name))
def requires_sentencepiece(obj):
name = obj.__name__ if hasattr(obj, "__name__") else obj.__class__.__name__
if not is_sentencepiece_available():
raise ImportError(SENTENCEPIECE_IMPORT_ERROR.format(name))
def add_start_docstrings(*docstr): def add_start_docstrings(*docstr):
def docstring_decorator(fn): def docstring_decorator(fn):
fn.__doc__ = "".join(docstr) + (fn.__doc__ if fn.__doc__ is not None else "") fn.__doc__ = "".join(docstr) + (fn.__doc__ if fn.__doc__ is not None else "")
......
...@@ -346,8 +346,9 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin, TFGenerationMixin): ...@@ -346,8 +346,9 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin, TFGenerationMixin):
self.__class__.__name__, self.__class__.__name__ self.__class__.__name__, self.__class__.__name__
) )
) )
# Save config in model # Save config and origin of the pretrained weights if given in model
self.config = config self.config = config
self.name_or_path = config.name_or_path
def get_input_embeddings(self) -> tf.keras.layers.Layer: def get_input_embeddings(self) -> tf.keras.layers.Layer:
""" """
...@@ -690,6 +691,8 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin, TFGenerationMixin): ...@@ -690,6 +691,8 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin, TFGenerationMixin):
else: else:
resolved_archive_file = None resolved_archive_file = None
config.name_or_path = pretrained_model_name_or_path
# Instantiate model. # Instantiate model.
model = cls(config, *model_args, **model_kwargs) model = cls(config, *model_args, **model_kwargs)
......
...@@ -432,8 +432,9 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin): ...@@ -432,8 +432,9 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin):
self.__class__.__name__, self.__class__.__name__ self.__class__.__name__, self.__class__.__name__
) )
) )
# Save config in model # Save config and origin of the pretrained weights if given in model
self.config = config self.config = config
self.name_or_path = config.name_or_path
@property @property
def base_model(self) -> nn.Module: def base_model(self) -> nn.Module:
...@@ -933,6 +934,8 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin): ...@@ -933,6 +934,8 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin):
else: else:
resolved_archive_file = None resolved_archive_file = None
config.name_or_path = pretrained_model_name_or_path
# Instantiate model. # Instantiate model.
model = cls(config, *model_args, **model_kwargs) model = cls(config, *model_args, **model_kwargs)
......
...@@ -10,7 +10,15 @@ from distutils.util import strtobool ...@@ -10,7 +10,15 @@ from distutils.util import strtobool
from io import StringIO from io import StringIO
from pathlib import Path from pathlib import Path
from .file_utils import _datasets_available, _faiss_available, _tf_available, _torch_available, _torch_tpu_available from .file_utils import (
_datasets_available,
_faiss_available,
_sentencepiece_available,
_tf_available,
_tokenizers_available,
_torch_available,
_torch_tpu_available,
)
SMALL_MODEL_IDENTIFIER = "julien-c/bert-xsmall-dummy" SMALL_MODEL_IDENTIFIER = "julien-c/bert-xsmall-dummy"
...@@ -107,6 +115,32 @@ def require_tf(test_case): ...@@ -107,6 +115,32 @@ def require_tf(test_case):
return test_case return test_case
def require_sentencepiece(test_case):
"""
Decorator marking a test that requires SentencePiece.
These tests are skipped when SentencePiece isn't installed.
"""
if not _sentencepiece_available:
return unittest.skip("test requires SentencePiece")(test_case)
else:
return test_case
def require_tokenizers(test_case):
"""
Decorator marking a test that requires 🤗 Tokenizers.
These tests are skipped when 🤗 Tokenizers isn't installed.
"""
if not _tokenizers_available:
return unittest.skip("test requires tokenizers")(test_case)
else:
return test_case
def require_multigpu(test_case): def require_multigpu(test_case):
""" """
Decorator marking a test that requires a multi-GPU setup (in PyTorch). Decorator marking a test that requires a multi-GPU setup (in PyTorch).
......
...@@ -18,10 +18,11 @@ ...@@ -18,10 +18,11 @@
import os import os
import unicodedata import unicodedata
from shutil import copyfile from shutil import copyfile
from typing import List, Optional from typing import List, Optional, Tuple
import sentencepiece as spm
from .tokenization_utils import PreTrainedTokenizer from .tokenization_utils import PreTrainedTokenizer
from .tokenization_utils_fast import PreTrainedTokenizerFast
from .utils import logging from .utils import logging
...@@ -138,15 +139,6 @@ class AlbertTokenizer(PreTrainedTokenizer): ...@@ -138,15 +139,6 @@ class AlbertTokenizer(PreTrainedTokenizer):
**kwargs, **kwargs,
) )
try:
import sentencepiece as spm
except ImportError:
logger.warning(
"You need to install SentencePiece to use AlbertTokenizer: https://github.com/google/sentencepiece"
"pip install sentencepiece"
)
raise
self.do_lower_case = do_lower_case self.do_lower_case = do_lower_case
self.remove_space = remove_space self.remove_space = remove_space
self.keep_accents = keep_accents self.keep_accents = keep_accents
...@@ -171,14 +163,6 @@ class AlbertTokenizer(PreTrainedTokenizer): ...@@ -171,14 +163,6 @@ class AlbertTokenizer(PreTrainedTokenizer):
def __setstate__(self, d): def __setstate__(self, d):
self.__dict__ = d self.__dict__ = d
try:
import sentencepiece as spm
except ImportError:
logger.warning(
"You need to install SentencePiece to use AlbertTokenizer: https://github.com/google/sentencepiece"
"pip install sentencepiece"
)
raise
self.sp_model = spm.SentencePieceProcessor() self.sp_model = spm.SentencePieceProcessor()
self.sp_model.Load(self.vocab_file) self.sp_model.Load(self.vocab_file)
...@@ -321,225 +305,14 @@ class AlbertTokenizer(PreTrainedTokenizer): ...@@ -321,225 +305,14 @@ class AlbertTokenizer(PreTrainedTokenizer):
return len(cls + token_ids_0 + sep) * [0] return len(cls + token_ids_0 + sep) * [0]
return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1] return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]
def save_vocabulary(self, save_directory): def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
"""
Save the sentencepiece vocabulary (copy original file) and special tokens file to a directory.
Args:
save_directory (:obj:`str`):
The directory in which to save the vocabulary.
Returns:
:obj:`Tuple(str)`: Paths to the files saved.
"""
if not os.path.isdir(save_directory): if not os.path.isdir(save_directory):
logger.error("Vocabulary path ({}) should be a directory".format(save_directory)) logger.error("Vocabulary path ({}) should be a directory".format(save_directory))
return return
out_vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES["vocab_file"]) out_vocab_file = os.path.join(
save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):
copyfile(self.vocab_file, out_vocab_file)
return (out_vocab_file,)
class AlbertTokenizerFast(PreTrainedTokenizerFast):
"""
Construct a "fast" ALBERT tokenizer (backed by HuggingFace's `tokenizers` library). Based on
`SentencePiece <https://github.com/google/sentencepiece>`__.
This tokenizer inherits from :class:`~transformers.PreTrainedTokenizerFast` which contains most of the main
methods. Users should refer to this superclass for more information regarding those methods.
Args:
vocab_file (:obj:`str`):
`SentencePiece <https://github.com/google/sentencepiece>`__ file (generally has a `.spm` extension) that
contains the vocabulary necessary to instantiate a tokenizer.
do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not to lowercase the input when tokenizing.
remove_space (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not to strip the text when tokenizing (removing excess spaces before and after the string).
keep_accents (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether or not to keep accents when tokenizing.
bos_token (:obj:`str`, `optional`, defaults to :obj:`"[CLS]"`):
The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
.. note::
When building a sequence using special tokens, this is not the token that is used for the beginning
of sequence. The token used is the :obj:`cls_token`.
eos_token (:obj:`str`, `optional`, defaults to :obj:`"[SEP]"`):
The end of sequence token.
.. note::
When building a sequence using special tokens, this is not the token that is used for the end
of sequence. The token used is the :obj:`sep_token`.
unk_token (:obj:`str`, `optional`, defaults to :obj:`"<unk>"`):
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
token instead.
sep_token (:obj:`str`, `optional`, defaults to :obj:`"[SEP]"`):
The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences
for sequence classification or for a text and a question for question answering.
It is also used as the last token of a sequence built with special tokens.
pad_token (:obj:`str`, `optional`, defaults to :obj:`"<pad>"`):
The token used for padding, for example when batching sequences of different lengths.
cls_token (:obj:`str`, `optional`, defaults to :obj:`"[CLS]"`):
The classifier token which is used when doing sequence classification (classification of the whole
sequence instead of per-token classification). It is the first token of the sequence when built with
special tokens.
mask_token (:obj:`str`, `optional`, defaults to :obj:`"[MASK]"`):
The token used for masking values. This is the token used when training this model with masked language
modeling. This is the token which the model will try to predict.
Attributes:
sp_model (:obj:`SentencePieceProcessor`):
The `SentencePiece` processor that is used for every conversion (string, tokens and IDs).
"""
vocab_files_names = VOCAB_FILES_NAMES
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
slow_tokenizer_class = AlbertTokenizer
def __init__(
self,
vocab_file,
do_lower_case=True,
remove_space=True,
keep_accents=False,
bos_token="[CLS]",
eos_token="[SEP]",
unk_token="<unk>",
sep_token="[SEP]",
pad_token="<pad>",
cls_token="[CLS]",
mask_token="[MASK]",
**kwargs
):
super().__init__(
vocab_file,
do_lower_case=do_lower_case,
remove_space=remove_space,
keep_accents=keep_accents,
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
pad_token=pad_token,
cls_token=cls_token,
mask_token=mask_token,
**kwargs,
) )
self.do_lower_case = do_lower_case
self.remove_space = remove_space
self.keep_accents = keep_accents
self.vocab_file = vocab_file
def build_inputs_with_special_tokens(
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
) -> List[int]:
"""
Build model inputs from a sequence or a pair of sequence for sequence classification tasks
by concatenating and adding special tokens.
An ALBERT sequence has the following format:
- single sequence: ``[CLS] X [SEP]``
- pair of sequences: ``[CLS] A [SEP] B [SEP]``
Args:
token_ids_0 (:obj:`List[int]`):
List of IDs to which the special tokens will be added.
token_ids_1 (:obj:`List[int]`, `optional`):
Optional second list of IDs for sequence pairs.
Returns:
:obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
"""
sep = [self.sep_token_id]
cls = [self.cls_token_id]
if token_ids_1 is None:
return cls + token_ids_0 + sep
return cls + token_ids_0 + sep + token_ids_1 + sep
def get_special_tokens_mask(
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
) -> List[int]:
"""
Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
special tokens using the tokenizer ``prepare_for_model`` method.
Args:
token_ids_0 (:obj:`List[int]`):
List of IDs.
token_ids_1 (:obj:`List[int]`, `optional`):
Optional second list of IDs for sequence pairs.
already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether or not the token list is already formatted with special tokens for the model.
Returns:
:obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
"""
if already_has_special_tokens:
if token_ids_1 is not None:
raise ValueError(
"You should not supply a second sequence if the provided sequence of "
"ids is already formatted with special tokens for the model."
)
return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))
if token_ids_1 is not None:
return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
return [1] + ([0] * len(token_ids_0)) + [1]
def create_token_type_ids_from_sequences(
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
) -> List[int]:
"""
Create a mask from the two sequences passed to be used in a sequence-pair classification task.
An ALBERT sequence pair mask has the following format:
::
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
| first sequence | second sequence |
If :obj:`token_ids_1` is :obj:`None`, this method only returns the first portion of the mask (0s).
Args:
token_ids_0 (:obj:`List[int]`):
List of IDs.
token_ids_1 (:obj:`List[int]`, `optional`):
Optional second list of IDs for sequence pairs.
Returns:
:obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given
sequence(s).
"""
sep = [self.sep_token_id]
cls = [self.cls_token_id]
if token_ids_1 is None:
return len(cls + token_ids_0 + sep) * [0]
return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]
def save_vocabulary(self, save_directory):
"""
Save the sentencepiece vocabulary (copy original file) and special tokens file to a directory.
Args:
save_directory (:obj:`str`):
The directory in which to save the vocabulary.
Returns:
:obj:`Tuple(str)`: Paths to the files saved.
"""
if not os.path.isdir(save_directory):
logger.error("Vocabulary path ({}) should be a directory".format(save_directory))
return
out_vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES["vocab_file"])
if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file): if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):
copyfile(self.vocab_file, out_vocab_file) copyfile(self.vocab_file, out_vocab_file)
......
# coding=utf-8
# Copyright 2018 Google AI, Google Brain and the HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Tokenization classes for ALBERT model."""
import os
from shutil import copyfile
from typing import List, Optional, Tuple
from .file_utils import is_sentencepiece_available
from .tokenization_utils_fast import PreTrainedTokenizerFast
from .utils import logging
if is_sentencepiece_available():
from .tokenization_albert import AlbertTokenizer
else:
AlbertTokenizer = None
logger = logging.get_logger(__name__)
VOCAB_FILES_NAMES = {"vocab_file": "spiece.model", "tokenizer_file": "tokenizer.json"}
PRETRAINED_VOCAB_FILES_MAP = {
"vocab_file": {
"albert-base-v1": "https://s3.amazonaws.com/models.huggingface.co/bert/albert-base-v1-spiece.model",
"albert-large-v1": "https://s3.amazonaws.com/models.huggingface.co/bert/albert-large-v1-spiece.model",
"albert-xlarge-v1": "https://s3.amazonaws.com/models.huggingface.co/bert/albert-xlarge-v1-spiece.model",
"albert-xxlarge-v1": "https://s3.amazonaws.com/models.huggingface.co/bert/albert-xxlarge-v1-spiece.model",
"albert-base-v2": "https://s3.amazonaws.com/models.huggingface.co/bert/albert-base-v2-spiece.model",
"albert-large-v2": "https://s3.amazonaws.com/models.huggingface.co/bert/albert-large-v2-spiece.model",
"albert-xlarge-v2": "https://s3.amazonaws.com/models.huggingface.co/bert/albert-xlarge-v2-spiece.model",
"albert-xxlarge-v2": "https://s3.amazonaws.com/models.huggingface.co/bert/albert-xxlarge-v2-spiece.model",
},
"tokenizer_file": {
"albert-base-v1": "https://s3.amazonaws.com/models.huggingface.co/bert/albert-base-v1-tokenizer.json",
"albert-large-v1": "https://s3.amazonaws.com/models.huggingface.co/bert/albert-large-v1-tokenizer.json",
"albert-xlarge-v1": "https://s3.amazonaws.com/models.huggingface.co/bert/albert-xlarge-v1-tokenizer.json",
"albert-xxlarge-v1": "https://s3.amazonaws.com/models.huggingface.co/bert/albert-xxlarge-v1-tokenizer.json",
"albert-base-v2": "https://s3.amazonaws.com/models.huggingface.co/bert/albert-base-v2-tokenizer.json",
"albert-large-v2": "https://s3.amazonaws.com/models.huggingface.co/bert/albert-large-v2-tokenizer.json",
"albert-xlarge-v2": "https://s3.amazonaws.com/models.huggingface.co/bert/albert-xlarge-v2-tokenizer.json",
"albert-xxlarge-v2": "https://s3.amazonaws.com/models.huggingface.co/bert/albert-xxlarge-v2-tokenizer.json",
},
}
PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
"albert-base-v1": 512,
"albert-large-v1": 512,
"albert-xlarge-v1": 512,
"albert-xxlarge-v1": 512,
"albert-base-v2": 512,
"albert-large-v2": 512,
"albert-xlarge-v2": 512,
"albert-xxlarge-v2": 512,
}
SPIECE_UNDERLINE = "▁"
class AlbertTokenizerFast(PreTrainedTokenizerFast):
"""
Construct a "fast" ALBERT tokenizer (backed by HuggingFace's `tokenizers` library). Based on
`SentencePiece <https://github.com/google/sentencepiece>`__.
This tokenizer inherits from :class:`~transformers.PreTrainedTokenizerFast` which contains most of the main
methods. Users should refer to this superclass for more information regarding those methods.
Args:
vocab_file (:obj:`str`):
`SentencePiece <https://github.com/google/sentencepiece>`__ file (generally has a `.spm` extension) that
contains the vocabulary necessary to instantiate a tokenizer.
do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not to lowercase the input when tokenizing.
remove_space (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not to strip the text when tokenizing (removing excess spaces before and after the string).
keep_accents (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether or not to keep accents when tokenizing.
bos_token (:obj:`str`, `optional`, defaults to :obj:`"[CLS]"`):
The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
.. note::
When building a sequence using special tokens, this is not the token that is used for the beginning
of sequence. The token used is the :obj:`cls_token`.
eos_token (:obj:`str`, `optional`, defaults to :obj:`"[SEP]"`):
The end of sequence token.
.. note::
When building a sequence using special tokens, this is not the token that is used for the end
of sequence. The token used is the :obj:`sep_token`.
unk_token (:obj:`str`, `optional`, defaults to :obj:`"<unk>"`):
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
token instead.
sep_token (:obj:`str`, `optional`, defaults to :obj:`"[SEP]"`):
The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences
for sequence classification or for a text and a question for question answering.
It is also used as the last token of a sequence built with special tokens.
pad_token (:obj:`str`, `optional`, defaults to :obj:`"<pad>"`):
The token used for padding, for example when batching sequences of different lengths.
cls_token (:obj:`str`, `optional`, defaults to :obj:`"[CLS]"`):
The classifier token which is used when doing sequence classification (classification of the whole
sequence instead of per-token classification). It is the first token of the sequence when built with
special tokens.
mask_token (:obj:`str`, `optional`, defaults to :obj:`"[MASK]"`):
The token used for masking values. This is the token used when training this model with masked language
modeling. This is the token which the model will try to predict.
Attributes:
sp_model (:obj:`SentencePieceProcessor`):
The `SentencePiece` processor that is used for every conversion (string, tokens and IDs).
"""
vocab_files_names = VOCAB_FILES_NAMES
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
slow_tokenizer_class = AlbertTokenizer
def __init__(
self,
vocab_file,
tokenizer_file=None,
do_lower_case=True,
remove_space=True,
keep_accents=False,
bos_token="[CLS]",
eos_token="[SEP]",
unk_token="<unk>",
sep_token="[SEP]",
pad_token="<pad>",
cls_token="[CLS]",
mask_token="[MASK]",
**kwargs
):
super().__init__(
vocab_file,
tokenizer_file=tokenizer_file,
do_lower_case=do_lower_case,
remove_space=remove_space,
keep_accents=keep_accents,
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
pad_token=pad_token,
cls_token=cls_token,
mask_token=mask_token,
**kwargs,
)
self.do_lower_case = do_lower_case
self.remove_space = remove_space
self.keep_accents = keep_accents
self.vocab_file = vocab_file
def build_inputs_with_special_tokens(
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
) -> List[int]:
"""
Build model inputs from a sequence or a pair of sequence for sequence classification tasks
by concatenating and adding special tokens.
An ALBERT sequence has the following format:
- single sequence: ``[CLS] X [SEP]``
- pair of sequences: ``[CLS] A [SEP] B [SEP]``
Args:
token_ids_0 (:obj:`List[int]`):
List of IDs to which the special tokens will be added
token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):
Optional second list of IDs for sequence pairs.
Returns:
:obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
"""
sep = [self.sep_token_id]
cls = [self.cls_token_id]
if token_ids_1 is None:
return cls + token_ids_0 + sep
return cls + token_ids_0 + sep + token_ids_1 + sep
def get_special_tokens_mask(
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
) -> List[int]:
"""
Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
special tokens using the tokenizer ``prepare_for_model`` method.
Args:
token_ids_0 (:obj:`List[int]`):
List of ids.
token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):
Optional second list of IDs for sequence pairs.
already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
Set to True if the token list is already formatted with special tokens for the model
Returns:
:obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
"""
if already_has_special_tokens:
if token_ids_1 is not None:
raise ValueError(
"You should not supply a second sequence if the provided sequence of "
"ids is already formatted with special tokens for the model."
)
return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))
if token_ids_1 is not None:
return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
return [1] + ([0] * len(token_ids_0)) + [1]
def create_token_type_ids_from_sequences(
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
) -> List[int]:
"""
Creates a mask from the two sequences passed to be used in a sequence-pair classification task.
An ALBERT sequence pair mask has the following format:
::
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
| first sequence | second sequence |
if token_ids_1 is None, only returns the first portion of the mask (0s).
Args:
token_ids_0 (:obj:`List[int]`):
List of ids.
token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):
Optional second list of IDs for sequence pairs.
Returns:
:obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given
sequence(s).
"""
sep = [self.sep_token_id]
cls = [self.cls_token_id]
if token_ids_1 is None:
return len(cls + token_ids_0 + sep) * [0]
return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]
def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
if not os.path.isdir(save_directory):
logger.error("Vocabulary path ({}) should be a directory".format(save_directory))
return
out_vocab_file = os.path.join(
save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
)
if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):
copyfile(self.vocab_file, out_vocab_file)
return (out_vocab_file,)
...@@ -56,45 +56,108 @@ from .configuration_auto import ( ...@@ -56,45 +56,108 @@ from .configuration_auto import (
replace_list_option_in_docstrings, replace_list_option_in_docstrings,
) )
from .configuration_utils import PretrainedConfig from .configuration_utils import PretrainedConfig
from .tokenization_albert import AlbertTokenizer, AlbertTokenizerFast from .file_utils import is_sentencepiece_available, is_tokenizers_available
from .tokenization_bart import BartTokenizer, BartTokenizerFast from .tokenization_bart import BartTokenizer
from .tokenization_bert import BertTokenizer, BertTokenizerFast from .tokenization_bert import BertTokenizer
from .tokenization_bert_generation import BertGenerationTokenizer
from .tokenization_bert_japanese import BertJapaneseTokenizer from .tokenization_bert_japanese import BertJapaneseTokenizer
from .tokenization_bertweet import BertweetTokenizer from .tokenization_bertweet import BertweetTokenizer
from .tokenization_blenderbot import BlenderbotSmallTokenizer from .tokenization_blenderbot import BlenderbotSmallTokenizer
from .tokenization_camembert import CamembertTokenizer, CamembertTokenizerFast
from .tokenization_ctrl import CTRLTokenizer from .tokenization_ctrl import CTRLTokenizer
from .tokenization_deberta import DebertaTokenizer from .tokenization_deberta import DebertaTokenizer
from .tokenization_distilbert import DistilBertTokenizer, DistilBertTokenizerFast from .tokenization_distilbert import DistilBertTokenizer
from .tokenization_dpr import DPRQuestionEncoderTokenizer, DPRQuestionEncoderTokenizerFast from .tokenization_dpr import DPRQuestionEncoderTokenizer
from .tokenization_electra import ElectraTokenizer, ElectraTokenizerFast from .tokenization_electra import ElectraTokenizer
from .tokenization_flaubert import FlaubertTokenizer from .tokenization_flaubert import FlaubertTokenizer
from .tokenization_fsmt import FSMTTokenizer from .tokenization_fsmt import FSMTTokenizer
from .tokenization_funnel import FunnelTokenizer, FunnelTokenizerFast from .tokenization_funnel import FunnelTokenizer
from .tokenization_gpt2 import GPT2Tokenizer, GPT2TokenizerFast from .tokenization_gpt2 import GPT2Tokenizer
from .tokenization_layoutlm import LayoutLMTokenizer, LayoutLMTokenizerFast from .tokenization_layoutlm import LayoutLMTokenizer
from .tokenization_longformer import LongformerTokenizer, LongformerTokenizerFast from .tokenization_longformer import LongformerTokenizer
from .tokenization_lxmert import LxmertTokenizer, LxmertTokenizerFast from .tokenization_lxmert import LxmertTokenizer
from .tokenization_marian import MarianTokenizer from .tokenization_mobilebert import MobileBertTokenizer
from .tokenization_mbart import MBartTokenizer, MBartTokenizerFast from .tokenization_openai import OpenAIGPTTokenizer
from .tokenization_mobilebert import MobileBertTokenizer, MobileBertTokenizerFast
from .tokenization_openai import OpenAIGPTTokenizer, OpenAIGPTTokenizerFast
from .tokenization_pegasus import PegasusTokenizer, PegasusTokenizerFast
from .tokenization_phobert import PhobertTokenizer from .tokenization_phobert import PhobertTokenizer
from .tokenization_rag import RagTokenizer from .tokenization_rag import RagTokenizer
from .tokenization_reformer import ReformerTokenizer, ReformerTokenizerFast from .tokenization_retribert import RetriBertTokenizer
from .tokenization_retribert import RetriBertTokenizer, RetriBertTokenizerFast from .tokenization_roberta import RobertaTokenizer
from .tokenization_roberta import RobertaTokenizer, RobertaTokenizerFast from .tokenization_squeezebert import SqueezeBertTokenizer
from .tokenization_squeezebert import SqueezeBertTokenizer, SqueezeBertTokenizerFast
from .tokenization_t5 import T5Tokenizer, T5TokenizerFast
from .tokenization_transfo_xl import TransfoXLTokenizer from .tokenization_transfo_xl import TransfoXLTokenizer
from .tokenization_xlm import XLMTokenizer from .tokenization_xlm import XLMTokenizer
from .tokenization_xlm_roberta import XLMRobertaTokenizer, XLMRobertaTokenizerFast
from .tokenization_xlnet import XLNetTokenizer, XLNetTokenizerFast
from .utils import logging from .utils import logging
if is_sentencepiece_available():
from .tokenization_albert import AlbertTokenizer
from .tokenization_bert_generation import BertGenerationTokenizer
from .tokenization_camembert import CamembertTokenizer
from .tokenization_marian import MarianTokenizer
from .tokenization_mbart import MBartTokenizer
from .tokenization_pegasus import PegasusTokenizer
from .tokenization_reformer import ReformerTokenizer
from .tokenization_t5 import T5Tokenizer
from .tokenization_xlm_roberta import XLMRobertaTokenizer
from .tokenization_xlnet import XLNetTokenizer
else:
AlbertTokenizer = None
BertGenerationTokenizer = None
CamembertTokenizer = None
MarianTokenizer = None
MBartTokenizer = None
PegasusTokenizer = None
ReformerTokenizer = None
T5Tokenizer = None
XLMRobertaTokenizer = None
XLNetTokenizer = None
if is_tokenizers_available():
from .tokenization_albert_fast import AlbertTokenizerFast
from .tokenization_bart_fast import BartTokenizerFast
from .tokenization_bert_fast import BertTokenizerFast
from .tokenization_camembert_fast import CamembertTokenizerFast
from .tokenization_distilbert_fast import DistilBertTokenizerFast
from .tokenization_dpr_fast import DPRQuestionEncoderTokenizerFast
from .tokenization_electra_fast import ElectraTokenizerFast
from .tokenization_funnel_fast import FunnelTokenizerFast
from .tokenization_gpt2_fast import GPT2TokenizerFast
from .tokenization_layoutlm_fast import LayoutLMTokenizerFast
from .tokenization_longformer_fast import LongformerTokenizerFast
from .tokenization_lxmert_fast import LxmertTokenizerFast
from .tokenization_mbart_fast import MBartTokenizerFast
from .tokenization_mobilebert_fast import MobileBertTokenizerFast
from .tokenization_openai_fast import OpenAIGPTTokenizerFast
from .tokenization_pegasus_fast import PegasusTokenizerFast
from .tokenization_reformer_fast import ReformerTokenizerFast
from .tokenization_retribert_fast import RetriBertTokenizerFast
from .tokenization_roberta_fast import RobertaTokenizerFast
from .tokenization_squeezebert_fast import SqueezeBertTokenizerFast
from .tokenization_t5_fast import T5TokenizerFast
from .tokenization_xlm_roberta_fast import XLMRobertaTokenizerFast
from .tokenization_xlnet_fast import XLNetTokenizerFast
else:
AlbertTokenizerFast = None
BartTokenizerFast = None
BertTokenizerFast = None
CamembertTokenizerFast = None
DistilBertTokenizerFast = None
DPRQuestionEncoderTokenizerFast = None
ElectraTokenizerFast = None
FunnelTokenizerFast = None
GPT2TokenizerFast = None
LayoutLMTokenizerFast = None
LongformerTokenizerFast = None
LxmertTokenizerFast = None
MBartTokenizerFast = None
MobileBertTokenizerFast = None
OpenAIGPTTokenizerFast = None
PegasusTokenizerFast = None
ReformerTokenizerFast = None
RetriBertTokenizerFast = None
RobertaTokenizerFast = None
SqueezeBertTokenizerFast = None
T5TokenizerFast = None
XLMRobertaTokenizerFast = None
XLNetTokenizerFast = None
logger = logging.get_logger(__name__) logger = logging.get_logger(__name__)
...@@ -111,7 +174,7 @@ TOKENIZER_MAPPING = OrderedDict( ...@@ -111,7 +174,7 @@ TOKENIZER_MAPPING = OrderedDict(
(XLMRobertaConfig, (XLMRobertaTokenizer, XLMRobertaTokenizerFast)), (XLMRobertaConfig, (XLMRobertaTokenizer, XLMRobertaTokenizerFast)),
(MarianConfig, (MarianTokenizer, None)), (MarianConfig, (MarianTokenizer, None)),
(BlenderbotConfig, (BlenderbotSmallTokenizer, None)), (BlenderbotConfig, (BlenderbotSmallTokenizer, None)),
(LongformerConfig, (LongformerTokenizer, None)), (LongformerConfig, (LongformerTokenizer, LongformerTokenizerFast)),
(BartConfig, (BartTokenizer, BartTokenizerFast)), (BartConfig, (BartTokenizer, BartTokenizerFast)),
(LongformerConfig, (LongformerTokenizer, LongformerTokenizerFast)), (LongformerConfig, (LongformerTokenizer, LongformerTokenizerFast)),
(RobertaConfig, (BertweetTokenizer, None)), (RobertaConfig, (BertweetTokenizer, None)),
...@@ -139,7 +202,11 @@ TOKENIZER_MAPPING = OrderedDict( ...@@ -139,7 +202,11 @@ TOKENIZER_MAPPING = OrderedDict(
] ]
) )
SLOW_TOKENIZER_MAPPING = {k: v[0] for k, v in TOKENIZER_MAPPING.items()} SLOW_TOKENIZER_MAPPING = {
k: (v[0] if v[0] is not None else v[1])
for k, v in TOKENIZER_MAPPING.items()
if (v[0] is not None or v[1] is not None)
}
class AutoTokenizer: class AutoTokenizer:
...@@ -254,7 +321,7 @@ class AutoTokenizer: ...@@ -254,7 +321,7 @@ class AutoTokenizer:
if type(config) in TOKENIZER_MAPPING.keys(): if type(config) in TOKENIZER_MAPPING.keys():
tokenizer_class_py, tokenizer_class_fast = TOKENIZER_MAPPING[type(config)] tokenizer_class_py, tokenizer_class_fast = TOKENIZER_MAPPING[type(config)]
if tokenizer_class_fast and use_fast: if tokenizer_class_fast and (use_fast or tokenizer_class_py is None):
return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs) return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
else: else:
return tokenizer_class_py.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs) return tokenizer_class_py.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment