[Dependencies|tokenizers] Make both SentencePiece and Tokenizers optional dependencies (#7659)

* splitting fast and slow tokenizers [WIP] * [WIP] splitting sentencepiece and tokenizers dependencies * update dummy objects * add name_or_path to models and tokenizers * prefix added to file names * prefix * styling + quality * spliting all the tokenizer files - sorting sentencepiece based ones * update tokenizer version up to 0.9.0 * remove hard dependency on sentencepiece 🎉 * and removed hard dependency on tokenizers 🎉 * update conversion script * update missing models * fixing tests * move test_tokenization_fast to main tokenization tests - fix bugs * bump up tokenizers * fix bert_generation * update ad fix several tokenizers * keep sentencepiece in deps for now * fix funnel and deberta tests * fix fsmt * fix marian tests * fix layoutlm * fix squeezebert and gpt2 * fix T5 tokenization * fix xlnet tests * style * fix mbart * bump up tokenizers to 0.9.2 * fix model tests * fix tf models * fix seq2seq examples * fix tests without sentencepiece * fix slow => fast conversion without sentencepiece * update auto and bert generation tests * fix mbart tests * fix auto and common test without tokenizers * fix tests without tokenizers * clean up tests lighten up when tokenizers + sentencepiece are both off * style quality and tests fixing * add sentencepiece to doc/examples reqs * leave sentencepiece on for now * style quality split hebert and fix pegasus * WIP Herbert fast * add sample_text_no_unicode and fix hebert tokenization * skip FSMT example test for now * fix style * fix fsmt in example tests * update following Lysandre and Sylvain's comments * Update src/transformers/testing_utils.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/testing_utils.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/tokenization_utils_base.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/tokenization_utils_base.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

[Dependencies|tokenizers] Make both SentencePiece and Tokenizers optional dependencies (#7659)
* splitting fast and slow tokenizers [WIP] * [WIP] splitting sentencepiece and tokenizers dependencies * update dummy objects * add name_or_path to models and tokenizers * prefix added to file names * prefix * styling + quality * spliting all the tokenizer files - sorting sentencepiece based ones * update tokenizer version up to 0.9.0 * remove hard dependency on sentencepiece 🎉 * and removed hard dependency on tokenizers 🎉 * update conversion script * update missing models * fixing tests * move test_tokenization_fast to main tokenization tests - fix bugs * bump up tokenizers * fix bert_generation * update ad fix several tokenizers * keep sentencepiece in deps for now * fix funnel and deberta tests * fix fsmt * fix marian tests * fix layoutlm * fix squeezebert and gpt2 * fix T5 tokenization * fix xlnet tests * style * fix mbart * bump up tokenizers to 0.9.2 * fix model tests * fix tf models * fix seq2seq examples * fix tests without sentencepiece * fix slow => fast conversion without sentencepiece * update auto and bert generation tests * fix mbart tests * fix auto and common test without tokenizers * fix tests without tokenizers * clean up tests lighten up when tokenizers + sentencepiece are both off * style quality and tests fixing * add sentencepiece to doc/examples reqs * leave sentencepiece on for now * style quality split hebert and fix pegasus * WIP Herbert fast * add sample_text_no_unicode and fix hebert tokenization * skip FSMT example test for now * fix style * fix fsmt in example tests * update following Lysandre and Sylvain's comments * Update src/transformers/testing_utils.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/testing_utils.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/tokenization_utils_base.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/tokenization_utils_base.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
ba8c4d0a · Thomas Wolf · GitHub · c65863ce · ba8c4d0a · ba8c4d0a
Unverified Commit ba8c4d0a authored Oct 18, 2020 by Thomas Wolf Committed by GitHub Oct 18, 2020
20 changed files
--- a/.circleci/config.yml
+++ b/.circleci/config.yml
@@ -198,7 +198,7 @@ jobs:
                      - v0.3-build_doc-{{ checksum "setup.py" }}
                      - v0.3-{{ checksum "setup.py" }}
            - run: pip install --upgrade pip
-            - run: pip install .[tf,torch,docs]
+            - run: pip install .[tf,torch,sentencepiece,docs]
            - save_cache:
                  key: v0.3-build_doc-{{ checksum "setup.py" }}
                  paths:
@@ -219,7 +219,7 @@ jobs:
                  keys:
                      - v0.3-deploy_doc-{{ checksum "setup.py" }}
                      - v0.3-{{ checksum "setup.py" }}
-            - run: pip install .[tf,torch,docs]
+            - run: pip install .[tf,torch,sentencepiece,docs]
            - save_cache:
                  key: v0.3-deploy_doc-{{ checksum "setup.py" }}
                  paths:

--- a/.github/workflows/github-torch-hub.yml
+++ b/.github/workflows/github-torch-hub.yml
@@ -30,8 +30,7 @@ jobs:
      run: |
        pip install --upgrade pip
        pip install torch
-        pip install numpy filelock protobuf requests tqdm regex sentencepiece sacremoses packaging
+        pip install numpy filelock protobuf requests tqdm regex sentencepiece sacremoses tokenizers packaging
-        pip install tokenizers==0.9.0.rc2
    - name: Torch hub list
      run: |

--- a/.gitignore
+++ b/.gitignore
@@ -9,7 +9,8 @@ __pycache__/
 *.so
 # tests and logs
-tests/fixtures
+tests/fixtures/*
+!tests/fixtures/sample_text_no_unicode.txt
 logs/
 lightning_logs/
 lang_code_data/

--- a/docs/source/task_summary.rst
+++ b/docs/source/task_summary.rst
@@ -758,8 +758,8 @@ Here is an example of using the pipelines to do summarization. It leverages a Ba
    ... If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18.
    ... """
-Because the summarization pipeline depends on the ``PretrainedModel.generate()`` method, we can override the default arguments
+Because the summarization pipeline depends on the ``PreTrainedModel.generate()`` method, we can override the default arguments
-of ``PretrainedModel.generate()`` directly in the pipeline for ``max_length`` and ``min_length`` as shown below.
+of ``PreTrainedModel.generate()`` directly in the pipeline for ``max_length`` and ``min_length`` as shown below.
 This outputs the following summary:
 .. code-block::
@@ -772,7 +772,7 @@ Here is an example of doing summarization using a model and a tokenizer. The pro
 1. Instantiate a tokenizer and a model from the checkpoint name. Summarization is usually done using an encoder-decoder model, such as ``Bart`` or ``T5``.
 2. Define the article that should be summarized.
 3. Add the T5 specific prefix "summarize: ".
-4. Use the ``PretrainedModel.generate()`` method to generate the summary.
+4. Use the ``PreTrainedModel.generate()`` method to generate the summary.
 In this example we use Google`s T5 model. Even though it was pre-trained only on a multi-task mixed dataset (including CNN / Daily Mail), it yields very good results.
@@ -819,15 +819,15 @@ translation results.
    >>> print(translator("Hugging Face is a technology company based in New York and Paris", max_length=40))
    [{'translation_text': 'Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.'}]
-Because the translation pipeline depends on the ``PretrainedModel.generate()`` method, we can override the default arguments
+Because the translation pipeline depends on the ``PreTrainedModel.generate()`` method, we can override the default arguments
-of ``PretrainedModel.generate()`` directly in the pipeline as is shown for ``max_length`` above.
+of ``PreTrainedModel.generate()`` directly in the pipeline as is shown for ``max_length`` above.
 Here is an example of doing translation using a model and a tokenizer. The process is the following:
 1. Instantiate a tokenizer and a model from the checkpoint name. Summarization is usually done using an encoder-decoder model, such as ``Bart`` or ``T5``.
 2. Define the article that should be summarizaed.
 3. Add the T5 specific prefix "translate English to German: "
-4. Use the ``PretrainedModel.generate()`` method to perform the translation.
+4. Use the ``PreTrainedModel.generate()`` method to perform the translation.
 .. code-block::

--- a/examples/requirements.txt
+++ b/examples/requirements.txt
@@ -17,3 +17,4 @@ datasets
 fire
 pytest
 conllu
+sentencepiece != 0.1.92
--- a/setup.py
+++ b/setup.py
@@ -92,12 +92,13 @@ extras["onnxruntime"] = ["onnxruntime>=1.4.0", "onnxruntime-tools>=1.4.2"]
 extras["serving"] = ["pydantic", "uvicorn", "fastapi", "starlette"]
 extras["all"] = extras["serving"] + ["tensorflow", "torch"]
+extras["sentencepiece"] = ["sentencepiece!=0.1.92"]
 extras["retrieval"] = ["faiss-cpu", "datasets"]
 extras["testing"] = ["pytest", "pytest-xdist", "timeout-decorator", "parameterized", "psutil"] + extras["retrieval"]
 # sphinx-rtd-theme==0.5.0 introduced big changes in the style.
 extras["docs"] = ["recommonmark", "sphinx", "sphinx-markdown-tables", "sphinx-rtd-theme==0.4.3", "sphinx-copybutton"]
 extras["quality"] = ["black >= 20.8b1", "isort >= 5.5.4", "flake8 >= 3.8.3"]
-extras["dev"] = extras["testing"] + extras["quality"] + extras["ja"] + ["scikit-learn", "tensorflow", "torch"]
+extras["dev"] = extras["testing"] + extras["quality"] + extras["ja"] + ["scikit-learn", "tensorflow", "torch", "sentencepiece!=0.1.92"]
 setup(
    name="transformers",
@@ -114,7 +115,7 @@ setup(
    packages=find_packages("src"),
    install_requires=[
        "numpy",
-        "tokenizers == 0.9.0.rc2",
+        "tokenizers == 0.9.2",
        # dataclasses for Python versions that don't have it
        "dataclasses;python_version<'3.7'",
        # utilities from PyPA to e.g. compare versions

--- a/src/transformers/__init__.py
+++ b/src/transformers/__init__.py
@@ -92,6 +92,7 @@ from .file_utils import (
    MODEL_CARD_NAME,
    PYTORCH_PRETRAINED_BERT_CACHE,
    PYTORCH_TRANSFORMERS_CACHE,
+    SPIECE_UNDERLINE,
    TF2_WEIGHTS_NAME,
    TF_WEIGHTS_NAME,
    TRANSFORMERS_CACHE,
@@ -104,8 +105,10 @@ from .file_utils import (
    is_faiss_available,
    is_psutil_available,
    is_py3nvml_available,
+    is_sentencepiece_available,
    is_sklearn_available,
    is_tf_available,
+    is_tokenizers_available,
    is_torch_available,
    is_torch_tpu_available,
 )
@@ -152,49 +155,41 @@ from .pipelines import (
 from .retrieval_rag import RagRetriever
 # Tokenizers
-from .tokenization_albert import AlbertTokenizer, AlbertTokenizerFast
 from .tokenization_auto import TOKENIZER_MAPPING, AutoTokenizer
-from .tokenization_bart import BartTokenizer, BartTokenizerFast
+from .tokenization_bart import BartTokenizer
-from .tokenization_bert import BasicTokenizer, BertTokenizer, BertTokenizerFast, WordpieceTokenizer
+from .tokenization_bert import BasicTokenizer, BertTokenizer, WordpieceTokenizer
-from .tokenization_bert_generation import BertGenerationTokenizer
 from .tokenization_bert_japanese import BertJapaneseTokenizer, CharacterTokenizer, MecabTokenizer
 from .tokenization_bertweet import BertweetTokenizer
 from .tokenization_blenderbot import BlenderbotSmallTokenizer, BlenderbotTokenizer
-from .tokenization_camembert import CamembertTokenizer, CamembertTokenizerFast
 from .tokenization_ctrl import CTRLTokenizer
 from .tokenization_deberta import DebertaTokenizer
-from .tokenization_distilbert import DistilBertTokenizer, DistilBertTokenizerFast
+from .tokenization_distilbert import DistilBertTokenizer
 from .tokenization_dpr import (
    DPRContextEncoderTokenizer,
-    DPRContextEncoderTokenizerFast,
    DPRQuestionEncoderTokenizer,
-    DPRQuestionEncoderTokenizerFast,
+    DPRReaderOutput,
    DPRReaderTokenizer,
-    DPRReaderTokenizerFast,
 )
-from .tokenization_electra import ElectraTokenizer, ElectraTokenizerFast
+from .tokenization_electra import ElectraTokenizer
 from .tokenization_flaubert import FlaubertTokenizer
 from .tokenization_fsmt import FSMTTokenizer
-from .tokenization_funnel import FunnelTokenizer, FunnelTokenizerFast
+from .tokenization_funnel import FunnelTokenizer
-from .tokenization_gpt2 import GPT2Tokenizer, GPT2TokenizerFast
+from .tokenization_gpt2 import GPT2Tokenizer
-from .tokenization_herbert import HerbertTokenizer, HerbertTokenizerFast
+from .tokenization_herbert import HerbertTokenizer
-from .tokenization_layoutlm import LayoutLMTokenizer, LayoutLMTokenizerFast
+from .tokenization_layoutlm import LayoutLMTokenizer
-from .tokenization_longformer import LongformerTokenizer, LongformerTokenizerFast
+from .tokenization_longformer import LongformerTokenizer
-from .tokenization_lxmert import LxmertTokenizer, LxmertTokenizerFast
+from .tokenization_lxmert import LxmertTokenizer
-from .tokenization_mbart import MBartTokenizer, MBartTokenizerFast
+from .tokenization_mobilebert import MobileBertTokenizer
-from .tokenization_mobilebert import MobileBertTokenizer, MobileBertTokenizerFast
+from .tokenization_openai import OpenAIGPTTokenizer
-from .tokenization_openai import OpenAIGPTTokenizer, OpenAIGPTTokenizerFast
-from .tokenization_pegasus import PegasusTokenizer, PegasusTokenizerFast
 from .tokenization_phobert import PhobertTokenizer
 from .tokenization_rag import RagTokenizer
-from .tokenization_reformer import ReformerTokenizer, ReformerTokenizerFast
+from .tokenization_retribert import RetriBertTokenizer
-from .tokenization_retribert import RetriBertTokenizer, RetriBertTokenizerFast
+from .tokenization_roberta import RobertaTokenizer
-from .tokenization_roberta import RobertaTokenizer, RobertaTokenizerFast
+from .tokenization_squeezebert import SqueezeBertTokenizer
-from .tokenization_squeezebert import SqueezeBertTokenizer, SqueezeBertTokenizerFast
-from .tokenization_t5 import T5Tokenizer, T5TokenizerFast
 from .tokenization_transfo_xl import TransfoXLCorpus, TransfoXLTokenizer
 from .tokenization_utils import PreTrainedTokenizer
 from .tokenization_utils_base import (
+    AddedToken,
    BatchEncoding,
    CharSpan,
    PreTrainedTokenizerBase,
@@ -202,10 +197,59 @@ from .tokenization_utils_base import (
    TensorType,
    TokenSpan,
 )
-from .tokenization_utils_fast import PreTrainedTokenizerFast
 from .tokenization_xlm import XLMTokenizer
-from .tokenization_xlm_roberta import XLMRobertaTokenizer, XLMRobertaTokenizerFast
-from .tokenization_xlnet import SPIECE_UNDERLINE, XLNetTokenizer, XLNetTokenizerFast
+if is_sentencepiece_available():
+    from .tokenization_albert import AlbertTokenizer
+    from .tokenization_bert_generation import BertGenerationTokenizer
+    from .tokenization_camembert import CamembertTokenizer
+    from .tokenization_marian import MarianTokenizer
+    from .tokenization_mbart import MBartTokenizer
+    from .tokenization_pegasus import PegasusTokenizer
+    from .tokenization_reformer import ReformerTokenizer
+    from .tokenization_t5 import T5Tokenizer
+    from .tokenization_xlm_roberta import XLMRobertaTokenizer
+    from .tokenization_xlnet import XLNetTokenizer
+else:
+    from .utils.dummy_sentencepiece_objects import *
+if is_tokenizers_available():
+    from .tokenization_albert_fast import AlbertTokenizerFast
+    from .tokenization_bart_fast import BartTokenizerFast
+    from .tokenization_bert_fast import BertTokenizerFast
+    from .tokenization_camembert_fast import CamembertTokenizerFast
+    from .tokenization_distilbert_fast import DistilBertTokenizerFast
+    from .tokenization_dpr_fast import (
+        DPRContextEncoderTokenizerFast,
+        DPRQuestionEncoderTokenizerFast,
+        DPRReaderTokenizerFast,
+    )
+    from .tokenization_electra_fast import ElectraTokenizerFast
+    from .tokenization_funnel_fast import FunnelTokenizerFast
+    from .tokenization_gpt2_fast import GPT2TokenizerFast
+    from .tokenization_herbert_fast import HerbertTokenizerFast
+    from .tokenization_layoutlm_fast import LayoutLMTokenizerFast
+    from .tokenization_longformer_fast import LongformerTokenizerFast
+    from .tokenization_lxmert_fast import LxmertTokenizerFast
+    from .tokenization_mbart_fast import MBartTokenizerFast
+    from .tokenization_mobilebert_fast import MobileBertTokenizerFast
+    from .tokenization_openai_fast import OpenAIGPTTokenizerFast
+    from .tokenization_pegasus_fast import PegasusTokenizerFast
+    from .tokenization_reformer_fast import ReformerTokenizerFast
+    from .tokenization_retribert_fast import RetriBertTokenizerFast
+    from .tokenization_roberta_fast import RobertaTokenizerFast
+    from .tokenization_squeezebert_fast import SqueezeBertTokenizerFast
+    from .tokenization_t5_fast import T5TokenizerFast
+    from .tokenization_utils_fast import PreTrainedTokenizerFast
+    from .tokenization_xlm_roberta_fast import XLMRobertaTokenizerFast
+    from .tokenization_xlnet_fast import XLNetTokenizerFast
+    if is_sentencepiece_available():
+        from .convert_slow_tokenizer import SLOW_TO_FAST_CONVERTERS, convert_slow_tokenizer
+else:
+    from .utils.dummy_tokenizers_objects import *
 # Trainer
 from .trainer_callback import (
@@ -539,7 +583,6 @@ if is_torch_available():
        get_linear_schedule_with_warmup,
        get_polynomial_decay_schedule_with_warmup,
    )
-    from .tokenization_marian import MarianTokenizer
    # Trainer
    from .trainer import Trainer

--- a/src/transformers/configuration_auto.py
+++ b/src/transformers/configuration_auto.py
@@ -266,7 +266,7 @@ class AutoConfig:
                      our S3, e.g., ``dbmdz/bert-base-german-cased``.
                    - A path to a `directory` containing a configuration file saved using the
                      :meth:`~transformers.PretrainedConfig.save_pretrained` method, or the
-                      :meth:`~transformers.PretrainedModel.save_pretrained` method, e.g., ``./my_model_directory/``.
+                      :meth:`~transformers.PreTrainedModel.save_pretrained` method, e.g., ``./my_model_directory/``.
                    - A path or url to a saved configuration JSON `file`, e.g.,
                      ``./my_model_directory/configuration.json``.
            cache_dir (:obj:`str`, `optional`):

--- a/src/transformers/configuration_utils.py
+++ b/src/transformers/configuration_utils.py
@@ -43,6 +43,9 @@ class PretrainedConfig(object):
          recreate the correct object in :class:`~transformers.AutoConfig`.
    Args:
+        name_or_path (:obj:`str`, `optional`, defaults to :obj:`""`):
+            Store the string that was passed to :func:`~transformers.PreTrainedModel.from_pretrained` or :func:`~transformers.TFPreTrainedModel.from_pretrained`
+            as ``pretrained_model_name_or_path`` if the configuration was created with such a method.
        output_hidden_states (:obj:`bool`, `optional`, defaults to :obj:`False`):
            Whether or not the model should return all hidden-states.
        output_attentions (:obj:`bool`, `optional`, defaults to :obj:`False`):
@@ -206,6 +209,9 @@ class PretrainedConfig(object):
        # TPU arguments
        self.xla_device = kwargs.pop("xla_device", None)
+        # Name or path to the pretrained checkpoint
+        self._name_or_path = str(kwargs.pop("name_or_path", ""))
        # Additional attributes without default values
        for key, value in kwargs.items():
            try:
@@ -214,6 +220,14 @@ class PretrainedConfig(object):
                logger.error("Can't set {} with value {} for {}".format(key, value, self))
                raise err
+    @property
+    def name_or_path(self) -> str:
+        return self._name_or_path
+    @name_or_path.setter
+    def name_or_path(self, value):
+        self._name_or_path = str(value)  # Make sure that name_or_path is a string (for JSON encoding)
    @property
    def use_return_dict(self) -> bool:
        """

--- a/src/transformers/convert_slow_tokenizer.py
+++ b/src/transformers/convert_slow_tokenizer.py
@@ -20,13 +20,14 @@
 from typing import Dict, List, Tuple
-from sentencepiece import SentencePieceProcessor
 from tokenizers import Tokenizer, decoders, normalizers, pre_tokenizers, processors
 from tokenizers.models import BPE, Unigram, WordPiece
 # from transformers.tokenization_openai import OpenAIGPTTokenizer
 from transformers.utils import sentencepiece_model_pb2 as model
+from .file_utils import requires_sentencepiece
 class SentencePieceExtractor:
    """
@@ -35,7 +36,9 @@ class SentencePieceExtractor:
    """
    def __init__(self, model: str):
-        # Get SentencePiece
+        requires_sentencepiece(self)
+        from sentencepiece import SentencePieceProcessor
        self.sp = SentencePieceProcessor()
        self.sp.Load(model)
@@ -568,11 +571,10 @@ class T5Converter(SpmConverter):
        )
-CONVERTERS = {
+SLOW_TO_FAST_CONVERTERS = {
    "AlbertTokenizer": AlbertConverter,
-    "BertTokenizer": BertConverter,
-    "BertGenerationTokenizer": BertGenerationConverter,
    "BartTokenizer": RobertaConverter,
+    "BertTokenizer": BertConverter,
    "CamembertTokenizer": CamembertConverter,
    "DistilBertTokenizer": BertConverter,
    "DPRReaderTokenizer": BertConverter,
@@ -582,12 +584,17 @@ CONVERTERS = {
    "FunnelTokenizer": FunnelConverter,
    "GPT2Tokenizer": GPT2Converter,
    "HerbertTokenizer": HerbertConverter,
+    "LayoutLMTokenizer": BertConverter,
+    "LongformerTokenizer": RobertaConverter,
    "LxmertTokenizer": BertConverter,
    "MBartTokenizer": MBartConverter,
+    "MobileBertTokenizer": BertConverter,
    "OpenAIGPTTokenizer": OpenAIGPTConverter,
    "PegasusTokenizer": PegasusConverter,
    "ReformerTokenizer": ReformerConverter,
+    "RetriBertTokenizer": BertConverter,
    "RobertaTokenizer": RobertaConverter,
+    "SqueezeBertTokenizer": BertConverter,
    "T5Tokenizer": T5Converter,
    "XLMRobertaTokenizer": XLMRobertaConverter,
    "XLNetTokenizer": XLNetConverter,
@@ -595,5 +602,26 @@ CONVERTERS = {
 def convert_slow_tokenizer(transformer_tokenizer) -> Tokenizer:
-    converter_class = CONVERTERS[transformer_tokenizer.__class__.__name__]
+    """Utilities to convert a slow tokenizer instance in a fast tokenizer instance.
+    Args:
+        transformer_tokenizer (:class:`~transformers.tokenization_utils_base.PreTrainedTokenizer`):
+            Instance of a slow tokenizer to convert in the backend tokenizer for
+            :class:`~transformers.tokenization_utils_base.PreTrainedTokenizerFast`.
+    Return:
+        A instance of :class:`~tokenizers.Tokenizer` to be used as the backend tokenizer of a
+        :class:`~transformers.tokenization_utils_base.PreTrainedTokenizerFast`
+    """
+    tokenizer_class_name = transformer_tokenizer.__class__.__name__
+    if tokenizer_class_name not in SLOW_TO_FAST_CONVERTERS:
+        raise ValueError(
+            f"An instance of tokenizer class {tokenizer_class_name} cannot be converted in a Fast tokenizer instance. "
+            f"No converter was found. Currently available slow->fast convertors: {list(SLOW_TO_FAST_CONVERTERS.keys())}"
+        )
+    converter_class = SLOW_TO_FAST_CONVERTERS[tokenizer_class_name]
    return converter_class(transformer_tokenizer).converted()
--- a/src/transformers/convert_slow_tokenizers_checkpoints_to_fast.py
+++ b/src/transformers/convert_slow_tokenizers_checkpoints_to_fast.py
+# coding=utf-8
+# Copyright 2018 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Convert slow tokenizers checkpoints in fast (serialization format of the `tokenizers` library) """
+import argparse
+import os
+import transformers
+from transformers.convert_slow_tokenizer import SLOW_TO_FAST_CONVERTERS
+from transformers.utils import logging
+logging.set_verbosity_info()
+logger = logging.get_logger(__name__)
+TOKENIZER_CLASSES = {name: getattr(transformers, name + "Fast") for name in SLOW_TO_FAST_CONVERTERS}
+def convert_slow_checkpoint_to_fast(tokenizer_name, checkpoint_name, dump_path, force_download):
+    if tokenizer_name is not None and tokenizer_name not in TOKENIZER_CLASSES:
+        raise ValueError("Unrecognized tokenizer name, should be one of {}.".format(list(TOKENIZER_CLASSES.keys())))
+    if tokenizer_name is None:
+        tokenizer_names = TOKENIZER_CLASSES
+    else:
+        tokenizer_names = {tokenizer_name: getattr(transformers, tokenizer_name + "Fast")}
+    logger.info(f"Loading tokenizer classes: {tokenizer_names}")
+    for tokenizer_name in tokenizer_names:
+        tokenizer_class = TOKENIZER_CLASSES[tokenizer_name]
+        add_prefix = True
+        if checkpoint_name is None:
+            checkpoint_names = list(tokenizer_class.max_model_input_sizes.keys())
+        else:
+            checkpoint_names = [checkpoint_name]
+        logger.info(f"For tokenizer {tokenizer_class.__class__.__name__} loading checkpoints: {checkpoint_names}")
+        for checkpoint in checkpoint_names:
+            logger.info(f"Loading {tokenizer_class.__class__.__name__} {checkpoint}")
+            # Load tokenizer
+            tokenizer = tokenizer_class.from_pretrained(checkpoint, force_download=force_download)
+            # Save fast tokenizer
+            logger.info(
+                "Save fast tokenizer to {} with prefix {} add_prefix {}".format(dump_path, checkpoint, add_prefix)
+            )
+            # For organization names we create sub-directories
+            if "/" in checkpoint:
+                checkpoint_directory, checkpoint_prefix_name = checkpoint.split("/")
+                dump_path_full = os.path.join(dump_path, checkpoint_directory)
+            elif add_prefix:
+                checkpoint_prefix_name = checkpoint
+                dump_path_full = dump_path
+            else:
+                checkpoint_prefix_name = None
+                dump_path_full = dump_path
+            logger.info(
+                "=> {} with prefix {}, add_prefix {}".format(dump_path_full, checkpoint_prefix_name, add_prefix)
+            )
+            file_path = list(tokenizer.pretrained_vocab_files_map.values())[0][checkpoint]
+            next_char = file_path.split(checkpoint)[-1][0]
+            if next_char == "/":
+                dump_path_full = os.path.join(dump_path_full, checkpoint_prefix_name)
+                checkpoint_prefix_name = None
+            logger.info(
+                "=> {} with prefix {}, add_prefix {}".format(dump_path_full, checkpoint_prefix_name, add_prefix)
+            )
+            file_names = tokenizer.save_pretrained(
+                dump_path_full, legacy_format=False, filename_prefix=checkpoint_prefix_name
+            )
+            logger.info("=> File names {}".format(file_names))
+            for file_name in file_names:
+                if not file_name.endswith("tokenizer.json"):
+                    os.remove(file_name)
+                    logger.info("=> removing {}".format(file_name))
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    # Required parameters
+    parser.add_argument(
+        "--dump_path", default=None, type=str, required=True, help="Path to output generated fast tokenizer files."
+    )
+    parser.add_argument(
+        "--tokenizer_name",
+        default=None,
+        type=str,
+        help="Optional tokenizer type selected in the list of {}. If not given, will download and convert all the checkpoints from AWS.".format(
+            list(TOKENIZER_CLASSES.keys())
+        ),
+    )
+    parser.add_argument(
+        "--checkpoint_name",
+        default=None,
+        type=str,
+        help="Optional checkpoint name. If not given, will download and convert the canonical checkpoints from AWS.",
+    )
+    parser.add_argument(
+        "--force_download",
+        action="store_true",
+        help="Re-dowload checkpoints.",
+    )
+    args = parser.parse_args()
+    convert_slow_checkpoint_to_fast(args.tokenizer_name, args.checkpoint_name, args.dump_path, args.force_download)
--- a/src/transformers/data/data_collator.py
+++ b/src/transformers/data/data_collator.py
@@ -4,9 +4,7 @@ from typing import Any, Callable, Dict, List, NewType, Optional, Tuple, Union
 import torch
 from torch.nn.utils.rnn import pad_sequence
-from ..tokenization_utils import PreTrainedTokenizer
+from ..tokenization_utils_base import BatchEncoding, PaddingStrategy, PreTrainedTokenizerBase
-from ..tokenization_utils_base import BatchEncoding, PaddingStrategy
-from ..tokenization_utils_fast import PreTrainedTokenizerFast
 InputDataClass = NewType("InputDataClass", Any)
@@ -94,7 +92,7 @@ class DataCollatorWithPadding:
            >= 7.5 (Volta).
    """
-    tokenizer: Union[PreTrainedTokenizer, PreTrainedTokenizerFast]
+    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None
@@ -124,7 +122,7 @@ class DataCollatorForLanguageModeling:
    - preprocesses batches for masked language modeling
    """
-    tokenizer: PreTrainedTokenizer
+    tokenizer: PreTrainedTokenizerBase
    mlm: bool = True
    mlm_probability: float = 0.15
@@ -274,7 +272,7 @@ class DataCollatorForPermutationLanguageModeling:
    - preprocesses batches for permutation language modeling with procedures specific to XLNet
    """
-    tokenizer: PreTrainedTokenizer
+    tokenizer: PreTrainedTokenizerBase
    plm_probability: float = 1 / 6
    max_span_length: int = 5  # maximum length of a span of masked tokens
@@ -406,7 +404,7 @@ class DataCollatorForNextSentencePrediction:
    - preprocesses batches for masked language modeling
    """
-    tokenizer: PreTrainedTokenizer
+    tokenizer: PreTrainedTokenizerBase
    mlm: bool = True
    block_size: int = 512
    short_seq_probability: float = 0.1

--- a/src/transformers/data/datasets/glue.py
+++ b/src/transformers/data/datasets/glue.py
@@ -9,10 +9,7 @@ from torch.utils.data.dataset import Dataset
 from filelock import FileLock
-from ...tokenization_bart import BartTokenizer, BartTokenizerFast
+from ...tokenization_utils_base import PreTrainedTokenizerBase
-from ...tokenization_roberta import RobertaTokenizer, RobertaTokenizerFast
-from ...tokenization_utils import PreTrainedTokenizer
-from ...tokenization_xlm_roberta import XLMRobertaTokenizer
 from ...utils import logging
 from ..processors.glue import glue_convert_examples_to_features, glue_output_modes, glue_processors
 from ..processors.utils import InputFeatures
@@ -69,7 +66,7 @@ class GlueDataset(Dataset):
    def __init__(
        self,
        args: GlueDataTrainingArguments,
-        tokenizer: PreTrainedTokenizer,
+        tokenizer: PreTrainedTokenizerBase,
        limit_length: Optional[int] = None,
        mode: Union[str, Split] = Split.train,
        cache_dir: Optional[str] = None,
@@ -93,12 +90,12 @@ class GlueDataset(Dataset):
            ),
        )
        label_list = self.processor.get_labels()
-        if args.task_name in ["mnli", "mnli-mm"] and tokenizer.__class__ in (
+        if args.task_name in ["mnli", "mnli-mm"] and tokenizer.__class__.__name__ in (
-            RobertaTokenizer,
+            "RobertaTokenizer",
-            RobertaTokenizerFast,
+            "RobertaTokenizerFast",
-            XLMRobertaTokenizer,
+            "XLMRobertaTokenizer",
-            BartTokenizer,
+            "BartTokenizer",
-            BartTokenizerFast,
+            "BartTokenizerFast",
        ):
            # HACK(label indices are swapped in RoBERTa pretrained model)
            label_list[1], label_list[2] = label_list[2], label_list[1]

--- a/src/transformers/file_utils.py
+++ b/src/transformers/file_utils.py
@@ -157,6 +157,24 @@ except (AttributeError, ImportError, KeyError):
    _in_notebook = False
+try:
+    import sentencepiece  # noqa: F401
+    _sentencepiece_available = True
+except ImportError:
+    _sentencepiece_available = False
+try:
+    import tokenizers  # noqa: F401
+    _tokenizers_available = True
+except ImportError:
+    _tokenizers_available = False
 default_cache_path = os.path.join(torch_cache_home, "transformers")
@@ -170,6 +188,8 @@ TF_WEIGHTS_NAME = "model.ckpt"
 CONFIG_NAME = "config.json"
 MODEL_CARD_NAME = "modelcard.json"
+SENTENCEPIECE_UNDERLINE = "▁"
+SPIECE_UNDERLINE = SENTENCEPIECE_UNDERLINE  # Kept for backward compatibility
 MULTIPLE_CHOICE_DUMMY_INPUTS = [
    [[0, 1, 0, 1], [1, 0, 0, 1]]
@@ -217,6 +237,18 @@ def is_faiss_available():
    return _faiss_available
+def is_sklearn_available():
+    return _has_sklearn
+def is_sentencepiece_available():
+    return _sentencepiece_available
+def is_tokenizers_available():
+    return _tokenizers_available
 def is_in_notebook():
    return _in_notebook
@@ -234,10 +266,6 @@ def torch_only_method(fn):
    return wrapper
-def is_sklearn_available():
-    return _has_sklearn
 DATASETS_IMPORT_ERROR = """
 {0} requires the 🤗 Datasets library but it was not found in your enviromnent. You can install it with:
 ```
@@ -255,6 +283,25 @@ that python file if that's the case.
 """
+TOKENIZERS_IMPORT_ERROR = """
+{0} requires the 🤗 Tokenizers library but it was not found in your enviromnent. You can install it with:
+```
+pip install tokenizers
+```
+In a notebook or a colab, you can install it by executing a cell with
+```
+!pip install tokenizers
+```
+"""
+SENTENCEPIECE_IMPORT_ERROR = """
+{0} requires the SentencePiece library but it was not found in your enviromnent. Checkout the instructions on the
+installation page of its repo: https://github.com/google/sentencepiece#installation and follow the ones
+that match your enviromnent.
+"""
 FAISS_IMPORT_ERROR = """
 {0} requires the faiss library but it was not found in your enviromnent. Checkout the instructions on the
 installation page of its repo: https://github.com/facebookresearch/faiss/blob/master/INSTALL.md and follow the ones
@@ -316,6 +363,18 @@ def requires_tf(obj):
        raise ImportError(TENSORFLOW_IMPORT_ERROR.format(name))
+def requires_tokenizers(obj):
+    name = obj.__name__ if hasattr(obj, "__name__") else obj.__class__.__name__
+    if not is_tokenizers_available():
+        raise ImportError(TOKENIZERS_IMPORT_ERROR.format(name))
+def requires_sentencepiece(obj):
+    name = obj.__name__ if hasattr(obj, "__name__") else obj.__class__.__name__
+    if not is_sentencepiece_available():
+        raise ImportError(SENTENCEPIECE_IMPORT_ERROR.format(name))
 def add_start_docstrings(*docstr):
    def docstring_decorator(fn):
        fn.__doc__ = "".join(docstr) + (fn.__doc__ if fn.__doc__ is not None else "")

--- a/src/transformers/modeling_tf_utils.py
+++ b/src/transformers/modeling_tf_utils.py
@@ -346,8 +346,9 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin, TFGenerationMixin):
                    self.__class__.__name__, self.__class__.__name__
                )
            )
-        # Save config in model
+        # Save config and origin of the pretrained weights if given in model
        self.config = config
+        self.name_or_path = config.name_or_path
    def get_input_embeddings(self) -> tf.keras.layers.Layer:
        """
@@ -690,6 +691,8 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin, TFGenerationMixin):
        else:
            resolved_archive_file = None
+        config.name_or_path = pretrained_model_name_or_path
        # Instantiate model.
        model = cls(config, *model_args, **model_kwargs)

--- a/src/transformers/modeling_utils.py
+++ b/src/transformers/modeling_utils.py
@@ -432,8 +432,9 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin):
                    self.__class__.__name__, self.__class__.__name__
                )
            )
-        # Save config in model
+        # Save config and origin of the pretrained weights if given in model
        self.config = config
+        self.name_or_path = config.name_or_path
    @property
    def base_model(self) -> nn.Module:
@@ -933,6 +934,8 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin):
        else:
            resolved_archive_file = None
+        config.name_or_path = pretrained_model_name_or_path
        # Instantiate model.
        model = cls(config, *model_args, **model_kwargs)

--- a/src/transformers/testing_utils.py
+++ b/src/transformers/testing_utils.py
@@ -10,7 +10,15 @@ from distutils.util import strtobool
 from io import StringIO
 from pathlib import Path
-from .file_utils import _datasets_available, _faiss_available, _tf_available, _torch_available, _torch_tpu_available
+from .file_utils import (
+    _datasets_available,
+    _faiss_available,
+    _sentencepiece_available,
+    _tf_available,
+    _tokenizers_available,
+    _torch_available,
+    _torch_tpu_available,
+)
 SMALL_MODEL_IDENTIFIER = "julien-c/bert-xsmall-dummy"
@@ -107,6 +115,32 @@ def require_tf(test_case):
        return test_case
+def require_sentencepiece(test_case):
+    """
+    Decorator marking a test that requires SentencePiece.
+    These tests are skipped when SentencePiece isn't installed.
+    """
+    if not _sentencepiece_available:
+        return unittest.skip("test requires SentencePiece")(test_case)
+    else:
+        return test_case
+def require_tokenizers(test_case):
+    """
+    Decorator marking a test that requires 🤗 Tokenizers.
+    These tests are skipped when 🤗 Tokenizers isn't installed.
+    """
+    if not _tokenizers_available:
+        return unittest.skip("test requires tokenizers")(test_case)
+    else:
+        return test_case
 def require_multigpu(test_case):
    """
    Decorator marking a test that requires a multi-GPU setup (in PyTorch).

--- a/src/transformers/tokenization_albert.py
+++ b/src/transformers/tokenization_albert.py
@@ -18,10 +18,11 @@
 import os
 import unicodedata
 from shutil import copyfile
-from typing import List, Optional
+from typing import List, Optional, Tuple
+import sentencepiece as spm
 from .tokenization_utils import PreTrainedTokenizer
-from .tokenization_utils_fast import PreTrainedTokenizerFast
 from .utils import logging
@@ -138,15 +139,6 @@ class AlbertTokenizer(PreTrainedTokenizer):
            **kwargs,
        )
-        try:
-            import sentencepiece as spm
-        except ImportError:
-            logger.warning(
-                "You need to install SentencePiece to use AlbertTokenizer: https://github.com/google/sentencepiece"
-                "pip install sentencepiece"
-            )
-            raise
        self.do_lower_case = do_lower_case
        self.remove_space = remove_space
        self.keep_accents = keep_accents
@@ -171,14 +163,6 @@ class AlbertTokenizer(PreTrainedTokenizer):
    def __setstate__(self, d):
        self.__dict__ = d
-        try:
-            import sentencepiece as spm
-        except ImportError:
-            logger.warning(
-                "You need to install SentencePiece to use AlbertTokenizer: https://github.com/google/sentencepiece"
-                "pip install sentencepiece"
-            )
-            raise
        self.sp_model = spm.SentencePieceProcessor()
        self.sp_model.Load(self.vocab_file)
@@ -321,225 +305,14 @@ class AlbertTokenizer(PreTrainedTokenizer):
            return len(cls + token_ids_0 + sep) * [0]
        return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]
-    def save_vocabulary(self, save_directory):
+    def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
-        """
-        Save the sentencepiece vocabulary (copy original file) and special tokens file to a directory.
-        Args:
-            save_directory (:obj:`str`):
-                The directory in which to save the vocabulary.
-        Returns:
-            :obj:`Tuple(str)`: Paths to the files saved.
-        """
        if not os.path.isdir(save_directory):
            logger.error("Vocabulary path ({}) should be a directory".format(save_directory))
            return
-        out_vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES["vocab_file"])
+        out_vocab_file = os.path.join(
+            save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
-        if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):
-            copyfile(self.vocab_file, out_vocab_file)
-        return (out_vocab_file,)
-class AlbertTokenizerFast(PreTrainedTokenizerFast):
-    """
-    Construct a "fast" ALBERT tokenizer (backed by HuggingFace's `tokenizers` library). Based on
-    `SentencePiece <https://github.com/google/sentencepiece>`__.
-    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizerFast` which contains most of the main
-    methods. Users should refer to this superclass for more information regarding those methods.
-    Args:
-        vocab_file (:obj:`str`):
-            `SentencePiece <https://github.com/google/sentencepiece>`__ file (generally has a `.spm` extension) that
-            contains the vocabulary necessary to instantiate a tokenizer.
-        do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):
-            Whether or not to lowercase the input when tokenizing.
-        remove_space (:obj:`bool`, `optional`, defaults to :obj:`True`):
-            Whether or not to strip the text when tokenizing (removing excess spaces before and after the string).
-        keep_accents (:obj:`bool`, `optional`, defaults to :obj:`False`):
-            Whether or not to keep accents when tokenizing.
-        bos_token (:obj:`str`, `optional`, defaults to :obj:`"[CLS]"`):
-            The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
-            .. note::
-                When building a sequence using special tokens, this is not the token that is used for the beginning
-                of sequence. The token used is the :obj:`cls_token`.
-        eos_token (:obj:`str`, `optional`, defaults to :obj:`"[SEP]"`):
-            The end of sequence token.
-            .. note::
-                When building a sequence using special tokens, this is not the token that is used for the end
-                of sequence. The token used is the :obj:`sep_token`.
-        unk_token (:obj:`str`, `optional`, defaults to :obj:`"<unk>"`):
-            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
-            token instead.
-        sep_token (:obj:`str`, `optional`, defaults to :obj:`"[SEP]"`):
-            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences
-            for sequence classification or for a text and a question for question answering.
-            It is also used as the last token of a sequence built with special tokens.
-        pad_token (:obj:`str`, `optional`, defaults to :obj:`"<pad>"`):
-            The token used for padding, for example when batching sequences of different lengths.
-        cls_token (:obj:`str`, `optional`, defaults to :obj:`"[CLS]"`):
-            The classifier token which is used when doing sequence classification (classification of the whole
-            sequence instead of per-token classification). It is the first token of the sequence when built with
-            special tokens.
-        mask_token (:obj:`str`, `optional`, defaults to :obj:`"[MASK]"`):
-            The token used for masking values. This is the token used when training this model with masked language
-            modeling. This is the token which the model will try to predict.
-    Attributes:
-        sp_model (:obj:`SentencePieceProcessor`):
-            The `SentencePiece` processor that is used for every conversion (string, tokens and IDs).
-    """
-    vocab_files_names = VOCAB_FILES_NAMES
-    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
-    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
-    slow_tokenizer_class = AlbertTokenizer
-    def __init__(
-        self,
-        vocab_file,
-        do_lower_case=True,
-        remove_space=True,
-        keep_accents=False,
-        bos_token="[CLS]",
-        eos_token="[SEP]",
-        unk_token="<unk>",
-        sep_token="[SEP]",
-        pad_token="<pad>",
-        cls_token="[CLS]",
-        mask_token="[MASK]",
-        **kwargs
-    ):
-        super().__init__(
-            vocab_file,
-            do_lower_case=do_lower_case,
-            remove_space=remove_space,
-            keep_accents=keep_accents,
-            bos_token=bos_token,
-            eos_token=eos_token,
-            unk_token=unk_token,
-            sep_token=sep_token,
-            pad_token=pad_token,
-            cls_token=cls_token,
-            mask_token=mask_token,
-            **kwargs,
        )
-        self.do_lower_case = do_lower_case
-        self.remove_space = remove_space
-        self.keep_accents = keep_accents
-        self.vocab_file = vocab_file
-    def build_inputs_with_special_tokens(
-        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
-    ) -> List[int]:
-        """
-        Build model inputs from a sequence or a pair of sequence for sequence classification tasks
-        by concatenating and adding special tokens.
-        An ALBERT sequence has the following format:
-        - single sequence: ``[CLS] X [SEP]``
-        - pair of sequences: ``[CLS] A [SEP] B [SEP]``
-        Args:
-            token_ids_0 (:obj:`List[int]`):
-                List of IDs to which the special tokens will be added.
-            token_ids_1 (:obj:`List[int]`, `optional`):
-                Optional second list of IDs for sequence pairs.
-        Returns:
-            :obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
-        """
-        sep = [self.sep_token_id]
-        cls = [self.cls_token_id]
-        if token_ids_1 is None:
-            return cls + token_ids_0 + sep
-        return cls + token_ids_0 + sep + token_ids_1 + sep
-    def get_special_tokens_mask(
-        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
-    ) -> List[int]:
-        """
-        Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
-        special tokens using the tokenizer ``prepare_for_model`` method.
-        Args:
-            token_ids_0 (:obj:`List[int]`):
-                List of IDs.
-            token_ids_1 (:obj:`List[int]`, `optional`):
-                Optional second list of IDs for sequence pairs.
-            already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
-                Whether or not the token list is already formatted with special tokens for the model.
-        Returns:
-            :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
-        """
-        if already_has_special_tokens:
-            if token_ids_1 is not None:
-                raise ValueError(
-                    "You should not supply a second sequence if the provided sequence of "
-                    "ids is already formatted with special tokens for the model."
-                )
-            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))
-        if token_ids_1 is not None:
-            return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
-        return [1] + ([0] * len(token_ids_0)) + [1]
-    def create_token_type_ids_from_sequences(
-        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
-    ) -> List[int]:
-        """
-        Create a mask from the two sequences passed to be used in a sequence-pair classification task.
-        An ALBERT sequence pair mask has the following format:
-        ::
-            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
-            | first sequence    | second sequence |
-        If :obj:`token_ids_1` is :obj:`None`, this method only returns the first portion of the mask (0s).
-        Args:
-            token_ids_0 (:obj:`List[int]`):
-                List of IDs.
-            token_ids_1 (:obj:`List[int]`, `optional`):
-                Optional second list of IDs for sequence pairs.
-        Returns:
-            :obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given
-            sequence(s).
-        """
-        sep = [self.sep_token_id]
-        cls = [self.cls_token_id]
-        if token_ids_1 is None:
-            return len(cls + token_ids_0 + sep) * [0]
-        return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]
-    def save_vocabulary(self, save_directory):
-        """
-        Save the sentencepiece vocabulary (copy original file) and special tokens file to a directory.
-        Args:
-            save_directory (:obj:`str`):
-                The directory in which to save the vocabulary.
-        Returns:
-            :obj:`Tuple(str)`: Paths to the files saved.
-        """
-        if not os.path.isdir(save_directory):
-            logger.error("Vocabulary path ({}) should be a directory".format(save_directory))
-            return
-        out_vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES["vocab_file"])
        if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):
            copyfile(self.vocab_file, out_vocab_file)

--- a/src/transformers/tokenization_albert_fast.py
+++ b/src/transformers/tokenization_albert_fast.py
+# coding=utf-8
+# Copyright 2018 Google AI, Google Brain and the HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Tokenization classes for ALBERT model."""
+import os
+from shutil import copyfile
+from typing import List, Optional, Tuple
+from .file_utils import is_sentencepiece_available
+from .tokenization_utils_fast import PreTrainedTokenizerFast
+from .utils import logging
+if is_sentencepiece_available():
+    from .tokenization_albert import AlbertTokenizer
+else:
+    AlbertTokenizer = None
+logger = logging.get_logger(__name__)
+VOCAB_FILES_NAMES = {"vocab_file": "spiece.model", "tokenizer_file": "tokenizer.json"}
+PRETRAINED_VOCAB_FILES_MAP = {
+    "vocab_file": {
+        "albert-base-v1": "https://s3.amazonaws.com/models.huggingface.co/bert/albert-base-v1-spiece.model",
+        "albert-large-v1": "https://s3.amazonaws.com/models.huggingface.co/bert/albert-large-v1-spiece.model",
+        "albert-xlarge-v1": "https://s3.amazonaws.com/models.huggingface.co/bert/albert-xlarge-v1-spiece.model",
+        "albert-xxlarge-v1": "https://s3.amazonaws.com/models.huggingface.co/bert/albert-xxlarge-v1-spiece.model",
+        "albert-base-v2": "https://s3.amazonaws.com/models.huggingface.co/bert/albert-base-v2-spiece.model",
+        "albert-large-v2": "https://s3.amazonaws.com/models.huggingface.co/bert/albert-large-v2-spiece.model",
+        "albert-xlarge-v2": "https://s3.amazonaws.com/models.huggingface.co/bert/albert-xlarge-v2-spiece.model",
+        "albert-xxlarge-v2": "https://s3.amazonaws.com/models.huggingface.co/bert/albert-xxlarge-v2-spiece.model",
+    },
+    "tokenizer_file": {
+        "albert-base-v1": "https://s3.amazonaws.com/models.huggingface.co/bert/albert-base-v1-tokenizer.json",
+        "albert-large-v1": "https://s3.amazonaws.com/models.huggingface.co/bert/albert-large-v1-tokenizer.json",
+        "albert-xlarge-v1": "https://s3.amazonaws.com/models.huggingface.co/bert/albert-xlarge-v1-tokenizer.json",
+        "albert-xxlarge-v1": "https://s3.amazonaws.com/models.huggingface.co/bert/albert-xxlarge-v1-tokenizer.json",
+        "albert-base-v2": "https://s3.amazonaws.com/models.huggingface.co/bert/albert-base-v2-tokenizer.json",
+        "albert-large-v2": "https://s3.amazonaws.com/models.huggingface.co/bert/albert-large-v2-tokenizer.json",
+        "albert-xlarge-v2": "https://s3.amazonaws.com/models.huggingface.co/bert/albert-xlarge-v2-tokenizer.json",
+        "albert-xxlarge-v2": "https://s3.amazonaws.com/models.huggingface.co/bert/albert-xxlarge-v2-tokenizer.json",
+    },
+}
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
+    "albert-base-v1": 512,
+    "albert-large-v1": 512,
+    "albert-xlarge-v1": 512,
+    "albert-xxlarge-v1": 512,
+    "albert-base-v2": 512,
+    "albert-large-v2": 512,
+    "albert-xlarge-v2": 512,
+    "albert-xxlarge-v2": 512,
+}
+SPIECE_UNDERLINE = "▁"
+class AlbertTokenizerFast(PreTrainedTokenizerFast):
+    """
+    Construct a "fast" ALBERT tokenizer (backed by HuggingFace's `tokenizers` library). Based on
+    `SentencePiece <https://github.com/google/sentencepiece>`__.
+    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizerFast` which contains most of the main
+    methods. Users should refer to this superclass for more information regarding those methods.
+    Args:
+        vocab_file (:obj:`str`):
+            `SentencePiece <https://github.com/google/sentencepiece>`__ file (generally has a `.spm` extension) that
+            contains the vocabulary necessary to instantiate a tokenizer.
+        do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether or not to lowercase the input when tokenizing.
+        remove_space (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether or not to strip the text when tokenizing (removing excess spaces before and after the string).
+        keep_accents (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether or not to keep accents when tokenizing.
+        bos_token (:obj:`str`, `optional`, defaults to :obj:`"[CLS]"`):
+            The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
+            .. note::
+                When building a sequence using special tokens, this is not the token that is used for the beginning
+                of sequence. The token used is the :obj:`cls_token`.
+        eos_token (:obj:`str`, `optional`, defaults to :obj:`"[SEP]"`):
+            The end of sequence token.
+            .. note::
+                When building a sequence using special tokens, this is not the token that is used for the end
+                of sequence. The token used is the :obj:`sep_token`.
+        unk_token (:obj:`str`, `optional`, defaults to :obj:`"<unk>"`):
+            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
+            token instead.
+        sep_token (:obj:`str`, `optional`, defaults to :obj:`"[SEP]"`):
+            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences
+            for sequence classification or for a text and a question for question answering.
+            It is also used as the last token of a sequence built with special tokens.
+        pad_token (:obj:`str`, `optional`, defaults to :obj:`"<pad>"`):
+            The token used for padding, for example when batching sequences of different lengths.
+        cls_token (:obj:`str`, `optional`, defaults to :obj:`"[CLS]"`):
+            The classifier token which is used when doing sequence classification (classification of the whole
+            sequence instead of per-token classification). It is the first token of the sequence when built with
+            special tokens.
+        mask_token (:obj:`str`, `optional`, defaults to :obj:`"[MASK]"`):
+            The token used for masking values. This is the token used when training this model with masked language
+            modeling. This is the token which the model will try to predict.
+    Attributes:
+        sp_model (:obj:`SentencePieceProcessor`):
+            The `SentencePiece` processor that is used for every conversion (string, tokens and IDs).
+    """
+    vocab_files_names = VOCAB_FILES_NAMES
+    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+    slow_tokenizer_class = AlbertTokenizer
+    def __init__(
+        self,
+        vocab_file,
+        tokenizer_file=None,
+        do_lower_case=True,
+        remove_space=True,
+        keep_accents=False,
+        bos_token="[CLS]",
+        eos_token="[SEP]",
+        unk_token="<unk>",
+        sep_token="[SEP]",
+        pad_token="<pad>",
+        cls_token="[CLS]",
+        mask_token="[MASK]",
+        **kwargs
+    ):
+        super().__init__(
+            vocab_file,
+            tokenizer_file=tokenizer_file,
+            do_lower_case=do_lower_case,
+            remove_space=remove_space,
+            keep_accents=keep_accents,
+            bos_token=bos_token,
+            eos_token=eos_token,
+            unk_token=unk_token,
+            sep_token=sep_token,
+            pad_token=pad_token,
+            cls_token=cls_token,
+            mask_token=mask_token,
+            **kwargs,
+        )
+        self.do_lower_case = do_lower_case
+        self.remove_space = remove_space
+        self.keep_accents = keep_accents
+        self.vocab_file = vocab_file
+    def build_inputs_with_special_tokens(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
+        """
+        Build model inputs from a sequence or a pair of sequence for sequence classification tasks
+        by concatenating and adding special tokens.
+        An ALBERT sequence has the following format:
+        - single sequence: ``[CLS] X [SEP]``
+        - pair of sequences: ``[CLS] A [SEP] B [SEP]``
+        Args:
+            token_ids_0 (:obj:`List[int]`):
+                List of IDs to which the special tokens will be added
+            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):
+                Optional second list of IDs for sequence pairs.
+        Returns:
+            :obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
+        """
+        sep = [self.sep_token_id]
+        cls = [self.cls_token_id]
+        if token_ids_1 is None:
+            return cls + token_ids_0 + sep
+        return cls + token_ids_0 + sep + token_ids_1 + sep
+    def get_special_tokens_mask(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
+    ) -> List[int]:
+        """
+        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
+        special tokens using the tokenizer ``prepare_for_model`` method.
+        Args:
+            token_ids_0 (:obj:`List[int]`):
+                List of ids.
+            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):
+                Optional second list of IDs for sequence pairs.
+            already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
+                Set to True if the token list is already formatted with special tokens for the model
+        Returns:
+            :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
+        """
+        if already_has_special_tokens:
+            if token_ids_1 is not None:
+                raise ValueError(
+                    "You should not supply a second sequence if the provided sequence of "
+                    "ids is already formatted with special tokens for the model."
+                )
+            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))
+        if token_ids_1 is not None:
+            return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
+        return [1] + ([0] * len(token_ids_0)) + [1]
+    def create_token_type_ids_from_sequences(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
+        """
+        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.
+        An ALBERT sequence pair mask has the following format:
+        ::
+            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
+            | first sequence    | second sequence |
+        if token_ids_1 is None, only returns the first portion of the mask (0s).
+        Args:
+            token_ids_0 (:obj:`List[int]`):
+                List of ids.
+            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):
+                Optional second list of IDs for sequence pairs.
+        Returns:
+            :obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given
+            sequence(s).
+        """
+        sep = [self.sep_token_id]
+        cls = [self.cls_token_id]
+        if token_ids_1 is None:
+            return len(cls + token_ids_0 + sep) * [0]
+        return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]
+    def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
+        if not os.path.isdir(save_directory):
+            logger.error("Vocabulary path ({}) should be a directory".format(save_directory))
+            return
+        out_vocab_file = os.path.join(
+            save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
+        )
+        if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):
+            copyfile(self.vocab_file, out_vocab_file)
+        return (out_vocab_file,)
--- a/src/transformers/tokenization_auto.py
+++ b/src/transformers/tokenization_auto.py
@@ -56,45 +56,108 @@ from .configuration_auto import (
    replace_list_option_in_docstrings,
 )
 from .configuration_utils import PretrainedConfig
-from .tokenization_albert import AlbertTokenizer, AlbertTokenizerFast
+from .file_utils import is_sentencepiece_available, is_tokenizers_available
-from .tokenization_bart import BartTokenizer, BartTokenizerFast
+from .tokenization_bart import BartTokenizer
-from .tokenization_bert import BertTokenizer, BertTokenizerFast
+from .tokenization_bert import BertTokenizer
-from .tokenization_bert_generation import BertGenerationTokenizer
 from .tokenization_bert_japanese import BertJapaneseTokenizer
 from .tokenization_bertweet import BertweetTokenizer
 from .tokenization_blenderbot import BlenderbotSmallTokenizer
-from .tokenization_camembert import CamembertTokenizer, CamembertTokenizerFast
 from .tokenization_ctrl import CTRLTokenizer
 from .tokenization_deberta import DebertaTokenizer
-from .tokenization_distilbert import DistilBertTokenizer, DistilBertTokenizerFast
+from .tokenization_distilbert import DistilBertTokenizer
-from .tokenization_dpr import DPRQuestionEncoderTokenizer, DPRQuestionEncoderTokenizerFast
+from .tokenization_dpr import DPRQuestionEncoderTokenizer
-from .tokenization_electra import ElectraTokenizer, ElectraTokenizerFast
+from .tokenization_electra import ElectraTokenizer
 from .tokenization_flaubert import FlaubertTokenizer
 from .tokenization_fsmt import FSMTTokenizer
-from .tokenization_funnel import FunnelTokenizer, FunnelTokenizerFast
+from .tokenization_funnel import FunnelTokenizer
-from .tokenization_gpt2 import GPT2Tokenizer, GPT2TokenizerFast
+from .tokenization_gpt2 import GPT2Tokenizer
-from .tokenization_layoutlm import LayoutLMTokenizer, LayoutLMTokenizerFast
+from .tokenization_layoutlm import LayoutLMTokenizer
-from .tokenization_longformer import LongformerTokenizer, LongformerTokenizerFast
+from .tokenization_longformer import LongformerTokenizer
-from .tokenization_lxmert import LxmertTokenizer, LxmertTokenizerFast
+from .tokenization_lxmert import LxmertTokenizer
-from .tokenization_marian import MarianTokenizer
+from .tokenization_mobilebert import MobileBertTokenizer
-from .tokenization_mbart import MBartTokenizer, MBartTokenizerFast
+from .tokenization_openai import OpenAIGPTTokenizer
-from .tokenization_mobilebert import MobileBertTokenizer, MobileBertTokenizerFast
-from .tokenization_openai import OpenAIGPTTokenizer, OpenAIGPTTokenizerFast
-from .tokenization_pegasus import PegasusTokenizer, PegasusTokenizerFast
 from .tokenization_phobert import PhobertTokenizer
 from .tokenization_rag import RagTokenizer
-from .tokenization_reformer import ReformerTokenizer, ReformerTokenizerFast
+from .tokenization_retribert import RetriBertTokenizer
-from .tokenization_retribert import RetriBertTokenizer, RetriBertTokenizerFast
+from .tokenization_roberta import RobertaTokenizer
-from .tokenization_roberta import RobertaTokenizer, RobertaTokenizerFast
+from .tokenization_squeezebert import SqueezeBertTokenizer
-from .tokenization_squeezebert import SqueezeBertTokenizer, SqueezeBertTokenizerFast
-from .tokenization_t5 import T5Tokenizer, T5TokenizerFast
 from .tokenization_transfo_xl import TransfoXLTokenizer
 from .tokenization_xlm import XLMTokenizer
-from .tokenization_xlm_roberta import XLMRobertaTokenizer, XLMRobertaTokenizerFast
-from .tokenization_xlnet import XLNetTokenizer, XLNetTokenizerFast
 from .utils import logging
+if is_sentencepiece_available():
+    from .tokenization_albert import AlbertTokenizer
+    from .tokenization_bert_generation import BertGenerationTokenizer
+    from .tokenization_camembert import CamembertTokenizer
+    from .tokenization_marian import MarianTokenizer
+    from .tokenization_mbart import MBartTokenizer
+    from .tokenization_pegasus import PegasusTokenizer
+    from .tokenization_reformer import ReformerTokenizer
+    from .tokenization_t5 import T5Tokenizer
+    from .tokenization_xlm_roberta import XLMRobertaTokenizer
+    from .tokenization_xlnet import XLNetTokenizer
+else:
+    AlbertTokenizer = None
+    BertGenerationTokenizer = None
+    CamembertTokenizer = None
+    MarianTokenizer = None
+    MBartTokenizer = None
+    PegasusTokenizer = None
+    ReformerTokenizer = None
+    T5Tokenizer = None
+    XLMRobertaTokenizer = None
+    XLNetTokenizer = None
+if is_tokenizers_available():
+    from .tokenization_albert_fast import AlbertTokenizerFast
+    from .tokenization_bart_fast import BartTokenizerFast
+    from .tokenization_bert_fast import BertTokenizerFast
+    from .tokenization_camembert_fast import CamembertTokenizerFast
+    from .tokenization_distilbert_fast import DistilBertTokenizerFast
+    from .tokenization_dpr_fast import DPRQuestionEncoderTokenizerFast
+    from .tokenization_electra_fast import ElectraTokenizerFast
+    from .tokenization_funnel_fast import FunnelTokenizerFast
+    from .tokenization_gpt2_fast import GPT2TokenizerFast
+    from .tokenization_layoutlm_fast import LayoutLMTokenizerFast
+    from .tokenization_longformer_fast import LongformerTokenizerFast
+    from .tokenization_lxmert_fast import LxmertTokenizerFast
+    from .tokenization_mbart_fast import MBartTokenizerFast
+    from .tokenization_mobilebert_fast import MobileBertTokenizerFast
+    from .tokenization_openai_fast import OpenAIGPTTokenizerFast
+    from .tokenization_pegasus_fast import PegasusTokenizerFast
+    from .tokenization_reformer_fast import ReformerTokenizerFast
+    from .tokenization_retribert_fast import RetriBertTokenizerFast
+    from .tokenization_roberta_fast import RobertaTokenizerFast
+    from .tokenization_squeezebert_fast import SqueezeBertTokenizerFast
+    from .tokenization_t5_fast import T5TokenizerFast
+    from .tokenization_xlm_roberta_fast import XLMRobertaTokenizerFast
+    from .tokenization_xlnet_fast import XLNetTokenizerFast
+else:
+    AlbertTokenizerFast = None
+    BartTokenizerFast = None
+    BertTokenizerFast = None
+    CamembertTokenizerFast = None
+    DistilBertTokenizerFast = None
+    DPRQuestionEncoderTokenizerFast = None
+    ElectraTokenizerFast = None
+    FunnelTokenizerFast = None
+    GPT2TokenizerFast = None
+    LayoutLMTokenizerFast = None
+    LongformerTokenizerFast = None
+    LxmertTokenizerFast = None
+    MBartTokenizerFast = None
+    MobileBertTokenizerFast = None
+    OpenAIGPTTokenizerFast = None
+    PegasusTokenizerFast = None
+    ReformerTokenizerFast = None
+    RetriBertTokenizerFast = None
+    RobertaTokenizerFast = None
+    SqueezeBertTokenizerFast = None
+    T5TokenizerFast = None
+    XLMRobertaTokenizerFast = None
+    XLNetTokenizerFast = None
 logger = logging.get_logger(__name__)
@@ -111,7 +174,7 @@ TOKENIZER_MAPPING = OrderedDict(
        (XLMRobertaConfig, (XLMRobertaTokenizer, XLMRobertaTokenizerFast)),
        (MarianConfig, (MarianTokenizer, None)),
        (BlenderbotConfig, (BlenderbotSmallTokenizer, None)),
-        (LongformerConfig, (LongformerTokenizer, None)),
+        (LongformerConfig, (LongformerTokenizer, LongformerTokenizerFast)),
        (BartConfig, (BartTokenizer, BartTokenizerFast)),
        (LongformerConfig, (LongformerTokenizer, LongformerTokenizerFast)),
        (RobertaConfig, (BertweetTokenizer, None)),
@@ -139,7 +202,11 @@ TOKENIZER_MAPPING = OrderedDict(
    ]
 )
-SLOW_TOKENIZER_MAPPING = {k: v[0] for k, v in TOKENIZER_MAPPING.items()}
+SLOW_TOKENIZER_MAPPING = {
+    k: (v[0] if v[0] is not None else v[1])
+    for k, v in TOKENIZER_MAPPING.items()
+    if (v[0] is not None or v[1] is not None)
+}
 class AutoTokenizer:
@@ -254,7 +321,7 @@ class AutoTokenizer:
        if type(config) in TOKENIZER_MAPPING.keys():
            tokenizer_class_py, tokenizer_class_fast = TOKENIZER_MAPPING[type(config)]
-            if tokenizer_class_fast and use_fast:
+            if tokenizer_class_fast and (use_fast or tokenizer_class_py is None):
                return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
            else:
                return tokenizer_class_py.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)