Doc styling (#8067)

* Important files * Styling them all * Revert "Styling them all" This reverts commit 7d029395fdae8513b8281cbc2a6c239f8093503e. * Syling them for realsies * Fix syntax error * Fix benchmark_utils * More fixes * Fix modeling auto and script * Remove new line * Fixes * More fixes * Fix more files * Style * Add FSMT * More fixes * More fixes * More fixes * More fixes * Fixes * More fixes * More fixes * Last fixes * Make sphinx happy

Doc styling (#8067)
* Important files * Styling them all * Revert "Styling them all" This reverts commit 7d029395fdae8513b8281cbc2a6c239f8093503e. * Syling them for realsies * Fix syntax error * Fix benchmark_utils * More fixes * Fix modeling auto and script * Remove new line * Fixes * More fixes * Fix more files * Style * Add FSMT * More fixes * More fixes * More fixes * More fixes * Fixes * More fixes * More fixes * Last fixes * Make sphinx happy
08f534d2 · Sylvain Gugger · GitHub · 04a17f85 · 08f534d2 · 08f534d2
Unverified Commit 08f534d2 authored Oct 26, 2020 by Sylvain Gugger Committed by GitHub Oct 26, 2020
20 changed files
--- a/.circleci/config.yml
+++ b/.circleci/config.yml
@@ -322,6 +322,7 @@ jobs:
            - run: black --check examples templates tests src utils
            - run: isort --check-only examples templates tests src utils
            - run: flake8 examples templates tests src utils
+            - run: python utils/style_doc.py src/transformers docs/source --max_len 119 --check_only
            - run: python utils/check_copies.py
            - run: python utils/check_dummies.py
            - run: python utils/check_repo.py

--- a/Makefile
+++ b/Makefile
@@ -15,6 +15,7 @@ modified_only_fixup:
 		black $(modified_py_files); \
 		isort $(modified_py_files); \
 		flake8 $(modified_py_files); \
+		python utils/style_doc.py $(modified_py_files) --max_len 119; \
 	else \
 		echo "No library .py files were modified"; \
 	fi
@@ -31,6 +32,7 @@ quality:
 	black --check $(check_dirs)
 	isort --check-only $(check_dirs)
 	flake8 $(check_dirs)
+	python utils/style_doc.py src/transformers docs/source --max_len 119 --check_only
 	${MAKE} extra_quality_checks
 # Format source code automatically and check is there are any problems left that need manual fixing
@@ -38,6 +40,7 @@ quality:
 style:
 	black $(check_dirs)
 	isort $(check_dirs)
+	python utils/style_doc.py src/transformers docs/source --max_len 119
 # Super fast fix and check target that only works on relevant modified files since the branch was made

--- a/README.md
+++ b/README.md
--- a/docs/source/benchmarks.rst
+++ b/docs/source/benchmarks.rst
@@ -3,21 +3,27 @@ Benchmarks
 Let's take a look at how 🤗 Transformer models can be benchmarked, best practices, and already available benchmarks.
-A notebook explaining in more detail how to benchmark 🤗 Transformer models can be found `here <https://github.com/huggingface/transformers/blob/master/notebooks/05-benchmark.ipynb>`__.
+A notebook explaining in more detail how to benchmark 🤗 Transformer models can be found `here
+<https://github.com/huggingface/transformers/blob/master/notebooks/05-benchmark.ipynb>`__.
 How to benchmark 🤗 Transformer models
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The classes :class:`~transformers.PyTorchBenchmark` and :class:`~transformers.TensorFlowBenchmark` allow to flexibly benchmark 🤗 Transformer models.
+The classes :class:`~transformers.PyTorchBenchmark` and :class:`~transformers.TensorFlowBenchmark` allow to flexibly
-The benchmark classes allow us to measure the `peak memory usage` and `required time` for both 
+benchmark 🤗 Transformer models. The benchmark classes allow us to measure the `peak memory usage` and `required time`
-`inference` and `training`. 
+for both `inference` and `training`.
 .. note::
-  Hereby, `inference` is defined by a single forward pass, and `training` is defined by a single forward pass and backward pass.
+  Hereby, `inference` is defined by a single forward pass, and `training` is defined by a single forward pass and
+  backward pass.
-The benchmark classes :class:`~transformers.PyTorchBenchmark` and :class:`~transformers.TensorFlowBenchmark` expect an object of type :class:`~transformers.PyTorchBenchmarkArguments` and :class:`~transformers.TensorFlowBenchmarkArguments`, respectively, for instantiation. :class:`~transformers.PyTorchBenchmarkArguments` and :class:`~transformers.TensorFlowBenchmarkArguments` are data classes and contain all relevant configurations for their corresponding benchmark class.
+The benchmark classes :class:`~transformers.PyTorchBenchmark` and :class:`~transformers.TensorFlowBenchmark` expect an
-In the following example, it is shown how a BERT model of type `bert-base-cased` can be benchmarked.
+object of type :class:`~transformers.PyTorchBenchmarkArguments` and
+:class:`~transformers.TensorFlowBenchmarkArguments`, respectively, for instantiation.
+:class:`~transformers.PyTorchBenchmarkArguments` and :class:`~transformers.TensorFlowBenchmarkArguments` are data
+classes and contain all relevant configurations for their corresponding benchmark class. In the following example, it
+is shown how a BERT model of type `bert-base-cased` can be benchmarked.
 .. code-block::
@@ -34,11 +40,15 @@ In the following example, it is shown how a BERT model of type `bert-base-cased`
    >>> benchmark = TensorFlowBenchmark(args)
-Here, three arguments are given to the benchmark argument data classes, namely ``models``, ``batch_sizes``, and ``sequence_lengths``. The argument ``models`` is required and expects a :obj:`list` of model identifiers from the `model hub <https://huggingface.co/models>`__
+Here, three arguments are given to the benchmark argument data classes, namely ``models``, ``batch_sizes``, and
-The :obj:`list` arguments ``batch_sizes`` and ``sequence_lengths`` define the size of the ``input_ids`` on which the model is benchmarked. 
+``sequence_lengths``. The argument ``models`` is required and expects a :obj:`list` of model identifiers from the
-There are many more parameters that can be configured via the benchmark argument data classes. For more detail on these one can either directly consult the files 
+`model hub <https://huggingface.co/models>`__ The :obj:`list` arguments ``batch_sizes`` and ``sequence_lengths`` define
-``src/transformers/benchmark/benchmark_args_utils.py``, ``src/transformers/benchmark/benchmark_args.py`` (for PyTorch) and ``src/transformers/benchmark/benchmark_args_tf.py`` (for Tensorflow). 
+the size of the ``input_ids`` on which the model is benchmarked. There are many more parameters that can be configured
-Alternatively, running the following shell commands from root will print out a descriptive list of all configurable parameters for PyTorch and Tensorflow respectively.
+via the benchmark argument data classes. For more detail on these one can either directly consult the files
+``src/transformers/benchmark/benchmark_args_utils.py``, ``src/transformers/benchmark/benchmark_args.py`` (for PyTorch)
+and ``src/transformers/benchmark/benchmark_args_tf.py`` (for Tensorflow). Alternatively, running the following shell
+commands from root will print out a descriptive list of all configurable parameters for PyTorch and Tensorflow
+respectively.
 .. code-block:: bash
@@ -145,14 +155,17 @@ An instantiated benchmark object can then simply be run by calling ``benchmark.r
    - gpu_performance_state: 2
    - use_tpu: False
-By default, the `time` and the `required memory` for `inference` are benchmarked. 
+By default, the `time` and the `required memory` for `inference` are benchmarked. In the example output above the first
-In the example output above the first two sections show the result corresponding to `inference time` and `inference memory`. 
+two sections show the result corresponding to `inference time` and `inference memory`. In addition, all relevant
-In addition, all relevant information about the computing environment, `e.g.` the GPU type, the system, the library versions, etc... are printed out in the third section under `ENVIRONMENT INFORMATION`.
+information about the computing environment, `e.g.` the GPU type, the system, the library versions, etc... are printed
-This information can optionally be saved in a `.csv` file when adding the argument :obj:`save_to_csv=True` to :class:`~transformers.PyTorchBenchmarkArguments` and :class:`~transformers.TensorFlowBenchmarkArguments` respectively.
+out in the third section under `ENVIRONMENT INFORMATION`. This information can optionally be saved in a `.csv` file
-In this case, every section is saved in a separate `.csv` file. The path to each `.csv` file can optionally be defined via the argument data classes.
+when adding the argument :obj:`save_to_csv=True` to :class:`~transformers.PyTorchBenchmarkArguments` and
+:class:`~transformers.TensorFlowBenchmarkArguments` respectively. In this case, every section is saved in a separate
+`.csv` file. The path to each `.csv` file can optionally be defined via the argument data classes.
-Instead of benchmarking pre-trained models via their model identifier, `e.g.` `bert-base-uncased`, the user can alternatively benchmark an arbitrary configuration of any available model class. 
+Instead of benchmarking pre-trained models via their model identifier, `e.g.` `bert-base-uncased`, the user can
-In this case, a :obj:`list` of configurations must be inserted with the benchmark args as follows.
+alternatively benchmark an arbitrary configuration of any available model class. In this case, a :obj:`list` of
+configurations must be inserted with the benchmark args as follows.
 .. code-block::
@@ -295,8 +308,9 @@ In this case, a :obj:`list` of configurations must be inserted with the benchmar
    - use_tpu: False
-Again, `inference time` and `required memory` for `inference` are measured, but this time for customized configurations of the :obj:`BertModel` class. This feature can especially be helpful when 
+Again, `inference time` and `required memory` for `inference` are measured, but this time for customized configurations
-deciding for which configuration the model should be trained.
+of the :obj:`BertModel` class. This feature can especially be helpful when deciding for which configuration the model
+should be trained.
 Benchmark best practices
@@ -305,18 +319,27 @@ Benchmark best practices
 This section lists a couple of best practices one should be aware of when benchmarking a model.
 - Currently, only single device benchmarking is supported. When benchmarking on GPU, it is recommended that the user
-  specifies on which device the code should be run by setting the ``CUDA_VISIBLE_DEVICES`` environment variable in the shell, `e.g.` ``export CUDA_VISIBLE_DEVICES=0`` before running the code.
+  specifies on which device the code should be run by setting the ``CUDA_VISIBLE_DEVICES`` environment variable in the
- The option :obj:`no_multi_processing` should only be set to :obj:`True` for testing and debugging. To ensure accurate memory measurement it is recommended to run each memory benchmark in a separate process by making sure :obj:`no_multi_processing` is set to :obj:`True`.
+  shell, `e.g.` ``export CUDA_VISIBLE_DEVICES=0`` before running the code.
- One should always state the environment information when sharing the results of a model benchmark. Results can vary heavily between different GPU devices, library versions, etc., so that benchmark results on their own are not very useful for the community.
+- The option :obj:`no_multi_processing` should only be set to :obj:`True` for testing and debugging. To ensure accurate
+  memory measurement it is recommended to run each memory benchmark in a separate process by making sure
+  :obj:`no_multi_processing` is set to :obj:`True`.
+- One should always state the environment information when sharing the results of a model benchmark. Results can vary
+  heavily between different GPU devices, library versions, etc., so that benchmark results on their own are not very
+  useful for the community.
 Sharing your benchmark
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Previously all available core models (10 at the time) have been benchmarked for `inference time`, across many different settings: using PyTorch, with
+Previously all available core models (10 at the time) have been benchmarked for `inference time`, across many different
-and without TorchScript, using TensorFlow, with and without XLA. All of those tests were done across CPUs (except for
+settings: using PyTorch, with and without TorchScript, using TensorFlow, with and without XLA. All of those tests were
-TensorFlow XLA) and GPUs.
+done across CPUs (except for TensorFlow XLA) and GPUs.
-The approach is detailed in the `following blogpost <https://medium.com/huggingface/benchmarking-transformers-pytorch-and-tensorflow-e2917fb891c2>`__ and the results are available `here <https://docs.google.com/spreadsheets/d/1sryqufw2D0XlUH4sq3e9Wnxu5EAQkaohzrJbd5HdQ_w/edit?usp=sharing>`__.
+The approach is detailed in the `following blogpost
+<https://medium.com/huggingface/benchmarking-transformers-pytorch-and-tensorflow-e2917fb891c2>`__ and the results are
+available `here
+<https://docs.google.com/spreadsheets/d/1sryqufw2D0XlUH4sq3e9Wnxu5EAQkaohzrJbd5HdQ_w/edit?usp=sharing>`__.
-With the new `benchmark` tools, it is easier than ever to share your benchmark results with the community `here <https://github.com/huggingface/transformers/blob/master/examples/benchmarking/README.md>`__.
+With the new `benchmark` tools, it is easier than ever to share your benchmark results with the community `here
+<https://github.com/huggingface/transformers/blob/master/examples/benchmarking/README.md>`__.
--- a/docs/source/bertology.rst
+++ b/docs/source/bertology.rst
 BERTology
 -----------------------------------------------------------------------------------------------------------------------
-There is a growing field of study concerned with investigating the inner working of large-scale transformers like BERT (that some call "BERTology"). Some good examples of this field are:
+There is a growing field of study concerned with investigating the inner working of large-scale transformers like BERT
+(that some call "BERTology"). Some good examples of this field are:
-* BERT Rediscovers the Classical NLP Pipeline by Ian Tenney, Dipanjan Das, Ellie Pavlick: https://arxiv.org/abs/1905.05950
+* BERT Rediscovers the Classical NLP Pipeline by Ian Tenney, Dipanjan Das, Ellie Pavlick:
+  https://arxiv.org/abs/1905.05950
 * Are Sixteen Heads Really Better than One? by Paul Michel, Omer Levy, Graham Neubig: https://arxiv.org/abs/1905.10650
-* What Does BERT Look At? An Analysis of BERT's Attention by Kevin Clark, Urvashi Khandelwal, Omer Levy, Christopher D. Manning: https://arxiv.org/abs/1906.04341
+* What Does BERT Look At? An Analysis of BERT's Attention by Kevin Clark, Urvashi Khandelwal, Omer Levy, Christopher D.
+  Manning: https://arxiv.org/abs/1906.04341
-In order to help this new field develop, we have included a few additional features in the BERT/GPT/GPT-2 models to help people access the inner representations, mainly adapted from the great work of Paul Michel (https://arxiv.org/abs/1905.10650):
+In order to help this new field develop, we have included a few additional features in the BERT/GPT/GPT-2 models to
+help people access the inner representations, mainly adapted from the great work of Paul Michel
+(https://arxiv.org/abs/1905.10650):
 * accessing all the hidden-states of BERT/GPT/GPT-2,
 * accessing all the attention weights for each head of BERT/GPT/GPT-2,
-* retrieving heads output values and gradients to be able to compute head importance score and prune head as explained in https://arxiv.org/abs/1905.10650.
+* retrieving heads output values and gradients to be able to compute head importance score and prune head as explained
+  in https://arxiv.org/abs/1905.10650.
-To help you understand and use these features, we have added a specific example script: `bertology.py <https://github.com/huggingface/transformers/blob/master/examples/bertology/run_bertology.py>`_ while extract information and prune a model pre-trained on GLUE.
+To help you understand and use these features, we have added a specific example script: `bertology.py
+<https://github.com/huggingface/transformers/blob/master/examples/bertology/run_bertology.py>`_ while extract
+information and prune a model pre-trained on GLUE.
--- a/docs/source/converting_tensorflow_models.rst
+++ b/docs/source/converting_tensorflow_models.rst
 Converting Tensorflow Checkpoints
 =======================================================================================================================
-A command-line interface is provided to convert original Bert/GPT/GPT-2/Transformer-XL/XLNet/XLM checkpoints in models than be loaded using the ``from_pretrained`` methods of the library.
+A command-line interface is provided to convert original Bert/GPT/GPT-2/Transformer-XL/XLNet/XLM checkpoints in models
+than be loaded using the ``from_pretrained`` methods of the library.
 .. note::
-    Since 2.3.0 the conversion script is now part of the transformers CLI (**transformers-cli**)
+    Since 2.3.0 the conversion script is now part of the transformers CLI (**transformers-cli**) available in any
-    available in any transformers >= 2.3.0 installation.
+    transformers >= 2.3.0 installation.
    The documentation below reflects the **transformers-cli convert** command format.
 BERT
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-You can convert any TensorFlow checkpoint for BERT (in particular `the pre-trained models released by Google <https://github.com/google-research/bert#pre-trained-models>`_\ ) in a PyTorch save file by using the `convert_bert_original_tf_checkpoint_to_pytorch.py <https://github.com/huggingface/transformers/blob/master/src/transformers/convert_bert_original_tf_checkpoint_to_pytorch.py>`_ script.
+You can convert any TensorFlow checkpoint for BERT (in particular `the pre-trained models released by Google
+<https://github.com/google-research/bert#pre-trained-models>`_\ ) in a PyTorch save file by using the
-This CLI takes as input a TensorFlow checkpoint (three files starting with ``bert_model.ckpt``\ ) and the associated configuration file (\ ``bert_config.json``\ ), and creates a PyTorch model for this configuration, loads the weights from the TensorFlow checkpoint in the PyTorch model and saves the resulting model in a standard PyTorch save file that can be imported using ``torch.load()`` (see examples in `run_bert_extract_features.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_extract_features.py>`_\ , `run_bert_classifier.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_classifier.py>`_ and `run_bert_squad.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_squad.py>`_\ ).
+`convert_bert_original_tf_checkpoint_to_pytorch.py
+<https://github.com/huggingface/transformers/blob/master/src/transformers/convert_bert_original_tf_checkpoint_to_pytorch.py>`_
-You only need to run this conversion script **once** to get a PyTorch model. You can then disregard the TensorFlow checkpoint (the three files starting with ``bert_model.ckpt``\ ) but be sure to keep the configuration file (\ ``bert_config.json``\ ) and the vocabulary file (\ ``vocab.txt``\ ) as these are needed for the PyTorch model too.
+script.
-To run this specific conversion script you will need to have TensorFlow and PyTorch installed (\ ``pip install tensorflow``\ ). The rest of the repository only requires PyTorch.
+This CLI takes as input a TensorFlow checkpoint (three files starting with ``bert_model.ckpt``\ ) and the associated
+configuration file (\ ``bert_config.json``\ ), and creates a PyTorch model for this configuration, loads the weights
+from the TensorFlow checkpoint in the PyTorch model and saves the resulting model in a standard PyTorch save file that
+can be imported using ``torch.load()`` (see examples in `run_bert_extract_features.py
+<https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_extract_features.py>`_\ ,
+`run_bert_classifier.py
+<https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_classifier.py>`_ and
+`run_bert_squad.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_squad.py>`_\
+).
+You only need to run this conversion script **once** to get a PyTorch model. You can then disregard the TensorFlow
+checkpoint (the three files starting with ``bert_model.ckpt``\ ) but be sure to keep the configuration file (\
+``bert_config.json``\ ) and the vocabulary file (\ ``vocab.txt``\ ) as these are needed for the PyTorch model too.
+To run this specific conversion script you will need to have TensorFlow and PyTorch installed (\ ``pip install
+tensorflow``\ ). The rest of the repository only requires PyTorch.
 Here is an example of the conversion process for a pre-trained ``BERT-Base Uncased`` model:
@@ -31,14 +47,20 @@ Here is an example of the conversion process for a pre-trained ``BERT-Base Uncas
     --config $BERT_BASE_DIR/bert_config.json \
     --pytorch_dump_output $BERT_BASE_DIR/pytorch_model.bin
-You can download Google's pre-trained models for the conversion `here <https://github.com/google-research/bert#pre-trained-models>`__.
+You can download Google's pre-trained models for the conversion `here
+<https://github.com/google-research/bert#pre-trained-models>`__.
 ALBERT
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-Convert TensorFlow model checkpoints of ALBERT to PyTorch using the `convert_albert_original_tf_checkpoint_to_pytorch.py <https://github.com/huggingface/transformers/blob/master/src/transformers/convert_bert_original_tf_checkpoint_to_pytorch.py>`_ script.
+Convert TensorFlow model checkpoints of ALBERT to PyTorch using the
+`convert_albert_original_tf_checkpoint_to_pytorch.py
+<https://github.com/huggingface/transformers/blob/master/src/transformers/convert_bert_original_tf_checkpoint_to_pytorch.py>`_
+script.
-The CLI takes as input a TensorFlow checkpoint (three files starting with ``model.ckpt-best``\ ) and the accompanying configuration file (\ ``albert_config.json``\ ), then creates and saves a PyTorch model. To run this conversion you will need to have TensorFlow and PyTorch installed.
+The CLI takes as input a TensorFlow checkpoint (three files starting with ``model.ckpt-best``\ ) and the accompanying
+configuration file (\ ``albert_config.json``\ ), then creates and saves a PyTorch model. To run this conversion you
+will need to have TensorFlow and PyTorch installed.
 Here is an example of the conversion process for the pre-trained ``ALBERT Base`` model:
@@ -51,12 +73,15 @@ Here is an example of the conversion process for the pre-trained ``ALBERT Base``
     --config $ALBERT_BASE_DIR/albert_config.json \
     --pytorch_dump_output $ALBERT_BASE_DIR/pytorch_model.bin
-You can download Google's pre-trained models for the conversion `here <https://github.com/google-research/albert#pre-trained-models>`__.
+You can download Google's pre-trained models for the conversion `here
+<https://github.com/google-research/albert#pre-trained-models>`__.
 OpenAI GPT
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-Here is an example of the conversion process for a pre-trained OpenAI GPT model, assuming that your NumPy checkpoint save as the same format than OpenAI pretrained model (see `here <https://github.com/openai/finetune-transformer-lm>`__\ )
+Here is an example of the conversion process for a pre-trained OpenAI GPT model, assuming that your NumPy checkpoint
+save as the same format than OpenAI pretrained model (see `here <https://github.com/openai/finetune-transformer-lm>`__\
+)
 .. code-block:: shell
@@ -72,7 +97,8 @@ Here is an example of the conversion process for a pre-trained OpenAI GPT model,
 OpenAI GPT-2
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-Here is an example of the conversion process for a pre-trained OpenAI GPT-2 model (see `here <https://github.com/openai/gpt-2>`__\ )
+Here is an example of the conversion process for a pre-trained OpenAI GPT-2 model (see `here
+<https://github.com/openai/gpt-2>`__\ )
 .. code-block:: shell
@@ -87,7 +113,8 @@ Here is an example of the conversion process for a pre-trained OpenAI GPT-2 mode
 Transformer-XL
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-Here is an example of the conversion process for a pre-trained Transformer-XL model (see `here <https://github.com/kimiyoung/transformer-xl/tree/master/tf#obtain-and-evaluate-pretrained-sota-models>`__\ )
+Here is an example of the conversion process for a pre-trained Transformer-XL model (see `here
+<https://github.com/kimiyoung/transformer-xl/tree/master/tf#obtain-and-evaluate-pretrained-sota-models>`__\ )
 .. code-block:: shell

--- a/docs/source/custom_datasets.rst
+++ b/docs/source/custom_datasets.rst
@@ -3,15 +3,15 @@ Fine-tuning with custom datasets
 .. note::
-    The datasets used in this tutorial are available and can be more easily accessed using the
+    The datasets used in this tutorial are available and can be more easily accessed using the `🤗 NLP library
-    `🤗 NLP library <https://github.com/huggingface/nlp>`_. We do not use this library to access the datasets here
+    <https://github.com/huggingface/nlp>`_. We do not use this library to access the datasets here since this tutorial
-    since this tutorial meant to illustrate how to work with your own data. A brief of introduction can be found
+    meant to illustrate how to work with your own data. A brief of introduction can be found at the end of the tutorial
-    at the end of the tutorial in the section ":ref:`nlplib`".
+    in the section ":ref:`nlplib`".
-This tutorial will take you through several examples of using 🤗 Transformers models with your own datasets. The
+This tutorial will take you through several examples of using 🤗 Transformers models with your own datasets. The guide
-guide shows one of many valid workflows for using these models and is meant to be illustrative rather than
+shows one of many valid workflows for using these models and is meant to be illustrative rather than definitive. We
-definitive. We show examples of reading in several data formats, preprocessing the data for several types of tasks,
+show examples of reading in several data formats, preprocessing the data for several types of tasks, and then preparing
-and then preparing the data into PyTorch/TensorFlow ``Dataset`` objects which can easily be used either with
+the data into PyTorch/TensorFlow ``Dataset`` objects which can easily be used either with
 :class:`~transformers.Trainer`/:class:`~transformers.TFTrainer` or with native PyTorch/TensorFlow.
 We include several examples, each of which demonstrates a different type of common downstream task:
@@ -28,13 +28,13 @@ Sequence Classification with IMDb Reviews
 .. note::
-    This dataset can be explored in the Hugging Face model hub (`IMDb <https://huggingface.co/datasets/imdb>`_), and can
+    This dataset can be explored in the Hugging Face model hub (`IMDb <https://huggingface.co/datasets/imdb>`_), and
-    be alternatively downloaded with the 🤗 NLP library with ``load_dataset("imdb")``.
+    can be alternatively downloaded with the 🤗 NLP library with ``load_dataset("imdb")``.
-In this example, we'll show how to download, tokenize, and train a model on the IMDb reviews dataset. This task
+In this example, we'll show how to download, tokenize, and train a model on the IMDb reviews dataset. This task takes
-takes the text of a review and requires the model to predict whether the sentiment of the review is positive or
+the text of a review and requires the model to predict whether the sentiment of the review is positive or negative.
-negative. Let's start by downloading the dataset from the
+Let's start by downloading the dataset from the `Large Movie Review Dataset
-`Large Movie Review Dataset <http://ai.stanford.edu/~amaas/data/sentiment/>`_ webpage.
+<http://ai.stanford.edu/~amaas/data/sentiment/>`_ webpage.
 .. code-block:: bash
@@ -62,9 +62,8 @@ read this in.
    train_texts, train_labels = read_imdb_split('aclImdb/train')
    test_texts, test_labels = read_imdb_split('aclImdb/test')
-We now have a train and test dataset, but let's also also create a validation set which we can use for for
+We now have a train and test dataset, but let's also also create a validation set which we can use for for evaluation
-evaluation and tuning without training our test set results. Sklearn has a convenient utility for creating such
+and tuning without training our test set results. Sklearn has a convenient utility for creating such splits:
-splits:
 .. code-block:: python
@@ -80,8 +79,8 @@ pre-trained DistilBert, so let's use the DistilBert tokenizer.
    tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
 Now we can simply pass our texts to the tokenizer. We'll pass ``truncation=True`` and ``padding=True``, which will
-ensure that all of our sequences are padded to the same length and are truncated to be no longer model's maximum
+ensure that all of our sequences are padded to the same length and are truncated to be no longer model's maximum input
-input length. This will allow us to feed batches of sequences into the model at the same time.
+length. This will allow us to feed batches of sequences into the model at the same time.
 .. code-block:: python
@@ -90,9 +89,9 @@ input length. This will allow us to feed batches of sequences into the model at
    test_encodings = tokenizer(test_texts, truncation=True, padding=True)
 Now, let's turn our labels and encodings into a Dataset object. In PyTorch, this is done by subclassing a
-``torch.utils.data.Dataset`` object and implementing ``__len__`` and ``__getitem__``. In TensorFlow, we pass our input encodings and
+``torch.utils.data.Dataset`` object and implementing ``__len__`` and ``__getitem__``. In TensorFlow, we pass our input
-labels to the ``from_tensor_slices`` constructor method. We put the data in this format so that the data can be
+encodings and labels to the ``from_tensor_slices`` constructor method. We put the data in this format so that the data
-easily batched such that each key in the batch encoding corresponds to a named parameter of the
+can be easily batched such that each key in the batch encoding corresponds to a named parameter of the
 :meth:`~transformers.DistilBertForSequenceClassification.forward` method of the model we will train.
 .. code-block:: python
@@ -133,17 +132,17 @@ easily batched such that each key in the batch encoding corresponds to a named p
    ))
 Now that our datasets our ready, we can fine-tune a model either with the 🤗
-:class:`~transformers.Trainer`/:class:`~transformers.TFTrainer` or with native PyTorch/TensorFlow. See
+:class:`~transformers.Trainer`/:class:`~transformers.TFTrainer` or with native PyTorch/TensorFlow. See :doc:`training
-:doc:`training <training>`.
+<training>`.
 .. _ft_trainer:
 Fine-tuning with Trainer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The steps above prepared the datasets in the way that the trainer is expected. Now all we need to do is create a
+The steps above prepared the datasets in the way that the trainer is expected. Now all we need to do is create a model
-model to fine-tune, define the :class:`~transformers.TrainingArguments`/:class:`~transformers.TFTrainingArguments`
+to fine-tune, define the :class:`~transformers.TrainingArguments`/:class:`~transformers.TFTrainingArguments` and
-and instantiate a :class:`~transformers.Trainer`/:class:`~transformers.TFTrainer`.
+instantiate a :class:`~transformers.Trainer`/:class:`~transformers.TFTrainer`.
 .. code-block:: python
@@ -248,15 +247,15 @@ Token Classification with W-NUT Emerging Entities
 .. note::
-    This dataset can be explored in the Hugging Face model hub (`WNUT-17 <https://huggingface.co/datasets/wnut_17>`_), and can
+    This dataset can be explored in the Hugging Face model hub (`WNUT-17 <https://huggingface.co/datasets/wnut_17>`_),
-    be alternatively downloaded with the 🤗 NLP library with ``load_dataset("wnut_17")``.
+    and can be alternatively downloaded with the 🤗 NLP library with ``load_dataset("wnut_17")``.
 Next we will look at token classification. Rather than classifying an entire sequence, this task classifies token by
-token. We'll demonstrate how to do this with 
+token. We'll demonstrate how to do this with `Named Entity Recognition
-`Named Entity Recognition <http://nlpprogress.com/english/named_entity_recognition.html>`_, which involves
+<http://nlpprogress.com/english/named_entity_recognition.html>`_, which involves identifying tokens which correspond to
-identifying tokens which correspond to a predefined set of "entities". Specifically, we'll use the
+a predefined set of "entities". Specifically, we'll use the `W-NUT Emerging and Rare entities
-`W-NUT Emerging and Rare entities <http://noisy-text.github.io/2017/emerging-rare-entities.html>`_ corpus. The data
+<http://noisy-text.github.io/2017/emerging-rare-entities.html>`_ corpus. The data is given as a collection of
-is given as a collection of pre-tokenized documents where each token is assigned a tag.
+pre-tokenized documents where each token is assigned a tag.
 Let's start by downloading the data.
@@ -264,10 +263,10 @@ Let's start by downloading the data.
    wget http://noisy-text.github.io/2017/files/wnut17train.conll
-In this case, we'll just download the train set, which is a single text file. Each line of the file contains either
+In this case, we'll just download the train set, which is a single text file. Each line of the file contains either (1)
-(1) a word and tag separated by a tab, or (2) a blank line indicating the end of a document. Let's write a
+a word and tag separated by a tab, or (2) a blank line indicating the end of a document. Let's write a function to read
-function to read this in. We'll take in the file path and return ``token_docs`` which is a list of lists of token
+this in. We'll take in the file path and return ``token_docs`` which is a list of lists of token strings, and
-strings, and ``token_tags`` which is a list of lists of tag strings.
+``token_tags`` which is a list of lists of tag strings.
 .. code-block:: python
@@ -303,8 +302,8 @@ Just to see what this data looks like, let's take a look at a segment of the fir
    ['for', 'two', 'weeks', '.', 'Empire', 'State', 'Building']
    ['O', 'O', 'O', 'O', 'B-location', 'I-location', 'I-location']
-``location`` is an entity type, ``B-`` indicates the beginning of an entity, and ``I-`` indicates consecutive positions of
+``location`` is an entity type, ``B-`` indicates the beginning of an entity, and ``I-`` indicates consecutive positions
-the same entity ("Empire State Building" is considered one entity). ``O`` indicates the token does not correspond to
+of the same entity ("Empire State Building" is considered one entity). ``O`` indicates the token does not correspond to
 any entity.
 Now that we've read the data in, let's create a train/validation split:
@@ -314,8 +313,8 @@ Now that we've read the data in, let's create a train/validation split:
    from sklearn.model_selection import train_test_split
    train_texts, val_texts, train_tags, val_tags = train_test_split(texts, tags, test_size=.2)
-Next, let's create encodings for our tokens and tags. For the tags, we can start by just create a simple mapping
+Next, let's create encodings for our tokens and tags. For the tags, we can start by just create a simple mapping which
-which we'll use in a moment:
+we'll use in a moment:
 .. code-block:: python
@@ -323,11 +322,11 @@ which we'll use in a moment:
    tag2id = {tag: id for id, tag in enumerate(unique_tags)}
    id2tag = {id: tag for tag, id in tag2id.items()}
-To encode the tokens, we'll use a pre-trained DistilBert tokenizer. We can tell the tokenizer that we're dealing
+To encode the tokens, we'll use a pre-trained DistilBert tokenizer. We can tell the tokenizer that we're dealing with
-with ready-split tokens rather than full sentence strings by passing ``is_split_into_words=True``. We'll also pass
+ready-split tokens rather than full sentence strings by passing ``is_split_into_words=True``. We'll also pass
-``padding=True`` and ``truncation=True`` to pad the sequences to be the same length. Lastly, we can tell the model
+``padding=True`` and ``truncation=True`` to pad the sequences to be the same length. Lastly, we can tell the model to
-to return information about the tokens which are split by the wordpiece tokenization process, which we will need in
+return information about the tokens which are split by the wordpiece tokenization process, which we will need in a
-a moment.
+moment.
 .. code-block:: python
@@ -339,24 +338,24 @@ a moment.
 Great, so now our tokens are nicely encoded in the format that they need to be in to feed them into our DistilBert
 model below.
-Now we arrive at a common obstacle with using pre-trained models for token-level classification: many of the tokens
+Now we arrive at a common obstacle with using pre-trained models for token-level classification: many of the tokens in
-in the W-NUT corpus are not in DistilBert's vocabulary. Bert and many models like it use a method called WordPiece
+the W-NUT corpus are not in DistilBert's vocabulary. Bert and many models like it use a method called WordPiece
-Tokenization, meaning that single words are split into multiple tokens such that each token is likely to be in
+Tokenization, meaning that single words are split into multiple tokens such that each token is likely to be in the
-the vocabulary. For example, DistilBert's tokenizer would split the Twitter handle ``@huggingface`` into the tokens
+vocabulary. For example, DistilBert's tokenizer would split the Twitter handle ``@huggingface`` into the tokens ``['@',
-``['@', 'hugging', '##face']``. This is a problem for us because we have exactly one tag per token. If the tokenizer
+'hugging', '##face']``. This is a problem for us because we have exactly one tag per token. If the tokenizer splits a
-splits a token into multiple sub-tokens, then we will end up with a mismatch between our tokens and our labels.
+token into multiple sub-tokens, then we will end up with a mismatch between our tokens and our labels.
-One way to handle this is to only train on the tag labels for the first subtoken of a split token. We can do this in
+One way to handle this is to only train on the tag labels for the first subtoken of a split token. We can do this in 🤗
-🤗 Transformers by setting the labels we wish to ignore to ``-100``. In the example above, if the label for
+Transformers by setting the labels we wish to ignore to ``-100``. In the example above, if the label for
 ``@HuggingFace`` is ``3`` (indexing ``B-corporation``), we would set the labels of ``['@', 'hugging', '##face']`` to
 ``[3, -100, -100]``.
 Let's write a function to do this. This is where we will use the ``offset_mapping`` from the tokenizer as mentioned
 above. For each sub-token returned by the tokenizer, the offset mapping gives us a tuple indicating the sub-token's
-start position and end position relative to the original token it was split from. That means that if the first
+start position and end position relative to the original token it was split from. That means that if the first position
-position in the tuple is anything other than ``0``, we will set its corresponding label to ``-100``. While we're at
+in the tuple is anything other than ``0``, we will set its corresponding label to ``-100``. While we're at it, we can
-it, we can also set labels to ``-100`` if the second position of the offset mapping is ``0``, since this means it must
+also set labels to ``-100`` if the second position of the offset mapping is ``0``, since this means it must be a
-be a special token like ``[PAD]`` or ``[CLS]``.
+special token like ``[PAD]`` or ``[CLS]``.
 .. note::
@@ -447,8 +446,9 @@ Question Answering with SQuAD 2.0
 .. note::
-    This dataset can be explored in the Hugging Face model hub (`SQuAD V2 <https://huggingface.co/datasets/squad_v2>`_), and can
+    This dataset can be explored in the Hugging Face model hub (`SQuAD V2
-    be alternatively downloaded with the 🤗 NLP library with ``load_dataset("squad_v2")``.
+    <https://huggingface.co/datasets/squad_v2>`_), and can be alternatively downloaded with the 🤗 NLP library with
+    ``load_dataset("squad_v2")``.
 Question answering comes in many forms. In this example, we'll look at the particular type of extractive QA that
 involves answering a question about a passage by highlighting the segment of the passage that answers the question.
@@ -464,8 +464,8 @@ We will start by downloading the data:
    wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json -O squad/dev-v2.0.json
 Each split is in a structured json file with a number of questions and answers for each passage (or context). We'll
-take this apart into parallel lists of contexts, questions, and answers (note that the contexts here are repeated
+take this apart into parallel lists of contexts, questions, and answers (note that the contexts here are repeated since
-since there are multiple questions per context):
+there are multiple questions per context):
 .. code-block:: python
@@ -495,13 +495,13 @@ since there are multiple questions per context):
    train_contexts, train_questions, train_answers = read_squad('squad/train-v2.0.json')
    val_contexts, val_questions, val_answers = read_squad('squad/dev-v2.0.json')
-The contexts and questions are just strings. The answers are dicts containing the subsequence of the passage with
+The contexts and questions are just strings. The answers are dicts containing the subsequence of the passage with the
-the correct answer as well as an integer indicating the character at which the answer begins. In order to train a
+correct answer as well as an integer indicating the character at which the answer begins. In order to train a model on
-model on this data we need (1) the tokenized context/question pairs, and (2) integers indicating at which *token*
+this data we need (1) the tokenized context/question pairs, and (2) integers indicating at which *token* positions the
-positions the answer begins and ends.
+answer begins and ends.
-First, let's get the *character* position at which the answer ends in the passage (we are given the starting
+First, let's get the *character* position at which the answer ends in the passage (we are given the starting position).
-position). Sometimes SQuAD answers are off by one or two characters, so we will also adjust for that.
+Sometimes SQuAD answers are off by one or two characters, so we will also adjust for that.
 .. code-block:: python
@@ -524,9 +524,9 @@ position). Sometimes SQuAD answers are off by one or two characters, so we will
    add_end_idx(train_answers, train_contexts)
    add_end_idx(val_answers, val_contexts)
-Now ``train_answers`` and ``val_answers`` include the character end positions and the corrected start positions.
+Now ``train_answers`` and ``val_answers`` include the character end positions and the corrected start positions. Next,
-Next, let's tokenize our context/question pairs. 🤗 Tokenizers can accept parallel lists of sequences and encode
+let's tokenize our context/question pairs. 🤗 Tokenizers can accept parallel lists of sequences and encode them together
-them together as sequence pairs.
+as sequence pairs.
 .. code-block:: python
@@ -536,8 +536,8 @@ them together as sequence pairs.
    train_encodings = tokenizer(train_contexts, train_questions, truncation=True, padding=True)
    val_encodings = tokenizer(val_contexts, val_questions, truncation=True, padding=True)
-Next we need to convert our character start/end positions to token start/end positions. When using 🤗 Fast
+Next we need to convert our character start/end positions to token start/end positions. When using 🤗 Fast Tokenizers,
-Tokenizers, we can use the built in :func:`~transformers.BatchEncoding.char_to_token` method.
+we can use the built in :func:`~transformers.BatchEncoding.char_to_token` method.
 .. code-block:: python
@@ -557,9 +557,9 @@ Tokenizers, we can use the built in :func:`~transformers.BatchEncoding.char_to_t
    add_token_positions(train_encodings, train_answers)
    add_token_positions(val_encodings, val_answers)
-Our data is ready. Let's just put it in a PyTorch/TensorFlow dataset so that we can easily use it for
+Our data is ready. Let's just put it in a PyTorch/TensorFlow dataset so that we can easily use it for training. In
-training. In PyTorch, we define a custom ``Dataset`` class. In TensorFlow, we pass a tuple of
+PyTorch, we define a custom ``Dataset`` class. In TensorFlow, we pass a tuple of ``(inputs_dict, labels_dict)`` to the
-``(inputs_dict, labels_dict)`` to the ``from_tensor_slices`` method.
+``from_tensor_slices`` method.
 .. code-block:: python
@@ -668,12 +668,11 @@ Additional Resources
 Using the 🤗 NLP Datasets & Metrics library
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-This tutorial demonstrates how to read in datasets from various raw text formats and prepare them for training with
+This tutorial demonstrates how to read in datasets from various raw text formats and prepare them for training with 🤗
-🤗 Transformers so that you can do the same thing with your own custom datasets. However, we recommend users use the
+Transformers so that you can do the same thing with your own custom datasets. However, we recommend users use the `🤗
-`🤗 NLP library <https://github.com/huggingface/nlp>`_ for working with the 150+ datasets included in the
+NLP library <https://github.com/huggingface/nlp>`_ for working with the 150+ datasets included in the `hub
-`hub <https://huggingface.co/datasets>`_, including the three datasets used in this tutorial. As a very brief overview,
+<https://huggingface.co/datasets>`_, including the three datasets used in this tutorial. As a very brief overview, we
-we will show how to use the NLP library to download and prepare the IMDb dataset from the first example,
+will show how to use the NLP library to download and prepare the IMDb dataset from the first example, :ref:`seq_imdb`.
-:ref:`seq_imdb`.
 Start by downloading the dataset:
@@ -689,8 +688,8 @@ Each dataset has multiple columns corresponding to different features. Let's see
    >>> print(train.column_names)
    ['label', 'text']
-Great. Now let's tokenize the text. We can do this using the ``map`` method. We'll also rename the ``label`` column
+Great. Now let's tokenize the text. We can do this using the ``map`` method. We'll also rename the ``label`` column to
-to ``labels`` to match the model's input arguments.
+``labels`` to match the model's input arguments.
 .. code-block:: python
@@ -711,5 +710,5 @@ dataset elements.
    >>> {key: val.shape for key, val in train[0].items()})
    {'labels': TensorShape([]), 'input_ids': TensorShape([512]), 'attention_mask': TensorShape([512])}
-We now have a fully-prepared dataset. Check out `the 🤗 NLP docs <https://huggingface.co/nlp/processing.html>`_ for
+We now have a fully-prepared dataset. Check out `the 🤗 NLP docs <https://huggingface.co/nlp/processing.html>`_ for a
-a more thorough introduction.
+more thorough introduction.
\ No newline at end of file
--- a/docs/source/glossary.rst
+++ b/docs/source/glossary.rst
@@ -57,8 +57,8 @@ The tokenizer takes care of splitting the sequence into tokens available in the
    >>> tokenized_sequence = tokenizer.tokenize(sequence)
 The tokens are either words or subwords. Here for instance, "VRAM" wasn't in the model vocabulary, so it's been split
-in "V", "RA" and "M". To indicate those tokens are not separate words but parts of the same word, a double-hash prefix is
+in "V", "RA" and "M". To indicate those tokens are not separate words but parts of the same word, a double-hash prefix
-added for "RA" and "M":
+is added for "RA" and "M":
 .. code-block::
@@ -66,8 +66,8 @@ added for "RA" and "M":
    ['A', 'Titan', 'R', '##T', '##X', 'has', '24', '##GB', 'of', 'V', '##RA', '##M']
 These tokens can then be converted into IDs which are understandable by the model. This can be done by directly feeding
-the sentence to the tokenizer, which leverages the Rust implementation of
+the sentence to the tokenizer, which leverages the Rust implementation of `huggingface/tokenizers
-`huggingface/tokenizers <https://github.com/huggingface/tokenizers>`__ for peak performance.
+<https://github.com/huggingface/tokenizers>`__ for peak performance.
 .. code-block::
@@ -105,8 +105,8 @@ because this is the way a :class:`~transformers.BertModel` is going to expect it
 Attention mask
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The attention mask is an optional argument used when batching sequences together. This argument indicates to the
+The attention mask is an optional argument used when batching sequences together. This argument indicates to the model
-model which tokens should be attended to, and which should not.
+which tokens should be attended to, and which should not.
 For example, consider these two sequences:
@@ -145,10 +145,10 @@ We can see that 0s have been added on the right of the first sentence to make it
    >>> padded_sequences["input_ids"]
    [[101, 1188, 1110, 170, 1603, 4954, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1188, 1110, 170, 1897, 1263, 4954, 119, 1135, 1110, 1120, 1655, 2039, 1190, 1103, 4954, 138, 119, 102]]
-This can then be converted into a tensor in PyTorch or TensorFlow. The attention mask is a binary tensor indicating
+This can then be converted into a tensor in PyTorch or TensorFlow. The attention mask is a binary tensor indicating the
-the position of the padded indices so that the model does not attend to them. For the
+position of the padded indices so that the model does not attend to them. For the :class:`~transformers.BertTokenizer`,
-:class:`~transformers.BertTokenizer`, :obj:`1` indicates a value that should be attended to, while :obj:`0` indicates
+:obj:`1` indicates a value that should be attended to, while :obj:`0` indicates a padded value. This attention mask is
-a padded value. This attention mask is in the dictionary returned by the tokenizer under the key "attention_mask":
+in the dictionary returned by the tokenizer under the key "attention_mask":
 .. code-block::
@@ -161,15 +161,16 @@ Token Type IDs
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Some models' purpose is to do sequence classification or question answering. These require two different sequences to
-be joined in a single "input_ids" entry, which usually is performed with the help of special tokens, such as the classifier (``[CLS]``) and separator (``[SEP]``)
+be joined in a single "input_ids" entry, which usually is performed with the help of special tokens, such as the
-tokens. For example, the BERT model builds its two sequence input as such:
+classifier (``[CLS]``) and separator (``[SEP]``) tokens. For example, the BERT model builds its two sequence input as
+such:
 .. code-block::
   >>> # [CLS] SEQUENCE_A [SEP] SEQUENCE_B [SEP]
-We can use our tokenizer to automatically generate such a sentence by passing the two sequences to ``tokenizer`` as two arguments (and
+We can use our tokenizer to automatically generate such a sentence by passing the two sequences to ``tokenizer`` as two
-not a list, like before) like this:
+arguments (and not a list, like before) like this:
 .. code-block::
@@ -189,8 +190,8 @@ which will return:
    [CLS] HuggingFace is based in NYC [SEP] Where is HuggingFace based? [SEP]
 This is enough for some models to understand where one sequence ends and where another begins. However, other models,
-such as BERT, also deploy token type IDs (also called segment IDs). They are represented as a binary
+such as BERT, also deploy token type IDs (also called segment IDs). They are represented as a binary mask identifying
-mask identifying the two types of sequence in the model.
+the two types of sequence in the model.
 The tokenizer returns this mask as the "token_type_ids" entry:
@@ -209,14 +210,15 @@ Some models, like :class:`~transformers.XLNetModel` use an additional token repr
 Position IDs
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Contrary to RNNs that have the position of each token embedded within them,
+Contrary to RNNs that have the position of each token embedded within them, transformers are unaware of the position of
-transformers are unaware of the position of each token. Therefore, the position IDs (``position_ids``) are used by the model to identify each token's position in the list of tokens.
+each token. Therefore, the position IDs (``position_ids``) are used by the model to identify each token's position in
+the list of tokens.
-They are an optional parameter. If no ``position_ids`` is passed to the model, the IDs are automatically created as absolute
+They are an optional parameter. If no ``position_ids`` is passed to the model, the IDs are automatically created as
-positional embeddings.
+absolute positional embeddings.
-Absolute positional embeddings are selected in the range ``[0, config.max_position_embeddings - 1]``. Some models
+Absolute positional embeddings are selected in the range ``[0, config.max_position_embeddings - 1]``. Some models use
-use other types of positional embeddings, such as sinusoidal position embeddings or relative position embeddings.
+other types of positional embeddings, such as sinusoidal position embeddings or relative position embeddings.
 .. _labels:
@@ -224,43 +226,41 @@ Labels
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The labels are an optional argument which can be passed in order for the model to compute the loss itself. These labels
-should be the expected prediction of the model: it will use the standard loss in order to compute the loss between
+should be the expected prediction of the model: it will use the standard loss in order to compute the loss between its
-its predictions and the expected value (the label).
+predictions and the expected value (the label).
 These labels are different according to the model head, for example:
- For sequence classification models (e.g., :class:`~transformers.BertForSequenceClassification`), the model expects
+- For sequence classification models (e.g., :class:`~transformers.BertForSequenceClassification`), the model expects a
-  a tensor of dimension :obj:`(batch_size)` with each value of the batch corresponding to the expected label of the
+  tensor of dimension :obj:`(batch_size)` with each value of the batch corresponding to the expected label of the
  entire sequence.
- For token classification models (e.g., :class:`~transformers.BertForTokenClassification`), the model expects
+- For token classification models (e.g., :class:`~transformers.BertForTokenClassification`), the model expects a tensor
-  a tensor of dimension :obj:`(batch_size, seq_length)` with each value corresponding to the expected label of each
+  of dimension :obj:`(batch_size, seq_length)` with each value corresponding to the expected label of each individual
-  individual token.
+  token.
- For masked language modeling (e.g., :class:`~transformers.BertForMaskedLM`), the model expects
+- For masked language modeling (e.g., :class:`~transformers.BertForMaskedLM`), the model expects a tensor of dimension
-  a tensor of dimension :obj:`(batch_size, seq_length)` with each value corresponding to the expected label of each
+  :obj:`(batch_size, seq_length)` with each value corresponding to the expected label of each individual token: the
-  individual token: the labels being the token ID for the masked token, and values to be ignored for the rest (usually
+  labels being the token ID for the masked token, and values to be ignored for the rest (usually -100).
-  -100).
 - For sequence to sequence tasks,(e.g., :class:`~transformers.BartForConditionalGeneration`,
-  :class:`~transformers.MBartForConditionalGeneration`), the model expects a tensor of dimension
+  :class:`~transformers.MBartForConditionalGeneration`), the model expects a tensor of dimension :obj:`(batch_size,
-  :obj:`(batch_size, tgt_seq_length)` with each value corresponding to the target sequences associated with each
+  tgt_seq_length)` with each value corresponding to the target sequences associated with each input sequence. During
-  input sequence. During training, both `BART` and `T5` will make the appropriate `decoder_input_ids` and decoder
+  training, both `BART` and `T5` will make the appropriate `decoder_input_ids` and decoder attention masks internally.
-  attention masks internally. They usually do not need to be supplied. This does not apply to models leveraging the
+  They usually do not need to be supplied. This does not apply to models leveraging the Encoder-Decoder framework. See
-  Encoder-Decoder framework.
+  the documentation of each model for more information on each specific model's labels.
-  See the documentation of each model for more information on each specific model's labels.
-The base models (e.g., :class:`~transformers.BertModel`) do not accept labels, as these are the base transformer models,
+The base models (e.g., :class:`~transformers.BertModel`) do not accept labels, as these are the base transformer
-simply outputting features.
+models, simply outputting features.
 .. _decoder-input-ids:
 Decoder input IDs
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-This input is specific to encoder-decoder models, and contains the input IDs that will be fed to the decoder.
+This input is specific to encoder-decoder models, and contains the input IDs that will be fed to the decoder. These
-These inputs should be used for sequence to sequence tasks, such as translation or summarization, and are usually
+inputs should be used for sequence to sequence tasks, such as translation or summarization, and are usually built in a
-built in a way specific to each model.
+way specific to each model.
-Most encoder-decoder models (BART, T5) create their :obj:`decoder_input_ids` on their own from the :obj:`labels`.
+Most encoder-decoder models (BART, T5) create their :obj:`decoder_input_ids` on their own from the :obj:`labels`. In
-In such models, passing the :obj:`labels` is the preferred way to handle training.
+such models, passing the :obj:`labels` is the preferred way to handle training.
 Please check each model's docs to see how they handle these input IDs for sequence to sequence training.
@@ -270,17 +270,17 @@ Feed Forward Chunking
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 In each residual attention block in transformers the self-attention layer is usually followed by 2 feed forward layers.
-The intermediate embedding size of the feed forward layers is often bigger than the hidden size of the model (e.g.,
+The intermediate embedding size of the feed forward layers is often bigger than the hidden size of the model (e.g., for
-for ``bert-base-uncased``).
+``bert-base-uncased``).
 For an input of size ``[batch_size, sequence_length]``, the memory required to store the intermediate feed forward
 embeddings ``[batch_size, sequence_length, config.intermediate_size]`` can account for a large fraction of the memory
 use. The authors of `Reformer: The Efficient Transformer <https://arxiv.org/abs/2001.04451>`_ noticed that since the
 computation is independent of the ``sequence_length`` dimension, it is mathematically equivalent to compute the output
 embeddings of both feed forward layers ``[batch_size, config.hidden_size]_0, ..., [batch_size, config.hidden_size]_n``
-individually and concat them afterward to ``[batch_size, sequence_length, config.hidden_size]`` with
+individually and concat them afterward to ``[batch_size, sequence_length, config.hidden_size]`` with ``n =
-``n = sequence_length``, which trades increased computation time against reduced memory use, but yields a
+sequence_length``, which trades increased computation time against reduced memory use, but yields a mathematically
-mathematically **equivalent** result.
+**equivalent** result.
 For models employing the function :func:`~.transformers.apply_chunking_to_forward`, the ``chunk_size`` defines the
 number of output embeddings that are computed in parallel and thus defines the trade-off between memory and time

--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -47,6 +47,7 @@ The documentation is organized in five parts:
 - **RESEARCH** focuses on tutorials that have less to do with how to use the library but more about general resarch in
  transformers model
 - The three last section contain the documentation of each public class and function, grouped in:
    - **MAIN CLASSES** for the main classes exposing the important APIs of the library.
    - **MODELS** for the classes and functions related to each model implemented in the library.
    - **INTERNAL HELPERS** for the classes and functions we use internally.

--- a/docs/source/internal/modeling_utils.rst
+++ b/docs/source/internal/modeling_utils.rst
--- a/docs/source/internal/tokenization_utils.rst
+++ b/docs/source/internal/tokenization_utils.rst
@@ -25,6 +25,7 @@ SpecialTokensMixin
 Enums and namedtuples
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.tokenization_utils_base.ExplicitEnum
 .. autoclass:: transformers.tokenization_utils_base.PaddingStrategy

--- a/docs/source/internal/trainer_utils.rst
+++ b/docs/source/internal/trainer_utils.rst
--- a/docs/source/main_classes/logging.rst
+++ b/docs/source/main_classes/logging.rst
--- a/docs/source/main_classes/model.rst
+++ b/docs/source/main_classes/model.rst
--- a/docs/source/main_classes/pipelines.rst
+++ b/docs/source/main_classes/pipelines.rst
 Pipelines
 -----------------------------------------------------------------------------------------------------------------------
-The pipelines are a great and easy way to use models for inference. These pipelines are objects that abstract most
+The pipelines are a great and easy way to use models for inference. These pipelines are objects that abstract most of
-of the complex code from the library, offering a simple API dedicated to several tasks, including Named Entity
+the complex code from the library, offering a simple API dedicated to several tasks, including Named Entity
 Recognition, Masked Language Modeling, Sentiment Analysis, Feature Extraction and Question Answering. See the
 :doc:`task summary <../task_summary>` for examples of use.
@@ -26,8 +26,8 @@ There are two categories of pipeline abstractions to be aware about:
 The pipeline abstraction
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The `pipeline` abstraction is a wrapper around all the other available pipelines. It is instantiated as any
+The `pipeline` abstraction is a wrapper around all the other available pipelines. It is instantiated as any other
-other pipeline but requires an additional argument which is the `task`.
+pipeline but requires an additional argument which is the `task`.
 .. autofunction:: transformers.pipeline

--- a/docs/source/main_classes/processors.rst
+++ b/docs/source/main_classes/processors.rst
@@ -8,8 +8,8 @@ Processors
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 All processors follow the same architecture which is that of the
-:class:`~transformers.data.processors.utils.DataProcessor`. The processor returns a list
+:class:`~transformers.data.processors.utils.DataProcessor`. The processor returns a list of
-of :class:`~transformers.data.processors.utils.InputExample`. These
+:class:`~transformers.data.processors.utils.InputExample`. These
 :class:`~transformers.data.processors.utils.InputExample` can be converted to
 :class:`~transformers.data.processors.utils.InputFeatures` in order to be fed to the model.
@@ -28,14 +28,16 @@ of :class:`~transformers.data.processors.utils.InputExample`. These
 GLUE
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-`General Language Understanding Evaluation (GLUE) <https://gluebenchmark.com/>`__ is a benchmark that evaluates
+`General Language Understanding Evaluation (GLUE) <https://gluebenchmark.com/>`__ is a benchmark that evaluates the
-the performance of models across a diverse set of existing NLU tasks. It was released together with the paper
+performance of models across a diverse set of existing NLU tasks. It was released together with the paper `GLUE: A
-`GLUE: A multi-task benchmark and analysis platform for natural language understanding <https://openreview.net/pdf?id=rJ4km2R5t7>`__
+multi-task benchmark and analysis platform for natural language understanding
+<https://openreview.net/pdf?id=rJ4km2R5t7>`__
-This library hosts a total of 10 processors for the following tasks: MRPC, MNLI, MNLI (mismatched),
+This library hosts a total of 10 processors for the following tasks: MRPC, MNLI, MNLI (mismatched), CoLA, SST2, STSB,
-CoLA, SST2, STSB, QQP, QNLI, RTE and WNLI.
+QQP, QNLI, RTE and WNLI.
 Those processors are:
    - :class:`~transformers.data.processors.utils.MrpcProcessor`
    - :class:`~transformers.data.processors.utils.MnliProcessor`
    - :class:`~transformers.data.processors.utils.MnliMismatchedProcessor`
@@ -54,36 +56,39 @@ Additionally, the following method  can be used to load values from a data file
 Example usage
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-An example using these processors is given in the `run_glue.py <https://github.com/huggingface/pytorch-transformers/blob/master/examples/text-classification/run_glue.py>`__ script.
+An example using these processors is given in the `run_glue.py
+<https://github.com/huggingface/pytorch-transformers/blob/master/examples/text-classification/run_glue.py>`__ script.
 XNLI
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-`The Cross-Lingual NLI Corpus (XNLI) <https://www.nyu.edu/projects/bowman/xnli/>`__ is a benchmark that evaluates
+`The Cross-Lingual NLI Corpus (XNLI) <https://www.nyu.edu/projects/bowman/xnli/>`__ is a benchmark that evaluates the
-the quality of cross-lingual text representations. 
+quality of cross-lingual text representations. XNLI is crowd-sourced dataset based on `MultiNLI
-XNLI is crowd-sourced dataset based on `MultiNLI <http://www.nyu.edu/projects/bowman/multinli/>`: pairs of text are labeled with textual entailment 
+<http://www.nyu.edu/projects/bowman/multinli/>`: pairs of text are labeled with textual entailment annotations for 15
-annotations for 15 different languages (including both high-resource language such as English and low-resource languages such as Swahili).
+different languages (including both high-resource language such as English and low-resource languages such as Swahili).
-It was released together with the paper
+It was released together with the paper `XNLI: Evaluating Cross-lingual Sentence Representations
-`XNLI: Evaluating Cross-lingual Sentence Representations <https://arxiv.org/abs/1809.05053>`__
+<https://arxiv.org/abs/1809.05053>`__
 This library hosts the processor to load the XNLI data:
    - :class:`~transformers.data.processors.utils.XnliProcessor`
 Please note that since the gold labels are available on the test set, evaluation is performed on the test set.
-An example using these processors is given in the
+An example using these processors is given in the `run_xnli.py
-`run_xnli.py <https://github.com/huggingface/pytorch-transformers/blob/master/examples/text-classification/run_xnli.py>`__ script.
+<https://github.com/huggingface/pytorch-transformers/blob/master/examples/text-classification/run_xnli.py>`__ script.
 SQuAD
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-`The Stanford Question Answering Dataset (SQuAD) <https://rajpurkar.github.io/SQuAD-explorer//>`__ is a benchmark that evaluates
+`The Stanford Question Answering Dataset (SQuAD) <https://rajpurkar.github.io/SQuAD-explorer//>`__ is a benchmark that
-the performance of models on question answering. Two versions are available, v1.1 and v2.0. The first version (v1.1) was released together with the paper
+evaluates the performance of models on question answering. Two versions are available, v1.1 and v2.0. The first version
-`SQuAD: 100,000+ Questions for Machine Comprehension of Text <https://arxiv.org/abs/1606.05250>`__. The second version (v2.0) was released alongside 
+(v1.1) was released together with the paper `SQuAD: 100,000+ Questions for Machine Comprehension of Text
-the paper `Know What You Don't Know: Unanswerable Questions for SQuAD <https://arxiv.org/abs/1806.03822>`__.
+<https://arxiv.org/abs/1606.05250>`__. The second version (v2.0) was released alongside the paper `Know What You Don't
+Know: Unanswerable Questions for SQuAD <https://arxiv.org/abs/1806.03822>`__.
 This library hosts a processor for each of the two versions:
@@ -91,6 +96,7 @@ Processors
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 Those processors are:
    - :class:`~transformers.data.processors.utils.SquadV1Processor`
    - :class:`~transformers.data.processors.utils.SquadV2Processor`
@@ -99,17 +105,18 @@ They both inherit from the abstract class :class:`~transformers.data.processors.
 .. autoclass:: transformers.data.processors.squad.SquadProcessor
    :members:
-Additionally, the following method can be used to convert SQuAD examples into :class:`~transformers.data.processors.utils.SquadFeatures`
+Additionally, the following method can be used to convert SQuAD examples into
-that can be used as model inputs.
+:class:`~transformers.data.processors.utils.SquadFeatures` that can be used as model inputs.
 .. automethod:: transformers.data.processors.squad.squad_convert_examples_to_features
-These processors as well as the aforementionned method can be used with files containing the data as well as with the `tensorflow_datasets` package.
+These processors as well as the aforementionned method can be used with files containing the data as well as with the
-Examples are given below.
+`tensorflow_datasets` package. Examples are given below.
 Example usage
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 Here is an example using the processors as well as the conversion method using data files:
 .. code-block::
@@ -149,5 +156,5 @@ Using `tensorflow_datasets` is as easy as using a data file:
    )
-Another example using these processors is given in the
+Another example using these processors is given in the `run_squad.py
-`run_squad.py <https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py>`__ script.
+<https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py>`__ script.
--- a/docs/source/main_classes/tokenizer.rst
+++ b/docs/source/main_classes/tokenizer.rst
@@ -29,11 +29,12 @@ methods for using all the tokenizers:
 :class:`~transformers.BatchEncoding` holds the output of the tokenizer's encoding methods (``__call__``,
 ``encode_plus`` and ``batch_encode_plus``) and is derived from a Python dictionary. When the tokenizer is a pure python
-tokenizer, this class behaves just like a standard python dictionary and holds the various model inputs computed by these
+tokenizer, this class behaves just like a standard python dictionary and holds the various model inputs computed by
-methods (``input_ids``, ``attention_mask``...). When the tokenizer is a "Fast" tokenizer (i.e., backed by HuggingFace
+these methods (``input_ids``, ``attention_mask``...). When the tokenizer is a "Fast" tokenizer (i.e., backed by
-`tokenizers library <https://github.com/huggingface/tokenizers>`__), this class provides in addition several advanced
+HuggingFace `tokenizers library <https://github.com/huggingface/tokenizers>`__), this class provides in addition
-alignment methods which can be used to map between the original string (character and words) and the token space (e.g.,
+several advanced alignment methods which can be used to map between the original string (character and words) and the
-getting the index of the token comprising a given character or the span of characters corresponding to a given token).
+token space (e.g., getting the index of the token comprising a given character or the span of characters corresponding
+to a given token).
 PreTrainedTokenizer

--- a/docs/source/main_classes/trainer.rst
+++ b/docs/source/main_classes/trainer.rst
--- a/docs/source/model_doc/albert.rst
+++ b/docs/source/model_doc/albert.rst
@@ -19,14 +19,14 @@ downstream tasks. However, at some point further model increases become harder d
 longer training times, and unexpected model degradation. To address these problems, we present two parameter-reduction
 techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows
 that our proposed methods lead to models that scale much better compared to the original BERT. We also use a
-self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream
+self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks
-tasks with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE,
+with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and
-RACE, and SQuAD benchmarks while having fewer parameters compared to BERT-large.*
+SQuAD benchmarks while having fewer parameters compared to BERT-large.*
 Tips:
- ALBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on
+- ALBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather
-  the right rather than the left.
+  than the left.
 - ALBERT uses repeating layers which results in a small memory footprint, however the computational cost remains
  similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same
  number of (repeating) layers.

--- a/docs/source/model_doc/auto.rst
+++ b/docs/source/model_doc/auto.rst
@@ -2,9 +2,8 @@ AutoClasses
 -----------------------------------------------------------------------------------------------------------------------
 In many cases, the architecture you want to use can be guessed from the name or the path of the pretrained model you
-are supplying to the :obj:`from_pretrained()` method.
+are supplying to the :obj:`from_pretrained()` method. AutoClasses are here to do this job for you so that you
-AutoClasses are here to do this job for you so that you automatically retrieve the relevant model given the name/path
+automatically retrieve the relevant model given the name/path to the pretrained weights/config/vocabulary.
-to the pretrained weights/config/vocabulary.
 Instantiating one of :class:`~transformers.AutoConfig`, :class:`~transformers.AutoModel`, and
 :class:`~transformers.AutoTokenizer` will directly create a class of the relevant architecture. For instance