Doc styling (#8067)

* Important files * Styling them all * Revert "Styling them all" This reverts commit 7d029395fdae8513b8281cbc2a6c239f8093503e. * Syling them for realsies * Fix syntax error * Fix benchmark_utils * More fixes * Fix modeling auto and script * Remove new line * Fixes * More fixes * Fix more files * Style * Add FSMT * More fixes * More fixes * More fixes * More fixes * Fixes * More fixes * More fixes * Last fixes * Make sphinx happy

Doc styling (#8067)
* Important files * Styling them all * Revert "Styling them all" This reverts commit 7d029395fdae8513b8281cbc2a6c239f8093503e. * Syling them for realsies * Fix syntax error * Fix benchmark_utils * More fixes * Fix modeling auto and script * Remove new line * Fixes * More fixes * Fix more files * Style * Add FSMT * More fixes * More fixes * More fixes * More fixes * Fixes * More fixes * More fixes * Last fixes * Make sphinx happy
08f534d2 · Sylvain Gugger · GitHub · 04a17f85 · 08f534d2 · 08f534d2
Unverified Commit 08f534d2 authored Oct 26, 2020 by Sylvain Gugger Committed by GitHub Oct 26, 2020
20 changed files
--- a/.circleci/config.yml
+++ b/.circleci/config.yml
@@ -322,6 +322,7 @@ jobs:
            - run: black --check examples templates tests src utils
            - run: isort --check-only examples templates tests src utils
            - run: flake8 examples templates tests src utils
+            - run: python utils/style_doc.py src/transformers docs/source --max_len 119 --check_only
            - run: python utils/check_copies.py
            - run: python utils/check_dummies.py
            - run: python utils/check_repo.py

--- a/Makefile
+++ b/Makefile
@@ -15,6 +15,7 @@ modified_only_fixup:
 		black $(modified_py_files); \
 		isort $(modified_py_files); \
 		flake8 $(modified_py_files); \
+		python utils/style_doc.py $(modified_py_files) --max_len 119; \
 	else \
 		echo "No library .py files were modified"; \
 	fi
@@ -31,6 +32,7 @@ quality:
 	black --check $(check_dirs)
 	isort --check-only $(check_dirs)
 	flake8 $(check_dirs)
+	python utils/style_doc.py src/transformers docs/source --max_len 119 --check_only
 	${MAKE} extra_quality_checks
 # Format source code automatically and check is there are any problems left that need manual fixing
@@ -38,6 +40,7 @@ quality:
 style:
 	black $(check_dirs)
 	isort $(check_dirs)
+	python utils/style_doc.py src/transformers docs/source --max_len 119
 # Super fast fix and check target that only works on relevant modified files since the branch was made

--- a/README.md
+++ b/README.md
@@ -180,7 +180,7 @@ Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
 1. **[Longformer](https://huggingface.co/transformers/model_doc/longformer.html)** (from AllenAI) released with the paper [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan.
 1. **[LXMERT](https://huggingface.co/transformers/model_doc/lxmert.html)** (from UNC Chapel Hill) released with the paper [LXMERT: Learning Cross-Modality Encoder Representations from Transformers for Open-Domain Question Answering](https://arxiv.org/abs/1908.07490) by Hao Tan and Mohit Bansal.
 1. **[MarianMT](https://huggingface.co/transformers/model_doc/marian.html)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team.
-1. **[MBart](https://huggingface.co/transformers/model_doc/mbart.html)** (from Facebook) released with the paper  [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
+1. **[MBart](https://huggingface.co/transformers/model_doc/mbart.html)** (from Facebook) released with the paper [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
 1. **[Pegasus](https://huggingface.co/transformers/model_doc/pegasus.html)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777)> by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.
 1. **[ProphetNet](https://huggingface.co/transformers/model_doc/prophetnet.html)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
 1. **[Reformer](https://huggingface.co/transformers/model_doc/reformer.html)** (from Google Research) released with the paper [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.

--- a/docs/source/benchmarks.rst
+++ b/docs/source/benchmarks.rst
@@ -3,21 +3,27 @@ Benchmarks
 Let's take a look at how 🤗 Transformer models can be benchmarked, best practices, and already available benchmarks.
-A notebook explaining in more detail how to benchmark 🤗 Transformer models can be found `here <https://github.com/huggingface/transformers/blob/master/notebooks/05-benchmark.ipynb>`__.
+A notebook explaining in more detail how to benchmark 🤗 Transformer models can be found `here
+<https://github.com/huggingface/transformers/blob/master/notebooks/05-benchmark.ipynb>`__.
 How to benchmark 🤗 Transformer models
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The classes :class:`~transformers.PyTorchBenchmark` and :class:`~transformers.TensorFlowBenchmark` allow to flexibly benchmark 🤗 Transformer models.
+The classes :class:`~transformers.PyTorchBenchmark` and :class:`~transformers.TensorFlowBenchmark` allow to flexibly
-The benchmark classes allow us to measure the `peak memory usage` and `required time` for both 
+benchmark 🤗 Transformer models. The benchmark classes allow us to measure the `peak memory usage` and `required time`
-`inference` and `training`. 
+for both `inference` and `training`.
 .. note::
-  Hereby, `inference` is defined by a single forward pass, and `training` is defined by a single forward pass and backward pass.
+  Hereby, `inference` is defined by a single forward pass, and `training` is defined by a single forward pass and
+  backward pass.
-The benchmark classes :class:`~transformers.PyTorchBenchmark` and :class:`~transformers.TensorFlowBenchmark` expect an object of type :class:`~transformers.PyTorchBenchmarkArguments` and :class:`~transformers.TensorFlowBenchmarkArguments`, respectively, for instantiation. :class:`~transformers.PyTorchBenchmarkArguments` and :class:`~transformers.TensorFlowBenchmarkArguments` are data classes and contain all relevant configurations for their corresponding benchmark class.
+The benchmark classes :class:`~transformers.PyTorchBenchmark` and :class:`~transformers.TensorFlowBenchmark` expect an
-In the following example, it is shown how a BERT model of type `bert-base-cased` can be benchmarked.
+object of type :class:`~transformers.PyTorchBenchmarkArguments` and
+:class:`~transformers.TensorFlowBenchmarkArguments`, respectively, for instantiation.
+:class:`~transformers.PyTorchBenchmarkArguments` and :class:`~transformers.TensorFlowBenchmarkArguments` are data
+classes and contain all relevant configurations for their corresponding benchmark class. In the following example, it
+is shown how a BERT model of type `bert-base-cased` can be benchmarked.
 .. code-block::
@@ -34,11 +40,15 @@ In the following example, it is shown how a BERT model of type `bert-base-cased`
    >>> benchmark = TensorFlowBenchmark(args)
-Here, three arguments are given to the benchmark argument data classes, namely ``models``, ``batch_sizes``, and ``sequence_lengths``. The argument ``models`` is required and expects a :obj:`list` of model identifiers from the `model hub <https://huggingface.co/models>`__
+Here, three arguments are given to the benchmark argument data classes, namely ``models``, ``batch_sizes``, and
-The :obj:`list` arguments ``batch_sizes`` and ``sequence_lengths`` define the size of the ``input_ids`` on which the model is benchmarked. 
+``sequence_lengths``. The argument ``models`` is required and expects a :obj:`list` of model identifiers from the
-There are many more parameters that can be configured via the benchmark argument data classes. For more detail on these one can either directly consult the files 
+`model hub <https://huggingface.co/models>`__ The :obj:`list` arguments ``batch_sizes`` and ``sequence_lengths`` define
-``src/transformers/benchmark/benchmark_args_utils.py``, ``src/transformers/benchmark/benchmark_args.py`` (for PyTorch) and ``src/transformers/benchmark/benchmark_args_tf.py`` (for Tensorflow). 
+the size of the ``input_ids`` on which the model is benchmarked. There are many more parameters that can be configured
-Alternatively, running the following shell commands from root will print out a descriptive list of all configurable parameters for PyTorch and Tensorflow respectively.
+via the benchmark argument data classes. For more detail on these one can either directly consult the files
+``src/transformers/benchmark/benchmark_args_utils.py``, ``src/transformers/benchmark/benchmark_args.py`` (for PyTorch)
+and ``src/transformers/benchmark/benchmark_args_tf.py`` (for Tensorflow). Alternatively, running the following shell
+commands from root will print out a descriptive list of all configurable parameters for PyTorch and Tensorflow
+respectively.
 .. code-block:: bash
@@ -65,7 +75,7 @@ An instantiated benchmark object can then simply be run by calling ``benchmark.r
    bert-base-uncased          8              128            0.018     
    bert-base-uncased          8              512            0.088     
    --------------------------------------------------------------------------------
    ====================      INFERENCE - MEMORY - RESULT       ====================
    --------------------------------------------------------------------------------
    Model Name             Batch Size     Seq Length    Memory in MB 
@@ -75,7 +85,7 @@ An instantiated benchmark object can then simply be run by calling ``benchmark.r
    bert-base-uncased          8              128            1307
    bert-base-uncased          8              512            1539
    --------------------------------------------------------------------------------
    ====================        ENVIRONMENT INFORMATION         ====================
    - transformers_version: 2.11.0
    - framework: PyTorch
@@ -98,7 +108,7 @@ An instantiated benchmark object can then simply be run by calling ``benchmark.r
    - gpu_power_watts: 280.0
    - gpu_performance_state: 2
    - use_tpu: False
    >>> ## TENSORFLOW CODE
    >>> results = benchmark.run()
    >>> print(results)
@@ -111,7 +121,7 @@ An instantiated benchmark object can then simply be run by calling ``benchmark.r
    bert-base-uncased          8              128            0.022
    bert-base-uncased          8              512            0.105
    --------------------------------------------------------------------------------
    ====================      INFERENCE - MEMORY - RESULT       ====================
    --------------------------------------------------------------------------------
    Model Name             Batch Size     Seq Length    Memory in MB 
@@ -121,7 +131,7 @@ An instantiated benchmark object can then simply be run by calling ``benchmark.r
    bert-base-uncased          8              128            1330
    bert-base-uncased          8              512            1770
    --------------------------------------------------------------------------------
    ====================        ENVIRONMENT INFORMATION         ====================
    - transformers_version: 2.11.0
    - framework: Tensorflow
@@ -145,14 +155,17 @@ An instantiated benchmark object can then simply be run by calling ``benchmark.r
    - gpu_performance_state: 2
    - use_tpu: False
-By default, the `time` and the `required memory` for `inference` are benchmarked. 
+By default, the `time` and the `required memory` for `inference` are benchmarked. In the example output above the first
-In the example output above the first two sections show the result corresponding to `inference time` and `inference memory`. 
+two sections show the result corresponding to `inference time` and `inference memory`. In addition, all relevant
-In addition, all relevant information about the computing environment, `e.g.` the GPU type, the system, the library versions, etc... are printed out in the third section under `ENVIRONMENT INFORMATION`.
+information about the computing environment, `e.g.` the GPU type, the system, the library versions, etc... are printed
-This information can optionally be saved in a `.csv` file when adding the argument :obj:`save_to_csv=True` to :class:`~transformers.PyTorchBenchmarkArguments` and :class:`~transformers.TensorFlowBenchmarkArguments` respectively.
+out in the third section under `ENVIRONMENT INFORMATION`. This information can optionally be saved in a `.csv` file
-In this case, every section is saved in a separate `.csv` file. The path to each `.csv` file can optionally be defined via the argument data classes.
+when adding the argument :obj:`save_to_csv=True` to :class:`~transformers.PyTorchBenchmarkArguments` and
+:class:`~transformers.TensorFlowBenchmarkArguments` respectively. In this case, every section is saved in a separate
+`.csv` file. The path to each `.csv` file can optionally be defined via the argument data classes.
-Instead of benchmarking pre-trained models via their model identifier, `e.g.` `bert-base-uncased`, the user can alternatively benchmark an arbitrary configuration of any available model class. 
+Instead of benchmarking pre-trained models via their model identifier, `e.g.` `bert-base-uncased`, the user can
-In this case, a :obj:`list` of configurations must be inserted with the benchmark args as follows.
+alternatively benchmark an arbitrary configuration of any available model class. In this case, a :obj:`list` of
+configurations must be inserted with the benchmark args as follows.
 .. code-block::
@@ -183,7 +196,7 @@ In this case, a :obj:`list` of configurations must be inserted with the benchmar
    bert-6-lay                 8              128            0.009     
    bert-6-lay                 8              512            0.044
    --------------------------------------------------------------------------------
    ====================      INFERENCE - MEMORY - RESULT       ====================
    --------------------------------------------------------------------------------
    Model Name             Batch Size     Seq Length      Memory in MB 
@@ -201,7 +214,7 @@ In this case, a :obj:`list` of configurations must be inserted with the benchmar
    bert-6-lay                 8              128            1127     
    bert-6-lay                 8              512            1359
    --------------------------------------------------------------------------------
    ====================        ENVIRONMENT INFORMATION         ====================
    - transformers_version: 2.11.0
    - framework: PyTorch
@@ -252,7 +265,7 @@ In this case, a :obj:`list` of configurations must be inserted with the benchmar
    bert-6-lay                 8              128            0.0011
    bert-6-lay                 8              512            0.074
    --------------------------------------------------------------------------------
    ====================      INFERENCE - MEMORY - RESULT       ====================
    --------------------------------------------------------------------------------
    Model Name             Batch Size     Seq Length      Memory in MB 
@@ -270,7 +283,7 @@ In this case, a :obj:`list` of configurations must be inserted with the benchmar
    bert-6-lay                 8              128            1330
    bert-6-lay                 8              512            1540
    --------------------------------------------------------------------------------
    ====================        ENVIRONMENT INFORMATION         ====================
    - transformers_version: 2.11.0
    - framework: Tensorflow
@@ -295,8 +308,9 @@ In this case, a :obj:`list` of configurations must be inserted with the benchmar
    - use_tpu: False
-Again, `inference time` and `required memory` for `inference` are measured, but this time for customized configurations of the :obj:`BertModel` class. This feature can especially be helpful when 
+Again, `inference time` and `required memory` for `inference` are measured, but this time for customized configurations
-deciding for which configuration the model should be trained.
+of the :obj:`BertModel` class. This feature can especially be helpful when deciding for which configuration the model
+should be trained.
 Benchmark best practices
@@ -304,19 +318,28 @@ Benchmark best practices
 This section lists a couple of best practices one should be aware of when benchmarking a model.
- Currently, only single device benchmarking is supported. When benchmarking on GPU, it is recommended that the user 
+- Currently, only single device benchmarking is supported. When benchmarking on GPU, it is recommended that the user
-  specifies on which device the code should be run by setting the ``CUDA_VISIBLE_DEVICES`` environment variable in the shell, `e.g.` ``export CUDA_VISIBLE_DEVICES=0`` before running the code.
+  specifies on which device the code should be run by setting the ``CUDA_VISIBLE_DEVICES`` environment variable in the
- The option :obj:`no_multi_processing` should only be set to :obj:`True` for testing and debugging. To ensure accurate memory measurement it is recommended to run each memory benchmark in a separate process by making sure :obj:`no_multi_processing` is set to :obj:`True`.
+  shell, `e.g.` ``export CUDA_VISIBLE_DEVICES=0`` before running the code.
- One should always state the environment information when sharing the results of a model benchmark. Results can vary heavily between different GPU devices, library versions, etc., so that benchmark results on their own are not very useful for the community.
+- The option :obj:`no_multi_processing` should only be set to :obj:`True` for testing and debugging. To ensure accurate
+  memory measurement it is recommended to run each memory benchmark in a separate process by making sure
+  :obj:`no_multi_processing` is set to :obj:`True`.
+- One should always state the environment information when sharing the results of a model benchmark. Results can vary
+  heavily between different GPU devices, library versions, etc., so that benchmark results on their own are not very
+  useful for the community.
 Sharing your benchmark
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Previously all available core models (10 at the time) have been benchmarked for `inference time`, across many different settings: using PyTorch, with
+Previously all available core models (10 at the time) have been benchmarked for `inference time`, across many different
-and without TorchScript, using TensorFlow, with and without XLA. All of those tests were done across CPUs (except for
+settings: using PyTorch, with and without TorchScript, using TensorFlow, with and without XLA. All of those tests were
-TensorFlow XLA) and GPUs.
+done across CPUs (except for TensorFlow XLA) and GPUs.
-The approach is detailed in the `following blogpost <https://medium.com/huggingface/benchmarking-transformers-pytorch-and-tensorflow-e2917fb891c2>`__ and the results are available `here <https://docs.google.com/spreadsheets/d/1sryqufw2D0XlUH4sq3e9Wnxu5EAQkaohzrJbd5HdQ_w/edit?usp=sharing>`__.
+The approach is detailed in the `following blogpost
+<https://medium.com/huggingface/benchmarking-transformers-pytorch-and-tensorflow-e2917fb891c2>`__ and the results are
+available `here
+<https://docs.google.com/spreadsheets/d/1sryqufw2D0XlUH4sq3e9Wnxu5EAQkaohzrJbd5HdQ_w/edit?usp=sharing>`__.
-With the new `benchmark` tools, it is easier than ever to share your benchmark results with the community `here <https://github.com/huggingface/transformers/blob/master/examples/benchmarking/README.md>`__.
+With the new `benchmark` tools, it is easier than ever to share your benchmark results with the community `here
+<https://github.com/huggingface/transformers/blob/master/examples/benchmarking/README.md>`__.
--- a/docs/source/bertology.rst
+++ b/docs/source/bertology.rst
 BERTology
 -----------------------------------------------------------------------------------------------------------------------
-There is a growing field of study concerned with investigating the inner working of large-scale transformers like BERT (that some call "BERTology"). Some good examples of this field are:
+There is a growing field of study concerned with investigating the inner working of large-scale transformers like BERT
+(that some call "BERTology"). Some good examples of this field are:
-* BERT Rediscovers the Classical NLP Pipeline by Ian Tenney, Dipanjan Das, Ellie Pavlick: https://arxiv.org/abs/1905.05950
+* BERT Rediscovers the Classical NLP Pipeline by Ian Tenney, Dipanjan Das, Ellie Pavlick:
+  https://arxiv.org/abs/1905.05950
 * Are Sixteen Heads Really Better than One? by Paul Michel, Omer Levy, Graham Neubig: https://arxiv.org/abs/1905.10650
-* What Does BERT Look At? An Analysis of BERT's Attention by Kevin Clark, Urvashi Khandelwal, Omer Levy, Christopher D. Manning: https://arxiv.org/abs/1906.04341
+* What Does BERT Look At? An Analysis of BERT's Attention by Kevin Clark, Urvashi Khandelwal, Omer Levy, Christopher D.
+  Manning: https://arxiv.org/abs/1906.04341
-In order to help this new field develop, we have included a few additional features in the BERT/GPT/GPT-2 models to help people access the inner representations, mainly adapted from the great work of Paul Michel (https://arxiv.org/abs/1905.10650):
+In order to help this new field develop, we have included a few additional features in the BERT/GPT/GPT-2 models to
+help people access the inner representations, mainly adapted from the great work of Paul Michel
+(https://arxiv.org/abs/1905.10650):
 * accessing all the hidden-states of BERT/GPT/GPT-2,
 * accessing all the attention weights for each head of BERT/GPT/GPT-2,
-* retrieving heads output values and gradients to be able to compute head importance score and prune head as explained in https://arxiv.org/abs/1905.10650.
+* retrieving heads output values and gradients to be able to compute head importance score and prune head as explained
+  in https://arxiv.org/abs/1905.10650.
-To help you understand and use these features, we have added a specific example script: `bertology.py <https://github.com/huggingface/transformers/blob/master/examples/bertology/run_bertology.py>`_ while extract information and prune a model pre-trained on GLUE.
+To help you understand and use these features, we have added a specific example script: `bertology.py
+<https://github.com/huggingface/transformers/blob/master/examples/bertology/run_bertology.py>`_ while extract
+information and prune a model pre-trained on GLUE.
--- a/docs/source/converting_tensorflow_models.rst
+++ b/docs/source/converting_tensorflow_models.rst
 Converting Tensorflow Checkpoints
 =======================================================================================================================
-A command-line interface is provided to convert original Bert/GPT/GPT-2/Transformer-XL/XLNet/XLM checkpoints in models than be loaded using the ``from_pretrained`` methods of the library.
+A command-line interface is provided to convert original Bert/GPT/GPT-2/Transformer-XL/XLNet/XLM checkpoints in models
+than be loaded using the ``from_pretrained`` methods of the library.
 .. note::
-    Since 2.3.0 the conversion script is now part of the transformers CLI (**transformers-cli**)
+    Since 2.3.0 the conversion script is now part of the transformers CLI (**transformers-cli**) available in any
-    available in any transformers >= 2.3.0 installation.
+    transformers >= 2.3.0 installation.
    The documentation below reflects the **transformers-cli convert** command format.
 BERT
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-You can convert any TensorFlow checkpoint for BERT (in particular `the pre-trained models released by Google <https://github.com/google-research/bert#pre-trained-models>`_\ ) in a PyTorch save file by using the `convert_bert_original_tf_checkpoint_to_pytorch.py <https://github.com/huggingface/transformers/blob/master/src/transformers/convert_bert_original_tf_checkpoint_to_pytorch.py>`_ script.
+You can convert any TensorFlow checkpoint for BERT (in particular `the pre-trained models released by Google
+<https://github.com/google-research/bert#pre-trained-models>`_\ ) in a PyTorch save file by using the
-This CLI takes as input a TensorFlow checkpoint (three files starting with ``bert_model.ckpt``\ ) and the associated configuration file (\ ``bert_config.json``\ ), and creates a PyTorch model for this configuration, loads the weights from the TensorFlow checkpoint in the PyTorch model and saves the resulting model in a standard PyTorch save file that can be imported using ``torch.load()`` (see examples in `run_bert_extract_features.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_extract_features.py>`_\ , `run_bert_classifier.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_classifier.py>`_ and `run_bert_squad.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_squad.py>`_\ ).
+`convert_bert_original_tf_checkpoint_to_pytorch.py
+<https://github.com/huggingface/transformers/blob/master/src/transformers/convert_bert_original_tf_checkpoint_to_pytorch.py>`_
-You only need to run this conversion script **once** to get a PyTorch model. You can then disregard the TensorFlow checkpoint (the three files starting with ``bert_model.ckpt``\ ) but be sure to keep the configuration file (\ ``bert_config.json``\ ) and the vocabulary file (\ ``vocab.txt``\ ) as these are needed for the PyTorch model too.
+script.
-To run this specific conversion script you will need to have TensorFlow and PyTorch installed (\ ``pip install tensorflow``\ ). The rest of the repository only requires PyTorch.
+This CLI takes as input a TensorFlow checkpoint (three files starting with ``bert_model.ckpt``\ ) and the associated
+configuration file (\ ``bert_config.json``\ ), and creates a PyTorch model for this configuration, loads the weights
+from the TensorFlow checkpoint in the PyTorch model and saves the resulting model in a standard PyTorch save file that
+can be imported using ``torch.load()`` (see examples in `run_bert_extract_features.py
+<https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_extract_features.py>`_\ ,
+`run_bert_classifier.py
+<https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_classifier.py>`_ and
+`run_bert_squad.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_squad.py>`_\
+).
+You only need to run this conversion script **once** to get a PyTorch model. You can then disregard the TensorFlow
+checkpoint (the three files starting with ``bert_model.ckpt``\ ) but be sure to keep the configuration file (\
+``bert_config.json``\ ) and the vocabulary file (\ ``vocab.txt``\ ) as these are needed for the PyTorch model too.
+To run this specific conversion script you will need to have TensorFlow and PyTorch installed (\ ``pip install
+tensorflow``\ ). The rest of the repository only requires PyTorch.
 Here is an example of the conversion process for a pre-trained ``BERT-Base Uncased`` model:
@@ -31,14 +47,20 @@ Here is an example of the conversion process for a pre-trained ``BERT-Base Uncas
     --config $BERT_BASE_DIR/bert_config.json \
     --pytorch_dump_output $BERT_BASE_DIR/pytorch_model.bin
-You can download Google's pre-trained models for the conversion `here <https://github.com/google-research/bert#pre-trained-models>`__.
+You can download Google's pre-trained models for the conversion `here
+<https://github.com/google-research/bert#pre-trained-models>`__.
 ALBERT
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-Convert TensorFlow model checkpoints of ALBERT to PyTorch using the `convert_albert_original_tf_checkpoint_to_pytorch.py <https://github.com/huggingface/transformers/blob/master/src/transformers/convert_bert_original_tf_checkpoint_to_pytorch.py>`_ script.
+Convert TensorFlow model checkpoints of ALBERT to PyTorch using the
+`convert_albert_original_tf_checkpoint_to_pytorch.py
+<https://github.com/huggingface/transformers/blob/master/src/transformers/convert_bert_original_tf_checkpoint_to_pytorch.py>`_
+script.
-The CLI takes as input a TensorFlow checkpoint (three files starting with ``model.ckpt-best``\ ) and the accompanying configuration file (\ ``albert_config.json``\ ), then creates and saves a PyTorch model. To run this conversion you will need to have TensorFlow and PyTorch installed.
+The CLI takes as input a TensorFlow checkpoint (three files starting with ``model.ckpt-best``\ ) and the accompanying
+configuration file (\ ``albert_config.json``\ ), then creates and saves a PyTorch model. To run this conversion you
+will need to have TensorFlow and PyTorch installed.
 Here is an example of the conversion process for the pre-trained ``ALBERT Base`` model:
@@ -51,12 +73,15 @@ Here is an example of the conversion process for the pre-trained ``ALBERT Base``
     --config $ALBERT_BASE_DIR/albert_config.json \
     --pytorch_dump_output $ALBERT_BASE_DIR/pytorch_model.bin
-You can download Google's pre-trained models for the conversion `here <https://github.com/google-research/albert#pre-trained-models>`__.
+You can download Google's pre-trained models for the conversion `here
+<https://github.com/google-research/albert#pre-trained-models>`__.
 OpenAI GPT
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-Here is an example of the conversion process for a pre-trained OpenAI GPT model, assuming that your NumPy checkpoint save as the same format than OpenAI pretrained model (see `here <https://github.com/openai/finetune-transformer-lm>`__\ )
+Here is an example of the conversion process for a pre-trained OpenAI GPT model, assuming that your NumPy checkpoint
+save as the same format than OpenAI pretrained model (see `here <https://github.com/openai/finetune-transformer-lm>`__\
+)
 .. code-block:: shell
@@ -72,7 +97,8 @@ Here is an example of the conversion process for a pre-trained OpenAI GPT model,
 OpenAI GPT-2
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-Here is an example of the conversion process for a pre-trained OpenAI GPT-2 model (see `here <https://github.com/openai/gpt-2>`__\ )
+Here is an example of the conversion process for a pre-trained OpenAI GPT-2 model (see `here
+<https://github.com/openai/gpt-2>`__\ )
 .. code-block:: shell
@@ -87,7 +113,8 @@ Here is an example of the conversion process for a pre-trained OpenAI GPT-2 mode
 Transformer-XL
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-Here is an example of the conversion process for a pre-trained Transformer-XL model (see `here <https://github.com/kimiyoung/transformer-xl/tree/master/tf#obtain-and-evaluate-pretrained-sota-models>`__\ )
+Here is an example of the conversion process for a pre-trained Transformer-XL model (see `here
+<https://github.com/kimiyoung/transformer-xl/tree/master/tf#obtain-and-evaluate-pretrained-sota-models>`__\ )
 .. code-block:: shell
@@ -130,4 +157,4 @@ Here is an example of the conversion process for a pre-trained XLM model:
     --tf_checkpoint $XLM_CHECKPOINT_PATH \
     --pytorch_dump_output $PYTORCH_DUMP_OUTPUT
    [--config XML_CONFIG] \
    [--finetuning_task_name XML_FINETUNED_TASK]
\ No newline at end of file
--- a/docs/source/custom_datasets.rst
+++ b/docs/source/custom_datasets.rst
--- a/docs/source/glossary.rst
+++ b/docs/source/glossary.rst
@@ -57,8 +57,8 @@ The tokenizer takes care of splitting the sequence into tokens available in the
    >>> tokenized_sequence = tokenizer.tokenize(sequence)
 The tokens are either words or subwords. Here for instance, "VRAM" wasn't in the model vocabulary, so it's been split
-in "V", "RA" and "M". To indicate those tokens are not separate words but parts of the same word, a double-hash prefix is
+in "V", "RA" and "M". To indicate those tokens are not separate words but parts of the same word, a double-hash prefix
-added for "RA" and "M":
+is added for "RA" and "M":
 .. code-block::
@@ -66,8 +66,8 @@ added for "RA" and "M":
    ['A', 'Titan', 'R', '##T', '##X', 'has', '24', '##GB', 'of', 'V', '##RA', '##M']
 These tokens can then be converted into IDs which are understandable by the model. This can be done by directly feeding
-the sentence to the tokenizer, which leverages the Rust implementation of
+the sentence to the tokenizer, which leverages the Rust implementation of `huggingface/tokenizers
-`huggingface/tokenizers <https://github.com/huggingface/tokenizers>`__ for peak performance.
+<https://github.com/huggingface/tokenizers>`__ for peak performance.
 .. code-block::
@@ -105,8 +105,8 @@ because this is the way a :class:`~transformers.BertModel` is going to expect it
 Attention mask
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The attention mask is an optional argument used when batching sequences together. This argument indicates to the
+The attention mask is an optional argument used when batching sequences together. This argument indicates to the model
-model which tokens should be attended to, and which should not.
+which tokens should be attended to, and which should not.
 For example, consider these two sequences:
@@ -145,10 +145,10 @@ We can see that 0s have been added on the right of the first sentence to make it
    >>> padded_sequences["input_ids"]
    [[101, 1188, 1110, 170, 1603, 4954, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1188, 1110, 170, 1897, 1263, 4954, 119, 1135, 1110, 1120, 1655, 2039, 1190, 1103, 4954, 138, 119, 102]]
-This can then be converted into a tensor in PyTorch or TensorFlow. The attention mask is a binary tensor indicating
+This can then be converted into a tensor in PyTorch or TensorFlow. The attention mask is a binary tensor indicating the
-the position of the padded indices so that the model does not attend to them. For the
+position of the padded indices so that the model does not attend to them. For the :class:`~transformers.BertTokenizer`,
-:class:`~transformers.BertTokenizer`, :obj:`1` indicates a value that should be attended to, while :obj:`0` indicates
+:obj:`1` indicates a value that should be attended to, while :obj:`0` indicates a padded value. This attention mask is
-a padded value. This attention mask is in the dictionary returned by the tokenizer under the key "attention_mask":
+in the dictionary returned by the tokenizer under the key "attention_mask":
 .. code-block::
@@ -161,15 +161,16 @@ Token Type IDs
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Some models' purpose is to do sequence classification or question answering. These require two different sequences to
-be joined in a single "input_ids" entry, which usually is performed with the help of special tokens, such as the classifier (``[CLS]``) and separator (``[SEP]``)
+be joined in a single "input_ids" entry, which usually is performed with the help of special tokens, such as the
-tokens. For example, the BERT model builds its two sequence input as such:
+classifier (``[CLS]``) and separator (``[SEP]``) tokens. For example, the BERT model builds its two sequence input as
+such:
 .. code-block::
   >>> # [CLS] SEQUENCE_A [SEP] SEQUENCE_B [SEP]
-We can use our tokenizer to automatically generate such a sentence by passing the two sequences to ``tokenizer`` as two arguments (and
+We can use our tokenizer to automatically generate such a sentence by passing the two sequences to ``tokenizer`` as two
-not a list, like before) like this:
+arguments (and not a list, like before) like this:
 .. code-block::
@@ -189,8 +190,8 @@ which will return:
    [CLS] HuggingFace is based in NYC [SEP] Where is HuggingFace based? [SEP]
 This is enough for some models to understand where one sequence ends and where another begins. However, other models,
-such as BERT, also deploy token type IDs (also called segment IDs). They are represented as a binary
+such as BERT, also deploy token type IDs (also called segment IDs). They are represented as a binary mask identifying
-mask identifying the two types of sequence in the model.
+the two types of sequence in the model.
 The tokenizer returns this mask as the "token_type_ids" entry:
@@ -209,14 +210,15 @@ Some models, like :class:`~transformers.XLNetModel` use an additional token repr
 Position IDs
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Contrary to RNNs that have the position of each token embedded within them,
+Contrary to RNNs that have the position of each token embedded within them, transformers are unaware of the position of
-transformers are unaware of the position of each token. Therefore, the position IDs (``position_ids``) are used by the model to identify each token's position in the list of tokens.
+each token. Therefore, the position IDs (``position_ids``) are used by the model to identify each token's position in
+the list of tokens.
-They are an optional parameter. If no ``position_ids`` is passed to the model, the IDs are automatically created as absolute
+They are an optional parameter. If no ``position_ids`` is passed to the model, the IDs are automatically created as
-positional embeddings.
+absolute positional embeddings.
-Absolute positional embeddings are selected in the range ``[0, config.max_position_embeddings - 1]``. Some models
+Absolute positional embeddings are selected in the range ``[0, config.max_position_embeddings - 1]``. Some models use
-use other types of positional embeddings, such as sinusoidal position embeddings or relative position embeddings.
+other types of positional embeddings, such as sinusoidal position embeddings or relative position embeddings.
 .. _labels:
@@ -224,43 +226,41 @@ Labels
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The labels are an optional argument which can be passed in order for the model to compute the loss itself. These labels
-should be the expected prediction of the model: it will use the standard loss in order to compute the loss between
+should be the expected prediction of the model: it will use the standard loss in order to compute the loss between its
-its predictions and the expected value (the label).
+predictions and the expected value (the label).
 These labels are different according to the model head, for example:
- For sequence classification models (e.g., :class:`~transformers.BertForSequenceClassification`), the model expects
+- For sequence classification models (e.g., :class:`~transformers.BertForSequenceClassification`), the model expects a
-  a tensor of dimension :obj:`(batch_size)` with each value of the batch corresponding to the expected label of the
+  tensor of dimension :obj:`(batch_size)` with each value of the batch corresponding to the expected label of the
  entire sequence.
- For token classification models (e.g., :class:`~transformers.BertForTokenClassification`), the model expects
+- For token classification models (e.g., :class:`~transformers.BertForTokenClassification`), the model expects a tensor
-  a tensor of dimension :obj:`(batch_size, seq_length)` with each value corresponding to the expected label of each
+  of dimension :obj:`(batch_size, seq_length)` with each value corresponding to the expected label of each individual
-  individual token.
+  token.
- For masked language modeling (e.g., :class:`~transformers.BertForMaskedLM`), the model expects
+- For masked language modeling (e.g., :class:`~transformers.BertForMaskedLM`), the model expects a tensor of dimension
-  a tensor of dimension :obj:`(batch_size, seq_length)` with each value corresponding to the expected label of each
+  :obj:`(batch_size, seq_length)` with each value corresponding to the expected label of each individual token: the
-  individual token: the labels being the token ID for the masked token, and values to be ignored for the rest (usually
+  labels being the token ID for the masked token, and values to be ignored for the rest (usually -100).
-  -100).
 - For sequence to sequence tasks,(e.g., :class:`~transformers.BartForConditionalGeneration`,
-  :class:`~transformers.MBartForConditionalGeneration`), the model expects a tensor of dimension
+  :class:`~transformers.MBartForConditionalGeneration`), the model expects a tensor of dimension :obj:`(batch_size,
-  :obj:`(batch_size, tgt_seq_length)` with each value corresponding to the target sequences associated with each
+  tgt_seq_length)` with each value corresponding to the target sequences associated with each input sequence. During
-  input sequence. During training, both `BART` and `T5` will make the appropriate `decoder_input_ids` and decoder
+  training, both `BART` and `T5` will make the appropriate `decoder_input_ids` and decoder attention masks internally.
-  attention masks internally. They usually do not need to be supplied. This does not apply to models leveraging the
+  They usually do not need to be supplied. This does not apply to models leveraging the Encoder-Decoder framework. See
-  Encoder-Decoder framework.
+  the documentation of each model for more information on each specific model's labels.
-  See the documentation of each model for more information on each specific model's labels.
-The base models (e.g., :class:`~transformers.BertModel`) do not accept labels, as these are the base transformer models,
+The base models (e.g., :class:`~transformers.BertModel`) do not accept labels, as these are the base transformer
-simply outputting features.
+models, simply outputting features.
 .. _decoder-input-ids:
 Decoder input IDs
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-This input is specific to encoder-decoder models, and contains the input IDs that will be fed to the decoder.
+This input is specific to encoder-decoder models, and contains the input IDs that will be fed to the decoder. These
-These inputs should be used for sequence to sequence tasks, such as translation or summarization, and are usually
+inputs should be used for sequence to sequence tasks, such as translation or summarization, and are usually built in a
-built in a way specific to each model.
+way specific to each model.
-Most encoder-decoder models (BART, T5) create their :obj:`decoder_input_ids` on their own from the :obj:`labels`.
+Most encoder-decoder models (BART, T5) create their :obj:`decoder_input_ids` on their own from the :obj:`labels`. In
-In such models, passing the :obj:`labels` is the preferred way to handle training.
+such models, passing the :obj:`labels` is the preferred way to handle training.
 Please check each model's docs to see how they handle these input IDs for sequence to sequence training.
@@ -270,18 +270,18 @@ Feed Forward Chunking
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 In each residual attention block in transformers the self-attention layer is usually followed by 2 feed forward layers.
-The intermediate embedding size of the feed forward layers is often bigger than the hidden size of the model (e.g.,
+The intermediate embedding size of the feed forward layers is often bigger than the hidden size of the model (e.g., for
-for ``bert-base-uncased``).
+``bert-base-uncased``).
 For an input of size ``[batch_size, sequence_length]``, the memory required to store the intermediate feed forward
 embeddings ``[batch_size, sequence_length, config.intermediate_size]`` can account for a large fraction of the memory
 use. The authors of `Reformer: The Efficient Transformer <https://arxiv.org/abs/2001.04451>`_ noticed that since the
 computation is independent of the ``sequence_length`` dimension, it is mathematically equivalent to compute the output
 embeddings of both feed forward layers ``[batch_size, config.hidden_size]_0, ..., [batch_size, config.hidden_size]_n``
-individually and concat them afterward to ``[batch_size, sequence_length, config.hidden_size]`` with
+individually and concat them afterward to ``[batch_size, sequence_length, config.hidden_size]`` with ``n =
-``n = sequence_length``, which trades increased computation time against reduced memory use, but yields a
+sequence_length``, which trades increased computation time against reduced memory use, but yields a mathematically
-mathematically **equivalent** result.
+**equivalent** result.
 For models employing the function :func:`~.transformers.apply_chunking_to_forward`, the ``chunk_size`` defines the
 number of output embeddings that are computed in parallel and thus defines the trade-off between memory and time
-complexity.  If ``chunk_size`` is set to 0, no feed forward chunking is done.
+complexity. If ``chunk_size`` is set to 0, no feed forward chunking is done.
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -47,6 +47,7 @@ The documentation is organized in five parts:
 - **RESEARCH** focuses on tutorials that have less to do with how to use the library but more about general resarch in
  transformers model
 - The three last section contain the documentation of each public class and function, grouped in:
    - **MAIN CLASSES** for the main classes exposing the important APIs of the library.
    - **MODELS** for the classes and functions related to each model implemented in the library.
    - **INTERNAL HELPERS** for the classes and functions we use internally.
@@ -122,7 +123,7 @@ conversion utilities for the following models:
 20. :doc:`MarianMT <model_doc/marian>` Machine translation models trained using `OPUS <http://opus.nlpl.eu/>`__ data by
    Jörg Tiedemann. The `Marian Framework <https://marian-nmt.github.io/>`__ is being developed by the Microsoft
    Translator Team.
-21. :doc:`MBart <model_doc/mbart>` (from Facebook) released with the paper  `Multilingual Denoising Pre-training for
+21. :doc:`MBart <model_doc/mbart>` (from Facebook) released with the paper `Multilingual Denoising Pre-training for
    Neural Machine Translation <https://arxiv.org/abs/2001.08210>`__ by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li,
    Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
 22. :doc:`Pegasus <model_doc/pegasus>` (from Google) released with the paper `PEGASUS: Pre-training with Extracted

--- a/docs/source/internal/modeling_utils.rst
+++ b/docs/source/internal/modeling_utils.rst
@@ -85,4 +85,4 @@ TensorFlow Helper Functions
 .. autofunction:: transformers.modeling_tf_utils.keras_serializable
 .. autofunction:: transformers.modeling_tf_utils.shape_list
\ No newline at end of file
--- a/docs/source/internal/tokenization_utils.rst
+++ b/docs/source/internal/tokenization_utils.rst
@@ -25,6 +25,7 @@ SpecialTokensMixin
 Enums and namedtuples
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.tokenization_utils_base.ExplicitEnum
 .. autoclass:: transformers.tokenization_utils_base.PaddingStrategy

--- a/docs/source/internal/trainer_utils.rst
+++ b/docs/source/internal/trainer_utils.rst
@@ -24,4 +24,4 @@ Distributed Evaluation
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.trainer_pt_utils.DistributedTensorGatherer
    :members:
\ No newline at end of file
--- a/docs/source/main_classes/logging.rst
+++ b/docs/source/main_classes/logging.rst
@@ -17,7 +17,7 @@ You can also use the environment variable ``TRANSFORMERS_VERBOSITY`` to override
 to one of the following: ``debug``, ``info``, ``warning``, ``error``, ``critical``. For example:
 .. code-block:: bash
    TRANSFORMERS_VERBOSITY=error ./myprogram.py
 All the methods of this logging module are documented below, the main ones are
@@ -55,4 +55,4 @@ Other functions
 .. autofunction:: transformers.logging.enable_explicit_format
 .. autofunction:: transformers.logging.reset_format
\ No newline at end of file
--- a/docs/source/main_classes/model.rst
+++ b/docs/source/main_classes/model.rst
@@ -52,4 +52,4 @@ Generative models
    :members:
 .. autoclass:: transformers.generation_tf_utils.TFGenerationMixin
    :members:
\ No newline at end of file
--- a/docs/source/main_classes/pipelines.rst
+++ b/docs/source/main_classes/pipelines.rst
 Pipelines
 -----------------------------------------------------------------------------------------------------------------------
-The pipelines are a great and easy way to use models for inference. These pipelines are objects that abstract most
+The pipelines are a great and easy way to use models for inference. These pipelines are objects that abstract most of
-of the complex code from the library, offering a simple API dedicated to several tasks, including Named Entity
+the complex code from the library, offering a simple API dedicated to several tasks, including Named Entity
 Recognition, Masked Language Modeling, Sentiment Analysis, Feature Extraction and Question Answering. See the
 :doc:`task summary <../task_summary>` for examples of use.
@@ -26,8 +26,8 @@ There are two categories of pipeline abstractions to be aware about:
 The pipeline abstraction
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The `pipeline` abstraction is a wrapper around all the other available pipelines. It is instantiated as any
+The `pipeline` abstraction is a wrapper around all the other available pipelines. It is instantiated as any other
-other pipeline but requires an additional argument which is the `task`.
+pipeline but requires an additional argument which is the `task`.
 .. autofunction:: transformers.pipeline

--- a/docs/source/main_classes/processors.rst
+++ b/docs/source/main_classes/processors.rst
@@ -8,8 +8,8 @@ Processors
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 All processors follow the same architecture which is that of the
-:class:`~transformers.data.processors.utils.DataProcessor`. The processor returns a list
+:class:`~transformers.data.processors.utils.DataProcessor`. The processor returns a list of
-of :class:`~transformers.data.processors.utils.InputExample`. These
+:class:`~transformers.data.processors.utils.InputExample`. These
 :class:`~transformers.data.processors.utils.InputExample` can be converted to
 :class:`~transformers.data.processors.utils.InputFeatures` in order to be fed to the model.
@@ -28,14 +28,16 @@ of :class:`~transformers.data.processors.utils.InputExample`. These
 GLUE
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-`General Language Understanding Evaluation (GLUE) <https://gluebenchmark.com/>`__ is a benchmark that evaluates
+`General Language Understanding Evaluation (GLUE) <https://gluebenchmark.com/>`__ is a benchmark that evaluates the
-the performance of models across a diverse set of existing NLU tasks. It was released together with the paper
+performance of models across a diverse set of existing NLU tasks. It was released together with the paper `GLUE: A
-`GLUE: A multi-task benchmark and analysis platform for natural language understanding <https://openreview.net/pdf?id=rJ4km2R5t7>`__
+multi-task benchmark and analysis platform for natural language understanding
+<https://openreview.net/pdf?id=rJ4km2R5t7>`__
-This library hosts a total of 10 processors for the following tasks: MRPC, MNLI, MNLI (mismatched),
+This library hosts a total of 10 processors for the following tasks: MRPC, MNLI, MNLI (mismatched), CoLA, SST2, STSB,
-CoLA, SST2, STSB, QQP, QNLI, RTE and WNLI.
+QQP, QNLI, RTE and WNLI.
 Those processors are:
    - :class:`~transformers.data.processors.utils.MrpcProcessor`
    - :class:`~transformers.data.processors.utils.MnliProcessor`
    - :class:`~transformers.data.processors.utils.MnliMismatchedProcessor`
@@ -46,7 +48,7 @@ Those processors are:
    - :class:`~transformers.data.processors.utils.RteProcessor`
    - :class:`~transformers.data.processors.utils.WnliProcessor`
-Additionally, the following method  can be used to load values from a data file and convert them to a list of
+Additionally, the following method can be used to load values from a data file and convert them to a list of
 :class:`~transformers.data.processors.utils.InputExample`.
 .. automethod:: transformers.data.processors.glue.glue_convert_examples_to_features
@@ -54,36 +56,39 @@ Additionally, the following method  can be used to load values from a data file
 Example usage
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-An example using these processors is given in the `run_glue.py <https://github.com/huggingface/pytorch-transformers/blob/master/examples/text-classification/run_glue.py>`__ script.
+An example using these processors is given in the `run_glue.py
+<https://github.com/huggingface/pytorch-transformers/blob/master/examples/text-classification/run_glue.py>`__ script.
 XNLI
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-`The Cross-Lingual NLI Corpus (XNLI) <https://www.nyu.edu/projects/bowman/xnli/>`__ is a benchmark that evaluates
+`The Cross-Lingual NLI Corpus (XNLI) <https://www.nyu.edu/projects/bowman/xnli/>`__ is a benchmark that evaluates the
-the quality of cross-lingual text representations. 
+quality of cross-lingual text representations. XNLI is crowd-sourced dataset based on `MultiNLI
-XNLI is crowd-sourced dataset based on `MultiNLI <http://www.nyu.edu/projects/bowman/multinli/>`: pairs of text are labeled with textual entailment 
+<http://www.nyu.edu/projects/bowman/multinli/>`: pairs of text are labeled with textual entailment annotations for 15
-annotations for 15 different languages (including both high-resource language such as English and low-resource languages such as Swahili).
+different languages (including both high-resource language such as English and low-resource languages such as Swahili).
-It was released together with the paper
+It was released together with the paper `XNLI: Evaluating Cross-lingual Sentence Representations
-`XNLI: Evaluating Cross-lingual Sentence Representations <https://arxiv.org/abs/1809.05053>`__
+<https://arxiv.org/abs/1809.05053>`__
 This library hosts the processor to load the XNLI data:
    - :class:`~transformers.data.processors.utils.XnliProcessor`
 Please note that since the gold labels are available on the test set, evaluation is performed on the test set.
-An example using these processors is given in the
+An example using these processors is given in the `run_xnli.py
-`run_xnli.py <https://github.com/huggingface/pytorch-transformers/blob/master/examples/text-classification/run_xnli.py>`__ script.
+<https://github.com/huggingface/pytorch-transformers/blob/master/examples/text-classification/run_xnli.py>`__ script.
 SQuAD
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-`The Stanford Question Answering Dataset (SQuAD) <https://rajpurkar.github.io/SQuAD-explorer//>`__ is a benchmark that evaluates
+`The Stanford Question Answering Dataset (SQuAD) <https://rajpurkar.github.io/SQuAD-explorer//>`__ is a benchmark that
-the performance of models on question answering. Two versions are available, v1.1 and v2.0. The first version (v1.1) was released together with the paper
+evaluates the performance of models on question answering. Two versions are available, v1.1 and v2.0. The first version
-`SQuAD: 100,000+ Questions for Machine Comprehension of Text <https://arxiv.org/abs/1606.05250>`__. The second version (v2.0) was released alongside 
+(v1.1) was released together with the paper `SQuAD: 100,000+ Questions for Machine Comprehension of Text
-the paper `Know What You Don't Know: Unanswerable Questions for SQuAD <https://arxiv.org/abs/1806.03822>`__.
+<https://arxiv.org/abs/1606.05250>`__. The second version (v2.0) was released alongside the paper `Know What You Don't
+Know: Unanswerable Questions for SQuAD <https://arxiv.org/abs/1806.03822>`__.
 This library hosts a processor for each of the two versions:
@@ -91,6 +96,7 @@ Processors
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 Those processors are:
    - :class:`~transformers.data.processors.utils.SquadV1Processor`
    - :class:`~transformers.data.processors.utils.SquadV2Processor`
@@ -99,17 +105,18 @@ They both inherit from the abstract class :class:`~transformers.data.processors.
 .. autoclass:: transformers.data.processors.squad.SquadProcessor
    :members:
-Additionally, the following method can be used to convert SQuAD examples into :class:`~transformers.data.processors.utils.SquadFeatures`
+Additionally, the following method can be used to convert SQuAD examples into
-that can be used as model inputs.
+:class:`~transformers.data.processors.utils.SquadFeatures` that can be used as model inputs.
 .. automethod:: transformers.data.processors.squad.squad_convert_examples_to_features
-These processors as well as the aforementionned method can be used with files containing the data as well as with the `tensorflow_datasets` package.
+These processors as well as the aforementionned method can be used with files containing the data as well as with the
-Examples are given below.
+`tensorflow_datasets` package. Examples are given below.
 Example usage
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 Here is an example using the processors as well as the conversion method using data files:
 .. code-block::
@@ -149,5 +156,5 @@ Using `tensorflow_datasets` is as easy as using a data file:
    )
-Another example using these processors is given in the
+Another example using these processors is given in the `run_squad.py
-`run_squad.py <https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py>`__ script.
+<https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py>`__ script.
--- a/docs/source/main_classes/tokenizer.rst
+++ b/docs/source/main_classes/tokenizer.rst
@@ -29,11 +29,12 @@ methods for using all the tokenizers:
 :class:`~transformers.BatchEncoding` holds the output of the tokenizer's encoding methods (``__call__``,
 ``encode_plus`` and ``batch_encode_plus``) and is derived from a Python dictionary. When the tokenizer is a pure python
-tokenizer, this class behaves just like a standard python dictionary and holds the various model inputs computed by these
+tokenizer, this class behaves just like a standard python dictionary and holds the various model inputs computed by
-methods (``input_ids``, ``attention_mask``...). When the tokenizer is a "Fast" tokenizer (i.e., backed by HuggingFace
+these methods (``input_ids``, ``attention_mask``...). When the tokenizer is a "Fast" tokenizer (i.e., backed by
-`tokenizers library <https://github.com/huggingface/tokenizers>`__), this class provides in addition several advanced
+HuggingFace `tokenizers library <https://github.com/huggingface/tokenizers>`__), this class provides in addition
-alignment methods which can be used to map between the original string (character and words) and the token space (e.g.,
+several advanced alignment methods which can be used to map between the original string (character and words) and the
-getting the index of the token comprising a given character or the span of characters corresponding to a given token).
+token space (e.g., getting the index of the token comprising a given character or the span of characters corresponding
+to a given token).
 PreTrainedTokenizer

--- a/docs/source/main_classes/trainer.rst
+++ b/docs/source/main_classes/trainer.rst
@@ -4,7 +4,7 @@ Trainer
 The :class:`~transformers.Trainer` and :class:`~transformers.TFTrainer` classes provide an API for feature-complete
 training in most standard use cases. It's used in most of the :doc:`example scripts <../examples>`.
-Before instantiating your :class:`~transformers.Trainer`/:class:`~transformers.TFTrainer`, create a 
+Before instantiating your :class:`~transformers.Trainer`/:class:`~transformers.TFTrainer`, create a
 :class:`~transformers.TrainingArguments`/:class:`~transformers.TFTrainingArguments` to access all the points of
 customization during training.

--- a/docs/source/model_doc/albert.rst
+++ b/docs/source/model_doc/albert.rst
@@ -19,14 +19,14 @@ downstream tasks. However, at some point further model increases become harder d
 longer training times, and unexpected model degradation. To address these problems, we present two parameter-reduction
 techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows
 that our proposed methods lead to models that scale much better compared to the original BERT. We also use a
-self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream
+self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks
-tasks with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE,
+with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and
-RACE, and SQuAD benchmarks while having fewer parameters compared to BERT-large.*
+SQuAD benchmarks while having fewer parameters compared to BERT-large.*
 Tips:
- ALBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on
+- ALBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather
-  the right rather than the left.
+  than the left.
 - ALBERT uses repeating layers which results in a small memory footprint, however the computational cost remains
  similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same
  number of (repeating) layers.

--- a/docs/source/model_doc/auto.rst
+++ b/docs/source/model_doc/auto.rst
@@ -2,9 +2,8 @@ AutoClasses
 -----------------------------------------------------------------------------------------------------------------------
 In many cases, the architecture you want to use can be guessed from the name or the path of the pretrained model you
-are supplying to the :obj:`from_pretrained()` method.
+are supplying to the :obj:`from_pretrained()` method. AutoClasses are here to do this job for you so that you
-AutoClasses are here to do this job for you so that you automatically retrieve the relevant model given the name/path
+automatically retrieve the relevant model given the name/path to the pretrained weights/config/vocabulary.
-to the pretrained weights/config/vocabulary.
 Instantiating one of :class:`~transformers.AutoConfig`, :class:`~transformers.AutoModel`, and
 :class:`~transformers.AutoTokenizer` will directly create a class of the relevant architecture. For instance