Convert tutorials (#14665)

* Convert a few docs * And another * Last tutorials * New syntax for colab links * Convert a few docs * And another * Last tutorials * New syntax for colab links

Convert tutorials (#14665)
* Convert a few docs * And another * Last tutorials * New syntax for colab links * Convert a few docs * And another * Last tutorials * New syntax for colab links
cf36f4d7 · Sylvain Gugger · GitHub · 0f4e39c5 · cf36f4d7 · 0f4e39c5
Unverified Commit cf36f4d7 authored Dec 08, 2021 by Sylvain Gugger Committed by GitHub Dec 08, 2021
15 changed files
--- a/docs/source/benchmarks.mdx
+++ b/docs/source/benchmarks.mdx
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+# Benchmarks
+[[open-in-colab]]
+Let's take a look at how 🤗 Transformer models can be benchmarked, best practices, and already available benchmarks.
+A notebook explaining in more detail how to benchmark 🤗 Transformer models can be found [here](https://github.com/huggingface/transformers/tree/master/notebooks/05-benchmark.ipynb).
+## How to benchmark 🤗 Transformer models
+The classes [`PyTorchBenchmark`] and [`TensorFlowBenchmark`] allow to flexibly benchmark 🤗 Transformer models. The benchmark classes allow us to measure the _peak memory usage_ and _required time_ for both _inference_ and _training_.
+<Tip>
+Hereby, _inference_ is defined by a single forward pass, and _training_ is defined by a single forward pass and
+backward pass.
+</Tip>
+The benchmark classes [`PyTorchBenchmark`] and [`TensorFlowBenchmark`] expect an object of type [`PyTorchBenchmarkArguments`] and
+[`TensorFlowBenchmarkArguments`], respectively, for instantiation. [`PyTorchBenchmarkArguments`] and [`TensorFlowBenchmarkArguments`] are data classes and contain all relevant configurations for their corresponding benchmark class. In the following example, it is shown how a BERT model of type _bert-base-cased_ can be benchmarked.
+```py
+>>> from transformers import PyTorchBenchmark, PyTorchBenchmarkArguments
+>>> args = PyTorchBenchmarkArguments(models=["bert-base-uncased"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512])
+>>> benchmark = PyTorchBenchmark(args)
+===PT-TF-SPLIT===
+>>> from transformers import TensorFlowBenchmark, TensorFlowBenchmarkArguments
+>>> args = TensorFlowBenchmarkArguments(models=["bert-base-uncased"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512])
+>>> benchmark = TensorFlowBenchmark(args)
+```
+Here, three arguments are given to the benchmark argument data classes, namely `models`, `batch_sizes`, and
+`sequence_lengths`. The argument `models` is required and expects a `list` of model identifiers from the
+[model hub](https://huggingface.co/models) The `list` arguments `batch_sizes` and `sequence_lengths` define
+the size of the `input_ids` on which the model is benchmarked. There are many more parameters that can be configured
+via the benchmark argument data classes. For more detail on these one can either directly consult the files
+`src/transformers/benchmark/benchmark_args_utils.py`, `src/transformers/benchmark/benchmark_args.py` (for PyTorch)
+and `src/transformers/benchmark/benchmark_args_tf.py` (for Tensorflow). Alternatively, running the following shell
+commands from root will print out a descriptive list of all configurable parameters for PyTorch and Tensorflow
+respectively.
+```bash
+python examples/pytorch/benchmarking/run_benchmark.py --help
+===PT-TF-SPLIT===
+python examples/tensorflow/benchmarking/run_benchmark_tf.py --help
+```
+An instantiated benchmark object can then simply be run by calling `benchmark.run()`.
+```py
+>>> results = benchmark.run()
+>>> print(results)
+====================       INFERENCE - SPEED - RESULT       ====================
+--------------------------------------------------------------------------------
+Model Name             Batch Size     Seq Length     Time in s                  
+--------------------------------------------------------------------------------
+bert-base-uncased          8               8             0.006     
+bert-base-uncased          8               32            0.006     
+bert-base-uncased          8              128            0.018     
+bert-base-uncased          8              512            0.088     
+--------------------------------------------------------------------------------
+====================      INFERENCE - MEMORY - RESULT       ====================
+--------------------------------------------------------------------------------
+Model Name             Batch Size     Seq Length    Memory in MB 
+--------------------------------------------------------------------------------
+bert-base-uncased          8               8             1227
+bert-base-uncased          8               32            1281
+bert-base-uncased          8              128            1307
+bert-base-uncased          8              512            1539
+--------------------------------------------------------------------------------
+====================        ENVIRONMENT INFORMATION         ====================
+- transformers_version: 2.11.0
+- framework: PyTorch
+- use_torchscript: False
+- framework_version: 1.4.0
+- python_version: 3.6.10
+- system: Linux
+- cpu: x86_64
+- architecture: 64bit
+- date: 2020-06-29
+- time: 08:58:43.371351
+- fp16: False
+- use_multiprocessing: True
+- only_pretrain_model: False
+- cpu_ram_mb: 32088
+- use_gpu: True
+- num_gpus: 1
+- gpu: TITAN RTX
+- gpu_ram_mb: 24217
+- gpu_power_watts: 280.0
+- gpu_performance_state: 2
+- use_tpu: False
+===PT-TF-SPLIT===
+>>> results = benchmark.run()
+>>> print(results)
+====================       INFERENCE - SPEED - RESULT       ====================
+--------------------------------------------------------------------------------
+Model Name             Batch Size     Seq Length     Time in s                  
+--------------------------------------------------------------------------------
+bert-base-uncased          8               8             0.005
+bert-base-uncased          8               32            0.008
+bert-base-uncased          8              128            0.022
+bert-base-uncased          8              512            0.105
+--------------------------------------------------------------------------------
+====================      INFERENCE - MEMORY - RESULT       ====================
+--------------------------------------------------------------------------------
+Model Name             Batch Size     Seq Length    Memory in MB 
+--------------------------------------------------------------------------------
+bert-base-uncased          8               8             1330
+bert-base-uncased          8               32            1330
+bert-base-uncased          8              128            1330
+bert-base-uncased          8              512            1770
+--------------------------------------------------------------------------------
+====================        ENVIRONMENT INFORMATION         ====================
+- transformers_version: 2.11.0
+- framework: Tensorflow
+- use_xla: False
+- framework_version: 2.2.0
+- python_version: 3.6.10
+- system: Linux
+- cpu: x86_64
+- architecture: 64bit
+- date: 2020-06-29
+- time: 09:26:35.617317
+- fp16: False
+- use_multiprocessing: True
+- only_pretrain_model: False
+- cpu_ram_mb: 32088
+- use_gpu: True
+- num_gpus: 1
+- gpu: TITAN RTX
+- gpu_ram_mb: 24217
+- gpu_power_watts: 280.0
+- gpu_performance_state: 2
+- use_tpu: False
+```
+By default, the _time_ and the _required memory_ for _inference_ are benchmarked. In the example output above the first
+two sections show the result corresponding to _inference time_ and _inference memory_. In addition, all relevant
+information about the computing environment, _e.g._ the GPU type, the system, the library versions, etc... are printed
+out in the third section under _ENVIRONMENT INFORMATION_. This information can optionally be saved in a _.csv_ file
+when adding the argument `save_to_csv=True` to [`PyTorchBenchmarkArguments`] and
+[`TensorFlowBenchmarkArguments`] respectively. In this case, every section is saved in a separate
+_.csv_ file. The path to each _.csv_ file can optionally be defined via the argument data classes.
+Instead of benchmarking pre-trained models via their model identifier, _e.g._ `bert-base-uncased`, the user can
+alternatively benchmark an arbitrary configuration of any available model class. In this case, a `list` of
+configurations must be inserted with the benchmark args as follows.
+```py
+>>> from transformers import PyTorchBenchmark, PyTorchBenchmarkArguments, BertConfig
+>>> args = PyTorchBenchmarkArguments(models=["bert-base", "bert-384-hid", "bert-6-lay"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512])
+>>> config_base = BertConfig()
+>>> config_384_hid = BertConfig(hidden_size=384)
+>>> config_6_lay = BertConfig(num_hidden_layers=6)
+>>> benchmark = PyTorchBenchmark(args, configs=[config_base, config_384_hid, config_6_lay])
+>>> benchmark.run()
+====================       INFERENCE - SPEED - RESULT       ====================
+--------------------------------------------------------------------------------
+Model Name             Batch Size     Seq Length       Time in s                  
+--------------------------------------------------------------------------------
+bert-base                  8              128            0.006
+bert-base                  8              512            0.006
+bert-base                  8              128            0.018     
+bert-base                  8              512            0.088     
+bert-384-hid              8               8             0.006     
+bert-384-hid              8               32            0.006     
+bert-384-hid              8              128            0.011     
+bert-384-hid              8              512            0.054     
+bert-6-lay                 8               8             0.003     
+bert-6-lay                 8               32            0.004     
+bert-6-lay                 8              128            0.009     
+bert-6-lay                 8              512            0.044
+--------------------------------------------------------------------------------
+====================      INFERENCE - MEMORY - RESULT       ====================
+--------------------------------------------------------------------------------
+Model Name             Batch Size     Seq Length      Memory in MB 
+--------------------------------------------------------------------------------
+bert-base                  8               8             1277
+bert-base                  8               32            1281
+bert-base                  8              128            1307     
+bert-base                  8              512            1539     
+bert-384-hid              8               8             1005     
+bert-384-hid              8               32            1027     
+bert-384-hid              8              128            1035     
+bert-384-hid              8              512            1255     
+bert-6-lay                 8               8             1097     
+bert-6-lay                 8               32            1101     
+bert-6-lay                 8              128            1127     
+bert-6-lay                 8              512            1359
+--------------------------------------------------------------------------------
+====================        ENVIRONMENT INFORMATION         ====================
+- transformers_version: 2.11.0
+- framework: PyTorch
+- use_torchscript: False
+- framework_version: 1.4.0
+- python_version: 3.6.10
+- system: Linux
+- cpu: x86_64
+- architecture: 64bit
+- date: 2020-06-29
+- time: 09:35:25.143267
+- fp16: False
+- use_multiprocessing: True
+- only_pretrain_model: False
+- cpu_ram_mb: 32088
+- use_gpu: True
+- num_gpus: 1
+- gpu: TITAN RTX
+- gpu_ram_mb: 24217
+- gpu_power_watts: 280.0
+- gpu_performance_state: 2
+- use_tpu: False
+===PT-TF-SPLIT===
+>>> from transformers import TensorFlowBenchmark, TensorFlowBenchmarkArguments, BertConfig
+>>> args = TensorFlowBenchmarkArguments(models=["bert-base", "bert-384-hid", "bert-6-lay"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512])
+>>> config_base = BertConfig()
+>>> config_384_hid = BertConfig(hidden_size=384)
+>>> config_6_lay = BertConfig(num_hidden_layers=6)
+>>> benchmark = TensorFlowBenchmark(args, configs=[config_base, config_384_hid, config_6_lay])
+>>> benchmark.run()
+====================       INFERENCE - SPEED - RESULT       ====================
+--------------------------------------------------------------------------------
+Model Name             Batch Size     Seq Length       Time in s                  
+--------------------------------------------------------------------------------
+bert-base                  8               8             0.005
+bert-base                  8               32            0.008
+bert-base                  8              128            0.022
+bert-base                  8              512            0.106
+bert-384-hid              8               8             0.005
+bert-384-hid              8               32            0.007
+bert-384-hid              8              128            0.018
+bert-384-hid              8              512            0.064
+bert-6-lay                 8               8             0.002
+bert-6-lay                 8               32            0.003
+bert-6-lay                 8              128            0.0011
+bert-6-lay                 8              512            0.074
+--------------------------------------------------------------------------------
+====================      INFERENCE - MEMORY - RESULT       ====================
+--------------------------------------------------------------------------------
+Model Name             Batch Size     Seq Length      Memory in MB 
+--------------------------------------------------------------------------------
+bert-base                  8               8             1330
+bert-base                  8               32            1330
+bert-base                  8              128            1330
+bert-base                  8              512            1770
+bert-384-hid              8               8             1330
+bert-384-hid              8               32            1330
+bert-384-hid              8              128            1330
+bert-384-hid              8              512            1540
+bert-6-lay                 8               8             1330
+bert-6-lay                 8               32            1330
+bert-6-lay                 8              128            1330
+bert-6-lay                 8              512            1540
+--------------------------------------------------------------------------------
+====================        ENVIRONMENT INFORMATION         ====================
+- transformers_version: 2.11.0
+- framework: Tensorflow
+- use_xla: False
+- framework_version: 2.2.0
+- python_version: 3.6.10
+- system: Linux
+- cpu: x86_64
+- architecture: 64bit
+- date: 2020-06-29
+- time: 09:38:15.487125
+- fp16: False
+- use_multiprocessing: True
+- only_pretrain_model: False
+- cpu_ram_mb: 32088
+- use_gpu: True
+- num_gpus: 1
+- gpu: TITAN RTX
+- gpu_ram_mb: 24217
+- gpu_power_watts: 280.0
+- gpu_performance_state: 2
+- use_tpu: False
+```
+Again, _inference time_ and _required memory_ for _inference_ are measured, but this time for customized configurations
+of the `BertModel` class. This feature can especially be helpful when deciding for which configuration the model
+should be trained.
+## Benchmark best practices
+This section lists a couple of best practices one should be aware of when benchmarking a model.
+- Currently, only single device benchmarking is supported. When benchmarking on GPU, it is recommended that the user
+  specifies on which device the code should be run by setting the `CUDA_VISIBLE_DEVICES` environment variable in the
+  shell, _e.g._ `export CUDA_VISIBLE_DEVICES=0` before running the code.
+- The option `no_multi_processing` should only be set to `True` for testing and debugging. To ensure accurate
+  memory measurement it is recommended to run each memory benchmark in a separate process by making sure
+  `no_multi_processing` is set to `True`.
+- One should always state the environment information when sharing the results of a model benchmark. Results can vary
+  heavily between different GPU devices, library versions, etc., so that benchmark results on their own are not very
+  useful for the community.
+## Sharing your benchmark
+Previously all available core models (10 at the time) have been benchmarked for _inference time_, across many different
+settings: using PyTorch, with and without TorchScript, using TensorFlow, with and without XLA. All of those tests were
+done across CPUs (except for TensorFlow XLA) and GPUs.
+The approach is detailed in the [following blogpost](https://medium.com/huggingface/benchmarking-transformers-pytorch-and-tensorflow-e2917fb891c2) and the results are
+available [here](https://docs.google.com/spreadsheets/d/1sryqufw2D0XlUH4sq3e9Wnxu5EAQkaohzrJbd5HdQ_w/edit?usp=sharing).
+With the new _benchmark_ tools, it is easier than ever to share your benchmark results with the community
+- [PyTorch Benchmarking Results](https://github.com/huggingface/transformers/tree/master/examples/pytorch/benchmarking/README.md).
+- [TensorFlow Benchmarking Results](https://github.com/huggingface/transformers/tree/master/examples/tensorflow/benchmarking/README.md).
--- a/docs/source/benchmarks.rst
+++ b/docs/source/benchmarks.rst
-.. 
-    Copyright 2020 The HuggingFace Team. All rights reserved.
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
-        http://www.apache.org/licenses/LICENSE-2.0
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
-Benchmarks
-=======================================================================================================================
-Let's take a look at how 🤗 Transformer models can be benchmarked, best practices, and already available benchmarks.
-A notebook explaining in more detail how to benchmark 🤗 Transformer models can be found :prefix_link:`here
-<notebooks/05-benchmark.ipynb>`.
-How to benchmark 🤗 Transformer models
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The classes :class:`~transformers.PyTorchBenchmark` and :class:`~transformers.TensorFlowBenchmark` allow to flexibly
-benchmark 🤗 Transformer models. The benchmark classes allow us to measure the `peak memory usage` and `required time`
-for both `inference` and `training`.
-.. note::
-  Hereby, `inference` is defined by a single forward pass, and `training` is defined by a single forward pass and
-  backward pass.
-The benchmark classes :class:`~transformers.PyTorchBenchmark` and :class:`~transformers.TensorFlowBenchmark` expect an
-object of type :class:`~transformers.PyTorchBenchmarkArguments` and
-:class:`~transformers.TensorFlowBenchmarkArguments`, respectively, for instantiation.
-:class:`~transformers.PyTorchBenchmarkArguments` and :class:`~transformers.TensorFlowBenchmarkArguments` are data
-classes and contain all relevant configurations for their corresponding benchmark class. In the following example, it
-is shown how a BERT model of type `bert-base-cased` can be benchmarked.
-.. code-block::
-    >>> ## PYTORCH CODE
-    >>> from transformers import PyTorchBenchmark, PyTorchBenchmarkArguments
-    >>> args = PyTorchBenchmarkArguments(models=["bert-base-uncased"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512])
-    >>> benchmark = PyTorchBenchmark(args)
-    >>> ## TENSORFLOW CODE
-    >>> from transformers import TensorFlowBenchmark, TensorFlowBenchmarkArguments
-    >>> args = TensorFlowBenchmarkArguments(models=["bert-base-uncased"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512])
-    >>> benchmark = TensorFlowBenchmark(args)
-Here, three arguments are given to the benchmark argument data classes, namely ``models``, ``batch_sizes``, and
-``sequence_lengths``. The argument ``models`` is required and expects a :obj:`list` of model identifiers from the
-`model hub <https://huggingface.co/models>`__ The :obj:`list` arguments ``batch_sizes`` and ``sequence_lengths`` define
-the size of the ``input_ids`` on which the model is benchmarked. There are many more parameters that can be configured
-via the benchmark argument data classes. For more detail on these one can either directly consult the files
-``src/transformers/benchmark/benchmark_args_utils.py``, ``src/transformers/benchmark/benchmark_args.py`` (for PyTorch)
-and ``src/transformers/benchmark/benchmark_args_tf.py`` (for Tensorflow). Alternatively, running the following shell
-commands from root will print out a descriptive list of all configurable parameters for PyTorch and Tensorflow
-respectively.
-.. code-block:: bash
-    ## PYTORCH CODE
-    python examples/pytorch/benchmarking/run_benchmark.py --help
-    ## TENSORFLOW CODE
-    python examples/tensorflow/benchmarking/run_benchmark_tf.py --help
-An instantiated benchmark object can then simply be run by calling ``benchmark.run()``.
-.. code-block::
-    >>> ## PYTORCH CODE
-    >>> results = benchmark.run()
-    >>> print(results)
-    ====================       INFERENCE - SPEED - RESULT       ====================
-    --------------------------------------------------------------------------------
-    Model Name             Batch Size     Seq Length     Time in s                  
-    --------------------------------------------------------------------------------
-    bert-base-uncased          8               8             0.006     
-    bert-base-uncased          8               32            0.006     
-    bert-base-uncased          8              128            0.018     
-    bert-base-uncased          8              512            0.088     
-    --------------------------------------------------------------------------------
-    ====================      INFERENCE - MEMORY - RESULT       ====================
-    --------------------------------------------------------------------------------
-    Model Name             Batch Size     Seq Length    Memory in MB 
-    --------------------------------------------------------------------------------
-    bert-base-uncased          8               8             1227
-    bert-base-uncased          8               32            1281
-    bert-base-uncased          8              128            1307
-    bert-base-uncased          8              512            1539
-    --------------------------------------------------------------------------------
-    ====================        ENVIRONMENT INFORMATION         ====================
-    - transformers_version: 2.11.0
-    - framework: PyTorch
-    - use_torchscript: False
-    - framework_version: 1.4.0
-    - python_version: 3.6.10
-    - system: Linux
-    - cpu: x86_64
-    - architecture: 64bit
-    - date: 2020-06-29
-    - time: 08:58:43.371351
-    - fp16: False
-    - use_multiprocessing: True
-    - only_pretrain_model: False
-    - cpu_ram_mb: 32088
-    - use_gpu: True
-    - num_gpus: 1
-    - gpu: TITAN RTX
-    - gpu_ram_mb: 24217
-    - gpu_power_watts: 280.0
-    - gpu_performance_state: 2
-    - use_tpu: False
-    >>> ## TENSORFLOW CODE
-    >>> results = benchmark.run()
-    >>> print(results)
-    ====================       INFERENCE - SPEED - RESULT       ====================
-    --------------------------------------------------------------------------------
-    Model Name             Batch Size     Seq Length     Time in s                  
-    --------------------------------------------------------------------------------
-    bert-base-uncased          8               8             0.005
-    bert-base-uncased          8               32            0.008
-    bert-base-uncased          8              128            0.022
-    bert-base-uncased          8              512            0.105
-    --------------------------------------------------------------------------------
-    ====================      INFERENCE - MEMORY - RESULT       ====================
-    --------------------------------------------------------------------------------
-    Model Name             Batch Size     Seq Length    Memory in MB 
-    --------------------------------------------------------------------------------
-    bert-base-uncased          8               8             1330
-    bert-base-uncased          8               32            1330
-    bert-base-uncased          8              128            1330
-    bert-base-uncased          8              512            1770
-    --------------------------------------------------------------------------------
-    ====================        ENVIRONMENT INFORMATION         ====================
-    - transformers_version: 2.11.0
-    - framework: Tensorflow
-    - use_xla: False
-    - framework_version: 2.2.0
-    - python_version: 3.6.10
-    - system: Linux
-    - cpu: x86_64
-    - architecture: 64bit
-    - date: 2020-06-29
-    - time: 09:26:35.617317
-    - fp16: False
-    - use_multiprocessing: True
-    - only_pretrain_model: False
-    - cpu_ram_mb: 32088
-    - use_gpu: True
-    - num_gpus: 1
-    - gpu: TITAN RTX
-    - gpu_ram_mb: 24217
-    - gpu_power_watts: 280.0
-    - gpu_performance_state: 2
-    - use_tpu: False
-By default, the `time` and the `required memory` for `inference` are benchmarked. In the example output above the first
-two sections show the result corresponding to `inference time` and `inference memory`. In addition, all relevant
-information about the computing environment, `e.g.` the GPU type, the system, the library versions, etc... are printed
-out in the third section under `ENVIRONMENT INFORMATION`. This information can optionally be saved in a `.csv` file
-when adding the argument :obj:`save_to_csv=True` to :class:`~transformers.PyTorchBenchmarkArguments` and
-:class:`~transformers.TensorFlowBenchmarkArguments` respectively. In this case, every section is saved in a separate
-`.csv` file. The path to each `.csv` file can optionally be defined via the argument data classes.
-Instead of benchmarking pre-trained models via their model identifier, `e.g.` `bert-base-uncased`, the user can
-alternatively benchmark an arbitrary configuration of any available model class. In this case, a :obj:`list` of
-configurations must be inserted with the benchmark args as follows.
-.. code-block::
-    >>> ## PYTORCH CODE
-    >>> from transformers import PyTorchBenchmark, PyTorchBenchmarkArguments, BertConfig
-    >>> args = PyTorchBenchmarkArguments(models=["bert-base", "bert-384-hid", "bert-6-lay"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512])
-    >>> config_base = BertConfig()
-    >>> config_384_hid = BertConfig(hidden_size=384)
-    >>> config_6_lay = BertConfig(num_hidden_layers=6)
-    >>> benchmark = PyTorchBenchmark(args, configs=[config_base, config_384_hid, config_6_lay])
-    >>> benchmark.run()
-    ====================       INFERENCE - SPEED - RESULT       ====================
-    --------------------------------------------------------------------------------
-    Model Name             Batch Size     Seq Length       Time in s                  
-    --------------------------------------------------------------------------------
-    bert-base                  8              128            0.006
-    bert-base                  8              512            0.006
-    bert-base                  8              128            0.018     
-    bert-base                  8              512            0.088     
-    bert-384-hid              8               8             0.006     
-    bert-384-hid              8               32            0.006     
-    bert-384-hid              8              128            0.011     
-    bert-384-hid              8              512            0.054     
-    bert-6-lay                 8               8             0.003     
-    bert-6-lay                 8               32            0.004     
-    bert-6-lay                 8              128            0.009     
-    bert-6-lay                 8              512            0.044
-    --------------------------------------------------------------------------------
-    ====================      INFERENCE - MEMORY - RESULT       ====================
-    --------------------------------------------------------------------------------
-    Model Name             Batch Size     Seq Length      Memory in MB 
-    --------------------------------------------------------------------------------
-    bert-base                  8               8             1277
-    bert-base                  8               32            1281
-    bert-base                  8              128            1307     
-    bert-base                  8              512            1539     
-    bert-384-hid              8               8             1005     
-    bert-384-hid              8               32            1027     
-    bert-384-hid              8              128            1035     
-    bert-384-hid              8              512            1255     
-    bert-6-lay                 8               8             1097     
-    bert-6-lay                 8               32            1101     
-    bert-6-lay                 8              128            1127     
-    bert-6-lay                 8              512            1359
-    --------------------------------------------------------------------------------
-    ====================        ENVIRONMENT INFORMATION         ====================
-    - transformers_version: 2.11.0
-    - framework: PyTorch
-    - use_torchscript: False
-    - framework_version: 1.4.0
-    - python_version: 3.6.10
-    - system: Linux
-    - cpu: x86_64
-    - architecture: 64bit
-    - date: 2020-06-29
-    - time: 09:35:25.143267
-    - fp16: False
-    - use_multiprocessing: True
-    - only_pretrain_model: False
-    - cpu_ram_mb: 32088
-    - use_gpu: True
-    - num_gpus: 1
-    - gpu: TITAN RTX
-    - gpu_ram_mb: 24217
-    - gpu_power_watts: 280.0
-    - gpu_performance_state: 2
-    - use_tpu: False
-    >>> ## TENSORFLOW CODE
-    >>> from transformers import TensorFlowBenchmark, TensorFlowBenchmarkArguments, BertConfig
-    >>> args = TensorFlowBenchmarkArguments(models=["bert-base", "bert-384-hid", "bert-6-lay"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512])
-    >>> config_base = BertConfig()
-    >>> config_384_hid = BertConfig(hidden_size=384)
-    >>> config_6_lay = BertConfig(num_hidden_layers=6)
-    >>> benchmark = TensorFlowBenchmark(args, configs=[config_base, config_384_hid, config_6_lay])
-    >>> benchmark.run()
-    ====================       INFERENCE - SPEED - RESULT       ====================
-    --------------------------------------------------------------------------------
-    Model Name             Batch Size     Seq Length       Time in s                  
-    --------------------------------------------------------------------------------
-    bert-base                  8               8             0.005
-    bert-base                  8               32            0.008
-    bert-base                  8              128            0.022
-    bert-base                  8              512            0.106
-    bert-384-hid              8               8             0.005
-    bert-384-hid              8               32            0.007
-    bert-384-hid              8              128            0.018
-    bert-384-hid              8              512            0.064
-    bert-6-lay                 8               8             0.002
-    bert-6-lay                 8               32            0.003
-    bert-6-lay                 8              128            0.0011
-    bert-6-lay                 8              512            0.074
-    --------------------------------------------------------------------------------
-    ====================      INFERENCE - MEMORY - RESULT       ====================
-    --------------------------------------------------------------------------------
-    Model Name             Batch Size     Seq Length      Memory in MB 
-    --------------------------------------------------------------------------------
-    bert-base                  8               8             1330
-    bert-base                  8               32            1330
-    bert-base                  8              128            1330
-    bert-base                  8              512            1770
-    bert-384-hid              8               8             1330
-    bert-384-hid              8               32            1330
-    bert-384-hid              8              128            1330
-    bert-384-hid              8              512            1540
-    bert-6-lay                 8               8             1330
-    bert-6-lay                 8               32            1330
-    bert-6-lay                 8              128            1330
-    bert-6-lay                 8              512            1540
-    --------------------------------------------------------------------------------
-    ====================        ENVIRONMENT INFORMATION         ====================
-    - transformers_version: 2.11.0
-    - framework: Tensorflow
-    - use_xla: False
-    - framework_version: 2.2.0
-    - python_version: 3.6.10
-    - system: Linux
-    - cpu: x86_64
-    - architecture: 64bit
-    - date: 2020-06-29
-    - time: 09:38:15.487125
-    - fp16: False
-    - use_multiprocessing: True
-    - only_pretrain_model: False
-    - cpu_ram_mb: 32088
-    - use_gpu: True
-    - num_gpus: 1
-    - gpu: TITAN RTX
-    - gpu_ram_mb: 24217
-    - gpu_power_watts: 280.0
-    - gpu_performance_state: 2
-    - use_tpu: False
-Again, `inference time` and `required memory` for `inference` are measured, but this time for customized configurations
-of the :obj:`BertModel` class. This feature can especially be helpful when deciding for which configuration the model
-should be trained.
-Benchmark best practices
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-This section lists a couple of best practices one should be aware of when benchmarking a model.
- Currently, only single device benchmarking is supported. When benchmarking on GPU, it is recommended that the user
-  specifies on which device the code should be run by setting the ``CUDA_VISIBLE_DEVICES`` environment variable in the
-  shell, `e.g.` ``export CUDA_VISIBLE_DEVICES=0`` before running the code.
- The option :obj:`no_multi_processing` should only be set to :obj:`True` for testing and debugging. To ensure accurate
-  memory measurement it is recommended to run each memory benchmark in a separate process by making sure
-  :obj:`no_multi_processing` is set to :obj:`True`.
- One should always state the environment information when sharing the results of a model benchmark. Results can vary
-  heavily between different GPU devices, library versions, etc., so that benchmark results on their own are not very
-  useful for the community.
-Sharing your benchmark
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Previously all available core models (10 at the time) have been benchmarked for `inference time`, across many different
-settings: using PyTorch, with and without TorchScript, using TensorFlow, with and without XLA. All of those tests were
-done across CPUs (except for TensorFlow XLA) and GPUs.
-The approach is detailed in the `following blogpost
-<https://medium.com/huggingface/benchmarking-transformers-pytorch-and-tensorflow-e2917fb891c2>`__ and the results are
-available `here
-<https://docs.google.com/spreadsheets/d/1sryqufw2D0XlUH4sq3e9Wnxu5EAQkaohzrJbd5HdQ_w/edit?usp=sharing>`__.
-With the new `benchmark` tools, it is easier than ever to share your benchmark results with the community
- :prefix_link:`PyTorch Benchmarking Results<examples/pytorch/benchmarking/README.md>`.
- :prefix_link:`TensorFlow Benchmarking Results<examples/tensorflow/benchmarking/README.md>`.
--- a/docs/source/custom_datasets.mdx
+++ b/docs/source/custom_datasets.mdx
--- a/docs/source/custom_datasets.rst
+++ b/docs/source/custom_datasets.rst
--- a/docs/source/multilingual.rst
+++ b/docs/source/multilingual.rst
-.. 
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
-    Copyright 2020 The HuggingFace Team. All rights reserved.
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
+the License. You may obtain a copy of the License at
-        http://www.apache.org/licenses/LICENSE-2.0
+http://www.apache.org/licenses/LICENSE-2.0
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
+specific language governing permissions and limitations under the License.
+-->
-Multi-lingual models
+# Multi-lingual models
-=======================================================================================================================
+[[open-in-colab]]
 Most of the models available in this library are mono-lingual models (English, Chinese and German). A few multi-lingual
 models are available and have a different mechanisms than mono-lingual models. This page details the usage of these
 models.
-XLM
+## XLM
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 XLM has a total of 10 different checkpoints, only one of which is mono-lingual. The 9 remaining model checkpoints can
 be split in two categories: the checkpoints that make use of language embeddings, and those that don't
-XLM & Language Embeddings
+### XLM & Language Embeddings
-----------------------------------------------------------------------------------------------------------------------
 This section concerns the following checkpoints:
- ``xlm-mlm-ende-1024`` (Masked language modeling, English-German)
+- `xlm-mlm-ende-1024` (Masked language modeling, English-German)
- ``xlm-mlm-enfr-1024`` (Masked language modeling, English-French)
+- `xlm-mlm-enfr-1024` (Masked language modeling, English-French)
- ``xlm-mlm-enro-1024`` (Masked language modeling, English-Romanian)
+- `xlm-mlm-enro-1024` (Masked language modeling, English-Romanian)
- ``xlm-mlm-xnli15-1024`` (Masked language modeling, XNLI languages)
+- `xlm-mlm-xnli15-1024` (Masked language modeling, XNLI languages)
- ``xlm-mlm-tlm-xnli15-1024`` (Masked language modeling + Translation, XNLI languages)
+- `xlm-mlm-tlm-xnli15-1024` (Masked language modeling + Translation, XNLI languages)
- ``xlm-clm-enfr-1024`` (Causal language modeling, English-French)
+- `xlm-clm-enfr-1024` (Causal language modeling, English-French)
- ``xlm-clm-ende-1024`` (Causal language modeling, English-German)
+- `xlm-clm-ende-1024` (Causal language modeling, English-German)
 These checkpoints require language embeddings that will specify the language used at inference time. These language
 embeddings are represented as a tensor that is of the same shape as the input ids passed to the model. The values in
-these tensors depend on the language used and are identifiable using the ``lang2id`` and ``id2lang`` attributes from
+these tensors depend on the language used and are identifiable using the `lang2id` and `id2lang` attributes from
 the tokenizer.
-Here is an example using the ``xlm-clm-enfr-1024`` checkpoint (Causal language modeling, English-French):
+Here is an example using the `xlm-clm-enfr-1024` checkpoint (Causal language modeling, English-French):
-.. code-block::
-    >>> import torch
+```py
-    >>> from transformers import XLMTokenizer, XLMWithLMHeadModel
+>>> import torch
+>>> from transformers import XLMTokenizer, XLMWithLMHeadModel
-    >>> tokenizer = XLMTokenizer.from_pretrained("xlm-clm-enfr-1024")
-    >>> model = XLMWithLMHeadModel.from_pretrained("xlm-clm-enfr-1024")
+>>> tokenizer = XLMTokenizer.from_pretrained("xlm-clm-enfr-1024")
+>>> model = XLMWithLMHeadModel.from_pretrained("xlm-clm-enfr-1024")
+```
 The different languages this model/tokenizer handles, as well as the ids of these languages are visible using the
-``lang2id`` attribute:
+`lang2id` attribute:
-.. code-block::
-    >>> print(tokenizer.lang2id)
-    {'en': 0, 'fr': 1}
+```py
+>>> print(tokenizer.lang2id)
+{'en': 0, 'fr': 1}
+```
 These ids should be used when passing a language parameter during a model pass. Let's define our inputs:
-.. code-block::
+```py
+>>> input_ids = torch.tensor([tokenizer.encode("Wikipedia was used to")]) # batch size of 1
-    >>> input_ids = torch.tensor([tokenizer.encode("Wikipedia was used to")]) # batch size of 1
+```
 We should now define the language embedding by using the previously defined language id. We want to create a tensor
 filled with the appropriate language ids, of the same size as input_ids. For english, the id is 0:
-.. code-block::
+```py
+>>> language_id = tokenizer.lang2id['en']  # 0
-    >>> language_id = tokenizer.lang2id['en']  # 0
+>>> langs = torch.tensor([language_id] * input_ids.shape[1])  # torch.tensor([0, 0, 0, ..., 0])
-    >>> langs = torch.tensor([language_id] * input_ids.shape[1])  # torch.tensor([0, 0, 0, ..., 0])
-    >>> # We reshape it to be of size (batch_size, sequence_length)
-    >>> langs = langs.view(1, -1) # is now of shape [1, sequence_length] (we have a batch size of 1)
+>>> # We reshape it to be of size (batch_size, sequence_length)
+>>> langs = langs.view(1, -1) # is now of shape [1, sequence_length] (we have a batch size of 1)
+```
 You can then feed it all as input to your model:
-.. code-block::
+```py
+>>> outputs = model(input_ids, langs=langs)
-    >>> outputs = model(input_ids, langs=langs)
+```
+The example [run_generation.py](https://github.com/huggingface/transformers/tree/master/examples/pytorch/text-generation/run_generation.py) can generate text
-The example :prefix_link:`run_generation.py <examples/pytorch/text-generation/run_generation.py>` can generate text
 using the CLM checkpoints from XLM, using the language embeddings.
-XLM without Language Embeddings
+### XLM without Language Embeddings
-----------------------------------------------------------------------------------------------------------------------
 This section concerns the following checkpoints:
- ``xlm-mlm-17-1280`` (Masked language modeling, 17 languages)
+- `xlm-mlm-17-1280` (Masked language modeling, 17 languages)
- ``xlm-mlm-100-1280`` (Masked language modeling, 100 languages)
+- `xlm-mlm-100-1280` (Masked language modeling, 100 languages)
 These checkpoints do not require language embeddings at inference time. These models are used to have generic sentence
 representations, differently from previously-mentioned XLM checkpoints.
-BERT
+## BERT
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 BERT has two checkpoints that can be used for multi-lingual tasks:
- ``bert-base-multilingual-uncased`` (Masked language modeling + Next sentence prediction, 102 languages)
+- `bert-base-multilingual-uncased` (Masked language modeling + Next sentence prediction, 102 languages)
- ``bert-base-multilingual-cased`` (Masked language modeling + Next sentence prediction, 104 languages)
+- `bert-base-multilingual-cased` (Masked language modeling + Next sentence prediction, 104 languages)
 These checkpoints do not require language embeddings at inference time. They should identify the language used in the
 context and infer accordingly.
-XLM-RoBERTa
+## XLM-RoBERTa
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 XLM-RoBERTa was trained on 2.5TB of newly created clean CommonCrawl data in 100 languages. It provides strong gains
 over previously released multi-lingual models like mBERT or XLM on downstream tasks like classification, sequence
@@ -123,19 +114,5 @@ labeling and question answering.
 Two XLM-RoBERTa checkpoints can be used for multi-lingual tasks:
- ``xlm-roberta-base`` (Masked language modeling, 100 languages)
+- `xlm-roberta-base` (Masked language modeling, 100 languages)
- ``xlm-roberta-large`` (Masked language modeling, 100 languages)
+- `xlm-roberta-large` (Masked language modeling, 100 languages)
-mLUKE
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-mLUKE is based on XLM-RoBERTa and further trained on Wikipedia articles in 24 languages with masked language modeling
-as well as masked entity prediction objective.
-The model can be used in the same way as other models solely based on word-piece inputs, but also can be used with
-entity representations to achieve further performance gain, with entity-related tasks such as relation extraction,
-named entity recognition and question answering (see :doc:`LUKE <model_doc/luke>`).
-Currently, one mLUKE checkpoint is available:
- ``studio-ousia/mluke-base`` (Masked language modeling + Masked entity prediction, 100 languages)
--- a/docs/source/perplexity.rst
+++ b/docs/source/perplexity.rst
-.. 
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
-    Copyright 2020 The HuggingFace Team. All rights reserved.
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
+the License. You may obtain a copy of the License at
-        http://www.apache.org/licenses/LICENSE-2.0
+http://www.apache.org/licenses/LICENSE-2.0
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
+specific language governing permissions and limitations under the License.
+-->
-Perplexity of fixed-length models
+# Perplexity of fixed-length models
-=======================================================================================================================
+[[open-in-colab]]
 Perplexity (PPL) is one of the most common metrics for evaluating language models. Before diving in, we should note
 that the metric applies specifically to classical language models (sometimes called autoregressive or causal language
-models) and is not well defined for masked language models like BERT (see :doc:`summary of the models
+models) and is not well defined for masked language models like BERT (see [summary of the models](model_summary)).
-<model_summary>`).
 Perplexity is defined as the exponentiated average negative log-likelihood of a sequence. If we have a tokenized
-sequence :math:`X = (x_0, x_1, \dots, x_t)`, then the perplexity of :math:`X` is,
+sequence \\(X = (x_0, x_1, \dots, x_t)\\), then the perplexity of \\(X\\) is,
-.. math::
-    \text{PPL}(X)
+$$\text{PPL}(X) = \exp \left\{ {-\frac{1}{t}\sum_i^t \log p_\theta (x_i|x_{<i}) } \right\}$$
-    = \exp \left\{ {-\frac{1}{t}\sum_i^t \log p_\theta (x_i|x_{<i}) } \right\}
-where :math:`\log p_\theta (x_i|x_{<i})` is the log-likelihood of the ith token conditioned on the preceding tokens
+where \\(\log p_\theta (x_i|x_{<i})\\) is the log-likelihood of the ith token conditioned on the preceding tokens \\(x_{<i}\\) according to our model. Intuitively, it can be thought of as an evaluation of the model's ability to predict uniformly among the set of specified tokens in a corpus. Importantly, this means that the tokenization procedure has a direct impact on a model's perplexity which should always be taken into consideration when comparing different models.
-:math:`x_{<i}` according to our model. Intuitively, it can be thought of as an evaluation of the model's ability to
-predict uniformly among the set of specified tokens in a corpus. Importantly, this means that the tokenization
-procedure has a direct impact on a model's perplexity which should always be taken into consideration when comparing
-different models.
 This is also equivalent to the exponentiation of the cross-entropy between the data and model predictions. For more
 intuition about perplexity and its relationship to Bits Per Character (BPC) and data compression, check out this
-`fantastic blog post on The Gradient <https://thegradient.pub/understanding-evaluation-metrics-for-language-models/>`_.
+[fantastic blog post on The Gradient](https://thegradient.pub/understanding-evaluation-metrics-for-language-models/).
-Calculating PPL with fixed-length models
+## Calculating PPL with fixed-length models
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 If we weren't limited by a model's context size, we would evaluate the model's perplexity by autoregressively
 factorizing a sequence and conditioning on the entire preceding subsequence at each step, as shown below.
-.. image:: /imgs/ppl_full.gif
+<img width="600" alt="Full decomposition of a sequence with unlimited context length" src="/imgs/ppl_full.gif"/>
-    :width: 600
-    :alt: Full decomposition of a sequence with unlimited context length
 When working with approximate models, however, we typically have a constraint on the number of tokens the model can
-process. The largest version of :doc:`GPT-2 <model_doc/gpt2>`, for example, has a fixed length of 1024 tokens, so we
+process. The largest version of [GPT-2](model_doc/gpt2), for example, has a fixed length of 1024 tokens, so we
-cannot calculate :math:`p_\theta(x_t|x_{<t})` directly when :math:`t` is greater than 1024.
+cannot calculate \\(p_\theta(x_t|x_{<t})\\) directly when \\(t\\) is greater than 1024.
 Instead, the sequence is typically broken into subsequences equal to the model's maximum input size. If a model's max
-input size is :math:`k`, we then approximate the likelihood of a token :math:`x_t` by conditioning only on the
+input size is \\(k\\), we then approximate the likelihood of a token \\(x_t\\) by conditioning only on the
-:math:`k-1` tokens that precede it rather than the entire context. When evaluating the model's perplexity of a
+\\(k-1\\) tokens that precede it rather than the entire context. When evaluating the model's perplexity of a
 sequence, a tempting but suboptimal approach is to break the sequence into disjoint chunks and add up the decomposed
 log-likelihoods of each segment independently.
-.. image:: /imgs/ppl_chunked.gif
+<img width="600" alt="Suboptimal PPL not taking advantage of full available context" src="/imgs/ppl_chunked.gif"/>
-    :width: 600
-    :alt: Suboptimal PPL not taking advantage of full available context
 This is quick to compute since the perplexity of each segment can be computed in one forward pass, but serves as a poor
 approximation of the fully-factorized perplexity and will typically yield a higher (worse) PPL because the model will
@@ -67,9 +55,7 @@ have less context at most of the prediction steps.
 Instead, the PPL of fixed-length models should be evaluated with a sliding-window strategy. This involves repeatedly
 sliding the context window so that the model has more context when making each prediction.
-.. image:: /imgs/ppl_sliding.gif
+<img width="600" alt="Sliding window PPL taking advantage of all available context" src="/imgs/ppl_sliding.gif"/>
-    :width: 600
-    :alt: Sliding window PPL taking advantage of all available context
 This is a closer approximation to the true decomposition of the sequence probability and will typically yield a more
 favorable score. The downside is that it requires a separate forward pass for each token in the corpus. A good
@@ -77,47 +63,45 @@ practical compromise is to employ a strided sliding window, moving the context b
 1 token a time. This allows computation to proceed much faster while still giving the model a large context to make
 predictions at each step.
-Example: Calculating perplexity with GPT-2 in 🤗 Transformers
+## Example: Calculating perplexity with GPT-2 in 🤗 Transformers
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 Let's demonstrate this process with GPT-2.
-.. code-block:: python
+```python
+from transformers import GPT2LMHeadModel, GPT2TokenizerFast
-    from transformers import GPT2LMHeadModel, GPT2TokenizerFast
+device = 'cuda'
-    device = 'cuda'
+model_id = 'gpt2-large'
-    model_id = 'gpt2-large'
+model = GPT2LMHeadModel.from_pretrained(model_id).to(device)
-    model = GPT2LMHeadModel.from_pretrained(model_id).to(device)
+tokenizer = GPT2TokenizerFast.from_pretrained(model_id)
-    tokenizer = GPT2TokenizerFast.from_pretrained(model_id)
+```
 We'll load in the WikiText-2 dataset and evaluate the perplexity using a few different sliding-window strategies. Since
 this dataset is small and we're just doing one forward pass over the set, we can just load and encode the entire
 dataset in memory.
-.. code-block:: python
+```python
+from datasets import load_dataset
+test = load_dataset('wikitext', 'wikitext-2-raw-v1', split='test')
+encodings = tokenizer('\n\n'.join(test['text']), return_tensors='pt')
+```
-    from datasets import load_dataset
+With 🤗 Transformers, we can simply pass the `input_ids` as the `labels` to our model, and the average negative
-    test = load_dataset('wikitext', 'wikitext-2-raw-v1', split='test')
-    encodings = tokenizer('\n\n'.join(test['text']), return_tensors='pt')
-With 🤗 Transformers, we can simply pass the ``input_ids`` as the ``labels`` to our model, and the average negative
 log-likelihood for each token is returned as the loss. With our sliding window approach, however, there is overlap in
 the tokens we pass to the model at each iteration. We don't want the log-likelihood for the tokens we're just treating
-as context to be included in our loss, so we can set these targets to ``-100`` so that they are ignored. The following
+as context to be included in our loss, so we can set these targets to `-100` so that they are ignored. The following
-is an example of how we could do this with a stride of ``512``. This means that the model will have at least 512 tokens
+is an example of how we could do this with a stride of `512`. This means that the model will have at least 512 tokens
 for context when calculating the conditional likelihood of any one token (provided there are 512 preceding tokens
 available to condition on).
-.. code-block:: python
+```python
+import torch
-    import torch
+from tqdm import tqdm
-    from tqdm import tqdm
-    max_length = model.config.n_positions
+max_length = model.config.n_positions
-    stride = 512
+stride = 512
-    nlls = []
+nlls = []
-    for i in tqdm(range(0, encodings.input_ids.size(1), stride)):
+for i in tqdm(range(0, encodings.input_ids.size(1), stride)):
    begin_loc = max(i + stride - max_length, 0)
    end_loc = min(i + stride, encodings.input_ids.size(1))
    trg_len = end_loc - i    # may be different from stride on last loop
@@ -131,13 +115,14 @@ available to condition on).
    nlls.append(neg_log_likelihood)
-    ppl = torch.exp(torch.stack(nlls).sum() / end_loc)
+ppl = torch.exp(torch.stack(nlls).sum() / end_loc)
+```
 Running this with the stride length equal to the max input length is equivalent to the suboptimal, non-sliding-window
 strategy we discussed above. The smaller the stride, the more context the model will have in making each prediction,
 and the better the reported perplexity will typically be.
-When we run the above with ``stride = 1024``, i.e. no overlap, the resulting PPL is ``19.64``, which is about the same
+When we run the above with `stride = 1024`, i.e. no overlap, the resulting PPL is `19.64`, which is about the same
-as the ``19.93`` reported in the GPT-2 paper. By using ``stride = 512`` and thereby employing our striding window
+as the `19.93` reported in the GPT-2 paper. By using `stride = 512` and thereby employing our striding window
-strategy, this jumps down to ``16.53``. This is not only a more favorable score, but is calculated in a way that is
+strategy, this jumps down to `16.53`. This is not only a more favorable score, but is calculated in a way that is
 closer to the true autoregressive decomposition of a sequence likelihood.
--- a/docs/source/preprocessing.mdx
+++ b/docs/source/preprocessing.mdx
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+# Preprocessing data
+[[open-in-colab]]
+In this tutorial, we'll explore how to preprocess your data using 🤗 Transformers. The main tool for this is what we
+call a [tokenizer](main_classes/tokenizer). You can build one using the tokenizer class associated to the model
+you would like to use, or directly with the [`AutoTokenizer`] class.
+As we saw in the [quick tour](quicktour), the tokenizer will first split a given text in words (or part of
+words, punctuation symbols, etc.) usually called _tokens_. Then it will convert those _tokens_ into numbers, to be able
+to build a tensor out of them and feed them to the model. It will also add any additional inputs the model might expect
+to work properly.
+<Tip>
+If you plan on using a pretrained model, it's important to use the associated pretrained tokenizer: it will split
+the text you give it in tokens the same way for the pretraining corpus, and it will use the same correspondence
+token to index (that we usually call a _vocab_) as during pretraining.
+</Tip>
+To automatically download the vocab used during pretraining or fine-tuning a given model, you can use the
+[`AutoTokenizer.from_pretrained`] method:
+```py
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
+```
+## Base use
+<Youtube id="Yffk5aydLzg"/>
+A [`PreTrainedTokenizer`] has many methods, but the only one you need to remember for preprocessing
+is its `__call__`: you just need to feed your sentence to your tokenizer object.
+```py
+>>> encoded_input = tokenizer("Hello, I'm a single sentence!")
+>>> print(encoded_input)
+{'input_ids': [101, 138, 18696, 155, 1942, 3190, 1144, 1572, 13745, 1104, 159, 9664, 2107, 102], 
+ 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
+ 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
+```
+This returns a dictionary string to list of ints. The [input_ids](glossary#input-ids) are the indices corresponding
+to each token in our sentence. We will see below what the [attention_mask](glossary#attention-mask) is used for and
+in [the next section](#preprocessing-pairs-of-sentences) the goal of [token_type_ids](glossary#token-type-ids).
+The tokenizer can decode a list of token ids in a proper sentence:
+```py
+>>> tokenizer.decode(encoded_input["input_ids"])
+"[CLS] Hello, I'm a single sentence! [SEP]"
+```
+As you can see, the tokenizer automatically added some special tokens that the model expects. Not all models need
+special tokens; for instance, if we had used _gpt2-medium_ instead of _bert-base-cased_ to create our tokenizer, we
+would have seen the same sentence as the original one here. You can disable this behavior (which is only advised if you
+have added those special tokens yourself) by passing `add_special_tokens=False`.
+If you have several sentences you want to process, you can do this efficiently by sending them as a list to the
+tokenizer:
+```py
+>>> batch_sentences = ["Hello I'm a single sentence",
+...                    "And another sentence",
+...                    "And the very very last one"]
+>>> encoded_inputs = tokenizer(batch_sentences)
+>>> print(encoded_inputs)
+{'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102],
+               [101, 1262, 1330, 5650, 102],
+               [101, 1262, 1103, 1304, 1304, 1314, 1141, 102]],
+ 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0],
+                    [0, 0, 0, 0, 0],
+                    [0, 0, 0, 0, 0, 0, 0, 0]],
+ 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1],
+                    [1, 1, 1, 1, 1],
+                    [1, 1, 1, 1, 1, 1, 1, 1]]}
+```
+We get back a dictionary once again, this time with values being lists of lists of ints.
+If the purpose of sending several sentences at a time to the tokenizer is to build a batch to feed the model, you will
+probably want:
+- To pad each sentence to the maximum length there is in your batch.
+- To truncate each sentence to the maximum length the model can accept (if applicable).
+- To return tensors.
+You can do all of this by using the following options when feeding your list of sentences to the tokenizer:
+```py
+>>> batch = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt")
+>>> print(batch)
+{'input_ids': tensor([[ 101, 8667,  146,  112,  182,  170, 1423, 5650,  102],
+                      [ 101, 1262, 1330, 5650,  102,    0,    0,    0,    0],
+                      [ 101, 1262, 1103, 1304, 1304, 1314, 1141,  102,    0]]),
+ 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0],
+                           [0, 0, 0, 0, 0, 0, 0, 0, 0],
+                           [0, 0, 0, 0, 0, 0, 0, 0, 0]]), 
+ 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1],
+                           [1, 1, 1, 1, 1, 0, 0, 0, 0],
+                           [1, 1, 1, 1, 1, 1, 1, 1, 0]])}
+===PT-TF-SPLIT===
+>>> batch = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="tf")
+>>> print(batch)
+{'input_ids': tf.Tensor([[ 101, 8667,  146,  112,  182,  170, 1423, 5650,  102],
+                      [ 101, 1262, 1330, 5650,  102,    0,    0,    0,    0],
+                      [ 101, 1262, 1103, 1304, 1304, 1314, 1141,  102,    0]]),
+ 'token_type_ids': tf.Tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0],
+                           [0, 0, 0, 0, 0, 0, 0, 0, 0],
+                           [0, 0, 0, 0, 0, 0, 0, 0, 0]]), 
+ 'attention_mask': tf.Tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1],
+                           [1, 1, 1, 1, 1, 0, 0, 0, 0],
+                           [1, 1, 1, 1, 1, 1, 1, 1, 0]])}
+```
+It returns a dictionary with string keys and tensor values. We can now see what the [attention_mask](glossary#attention-mask) is all about: it points out which tokens the model should pay attention to and which ones
+it should not (because they represent padding in this case).
+Note that if your model does not have a maximum length associated to it, the command above will throw a warning. You
+can safely ignore it. You can also pass `verbose=False` to stop the tokenizer from throwing those kinds of warnings.
+<a id='sentence-pairs'></a>
+## Preprocessing pairs of sentences
+<Youtube id="0u3ioSwev3s"/>
+Sometimes you need to feed a pair of sentences to your model. For instance, if you want to classify if two sentences in
+a pair are similar, or for question-answering models, which take a context and a question. For BERT models, the input
+is then represented like this: `[CLS] Sequence A [SEP] Sequence B [SEP]`
+You can encode a pair of sentences in the format expected by your model by supplying the two sentences as two arguments
+(not a list since a list of two sentences will be interpreted as a batch of two single sentences, as we saw before).
+This will once again return a dict string to list of ints:
+```py
+>>> encoded_input = tokenizer("How old are you?", "I'm 6 years old")
+>>> print(encoded_input)
+{'input_ids': [101, 1731, 1385, 1132, 1128, 136, 102, 146, 112, 182, 127, 1201, 1385, 102], 
+ 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1], 
+ 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
+```
+This shows us what the [token_type_ids](glossary#token-type-ids) are for: they indicate to the model which part of
+the inputs correspond to the first sentence and which part corresponds to the second sentence. Note that
+_token_type_ids_ are not required or handled by all models. By default, a tokenizer will only return the inputs that
+its associated model expects. You can force the return (or the non-return) of any of those special arguments by using
+`return_input_ids` or `return_token_type_ids`.
+If we decode the token ids we obtained, we will see that the special tokens have been properly added.
+```py
+>>> tokenizer.decode(encoded_input["input_ids"])
+"[CLS] How old are you? [SEP] I'm 6 years old [SEP]"
+```
+If you have a list of pairs of sequences you want to process, you should feed them as two lists to your tokenizer: the
+list of first sentences and the list of second sentences:
+```py
+>>> batch_sentences = ["Hello I'm a single sentence",
+...                    "And another sentence",
+...                    "And the very very last one"]
+>>> batch_of_second_sentences = ["I'm a sentence that goes with the first sentence",
+...                              "And I should be encoded with the second sentence",
+...                              "And I go with the very last one"]
+>>> encoded_inputs = tokenizer(batch_sentences, batch_of_second_sentences)
+>>> print(encoded_inputs)
+{'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102, 146, 112, 182, 170, 5650, 1115, 2947, 1114, 1103, 1148, 5650, 102], 
+               [101, 1262, 1330, 5650, 102, 1262, 146, 1431, 1129, 12544, 1114, 1103, 1248, 5650, 102], 
+               [101, 1262, 1103, 1304, 1304, 1314, 1141, 102, 1262, 146, 1301, 1114, 1103, 1304, 1314, 1141, 102]], 
+'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
+                   [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
+                   [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 
+'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
+                   [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
+                   [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
+```
+As we can see, it returns a dictionary where each value is a list of lists of ints.
+To double-check what is fed to the model, we can decode each list in _input_ids_ one by one:
+```py
+>>> for ids in encoded_inputs["input_ids"]:
+>>>     print(tokenizer.decode(ids))
+[CLS] Hello I'm a single sentence [SEP] I'm a sentence that goes with the first sentence [SEP]
+[CLS] And another sentence [SEP] And I should be encoded with the second sentence [SEP]
+[CLS] And the very very last one [SEP] And I go with the very last one [SEP]
+```
+Once again, you can automatically pad your inputs to the maximum sentence length in the batch, truncate to the maximum
+length the model can accept and return tensors directly with the following:
+```py
+batch = tokenizer(batch_sentences, batch_of_second_sentences, padding=True, truncation=True, return_tensors="pt")
+===PT-TF-SPLIT===
+batch = tokenizer(batch_sentences, batch_of_second_sentences, padding=True, truncation=True, return_tensors="tf")
+```
+## Everything you always wanted to know about padding and truncation
+We have seen the commands that will work for most cases (pad your batch to the length of the maximum sentence and
+truncate to the maximum length the model can accept). However, the API supports more strategies if you need them. The
+three arguments you need to know for this are `padding`, `truncation` and `max_length`.
+- `padding` controls the padding. It can be a boolean or a string which should be:
+  - `True` or `'longest'` to pad to the longest sequence in the batch (doing no padding if you only provide
+    a single sequence).
+  - `'max_length'` to pad to a length specified by the `max_length` argument or the maximum length accepted
+    by the model if no `max_length` is provided (`max_length=None`). If you only provide a single sequence,
+    padding will still be applied to it.
+  - `False` or `'do_not_pad'` to not pad the sequences. As we have seen before, this is the default
+    behavior.
+- `truncation` controls the truncation. It can be a boolean or a string which should be:
+  - `True` or `'longest_first'` truncate to a maximum length specified by the `max_length` argument or
+    the maximum length accepted by the model if no `max_length` is provided (`max_length=None`). This will
+    truncate token by token, removing a token from the longest sequence in the pair until the proper length is
+    reached.
+  - `'only_second'` truncate to a maximum length specified by the `max_length` argument or the maximum
+    length accepted by the model if no `max_length` is provided (`max_length=None`). This will only truncate
+    the second sentence of a pair if a pair of sequence (or a batch of pairs of sequences) is provided.
+  - `'only_first'` truncate to a maximum length specified by the `max_length` argument or the maximum
+    length accepted by the model if no `max_length` is provided (`max_length=None`). This will only truncate
+    the first sentence of a pair if a pair of sequence (or a batch of pairs of sequences) is provided.
+  - `False` or `'do_not_truncate'` to not truncate the sequences. As we have seen before, this is the
+    default behavior.
+- `max_length` to control the length of the padding/truncation. It can be an integer or `None`, in which case
+  it will default to the maximum length the model can accept. If the model has no specific maximum input length,
+  truncation/padding to `max_length` is deactivated.
+Here is a table summarizing the recommend way to setup padding and truncation. If you use pair of inputs sequence in
+any of the following examples, you can replace `truncation=True` by a `STRATEGY` selected in
+`['only_first', 'only_second', 'longest_first']`, i.e. `truncation='only_second'` or `truncation= 'longest_first'` to control how both sequence in the pair are truncated as detailed before.
+| Truncation                           | Padding                           | Instruction                                                                                 |
+|--------------------------------------|-----------------------------------|---------------------------------------------------------------------------------------------|
+| no truncation                        | no padding                        | `tokenizer(batch_sentences)`                                                           |
+|                                      | padding to max sequence in batch  | `tokenizer(batch_sentences, padding=True)` or                                          |
+|                                      |                                   | `tokenizer(batch_sentences, padding='longest')`                                        |
+|                                      | padding to max model input length | `tokenizer(batch_sentences, padding='max_length')`                                     |
+|                                      | padding to specific length        | `tokenizer(batch_sentences, padding='max_length', max_length=42)`                      |
+| truncation to max model input length | no padding                        | `tokenizer(batch_sentences, truncation=True)` or                                       |
+|                                      |                                   | `tokenizer(batch_sentences, truncation=STRATEGY)`                                      |
+|                                      | padding to max sequence in batch  | `tokenizer(batch_sentences, padding=True, truncation=True)` or                         |
+|                                      |                                   | `tokenizer(batch_sentences, padding=True, truncation=STRATEGY)`                        |
+|                                      | padding to max model input length | `tokenizer(batch_sentences, padding='max_length', truncation=True)` or                 |
+|                                      |                                   | `tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY)`                |
+|                                      | padding to specific length        | Not possible                                                                                |
+| truncation to specific length        | no padding                        | `tokenizer(batch_sentences, truncation=True, max_length=42)` or                        |
+|                                      |                                   | `tokenizer(batch_sentences, truncation=STRATEGY, max_length=42)`                       |
+|                                      | padding to max sequence in batch  | `tokenizer(batch_sentences, padding=True, truncation=True, max_length=42)` or          |
+|                                      |                                   | `tokenizer(batch_sentences, padding=True, truncation=STRATEGY, max_length=42)`         |
+|                                      | padding to max model input length | Not possible                                                                                |
+|                                      | padding to specific length        | `tokenizer(batch_sentences, padding='max_length', truncation=True, max_length=42)` or  |
+|                                      |                                   | `tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY, max_length=42)` |
+## Pre-tokenized inputs
+The tokenizer also accept pre-tokenized inputs. This is particularly useful when you want to compute labels and extract
+predictions in [named entity recognition (NER)](https://en.wikipedia.org/wiki/Named-entity_recognition) or
+[part-of-speech tagging (POS tagging)](https://en.wikipedia.org/wiki/Part-of-speech_tagging).
+<Tip warning={true}>
+Pre-tokenized does not mean your inputs are already tokenized (you wouldn't need to pass them through the tokenizer
+if that was the case) but just split into words (which is often the first step in subword tokenization algorithms
+like BPE).
+</Tip>
+If you want to use pre-tokenized inputs, just set `is_split_into_words=True` when passing your inputs to the
+tokenizer. For instance, we have:
+```py
+>>> encoded_input = tokenizer(["Hello", "I'm", "a", "single", "sentence"], is_split_into_words=True)
+>>> print(encoded_input)
+{'input_ids': [101, 8667, 146, 112, 182, 170, 1423, 5650, 102],
+ 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 
+ 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
+```
+Note that the tokenizer still adds the ids of special tokens (if applicable) unless you pass
+`add_special_tokens=False`.
+This works exactly as before for batch of sentences or batch of pairs of sentences. You can encode a batch of sentences
+like this:
+```py
+batch_sentences = [["Hello", "I'm", "a", "single", "sentence"],
+                   ["And", "another", "sentence"],
+                   ["And", "the", "very", "very", "last", "one"]]
+encoded_inputs = tokenizer(batch_sentences, is_split_into_words=True)
+```
+or a batch of pair sentences like this:
+```py
+batch_of_second_sentences = [["I'm", "a", "sentence", "that", "goes", "with", "the", "first", "sentence"],
+                             ["And", "I", "should", "be", "encoded", "with", "the", "second", "sentence"],
+                             ["And", "I", "go", "with", "the", "very", "last", "one"]]
+encoded_inputs = tokenizer(batch_sentences, batch_of_second_sentences, is_split_into_words=True)
+```
+And you can add padding, truncation as well as directly return tensors like before:
+```py
+batch = tokenizer(batch_sentences,
+                  batch_of_second_sentences,
+                  is_split_into_words=True,
+                  padding=True,
+                  truncation=True,
+                  return_tensors="pt")
+===PT-TF-SPLIT===
+batch = tokenizer(batch_sentences,
+                  batch_of_second_sentences,
+                  is_split_into_words=True,
+                  padding=True,
+                  truncation=True,
+                  return_tensors="tf")
+```
--- a/docs/source/preprocessing.rst
+++ b/docs/source/preprocessing.rst
--- a/docs/source/quicktour.mdx
+++ b/docs/source/quicktour.mdx
--- a/docs/source/quicktour.rst
+++ b/docs/source/quicktour.rst
--- a/docs/source/task_summary.mdx
+++ b/docs/source/task_summary.mdx
--- a/docs/source/task_summary.rst
+++ b/docs/source/task_summary.rst
--- a/docs/source/tokenizer_summary.rst
+++ b/docs/source/tokenizer_summary.rst
--- a/docs/source/training.mdx
+++ b/docs/source/training.mdx
--- a/docs/source/training.rst
+++ b/docs/source/training.rst