"tests/vision_encoder_decoder/__init__.py" did not exist on "0bab55d5d52e4d538888980d05d73acc6da6274a"
Unverified Commit cf36f4d7 authored by Sylvain Gugger's avatar Sylvain Gugger Committed by GitHub
Browse files

Convert tutorials (#14665)

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links
parent 0f4e39c5
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Benchmarks
[[open-in-colab]]
Let's take a look at how 🤗 Transformer models can be benchmarked, best practices, and already available benchmarks.
A notebook explaining in more detail how to benchmark 🤗 Transformer models can be found [here](https://github.com/huggingface/transformers/tree/master/notebooks/05-benchmark.ipynb).
## How to benchmark 🤗 Transformer models
The classes [`PyTorchBenchmark`] and [`TensorFlowBenchmark`] allow to flexibly benchmark 🤗 Transformer models. The benchmark classes allow us to measure the _peak memory usage_ and _required time_ for both _inference_ and _training_.
<Tip>
Hereby, _inference_ is defined by a single forward pass, and _training_ is defined by a single forward pass and
backward pass.
</Tip>
The benchmark classes [`PyTorchBenchmark`] and [`TensorFlowBenchmark`] expect an object of type [`PyTorchBenchmarkArguments`] and
[`TensorFlowBenchmarkArguments`], respectively, for instantiation. [`PyTorchBenchmarkArguments`] and [`TensorFlowBenchmarkArguments`] are data classes and contain all relevant configurations for their corresponding benchmark class. In the following example, it is shown how a BERT model of type _bert-base-cased_ can be benchmarked.
```py
>>> from transformers import PyTorchBenchmark, PyTorchBenchmarkArguments
>>> args = PyTorchBenchmarkArguments(models=["bert-base-uncased"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512])
>>> benchmark = PyTorchBenchmark(args)
===PT-TF-SPLIT===
>>> from transformers import TensorFlowBenchmark, TensorFlowBenchmarkArguments
>>> args = TensorFlowBenchmarkArguments(models=["bert-base-uncased"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512])
>>> benchmark = TensorFlowBenchmark(args)
```
Here, three arguments are given to the benchmark argument data classes, namely `models`, `batch_sizes`, and
`sequence_lengths`. The argument `models` is required and expects a `list` of model identifiers from the
[model hub](https://huggingface.co/models) The `list` arguments `batch_sizes` and `sequence_lengths` define
the size of the `input_ids` on which the model is benchmarked. There are many more parameters that can be configured
via the benchmark argument data classes. For more detail on these one can either directly consult the files
`src/transformers/benchmark/benchmark_args_utils.py`, `src/transformers/benchmark/benchmark_args.py` (for PyTorch)
and `src/transformers/benchmark/benchmark_args_tf.py` (for Tensorflow). Alternatively, running the following shell
commands from root will print out a descriptive list of all configurable parameters for PyTorch and Tensorflow
respectively.
```bash
python examples/pytorch/benchmarking/run_benchmark.py --help
===PT-TF-SPLIT===
python examples/tensorflow/benchmarking/run_benchmark_tf.py --help
```
An instantiated benchmark object can then simply be run by calling `benchmark.run()`.
```py
>>> results = benchmark.run()
>>> print(results)
==================== INFERENCE - SPEED - RESULT ====================
--------------------------------------------------------------------------------
Model Name Batch Size Seq Length Time in s
--------------------------------------------------------------------------------
bert-base-uncased 8 8 0.006
bert-base-uncased 8 32 0.006
bert-base-uncased 8 128 0.018
bert-base-uncased 8 512 0.088
--------------------------------------------------------------------------------
==================== INFERENCE - MEMORY - RESULT ====================
--------------------------------------------------------------------------------
Model Name Batch Size Seq Length Memory in MB
--------------------------------------------------------------------------------
bert-base-uncased 8 8 1227
bert-base-uncased 8 32 1281
bert-base-uncased 8 128 1307
bert-base-uncased 8 512 1539
--------------------------------------------------------------------------------
==================== ENVIRONMENT INFORMATION ====================
- transformers_version: 2.11.0
- framework: PyTorch
- use_torchscript: False
- framework_version: 1.4.0
- python_version: 3.6.10
- system: Linux
- cpu: x86_64
- architecture: 64bit
- date: 2020-06-29
- time: 08:58:43.371351
- fp16: False
- use_multiprocessing: True
- only_pretrain_model: False
- cpu_ram_mb: 32088
- use_gpu: True
- num_gpus: 1
- gpu: TITAN RTX
- gpu_ram_mb: 24217
- gpu_power_watts: 280.0
- gpu_performance_state: 2
- use_tpu: False
===PT-TF-SPLIT===
>>> results = benchmark.run()
>>> print(results)
==================== INFERENCE - SPEED - RESULT ====================
--------------------------------------------------------------------------------
Model Name Batch Size Seq Length Time in s
--------------------------------------------------------------------------------
bert-base-uncased 8 8 0.005
bert-base-uncased 8 32 0.008
bert-base-uncased 8 128 0.022
bert-base-uncased 8 512 0.105
--------------------------------------------------------------------------------
==================== INFERENCE - MEMORY - RESULT ====================
--------------------------------------------------------------------------------
Model Name Batch Size Seq Length Memory in MB
--------------------------------------------------------------------------------
bert-base-uncased 8 8 1330
bert-base-uncased 8 32 1330
bert-base-uncased 8 128 1330
bert-base-uncased 8 512 1770
--------------------------------------------------------------------------------
==================== ENVIRONMENT INFORMATION ====================
- transformers_version: 2.11.0
- framework: Tensorflow
- use_xla: False
- framework_version: 2.2.0
- python_version: 3.6.10
- system: Linux
- cpu: x86_64
- architecture: 64bit
- date: 2020-06-29
- time: 09:26:35.617317
- fp16: False
- use_multiprocessing: True
- only_pretrain_model: False
- cpu_ram_mb: 32088
- use_gpu: True
- num_gpus: 1
- gpu: TITAN RTX
- gpu_ram_mb: 24217
- gpu_power_watts: 280.0
- gpu_performance_state: 2
- use_tpu: False
```
By default, the _time_ and the _required memory_ for _inference_ are benchmarked. In the example output above the first
two sections show the result corresponding to _inference time_ and _inference memory_. In addition, all relevant
information about the computing environment, _e.g._ the GPU type, the system, the library versions, etc... are printed
out in the third section under _ENVIRONMENT INFORMATION_. This information can optionally be saved in a _.csv_ file
when adding the argument `save_to_csv=True` to [`PyTorchBenchmarkArguments`] and
[`TensorFlowBenchmarkArguments`] respectively. In this case, every section is saved in a separate
_.csv_ file. The path to each _.csv_ file can optionally be defined via the argument data classes.
Instead of benchmarking pre-trained models via their model identifier, _e.g._ `bert-base-uncased`, the user can
alternatively benchmark an arbitrary configuration of any available model class. In this case, a `list` of
configurations must be inserted with the benchmark args as follows.
```py
>>> from transformers import PyTorchBenchmark, PyTorchBenchmarkArguments, BertConfig
>>> args = PyTorchBenchmarkArguments(models=["bert-base", "bert-384-hid", "bert-6-lay"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512])
>>> config_base = BertConfig()
>>> config_384_hid = BertConfig(hidden_size=384)
>>> config_6_lay = BertConfig(num_hidden_layers=6)
>>> benchmark = PyTorchBenchmark(args, configs=[config_base, config_384_hid, config_6_lay])
>>> benchmark.run()
==================== INFERENCE - SPEED - RESULT ====================
--------------------------------------------------------------------------------
Model Name Batch Size Seq Length Time in s
--------------------------------------------------------------------------------
bert-base 8 128 0.006
bert-base 8 512 0.006
bert-base 8 128 0.018
bert-base 8 512 0.088
bert-384-hid 8 8 0.006
bert-384-hid 8 32 0.006
bert-384-hid 8 128 0.011
bert-384-hid 8 512 0.054
bert-6-lay 8 8 0.003
bert-6-lay 8 32 0.004
bert-6-lay 8 128 0.009
bert-6-lay 8 512 0.044
--------------------------------------------------------------------------------
==================== INFERENCE - MEMORY - RESULT ====================
--------------------------------------------------------------------------------
Model Name Batch Size Seq Length Memory in MB
--------------------------------------------------------------------------------
bert-base 8 8 1277
bert-base 8 32 1281
bert-base 8 128 1307
bert-base 8 512 1539
bert-384-hid 8 8 1005
bert-384-hid 8 32 1027
bert-384-hid 8 128 1035
bert-384-hid 8 512 1255
bert-6-lay 8 8 1097
bert-6-lay 8 32 1101
bert-6-lay 8 128 1127
bert-6-lay 8 512 1359
--------------------------------------------------------------------------------
==================== ENVIRONMENT INFORMATION ====================
- transformers_version: 2.11.0
- framework: PyTorch
- use_torchscript: False
- framework_version: 1.4.0
- python_version: 3.6.10
- system: Linux
- cpu: x86_64
- architecture: 64bit
- date: 2020-06-29
- time: 09:35:25.143267
- fp16: False
- use_multiprocessing: True
- only_pretrain_model: False
- cpu_ram_mb: 32088
- use_gpu: True
- num_gpus: 1
- gpu: TITAN RTX
- gpu_ram_mb: 24217
- gpu_power_watts: 280.0
- gpu_performance_state: 2
- use_tpu: False
===PT-TF-SPLIT===
>>> from transformers import TensorFlowBenchmark, TensorFlowBenchmarkArguments, BertConfig
>>> args = TensorFlowBenchmarkArguments(models=["bert-base", "bert-384-hid", "bert-6-lay"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512])
>>> config_base = BertConfig()
>>> config_384_hid = BertConfig(hidden_size=384)
>>> config_6_lay = BertConfig(num_hidden_layers=6)
>>> benchmark = TensorFlowBenchmark(args, configs=[config_base, config_384_hid, config_6_lay])
>>> benchmark.run()
==================== INFERENCE - SPEED - RESULT ====================
--------------------------------------------------------------------------------
Model Name Batch Size Seq Length Time in s
--------------------------------------------------------------------------------
bert-base 8 8 0.005
bert-base 8 32 0.008
bert-base 8 128 0.022
bert-base 8 512 0.106
bert-384-hid 8 8 0.005
bert-384-hid 8 32 0.007
bert-384-hid 8 128 0.018
bert-384-hid 8 512 0.064
bert-6-lay 8 8 0.002
bert-6-lay 8 32 0.003
bert-6-lay 8 128 0.0011
bert-6-lay 8 512 0.074
--------------------------------------------------------------------------------
==================== INFERENCE - MEMORY - RESULT ====================
--------------------------------------------------------------------------------
Model Name Batch Size Seq Length Memory in MB
--------------------------------------------------------------------------------
bert-base 8 8 1330
bert-base 8 32 1330
bert-base 8 128 1330
bert-base 8 512 1770
bert-384-hid 8 8 1330
bert-384-hid 8 32 1330
bert-384-hid 8 128 1330
bert-384-hid 8 512 1540
bert-6-lay 8 8 1330
bert-6-lay 8 32 1330
bert-6-lay 8 128 1330
bert-6-lay 8 512 1540
--------------------------------------------------------------------------------
==================== ENVIRONMENT INFORMATION ====================
- transformers_version: 2.11.0
- framework: Tensorflow
- use_xla: False
- framework_version: 2.2.0
- python_version: 3.6.10
- system: Linux
- cpu: x86_64
- architecture: 64bit
- date: 2020-06-29
- time: 09:38:15.487125
- fp16: False
- use_multiprocessing: True
- only_pretrain_model: False
- cpu_ram_mb: 32088
- use_gpu: True
- num_gpus: 1
- gpu: TITAN RTX
- gpu_ram_mb: 24217
- gpu_power_watts: 280.0
- gpu_performance_state: 2
- use_tpu: False
```
Again, _inference time_ and _required memory_ for _inference_ are measured, but this time for customized configurations
of the `BertModel` class. This feature can especially be helpful when deciding for which configuration the model
should be trained.
## Benchmark best practices
This section lists a couple of best practices one should be aware of when benchmarking a model.
- Currently, only single device benchmarking is supported. When benchmarking on GPU, it is recommended that the user
specifies on which device the code should be run by setting the `CUDA_VISIBLE_DEVICES` environment variable in the
shell, _e.g._ `export CUDA_VISIBLE_DEVICES=0` before running the code.
- The option `no_multi_processing` should only be set to `True` for testing and debugging. To ensure accurate
memory measurement it is recommended to run each memory benchmark in a separate process by making sure
`no_multi_processing` is set to `True`.
- One should always state the environment information when sharing the results of a model benchmark. Results can vary
heavily between different GPU devices, library versions, etc., so that benchmark results on their own are not very
useful for the community.
## Sharing your benchmark
Previously all available core models (10 at the time) have been benchmarked for _inference time_, across many different
settings: using PyTorch, with and without TorchScript, using TensorFlow, with and without XLA. All of those tests were
done across CPUs (except for TensorFlow XLA) and GPUs.
The approach is detailed in the [following blogpost](https://medium.com/huggingface/benchmarking-transformers-pytorch-and-tensorflow-e2917fb891c2) and the results are
available [here](https://docs.google.com/spreadsheets/d/1sryqufw2D0XlUH4sq3e9Wnxu5EAQkaohzrJbd5HdQ_w/edit?usp=sharing).
With the new _benchmark_ tools, it is easier than ever to share your benchmark results with the community
- [PyTorch Benchmarking Results](https://github.com/huggingface/transformers/tree/master/examples/pytorch/benchmarking/README.md).
- [TensorFlow Benchmarking Results](https://github.com/huggingface/transformers/tree/master/examples/tensorflow/benchmarking/README.md).
..
Copyright 2020 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
Benchmarks
=======================================================================================================================
Let's take a look at how 🤗 Transformer models can be benchmarked, best practices, and already available benchmarks.
A notebook explaining in more detail how to benchmark 🤗 Transformer models can be found :prefix_link:`here
<notebooks/05-benchmark.ipynb>`.
How to benchmark 🤗 Transformer models
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The classes :class:`~transformers.PyTorchBenchmark` and :class:`~transformers.TensorFlowBenchmark` allow to flexibly
benchmark 🤗 Transformer models. The benchmark classes allow us to measure the `peak memory usage` and `required time`
for both `inference` and `training`.
.. note::
Hereby, `inference` is defined by a single forward pass, and `training` is defined by a single forward pass and
backward pass.
The benchmark classes :class:`~transformers.PyTorchBenchmark` and :class:`~transformers.TensorFlowBenchmark` expect an
object of type :class:`~transformers.PyTorchBenchmarkArguments` and
:class:`~transformers.TensorFlowBenchmarkArguments`, respectively, for instantiation.
:class:`~transformers.PyTorchBenchmarkArguments` and :class:`~transformers.TensorFlowBenchmarkArguments` are data
classes and contain all relevant configurations for their corresponding benchmark class. In the following example, it
is shown how a BERT model of type `bert-base-cased` can be benchmarked.
.. code-block::
>>> ## PYTORCH CODE
>>> from transformers import PyTorchBenchmark, PyTorchBenchmarkArguments
>>> args = PyTorchBenchmarkArguments(models=["bert-base-uncased"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512])
>>> benchmark = PyTorchBenchmark(args)
>>> ## TENSORFLOW CODE
>>> from transformers import TensorFlowBenchmark, TensorFlowBenchmarkArguments
>>> args = TensorFlowBenchmarkArguments(models=["bert-base-uncased"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512])
>>> benchmark = TensorFlowBenchmark(args)
Here, three arguments are given to the benchmark argument data classes, namely ``models``, ``batch_sizes``, and
``sequence_lengths``. The argument ``models`` is required and expects a :obj:`list` of model identifiers from the
`model hub <https://huggingface.co/models>`__ The :obj:`list` arguments ``batch_sizes`` and ``sequence_lengths`` define
the size of the ``input_ids`` on which the model is benchmarked. There are many more parameters that can be configured
via the benchmark argument data classes. For more detail on these one can either directly consult the files
``src/transformers/benchmark/benchmark_args_utils.py``, ``src/transformers/benchmark/benchmark_args.py`` (for PyTorch)
and ``src/transformers/benchmark/benchmark_args_tf.py`` (for Tensorflow). Alternatively, running the following shell
commands from root will print out a descriptive list of all configurable parameters for PyTorch and Tensorflow
respectively.
.. code-block:: bash
## PYTORCH CODE
python examples/pytorch/benchmarking/run_benchmark.py --help
## TENSORFLOW CODE
python examples/tensorflow/benchmarking/run_benchmark_tf.py --help
An instantiated benchmark object can then simply be run by calling ``benchmark.run()``.
.. code-block::
>>> ## PYTORCH CODE
>>> results = benchmark.run()
>>> print(results)
==================== INFERENCE - SPEED - RESULT ====================
--------------------------------------------------------------------------------
Model Name Batch Size Seq Length Time in s
--------------------------------------------------------------------------------
bert-base-uncased 8 8 0.006
bert-base-uncased 8 32 0.006
bert-base-uncased 8 128 0.018
bert-base-uncased 8 512 0.088
--------------------------------------------------------------------------------
==================== INFERENCE - MEMORY - RESULT ====================
--------------------------------------------------------------------------------
Model Name Batch Size Seq Length Memory in MB
--------------------------------------------------------------------------------
bert-base-uncased 8 8 1227
bert-base-uncased 8 32 1281
bert-base-uncased 8 128 1307
bert-base-uncased 8 512 1539
--------------------------------------------------------------------------------
==================== ENVIRONMENT INFORMATION ====================
- transformers_version: 2.11.0
- framework: PyTorch
- use_torchscript: False
- framework_version: 1.4.0
- python_version: 3.6.10
- system: Linux
- cpu: x86_64
- architecture: 64bit
- date: 2020-06-29
- time: 08:58:43.371351
- fp16: False
- use_multiprocessing: True
- only_pretrain_model: False
- cpu_ram_mb: 32088
- use_gpu: True
- num_gpus: 1
- gpu: TITAN RTX
- gpu_ram_mb: 24217
- gpu_power_watts: 280.0
- gpu_performance_state: 2
- use_tpu: False
>>> ## TENSORFLOW CODE
>>> results = benchmark.run()
>>> print(results)
==================== INFERENCE - SPEED - RESULT ====================
--------------------------------------------------------------------------------
Model Name Batch Size Seq Length Time in s
--------------------------------------------------------------------------------
bert-base-uncased 8 8 0.005
bert-base-uncased 8 32 0.008
bert-base-uncased 8 128 0.022
bert-base-uncased 8 512 0.105
--------------------------------------------------------------------------------
==================== INFERENCE - MEMORY - RESULT ====================
--------------------------------------------------------------------------------
Model Name Batch Size Seq Length Memory in MB
--------------------------------------------------------------------------------
bert-base-uncased 8 8 1330
bert-base-uncased 8 32 1330
bert-base-uncased 8 128 1330
bert-base-uncased 8 512 1770
--------------------------------------------------------------------------------
==================== ENVIRONMENT INFORMATION ====================
- transformers_version: 2.11.0
- framework: Tensorflow
- use_xla: False
- framework_version: 2.2.0
- python_version: 3.6.10
- system: Linux
- cpu: x86_64
- architecture: 64bit
- date: 2020-06-29
- time: 09:26:35.617317
- fp16: False
- use_multiprocessing: True
- only_pretrain_model: False
- cpu_ram_mb: 32088
- use_gpu: True
- num_gpus: 1
- gpu: TITAN RTX
- gpu_ram_mb: 24217
- gpu_power_watts: 280.0
- gpu_performance_state: 2
- use_tpu: False
By default, the `time` and the `required memory` for `inference` are benchmarked. In the example output above the first
two sections show the result corresponding to `inference time` and `inference memory`. In addition, all relevant
information about the computing environment, `e.g.` the GPU type, the system, the library versions, etc... are printed
out in the third section under `ENVIRONMENT INFORMATION`. This information can optionally be saved in a `.csv` file
when adding the argument :obj:`save_to_csv=True` to :class:`~transformers.PyTorchBenchmarkArguments` and
:class:`~transformers.TensorFlowBenchmarkArguments` respectively. In this case, every section is saved in a separate
`.csv` file. The path to each `.csv` file can optionally be defined via the argument data classes.
Instead of benchmarking pre-trained models via their model identifier, `e.g.` `bert-base-uncased`, the user can
alternatively benchmark an arbitrary configuration of any available model class. In this case, a :obj:`list` of
configurations must be inserted with the benchmark args as follows.
.. code-block::
>>> ## PYTORCH CODE
>>> from transformers import PyTorchBenchmark, PyTorchBenchmarkArguments, BertConfig
>>> args = PyTorchBenchmarkArguments(models=["bert-base", "bert-384-hid", "bert-6-lay"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512])
>>> config_base = BertConfig()
>>> config_384_hid = BertConfig(hidden_size=384)
>>> config_6_lay = BertConfig(num_hidden_layers=6)
>>> benchmark = PyTorchBenchmark(args, configs=[config_base, config_384_hid, config_6_lay])
>>> benchmark.run()
==================== INFERENCE - SPEED - RESULT ====================
--------------------------------------------------------------------------------
Model Name Batch Size Seq Length Time in s
--------------------------------------------------------------------------------
bert-base 8 128 0.006
bert-base 8 512 0.006
bert-base 8 128 0.018
bert-base 8 512 0.088
bert-384-hid 8 8 0.006
bert-384-hid 8 32 0.006
bert-384-hid 8 128 0.011
bert-384-hid 8 512 0.054
bert-6-lay 8 8 0.003
bert-6-lay 8 32 0.004
bert-6-lay 8 128 0.009
bert-6-lay 8 512 0.044
--------------------------------------------------------------------------------
==================== INFERENCE - MEMORY - RESULT ====================
--------------------------------------------------------------------------------
Model Name Batch Size Seq Length Memory in MB
--------------------------------------------------------------------------------
bert-base 8 8 1277
bert-base 8 32 1281
bert-base 8 128 1307
bert-base 8 512 1539
bert-384-hid 8 8 1005
bert-384-hid 8 32 1027
bert-384-hid 8 128 1035
bert-384-hid 8 512 1255
bert-6-lay 8 8 1097
bert-6-lay 8 32 1101
bert-6-lay 8 128 1127
bert-6-lay 8 512 1359
--------------------------------------------------------------------------------
==================== ENVIRONMENT INFORMATION ====================
- transformers_version: 2.11.0
- framework: PyTorch
- use_torchscript: False
- framework_version: 1.4.0
- python_version: 3.6.10
- system: Linux
- cpu: x86_64
- architecture: 64bit
- date: 2020-06-29
- time: 09:35:25.143267
- fp16: False
- use_multiprocessing: True
- only_pretrain_model: False
- cpu_ram_mb: 32088
- use_gpu: True
- num_gpus: 1
- gpu: TITAN RTX
- gpu_ram_mb: 24217
- gpu_power_watts: 280.0
- gpu_performance_state: 2
- use_tpu: False
>>> ## TENSORFLOW CODE
>>> from transformers import TensorFlowBenchmark, TensorFlowBenchmarkArguments, BertConfig
>>> args = TensorFlowBenchmarkArguments(models=["bert-base", "bert-384-hid", "bert-6-lay"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512])
>>> config_base = BertConfig()
>>> config_384_hid = BertConfig(hidden_size=384)
>>> config_6_lay = BertConfig(num_hidden_layers=6)
>>> benchmark = TensorFlowBenchmark(args, configs=[config_base, config_384_hid, config_6_lay])
>>> benchmark.run()
==================== INFERENCE - SPEED - RESULT ====================
--------------------------------------------------------------------------------
Model Name Batch Size Seq Length Time in s
--------------------------------------------------------------------------------
bert-base 8 8 0.005
bert-base 8 32 0.008
bert-base 8 128 0.022
bert-base 8 512 0.106
bert-384-hid 8 8 0.005
bert-384-hid 8 32 0.007
bert-384-hid 8 128 0.018
bert-384-hid 8 512 0.064
bert-6-lay 8 8 0.002
bert-6-lay 8 32 0.003
bert-6-lay 8 128 0.0011
bert-6-lay 8 512 0.074
--------------------------------------------------------------------------------
==================== INFERENCE - MEMORY - RESULT ====================
--------------------------------------------------------------------------------
Model Name Batch Size Seq Length Memory in MB
--------------------------------------------------------------------------------
bert-base 8 8 1330
bert-base 8 32 1330
bert-base 8 128 1330
bert-base 8 512 1770
bert-384-hid 8 8 1330
bert-384-hid 8 32 1330
bert-384-hid 8 128 1330
bert-384-hid 8 512 1540
bert-6-lay 8 8 1330
bert-6-lay 8 32 1330
bert-6-lay 8 128 1330
bert-6-lay 8 512 1540
--------------------------------------------------------------------------------
==================== ENVIRONMENT INFORMATION ====================
- transformers_version: 2.11.0
- framework: Tensorflow
- use_xla: False
- framework_version: 2.2.0
- python_version: 3.6.10
- system: Linux
- cpu: x86_64
- architecture: 64bit
- date: 2020-06-29
- time: 09:38:15.487125
- fp16: False
- use_multiprocessing: True
- only_pretrain_model: False
- cpu_ram_mb: 32088
- use_gpu: True
- num_gpus: 1
- gpu: TITAN RTX
- gpu_ram_mb: 24217
- gpu_power_watts: 280.0
- gpu_performance_state: 2
- use_tpu: False
Again, `inference time` and `required memory` for `inference` are measured, but this time for customized configurations
of the :obj:`BertModel` class. This feature can especially be helpful when deciding for which configuration the model
should be trained.
Benchmark best practices
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This section lists a couple of best practices one should be aware of when benchmarking a model.
- Currently, only single device benchmarking is supported. When benchmarking on GPU, it is recommended that the user
specifies on which device the code should be run by setting the ``CUDA_VISIBLE_DEVICES`` environment variable in the
shell, `e.g.` ``export CUDA_VISIBLE_DEVICES=0`` before running the code.
- The option :obj:`no_multi_processing` should only be set to :obj:`True` for testing and debugging. To ensure accurate
memory measurement it is recommended to run each memory benchmark in a separate process by making sure
:obj:`no_multi_processing` is set to :obj:`True`.
- One should always state the environment information when sharing the results of a model benchmark. Results can vary
heavily between different GPU devices, library versions, etc., so that benchmark results on their own are not very
useful for the community.
Sharing your benchmark
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Previously all available core models (10 at the time) have been benchmarked for `inference time`, across many different
settings: using PyTorch, with and without TorchScript, using TensorFlow, with and without XLA. All of those tests were
done across CPUs (except for TensorFlow XLA) and GPUs.
The approach is detailed in the `following blogpost
<https://medium.com/huggingface/benchmarking-transformers-pytorch-and-tensorflow-e2917fb891c2>`__ and the results are
available `here
<https://docs.google.com/spreadsheets/d/1sryqufw2D0XlUH4sq3e9Wnxu5EAQkaohzrJbd5HdQ_w/edit?usp=sharing>`__.
With the new `benchmark` tools, it is easier than ever to share your benchmark results with the community
- :prefix_link:`PyTorch Benchmarking Results<examples/pytorch/benchmarking/README.md>`.
- :prefix_link:`TensorFlow Benchmarking Results<examples/tensorflow/benchmarking/README.md>`.
This diff is collapsed.
This diff is collapsed.
.. <!--Copyright 2020 The HuggingFace Team. All rights reserved.
Copyright 2020 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0 http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License. specific language governing permissions and limitations under the License.
-->
Multi-lingual models # Multi-lingual models
=======================================================================================================================
[[open-in-colab]]
Most of the models available in this library are mono-lingual models (English, Chinese and German). A few multi-lingual Most of the models available in this library are mono-lingual models (English, Chinese and German). A few multi-lingual
models are available and have a different mechanisms than mono-lingual models. This page details the usage of these models are available and have a different mechanisms than mono-lingual models. This page details the usage of these
models. models.
XLM ## XLM
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
XLM has a total of 10 different checkpoints, only one of which is mono-lingual. The 9 remaining model checkpoints can XLM has a total of 10 different checkpoints, only one of which is mono-lingual. The 9 remaining model checkpoints can
be split in two categories: the checkpoints that make use of language embeddings, and those that don't be split in two categories: the checkpoints that make use of language embeddings, and those that don't
XLM & Language Embeddings ### XLM & Language Embeddings
-----------------------------------------------------------------------------------------------------------------------
This section concerns the following checkpoints: This section concerns the following checkpoints:
- ``xlm-mlm-ende-1024`` (Masked language modeling, English-German) - `xlm-mlm-ende-1024` (Masked language modeling, English-German)
- ``xlm-mlm-enfr-1024`` (Masked language modeling, English-French) - `xlm-mlm-enfr-1024` (Masked language modeling, English-French)
- ``xlm-mlm-enro-1024`` (Masked language modeling, English-Romanian) - `xlm-mlm-enro-1024` (Masked language modeling, English-Romanian)
- ``xlm-mlm-xnli15-1024`` (Masked language modeling, XNLI languages) - `xlm-mlm-xnli15-1024` (Masked language modeling, XNLI languages)
- ``xlm-mlm-tlm-xnli15-1024`` (Masked language modeling + Translation, XNLI languages) - `xlm-mlm-tlm-xnli15-1024` (Masked language modeling + Translation, XNLI languages)
- ``xlm-clm-enfr-1024`` (Causal language modeling, English-French) - `xlm-clm-enfr-1024` (Causal language modeling, English-French)
- ``xlm-clm-ende-1024`` (Causal language modeling, English-German) - `xlm-clm-ende-1024` (Causal language modeling, English-German)
These checkpoints require language embeddings that will specify the language used at inference time. These language These checkpoints require language embeddings that will specify the language used at inference time. These language
embeddings are represented as a tensor that is of the same shape as the input ids passed to the model. The values in embeddings are represented as a tensor that is of the same shape as the input ids passed to the model. The values in
these tensors depend on the language used and are identifiable using the ``lang2id`` and ``id2lang`` attributes from these tensors depend on the language used and are identifiable using the `lang2id` and `id2lang` attributes from
the tokenizer. the tokenizer.
Here is an example using the ``xlm-clm-enfr-1024`` checkpoint (Causal language modeling, English-French): Here is an example using the `xlm-clm-enfr-1024` checkpoint (Causal language modeling, English-French):
.. code-block::
>>> import torch ```py
>>> from transformers import XLMTokenizer, XLMWithLMHeadModel >>> import torch
>>> from transformers import XLMTokenizer, XLMWithLMHeadModel
>>> tokenizer = XLMTokenizer.from_pretrained("xlm-clm-enfr-1024")
>>> model = XLMWithLMHeadModel.from_pretrained("xlm-clm-enfr-1024")
>>> tokenizer = XLMTokenizer.from_pretrained("xlm-clm-enfr-1024")
>>> model = XLMWithLMHeadModel.from_pretrained("xlm-clm-enfr-1024")
```
The different languages this model/tokenizer handles, as well as the ids of these languages are visible using the The different languages this model/tokenizer handles, as well as the ids of these languages are visible using the
``lang2id`` attribute: `lang2id` attribute:
.. code-block::
>>> print(tokenizer.lang2id)
{'en': 0, 'fr': 1}
```py
>>> print(tokenizer.lang2id)
{'en': 0, 'fr': 1}
```
These ids should be used when passing a language parameter during a model pass. Let's define our inputs: These ids should be used when passing a language parameter during a model pass. Let's define our inputs:
.. code-block:: ```py
>>> input_ids = torch.tensor([tokenizer.encode("Wikipedia was used to")]) # batch size of 1
>>> input_ids = torch.tensor([tokenizer.encode("Wikipedia was used to")]) # batch size of 1 ```
We should now define the language embedding by using the previously defined language id. We want to create a tensor We should now define the language embedding by using the previously defined language id. We want to create a tensor
filled with the appropriate language ids, of the same size as input_ids. For english, the id is 0: filled with the appropriate language ids, of the same size as input_ids. For english, the id is 0:
.. code-block:: ```py
>>> language_id = tokenizer.lang2id['en'] # 0
>>> language_id = tokenizer.lang2id['en'] # 0 >>> langs = torch.tensor([language_id] * input_ids.shape[1]) # torch.tensor([0, 0, 0, ..., 0])
>>> langs = torch.tensor([language_id] * input_ids.shape[1]) # torch.tensor([0, 0, 0, ..., 0])
>>> # We reshape it to be of size (batch_size, sequence_length)
>>> langs = langs.view(1, -1) # is now of shape [1, sequence_length] (we have a batch size of 1)
>>> # We reshape it to be of size (batch_size, sequence_length)
>>> langs = langs.view(1, -1) # is now of shape [1, sequence_length] (we have a batch size of 1)
```
You can then feed it all as input to your model: You can then feed it all as input to your model:
.. code-block:: ```py
>>> outputs = model(input_ids, langs=langs)
>>> outputs = model(input_ids, langs=langs) ```
The example [run_generation.py](https://github.com/huggingface/transformers/tree/master/examples/pytorch/text-generation/run_generation.py) can generate text
The example :prefix_link:`run_generation.py <examples/pytorch/text-generation/run_generation.py>` can generate text
using the CLM checkpoints from XLM, using the language embeddings. using the CLM checkpoints from XLM, using the language embeddings.
XLM without Language Embeddings ### XLM without Language Embeddings
-----------------------------------------------------------------------------------------------------------------------
This section concerns the following checkpoints: This section concerns the following checkpoints:
- ``xlm-mlm-17-1280`` (Masked language modeling, 17 languages) - `xlm-mlm-17-1280` (Masked language modeling, 17 languages)
- ``xlm-mlm-100-1280`` (Masked language modeling, 100 languages) - `xlm-mlm-100-1280` (Masked language modeling, 100 languages)
These checkpoints do not require language embeddings at inference time. These models are used to have generic sentence These checkpoints do not require language embeddings at inference time. These models are used to have generic sentence
representations, differently from previously-mentioned XLM checkpoints. representations, differently from previously-mentioned XLM checkpoints.
BERT ## BERT
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
BERT has two checkpoints that can be used for multi-lingual tasks: BERT has two checkpoints that can be used for multi-lingual tasks:
- ``bert-base-multilingual-uncased`` (Masked language modeling + Next sentence prediction, 102 languages) - `bert-base-multilingual-uncased` (Masked language modeling + Next sentence prediction, 102 languages)
- ``bert-base-multilingual-cased`` (Masked language modeling + Next sentence prediction, 104 languages) - `bert-base-multilingual-cased` (Masked language modeling + Next sentence prediction, 104 languages)
These checkpoints do not require language embeddings at inference time. They should identify the language used in the These checkpoints do not require language embeddings at inference time. They should identify the language used in the
context and infer accordingly. context and infer accordingly.
XLM-RoBERTa ## XLM-RoBERTa
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
XLM-RoBERTa was trained on 2.5TB of newly created clean CommonCrawl data in 100 languages. It provides strong gains XLM-RoBERTa was trained on 2.5TB of newly created clean CommonCrawl data in 100 languages. It provides strong gains
over previously released multi-lingual models like mBERT or XLM on downstream tasks like classification, sequence over previously released multi-lingual models like mBERT or XLM on downstream tasks like classification, sequence
...@@ -123,19 +114,5 @@ labeling and question answering. ...@@ -123,19 +114,5 @@ labeling and question answering.
Two XLM-RoBERTa checkpoints can be used for multi-lingual tasks: Two XLM-RoBERTa checkpoints can be used for multi-lingual tasks:
- ``xlm-roberta-base`` (Masked language modeling, 100 languages) - `xlm-roberta-base` (Masked language modeling, 100 languages)
- ``xlm-roberta-large`` (Masked language modeling, 100 languages) - `xlm-roberta-large` (Masked language modeling, 100 languages)
mLUKE
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
mLUKE is based on XLM-RoBERTa and further trained on Wikipedia articles in 24 languages with masked language modeling
as well as masked entity prediction objective.
The model can be used in the same way as other models solely based on word-piece inputs, but also can be used with
entity representations to achieve further performance gain, with entity-related tasks such as relation extraction,
named entity recognition and question answering (see :doc:`LUKE <model_doc/luke>`).
Currently, one mLUKE checkpoint is available:
- ``studio-ousia/mluke-base`` (Masked language modeling + Masked entity prediction, 100 languages)
.. <!--Copyright 2020 The HuggingFace Team. All rights reserved.
Copyright 2020 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0 http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License. specific language governing permissions and limitations under the License.
-->
Perplexity of fixed-length models # Perplexity of fixed-length models
=======================================================================================================================
[[open-in-colab]]
Perplexity (PPL) is one of the most common metrics for evaluating language models. Before diving in, we should note Perplexity (PPL) is one of the most common metrics for evaluating language models. Before diving in, we should note
that the metric applies specifically to classical language models (sometimes called autoregressive or causal language that the metric applies specifically to classical language models (sometimes called autoregressive or causal language
models) and is not well defined for masked language models like BERT (see :doc:`summary of the models models) and is not well defined for masked language models like BERT (see [summary of the models](model_summary)).
<model_summary>`).
Perplexity is defined as the exponentiated average negative log-likelihood of a sequence. If we have a tokenized Perplexity is defined as the exponentiated average negative log-likelihood of a sequence. If we have a tokenized
sequence :math:`X = (x_0, x_1, \dots, x_t)`, then the perplexity of :math:`X` is, sequence \\(X = (x_0, x_1, \dots, x_t)\\), then the perplexity of \\(X\\) is,
.. math::
\text{PPL}(X) $$\text{PPL}(X) = \exp \left\{ {-\frac{1}{t}\sum_i^t \log p_\theta (x_i|x_{<i}) } \right\}$$
= \exp \left\{ {-\frac{1}{t}\sum_i^t \log p_\theta (x_i|x_{<i}) } \right\}
where :math:`\log p_\theta (x_i|x_{<i})` is the log-likelihood of the ith token conditioned on the preceding tokens where \\(\log p_\theta (x_i|x_{<i})\\) is the log-likelihood of the ith token conditioned on the preceding tokens \\(x_{<i}\\) according to our model. Intuitively, it can be thought of as an evaluation of the model's ability to predict uniformly among the set of specified tokens in a corpus. Importantly, this means that the tokenization procedure has a direct impact on a model's perplexity which should always be taken into consideration when comparing different models.
:math:`x_{<i}` according to our model. Intuitively, it can be thought of as an evaluation of the model's ability to
predict uniformly among the set of specified tokens in a corpus. Importantly, this means that the tokenization
procedure has a direct impact on a model's perplexity which should always be taken into consideration when comparing
different models.
This is also equivalent to the exponentiation of the cross-entropy between the data and model predictions. For more This is also equivalent to the exponentiation of the cross-entropy between the data and model predictions. For more
intuition about perplexity and its relationship to Bits Per Character (BPC) and data compression, check out this intuition about perplexity and its relationship to Bits Per Character (BPC) and data compression, check out this
`fantastic blog post on The Gradient <https://thegradient.pub/understanding-evaluation-metrics-for-language-models/>`_. [fantastic blog post on The Gradient](https://thegradient.pub/understanding-evaluation-metrics-for-language-models/).
Calculating PPL with fixed-length models ## Calculating PPL with fixed-length models
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If we weren't limited by a model's context size, we would evaluate the model's perplexity by autoregressively If we weren't limited by a model's context size, we would evaluate the model's perplexity by autoregressively
factorizing a sequence and conditioning on the entire preceding subsequence at each step, as shown below. factorizing a sequence and conditioning on the entire preceding subsequence at each step, as shown below.
.. image:: /imgs/ppl_full.gif <img width="600" alt="Full decomposition of a sequence with unlimited context length" src="/imgs/ppl_full.gif"/>
:width: 600
:alt: Full decomposition of a sequence with unlimited context length
When working with approximate models, however, we typically have a constraint on the number of tokens the model can When working with approximate models, however, we typically have a constraint on the number of tokens the model can
process. The largest version of :doc:`GPT-2 <model_doc/gpt2>`, for example, has a fixed length of 1024 tokens, so we process. The largest version of [GPT-2](model_doc/gpt2), for example, has a fixed length of 1024 tokens, so we
cannot calculate :math:`p_\theta(x_t|x_{<t})` directly when :math:`t` is greater than 1024. cannot calculate \\(p_\theta(x_t|x_{<t})\\) directly when \\(t\\) is greater than 1024.
Instead, the sequence is typically broken into subsequences equal to the model's maximum input size. If a model's max Instead, the sequence is typically broken into subsequences equal to the model's maximum input size. If a model's max
input size is :math:`k`, we then approximate the likelihood of a token :math:`x_t` by conditioning only on the input size is \\(k\\), we then approximate the likelihood of a token \\(x_t\\) by conditioning only on the
:math:`k-1` tokens that precede it rather than the entire context. When evaluating the model's perplexity of a \\(k-1\\) tokens that precede it rather than the entire context. When evaluating the model's perplexity of a
sequence, a tempting but suboptimal approach is to break the sequence into disjoint chunks and add up the decomposed sequence, a tempting but suboptimal approach is to break the sequence into disjoint chunks and add up the decomposed
log-likelihoods of each segment independently. log-likelihoods of each segment independently.
.. image:: /imgs/ppl_chunked.gif <img width="600" alt="Suboptimal PPL not taking advantage of full available context" src="/imgs/ppl_chunked.gif"/>
:width: 600
:alt: Suboptimal PPL not taking advantage of full available context
This is quick to compute since the perplexity of each segment can be computed in one forward pass, but serves as a poor This is quick to compute since the perplexity of each segment can be computed in one forward pass, but serves as a poor
approximation of the fully-factorized perplexity and will typically yield a higher (worse) PPL because the model will approximation of the fully-factorized perplexity and will typically yield a higher (worse) PPL because the model will
...@@ -67,9 +55,7 @@ have less context at most of the prediction steps. ...@@ -67,9 +55,7 @@ have less context at most of the prediction steps.
Instead, the PPL of fixed-length models should be evaluated with a sliding-window strategy. This involves repeatedly Instead, the PPL of fixed-length models should be evaluated with a sliding-window strategy. This involves repeatedly
sliding the context window so that the model has more context when making each prediction. sliding the context window so that the model has more context when making each prediction.
.. image:: /imgs/ppl_sliding.gif <img width="600" alt="Sliding window PPL taking advantage of all available context" src="/imgs/ppl_sliding.gif"/>
:width: 600
:alt: Sliding window PPL taking advantage of all available context
This is a closer approximation to the true decomposition of the sequence probability and will typically yield a more This is a closer approximation to the true decomposition of the sequence probability and will typically yield a more
favorable score. The downside is that it requires a separate forward pass for each token in the corpus. A good favorable score. The downside is that it requires a separate forward pass for each token in the corpus. A good
...@@ -77,47 +63,45 @@ practical compromise is to employ a strided sliding window, moving the context b ...@@ -77,47 +63,45 @@ practical compromise is to employ a strided sliding window, moving the context b
1 token a time. This allows computation to proceed much faster while still giving the model a large context to make 1 token a time. This allows computation to proceed much faster while still giving the model a large context to make
predictions at each step. predictions at each step.
Example: Calculating perplexity with GPT-2 in 🤗 Transformers ## Example: Calculating perplexity with GPT-2 in 🤗 Transformers
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Let's demonstrate this process with GPT-2. Let's demonstrate this process with GPT-2.
.. code-block:: python ```python
from transformers import GPT2LMHeadModel, GPT2TokenizerFast
from transformers import GPT2LMHeadModel, GPT2TokenizerFast device = 'cuda'
device = 'cuda' model_id = 'gpt2-large'
model_id = 'gpt2-large' model = GPT2LMHeadModel.from_pretrained(model_id).to(device)
model = GPT2LMHeadModel.from_pretrained(model_id).to(device) tokenizer = GPT2TokenizerFast.from_pretrained(model_id)
tokenizer = GPT2TokenizerFast.from_pretrained(model_id) ```
We'll load in the WikiText-2 dataset and evaluate the perplexity using a few different sliding-window strategies. Since We'll load in the WikiText-2 dataset and evaluate the perplexity using a few different sliding-window strategies. Since
this dataset is small and we're just doing one forward pass over the set, we can just load and encode the entire this dataset is small and we're just doing one forward pass over the set, we can just load and encode the entire
dataset in memory. dataset in memory.
.. code-block:: python ```python
from datasets import load_dataset
test = load_dataset('wikitext', 'wikitext-2-raw-v1', split='test')
encodings = tokenizer('\n\n'.join(test['text']), return_tensors='pt')
```
from datasets import load_dataset With 🤗 Transformers, we can simply pass the `input_ids` as the `labels` to our model, and the average negative
test = load_dataset('wikitext', 'wikitext-2-raw-v1', split='test')
encodings = tokenizer('\n\n'.join(test['text']), return_tensors='pt')
With 🤗 Transformers, we can simply pass the ``input_ids`` as the ``labels`` to our model, and the average negative
log-likelihood for each token is returned as the loss. With our sliding window approach, however, there is overlap in log-likelihood for each token is returned as the loss. With our sliding window approach, however, there is overlap in
the tokens we pass to the model at each iteration. We don't want the log-likelihood for the tokens we're just treating the tokens we pass to the model at each iteration. We don't want the log-likelihood for the tokens we're just treating
as context to be included in our loss, so we can set these targets to ``-100`` so that they are ignored. The following as context to be included in our loss, so we can set these targets to `-100` so that they are ignored. The following
is an example of how we could do this with a stride of ``512``. This means that the model will have at least 512 tokens is an example of how we could do this with a stride of `512`. This means that the model will have at least 512 tokens
for context when calculating the conditional likelihood of any one token (provided there are 512 preceding tokens for context when calculating the conditional likelihood of any one token (provided there are 512 preceding tokens
available to condition on). available to condition on).
.. code-block:: python ```python
import torch
import torch from tqdm import tqdm
from tqdm import tqdm
max_length = model.config.n_positions max_length = model.config.n_positions
stride = 512 stride = 512
nlls = [] nlls = []
for i in tqdm(range(0, encodings.input_ids.size(1), stride)): for i in tqdm(range(0, encodings.input_ids.size(1), stride)):
begin_loc = max(i + stride - max_length, 0) begin_loc = max(i + stride - max_length, 0)
end_loc = min(i + stride, encodings.input_ids.size(1)) end_loc = min(i + stride, encodings.input_ids.size(1))
trg_len = end_loc - i # may be different from stride on last loop trg_len = end_loc - i # may be different from stride on last loop
...@@ -131,13 +115,14 @@ available to condition on). ...@@ -131,13 +115,14 @@ available to condition on).
nlls.append(neg_log_likelihood) nlls.append(neg_log_likelihood)
ppl = torch.exp(torch.stack(nlls).sum() / end_loc) ppl = torch.exp(torch.stack(nlls).sum() / end_loc)
```
Running this with the stride length equal to the max input length is equivalent to the suboptimal, non-sliding-window Running this with the stride length equal to the max input length is equivalent to the suboptimal, non-sliding-window
strategy we discussed above. The smaller the stride, the more context the model will have in making each prediction, strategy we discussed above. The smaller the stride, the more context the model will have in making each prediction,
and the better the reported perplexity will typically be. and the better the reported perplexity will typically be.
When we run the above with ``stride = 1024``, i.e. no overlap, the resulting PPL is ``19.64``, which is about the same When we run the above with `stride = 1024`, i.e. no overlap, the resulting PPL is `19.64`, which is about the same
as the ``19.93`` reported in the GPT-2 paper. By using ``stride = 512`` and thereby employing our striding window as the `19.93` reported in the GPT-2 paper. By using `stride = 512` and thereby employing our striding window
strategy, this jumps down to ``16.53``. This is not only a more favorable score, but is calculated in a way that is strategy, this jumps down to `16.53`. This is not only a more favorable score, but is calculated in a way that is
closer to the true autoregressive decomposition of a sequence likelihood. closer to the true autoregressive decomposition of a sequence likelihood.
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Preprocessing data
[[open-in-colab]]
In this tutorial, we'll explore how to preprocess your data using 🤗 Transformers. The main tool for this is what we
call a [tokenizer](main_classes/tokenizer). You can build one using the tokenizer class associated to the model
you would like to use, or directly with the [`AutoTokenizer`] class.
As we saw in the [quick tour](quicktour), the tokenizer will first split a given text in words (or part of
words, punctuation symbols, etc.) usually called _tokens_. Then it will convert those _tokens_ into numbers, to be able
to build a tensor out of them and feed them to the model. It will also add any additional inputs the model might expect
to work properly.
<Tip>
If you plan on using a pretrained model, it's important to use the associated pretrained tokenizer: it will split
the text you give it in tokens the same way for the pretraining corpus, and it will use the same correspondence
token to index (that we usually call a _vocab_) as during pretraining.
</Tip>
To automatically download the vocab used during pretraining or fine-tuning a given model, you can use the
[`AutoTokenizer.from_pretrained`] method:
```py
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
```
## Base use
<Youtube id="Yffk5aydLzg"/>
A [`PreTrainedTokenizer`] has many methods, but the only one you need to remember for preprocessing
is its `__call__`: you just need to feed your sentence to your tokenizer object.
```py
>>> encoded_input = tokenizer("Hello, I'm a single sentence!")
>>> print(encoded_input)
{'input_ids': [101, 138, 18696, 155, 1942, 3190, 1144, 1572, 13745, 1104, 159, 9664, 2107, 102],
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
```
This returns a dictionary string to list of ints. The [input_ids](glossary#input-ids) are the indices corresponding
to each token in our sentence. We will see below what the [attention_mask](glossary#attention-mask) is used for and
in [the next section](#preprocessing-pairs-of-sentences) the goal of [token_type_ids](glossary#token-type-ids).
The tokenizer can decode a list of token ids in a proper sentence:
```py
>>> tokenizer.decode(encoded_input["input_ids"])
"[CLS] Hello, I'm a single sentence! [SEP]"
```
As you can see, the tokenizer automatically added some special tokens that the model expects. Not all models need
special tokens; for instance, if we had used _gpt2-medium_ instead of _bert-base-cased_ to create our tokenizer, we
would have seen the same sentence as the original one here. You can disable this behavior (which is only advised if you
have added those special tokens yourself) by passing `add_special_tokens=False`.
If you have several sentences you want to process, you can do this efficiently by sending them as a list to the
tokenizer:
```py
>>> batch_sentences = ["Hello I'm a single sentence",
... "And another sentence",
... "And the very very last one"]
>>> encoded_inputs = tokenizer(batch_sentences)
>>> print(encoded_inputs)
{'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102],
[101, 1262, 1330, 5650, 102],
[101, 1262, 1103, 1304, 1304, 1314, 1141, 102]],
'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0]],
'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 1]]}
```
We get back a dictionary once again, this time with values being lists of lists of ints.
If the purpose of sending several sentences at a time to the tokenizer is to build a batch to feed the model, you will
probably want:
- To pad each sentence to the maximum length there is in your batch.
- To truncate each sentence to the maximum length the model can accept (if applicable).
- To return tensors.
You can do all of this by using the following options when feeding your list of sentences to the tokenizer:
```py
>>> batch = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt")
>>> print(batch)
{'input_ids': tensor([[ 101, 8667, 146, 112, 182, 170, 1423, 5650, 102],
[ 101, 1262, 1330, 5650, 102, 0, 0, 0, 0],
[ 101, 1262, 1103, 1304, 1304, 1314, 1141, 102, 0]]),
'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0]]),
'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 1, 1, 1, 0]])}
===PT-TF-SPLIT===
>>> batch = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="tf")
>>> print(batch)
{'input_ids': tf.Tensor([[ 101, 8667, 146, 112, 182, 170, 1423, 5650, 102],
[ 101, 1262, 1330, 5650, 102, 0, 0, 0, 0],
[ 101, 1262, 1103, 1304, 1304, 1314, 1141, 102, 0]]),
'token_type_ids': tf.Tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0]]),
'attention_mask': tf.Tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 1, 1, 1, 0]])}
```
It returns a dictionary with string keys and tensor values. We can now see what the [attention_mask](glossary#attention-mask) is all about: it points out which tokens the model should pay attention to and which ones
it should not (because they represent padding in this case).
Note that if your model does not have a maximum length associated to it, the command above will throw a warning. You
can safely ignore it. You can also pass `verbose=False` to stop the tokenizer from throwing those kinds of warnings.
<a id='sentence-pairs'></a>
## Preprocessing pairs of sentences
<Youtube id="0u3ioSwev3s"/>
Sometimes you need to feed a pair of sentences to your model. For instance, if you want to classify if two sentences in
a pair are similar, or for question-answering models, which take a context and a question. For BERT models, the input
is then represented like this: `[CLS] Sequence A [SEP] Sequence B [SEP]`
You can encode a pair of sentences in the format expected by your model by supplying the two sentences as two arguments
(not a list since a list of two sentences will be interpreted as a batch of two single sentences, as we saw before).
This will once again return a dict string to list of ints:
```py
>>> encoded_input = tokenizer("How old are you?", "I'm 6 years old")
>>> print(encoded_input)
{'input_ids': [101, 1731, 1385, 1132, 1128, 136, 102, 146, 112, 182, 127, 1201, 1385, 102],
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1],
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
```
This shows us what the [token_type_ids](glossary#token-type-ids) are for: they indicate to the model which part of
the inputs correspond to the first sentence and which part corresponds to the second sentence. Note that
_token_type_ids_ are not required or handled by all models. By default, a tokenizer will only return the inputs that
its associated model expects. You can force the return (or the non-return) of any of those special arguments by using
`return_input_ids` or `return_token_type_ids`.
If we decode the token ids we obtained, we will see that the special tokens have been properly added.
```py
>>> tokenizer.decode(encoded_input["input_ids"])
"[CLS] How old are you? [SEP] I'm 6 years old [SEP]"
```
If you have a list of pairs of sequences you want to process, you should feed them as two lists to your tokenizer: the
list of first sentences and the list of second sentences:
```py
>>> batch_sentences = ["Hello I'm a single sentence",
... "And another sentence",
... "And the very very last one"]
>>> batch_of_second_sentences = ["I'm a sentence that goes with the first sentence",
... "And I should be encoded with the second sentence",
... "And I go with the very last one"]
>>> encoded_inputs = tokenizer(batch_sentences, batch_of_second_sentences)
>>> print(encoded_inputs)
{'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102, 146, 112, 182, 170, 5650, 1115, 2947, 1114, 1103, 1148, 5650, 102],
[101, 1262, 1330, 5650, 102, 1262, 146, 1431, 1129, 12544, 1114, 1103, 1248, 5650, 102],
[101, 1262, 1103, 1304, 1304, 1314, 1141, 102, 1262, 146, 1301, 1114, 1103, 1304, 1314, 1141, 102]],
'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
```
As we can see, it returns a dictionary where each value is a list of lists of ints.
To double-check what is fed to the model, we can decode each list in _input_ids_ one by one:
```py
>>> for ids in encoded_inputs["input_ids"]:
>>> print(tokenizer.decode(ids))
[CLS] Hello I'm a single sentence [SEP] I'm a sentence that goes with the first sentence [SEP]
[CLS] And another sentence [SEP] And I should be encoded with the second sentence [SEP]
[CLS] And the very very last one [SEP] And I go with the very last one [SEP]
```
Once again, you can automatically pad your inputs to the maximum sentence length in the batch, truncate to the maximum
length the model can accept and return tensors directly with the following:
```py
batch = tokenizer(batch_sentences, batch_of_second_sentences, padding=True, truncation=True, return_tensors="pt")
===PT-TF-SPLIT===
batch = tokenizer(batch_sentences, batch_of_second_sentences, padding=True, truncation=True, return_tensors="tf")
```
## Everything you always wanted to know about padding and truncation
We have seen the commands that will work for most cases (pad your batch to the length of the maximum sentence and
truncate to the maximum length the model can accept). However, the API supports more strategies if you need them. The
three arguments you need to know for this are `padding`, `truncation` and `max_length`.
- `padding` controls the padding. It can be a boolean or a string which should be:
- `True` or `'longest'` to pad to the longest sequence in the batch (doing no padding if you only provide
a single sequence).
- `'max_length'` to pad to a length specified by the `max_length` argument or the maximum length accepted
by the model if no `max_length` is provided (`max_length=None`). If you only provide a single sequence,
padding will still be applied to it.
- `False` or `'do_not_pad'` to not pad the sequences. As we have seen before, this is the default
behavior.
- `truncation` controls the truncation. It can be a boolean or a string which should be:
- `True` or `'longest_first'` truncate to a maximum length specified by the `max_length` argument or
the maximum length accepted by the model if no `max_length` is provided (`max_length=None`). This will
truncate token by token, removing a token from the longest sequence in the pair until the proper length is
reached.
- `'only_second'` truncate to a maximum length specified by the `max_length` argument or the maximum
length accepted by the model if no `max_length` is provided (`max_length=None`). This will only truncate
the second sentence of a pair if a pair of sequence (or a batch of pairs of sequences) is provided.
- `'only_first'` truncate to a maximum length specified by the `max_length` argument or the maximum
length accepted by the model if no `max_length` is provided (`max_length=None`). This will only truncate
the first sentence of a pair if a pair of sequence (or a batch of pairs of sequences) is provided.
- `False` or `'do_not_truncate'` to not truncate the sequences. As we have seen before, this is the
default behavior.
- `max_length` to control the length of the padding/truncation. It can be an integer or `None`, in which case
it will default to the maximum length the model can accept. If the model has no specific maximum input length,
truncation/padding to `max_length` is deactivated.
Here is a table summarizing the recommend way to setup padding and truncation. If you use pair of inputs sequence in
any of the following examples, you can replace `truncation=True` by a `STRATEGY` selected in
`['only_first', 'only_second', 'longest_first']`, i.e. `truncation='only_second'` or `truncation= 'longest_first'` to control how both sequence in the pair are truncated as detailed before.
| Truncation | Padding | Instruction |
|--------------------------------------|-----------------------------------|---------------------------------------------------------------------------------------------|
| no truncation | no padding | `tokenizer(batch_sentences)` |
| | padding to max sequence in batch | `tokenizer(batch_sentences, padding=True)` or |
| | | `tokenizer(batch_sentences, padding='longest')` |
| | padding to max model input length | `tokenizer(batch_sentences, padding='max_length')` |
| | padding to specific length | `tokenizer(batch_sentences, padding='max_length', max_length=42)` |
| truncation to max model input length | no padding | `tokenizer(batch_sentences, truncation=True)` or |
| | | `tokenizer(batch_sentences, truncation=STRATEGY)` |
| | padding to max sequence in batch | `tokenizer(batch_sentences, padding=True, truncation=True)` or |
| | | `tokenizer(batch_sentences, padding=True, truncation=STRATEGY)` |
| | padding to max model input length | `tokenizer(batch_sentences, padding='max_length', truncation=True)` or |
| | | `tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY)` |
| | padding to specific length | Not possible |
| truncation to specific length | no padding | `tokenizer(batch_sentences, truncation=True, max_length=42)` or |
| | | `tokenizer(batch_sentences, truncation=STRATEGY, max_length=42)` |
| | padding to max sequence in batch | `tokenizer(batch_sentences, padding=True, truncation=True, max_length=42)` or |
| | | `tokenizer(batch_sentences, padding=True, truncation=STRATEGY, max_length=42)` |
| | padding to max model input length | Not possible |
| | padding to specific length | `tokenizer(batch_sentences, padding='max_length', truncation=True, max_length=42)` or |
| | | `tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY, max_length=42)` |
## Pre-tokenized inputs
The tokenizer also accept pre-tokenized inputs. This is particularly useful when you want to compute labels and extract
predictions in [named entity recognition (NER)](https://en.wikipedia.org/wiki/Named-entity_recognition) or
[part-of-speech tagging (POS tagging)](https://en.wikipedia.org/wiki/Part-of-speech_tagging).
<Tip warning={true}>
Pre-tokenized does not mean your inputs are already tokenized (you wouldn't need to pass them through the tokenizer
if that was the case) but just split into words (which is often the first step in subword tokenization algorithms
like BPE).
</Tip>
If you want to use pre-tokenized inputs, just set `is_split_into_words=True` when passing your inputs to the
tokenizer. For instance, we have:
```py
>>> encoded_input = tokenizer(["Hello", "I'm", "a", "single", "sentence"], is_split_into_words=True)
>>> print(encoded_input)
{'input_ids': [101, 8667, 146, 112, 182, 170, 1423, 5650, 102],
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0],
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
```
Note that the tokenizer still adds the ids of special tokens (if applicable) unless you pass
`add_special_tokens=False`.
This works exactly as before for batch of sentences or batch of pairs of sentences. You can encode a batch of sentences
like this:
```py
batch_sentences = [["Hello", "I'm", "a", "single", "sentence"],
["And", "another", "sentence"],
["And", "the", "very", "very", "last", "one"]]
encoded_inputs = tokenizer(batch_sentences, is_split_into_words=True)
```
or a batch of pair sentences like this:
```py
batch_of_second_sentences = [["I'm", "a", "sentence", "that", "goes", "with", "the", "first", "sentence"],
["And", "I", "should", "be", "encoded", "with", "the", "second", "sentence"],
["And", "I", "go", "with", "the", "very", "last", "one"]]
encoded_inputs = tokenizer(batch_sentences, batch_of_second_sentences, is_split_into_words=True)
```
And you can add padding, truncation as well as directly return tensors like before:
```py
batch = tokenizer(batch_sentences,
batch_of_second_sentences,
is_split_into_words=True,
padding=True,
truncation=True,
return_tensors="pt")
===PT-TF-SPLIT===
batch = tokenizer(batch_sentences,
batch_of_second_sentences,
is_split_into_words=True,
padding=True,
truncation=True,
return_tensors="tf")
```
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment