Unverified Commit 08f534d2 authored by Sylvain Gugger's avatar Sylvain Gugger Committed by GitHub
Browse files

Doc styling (#8067)

* Important files

* Styling them all

* Revert "Styling them all"

This reverts commit 7d029395fdae8513b8281cbc2a6c239f8093503e.

* Syling them for realsies

* Fix syntax error

* Fix benchmark_utils

* More fixes

* Fix modeling auto and script

* Remove new line

* Fixes

* More fixes

* Fix more files

* Style

* Add FSMT

* More fixes

* More fixes

* More fixes

* More fixes

* Fixes

* More fixes

* More fixes

* Last fixes

* Make sphinx happy
parent 04a17f85
...@@ -322,6 +322,7 @@ jobs: ...@@ -322,6 +322,7 @@ jobs:
- run: black --check examples templates tests src utils - run: black --check examples templates tests src utils
- run: isort --check-only examples templates tests src utils - run: isort --check-only examples templates tests src utils
- run: flake8 examples templates tests src utils - run: flake8 examples templates tests src utils
- run: python utils/style_doc.py src/transformers docs/source --max_len 119 --check_only
- run: python utils/check_copies.py - run: python utils/check_copies.py
- run: python utils/check_dummies.py - run: python utils/check_dummies.py
- run: python utils/check_repo.py - run: python utils/check_repo.py
......
...@@ -15,6 +15,7 @@ modified_only_fixup: ...@@ -15,6 +15,7 @@ modified_only_fixup:
black $(modified_py_files); \ black $(modified_py_files); \
isort $(modified_py_files); \ isort $(modified_py_files); \
flake8 $(modified_py_files); \ flake8 $(modified_py_files); \
python utils/style_doc.py $(modified_py_files) --max_len 119; \
else \ else \
echo "No library .py files were modified"; \ echo "No library .py files were modified"; \
fi fi
...@@ -31,6 +32,7 @@ quality: ...@@ -31,6 +32,7 @@ quality:
black --check $(check_dirs) black --check $(check_dirs)
isort --check-only $(check_dirs) isort --check-only $(check_dirs)
flake8 $(check_dirs) flake8 $(check_dirs)
python utils/style_doc.py src/transformers docs/source --max_len 119 --check_only
${MAKE} extra_quality_checks ${MAKE} extra_quality_checks
# Format source code automatically and check is there are any problems left that need manual fixing # Format source code automatically and check is there are any problems left that need manual fixing
...@@ -38,6 +40,7 @@ quality: ...@@ -38,6 +40,7 @@ quality:
style: style:
black $(check_dirs) black $(check_dirs)
isort $(check_dirs) isort $(check_dirs)
python utils/style_doc.py src/transformers docs/source --max_len 119
# Super fast fix and check target that only works on relevant modified files since the branch was made # Super fast fix and check target that only works on relevant modified files since the branch was made
......
...@@ -180,7 +180,7 @@ Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. ...@@ -180,7 +180,7 @@ Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
1. **[Longformer](https://huggingface.co/transformers/model_doc/longformer.html)** (from AllenAI) released with the paper [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan. 1. **[Longformer](https://huggingface.co/transformers/model_doc/longformer.html)** (from AllenAI) released with the paper [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan.
1. **[LXMERT](https://huggingface.co/transformers/model_doc/lxmert.html)** (from UNC Chapel Hill) released with the paper [LXMERT: Learning Cross-Modality Encoder Representations from Transformers for Open-Domain Question Answering](https://arxiv.org/abs/1908.07490) by Hao Tan and Mohit Bansal. 1. **[LXMERT](https://huggingface.co/transformers/model_doc/lxmert.html)** (from UNC Chapel Hill) released with the paper [LXMERT: Learning Cross-Modality Encoder Representations from Transformers for Open-Domain Question Answering](https://arxiv.org/abs/1908.07490) by Hao Tan and Mohit Bansal.
1. **[MarianMT](https://huggingface.co/transformers/model_doc/marian.html)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team. 1. **[MarianMT](https://huggingface.co/transformers/model_doc/marian.html)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team.
1. **[MBart](https://huggingface.co/transformers/model_doc/mbart.html)** (from Facebook) released with the paper [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer. 1. **[MBart](https://huggingface.co/transformers/model_doc/mbart.html)** (from Facebook) released with the paper [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
1. **[Pegasus](https://huggingface.co/transformers/model_doc/pegasus.html)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777)> by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu. 1. **[Pegasus](https://huggingface.co/transformers/model_doc/pegasus.html)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777)> by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.
1. **[ProphetNet](https://huggingface.co/transformers/model_doc/prophetnet.html)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou. 1. **[ProphetNet](https://huggingface.co/transformers/model_doc/prophetnet.html)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
1. **[Reformer](https://huggingface.co/transformers/model_doc/reformer.html)** (from Google Research) released with the paper [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya. 1. **[Reformer](https://huggingface.co/transformers/model_doc/reformer.html)** (from Google Research) released with the paper [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
......
...@@ -3,21 +3,27 @@ Benchmarks ...@@ -3,21 +3,27 @@ Benchmarks
Let's take a look at how 🤗 Transformer models can be benchmarked, best practices, and already available benchmarks. Let's take a look at how 🤗 Transformer models can be benchmarked, best practices, and already available benchmarks.
A notebook explaining in more detail how to benchmark 🤗 Transformer models can be found `here <https://github.com/huggingface/transformers/blob/master/notebooks/05-benchmark.ipynb>`__. A notebook explaining in more detail how to benchmark 🤗 Transformer models can be found `here
<https://github.com/huggingface/transformers/blob/master/notebooks/05-benchmark.ipynb>`__.
How to benchmark 🤗 Transformer models How to benchmark 🤗 Transformer models
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The classes :class:`~transformers.PyTorchBenchmark` and :class:`~transformers.TensorFlowBenchmark` allow to flexibly benchmark 🤗 Transformer models. The classes :class:`~transformers.PyTorchBenchmark` and :class:`~transformers.TensorFlowBenchmark` allow to flexibly
The benchmark classes allow us to measure the `peak memory usage` and `required time` for both benchmark 🤗 Transformer models. The benchmark classes allow us to measure the `peak memory usage` and `required time`
`inference` and `training`. for both `inference` and `training`.
.. note:: .. note::
Hereby, `inference` is defined by a single forward pass, and `training` is defined by a single forward pass and backward pass. Hereby, `inference` is defined by a single forward pass, and `training` is defined by a single forward pass and
backward pass.
The benchmark classes :class:`~transformers.PyTorchBenchmark` and :class:`~transformers.TensorFlowBenchmark` expect an object of type :class:`~transformers.PyTorchBenchmarkArguments` and :class:`~transformers.TensorFlowBenchmarkArguments`, respectively, for instantiation. :class:`~transformers.PyTorchBenchmarkArguments` and :class:`~transformers.TensorFlowBenchmarkArguments` are data classes and contain all relevant configurations for their corresponding benchmark class. The benchmark classes :class:`~transformers.PyTorchBenchmark` and :class:`~transformers.TensorFlowBenchmark` expect an
In the following example, it is shown how a BERT model of type `bert-base-cased` can be benchmarked. object of type :class:`~transformers.PyTorchBenchmarkArguments` and
:class:`~transformers.TensorFlowBenchmarkArguments`, respectively, for instantiation.
:class:`~transformers.PyTorchBenchmarkArguments` and :class:`~transformers.TensorFlowBenchmarkArguments` are data
classes and contain all relevant configurations for their corresponding benchmark class. In the following example, it
is shown how a BERT model of type `bert-base-cased` can be benchmarked.
.. code-block:: .. code-block::
...@@ -34,11 +40,15 @@ In the following example, it is shown how a BERT model of type `bert-base-cased` ...@@ -34,11 +40,15 @@ In the following example, it is shown how a BERT model of type `bert-base-cased`
>>> benchmark = TensorFlowBenchmark(args) >>> benchmark = TensorFlowBenchmark(args)
Here, three arguments are given to the benchmark argument data classes, namely ``models``, ``batch_sizes``, and ``sequence_lengths``. The argument ``models`` is required and expects a :obj:`list` of model identifiers from the `model hub <https://huggingface.co/models>`__ Here, three arguments are given to the benchmark argument data classes, namely ``models``, ``batch_sizes``, and
The :obj:`list` arguments ``batch_sizes`` and ``sequence_lengths`` define the size of the ``input_ids`` on which the model is benchmarked. ``sequence_lengths``. The argument ``models`` is required and expects a :obj:`list` of model identifiers from the
There are many more parameters that can be configured via the benchmark argument data classes. For more detail on these one can either directly consult the files `model hub <https://huggingface.co/models>`__ The :obj:`list` arguments ``batch_sizes`` and ``sequence_lengths`` define
``src/transformers/benchmark/benchmark_args_utils.py``, ``src/transformers/benchmark/benchmark_args.py`` (for PyTorch) and ``src/transformers/benchmark/benchmark_args_tf.py`` (for Tensorflow). the size of the ``input_ids`` on which the model is benchmarked. There are many more parameters that can be configured
Alternatively, running the following shell commands from root will print out a descriptive list of all configurable parameters for PyTorch and Tensorflow respectively. via the benchmark argument data classes. For more detail on these one can either directly consult the files
``src/transformers/benchmark/benchmark_args_utils.py``, ``src/transformers/benchmark/benchmark_args.py`` (for PyTorch)
and ``src/transformers/benchmark/benchmark_args_tf.py`` (for Tensorflow). Alternatively, running the following shell
commands from root will print out a descriptive list of all configurable parameters for PyTorch and Tensorflow
respectively.
.. code-block:: bash .. code-block:: bash
...@@ -65,7 +75,7 @@ An instantiated benchmark object can then simply be run by calling ``benchmark.r ...@@ -65,7 +75,7 @@ An instantiated benchmark object can then simply be run by calling ``benchmark.r
bert-base-uncased 8 128 0.018 bert-base-uncased 8 128 0.018
bert-base-uncased 8 512 0.088 bert-base-uncased 8 512 0.088
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
==================== INFERENCE - MEMORY - RESULT ==================== ==================== INFERENCE - MEMORY - RESULT ====================
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
Model Name Batch Size Seq Length Memory in MB Model Name Batch Size Seq Length Memory in MB
...@@ -75,7 +85,7 @@ An instantiated benchmark object can then simply be run by calling ``benchmark.r ...@@ -75,7 +85,7 @@ An instantiated benchmark object can then simply be run by calling ``benchmark.r
bert-base-uncased 8 128 1307 bert-base-uncased 8 128 1307
bert-base-uncased 8 512 1539 bert-base-uncased 8 512 1539
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
==================== ENVIRONMENT INFORMATION ==================== ==================== ENVIRONMENT INFORMATION ====================
- transformers_version: 2.11.0 - transformers_version: 2.11.0
- framework: PyTorch - framework: PyTorch
...@@ -98,7 +108,7 @@ An instantiated benchmark object can then simply be run by calling ``benchmark.r ...@@ -98,7 +108,7 @@ An instantiated benchmark object can then simply be run by calling ``benchmark.r
- gpu_power_watts: 280.0 - gpu_power_watts: 280.0
- gpu_performance_state: 2 - gpu_performance_state: 2
- use_tpu: False - use_tpu: False
>>> ## TENSORFLOW CODE >>> ## TENSORFLOW CODE
>>> results = benchmark.run() >>> results = benchmark.run()
>>> print(results) >>> print(results)
...@@ -111,7 +121,7 @@ An instantiated benchmark object can then simply be run by calling ``benchmark.r ...@@ -111,7 +121,7 @@ An instantiated benchmark object can then simply be run by calling ``benchmark.r
bert-base-uncased 8 128 0.022 bert-base-uncased 8 128 0.022
bert-base-uncased 8 512 0.105 bert-base-uncased 8 512 0.105
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
==================== INFERENCE - MEMORY - RESULT ==================== ==================== INFERENCE - MEMORY - RESULT ====================
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
Model Name Batch Size Seq Length Memory in MB Model Name Batch Size Seq Length Memory in MB
...@@ -121,7 +131,7 @@ An instantiated benchmark object can then simply be run by calling ``benchmark.r ...@@ -121,7 +131,7 @@ An instantiated benchmark object can then simply be run by calling ``benchmark.r
bert-base-uncased 8 128 1330 bert-base-uncased 8 128 1330
bert-base-uncased 8 512 1770 bert-base-uncased 8 512 1770
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
==================== ENVIRONMENT INFORMATION ==================== ==================== ENVIRONMENT INFORMATION ====================
- transformers_version: 2.11.0 - transformers_version: 2.11.0
- framework: Tensorflow - framework: Tensorflow
...@@ -145,14 +155,17 @@ An instantiated benchmark object can then simply be run by calling ``benchmark.r ...@@ -145,14 +155,17 @@ An instantiated benchmark object can then simply be run by calling ``benchmark.r
- gpu_performance_state: 2 - gpu_performance_state: 2
- use_tpu: False - use_tpu: False
By default, the `time` and the `required memory` for `inference` are benchmarked. By default, the `time` and the `required memory` for `inference` are benchmarked. In the example output above the first
In the example output above the first two sections show the result corresponding to `inference time` and `inference memory`. two sections show the result corresponding to `inference time` and `inference memory`. In addition, all relevant
In addition, all relevant information about the computing environment, `e.g.` the GPU type, the system, the library versions, etc... are printed out in the third section under `ENVIRONMENT INFORMATION`. information about the computing environment, `e.g.` the GPU type, the system, the library versions, etc... are printed
This information can optionally be saved in a `.csv` file when adding the argument :obj:`save_to_csv=True` to :class:`~transformers.PyTorchBenchmarkArguments` and :class:`~transformers.TensorFlowBenchmarkArguments` respectively. out in the third section under `ENVIRONMENT INFORMATION`. This information can optionally be saved in a `.csv` file
In this case, every section is saved in a separate `.csv` file. The path to each `.csv` file can optionally be defined via the argument data classes. when adding the argument :obj:`save_to_csv=True` to :class:`~transformers.PyTorchBenchmarkArguments` and
:class:`~transformers.TensorFlowBenchmarkArguments` respectively. In this case, every section is saved in a separate
`.csv` file. The path to each `.csv` file can optionally be defined via the argument data classes.
Instead of benchmarking pre-trained models via their model identifier, `e.g.` `bert-base-uncased`, the user can alternatively benchmark an arbitrary configuration of any available model class. Instead of benchmarking pre-trained models via their model identifier, `e.g.` `bert-base-uncased`, the user can
In this case, a :obj:`list` of configurations must be inserted with the benchmark args as follows. alternatively benchmark an arbitrary configuration of any available model class. In this case, a :obj:`list` of
configurations must be inserted with the benchmark args as follows.
.. code-block:: .. code-block::
...@@ -183,7 +196,7 @@ In this case, a :obj:`list` of configurations must be inserted with the benchmar ...@@ -183,7 +196,7 @@ In this case, a :obj:`list` of configurations must be inserted with the benchmar
bert-6-lay 8 128 0.009 bert-6-lay 8 128 0.009
bert-6-lay 8 512 0.044 bert-6-lay 8 512 0.044
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
==================== INFERENCE - MEMORY - RESULT ==================== ==================== INFERENCE - MEMORY - RESULT ====================
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
Model Name Batch Size Seq Length Memory in MB Model Name Batch Size Seq Length Memory in MB
...@@ -201,7 +214,7 @@ In this case, a :obj:`list` of configurations must be inserted with the benchmar ...@@ -201,7 +214,7 @@ In this case, a :obj:`list` of configurations must be inserted with the benchmar
bert-6-lay 8 128 1127 bert-6-lay 8 128 1127
bert-6-lay 8 512 1359 bert-6-lay 8 512 1359
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
==================== ENVIRONMENT INFORMATION ==================== ==================== ENVIRONMENT INFORMATION ====================
- transformers_version: 2.11.0 - transformers_version: 2.11.0
- framework: PyTorch - framework: PyTorch
...@@ -252,7 +265,7 @@ In this case, a :obj:`list` of configurations must be inserted with the benchmar ...@@ -252,7 +265,7 @@ In this case, a :obj:`list` of configurations must be inserted with the benchmar
bert-6-lay 8 128 0.0011 bert-6-lay 8 128 0.0011
bert-6-lay 8 512 0.074 bert-6-lay 8 512 0.074
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
==================== INFERENCE - MEMORY - RESULT ==================== ==================== INFERENCE - MEMORY - RESULT ====================
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
Model Name Batch Size Seq Length Memory in MB Model Name Batch Size Seq Length Memory in MB
...@@ -270,7 +283,7 @@ In this case, a :obj:`list` of configurations must be inserted with the benchmar ...@@ -270,7 +283,7 @@ In this case, a :obj:`list` of configurations must be inserted with the benchmar
bert-6-lay 8 128 1330 bert-6-lay 8 128 1330
bert-6-lay 8 512 1540 bert-6-lay 8 512 1540
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
==================== ENVIRONMENT INFORMATION ==================== ==================== ENVIRONMENT INFORMATION ====================
- transformers_version: 2.11.0 - transformers_version: 2.11.0
- framework: Tensorflow - framework: Tensorflow
...@@ -295,8 +308,9 @@ In this case, a :obj:`list` of configurations must be inserted with the benchmar ...@@ -295,8 +308,9 @@ In this case, a :obj:`list` of configurations must be inserted with the benchmar
- use_tpu: False - use_tpu: False
Again, `inference time` and `required memory` for `inference` are measured, but this time for customized configurations of the :obj:`BertModel` class. This feature can especially be helpful when Again, `inference time` and `required memory` for `inference` are measured, but this time for customized configurations
deciding for which configuration the model should be trained. of the :obj:`BertModel` class. This feature can especially be helpful when deciding for which configuration the model
should be trained.
Benchmark best practices Benchmark best practices
...@@ -304,19 +318,28 @@ Benchmark best practices ...@@ -304,19 +318,28 @@ Benchmark best practices
This section lists a couple of best practices one should be aware of when benchmarking a model. This section lists a couple of best practices one should be aware of when benchmarking a model.
- Currently, only single device benchmarking is supported. When benchmarking on GPU, it is recommended that the user - Currently, only single device benchmarking is supported. When benchmarking on GPU, it is recommended that the user
specifies on which device the code should be run by setting the ``CUDA_VISIBLE_DEVICES`` environment variable in the shell, `e.g.` ``export CUDA_VISIBLE_DEVICES=0`` before running the code. specifies on which device the code should be run by setting the ``CUDA_VISIBLE_DEVICES`` environment variable in the
- The option :obj:`no_multi_processing` should only be set to :obj:`True` for testing and debugging. To ensure accurate memory measurement it is recommended to run each memory benchmark in a separate process by making sure :obj:`no_multi_processing` is set to :obj:`True`. shell, `e.g.` ``export CUDA_VISIBLE_DEVICES=0`` before running the code.
- One should always state the environment information when sharing the results of a model benchmark. Results can vary heavily between different GPU devices, library versions, etc., so that benchmark results on their own are not very useful for the community. - The option :obj:`no_multi_processing` should only be set to :obj:`True` for testing and debugging. To ensure accurate
memory measurement it is recommended to run each memory benchmark in a separate process by making sure
:obj:`no_multi_processing` is set to :obj:`True`.
- One should always state the environment information when sharing the results of a model benchmark. Results can vary
heavily between different GPU devices, library versions, etc., so that benchmark results on their own are not very
useful for the community.
Sharing your benchmark Sharing your benchmark
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Previously all available core models (10 at the time) have been benchmarked for `inference time`, across many different settings: using PyTorch, with Previously all available core models (10 at the time) have been benchmarked for `inference time`, across many different
and without TorchScript, using TensorFlow, with and without XLA. All of those tests were done across CPUs (except for settings: using PyTorch, with and without TorchScript, using TensorFlow, with and without XLA. All of those tests were
TensorFlow XLA) and GPUs. done across CPUs (except for TensorFlow XLA) and GPUs.
The approach is detailed in the `following blogpost <https://medium.com/huggingface/benchmarking-transformers-pytorch-and-tensorflow-e2917fb891c2>`__ and the results are available `here <https://docs.google.com/spreadsheets/d/1sryqufw2D0XlUH4sq3e9Wnxu5EAQkaohzrJbd5HdQ_w/edit?usp=sharing>`__. The approach is detailed in the `following blogpost
<https://medium.com/huggingface/benchmarking-transformers-pytorch-and-tensorflow-e2917fb891c2>`__ and the results are
available `here
<https://docs.google.com/spreadsheets/d/1sryqufw2D0XlUH4sq3e9Wnxu5EAQkaohzrJbd5HdQ_w/edit?usp=sharing>`__.
With the new `benchmark` tools, it is easier than ever to share your benchmark results with the community `here <https://github.com/huggingface/transformers/blob/master/examples/benchmarking/README.md>`__. With the new `benchmark` tools, it is easier than ever to share your benchmark results with the community `here
<https://github.com/huggingface/transformers/blob/master/examples/benchmarking/README.md>`__.
BERTology BERTology
----------------------------------------------------------------------------------------------------------------------- -----------------------------------------------------------------------------------------------------------------------
There is a growing field of study concerned with investigating the inner working of large-scale transformers like BERT (that some call "BERTology"). Some good examples of this field are: There is a growing field of study concerned with investigating the inner working of large-scale transformers like BERT
(that some call "BERTology"). Some good examples of this field are:
* BERT Rediscovers the Classical NLP Pipeline by Ian Tenney, Dipanjan Das, Ellie Pavlick: https://arxiv.org/abs/1905.05950 * BERT Rediscovers the Classical NLP Pipeline by Ian Tenney, Dipanjan Das, Ellie Pavlick:
https://arxiv.org/abs/1905.05950
* Are Sixteen Heads Really Better than One? by Paul Michel, Omer Levy, Graham Neubig: https://arxiv.org/abs/1905.10650 * Are Sixteen Heads Really Better than One? by Paul Michel, Omer Levy, Graham Neubig: https://arxiv.org/abs/1905.10650
* What Does BERT Look At? An Analysis of BERT's Attention by Kevin Clark, Urvashi Khandelwal, Omer Levy, Christopher D. Manning: https://arxiv.org/abs/1906.04341 * What Does BERT Look At? An Analysis of BERT's Attention by Kevin Clark, Urvashi Khandelwal, Omer Levy, Christopher D.
Manning: https://arxiv.org/abs/1906.04341
In order to help this new field develop, we have included a few additional features in the BERT/GPT/GPT-2 models to help people access the inner representations, mainly adapted from the great work of Paul Michel (https://arxiv.org/abs/1905.10650): In order to help this new field develop, we have included a few additional features in the BERT/GPT/GPT-2 models to
help people access the inner representations, mainly adapted from the great work of Paul Michel
(https://arxiv.org/abs/1905.10650):
* accessing all the hidden-states of BERT/GPT/GPT-2, * accessing all the hidden-states of BERT/GPT/GPT-2,
* accessing all the attention weights for each head of BERT/GPT/GPT-2, * accessing all the attention weights for each head of BERT/GPT/GPT-2,
* retrieving heads output values and gradients to be able to compute head importance score and prune head as explained in https://arxiv.org/abs/1905.10650. * retrieving heads output values and gradients to be able to compute head importance score and prune head as explained
in https://arxiv.org/abs/1905.10650.
To help you understand and use these features, we have added a specific example script: `bertology.py <https://github.com/huggingface/transformers/blob/master/examples/bertology/run_bertology.py>`_ while extract information and prune a model pre-trained on GLUE. To help you understand and use these features, we have added a specific example script: `bertology.py
<https://github.com/huggingface/transformers/blob/master/examples/bertology/run_bertology.py>`_ while extract
information and prune a model pre-trained on GLUE.
Converting Tensorflow Checkpoints Converting Tensorflow Checkpoints
======================================================================================================================= =======================================================================================================================
A command-line interface is provided to convert original Bert/GPT/GPT-2/Transformer-XL/XLNet/XLM checkpoints in models than be loaded using the ``from_pretrained`` methods of the library. A command-line interface is provided to convert original Bert/GPT/GPT-2/Transformer-XL/XLNet/XLM checkpoints in models
than be loaded using the ``from_pretrained`` methods of the library.
.. note:: .. note::
Since 2.3.0 the conversion script is now part of the transformers CLI (**transformers-cli**) Since 2.3.0 the conversion script is now part of the transformers CLI (**transformers-cli**) available in any
available in any transformers >= 2.3.0 installation. transformers >= 2.3.0 installation.
The documentation below reflects the **transformers-cli convert** command format. The documentation below reflects the **transformers-cli convert** command format.
BERT BERT
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
You can convert any TensorFlow checkpoint for BERT (in particular `the pre-trained models released by Google <https://github.com/google-research/bert#pre-trained-models>`_\ ) in a PyTorch save file by using the `convert_bert_original_tf_checkpoint_to_pytorch.py <https://github.com/huggingface/transformers/blob/master/src/transformers/convert_bert_original_tf_checkpoint_to_pytorch.py>`_ script. You can convert any TensorFlow checkpoint for BERT (in particular `the pre-trained models released by Google
<https://github.com/google-research/bert#pre-trained-models>`_\ ) in a PyTorch save file by using the
This CLI takes as input a TensorFlow checkpoint (three files starting with ``bert_model.ckpt``\ ) and the associated configuration file (\ ``bert_config.json``\ ), and creates a PyTorch model for this configuration, loads the weights from the TensorFlow checkpoint in the PyTorch model and saves the resulting model in a standard PyTorch save file that can be imported using ``torch.load()`` (see examples in `run_bert_extract_features.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_extract_features.py>`_\ , `run_bert_classifier.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_classifier.py>`_ and `run_bert_squad.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_squad.py>`_\ ). `convert_bert_original_tf_checkpoint_to_pytorch.py
<https://github.com/huggingface/transformers/blob/master/src/transformers/convert_bert_original_tf_checkpoint_to_pytorch.py>`_
You only need to run this conversion script **once** to get a PyTorch model. You can then disregard the TensorFlow checkpoint (the three files starting with ``bert_model.ckpt``\ ) but be sure to keep the configuration file (\ ``bert_config.json``\ ) and the vocabulary file (\ ``vocab.txt``\ ) as these are needed for the PyTorch model too. script.
To run this specific conversion script you will need to have TensorFlow and PyTorch installed (\ ``pip install tensorflow``\ ). The rest of the repository only requires PyTorch. This CLI takes as input a TensorFlow checkpoint (three files starting with ``bert_model.ckpt``\ ) and the associated
configuration file (\ ``bert_config.json``\ ), and creates a PyTorch model for this configuration, loads the weights
from the TensorFlow checkpoint in the PyTorch model and saves the resulting model in a standard PyTorch save file that
can be imported using ``torch.load()`` (see examples in `run_bert_extract_features.py
<https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_extract_features.py>`_\ ,
`run_bert_classifier.py
<https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_classifier.py>`_ and
`run_bert_squad.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_squad.py>`_\
).
You only need to run this conversion script **once** to get a PyTorch model. You can then disregard the TensorFlow
checkpoint (the three files starting with ``bert_model.ckpt``\ ) but be sure to keep the configuration file (\
``bert_config.json``\ ) and the vocabulary file (\ ``vocab.txt``\ ) as these are needed for the PyTorch model too.
To run this specific conversion script you will need to have TensorFlow and PyTorch installed (\ ``pip install
tensorflow``\ ). The rest of the repository only requires PyTorch.
Here is an example of the conversion process for a pre-trained ``BERT-Base Uncased`` model: Here is an example of the conversion process for a pre-trained ``BERT-Base Uncased`` model:
...@@ -31,14 +47,20 @@ Here is an example of the conversion process for a pre-trained ``BERT-Base Uncas ...@@ -31,14 +47,20 @@ Here is an example of the conversion process for a pre-trained ``BERT-Base Uncas
--config $BERT_BASE_DIR/bert_config.json \ --config $BERT_BASE_DIR/bert_config.json \
--pytorch_dump_output $BERT_BASE_DIR/pytorch_model.bin --pytorch_dump_output $BERT_BASE_DIR/pytorch_model.bin
You can download Google's pre-trained models for the conversion `here <https://github.com/google-research/bert#pre-trained-models>`__. You can download Google's pre-trained models for the conversion `here
<https://github.com/google-research/bert#pre-trained-models>`__.
ALBERT ALBERT
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Convert TensorFlow model checkpoints of ALBERT to PyTorch using the `convert_albert_original_tf_checkpoint_to_pytorch.py <https://github.com/huggingface/transformers/blob/master/src/transformers/convert_bert_original_tf_checkpoint_to_pytorch.py>`_ script. Convert TensorFlow model checkpoints of ALBERT to PyTorch using the
`convert_albert_original_tf_checkpoint_to_pytorch.py
<https://github.com/huggingface/transformers/blob/master/src/transformers/convert_bert_original_tf_checkpoint_to_pytorch.py>`_
script.
The CLI takes as input a TensorFlow checkpoint (three files starting with ``model.ckpt-best``\ ) and the accompanying configuration file (\ ``albert_config.json``\ ), then creates and saves a PyTorch model. To run this conversion you will need to have TensorFlow and PyTorch installed. The CLI takes as input a TensorFlow checkpoint (three files starting with ``model.ckpt-best``\ ) and the accompanying
configuration file (\ ``albert_config.json``\ ), then creates and saves a PyTorch model. To run this conversion you
will need to have TensorFlow and PyTorch installed.
Here is an example of the conversion process for the pre-trained ``ALBERT Base`` model: Here is an example of the conversion process for the pre-trained ``ALBERT Base`` model:
...@@ -51,12 +73,15 @@ Here is an example of the conversion process for the pre-trained ``ALBERT Base`` ...@@ -51,12 +73,15 @@ Here is an example of the conversion process for the pre-trained ``ALBERT Base``
--config $ALBERT_BASE_DIR/albert_config.json \ --config $ALBERT_BASE_DIR/albert_config.json \
--pytorch_dump_output $ALBERT_BASE_DIR/pytorch_model.bin --pytorch_dump_output $ALBERT_BASE_DIR/pytorch_model.bin
You can download Google's pre-trained models for the conversion `here <https://github.com/google-research/albert#pre-trained-models>`__. You can download Google's pre-trained models for the conversion `here
<https://github.com/google-research/albert#pre-trained-models>`__.
OpenAI GPT OpenAI GPT
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Here is an example of the conversion process for a pre-trained OpenAI GPT model, assuming that your NumPy checkpoint save as the same format than OpenAI pretrained model (see `here <https://github.com/openai/finetune-transformer-lm>`__\ ) Here is an example of the conversion process for a pre-trained OpenAI GPT model, assuming that your NumPy checkpoint
save as the same format than OpenAI pretrained model (see `here <https://github.com/openai/finetune-transformer-lm>`__\
)
.. code-block:: shell .. code-block:: shell
...@@ -72,7 +97,8 @@ Here is an example of the conversion process for a pre-trained OpenAI GPT model, ...@@ -72,7 +97,8 @@ Here is an example of the conversion process for a pre-trained OpenAI GPT model,
OpenAI GPT-2 OpenAI GPT-2
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Here is an example of the conversion process for a pre-trained OpenAI GPT-2 model (see `here <https://github.com/openai/gpt-2>`__\ ) Here is an example of the conversion process for a pre-trained OpenAI GPT-2 model (see `here
<https://github.com/openai/gpt-2>`__\ )
.. code-block:: shell .. code-block:: shell
...@@ -87,7 +113,8 @@ Here is an example of the conversion process for a pre-trained OpenAI GPT-2 mode ...@@ -87,7 +113,8 @@ Here is an example of the conversion process for a pre-trained OpenAI GPT-2 mode
Transformer-XL Transformer-XL
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Here is an example of the conversion process for a pre-trained Transformer-XL model (see `here <https://github.com/kimiyoung/transformer-xl/tree/master/tf#obtain-and-evaluate-pretrained-sota-models>`__\ ) Here is an example of the conversion process for a pre-trained Transformer-XL model (see `here
<https://github.com/kimiyoung/transformer-xl/tree/master/tf#obtain-and-evaluate-pretrained-sota-models>`__\ )
.. code-block:: shell .. code-block:: shell
...@@ -130,4 +157,4 @@ Here is an example of the conversion process for a pre-trained XLM model: ...@@ -130,4 +157,4 @@ Here is an example of the conversion process for a pre-trained XLM model:
--tf_checkpoint $XLM_CHECKPOINT_PATH \ --tf_checkpoint $XLM_CHECKPOINT_PATH \
--pytorch_dump_output $PYTORCH_DUMP_OUTPUT --pytorch_dump_output $PYTORCH_DUMP_OUTPUT
[--config XML_CONFIG] \ [--config XML_CONFIG] \
[--finetuning_task_name XML_FINETUNED_TASK] [--finetuning_task_name XML_FINETUNED_TASK]
\ No newline at end of file
This diff is collapsed.
...@@ -57,8 +57,8 @@ The tokenizer takes care of splitting the sequence into tokens available in the ...@@ -57,8 +57,8 @@ The tokenizer takes care of splitting the sequence into tokens available in the
>>> tokenized_sequence = tokenizer.tokenize(sequence) >>> tokenized_sequence = tokenizer.tokenize(sequence)
The tokens are either words or subwords. Here for instance, "VRAM" wasn't in the model vocabulary, so it's been split The tokens are either words or subwords. Here for instance, "VRAM" wasn't in the model vocabulary, so it's been split
in "V", "RA" and "M". To indicate those tokens are not separate words but parts of the same word, a double-hash prefix is in "V", "RA" and "M". To indicate those tokens are not separate words but parts of the same word, a double-hash prefix
added for "RA" and "M": is added for "RA" and "M":
.. code-block:: .. code-block::
...@@ -66,8 +66,8 @@ added for "RA" and "M": ...@@ -66,8 +66,8 @@ added for "RA" and "M":
['A', 'Titan', 'R', '##T', '##X', 'has', '24', '##GB', 'of', 'V', '##RA', '##M'] ['A', 'Titan', 'R', '##T', '##X', 'has', '24', '##GB', 'of', 'V', '##RA', '##M']
These tokens can then be converted into IDs which are understandable by the model. This can be done by directly feeding These tokens can then be converted into IDs which are understandable by the model. This can be done by directly feeding
the sentence to the tokenizer, which leverages the Rust implementation of the sentence to the tokenizer, which leverages the Rust implementation of `huggingface/tokenizers
`huggingface/tokenizers <https://github.com/huggingface/tokenizers>`__ for peak performance. <https://github.com/huggingface/tokenizers>`__ for peak performance.
.. code-block:: .. code-block::
...@@ -105,8 +105,8 @@ because this is the way a :class:`~transformers.BertModel` is going to expect it ...@@ -105,8 +105,8 @@ because this is the way a :class:`~transformers.BertModel` is going to expect it
Attention mask Attention mask
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The attention mask is an optional argument used when batching sequences together. This argument indicates to the The attention mask is an optional argument used when batching sequences together. This argument indicates to the model
model which tokens should be attended to, and which should not. which tokens should be attended to, and which should not.
For example, consider these two sequences: For example, consider these two sequences:
...@@ -145,10 +145,10 @@ We can see that 0s have been added on the right of the first sentence to make it ...@@ -145,10 +145,10 @@ We can see that 0s have been added on the right of the first sentence to make it
>>> padded_sequences["input_ids"] >>> padded_sequences["input_ids"]
[[101, 1188, 1110, 170, 1603, 4954, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1188, 1110, 170, 1897, 1263, 4954, 119, 1135, 1110, 1120, 1655, 2039, 1190, 1103, 4954, 138, 119, 102]] [[101, 1188, 1110, 170, 1603, 4954, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1188, 1110, 170, 1897, 1263, 4954, 119, 1135, 1110, 1120, 1655, 2039, 1190, 1103, 4954, 138, 119, 102]]
This can then be converted into a tensor in PyTorch or TensorFlow. The attention mask is a binary tensor indicating This can then be converted into a tensor in PyTorch or TensorFlow. The attention mask is a binary tensor indicating the
the position of the padded indices so that the model does not attend to them. For the position of the padded indices so that the model does not attend to them. For the :class:`~transformers.BertTokenizer`,
:class:`~transformers.BertTokenizer`, :obj:`1` indicates a value that should be attended to, while :obj:`0` indicates :obj:`1` indicates a value that should be attended to, while :obj:`0` indicates a padded value. This attention mask is
a padded value. This attention mask is in the dictionary returned by the tokenizer under the key "attention_mask": in the dictionary returned by the tokenizer under the key "attention_mask":
.. code-block:: .. code-block::
...@@ -161,15 +161,16 @@ Token Type IDs ...@@ -161,15 +161,16 @@ Token Type IDs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Some models' purpose is to do sequence classification or question answering. These require two different sequences to Some models' purpose is to do sequence classification or question answering. These require two different sequences to
be joined in a single "input_ids" entry, which usually is performed with the help of special tokens, such as the classifier (``[CLS]``) and separator (``[SEP]``) be joined in a single "input_ids" entry, which usually is performed with the help of special tokens, such as the
tokens. For example, the BERT model builds its two sequence input as such: classifier (``[CLS]``) and separator (``[SEP]``) tokens. For example, the BERT model builds its two sequence input as
such:
.. code-block:: .. code-block::
>>> # [CLS] SEQUENCE_A [SEP] SEQUENCE_B [SEP] >>> # [CLS] SEQUENCE_A [SEP] SEQUENCE_B [SEP]
We can use our tokenizer to automatically generate such a sentence by passing the two sequences to ``tokenizer`` as two arguments (and We can use our tokenizer to automatically generate such a sentence by passing the two sequences to ``tokenizer`` as two
not a list, like before) like this: arguments (and not a list, like before) like this:
.. code-block:: .. code-block::
...@@ -189,8 +190,8 @@ which will return: ...@@ -189,8 +190,8 @@ which will return:
[CLS] HuggingFace is based in NYC [SEP] Where is HuggingFace based? [SEP] [CLS] HuggingFace is based in NYC [SEP] Where is HuggingFace based? [SEP]
This is enough for some models to understand where one sequence ends and where another begins. However, other models, This is enough for some models to understand where one sequence ends and where another begins. However, other models,
such as BERT, also deploy token type IDs (also called segment IDs). They are represented as a binary such as BERT, also deploy token type IDs (also called segment IDs). They are represented as a binary mask identifying
mask identifying the two types of sequence in the model. the two types of sequence in the model.
The tokenizer returns this mask as the "token_type_ids" entry: The tokenizer returns this mask as the "token_type_ids" entry:
...@@ -209,14 +210,15 @@ Some models, like :class:`~transformers.XLNetModel` use an additional token repr ...@@ -209,14 +210,15 @@ Some models, like :class:`~transformers.XLNetModel` use an additional token repr
Position IDs Position IDs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Contrary to RNNs that have the position of each token embedded within them, Contrary to RNNs that have the position of each token embedded within them, transformers are unaware of the position of
transformers are unaware of the position of each token. Therefore, the position IDs (``position_ids``) are used by the model to identify each token's position in the list of tokens. each token. Therefore, the position IDs (``position_ids``) are used by the model to identify each token's position in
the list of tokens.
They are an optional parameter. If no ``position_ids`` is passed to the model, the IDs are automatically created as absolute They are an optional parameter. If no ``position_ids`` is passed to the model, the IDs are automatically created as
positional embeddings. absolute positional embeddings.
Absolute positional embeddings are selected in the range ``[0, config.max_position_embeddings - 1]``. Some models Absolute positional embeddings are selected in the range ``[0, config.max_position_embeddings - 1]``. Some models use
use other types of positional embeddings, such as sinusoidal position embeddings or relative position embeddings. other types of positional embeddings, such as sinusoidal position embeddings or relative position embeddings.
.. _labels: .. _labels:
...@@ -224,43 +226,41 @@ Labels ...@@ -224,43 +226,41 @@ Labels
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The labels are an optional argument which can be passed in order for the model to compute the loss itself. These labels The labels are an optional argument which can be passed in order for the model to compute the loss itself. These labels
should be the expected prediction of the model: it will use the standard loss in order to compute the loss between should be the expected prediction of the model: it will use the standard loss in order to compute the loss between its
its predictions and the expected value (the label). predictions and the expected value (the label).
These labels are different according to the model head, for example: These labels are different according to the model head, for example:
- For sequence classification models (e.g., :class:`~transformers.BertForSequenceClassification`), the model expects - For sequence classification models (e.g., :class:`~transformers.BertForSequenceClassification`), the model expects a
a tensor of dimension :obj:`(batch_size)` with each value of the batch corresponding to the expected label of the tensor of dimension :obj:`(batch_size)` with each value of the batch corresponding to the expected label of the
entire sequence. entire sequence.
- For token classification models (e.g., :class:`~transformers.BertForTokenClassification`), the model expects - For token classification models (e.g., :class:`~transformers.BertForTokenClassification`), the model expects a tensor
a tensor of dimension :obj:`(batch_size, seq_length)` with each value corresponding to the expected label of each of dimension :obj:`(batch_size, seq_length)` with each value corresponding to the expected label of each individual
individual token. token.
- For masked language modeling (e.g., :class:`~transformers.BertForMaskedLM`), the model expects - For masked language modeling (e.g., :class:`~transformers.BertForMaskedLM`), the model expects a tensor of dimension
a tensor of dimension :obj:`(batch_size, seq_length)` with each value corresponding to the expected label of each :obj:`(batch_size, seq_length)` with each value corresponding to the expected label of each individual token: the
individual token: the labels being the token ID for the masked token, and values to be ignored for the rest (usually labels being the token ID for the masked token, and values to be ignored for the rest (usually -100).
-100).
- For sequence to sequence tasks,(e.g., :class:`~transformers.BartForConditionalGeneration`, - For sequence to sequence tasks,(e.g., :class:`~transformers.BartForConditionalGeneration`,
:class:`~transformers.MBartForConditionalGeneration`), the model expects a tensor of dimension :class:`~transformers.MBartForConditionalGeneration`), the model expects a tensor of dimension :obj:`(batch_size,
:obj:`(batch_size, tgt_seq_length)` with each value corresponding to the target sequences associated with each tgt_seq_length)` with each value corresponding to the target sequences associated with each input sequence. During
input sequence. During training, both `BART` and `T5` will make the appropriate `decoder_input_ids` and decoder training, both `BART` and `T5` will make the appropriate `decoder_input_ids` and decoder attention masks internally.
attention masks internally. They usually do not need to be supplied. This does not apply to models leveraging the They usually do not need to be supplied. This does not apply to models leveraging the Encoder-Decoder framework. See
Encoder-Decoder framework. the documentation of each model for more information on each specific model's labels.
See the documentation of each model for more information on each specific model's labels.
The base models (e.g., :class:`~transformers.BertModel`) do not accept labels, as these are the base transformer models, The base models (e.g., :class:`~transformers.BertModel`) do not accept labels, as these are the base transformer
simply outputting features. models, simply outputting features.
.. _decoder-input-ids: .. _decoder-input-ids:
Decoder input IDs Decoder input IDs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This input is specific to encoder-decoder models, and contains the input IDs that will be fed to the decoder. This input is specific to encoder-decoder models, and contains the input IDs that will be fed to the decoder. These
These inputs should be used for sequence to sequence tasks, such as translation or summarization, and are usually inputs should be used for sequence to sequence tasks, such as translation or summarization, and are usually built in a
built in a way specific to each model. way specific to each model.
Most encoder-decoder models (BART, T5) create their :obj:`decoder_input_ids` on their own from the :obj:`labels`. Most encoder-decoder models (BART, T5) create their :obj:`decoder_input_ids` on their own from the :obj:`labels`. In
In such models, passing the :obj:`labels` is the preferred way to handle training. such models, passing the :obj:`labels` is the preferred way to handle training.
Please check each model's docs to see how they handle these input IDs for sequence to sequence training. Please check each model's docs to see how they handle these input IDs for sequence to sequence training.
...@@ -270,18 +270,18 @@ Feed Forward Chunking ...@@ -270,18 +270,18 @@ Feed Forward Chunking
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In each residual attention block in transformers the self-attention layer is usually followed by 2 feed forward layers. In each residual attention block in transformers the self-attention layer is usually followed by 2 feed forward layers.
The intermediate embedding size of the feed forward layers is often bigger than the hidden size of the model (e.g., The intermediate embedding size of the feed forward layers is often bigger than the hidden size of the model (e.g., for
for ``bert-base-uncased``). ``bert-base-uncased``).
For an input of size ``[batch_size, sequence_length]``, the memory required to store the intermediate feed forward For an input of size ``[batch_size, sequence_length]``, the memory required to store the intermediate feed forward
embeddings ``[batch_size, sequence_length, config.intermediate_size]`` can account for a large fraction of the memory embeddings ``[batch_size, sequence_length, config.intermediate_size]`` can account for a large fraction of the memory
use. The authors of `Reformer: The Efficient Transformer <https://arxiv.org/abs/2001.04451>`_ noticed that since the use. The authors of `Reformer: The Efficient Transformer <https://arxiv.org/abs/2001.04451>`_ noticed that since the
computation is independent of the ``sequence_length`` dimension, it is mathematically equivalent to compute the output computation is independent of the ``sequence_length`` dimension, it is mathematically equivalent to compute the output
embeddings of both feed forward layers ``[batch_size, config.hidden_size]_0, ..., [batch_size, config.hidden_size]_n`` embeddings of both feed forward layers ``[batch_size, config.hidden_size]_0, ..., [batch_size, config.hidden_size]_n``
individually and concat them afterward to ``[batch_size, sequence_length, config.hidden_size]`` with individually and concat them afterward to ``[batch_size, sequence_length, config.hidden_size]`` with ``n =
``n = sequence_length``, which trades increased computation time against reduced memory use, but yields a sequence_length``, which trades increased computation time against reduced memory use, but yields a mathematically
mathematically **equivalent** result. **equivalent** result.
For models employing the function :func:`~.transformers.apply_chunking_to_forward`, the ``chunk_size`` defines the For models employing the function :func:`~.transformers.apply_chunking_to_forward`, the ``chunk_size`` defines the
number of output embeddings that are computed in parallel and thus defines the trade-off between memory and time number of output embeddings that are computed in parallel and thus defines the trade-off between memory and time
complexity. If ``chunk_size`` is set to 0, no feed forward chunking is done. complexity. If ``chunk_size`` is set to 0, no feed forward chunking is done.
...@@ -47,6 +47,7 @@ The documentation is organized in five parts: ...@@ -47,6 +47,7 @@ The documentation is organized in five parts:
- **RESEARCH** focuses on tutorials that have less to do with how to use the library but more about general resarch in - **RESEARCH** focuses on tutorials that have less to do with how to use the library but more about general resarch in
transformers model transformers model
- The three last section contain the documentation of each public class and function, grouped in: - The three last section contain the documentation of each public class and function, grouped in:
- **MAIN CLASSES** for the main classes exposing the important APIs of the library. - **MAIN CLASSES** for the main classes exposing the important APIs of the library.
- **MODELS** for the classes and functions related to each model implemented in the library. - **MODELS** for the classes and functions related to each model implemented in the library.
- **INTERNAL HELPERS** for the classes and functions we use internally. - **INTERNAL HELPERS** for the classes and functions we use internally.
...@@ -122,7 +123,7 @@ conversion utilities for the following models: ...@@ -122,7 +123,7 @@ conversion utilities for the following models:
20. :doc:`MarianMT <model_doc/marian>` Machine translation models trained using `OPUS <http://opus.nlpl.eu/>`__ data by 20. :doc:`MarianMT <model_doc/marian>` Machine translation models trained using `OPUS <http://opus.nlpl.eu/>`__ data by
Jörg Tiedemann. The `Marian Framework <https://marian-nmt.github.io/>`__ is being developed by the Microsoft Jörg Tiedemann. The `Marian Framework <https://marian-nmt.github.io/>`__ is being developed by the Microsoft
Translator Team. Translator Team.
21. :doc:`MBart <model_doc/mbart>` (from Facebook) released with the paper `Multilingual Denoising Pre-training for 21. :doc:`MBart <model_doc/mbart>` (from Facebook) released with the paper `Multilingual Denoising Pre-training for
Neural Machine Translation <https://arxiv.org/abs/2001.08210>`__ by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Neural Machine Translation <https://arxiv.org/abs/2001.08210>`__ by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li,
Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer. Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
22. :doc:`Pegasus <model_doc/pegasus>` (from Google) released with the paper `PEGASUS: Pre-training with Extracted 22. :doc:`Pegasus <model_doc/pegasus>` (from Google) released with the paper `PEGASUS: Pre-training with Extracted
......
...@@ -85,4 +85,4 @@ TensorFlow Helper Functions ...@@ -85,4 +85,4 @@ TensorFlow Helper Functions
.. autofunction:: transformers.modeling_tf_utils.keras_serializable .. autofunction:: transformers.modeling_tf_utils.keras_serializable
.. autofunction:: transformers.modeling_tf_utils.shape_list .. autofunction:: transformers.modeling_tf_utils.shape_list
\ No newline at end of file
...@@ -25,6 +25,7 @@ SpecialTokensMixin ...@@ -25,6 +25,7 @@ SpecialTokensMixin
Enums and namedtuples Enums and namedtuples
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.tokenization_utils_base.ExplicitEnum .. autoclass:: transformers.tokenization_utils_base.ExplicitEnum
.. autoclass:: transformers.tokenization_utils_base.PaddingStrategy .. autoclass:: transformers.tokenization_utils_base.PaddingStrategy
......
...@@ -24,4 +24,4 @@ Distributed Evaluation ...@@ -24,4 +24,4 @@ Distributed Evaluation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.trainer_pt_utils.DistributedTensorGatherer .. autoclass:: transformers.trainer_pt_utils.DistributedTensorGatherer
:members: :members:
\ No newline at end of file
...@@ -17,7 +17,7 @@ You can also use the environment variable ``TRANSFORMERS_VERBOSITY`` to override ...@@ -17,7 +17,7 @@ You can also use the environment variable ``TRANSFORMERS_VERBOSITY`` to override
to one of the following: ``debug``, ``info``, ``warning``, ``error``, ``critical``. For example: to one of the following: ``debug``, ``info``, ``warning``, ``error``, ``critical``. For example:
.. code-block:: bash .. code-block:: bash
TRANSFORMERS_VERBOSITY=error ./myprogram.py TRANSFORMERS_VERBOSITY=error ./myprogram.py
All the methods of this logging module are documented below, the main ones are All the methods of this logging module are documented below, the main ones are
...@@ -55,4 +55,4 @@ Other functions ...@@ -55,4 +55,4 @@ Other functions
.. autofunction:: transformers.logging.enable_explicit_format .. autofunction:: transformers.logging.enable_explicit_format
.. autofunction:: transformers.logging.reset_format .. autofunction:: transformers.logging.reset_format
\ No newline at end of file
...@@ -52,4 +52,4 @@ Generative models ...@@ -52,4 +52,4 @@ Generative models
:members: :members:
.. autoclass:: transformers.generation_tf_utils.TFGenerationMixin .. autoclass:: transformers.generation_tf_utils.TFGenerationMixin
:members: :members:
\ No newline at end of file
Pipelines Pipelines
----------------------------------------------------------------------------------------------------------------------- -----------------------------------------------------------------------------------------------------------------------
The pipelines are a great and easy way to use models for inference. These pipelines are objects that abstract most The pipelines are a great and easy way to use models for inference. These pipelines are objects that abstract most of
of the complex code from the library, offering a simple API dedicated to several tasks, including Named Entity the complex code from the library, offering a simple API dedicated to several tasks, including Named Entity
Recognition, Masked Language Modeling, Sentiment Analysis, Feature Extraction and Question Answering. See the Recognition, Masked Language Modeling, Sentiment Analysis, Feature Extraction and Question Answering. See the
:doc:`task summary <../task_summary>` for examples of use. :doc:`task summary <../task_summary>` for examples of use.
...@@ -26,8 +26,8 @@ There are two categories of pipeline abstractions to be aware about: ...@@ -26,8 +26,8 @@ There are two categories of pipeline abstractions to be aware about:
The pipeline abstraction The pipeline abstraction
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The `pipeline` abstraction is a wrapper around all the other available pipelines. It is instantiated as any The `pipeline` abstraction is a wrapper around all the other available pipelines. It is instantiated as any other
other pipeline but requires an additional argument which is the `task`. pipeline but requires an additional argument which is the `task`.
.. autofunction:: transformers.pipeline .. autofunction:: transformers.pipeline
......
...@@ -8,8 +8,8 @@ Processors ...@@ -8,8 +8,8 @@ Processors
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All processors follow the same architecture which is that of the All processors follow the same architecture which is that of the
:class:`~transformers.data.processors.utils.DataProcessor`. The processor returns a list :class:`~transformers.data.processors.utils.DataProcessor`. The processor returns a list of
of :class:`~transformers.data.processors.utils.InputExample`. These :class:`~transformers.data.processors.utils.InputExample`. These
:class:`~transformers.data.processors.utils.InputExample` can be converted to :class:`~transformers.data.processors.utils.InputExample` can be converted to
:class:`~transformers.data.processors.utils.InputFeatures` in order to be fed to the model. :class:`~transformers.data.processors.utils.InputFeatures` in order to be fed to the model.
...@@ -28,14 +28,16 @@ of :class:`~transformers.data.processors.utils.InputExample`. These ...@@ -28,14 +28,16 @@ of :class:`~transformers.data.processors.utils.InputExample`. These
GLUE GLUE
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
`General Language Understanding Evaluation (GLUE) <https://gluebenchmark.com/>`__ is a benchmark that evaluates `General Language Understanding Evaluation (GLUE) <https://gluebenchmark.com/>`__ is a benchmark that evaluates the
the performance of models across a diverse set of existing NLU tasks. It was released together with the paper performance of models across a diverse set of existing NLU tasks. It was released together with the paper `GLUE: A
`GLUE: A multi-task benchmark and analysis platform for natural language understanding <https://openreview.net/pdf?id=rJ4km2R5t7>`__ multi-task benchmark and analysis platform for natural language understanding
<https://openreview.net/pdf?id=rJ4km2R5t7>`__
This library hosts a total of 10 processors for the following tasks: MRPC, MNLI, MNLI (mismatched), This library hosts a total of 10 processors for the following tasks: MRPC, MNLI, MNLI (mismatched), CoLA, SST2, STSB,
CoLA, SST2, STSB, QQP, QNLI, RTE and WNLI. QQP, QNLI, RTE and WNLI.
Those processors are: Those processors are:
- :class:`~transformers.data.processors.utils.MrpcProcessor` - :class:`~transformers.data.processors.utils.MrpcProcessor`
- :class:`~transformers.data.processors.utils.MnliProcessor` - :class:`~transformers.data.processors.utils.MnliProcessor`
- :class:`~transformers.data.processors.utils.MnliMismatchedProcessor` - :class:`~transformers.data.processors.utils.MnliMismatchedProcessor`
...@@ -46,7 +48,7 @@ Those processors are: ...@@ -46,7 +48,7 @@ Those processors are:
- :class:`~transformers.data.processors.utils.RteProcessor` - :class:`~transformers.data.processors.utils.RteProcessor`
- :class:`~transformers.data.processors.utils.WnliProcessor` - :class:`~transformers.data.processors.utils.WnliProcessor`
Additionally, the following method can be used to load values from a data file and convert them to a list of Additionally, the following method can be used to load values from a data file and convert them to a list of
:class:`~transformers.data.processors.utils.InputExample`. :class:`~transformers.data.processors.utils.InputExample`.
.. automethod:: transformers.data.processors.glue.glue_convert_examples_to_features .. automethod:: transformers.data.processors.glue.glue_convert_examples_to_features
...@@ -54,36 +56,39 @@ Additionally, the following method can be used to load values from a data file ...@@ -54,36 +56,39 @@ Additionally, the following method can be used to load values from a data file
Example usage Example usage
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
An example using these processors is given in the `run_glue.py <https://github.com/huggingface/pytorch-transformers/blob/master/examples/text-classification/run_glue.py>`__ script. An example using these processors is given in the `run_glue.py
<https://github.com/huggingface/pytorch-transformers/blob/master/examples/text-classification/run_glue.py>`__ script.
XNLI XNLI
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
`The Cross-Lingual NLI Corpus (XNLI) <https://www.nyu.edu/projects/bowman/xnli/>`__ is a benchmark that evaluates `The Cross-Lingual NLI Corpus (XNLI) <https://www.nyu.edu/projects/bowman/xnli/>`__ is a benchmark that evaluates the
the quality of cross-lingual text representations. quality of cross-lingual text representations. XNLI is crowd-sourced dataset based on `MultiNLI
XNLI is crowd-sourced dataset based on `MultiNLI <http://www.nyu.edu/projects/bowman/multinli/>`: pairs of text are labeled with textual entailment <http://www.nyu.edu/projects/bowman/multinli/>`: pairs of text are labeled with textual entailment annotations for 15
annotations for 15 different languages (including both high-resource language such as English and low-resource languages such as Swahili). different languages (including both high-resource language such as English and low-resource languages such as Swahili).
It was released together with the paper It was released together with the paper `XNLI: Evaluating Cross-lingual Sentence Representations
`XNLI: Evaluating Cross-lingual Sentence Representations <https://arxiv.org/abs/1809.05053>`__ <https://arxiv.org/abs/1809.05053>`__
This library hosts the processor to load the XNLI data: This library hosts the processor to load the XNLI data:
- :class:`~transformers.data.processors.utils.XnliProcessor` - :class:`~transformers.data.processors.utils.XnliProcessor`
Please note that since the gold labels are available on the test set, evaluation is performed on the test set. Please note that since the gold labels are available on the test set, evaluation is performed on the test set.
An example using these processors is given in the An example using these processors is given in the `run_xnli.py
`run_xnli.py <https://github.com/huggingface/pytorch-transformers/blob/master/examples/text-classification/run_xnli.py>`__ script. <https://github.com/huggingface/pytorch-transformers/blob/master/examples/text-classification/run_xnli.py>`__ script.
SQuAD SQuAD
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
`The Stanford Question Answering Dataset (SQuAD) <https://rajpurkar.github.io/SQuAD-explorer//>`__ is a benchmark that evaluates `The Stanford Question Answering Dataset (SQuAD) <https://rajpurkar.github.io/SQuAD-explorer//>`__ is a benchmark that
the performance of models on question answering. Two versions are available, v1.1 and v2.0. The first version (v1.1) was released together with the paper evaluates the performance of models on question answering. Two versions are available, v1.1 and v2.0. The first version
`SQuAD: 100,000+ Questions for Machine Comprehension of Text <https://arxiv.org/abs/1606.05250>`__. The second version (v2.0) was released alongside (v1.1) was released together with the paper `SQuAD: 100,000+ Questions for Machine Comprehension of Text
the paper `Know What You Don't Know: Unanswerable Questions for SQuAD <https://arxiv.org/abs/1806.03822>`__. <https://arxiv.org/abs/1606.05250>`__. The second version (v2.0) was released alongside the paper `Know What You Don't
Know: Unanswerable Questions for SQuAD <https://arxiv.org/abs/1806.03822>`__.
This library hosts a processor for each of the two versions: This library hosts a processor for each of the two versions:
...@@ -91,6 +96,7 @@ Processors ...@@ -91,6 +96,7 @@ Processors
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Those processors are: Those processors are:
- :class:`~transformers.data.processors.utils.SquadV1Processor` - :class:`~transformers.data.processors.utils.SquadV1Processor`
- :class:`~transformers.data.processors.utils.SquadV2Processor` - :class:`~transformers.data.processors.utils.SquadV2Processor`
...@@ -99,17 +105,18 @@ They both inherit from the abstract class :class:`~transformers.data.processors. ...@@ -99,17 +105,18 @@ They both inherit from the abstract class :class:`~transformers.data.processors.
.. autoclass:: transformers.data.processors.squad.SquadProcessor .. autoclass:: transformers.data.processors.squad.SquadProcessor
:members: :members:
Additionally, the following method can be used to convert SQuAD examples into :class:`~transformers.data.processors.utils.SquadFeatures` Additionally, the following method can be used to convert SQuAD examples into
that can be used as model inputs. :class:`~transformers.data.processors.utils.SquadFeatures` that can be used as model inputs.
.. automethod:: transformers.data.processors.squad.squad_convert_examples_to_features .. automethod:: transformers.data.processors.squad.squad_convert_examples_to_features
These processors as well as the aforementionned method can be used with files containing the data as well as with the `tensorflow_datasets` package. These processors as well as the aforementionned method can be used with files containing the data as well as with the
Examples are given below. `tensorflow_datasets` package. Examples are given below.
Example usage Example usage
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Here is an example using the processors as well as the conversion method using data files: Here is an example using the processors as well as the conversion method using data files:
.. code-block:: .. code-block::
...@@ -149,5 +156,5 @@ Using `tensorflow_datasets` is as easy as using a data file: ...@@ -149,5 +156,5 @@ Using `tensorflow_datasets` is as easy as using a data file:
) )
Another example using these processors is given in the Another example using these processors is given in the `run_squad.py
`run_squad.py <https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py>`__ script. <https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py>`__ script.
...@@ -29,11 +29,12 @@ methods for using all the tokenizers: ...@@ -29,11 +29,12 @@ methods for using all the tokenizers:
:class:`~transformers.BatchEncoding` holds the output of the tokenizer's encoding methods (``__call__``, :class:`~transformers.BatchEncoding` holds the output of the tokenizer's encoding methods (``__call__``,
``encode_plus`` and ``batch_encode_plus``) and is derived from a Python dictionary. When the tokenizer is a pure python ``encode_plus`` and ``batch_encode_plus``) and is derived from a Python dictionary. When the tokenizer is a pure python
tokenizer, this class behaves just like a standard python dictionary and holds the various model inputs computed by these tokenizer, this class behaves just like a standard python dictionary and holds the various model inputs computed by
methods (``input_ids``, ``attention_mask``...). When the tokenizer is a "Fast" tokenizer (i.e., backed by HuggingFace these methods (``input_ids``, ``attention_mask``...). When the tokenizer is a "Fast" tokenizer (i.e., backed by
`tokenizers library <https://github.com/huggingface/tokenizers>`__), this class provides in addition several advanced HuggingFace `tokenizers library <https://github.com/huggingface/tokenizers>`__), this class provides in addition
alignment methods which can be used to map between the original string (character and words) and the token space (e.g., several advanced alignment methods which can be used to map between the original string (character and words) and the
getting the index of the token comprising a given character or the span of characters corresponding to a given token). token space (e.g., getting the index of the token comprising a given character or the span of characters corresponding
to a given token).
PreTrainedTokenizer PreTrainedTokenizer
......
...@@ -4,7 +4,7 @@ Trainer ...@@ -4,7 +4,7 @@ Trainer
The :class:`~transformers.Trainer` and :class:`~transformers.TFTrainer` classes provide an API for feature-complete The :class:`~transformers.Trainer` and :class:`~transformers.TFTrainer` classes provide an API for feature-complete
training in most standard use cases. It's used in most of the :doc:`example scripts <../examples>`. training in most standard use cases. It's used in most of the :doc:`example scripts <../examples>`.
Before instantiating your :class:`~transformers.Trainer`/:class:`~transformers.TFTrainer`, create a Before instantiating your :class:`~transformers.Trainer`/:class:`~transformers.TFTrainer`, create a
:class:`~transformers.TrainingArguments`/:class:`~transformers.TFTrainingArguments` to access all the points of :class:`~transformers.TrainingArguments`/:class:`~transformers.TFTrainingArguments` to access all the points of
customization during training. customization during training.
......
...@@ -19,14 +19,14 @@ downstream tasks. However, at some point further model increases become harder d ...@@ -19,14 +19,14 @@ downstream tasks. However, at some point further model increases become harder d
longer training times, and unexpected model degradation. To address these problems, we present two parameter-reduction longer training times, and unexpected model degradation. To address these problems, we present two parameter-reduction
techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows
that our proposed methods lead to models that scale much better compared to the original BERT. We also use a that our proposed methods lead to models that scale much better compared to the original BERT. We also use a
self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks
tasks with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and
RACE, and SQuAD benchmarks while having fewer parameters compared to BERT-large.* SQuAD benchmarks while having fewer parameters compared to BERT-large.*
Tips: Tips:
- ALBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on - ALBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather
the right rather than the left. than the left.
- ALBERT uses repeating layers which results in a small memory footprint, however the computational cost remains - ALBERT uses repeating layers which results in a small memory footprint, however the computational cost remains
similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same
number of (repeating) layers. number of (repeating) layers.
......
...@@ -2,9 +2,8 @@ AutoClasses ...@@ -2,9 +2,8 @@ AutoClasses
----------------------------------------------------------------------------------------------------------------------- -----------------------------------------------------------------------------------------------------------------------
In many cases, the architecture you want to use can be guessed from the name or the path of the pretrained model you In many cases, the architecture you want to use can be guessed from the name or the path of the pretrained model you
are supplying to the :obj:`from_pretrained()` method. are supplying to the :obj:`from_pretrained()` method. AutoClasses are here to do this job for you so that you
AutoClasses are here to do this job for you so that you automatically retrieve the relevant model given the name/path automatically retrieve the relevant model given the name/path to the pretrained weights/config/vocabulary.
to the pretrained weights/config/vocabulary.
Instantiating one of :class:`~transformers.AutoConfig`, :class:`~transformers.AutoModel`, and Instantiating one of :class:`~transformers.AutoConfig`, :class:`~transformers.AutoModel`, and
:class:`~transformers.AutoTokenizer` will directly create a class of the relevant architecture. For instance :class:`~transformers.AutoTokenizer` will directly create a class of the relevant architecture. For instance
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment