Unverified Commit 08f534d2 authored by Sylvain Gugger's avatar Sylvain Gugger Committed by GitHub
Browse files

Doc styling (#8067)

* Important files

* Styling them all

* Revert "Styling them all"

This reverts commit 7d029395fdae8513b8281cbc2a6c239f8093503e.

* Syling them for realsies

* Fix syntax error

* Fix benchmark_utils

* More fixes

* Fix modeling auto and script

* Remove new line

* Fixes

* More fixes

* Fix more files

* Style

* Add FSMT

* More fixes

* More fixes

* More fixes

* More fixes

* Fixes

* More fixes

* More fixes

* Last fixes

* Make sphinx happy
parent 04a17f85
...@@ -322,6 +322,7 @@ jobs: ...@@ -322,6 +322,7 @@ jobs:
- run: black --check examples templates tests src utils - run: black --check examples templates tests src utils
- run: isort --check-only examples templates tests src utils - run: isort --check-only examples templates tests src utils
- run: flake8 examples templates tests src utils - run: flake8 examples templates tests src utils
- run: python utils/style_doc.py src/transformers docs/source --max_len 119 --check_only
- run: python utils/check_copies.py - run: python utils/check_copies.py
- run: python utils/check_dummies.py - run: python utils/check_dummies.py
- run: python utils/check_repo.py - run: python utils/check_repo.py
......
...@@ -15,6 +15,7 @@ modified_only_fixup: ...@@ -15,6 +15,7 @@ modified_only_fixup:
black $(modified_py_files); \ black $(modified_py_files); \
isort $(modified_py_files); \ isort $(modified_py_files); \
flake8 $(modified_py_files); \ flake8 $(modified_py_files); \
python utils/style_doc.py $(modified_py_files) --max_len 119; \
else \ else \
echo "No library .py files were modified"; \ echo "No library .py files were modified"; \
fi fi
...@@ -31,6 +32,7 @@ quality: ...@@ -31,6 +32,7 @@ quality:
black --check $(check_dirs) black --check $(check_dirs)
isort --check-only $(check_dirs) isort --check-only $(check_dirs)
flake8 $(check_dirs) flake8 $(check_dirs)
python utils/style_doc.py src/transformers docs/source --max_len 119 --check_only
${MAKE} extra_quality_checks ${MAKE} extra_quality_checks
# Format source code automatically and check is there are any problems left that need manual fixing # Format source code automatically and check is there are any problems left that need manual fixing
...@@ -38,6 +40,7 @@ quality: ...@@ -38,6 +40,7 @@ quality:
style: style:
black $(check_dirs) black $(check_dirs)
isort $(check_dirs) isort $(check_dirs)
python utils/style_doc.py src/transformers docs/source --max_len 119
# Super fast fix and check target that only works on relevant modified files since the branch was made # Super fast fix and check target that only works on relevant modified files since the branch was made
......
...@@ -3,21 +3,27 @@ Benchmarks ...@@ -3,21 +3,27 @@ Benchmarks
Let's take a look at how 🤗 Transformer models can be benchmarked, best practices, and already available benchmarks. Let's take a look at how 🤗 Transformer models can be benchmarked, best practices, and already available benchmarks.
A notebook explaining in more detail how to benchmark 🤗 Transformer models can be found `here <https://github.com/huggingface/transformers/blob/master/notebooks/05-benchmark.ipynb>`__. A notebook explaining in more detail how to benchmark 🤗 Transformer models can be found `here
<https://github.com/huggingface/transformers/blob/master/notebooks/05-benchmark.ipynb>`__.
How to benchmark 🤗 Transformer models How to benchmark 🤗 Transformer models
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The classes :class:`~transformers.PyTorchBenchmark` and :class:`~transformers.TensorFlowBenchmark` allow to flexibly benchmark 🤗 Transformer models. The classes :class:`~transformers.PyTorchBenchmark` and :class:`~transformers.TensorFlowBenchmark` allow to flexibly
The benchmark classes allow us to measure the `peak memory usage` and `required time` for both benchmark 🤗 Transformer models. The benchmark classes allow us to measure the `peak memory usage` and `required time`
`inference` and `training`. for both `inference` and `training`.
.. note:: .. note::
Hereby, `inference` is defined by a single forward pass, and `training` is defined by a single forward pass and backward pass. Hereby, `inference` is defined by a single forward pass, and `training` is defined by a single forward pass and
backward pass.
The benchmark classes :class:`~transformers.PyTorchBenchmark` and :class:`~transformers.TensorFlowBenchmark` expect an object of type :class:`~transformers.PyTorchBenchmarkArguments` and :class:`~transformers.TensorFlowBenchmarkArguments`, respectively, for instantiation. :class:`~transformers.PyTorchBenchmarkArguments` and :class:`~transformers.TensorFlowBenchmarkArguments` are data classes and contain all relevant configurations for their corresponding benchmark class. The benchmark classes :class:`~transformers.PyTorchBenchmark` and :class:`~transformers.TensorFlowBenchmark` expect an
In the following example, it is shown how a BERT model of type `bert-base-cased` can be benchmarked. object of type :class:`~transformers.PyTorchBenchmarkArguments` and
:class:`~transformers.TensorFlowBenchmarkArguments`, respectively, for instantiation.
:class:`~transformers.PyTorchBenchmarkArguments` and :class:`~transformers.TensorFlowBenchmarkArguments` are data
classes and contain all relevant configurations for their corresponding benchmark class. In the following example, it
is shown how a BERT model of type `bert-base-cased` can be benchmarked.
.. code-block:: .. code-block::
...@@ -34,11 +40,15 @@ In the following example, it is shown how a BERT model of type `bert-base-cased` ...@@ -34,11 +40,15 @@ In the following example, it is shown how a BERT model of type `bert-base-cased`
>>> benchmark = TensorFlowBenchmark(args) >>> benchmark = TensorFlowBenchmark(args)
Here, three arguments are given to the benchmark argument data classes, namely ``models``, ``batch_sizes``, and ``sequence_lengths``. The argument ``models`` is required and expects a :obj:`list` of model identifiers from the `model hub <https://huggingface.co/models>`__ Here, three arguments are given to the benchmark argument data classes, namely ``models``, ``batch_sizes``, and
The :obj:`list` arguments ``batch_sizes`` and ``sequence_lengths`` define the size of the ``input_ids`` on which the model is benchmarked. ``sequence_lengths``. The argument ``models`` is required and expects a :obj:`list` of model identifiers from the
There are many more parameters that can be configured via the benchmark argument data classes. For more detail on these one can either directly consult the files `model hub <https://huggingface.co/models>`__ The :obj:`list` arguments ``batch_sizes`` and ``sequence_lengths`` define
``src/transformers/benchmark/benchmark_args_utils.py``, ``src/transformers/benchmark/benchmark_args.py`` (for PyTorch) and ``src/transformers/benchmark/benchmark_args_tf.py`` (for Tensorflow). the size of the ``input_ids`` on which the model is benchmarked. There are many more parameters that can be configured
Alternatively, running the following shell commands from root will print out a descriptive list of all configurable parameters for PyTorch and Tensorflow respectively. via the benchmark argument data classes. For more detail on these one can either directly consult the files
``src/transformers/benchmark/benchmark_args_utils.py``, ``src/transformers/benchmark/benchmark_args.py`` (for PyTorch)
and ``src/transformers/benchmark/benchmark_args_tf.py`` (for Tensorflow). Alternatively, running the following shell
commands from root will print out a descriptive list of all configurable parameters for PyTorch and Tensorflow
respectively.
.. code-block:: bash .. code-block:: bash
...@@ -145,14 +155,17 @@ An instantiated benchmark object can then simply be run by calling ``benchmark.r ...@@ -145,14 +155,17 @@ An instantiated benchmark object can then simply be run by calling ``benchmark.r
- gpu_performance_state: 2 - gpu_performance_state: 2
- use_tpu: False - use_tpu: False
By default, the `time` and the `required memory` for `inference` are benchmarked. By default, the `time` and the `required memory` for `inference` are benchmarked. In the example output above the first
In the example output above the first two sections show the result corresponding to `inference time` and `inference memory`. two sections show the result corresponding to `inference time` and `inference memory`. In addition, all relevant
In addition, all relevant information about the computing environment, `e.g.` the GPU type, the system, the library versions, etc... are printed out in the third section under `ENVIRONMENT INFORMATION`. information about the computing environment, `e.g.` the GPU type, the system, the library versions, etc... are printed
This information can optionally be saved in a `.csv` file when adding the argument :obj:`save_to_csv=True` to :class:`~transformers.PyTorchBenchmarkArguments` and :class:`~transformers.TensorFlowBenchmarkArguments` respectively. out in the third section under `ENVIRONMENT INFORMATION`. This information can optionally be saved in a `.csv` file
In this case, every section is saved in a separate `.csv` file. The path to each `.csv` file can optionally be defined via the argument data classes. when adding the argument :obj:`save_to_csv=True` to :class:`~transformers.PyTorchBenchmarkArguments` and
:class:`~transformers.TensorFlowBenchmarkArguments` respectively. In this case, every section is saved in a separate
`.csv` file. The path to each `.csv` file can optionally be defined via the argument data classes.
Instead of benchmarking pre-trained models via their model identifier, `e.g.` `bert-base-uncased`, the user can alternatively benchmark an arbitrary configuration of any available model class. Instead of benchmarking pre-trained models via their model identifier, `e.g.` `bert-base-uncased`, the user can
In this case, a :obj:`list` of configurations must be inserted with the benchmark args as follows. alternatively benchmark an arbitrary configuration of any available model class. In this case, a :obj:`list` of
configurations must be inserted with the benchmark args as follows.
.. code-block:: .. code-block::
...@@ -295,8 +308,9 @@ In this case, a :obj:`list` of configurations must be inserted with the benchmar ...@@ -295,8 +308,9 @@ In this case, a :obj:`list` of configurations must be inserted with the benchmar
- use_tpu: False - use_tpu: False
Again, `inference time` and `required memory` for `inference` are measured, but this time for customized configurations of the :obj:`BertModel` class. This feature can especially be helpful when Again, `inference time` and `required memory` for `inference` are measured, but this time for customized configurations
deciding for which configuration the model should be trained. of the :obj:`BertModel` class. This feature can especially be helpful when deciding for which configuration the model
should be trained.
Benchmark best practices Benchmark best practices
...@@ -305,18 +319,27 @@ Benchmark best practices ...@@ -305,18 +319,27 @@ Benchmark best practices
This section lists a couple of best practices one should be aware of when benchmarking a model. This section lists a couple of best practices one should be aware of when benchmarking a model.
- Currently, only single device benchmarking is supported. When benchmarking on GPU, it is recommended that the user - Currently, only single device benchmarking is supported. When benchmarking on GPU, it is recommended that the user
specifies on which device the code should be run by setting the ``CUDA_VISIBLE_DEVICES`` environment variable in the shell, `e.g.` ``export CUDA_VISIBLE_DEVICES=0`` before running the code. specifies on which device the code should be run by setting the ``CUDA_VISIBLE_DEVICES`` environment variable in the
- The option :obj:`no_multi_processing` should only be set to :obj:`True` for testing and debugging. To ensure accurate memory measurement it is recommended to run each memory benchmark in a separate process by making sure :obj:`no_multi_processing` is set to :obj:`True`. shell, `e.g.` ``export CUDA_VISIBLE_DEVICES=0`` before running the code.
- One should always state the environment information when sharing the results of a model benchmark. Results can vary heavily between different GPU devices, library versions, etc., so that benchmark results on their own are not very useful for the community. - The option :obj:`no_multi_processing` should only be set to :obj:`True` for testing and debugging. To ensure accurate
memory measurement it is recommended to run each memory benchmark in a separate process by making sure
:obj:`no_multi_processing` is set to :obj:`True`.
- One should always state the environment information when sharing the results of a model benchmark. Results can vary
heavily between different GPU devices, library versions, etc., so that benchmark results on their own are not very
useful for the community.
Sharing your benchmark Sharing your benchmark
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Previously all available core models (10 at the time) have been benchmarked for `inference time`, across many different settings: using PyTorch, with Previously all available core models (10 at the time) have been benchmarked for `inference time`, across many different
and without TorchScript, using TensorFlow, with and without XLA. All of those tests were done across CPUs (except for settings: using PyTorch, with and without TorchScript, using TensorFlow, with and without XLA. All of those tests were
TensorFlow XLA) and GPUs. done across CPUs (except for TensorFlow XLA) and GPUs.
The approach is detailed in the `following blogpost <https://medium.com/huggingface/benchmarking-transformers-pytorch-and-tensorflow-e2917fb891c2>`__ and the results are available `here <https://docs.google.com/spreadsheets/d/1sryqufw2D0XlUH4sq3e9Wnxu5EAQkaohzrJbd5HdQ_w/edit?usp=sharing>`__. The approach is detailed in the `following blogpost
<https://medium.com/huggingface/benchmarking-transformers-pytorch-and-tensorflow-e2917fb891c2>`__ and the results are
available `here
<https://docs.google.com/spreadsheets/d/1sryqufw2D0XlUH4sq3e9Wnxu5EAQkaohzrJbd5HdQ_w/edit?usp=sharing>`__.
With the new `benchmark` tools, it is easier than ever to share your benchmark results with the community `here <https://github.com/huggingface/transformers/blob/master/examples/benchmarking/README.md>`__. With the new `benchmark` tools, it is easier than ever to share your benchmark results with the community `here
<https://github.com/huggingface/transformers/blob/master/examples/benchmarking/README.md>`__.
BERTology BERTology
----------------------------------------------------------------------------------------------------------------------- -----------------------------------------------------------------------------------------------------------------------
There is a growing field of study concerned with investigating the inner working of large-scale transformers like BERT (that some call "BERTology"). Some good examples of this field are: There is a growing field of study concerned with investigating the inner working of large-scale transformers like BERT
(that some call "BERTology"). Some good examples of this field are:
* BERT Rediscovers the Classical NLP Pipeline by Ian Tenney, Dipanjan Das, Ellie Pavlick: https://arxiv.org/abs/1905.05950 * BERT Rediscovers the Classical NLP Pipeline by Ian Tenney, Dipanjan Das, Ellie Pavlick:
https://arxiv.org/abs/1905.05950
* Are Sixteen Heads Really Better than One? by Paul Michel, Omer Levy, Graham Neubig: https://arxiv.org/abs/1905.10650 * Are Sixteen Heads Really Better than One? by Paul Michel, Omer Levy, Graham Neubig: https://arxiv.org/abs/1905.10650
* What Does BERT Look At? An Analysis of BERT's Attention by Kevin Clark, Urvashi Khandelwal, Omer Levy, Christopher D. Manning: https://arxiv.org/abs/1906.04341 * What Does BERT Look At? An Analysis of BERT's Attention by Kevin Clark, Urvashi Khandelwal, Omer Levy, Christopher D.
Manning: https://arxiv.org/abs/1906.04341
In order to help this new field develop, we have included a few additional features in the BERT/GPT/GPT-2 models to help people access the inner representations, mainly adapted from the great work of Paul Michel (https://arxiv.org/abs/1905.10650): In order to help this new field develop, we have included a few additional features in the BERT/GPT/GPT-2 models to
help people access the inner representations, mainly adapted from the great work of Paul Michel
(https://arxiv.org/abs/1905.10650):
* accessing all the hidden-states of BERT/GPT/GPT-2, * accessing all the hidden-states of BERT/GPT/GPT-2,
* accessing all the attention weights for each head of BERT/GPT/GPT-2, * accessing all the attention weights for each head of BERT/GPT/GPT-2,
* retrieving heads output values and gradients to be able to compute head importance score and prune head as explained in https://arxiv.org/abs/1905.10650. * retrieving heads output values and gradients to be able to compute head importance score and prune head as explained
in https://arxiv.org/abs/1905.10650.
To help you understand and use these features, we have added a specific example script: `bertology.py <https://github.com/huggingface/transformers/blob/master/examples/bertology/run_bertology.py>`_ while extract information and prune a model pre-trained on GLUE. To help you understand and use these features, we have added a specific example script: `bertology.py
<https://github.com/huggingface/transformers/blob/master/examples/bertology/run_bertology.py>`_ while extract
information and prune a model pre-trained on GLUE.
Converting Tensorflow Checkpoints Converting Tensorflow Checkpoints
======================================================================================================================= =======================================================================================================================
A command-line interface is provided to convert original Bert/GPT/GPT-2/Transformer-XL/XLNet/XLM checkpoints in models than be loaded using the ``from_pretrained`` methods of the library. A command-line interface is provided to convert original Bert/GPT/GPT-2/Transformer-XL/XLNet/XLM checkpoints in models
than be loaded using the ``from_pretrained`` methods of the library.
.. note:: .. note::
Since 2.3.0 the conversion script is now part of the transformers CLI (**transformers-cli**) Since 2.3.0 the conversion script is now part of the transformers CLI (**transformers-cli**) available in any
available in any transformers >= 2.3.0 installation. transformers >= 2.3.0 installation.
The documentation below reflects the **transformers-cli convert** command format. The documentation below reflects the **transformers-cli convert** command format.
BERT BERT
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
You can convert any TensorFlow checkpoint for BERT (in particular `the pre-trained models released by Google <https://github.com/google-research/bert#pre-trained-models>`_\ ) in a PyTorch save file by using the `convert_bert_original_tf_checkpoint_to_pytorch.py <https://github.com/huggingface/transformers/blob/master/src/transformers/convert_bert_original_tf_checkpoint_to_pytorch.py>`_ script. You can convert any TensorFlow checkpoint for BERT (in particular `the pre-trained models released by Google
<https://github.com/google-research/bert#pre-trained-models>`_\ ) in a PyTorch save file by using the
This CLI takes as input a TensorFlow checkpoint (three files starting with ``bert_model.ckpt``\ ) and the associated configuration file (\ ``bert_config.json``\ ), and creates a PyTorch model for this configuration, loads the weights from the TensorFlow checkpoint in the PyTorch model and saves the resulting model in a standard PyTorch save file that can be imported using ``torch.load()`` (see examples in `run_bert_extract_features.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_extract_features.py>`_\ , `run_bert_classifier.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_classifier.py>`_ and `run_bert_squad.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_squad.py>`_\ ). `convert_bert_original_tf_checkpoint_to_pytorch.py
<https://github.com/huggingface/transformers/blob/master/src/transformers/convert_bert_original_tf_checkpoint_to_pytorch.py>`_
You only need to run this conversion script **once** to get a PyTorch model. You can then disregard the TensorFlow checkpoint (the three files starting with ``bert_model.ckpt``\ ) but be sure to keep the configuration file (\ ``bert_config.json``\ ) and the vocabulary file (\ ``vocab.txt``\ ) as these are needed for the PyTorch model too. script.
To run this specific conversion script you will need to have TensorFlow and PyTorch installed (\ ``pip install tensorflow``\ ). The rest of the repository only requires PyTorch. This CLI takes as input a TensorFlow checkpoint (three files starting with ``bert_model.ckpt``\ ) and the associated
configuration file (\ ``bert_config.json``\ ), and creates a PyTorch model for this configuration, loads the weights
from the TensorFlow checkpoint in the PyTorch model and saves the resulting model in a standard PyTorch save file that
can be imported using ``torch.load()`` (see examples in `run_bert_extract_features.py
<https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_extract_features.py>`_\ ,
`run_bert_classifier.py
<https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_classifier.py>`_ and
`run_bert_squad.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_squad.py>`_\
).
You only need to run this conversion script **once** to get a PyTorch model. You can then disregard the TensorFlow
checkpoint (the three files starting with ``bert_model.ckpt``\ ) but be sure to keep the configuration file (\
``bert_config.json``\ ) and the vocabulary file (\ ``vocab.txt``\ ) as these are needed for the PyTorch model too.
To run this specific conversion script you will need to have TensorFlow and PyTorch installed (\ ``pip install
tensorflow``\ ). The rest of the repository only requires PyTorch.
Here is an example of the conversion process for a pre-trained ``BERT-Base Uncased`` model: Here is an example of the conversion process for a pre-trained ``BERT-Base Uncased`` model:
...@@ -31,14 +47,20 @@ Here is an example of the conversion process for a pre-trained ``BERT-Base Uncas ...@@ -31,14 +47,20 @@ Here is an example of the conversion process for a pre-trained ``BERT-Base Uncas
--config $BERT_BASE_DIR/bert_config.json \ --config $BERT_BASE_DIR/bert_config.json \
--pytorch_dump_output $BERT_BASE_DIR/pytorch_model.bin --pytorch_dump_output $BERT_BASE_DIR/pytorch_model.bin
You can download Google's pre-trained models for the conversion `here <https://github.com/google-research/bert#pre-trained-models>`__. You can download Google's pre-trained models for the conversion `here
<https://github.com/google-research/bert#pre-trained-models>`__.
ALBERT ALBERT
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Convert TensorFlow model checkpoints of ALBERT to PyTorch using the `convert_albert_original_tf_checkpoint_to_pytorch.py <https://github.com/huggingface/transformers/blob/master/src/transformers/convert_bert_original_tf_checkpoint_to_pytorch.py>`_ script. Convert TensorFlow model checkpoints of ALBERT to PyTorch using the
`convert_albert_original_tf_checkpoint_to_pytorch.py
<https://github.com/huggingface/transformers/blob/master/src/transformers/convert_bert_original_tf_checkpoint_to_pytorch.py>`_
script.
The CLI takes as input a TensorFlow checkpoint (three files starting with ``model.ckpt-best``\ ) and the accompanying configuration file (\ ``albert_config.json``\ ), then creates and saves a PyTorch model. To run this conversion you will need to have TensorFlow and PyTorch installed. The CLI takes as input a TensorFlow checkpoint (three files starting with ``model.ckpt-best``\ ) and the accompanying
configuration file (\ ``albert_config.json``\ ), then creates and saves a PyTorch model. To run this conversion you
will need to have TensorFlow and PyTorch installed.
Here is an example of the conversion process for the pre-trained ``ALBERT Base`` model: Here is an example of the conversion process for the pre-trained ``ALBERT Base`` model:
...@@ -51,12 +73,15 @@ Here is an example of the conversion process for the pre-trained ``ALBERT Base`` ...@@ -51,12 +73,15 @@ Here is an example of the conversion process for the pre-trained ``ALBERT Base``
--config $ALBERT_BASE_DIR/albert_config.json \ --config $ALBERT_BASE_DIR/albert_config.json \
--pytorch_dump_output $ALBERT_BASE_DIR/pytorch_model.bin --pytorch_dump_output $ALBERT_BASE_DIR/pytorch_model.bin
You can download Google's pre-trained models for the conversion `here <https://github.com/google-research/albert#pre-trained-models>`__. You can download Google's pre-trained models for the conversion `here
<https://github.com/google-research/albert#pre-trained-models>`__.
OpenAI GPT OpenAI GPT
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Here is an example of the conversion process for a pre-trained OpenAI GPT model, assuming that your NumPy checkpoint save as the same format than OpenAI pretrained model (see `here <https://github.com/openai/finetune-transformer-lm>`__\ ) Here is an example of the conversion process for a pre-trained OpenAI GPT model, assuming that your NumPy checkpoint
save as the same format than OpenAI pretrained model (see `here <https://github.com/openai/finetune-transformer-lm>`__\
)
.. code-block:: shell .. code-block:: shell
...@@ -72,7 +97,8 @@ Here is an example of the conversion process for a pre-trained OpenAI GPT model, ...@@ -72,7 +97,8 @@ Here is an example of the conversion process for a pre-trained OpenAI GPT model,
OpenAI GPT-2 OpenAI GPT-2
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Here is an example of the conversion process for a pre-trained OpenAI GPT-2 model (see `here <https://github.com/openai/gpt-2>`__\ ) Here is an example of the conversion process for a pre-trained OpenAI GPT-2 model (see `here
<https://github.com/openai/gpt-2>`__\ )
.. code-block:: shell .. code-block:: shell
...@@ -87,7 +113,8 @@ Here is an example of the conversion process for a pre-trained OpenAI GPT-2 mode ...@@ -87,7 +113,8 @@ Here is an example of the conversion process for a pre-trained OpenAI GPT-2 mode
Transformer-XL Transformer-XL
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Here is an example of the conversion process for a pre-trained Transformer-XL model (see `here <https://github.com/kimiyoung/transformer-xl/tree/master/tf#obtain-and-evaluate-pretrained-sota-models>`__\ ) Here is an example of the conversion process for a pre-trained Transformer-XL model (see `here
<https://github.com/kimiyoung/transformer-xl/tree/master/tf#obtain-and-evaluate-pretrained-sota-models>`__\ )
.. code-block:: shell .. code-block:: shell
......
...@@ -3,15 +3,15 @@ Fine-tuning with custom datasets ...@@ -3,15 +3,15 @@ Fine-tuning with custom datasets
.. note:: .. note::
The datasets used in this tutorial are available and can be more easily accessed using the The datasets used in this tutorial are available and can be more easily accessed using the `🤗 NLP library
`🤗 NLP library <https://github.com/huggingface/nlp>`_. We do not use this library to access the datasets here <https://github.com/huggingface/nlp>`_. We do not use this library to access the datasets here since this tutorial
since this tutorial meant to illustrate how to work with your own data. A brief of introduction can be found meant to illustrate how to work with your own data. A brief of introduction can be found at the end of the tutorial
at the end of the tutorial in the section ":ref:`nlplib`". in the section ":ref:`nlplib`".
This tutorial will take you through several examples of using 🤗 Transformers models with your own datasets. The This tutorial will take you through several examples of using 🤗 Transformers models with your own datasets. The guide
guide shows one of many valid workflows for using these models and is meant to be illustrative rather than shows one of many valid workflows for using these models and is meant to be illustrative rather than definitive. We
definitive. We show examples of reading in several data formats, preprocessing the data for several types of tasks, show examples of reading in several data formats, preprocessing the data for several types of tasks, and then preparing
and then preparing the data into PyTorch/TensorFlow ``Dataset`` objects which can easily be used either with the data into PyTorch/TensorFlow ``Dataset`` objects which can easily be used either with
:class:`~transformers.Trainer`/:class:`~transformers.TFTrainer` or with native PyTorch/TensorFlow. :class:`~transformers.Trainer`/:class:`~transformers.TFTrainer` or with native PyTorch/TensorFlow.
We include several examples, each of which demonstrates a different type of common downstream task: We include several examples, each of which demonstrates a different type of common downstream task:
...@@ -28,13 +28,13 @@ Sequence Classification with IMDb Reviews ...@@ -28,13 +28,13 @@ Sequence Classification with IMDb Reviews
.. note:: .. note::
This dataset can be explored in the Hugging Face model hub (`IMDb <https://huggingface.co/datasets/imdb>`_), and can This dataset can be explored in the Hugging Face model hub (`IMDb <https://huggingface.co/datasets/imdb>`_), and
be alternatively downloaded with the 🤗 NLP library with ``load_dataset("imdb")``. can be alternatively downloaded with the 🤗 NLP library with ``load_dataset("imdb")``.
In this example, we'll show how to download, tokenize, and train a model on the IMDb reviews dataset. This task In this example, we'll show how to download, tokenize, and train a model on the IMDb reviews dataset. This task takes
takes the text of a review and requires the model to predict whether the sentiment of the review is positive or the text of a review and requires the model to predict whether the sentiment of the review is positive or negative.
negative. Let's start by downloading the dataset from the Let's start by downloading the dataset from the `Large Movie Review Dataset
`Large Movie Review Dataset <http://ai.stanford.edu/~amaas/data/sentiment/>`_ webpage. <http://ai.stanford.edu/~amaas/data/sentiment/>`_ webpage.
.. code-block:: bash .. code-block:: bash
...@@ -62,9 +62,8 @@ read this in. ...@@ -62,9 +62,8 @@ read this in.
train_texts, train_labels = read_imdb_split('aclImdb/train') train_texts, train_labels = read_imdb_split('aclImdb/train')
test_texts, test_labels = read_imdb_split('aclImdb/test') test_texts, test_labels = read_imdb_split('aclImdb/test')
We now have a train and test dataset, but let's also also create a validation set which we can use for for We now have a train and test dataset, but let's also also create a validation set which we can use for for evaluation
evaluation and tuning without training our test set results. Sklearn has a convenient utility for creating such and tuning without training our test set results. Sklearn has a convenient utility for creating such splits:
splits:
.. code-block:: python .. code-block:: python
...@@ -80,8 +79,8 @@ pre-trained DistilBert, so let's use the DistilBert tokenizer. ...@@ -80,8 +79,8 @@ pre-trained DistilBert, so let's use the DistilBert tokenizer.
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased') tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
Now we can simply pass our texts to the tokenizer. We'll pass ``truncation=True`` and ``padding=True``, which will Now we can simply pass our texts to the tokenizer. We'll pass ``truncation=True`` and ``padding=True``, which will
ensure that all of our sequences are padded to the same length and are truncated to be no longer model's maximum ensure that all of our sequences are padded to the same length and are truncated to be no longer model's maximum input
input length. This will allow us to feed batches of sequences into the model at the same time. length. This will allow us to feed batches of sequences into the model at the same time.
.. code-block:: python .. code-block:: python
...@@ -90,9 +89,9 @@ input length. This will allow us to feed batches of sequences into the model at ...@@ -90,9 +89,9 @@ input length. This will allow us to feed batches of sequences into the model at
test_encodings = tokenizer(test_texts, truncation=True, padding=True) test_encodings = tokenizer(test_texts, truncation=True, padding=True)
Now, let's turn our labels and encodings into a Dataset object. In PyTorch, this is done by subclassing a Now, let's turn our labels and encodings into a Dataset object. In PyTorch, this is done by subclassing a
``torch.utils.data.Dataset`` object and implementing ``__len__`` and ``__getitem__``. In TensorFlow, we pass our input encodings and ``torch.utils.data.Dataset`` object and implementing ``__len__`` and ``__getitem__``. In TensorFlow, we pass our input
labels to the ``from_tensor_slices`` constructor method. We put the data in this format so that the data can be encodings and labels to the ``from_tensor_slices`` constructor method. We put the data in this format so that the data
easily batched such that each key in the batch encoding corresponds to a named parameter of the can be easily batched such that each key in the batch encoding corresponds to a named parameter of the
:meth:`~transformers.DistilBertForSequenceClassification.forward` method of the model we will train. :meth:`~transformers.DistilBertForSequenceClassification.forward` method of the model we will train.
.. code-block:: python .. code-block:: python
...@@ -133,17 +132,17 @@ easily batched such that each key in the batch encoding corresponds to a named p ...@@ -133,17 +132,17 @@ easily batched such that each key in the batch encoding corresponds to a named p
)) ))
Now that our datasets our ready, we can fine-tune a model either with the 🤗 Now that our datasets our ready, we can fine-tune a model either with the 🤗
:class:`~transformers.Trainer`/:class:`~transformers.TFTrainer` or with native PyTorch/TensorFlow. See :class:`~transformers.Trainer`/:class:`~transformers.TFTrainer` or with native PyTorch/TensorFlow. See :doc:`training
:doc:`training <training>`. <training>`.
.. _ft_trainer: .. _ft_trainer:
Fine-tuning with Trainer Fine-tuning with Trainer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The steps above prepared the datasets in the way that the trainer is expected. Now all we need to do is create a The steps above prepared the datasets in the way that the trainer is expected. Now all we need to do is create a model
model to fine-tune, define the :class:`~transformers.TrainingArguments`/:class:`~transformers.TFTrainingArguments` to fine-tune, define the :class:`~transformers.TrainingArguments`/:class:`~transformers.TFTrainingArguments` and
and instantiate a :class:`~transformers.Trainer`/:class:`~transformers.TFTrainer`. instantiate a :class:`~transformers.Trainer`/:class:`~transformers.TFTrainer`.
.. code-block:: python .. code-block:: python
...@@ -248,15 +247,15 @@ Token Classification with W-NUT Emerging Entities ...@@ -248,15 +247,15 @@ Token Classification with W-NUT Emerging Entities
.. note:: .. note::
This dataset can be explored in the Hugging Face model hub (`WNUT-17 <https://huggingface.co/datasets/wnut_17>`_), and can This dataset can be explored in the Hugging Face model hub (`WNUT-17 <https://huggingface.co/datasets/wnut_17>`_),
be alternatively downloaded with the 🤗 NLP library with ``load_dataset("wnut_17")``. and can be alternatively downloaded with the 🤗 NLP library with ``load_dataset("wnut_17")``.
Next we will look at token classification. Rather than classifying an entire sequence, this task classifies token by Next we will look at token classification. Rather than classifying an entire sequence, this task classifies token by
token. We'll demonstrate how to do this with token. We'll demonstrate how to do this with `Named Entity Recognition
`Named Entity Recognition <http://nlpprogress.com/english/named_entity_recognition.html>`_, which involves <http://nlpprogress.com/english/named_entity_recognition.html>`_, which involves identifying tokens which correspond to
identifying tokens which correspond to a predefined set of "entities". Specifically, we'll use the a predefined set of "entities". Specifically, we'll use the `W-NUT Emerging and Rare entities
`W-NUT Emerging and Rare entities <http://noisy-text.github.io/2017/emerging-rare-entities.html>`_ corpus. The data <http://noisy-text.github.io/2017/emerging-rare-entities.html>`_ corpus. The data is given as a collection of
is given as a collection of pre-tokenized documents where each token is assigned a tag. pre-tokenized documents where each token is assigned a tag.
Let's start by downloading the data. Let's start by downloading the data.
...@@ -264,10 +263,10 @@ Let's start by downloading the data. ...@@ -264,10 +263,10 @@ Let's start by downloading the data.
wget http://noisy-text.github.io/2017/files/wnut17train.conll wget http://noisy-text.github.io/2017/files/wnut17train.conll
In this case, we'll just download the train set, which is a single text file. Each line of the file contains either In this case, we'll just download the train set, which is a single text file. Each line of the file contains either (1)
(1) a word and tag separated by a tab, or (2) a blank line indicating the end of a document. Let's write a a word and tag separated by a tab, or (2) a blank line indicating the end of a document. Let's write a function to read
function to read this in. We'll take in the file path and return ``token_docs`` which is a list of lists of token this in. We'll take in the file path and return ``token_docs`` which is a list of lists of token strings, and
strings, and ``token_tags`` which is a list of lists of tag strings. ``token_tags`` which is a list of lists of tag strings.
.. code-block:: python .. code-block:: python
...@@ -303,8 +302,8 @@ Just to see what this data looks like, let's take a look at a segment of the fir ...@@ -303,8 +302,8 @@ Just to see what this data looks like, let's take a look at a segment of the fir
['for', 'two', 'weeks', '.', 'Empire', 'State', 'Building'] ['for', 'two', 'weeks', '.', 'Empire', 'State', 'Building']
['O', 'O', 'O', 'O', 'B-location', 'I-location', 'I-location'] ['O', 'O', 'O', 'O', 'B-location', 'I-location', 'I-location']
``location`` is an entity type, ``B-`` indicates the beginning of an entity, and ``I-`` indicates consecutive positions of ``location`` is an entity type, ``B-`` indicates the beginning of an entity, and ``I-`` indicates consecutive positions
the same entity ("Empire State Building" is considered one entity). ``O`` indicates the token does not correspond to of the same entity ("Empire State Building" is considered one entity). ``O`` indicates the token does not correspond to
any entity. any entity.
Now that we've read the data in, let's create a train/validation split: Now that we've read the data in, let's create a train/validation split:
...@@ -314,8 +313,8 @@ Now that we've read the data in, let's create a train/validation split: ...@@ -314,8 +313,8 @@ Now that we've read the data in, let's create a train/validation split:
from sklearn.model_selection import train_test_split from sklearn.model_selection import train_test_split
train_texts, val_texts, train_tags, val_tags = train_test_split(texts, tags, test_size=.2) train_texts, val_texts, train_tags, val_tags = train_test_split(texts, tags, test_size=.2)
Next, let's create encodings for our tokens and tags. For the tags, we can start by just create a simple mapping Next, let's create encodings for our tokens and tags. For the tags, we can start by just create a simple mapping which
which we'll use in a moment: we'll use in a moment:
.. code-block:: python .. code-block:: python
...@@ -323,11 +322,11 @@ which we'll use in a moment: ...@@ -323,11 +322,11 @@ which we'll use in a moment:
tag2id = {tag: id for id, tag in enumerate(unique_tags)} tag2id = {tag: id for id, tag in enumerate(unique_tags)}
id2tag = {id: tag for tag, id in tag2id.items()} id2tag = {id: tag for tag, id in tag2id.items()}
To encode the tokens, we'll use a pre-trained DistilBert tokenizer. We can tell the tokenizer that we're dealing To encode the tokens, we'll use a pre-trained DistilBert tokenizer. We can tell the tokenizer that we're dealing with
with ready-split tokens rather than full sentence strings by passing ``is_split_into_words=True``. We'll also pass ready-split tokens rather than full sentence strings by passing ``is_split_into_words=True``. We'll also pass
``padding=True`` and ``truncation=True`` to pad the sequences to be the same length. Lastly, we can tell the model ``padding=True`` and ``truncation=True`` to pad the sequences to be the same length. Lastly, we can tell the model to
to return information about the tokens which are split by the wordpiece tokenization process, which we will need in return information about the tokens which are split by the wordpiece tokenization process, which we will need in a
a moment. moment.
.. code-block:: python .. code-block:: python
...@@ -339,24 +338,24 @@ a moment. ...@@ -339,24 +338,24 @@ a moment.
Great, so now our tokens are nicely encoded in the format that they need to be in to feed them into our DistilBert Great, so now our tokens are nicely encoded in the format that they need to be in to feed them into our DistilBert
model below. model below.
Now we arrive at a common obstacle with using pre-trained models for token-level classification: many of the tokens Now we arrive at a common obstacle with using pre-trained models for token-level classification: many of the tokens in
in the W-NUT corpus are not in DistilBert's vocabulary. Bert and many models like it use a method called WordPiece the W-NUT corpus are not in DistilBert's vocabulary. Bert and many models like it use a method called WordPiece
Tokenization, meaning that single words are split into multiple tokens such that each token is likely to be in Tokenization, meaning that single words are split into multiple tokens such that each token is likely to be in the
the vocabulary. For example, DistilBert's tokenizer would split the Twitter handle ``@huggingface`` into the tokens vocabulary. For example, DistilBert's tokenizer would split the Twitter handle ``@huggingface`` into the tokens ``['@',
``['@', 'hugging', '##face']``. This is a problem for us because we have exactly one tag per token. If the tokenizer 'hugging', '##face']``. This is a problem for us because we have exactly one tag per token. If the tokenizer splits a
splits a token into multiple sub-tokens, then we will end up with a mismatch between our tokens and our labels. token into multiple sub-tokens, then we will end up with a mismatch between our tokens and our labels.
One way to handle this is to only train on the tag labels for the first subtoken of a split token. We can do this in One way to handle this is to only train on the tag labels for the first subtoken of a split token. We can do this in 🤗
🤗 Transformers by setting the labels we wish to ignore to ``-100``. In the example above, if the label for Transformers by setting the labels we wish to ignore to ``-100``. In the example above, if the label for
``@HuggingFace`` is ``3`` (indexing ``B-corporation``), we would set the labels of ``['@', 'hugging', '##face']`` to ``@HuggingFace`` is ``3`` (indexing ``B-corporation``), we would set the labels of ``['@', 'hugging', '##face']`` to
``[3, -100, -100]``. ``[3, -100, -100]``.
Let's write a function to do this. This is where we will use the ``offset_mapping`` from the tokenizer as mentioned Let's write a function to do this. This is where we will use the ``offset_mapping`` from the tokenizer as mentioned
above. For each sub-token returned by the tokenizer, the offset mapping gives us a tuple indicating the sub-token's above. For each sub-token returned by the tokenizer, the offset mapping gives us a tuple indicating the sub-token's
start position and end position relative to the original token it was split from. That means that if the first start position and end position relative to the original token it was split from. That means that if the first position
position in the tuple is anything other than ``0``, we will set its corresponding label to ``-100``. While we're at in the tuple is anything other than ``0``, we will set its corresponding label to ``-100``. While we're at it, we can
it, we can also set labels to ``-100`` if the second position of the offset mapping is ``0``, since this means it must also set labels to ``-100`` if the second position of the offset mapping is ``0``, since this means it must be a
be a special token like ``[PAD]`` or ``[CLS]``. special token like ``[PAD]`` or ``[CLS]``.
.. note:: .. note::
...@@ -447,8 +446,9 @@ Question Answering with SQuAD 2.0 ...@@ -447,8 +446,9 @@ Question Answering with SQuAD 2.0
.. note:: .. note::
This dataset can be explored in the Hugging Face model hub (`SQuAD V2 <https://huggingface.co/datasets/squad_v2>`_), and can This dataset can be explored in the Hugging Face model hub (`SQuAD V2
be alternatively downloaded with the 🤗 NLP library with ``load_dataset("squad_v2")``. <https://huggingface.co/datasets/squad_v2>`_), and can be alternatively downloaded with the 🤗 NLP library with
``load_dataset("squad_v2")``.
Question answering comes in many forms. In this example, we'll look at the particular type of extractive QA that Question answering comes in many forms. In this example, we'll look at the particular type of extractive QA that
involves answering a question about a passage by highlighting the segment of the passage that answers the question. involves answering a question about a passage by highlighting the segment of the passage that answers the question.
...@@ -464,8 +464,8 @@ We will start by downloading the data: ...@@ -464,8 +464,8 @@ We will start by downloading the data:
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json -O squad/dev-v2.0.json wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json -O squad/dev-v2.0.json
Each split is in a structured json file with a number of questions and answers for each passage (or context). We'll Each split is in a structured json file with a number of questions and answers for each passage (or context). We'll
take this apart into parallel lists of contexts, questions, and answers (note that the contexts here are repeated take this apart into parallel lists of contexts, questions, and answers (note that the contexts here are repeated since
since there are multiple questions per context): there are multiple questions per context):
.. code-block:: python .. code-block:: python
...@@ -495,13 +495,13 @@ since there are multiple questions per context): ...@@ -495,13 +495,13 @@ since there are multiple questions per context):
train_contexts, train_questions, train_answers = read_squad('squad/train-v2.0.json') train_contexts, train_questions, train_answers = read_squad('squad/train-v2.0.json')
val_contexts, val_questions, val_answers = read_squad('squad/dev-v2.0.json') val_contexts, val_questions, val_answers = read_squad('squad/dev-v2.0.json')
The contexts and questions are just strings. The answers are dicts containing the subsequence of the passage with The contexts and questions are just strings. The answers are dicts containing the subsequence of the passage with the
the correct answer as well as an integer indicating the character at which the answer begins. In order to train a correct answer as well as an integer indicating the character at which the answer begins. In order to train a model on
model on this data we need (1) the tokenized context/question pairs, and (2) integers indicating at which *token* this data we need (1) the tokenized context/question pairs, and (2) integers indicating at which *token* positions the
positions the answer begins and ends. answer begins and ends.
First, let's get the *character* position at which the answer ends in the passage (we are given the starting First, let's get the *character* position at which the answer ends in the passage (we are given the starting position).
position). Sometimes SQuAD answers are off by one or two characters, so we will also adjust for that. Sometimes SQuAD answers are off by one or two characters, so we will also adjust for that.
.. code-block:: python .. code-block:: python
...@@ -524,9 +524,9 @@ position). Sometimes SQuAD answers are off by one or two characters, so we will ...@@ -524,9 +524,9 @@ position). Sometimes SQuAD answers are off by one or two characters, so we will
add_end_idx(train_answers, train_contexts) add_end_idx(train_answers, train_contexts)
add_end_idx(val_answers, val_contexts) add_end_idx(val_answers, val_contexts)
Now ``train_answers`` and ``val_answers`` include the character end positions and the corrected start positions. Now ``train_answers`` and ``val_answers`` include the character end positions and the corrected start positions. Next,
Next, let's tokenize our context/question pairs. 🤗 Tokenizers can accept parallel lists of sequences and encode let's tokenize our context/question pairs. 🤗 Tokenizers can accept parallel lists of sequences and encode them together
them together as sequence pairs. as sequence pairs.
.. code-block:: python .. code-block:: python
...@@ -536,8 +536,8 @@ them together as sequence pairs. ...@@ -536,8 +536,8 @@ them together as sequence pairs.
train_encodings = tokenizer(train_contexts, train_questions, truncation=True, padding=True) train_encodings = tokenizer(train_contexts, train_questions, truncation=True, padding=True)
val_encodings = tokenizer(val_contexts, val_questions, truncation=True, padding=True) val_encodings = tokenizer(val_contexts, val_questions, truncation=True, padding=True)
Next we need to convert our character start/end positions to token start/end positions. When using 🤗 Fast Next we need to convert our character start/end positions to token start/end positions. When using 🤗 Fast Tokenizers,
Tokenizers, we can use the built in :func:`~transformers.BatchEncoding.char_to_token` method. we can use the built in :func:`~transformers.BatchEncoding.char_to_token` method.
.. code-block:: python .. code-block:: python
...@@ -557,9 +557,9 @@ Tokenizers, we can use the built in :func:`~transformers.BatchEncoding.char_to_t ...@@ -557,9 +557,9 @@ Tokenizers, we can use the built in :func:`~transformers.BatchEncoding.char_to_t
add_token_positions(train_encodings, train_answers) add_token_positions(train_encodings, train_answers)
add_token_positions(val_encodings, val_answers) add_token_positions(val_encodings, val_answers)
Our data is ready. Let's just put it in a PyTorch/TensorFlow dataset so that we can easily use it for Our data is ready. Let's just put it in a PyTorch/TensorFlow dataset so that we can easily use it for training. In
training. In PyTorch, we define a custom ``Dataset`` class. In TensorFlow, we pass a tuple of PyTorch, we define a custom ``Dataset`` class. In TensorFlow, we pass a tuple of ``(inputs_dict, labels_dict)`` to the
``(inputs_dict, labels_dict)`` to the ``from_tensor_slices`` method. ``from_tensor_slices`` method.
.. code-block:: python .. code-block:: python
...@@ -668,12 +668,11 @@ Additional Resources ...@@ -668,12 +668,11 @@ Additional Resources
Using the 🤗 NLP Datasets & Metrics library Using the 🤗 NLP Datasets & Metrics library
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This tutorial demonstrates how to read in datasets from various raw text formats and prepare them for training with This tutorial demonstrates how to read in datasets from various raw text formats and prepare them for training with 🤗
🤗 Transformers so that you can do the same thing with your own custom datasets. However, we recommend users use the Transformers so that you can do the same thing with your own custom datasets. However, we recommend users use the `🤗
`🤗 NLP library <https://github.com/huggingface/nlp>`_ for working with the 150+ datasets included in the NLP library <https://github.com/huggingface/nlp>`_ for working with the 150+ datasets included in the `hub
`hub <https://huggingface.co/datasets>`_, including the three datasets used in this tutorial. As a very brief overview, <https://huggingface.co/datasets>`_, including the three datasets used in this tutorial. As a very brief overview, we
we will show how to use the NLP library to download and prepare the IMDb dataset from the first example, will show how to use the NLP library to download and prepare the IMDb dataset from the first example, :ref:`seq_imdb`.
:ref:`seq_imdb`.
Start by downloading the dataset: Start by downloading the dataset:
...@@ -689,8 +688,8 @@ Each dataset has multiple columns corresponding to different features. Let's see ...@@ -689,8 +688,8 @@ Each dataset has multiple columns corresponding to different features. Let's see
>>> print(train.column_names) >>> print(train.column_names)
['label', 'text'] ['label', 'text']
Great. Now let's tokenize the text. We can do this using the ``map`` method. We'll also rename the ``label`` column Great. Now let's tokenize the text. We can do this using the ``map`` method. We'll also rename the ``label`` column to
to ``labels`` to match the model's input arguments. ``labels`` to match the model's input arguments.
.. code-block:: python .. code-block:: python
...@@ -711,5 +710,5 @@ dataset elements. ...@@ -711,5 +710,5 @@ dataset elements.
>>> {key: val.shape for key, val in train[0].items()}) >>> {key: val.shape for key, val in train[0].items()})
{'labels': TensorShape([]), 'input_ids': TensorShape([512]), 'attention_mask': TensorShape([512])} {'labels': TensorShape([]), 'input_ids': TensorShape([512]), 'attention_mask': TensorShape([512])}
We now have a fully-prepared dataset. Check out `the 🤗 NLP docs <https://huggingface.co/nlp/processing.html>`_ for We now have a fully-prepared dataset. Check out `the 🤗 NLP docs <https://huggingface.co/nlp/processing.html>`_ for a
a more thorough introduction. more thorough introduction.
\ No newline at end of file
...@@ -57,8 +57,8 @@ The tokenizer takes care of splitting the sequence into tokens available in the ...@@ -57,8 +57,8 @@ The tokenizer takes care of splitting the sequence into tokens available in the
>>> tokenized_sequence = tokenizer.tokenize(sequence) >>> tokenized_sequence = tokenizer.tokenize(sequence)
The tokens are either words or subwords. Here for instance, "VRAM" wasn't in the model vocabulary, so it's been split The tokens are either words or subwords. Here for instance, "VRAM" wasn't in the model vocabulary, so it's been split
in "V", "RA" and "M". To indicate those tokens are not separate words but parts of the same word, a double-hash prefix is in "V", "RA" and "M". To indicate those tokens are not separate words but parts of the same word, a double-hash prefix
added for "RA" and "M": is added for "RA" and "M":
.. code-block:: .. code-block::
...@@ -66,8 +66,8 @@ added for "RA" and "M": ...@@ -66,8 +66,8 @@ added for "RA" and "M":
['A', 'Titan', 'R', '##T', '##X', 'has', '24', '##GB', 'of', 'V', '##RA', '##M'] ['A', 'Titan', 'R', '##T', '##X', 'has', '24', '##GB', 'of', 'V', '##RA', '##M']
These tokens can then be converted into IDs which are understandable by the model. This can be done by directly feeding These tokens can then be converted into IDs which are understandable by the model. This can be done by directly feeding
the sentence to the tokenizer, which leverages the Rust implementation of the sentence to the tokenizer, which leverages the Rust implementation of `huggingface/tokenizers
`huggingface/tokenizers <https://github.com/huggingface/tokenizers>`__ for peak performance. <https://github.com/huggingface/tokenizers>`__ for peak performance.
.. code-block:: .. code-block::
...@@ -105,8 +105,8 @@ because this is the way a :class:`~transformers.BertModel` is going to expect it ...@@ -105,8 +105,8 @@ because this is the way a :class:`~transformers.BertModel` is going to expect it
Attention mask Attention mask
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The attention mask is an optional argument used when batching sequences together. This argument indicates to the The attention mask is an optional argument used when batching sequences together. This argument indicates to the model
model which tokens should be attended to, and which should not. which tokens should be attended to, and which should not.
For example, consider these two sequences: For example, consider these two sequences:
...@@ -145,10 +145,10 @@ We can see that 0s have been added on the right of the first sentence to make it ...@@ -145,10 +145,10 @@ We can see that 0s have been added on the right of the first sentence to make it
>>> padded_sequences["input_ids"] >>> padded_sequences["input_ids"]
[[101, 1188, 1110, 170, 1603, 4954, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1188, 1110, 170, 1897, 1263, 4954, 119, 1135, 1110, 1120, 1655, 2039, 1190, 1103, 4954, 138, 119, 102]] [[101, 1188, 1110, 170, 1603, 4954, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1188, 1110, 170, 1897, 1263, 4954, 119, 1135, 1110, 1120, 1655, 2039, 1190, 1103, 4954, 138, 119, 102]]
This can then be converted into a tensor in PyTorch or TensorFlow. The attention mask is a binary tensor indicating This can then be converted into a tensor in PyTorch or TensorFlow. The attention mask is a binary tensor indicating the
the position of the padded indices so that the model does not attend to them. For the position of the padded indices so that the model does not attend to them. For the :class:`~transformers.BertTokenizer`,
:class:`~transformers.BertTokenizer`, :obj:`1` indicates a value that should be attended to, while :obj:`0` indicates :obj:`1` indicates a value that should be attended to, while :obj:`0` indicates a padded value. This attention mask is
a padded value. This attention mask is in the dictionary returned by the tokenizer under the key "attention_mask": in the dictionary returned by the tokenizer under the key "attention_mask":
.. code-block:: .. code-block::
...@@ -161,15 +161,16 @@ Token Type IDs ...@@ -161,15 +161,16 @@ Token Type IDs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Some models' purpose is to do sequence classification or question answering. These require two different sequences to Some models' purpose is to do sequence classification or question answering. These require two different sequences to
be joined in a single "input_ids" entry, which usually is performed with the help of special tokens, such as the classifier (``[CLS]``) and separator (``[SEP]``) be joined in a single "input_ids" entry, which usually is performed with the help of special tokens, such as the
tokens. For example, the BERT model builds its two sequence input as such: classifier (``[CLS]``) and separator (``[SEP]``) tokens. For example, the BERT model builds its two sequence input as
such:
.. code-block:: .. code-block::
>>> # [CLS] SEQUENCE_A [SEP] SEQUENCE_B [SEP] >>> # [CLS] SEQUENCE_A [SEP] SEQUENCE_B [SEP]
We can use our tokenizer to automatically generate such a sentence by passing the two sequences to ``tokenizer`` as two arguments (and We can use our tokenizer to automatically generate such a sentence by passing the two sequences to ``tokenizer`` as two
not a list, like before) like this: arguments (and not a list, like before) like this:
.. code-block:: .. code-block::
...@@ -189,8 +190,8 @@ which will return: ...@@ -189,8 +190,8 @@ which will return:
[CLS] HuggingFace is based in NYC [SEP] Where is HuggingFace based? [SEP] [CLS] HuggingFace is based in NYC [SEP] Where is HuggingFace based? [SEP]
This is enough for some models to understand where one sequence ends and where another begins. However, other models, This is enough for some models to understand where one sequence ends and where another begins. However, other models,
such as BERT, also deploy token type IDs (also called segment IDs). They are represented as a binary such as BERT, also deploy token type IDs (also called segment IDs). They are represented as a binary mask identifying
mask identifying the two types of sequence in the model. the two types of sequence in the model.
The tokenizer returns this mask as the "token_type_ids" entry: The tokenizer returns this mask as the "token_type_ids" entry:
...@@ -209,14 +210,15 @@ Some models, like :class:`~transformers.XLNetModel` use an additional token repr ...@@ -209,14 +210,15 @@ Some models, like :class:`~transformers.XLNetModel` use an additional token repr
Position IDs Position IDs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Contrary to RNNs that have the position of each token embedded within them, Contrary to RNNs that have the position of each token embedded within them, transformers are unaware of the position of
transformers are unaware of the position of each token. Therefore, the position IDs (``position_ids``) are used by the model to identify each token's position in the list of tokens. each token. Therefore, the position IDs (``position_ids``) are used by the model to identify each token's position in
the list of tokens.
They are an optional parameter. If no ``position_ids`` is passed to the model, the IDs are automatically created as absolute They are an optional parameter. If no ``position_ids`` is passed to the model, the IDs are automatically created as
positional embeddings. absolute positional embeddings.
Absolute positional embeddings are selected in the range ``[0, config.max_position_embeddings - 1]``. Some models Absolute positional embeddings are selected in the range ``[0, config.max_position_embeddings - 1]``. Some models use
use other types of positional embeddings, such as sinusoidal position embeddings or relative position embeddings. other types of positional embeddings, such as sinusoidal position embeddings or relative position embeddings.
.. _labels: .. _labels:
...@@ -224,43 +226,41 @@ Labels ...@@ -224,43 +226,41 @@ Labels
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The labels are an optional argument which can be passed in order for the model to compute the loss itself. These labels The labels are an optional argument which can be passed in order for the model to compute the loss itself. These labels
should be the expected prediction of the model: it will use the standard loss in order to compute the loss between should be the expected prediction of the model: it will use the standard loss in order to compute the loss between its
its predictions and the expected value (the label). predictions and the expected value (the label).
These labels are different according to the model head, for example: These labels are different according to the model head, for example:
- For sequence classification models (e.g., :class:`~transformers.BertForSequenceClassification`), the model expects - For sequence classification models (e.g., :class:`~transformers.BertForSequenceClassification`), the model expects a
a tensor of dimension :obj:`(batch_size)` with each value of the batch corresponding to the expected label of the tensor of dimension :obj:`(batch_size)` with each value of the batch corresponding to the expected label of the
entire sequence. entire sequence.
- For token classification models (e.g., :class:`~transformers.BertForTokenClassification`), the model expects - For token classification models (e.g., :class:`~transformers.BertForTokenClassification`), the model expects a tensor
a tensor of dimension :obj:`(batch_size, seq_length)` with each value corresponding to the expected label of each of dimension :obj:`(batch_size, seq_length)` with each value corresponding to the expected label of each individual
individual token. token.
- For masked language modeling (e.g., :class:`~transformers.BertForMaskedLM`), the model expects - For masked language modeling (e.g., :class:`~transformers.BertForMaskedLM`), the model expects a tensor of dimension
a tensor of dimension :obj:`(batch_size, seq_length)` with each value corresponding to the expected label of each :obj:`(batch_size, seq_length)` with each value corresponding to the expected label of each individual token: the
individual token: the labels being the token ID for the masked token, and values to be ignored for the rest (usually labels being the token ID for the masked token, and values to be ignored for the rest (usually -100).
-100).
- For sequence to sequence tasks,(e.g., :class:`~transformers.BartForConditionalGeneration`, - For sequence to sequence tasks,(e.g., :class:`~transformers.BartForConditionalGeneration`,
:class:`~transformers.MBartForConditionalGeneration`), the model expects a tensor of dimension :class:`~transformers.MBartForConditionalGeneration`), the model expects a tensor of dimension :obj:`(batch_size,
:obj:`(batch_size, tgt_seq_length)` with each value corresponding to the target sequences associated with each tgt_seq_length)` with each value corresponding to the target sequences associated with each input sequence. During
input sequence. During training, both `BART` and `T5` will make the appropriate `decoder_input_ids` and decoder training, both `BART` and `T5` will make the appropriate `decoder_input_ids` and decoder attention masks internally.
attention masks internally. They usually do not need to be supplied. This does not apply to models leveraging the They usually do not need to be supplied. This does not apply to models leveraging the Encoder-Decoder framework. See
Encoder-Decoder framework. the documentation of each model for more information on each specific model's labels.
See the documentation of each model for more information on each specific model's labels.
The base models (e.g., :class:`~transformers.BertModel`) do not accept labels, as these are the base transformer models, The base models (e.g., :class:`~transformers.BertModel`) do not accept labels, as these are the base transformer
simply outputting features. models, simply outputting features.
.. _decoder-input-ids: .. _decoder-input-ids:
Decoder input IDs Decoder input IDs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This input is specific to encoder-decoder models, and contains the input IDs that will be fed to the decoder. This input is specific to encoder-decoder models, and contains the input IDs that will be fed to the decoder. These
These inputs should be used for sequence to sequence tasks, such as translation or summarization, and are usually inputs should be used for sequence to sequence tasks, such as translation or summarization, and are usually built in a
built in a way specific to each model. way specific to each model.
Most encoder-decoder models (BART, T5) create their :obj:`decoder_input_ids` on their own from the :obj:`labels`. Most encoder-decoder models (BART, T5) create their :obj:`decoder_input_ids` on their own from the :obj:`labels`. In
In such models, passing the :obj:`labels` is the preferred way to handle training. such models, passing the :obj:`labels` is the preferred way to handle training.
Please check each model's docs to see how they handle these input IDs for sequence to sequence training. Please check each model's docs to see how they handle these input IDs for sequence to sequence training.
...@@ -270,17 +270,17 @@ Feed Forward Chunking ...@@ -270,17 +270,17 @@ Feed Forward Chunking
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In each residual attention block in transformers the self-attention layer is usually followed by 2 feed forward layers. In each residual attention block in transformers the self-attention layer is usually followed by 2 feed forward layers.
The intermediate embedding size of the feed forward layers is often bigger than the hidden size of the model (e.g., The intermediate embedding size of the feed forward layers is often bigger than the hidden size of the model (e.g., for
for ``bert-base-uncased``). ``bert-base-uncased``).
For an input of size ``[batch_size, sequence_length]``, the memory required to store the intermediate feed forward For an input of size ``[batch_size, sequence_length]``, the memory required to store the intermediate feed forward
embeddings ``[batch_size, sequence_length, config.intermediate_size]`` can account for a large fraction of the memory embeddings ``[batch_size, sequence_length, config.intermediate_size]`` can account for a large fraction of the memory
use. The authors of `Reformer: The Efficient Transformer <https://arxiv.org/abs/2001.04451>`_ noticed that since the use. The authors of `Reformer: The Efficient Transformer <https://arxiv.org/abs/2001.04451>`_ noticed that since the
computation is independent of the ``sequence_length`` dimension, it is mathematically equivalent to compute the output computation is independent of the ``sequence_length`` dimension, it is mathematically equivalent to compute the output
embeddings of both feed forward layers ``[batch_size, config.hidden_size]_0, ..., [batch_size, config.hidden_size]_n`` embeddings of both feed forward layers ``[batch_size, config.hidden_size]_0, ..., [batch_size, config.hidden_size]_n``
individually and concat them afterward to ``[batch_size, sequence_length, config.hidden_size]`` with individually and concat them afterward to ``[batch_size, sequence_length, config.hidden_size]`` with ``n =
``n = sequence_length``, which trades increased computation time against reduced memory use, but yields a sequence_length``, which trades increased computation time against reduced memory use, but yields a mathematically
mathematically **equivalent** result. **equivalent** result.
For models employing the function :func:`~.transformers.apply_chunking_to_forward`, the ``chunk_size`` defines the For models employing the function :func:`~.transformers.apply_chunking_to_forward`, the ``chunk_size`` defines the
number of output embeddings that are computed in parallel and thus defines the trade-off between memory and time number of output embeddings that are computed in parallel and thus defines the trade-off between memory and time
......
...@@ -47,6 +47,7 @@ The documentation is organized in five parts: ...@@ -47,6 +47,7 @@ The documentation is organized in five parts:
- **RESEARCH** focuses on tutorials that have less to do with how to use the library but more about general resarch in - **RESEARCH** focuses on tutorials that have less to do with how to use the library but more about general resarch in
transformers model transformers model
- The three last section contain the documentation of each public class and function, grouped in: - The three last section contain the documentation of each public class and function, grouped in:
- **MAIN CLASSES** for the main classes exposing the important APIs of the library. - **MAIN CLASSES** for the main classes exposing the important APIs of the library.
- **MODELS** for the classes and functions related to each model implemented in the library. - **MODELS** for the classes and functions related to each model implemented in the library.
- **INTERNAL HELPERS** for the classes and functions we use internally. - **INTERNAL HELPERS** for the classes and functions we use internally.
......
...@@ -25,6 +25,7 @@ SpecialTokensMixin ...@@ -25,6 +25,7 @@ SpecialTokensMixin
Enums and namedtuples Enums and namedtuples
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.tokenization_utils_base.ExplicitEnum .. autoclass:: transformers.tokenization_utils_base.ExplicitEnum
.. autoclass:: transformers.tokenization_utils_base.PaddingStrategy .. autoclass:: transformers.tokenization_utils_base.PaddingStrategy
......
Pipelines Pipelines
----------------------------------------------------------------------------------------------------------------------- -----------------------------------------------------------------------------------------------------------------------
The pipelines are a great and easy way to use models for inference. These pipelines are objects that abstract most The pipelines are a great and easy way to use models for inference. These pipelines are objects that abstract most of
of the complex code from the library, offering a simple API dedicated to several tasks, including Named Entity the complex code from the library, offering a simple API dedicated to several tasks, including Named Entity
Recognition, Masked Language Modeling, Sentiment Analysis, Feature Extraction and Question Answering. See the Recognition, Masked Language Modeling, Sentiment Analysis, Feature Extraction and Question Answering. See the
:doc:`task summary <../task_summary>` for examples of use. :doc:`task summary <../task_summary>` for examples of use.
...@@ -26,8 +26,8 @@ There are two categories of pipeline abstractions to be aware about: ...@@ -26,8 +26,8 @@ There are two categories of pipeline abstractions to be aware about:
The pipeline abstraction The pipeline abstraction
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The `pipeline` abstraction is a wrapper around all the other available pipelines. It is instantiated as any The `pipeline` abstraction is a wrapper around all the other available pipelines. It is instantiated as any other
other pipeline but requires an additional argument which is the `task`. pipeline but requires an additional argument which is the `task`.
.. autofunction:: transformers.pipeline .. autofunction:: transformers.pipeline
......
...@@ -8,8 +8,8 @@ Processors ...@@ -8,8 +8,8 @@ Processors
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All processors follow the same architecture which is that of the All processors follow the same architecture which is that of the
:class:`~transformers.data.processors.utils.DataProcessor`. The processor returns a list :class:`~transformers.data.processors.utils.DataProcessor`. The processor returns a list of
of :class:`~transformers.data.processors.utils.InputExample`. These :class:`~transformers.data.processors.utils.InputExample`. These
:class:`~transformers.data.processors.utils.InputExample` can be converted to :class:`~transformers.data.processors.utils.InputExample` can be converted to
:class:`~transformers.data.processors.utils.InputFeatures` in order to be fed to the model. :class:`~transformers.data.processors.utils.InputFeatures` in order to be fed to the model.
...@@ -28,14 +28,16 @@ of :class:`~transformers.data.processors.utils.InputExample`. These ...@@ -28,14 +28,16 @@ of :class:`~transformers.data.processors.utils.InputExample`. These
GLUE GLUE
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
`General Language Understanding Evaluation (GLUE) <https://gluebenchmark.com/>`__ is a benchmark that evaluates `General Language Understanding Evaluation (GLUE) <https://gluebenchmark.com/>`__ is a benchmark that evaluates the
the performance of models across a diverse set of existing NLU tasks. It was released together with the paper performance of models across a diverse set of existing NLU tasks. It was released together with the paper `GLUE: A
`GLUE: A multi-task benchmark and analysis platform for natural language understanding <https://openreview.net/pdf?id=rJ4km2R5t7>`__ multi-task benchmark and analysis platform for natural language understanding
<https://openreview.net/pdf?id=rJ4km2R5t7>`__
This library hosts a total of 10 processors for the following tasks: MRPC, MNLI, MNLI (mismatched), This library hosts a total of 10 processors for the following tasks: MRPC, MNLI, MNLI (mismatched), CoLA, SST2, STSB,
CoLA, SST2, STSB, QQP, QNLI, RTE and WNLI. QQP, QNLI, RTE and WNLI.
Those processors are: Those processors are:
- :class:`~transformers.data.processors.utils.MrpcProcessor` - :class:`~transformers.data.processors.utils.MrpcProcessor`
- :class:`~transformers.data.processors.utils.MnliProcessor` - :class:`~transformers.data.processors.utils.MnliProcessor`
- :class:`~transformers.data.processors.utils.MnliMismatchedProcessor` - :class:`~transformers.data.processors.utils.MnliMismatchedProcessor`
...@@ -54,36 +56,39 @@ Additionally, the following method can be used to load values from a data file ...@@ -54,36 +56,39 @@ Additionally, the following method can be used to load values from a data file
Example usage Example usage
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
An example using these processors is given in the `run_glue.py <https://github.com/huggingface/pytorch-transformers/blob/master/examples/text-classification/run_glue.py>`__ script. An example using these processors is given in the `run_glue.py
<https://github.com/huggingface/pytorch-transformers/blob/master/examples/text-classification/run_glue.py>`__ script.
XNLI XNLI
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
`The Cross-Lingual NLI Corpus (XNLI) <https://www.nyu.edu/projects/bowman/xnli/>`__ is a benchmark that evaluates `The Cross-Lingual NLI Corpus (XNLI) <https://www.nyu.edu/projects/bowman/xnli/>`__ is a benchmark that evaluates the
the quality of cross-lingual text representations. quality of cross-lingual text representations. XNLI is crowd-sourced dataset based on `MultiNLI
XNLI is crowd-sourced dataset based on `MultiNLI <http://www.nyu.edu/projects/bowman/multinli/>`: pairs of text are labeled with textual entailment <http://www.nyu.edu/projects/bowman/multinli/>`: pairs of text are labeled with textual entailment annotations for 15
annotations for 15 different languages (including both high-resource language such as English and low-resource languages such as Swahili). different languages (including both high-resource language such as English and low-resource languages such as Swahili).
It was released together with the paper It was released together with the paper `XNLI: Evaluating Cross-lingual Sentence Representations
`XNLI: Evaluating Cross-lingual Sentence Representations <https://arxiv.org/abs/1809.05053>`__ <https://arxiv.org/abs/1809.05053>`__
This library hosts the processor to load the XNLI data: This library hosts the processor to load the XNLI data:
- :class:`~transformers.data.processors.utils.XnliProcessor` - :class:`~transformers.data.processors.utils.XnliProcessor`
Please note that since the gold labels are available on the test set, evaluation is performed on the test set. Please note that since the gold labels are available on the test set, evaluation is performed on the test set.
An example using these processors is given in the An example using these processors is given in the `run_xnli.py
`run_xnli.py <https://github.com/huggingface/pytorch-transformers/blob/master/examples/text-classification/run_xnli.py>`__ script. <https://github.com/huggingface/pytorch-transformers/blob/master/examples/text-classification/run_xnli.py>`__ script.
SQuAD SQuAD
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
`The Stanford Question Answering Dataset (SQuAD) <https://rajpurkar.github.io/SQuAD-explorer//>`__ is a benchmark that evaluates `The Stanford Question Answering Dataset (SQuAD) <https://rajpurkar.github.io/SQuAD-explorer//>`__ is a benchmark that
the performance of models on question answering. Two versions are available, v1.1 and v2.0. The first version (v1.1) was released together with the paper evaluates the performance of models on question answering. Two versions are available, v1.1 and v2.0. The first version
`SQuAD: 100,000+ Questions for Machine Comprehension of Text <https://arxiv.org/abs/1606.05250>`__. The second version (v2.0) was released alongside (v1.1) was released together with the paper `SQuAD: 100,000+ Questions for Machine Comprehension of Text
the paper `Know What You Don't Know: Unanswerable Questions for SQuAD <https://arxiv.org/abs/1806.03822>`__. <https://arxiv.org/abs/1606.05250>`__. The second version (v2.0) was released alongside the paper `Know What You Don't
Know: Unanswerable Questions for SQuAD <https://arxiv.org/abs/1806.03822>`__.
This library hosts a processor for each of the two versions: This library hosts a processor for each of the two versions:
...@@ -91,6 +96,7 @@ Processors ...@@ -91,6 +96,7 @@ Processors
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Those processors are: Those processors are:
- :class:`~transformers.data.processors.utils.SquadV1Processor` - :class:`~transformers.data.processors.utils.SquadV1Processor`
- :class:`~transformers.data.processors.utils.SquadV2Processor` - :class:`~transformers.data.processors.utils.SquadV2Processor`
...@@ -99,17 +105,18 @@ They both inherit from the abstract class :class:`~transformers.data.processors. ...@@ -99,17 +105,18 @@ They both inherit from the abstract class :class:`~transformers.data.processors.
.. autoclass:: transformers.data.processors.squad.SquadProcessor .. autoclass:: transformers.data.processors.squad.SquadProcessor
:members: :members:
Additionally, the following method can be used to convert SQuAD examples into :class:`~transformers.data.processors.utils.SquadFeatures` Additionally, the following method can be used to convert SQuAD examples into
that can be used as model inputs. :class:`~transformers.data.processors.utils.SquadFeatures` that can be used as model inputs.
.. automethod:: transformers.data.processors.squad.squad_convert_examples_to_features .. automethod:: transformers.data.processors.squad.squad_convert_examples_to_features
These processors as well as the aforementionned method can be used with files containing the data as well as with the `tensorflow_datasets` package. These processors as well as the aforementionned method can be used with files containing the data as well as with the
Examples are given below. `tensorflow_datasets` package. Examples are given below.
Example usage Example usage
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Here is an example using the processors as well as the conversion method using data files: Here is an example using the processors as well as the conversion method using data files:
.. code-block:: .. code-block::
...@@ -149,5 +156,5 @@ Using `tensorflow_datasets` is as easy as using a data file: ...@@ -149,5 +156,5 @@ Using `tensorflow_datasets` is as easy as using a data file:
) )
Another example using these processors is given in the Another example using these processors is given in the `run_squad.py
`run_squad.py <https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py>`__ script. <https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py>`__ script.
...@@ -29,11 +29,12 @@ methods for using all the tokenizers: ...@@ -29,11 +29,12 @@ methods for using all the tokenizers:
:class:`~transformers.BatchEncoding` holds the output of the tokenizer's encoding methods (``__call__``, :class:`~transformers.BatchEncoding` holds the output of the tokenizer's encoding methods (``__call__``,
``encode_plus`` and ``batch_encode_plus``) and is derived from a Python dictionary. When the tokenizer is a pure python ``encode_plus`` and ``batch_encode_plus``) and is derived from a Python dictionary. When the tokenizer is a pure python
tokenizer, this class behaves just like a standard python dictionary and holds the various model inputs computed by these tokenizer, this class behaves just like a standard python dictionary and holds the various model inputs computed by
methods (``input_ids``, ``attention_mask``...). When the tokenizer is a "Fast" tokenizer (i.e., backed by HuggingFace these methods (``input_ids``, ``attention_mask``...). When the tokenizer is a "Fast" tokenizer (i.e., backed by
`tokenizers library <https://github.com/huggingface/tokenizers>`__), this class provides in addition several advanced HuggingFace `tokenizers library <https://github.com/huggingface/tokenizers>`__), this class provides in addition
alignment methods which can be used to map between the original string (character and words) and the token space (e.g., several advanced alignment methods which can be used to map between the original string (character and words) and the
getting the index of the token comprising a given character or the span of characters corresponding to a given token). token space (e.g., getting the index of the token comprising a given character or the span of characters corresponding
to a given token).
PreTrainedTokenizer PreTrainedTokenizer
......
...@@ -19,14 +19,14 @@ downstream tasks. However, at some point further model increases become harder d ...@@ -19,14 +19,14 @@ downstream tasks. However, at some point further model increases become harder d
longer training times, and unexpected model degradation. To address these problems, we present two parameter-reduction longer training times, and unexpected model degradation. To address these problems, we present two parameter-reduction
techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows
that our proposed methods lead to models that scale much better compared to the original BERT. We also use a that our proposed methods lead to models that scale much better compared to the original BERT. We also use a
self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks
tasks with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and
RACE, and SQuAD benchmarks while having fewer parameters compared to BERT-large.* SQuAD benchmarks while having fewer parameters compared to BERT-large.*
Tips: Tips:
- ALBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on - ALBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather
the right rather than the left. than the left.
- ALBERT uses repeating layers which results in a small memory footprint, however the computational cost remains - ALBERT uses repeating layers which results in a small memory footprint, however the computational cost remains
similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same
number of (repeating) layers. number of (repeating) layers.
......
...@@ -2,9 +2,8 @@ AutoClasses ...@@ -2,9 +2,8 @@ AutoClasses
----------------------------------------------------------------------------------------------------------------------- -----------------------------------------------------------------------------------------------------------------------
In many cases, the architecture you want to use can be guessed from the name or the path of the pretrained model you In many cases, the architecture you want to use can be guessed from the name or the path of the pretrained model you
are supplying to the :obj:`from_pretrained()` method. are supplying to the :obj:`from_pretrained()` method. AutoClasses are here to do this job for you so that you
AutoClasses are here to do this job for you so that you automatically retrieve the relevant model given the name/path automatically retrieve the relevant model given the name/path to the pretrained weights/config/vocabulary.
to the pretrained weights/config/vocabulary.
Instantiating one of :class:`~transformers.AutoConfig`, :class:`~transformers.AutoModel`, and Instantiating one of :class:`~transformers.AutoConfig`, :class:`~transformers.AutoModel`, and
:class:`~transformers.AutoTokenizer` will directly create a class of the relevant architecture. For instance :class:`~transformers.AutoTokenizer` will directly create a class of the relevant architecture. For instance
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment