Unverified Commit 08f534d2 authored by Sylvain Gugger's avatar Sylvain Gugger Committed by GitHub
Browse files

Doc styling (#8067)

* Important files

* Styling them all

* Revert "Styling them all"

This reverts commit 7d029395fdae8513b8281cbc2a6c239f8093503e.

* Syling them for realsies

* Fix syntax error

* Fix benchmark_utils

* More fixes

* Fix modeling auto and script

* Remove new line

* Fixes

* More fixes

* Fix more files

* Style

* Add FSMT

* More fixes

* More fixes

* More fixes

* More fixes

* Fixes

* More fixes

* More fixes

* Last fixes

* Make sphinx happy
parent 04a17f85
......@@ -322,6 +322,7 @@ jobs:
- run: black --check examples templates tests src utils
- run: isort --check-only examples templates tests src utils
- run: flake8 examples templates tests src utils
- run: python utils/style_doc.py src/transformers docs/source --max_len 119 --check_only
- run: python utils/check_copies.py
- run: python utils/check_dummies.py
- run: python utils/check_repo.py
......
......@@ -15,6 +15,7 @@ modified_only_fixup:
black $(modified_py_files); \
isort $(modified_py_files); \
flake8 $(modified_py_files); \
python utils/style_doc.py $(modified_py_files) --max_len 119; \
else \
echo "No library .py files were modified"; \
fi
......@@ -31,6 +32,7 @@ quality:
black --check $(check_dirs)
isort --check-only $(check_dirs)
flake8 $(check_dirs)
python utils/style_doc.py src/transformers docs/source --max_len 119 --check_only
${MAKE} extra_quality_checks
# Format source code automatically and check is there are any problems left that need manual fixing
......@@ -38,6 +40,7 @@ quality:
style:
black $(check_dirs)
isort $(check_dirs)
python utils/style_doc.py src/transformers docs/source --max_len 119
# Super fast fix and check target that only works on relevant modified files since the branch was made
......
......@@ -3,21 +3,27 @@ Benchmarks
Let's take a look at how 🤗 Transformer models can be benchmarked, best practices, and already available benchmarks.
A notebook explaining in more detail how to benchmark 🤗 Transformer models can be found `here <https://github.com/huggingface/transformers/blob/master/notebooks/05-benchmark.ipynb>`__.
A notebook explaining in more detail how to benchmark 🤗 Transformer models can be found `here
<https://github.com/huggingface/transformers/blob/master/notebooks/05-benchmark.ipynb>`__.
How to benchmark 🤗 Transformer models
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The classes :class:`~transformers.PyTorchBenchmark` and :class:`~transformers.TensorFlowBenchmark` allow to flexibly benchmark 🤗 Transformer models.
The benchmark classes allow us to measure the `peak memory usage` and `required time` for both
`inference` and `training`.
The classes :class:`~transformers.PyTorchBenchmark` and :class:`~transformers.TensorFlowBenchmark` allow to flexibly
benchmark 🤗 Transformer models. The benchmark classes allow us to measure the `peak memory usage` and `required time`
for both `inference` and `training`.
.. note::
Hereby, `inference` is defined by a single forward pass, and `training` is defined by a single forward pass and backward pass.
Hereby, `inference` is defined by a single forward pass, and `training` is defined by a single forward pass and
backward pass.
The benchmark classes :class:`~transformers.PyTorchBenchmark` and :class:`~transformers.TensorFlowBenchmark` expect an object of type :class:`~transformers.PyTorchBenchmarkArguments` and :class:`~transformers.TensorFlowBenchmarkArguments`, respectively, for instantiation. :class:`~transformers.PyTorchBenchmarkArguments` and :class:`~transformers.TensorFlowBenchmarkArguments` are data classes and contain all relevant configurations for their corresponding benchmark class.
In the following example, it is shown how a BERT model of type `bert-base-cased` can be benchmarked.
The benchmark classes :class:`~transformers.PyTorchBenchmark` and :class:`~transformers.TensorFlowBenchmark` expect an
object of type :class:`~transformers.PyTorchBenchmarkArguments` and
:class:`~transformers.TensorFlowBenchmarkArguments`, respectively, for instantiation.
:class:`~transformers.PyTorchBenchmarkArguments` and :class:`~transformers.TensorFlowBenchmarkArguments` are data
classes and contain all relevant configurations for their corresponding benchmark class. In the following example, it
is shown how a BERT model of type `bert-base-cased` can be benchmarked.
.. code-block::
......@@ -34,11 +40,15 @@ In the following example, it is shown how a BERT model of type `bert-base-cased`
>>> benchmark = TensorFlowBenchmark(args)
Here, three arguments are given to the benchmark argument data classes, namely ``models``, ``batch_sizes``, and ``sequence_lengths``. The argument ``models`` is required and expects a :obj:`list` of model identifiers from the `model hub <https://huggingface.co/models>`__
The :obj:`list` arguments ``batch_sizes`` and ``sequence_lengths`` define the size of the ``input_ids`` on which the model is benchmarked.
There are many more parameters that can be configured via the benchmark argument data classes. For more detail on these one can either directly consult the files
``src/transformers/benchmark/benchmark_args_utils.py``, ``src/transformers/benchmark/benchmark_args.py`` (for PyTorch) and ``src/transformers/benchmark/benchmark_args_tf.py`` (for Tensorflow).
Alternatively, running the following shell commands from root will print out a descriptive list of all configurable parameters for PyTorch and Tensorflow respectively.
Here, three arguments are given to the benchmark argument data classes, namely ``models``, ``batch_sizes``, and
``sequence_lengths``. The argument ``models`` is required and expects a :obj:`list` of model identifiers from the
`model hub <https://huggingface.co/models>`__ The :obj:`list` arguments ``batch_sizes`` and ``sequence_lengths`` define
the size of the ``input_ids`` on which the model is benchmarked. There are many more parameters that can be configured
via the benchmark argument data classes. For more detail on these one can either directly consult the files
``src/transformers/benchmark/benchmark_args_utils.py``, ``src/transformers/benchmark/benchmark_args.py`` (for PyTorch)
and ``src/transformers/benchmark/benchmark_args_tf.py`` (for Tensorflow). Alternatively, running the following shell
commands from root will print out a descriptive list of all configurable parameters for PyTorch and Tensorflow
respectively.
.. code-block:: bash
......@@ -145,14 +155,17 @@ An instantiated benchmark object can then simply be run by calling ``benchmark.r
- gpu_performance_state: 2
- use_tpu: False
By default, the `time` and the `required memory` for `inference` are benchmarked.
In the example output above the first two sections show the result corresponding to `inference time` and `inference memory`.
In addition, all relevant information about the computing environment, `e.g.` the GPU type, the system, the library versions, etc... are printed out in the third section under `ENVIRONMENT INFORMATION`.
This information can optionally be saved in a `.csv` file when adding the argument :obj:`save_to_csv=True` to :class:`~transformers.PyTorchBenchmarkArguments` and :class:`~transformers.TensorFlowBenchmarkArguments` respectively.
In this case, every section is saved in a separate `.csv` file. The path to each `.csv` file can optionally be defined via the argument data classes.
By default, the `time` and the `required memory` for `inference` are benchmarked. In the example output above the first
two sections show the result corresponding to `inference time` and `inference memory`. In addition, all relevant
information about the computing environment, `e.g.` the GPU type, the system, the library versions, etc... are printed
out in the third section under `ENVIRONMENT INFORMATION`. This information can optionally be saved in a `.csv` file
when adding the argument :obj:`save_to_csv=True` to :class:`~transformers.PyTorchBenchmarkArguments` and
:class:`~transformers.TensorFlowBenchmarkArguments` respectively. In this case, every section is saved in a separate
`.csv` file. The path to each `.csv` file can optionally be defined via the argument data classes.
Instead of benchmarking pre-trained models via their model identifier, `e.g.` `bert-base-uncased`, the user can alternatively benchmark an arbitrary configuration of any available model class.
In this case, a :obj:`list` of configurations must be inserted with the benchmark args as follows.
Instead of benchmarking pre-trained models via their model identifier, `e.g.` `bert-base-uncased`, the user can
alternatively benchmark an arbitrary configuration of any available model class. In this case, a :obj:`list` of
configurations must be inserted with the benchmark args as follows.
.. code-block::
......@@ -295,8 +308,9 @@ In this case, a :obj:`list` of configurations must be inserted with the benchmar
- use_tpu: False
Again, `inference time` and `required memory` for `inference` are measured, but this time for customized configurations of the :obj:`BertModel` class. This feature can especially be helpful when
deciding for which configuration the model should be trained.
Again, `inference time` and `required memory` for `inference` are measured, but this time for customized configurations
of the :obj:`BertModel` class. This feature can especially be helpful when deciding for which configuration the model
should be trained.
Benchmark best practices
......@@ -305,18 +319,27 @@ Benchmark best practices
This section lists a couple of best practices one should be aware of when benchmarking a model.
- Currently, only single device benchmarking is supported. When benchmarking on GPU, it is recommended that the user
specifies on which device the code should be run by setting the ``CUDA_VISIBLE_DEVICES`` environment variable in the shell, `e.g.` ``export CUDA_VISIBLE_DEVICES=0`` before running the code.
- The option :obj:`no_multi_processing` should only be set to :obj:`True` for testing and debugging. To ensure accurate memory measurement it is recommended to run each memory benchmark in a separate process by making sure :obj:`no_multi_processing` is set to :obj:`True`.
- One should always state the environment information when sharing the results of a model benchmark. Results can vary heavily between different GPU devices, library versions, etc., so that benchmark results on their own are not very useful for the community.
specifies on which device the code should be run by setting the ``CUDA_VISIBLE_DEVICES`` environment variable in the
shell, `e.g.` ``export CUDA_VISIBLE_DEVICES=0`` before running the code.
- The option :obj:`no_multi_processing` should only be set to :obj:`True` for testing and debugging. To ensure accurate
memory measurement it is recommended to run each memory benchmark in a separate process by making sure
:obj:`no_multi_processing` is set to :obj:`True`.
- One should always state the environment information when sharing the results of a model benchmark. Results can vary
heavily between different GPU devices, library versions, etc., so that benchmark results on their own are not very
useful for the community.
Sharing your benchmark
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Previously all available core models (10 at the time) have been benchmarked for `inference time`, across many different settings: using PyTorch, with
and without TorchScript, using TensorFlow, with and without XLA. All of those tests were done across CPUs (except for
TensorFlow XLA) and GPUs.
Previously all available core models (10 at the time) have been benchmarked for `inference time`, across many different
settings: using PyTorch, with and without TorchScript, using TensorFlow, with and without XLA. All of those tests were
done across CPUs (except for TensorFlow XLA) and GPUs.
The approach is detailed in the `following blogpost <https://medium.com/huggingface/benchmarking-transformers-pytorch-and-tensorflow-e2917fb891c2>`__ and the results are available `here <https://docs.google.com/spreadsheets/d/1sryqufw2D0XlUH4sq3e9Wnxu5EAQkaohzrJbd5HdQ_w/edit?usp=sharing>`__.
The approach is detailed in the `following blogpost
<https://medium.com/huggingface/benchmarking-transformers-pytorch-and-tensorflow-e2917fb891c2>`__ and the results are
available `here
<https://docs.google.com/spreadsheets/d/1sryqufw2D0XlUH4sq3e9Wnxu5EAQkaohzrJbd5HdQ_w/edit?usp=sharing>`__.
With the new `benchmark` tools, it is easier than ever to share your benchmark results with the community `here <https://github.com/huggingface/transformers/blob/master/examples/benchmarking/README.md>`__.
With the new `benchmark` tools, it is easier than ever to share your benchmark results with the community `here
<https://github.com/huggingface/transformers/blob/master/examples/benchmarking/README.md>`__.
BERTology
-----------------------------------------------------------------------------------------------------------------------
There is a growing field of study concerned with investigating the inner working of large-scale transformers like BERT (that some call "BERTology"). Some good examples of this field are:
There is a growing field of study concerned with investigating the inner working of large-scale transformers like BERT
(that some call "BERTology"). Some good examples of this field are:
* BERT Rediscovers the Classical NLP Pipeline by Ian Tenney, Dipanjan Das, Ellie Pavlick: https://arxiv.org/abs/1905.05950
* BERT Rediscovers the Classical NLP Pipeline by Ian Tenney, Dipanjan Das, Ellie Pavlick:
https://arxiv.org/abs/1905.05950
* Are Sixteen Heads Really Better than One? by Paul Michel, Omer Levy, Graham Neubig: https://arxiv.org/abs/1905.10650
* What Does BERT Look At? An Analysis of BERT's Attention by Kevin Clark, Urvashi Khandelwal, Omer Levy, Christopher D. Manning: https://arxiv.org/abs/1906.04341
* What Does BERT Look At? An Analysis of BERT's Attention by Kevin Clark, Urvashi Khandelwal, Omer Levy, Christopher D.
Manning: https://arxiv.org/abs/1906.04341
In order to help this new field develop, we have included a few additional features in the BERT/GPT/GPT-2 models to help people access the inner representations, mainly adapted from the great work of Paul Michel (https://arxiv.org/abs/1905.10650):
In order to help this new field develop, we have included a few additional features in the BERT/GPT/GPT-2 models to
help people access the inner representations, mainly adapted from the great work of Paul Michel
(https://arxiv.org/abs/1905.10650):
* accessing all the hidden-states of BERT/GPT/GPT-2,
* accessing all the attention weights for each head of BERT/GPT/GPT-2,
* retrieving heads output values and gradients to be able to compute head importance score and prune head as explained in https://arxiv.org/abs/1905.10650.
* retrieving heads output values and gradients to be able to compute head importance score and prune head as explained
in https://arxiv.org/abs/1905.10650.
To help you understand and use these features, we have added a specific example script: `bertology.py <https://github.com/huggingface/transformers/blob/master/examples/bertology/run_bertology.py>`_ while extract information and prune a model pre-trained on GLUE.
To help you understand and use these features, we have added a specific example script: `bertology.py
<https://github.com/huggingface/transformers/blob/master/examples/bertology/run_bertology.py>`_ while extract
information and prune a model pre-trained on GLUE.
Converting Tensorflow Checkpoints
=======================================================================================================================
A command-line interface is provided to convert original Bert/GPT/GPT-2/Transformer-XL/XLNet/XLM checkpoints in models than be loaded using the ``from_pretrained`` methods of the library.
A command-line interface is provided to convert original Bert/GPT/GPT-2/Transformer-XL/XLNet/XLM checkpoints in models
than be loaded using the ``from_pretrained`` methods of the library.
.. note::
Since 2.3.0 the conversion script is now part of the transformers CLI (**transformers-cli**)
available in any transformers >= 2.3.0 installation.
Since 2.3.0 the conversion script is now part of the transformers CLI (**transformers-cli**) available in any
transformers >= 2.3.0 installation.
The documentation below reflects the **transformers-cli convert** command format.
BERT
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
You can convert any TensorFlow checkpoint for BERT (in particular `the pre-trained models released by Google <https://github.com/google-research/bert#pre-trained-models>`_\ ) in a PyTorch save file by using the `convert_bert_original_tf_checkpoint_to_pytorch.py <https://github.com/huggingface/transformers/blob/master/src/transformers/convert_bert_original_tf_checkpoint_to_pytorch.py>`_ script.
This CLI takes as input a TensorFlow checkpoint (three files starting with ``bert_model.ckpt``\ ) and the associated configuration file (\ ``bert_config.json``\ ), and creates a PyTorch model for this configuration, loads the weights from the TensorFlow checkpoint in the PyTorch model and saves the resulting model in a standard PyTorch save file that can be imported using ``torch.load()`` (see examples in `run_bert_extract_features.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_extract_features.py>`_\ , `run_bert_classifier.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_classifier.py>`_ and `run_bert_squad.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_squad.py>`_\ ).
You only need to run this conversion script **once** to get a PyTorch model. You can then disregard the TensorFlow checkpoint (the three files starting with ``bert_model.ckpt``\ ) but be sure to keep the configuration file (\ ``bert_config.json``\ ) and the vocabulary file (\ ``vocab.txt``\ ) as these are needed for the PyTorch model too.
To run this specific conversion script you will need to have TensorFlow and PyTorch installed (\ ``pip install tensorflow``\ ). The rest of the repository only requires PyTorch.
You can convert any TensorFlow checkpoint for BERT (in particular `the pre-trained models released by Google
<https://github.com/google-research/bert#pre-trained-models>`_\ ) in a PyTorch save file by using the
`convert_bert_original_tf_checkpoint_to_pytorch.py
<https://github.com/huggingface/transformers/blob/master/src/transformers/convert_bert_original_tf_checkpoint_to_pytorch.py>`_
script.
This CLI takes as input a TensorFlow checkpoint (three files starting with ``bert_model.ckpt``\ ) and the associated
configuration file (\ ``bert_config.json``\ ), and creates a PyTorch model for this configuration, loads the weights
from the TensorFlow checkpoint in the PyTorch model and saves the resulting model in a standard PyTorch save file that
can be imported using ``torch.load()`` (see examples in `run_bert_extract_features.py
<https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_extract_features.py>`_\ ,
`run_bert_classifier.py
<https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_classifier.py>`_ and
`run_bert_squad.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_squad.py>`_\
).
You only need to run this conversion script **once** to get a PyTorch model. You can then disregard the TensorFlow
checkpoint (the three files starting with ``bert_model.ckpt``\ ) but be sure to keep the configuration file (\
``bert_config.json``\ ) and the vocabulary file (\ ``vocab.txt``\ ) as these are needed for the PyTorch model too.
To run this specific conversion script you will need to have TensorFlow and PyTorch installed (\ ``pip install
tensorflow``\ ). The rest of the repository only requires PyTorch.
Here is an example of the conversion process for a pre-trained ``BERT-Base Uncased`` model:
......@@ -31,14 +47,20 @@ Here is an example of the conversion process for a pre-trained ``BERT-Base Uncas
--config $BERT_BASE_DIR/bert_config.json \
--pytorch_dump_output $BERT_BASE_DIR/pytorch_model.bin
You can download Google's pre-trained models for the conversion `here <https://github.com/google-research/bert#pre-trained-models>`__.
You can download Google's pre-trained models for the conversion `here
<https://github.com/google-research/bert#pre-trained-models>`__.
ALBERT
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Convert TensorFlow model checkpoints of ALBERT to PyTorch using the `convert_albert_original_tf_checkpoint_to_pytorch.py <https://github.com/huggingface/transformers/blob/master/src/transformers/convert_bert_original_tf_checkpoint_to_pytorch.py>`_ script.
Convert TensorFlow model checkpoints of ALBERT to PyTorch using the
`convert_albert_original_tf_checkpoint_to_pytorch.py
<https://github.com/huggingface/transformers/blob/master/src/transformers/convert_bert_original_tf_checkpoint_to_pytorch.py>`_
script.
The CLI takes as input a TensorFlow checkpoint (three files starting with ``model.ckpt-best``\ ) and the accompanying configuration file (\ ``albert_config.json``\ ), then creates and saves a PyTorch model. To run this conversion you will need to have TensorFlow and PyTorch installed.
The CLI takes as input a TensorFlow checkpoint (three files starting with ``model.ckpt-best``\ ) and the accompanying
configuration file (\ ``albert_config.json``\ ), then creates and saves a PyTorch model. To run this conversion you
will need to have TensorFlow and PyTorch installed.
Here is an example of the conversion process for the pre-trained ``ALBERT Base`` model:
......@@ -51,12 +73,15 @@ Here is an example of the conversion process for the pre-trained ``ALBERT Base``
--config $ALBERT_BASE_DIR/albert_config.json \
--pytorch_dump_output $ALBERT_BASE_DIR/pytorch_model.bin
You can download Google's pre-trained models for the conversion `here <https://github.com/google-research/albert#pre-trained-models>`__.
You can download Google's pre-trained models for the conversion `here
<https://github.com/google-research/albert#pre-trained-models>`__.
OpenAI GPT
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Here is an example of the conversion process for a pre-trained OpenAI GPT model, assuming that your NumPy checkpoint save as the same format than OpenAI pretrained model (see `here <https://github.com/openai/finetune-transformer-lm>`__\ )
Here is an example of the conversion process for a pre-trained OpenAI GPT model, assuming that your NumPy checkpoint
save as the same format than OpenAI pretrained model (see `here <https://github.com/openai/finetune-transformer-lm>`__\
)
.. code-block:: shell
......@@ -72,7 +97,8 @@ Here is an example of the conversion process for a pre-trained OpenAI GPT model,
OpenAI GPT-2
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Here is an example of the conversion process for a pre-trained OpenAI GPT-2 model (see `here <https://github.com/openai/gpt-2>`__\ )
Here is an example of the conversion process for a pre-trained OpenAI GPT-2 model (see `here
<https://github.com/openai/gpt-2>`__\ )
.. code-block:: shell
......@@ -87,7 +113,8 @@ Here is an example of the conversion process for a pre-trained OpenAI GPT-2 mode
Transformer-XL
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Here is an example of the conversion process for a pre-trained Transformer-XL model (see `here <https://github.com/kimiyoung/transformer-xl/tree/master/tf#obtain-and-evaluate-pretrained-sota-models>`__\ )
Here is an example of the conversion process for a pre-trained Transformer-XL model (see `here
<https://github.com/kimiyoung/transformer-xl/tree/master/tf#obtain-and-evaluate-pretrained-sota-models>`__\ )
.. code-block:: shell
......
......@@ -3,15 +3,15 @@ Fine-tuning with custom datasets
.. note::
The datasets used in this tutorial are available and can be more easily accessed using the
`🤗 NLP library <https://github.com/huggingface/nlp>`_. We do not use this library to access the datasets here
since this tutorial meant to illustrate how to work with your own data. A brief of introduction can be found
at the end of the tutorial in the section ":ref:`nlplib`".
This tutorial will take you through several examples of using 🤗 Transformers models with your own datasets. The
guide shows one of many valid workflows for using these models and is meant to be illustrative rather than
definitive. We show examples of reading in several data formats, preprocessing the data for several types of tasks,
and then preparing the data into PyTorch/TensorFlow ``Dataset`` objects which can easily be used either with
The datasets used in this tutorial are available and can be more easily accessed using the `🤗 NLP library
<https://github.com/huggingface/nlp>`_. We do not use this library to access the datasets here since this tutorial
meant to illustrate how to work with your own data. A brief of introduction can be found at the end of the tutorial
in the section ":ref:`nlplib`".
This tutorial will take you through several examples of using 🤗 Transformers models with your own datasets. The guide
shows one of many valid workflows for using these models and is meant to be illustrative rather than definitive. We
show examples of reading in several data formats, preprocessing the data for several types of tasks, and then preparing
the data into PyTorch/TensorFlow ``Dataset`` objects which can easily be used either with
:class:`~transformers.Trainer`/:class:`~transformers.TFTrainer` or with native PyTorch/TensorFlow.
We include several examples, each of which demonstrates a different type of common downstream task:
......@@ -28,13 +28,13 @@ Sequence Classification with IMDb Reviews
.. note::
This dataset can be explored in the Hugging Face model hub (`IMDb <https://huggingface.co/datasets/imdb>`_), and can
be alternatively downloaded with the 🤗 NLP library with ``load_dataset("imdb")``.
This dataset can be explored in the Hugging Face model hub (`IMDb <https://huggingface.co/datasets/imdb>`_), and
can be alternatively downloaded with the 🤗 NLP library with ``load_dataset("imdb")``.
In this example, we'll show how to download, tokenize, and train a model on the IMDb reviews dataset. This task
takes the text of a review and requires the model to predict whether the sentiment of the review is positive or
negative. Let's start by downloading the dataset from the
`Large Movie Review Dataset <http://ai.stanford.edu/~amaas/data/sentiment/>`_ webpage.
In this example, we'll show how to download, tokenize, and train a model on the IMDb reviews dataset. This task takes
the text of a review and requires the model to predict whether the sentiment of the review is positive or negative.
Let's start by downloading the dataset from the `Large Movie Review Dataset
<http://ai.stanford.edu/~amaas/data/sentiment/>`_ webpage.
.. code-block:: bash
......@@ -62,9 +62,8 @@ read this in.
train_texts, train_labels = read_imdb_split('aclImdb/train')
test_texts, test_labels = read_imdb_split('aclImdb/test')
We now have a train and test dataset, but let's also also create a validation set which we can use for for
evaluation and tuning without training our test set results. Sklearn has a convenient utility for creating such
splits:
We now have a train and test dataset, but let's also also create a validation set which we can use for for evaluation
and tuning without training our test set results. Sklearn has a convenient utility for creating such splits:
.. code-block:: python
......@@ -80,8 +79,8 @@ pre-trained DistilBert, so let's use the DistilBert tokenizer.
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
Now we can simply pass our texts to the tokenizer. We'll pass ``truncation=True`` and ``padding=True``, which will
ensure that all of our sequences are padded to the same length and are truncated to be no longer model's maximum
input length. This will allow us to feed batches of sequences into the model at the same time.
ensure that all of our sequences are padded to the same length and are truncated to be no longer model's maximum input
length. This will allow us to feed batches of sequences into the model at the same time.
.. code-block:: python
......@@ -90,9 +89,9 @@ input length. This will allow us to feed batches of sequences into the model at
test_encodings = tokenizer(test_texts, truncation=True, padding=True)
Now, let's turn our labels and encodings into a Dataset object. In PyTorch, this is done by subclassing a
``torch.utils.data.Dataset`` object and implementing ``__len__`` and ``__getitem__``. In TensorFlow, we pass our input encodings and
labels to the ``from_tensor_slices`` constructor method. We put the data in this format so that the data can be
easily batched such that each key in the batch encoding corresponds to a named parameter of the
``torch.utils.data.Dataset`` object and implementing ``__len__`` and ``__getitem__``. In TensorFlow, we pass our input
encodings and labels to the ``from_tensor_slices`` constructor method. We put the data in this format so that the data
can be easily batched such that each key in the batch encoding corresponds to a named parameter of the
:meth:`~transformers.DistilBertForSequenceClassification.forward` method of the model we will train.
.. code-block:: python
......@@ -133,17 +132,17 @@ easily batched such that each key in the batch encoding corresponds to a named p
))
Now that our datasets our ready, we can fine-tune a model either with the 🤗
:class:`~transformers.Trainer`/:class:`~transformers.TFTrainer` or with native PyTorch/TensorFlow. See
:doc:`training <training>`.
:class:`~transformers.Trainer`/:class:`~transformers.TFTrainer` or with native PyTorch/TensorFlow. See :doc:`training
<training>`.
.. _ft_trainer:
Fine-tuning with Trainer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The steps above prepared the datasets in the way that the trainer is expected. Now all we need to do is create a
model to fine-tune, define the :class:`~transformers.TrainingArguments`/:class:`~transformers.TFTrainingArguments`
and instantiate a :class:`~transformers.Trainer`/:class:`~transformers.TFTrainer`.
The steps above prepared the datasets in the way that the trainer is expected. Now all we need to do is create a model
to fine-tune, define the :class:`~transformers.TrainingArguments`/:class:`~transformers.TFTrainingArguments` and
instantiate a :class:`~transformers.Trainer`/:class:`~transformers.TFTrainer`.
.. code-block:: python
......@@ -248,15 +247,15 @@ Token Classification with W-NUT Emerging Entities
.. note::
This dataset can be explored in the Hugging Face model hub (`WNUT-17 <https://huggingface.co/datasets/wnut_17>`_), and can
be alternatively downloaded with the 🤗 NLP library with ``load_dataset("wnut_17")``.
This dataset can be explored in the Hugging Face model hub (`WNUT-17 <https://huggingface.co/datasets/wnut_17>`_),
and can be alternatively downloaded with the 🤗 NLP library with ``load_dataset("wnut_17")``.
Next we will look at token classification. Rather than classifying an entire sequence, this task classifies token by
token. We'll demonstrate how to do this with
`Named Entity Recognition <http://nlpprogress.com/english/named_entity_recognition.html>`_, which involves
identifying tokens which correspond to a predefined set of "entities". Specifically, we'll use the
`W-NUT Emerging and Rare entities <http://noisy-text.github.io/2017/emerging-rare-entities.html>`_ corpus. The data
is given as a collection of pre-tokenized documents where each token is assigned a tag.
token. We'll demonstrate how to do this with `Named Entity Recognition
<http://nlpprogress.com/english/named_entity_recognition.html>`_, which involves identifying tokens which correspond to
a predefined set of "entities". Specifically, we'll use the `W-NUT Emerging and Rare entities
<http://noisy-text.github.io/2017/emerging-rare-entities.html>`_ corpus. The data is given as a collection of
pre-tokenized documents where each token is assigned a tag.
Let's start by downloading the data.
......@@ -264,10 +263,10 @@ Let's start by downloading the data.
wget http://noisy-text.github.io/2017/files/wnut17train.conll
In this case, we'll just download the train set, which is a single text file. Each line of the file contains either
(1) a word and tag separated by a tab, or (2) a blank line indicating the end of a document. Let's write a
function to read this in. We'll take in the file path and return ``token_docs`` which is a list of lists of token
strings, and ``token_tags`` which is a list of lists of tag strings.
In this case, we'll just download the train set, which is a single text file. Each line of the file contains either (1)
a word and tag separated by a tab, or (2) a blank line indicating the end of a document. Let's write a function to read
this in. We'll take in the file path and return ``token_docs`` which is a list of lists of token strings, and
``token_tags`` which is a list of lists of tag strings.
.. code-block:: python
......@@ -303,8 +302,8 @@ Just to see what this data looks like, let's take a look at a segment of the fir
['for', 'two', 'weeks', '.', 'Empire', 'State', 'Building']
['O', 'O', 'O', 'O', 'B-location', 'I-location', 'I-location']
``location`` is an entity type, ``B-`` indicates the beginning of an entity, and ``I-`` indicates consecutive positions of
the same entity ("Empire State Building" is considered one entity). ``O`` indicates the token does not correspond to
``location`` is an entity type, ``B-`` indicates the beginning of an entity, and ``I-`` indicates consecutive positions
of the same entity ("Empire State Building" is considered one entity). ``O`` indicates the token does not correspond to
any entity.
Now that we've read the data in, let's create a train/validation split:
......@@ -314,8 +313,8 @@ Now that we've read the data in, let's create a train/validation split:
from sklearn.model_selection import train_test_split
train_texts, val_texts, train_tags, val_tags = train_test_split(texts, tags, test_size=.2)
Next, let's create encodings for our tokens and tags. For the tags, we can start by just create a simple mapping
which we'll use in a moment:
Next, let's create encodings for our tokens and tags. For the tags, we can start by just create a simple mapping which
we'll use in a moment:
.. code-block:: python
......@@ -323,11 +322,11 @@ which we'll use in a moment:
tag2id = {tag: id for id, tag in enumerate(unique_tags)}
id2tag = {id: tag for tag, id in tag2id.items()}
To encode the tokens, we'll use a pre-trained DistilBert tokenizer. We can tell the tokenizer that we're dealing
with ready-split tokens rather than full sentence strings by passing ``is_split_into_words=True``. We'll also pass
``padding=True`` and ``truncation=True`` to pad the sequences to be the same length. Lastly, we can tell the model
to return information about the tokens which are split by the wordpiece tokenization process, which we will need in
a moment.
To encode the tokens, we'll use a pre-trained DistilBert tokenizer. We can tell the tokenizer that we're dealing with
ready-split tokens rather than full sentence strings by passing ``is_split_into_words=True``. We'll also pass
``padding=True`` and ``truncation=True`` to pad the sequences to be the same length. Lastly, we can tell the model to
return information about the tokens which are split by the wordpiece tokenization process, which we will need in a
moment.
.. code-block:: python
......@@ -339,24 +338,24 @@ a moment.
Great, so now our tokens are nicely encoded in the format that they need to be in to feed them into our DistilBert
model below.
Now we arrive at a common obstacle with using pre-trained models for token-level classification: many of the tokens
in the W-NUT corpus are not in DistilBert's vocabulary. Bert and many models like it use a method called WordPiece
Tokenization, meaning that single words are split into multiple tokens such that each token is likely to be in
the vocabulary. For example, DistilBert's tokenizer would split the Twitter handle ``@huggingface`` into the tokens
``['@', 'hugging', '##face']``. This is a problem for us because we have exactly one tag per token. If the tokenizer
splits a token into multiple sub-tokens, then we will end up with a mismatch between our tokens and our labels.
Now we arrive at a common obstacle with using pre-trained models for token-level classification: many of the tokens in
the W-NUT corpus are not in DistilBert's vocabulary. Bert and many models like it use a method called WordPiece
Tokenization, meaning that single words are split into multiple tokens such that each token is likely to be in the
vocabulary. For example, DistilBert's tokenizer would split the Twitter handle ``@huggingface`` into the tokens ``['@',
'hugging', '##face']``. This is a problem for us because we have exactly one tag per token. If the tokenizer splits a
token into multiple sub-tokens, then we will end up with a mismatch between our tokens and our labels.
One way to handle this is to only train on the tag labels for the first subtoken of a split token. We can do this in
🤗 Transformers by setting the labels we wish to ignore to ``-100``. In the example above, if the label for
One way to handle this is to only train on the tag labels for the first subtoken of a split token. We can do this in 🤗
Transformers by setting the labels we wish to ignore to ``-100``. In the example above, if the label for
``@HuggingFace`` is ``3`` (indexing ``B-corporation``), we would set the labels of ``['@', 'hugging', '##face']`` to
``[3, -100, -100]``.
Let's write a function to do this. This is where we will use the ``offset_mapping`` from the tokenizer as mentioned
above. For each sub-token returned by the tokenizer, the offset mapping gives us a tuple indicating the sub-token's
start position and end position relative to the original token it was split from. That means that if the first
position in the tuple is anything other than ``0``, we will set its corresponding label to ``-100``. While we're at
it, we can also set labels to ``-100`` if the second position of the offset mapping is ``0``, since this means it must
be a special token like ``[PAD]`` or ``[CLS]``.
start position and end position relative to the original token it was split from. That means that if the first position
in the tuple is anything other than ``0``, we will set its corresponding label to ``-100``. While we're at it, we can
also set labels to ``-100`` if the second position of the offset mapping is ``0``, since this means it must be a
special token like ``[PAD]`` or ``[CLS]``.
.. note::
......@@ -447,8 +446,9 @@ Question Answering with SQuAD 2.0
.. note::
This dataset can be explored in the Hugging Face model hub (`SQuAD V2 <https://huggingface.co/datasets/squad_v2>`_), and can
be alternatively downloaded with the 🤗 NLP library with ``load_dataset("squad_v2")``.
This dataset can be explored in the Hugging Face model hub (`SQuAD V2
<https://huggingface.co/datasets/squad_v2>`_), and can be alternatively downloaded with the 🤗 NLP library with
``load_dataset("squad_v2")``.
Question answering comes in many forms. In this example, we'll look at the particular type of extractive QA that
involves answering a question about a passage by highlighting the segment of the passage that answers the question.
......@@ -464,8 +464,8 @@ We will start by downloading the data:
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json -O squad/dev-v2.0.json
Each split is in a structured json file with a number of questions and answers for each passage (or context). We'll
take this apart into parallel lists of contexts, questions, and answers (note that the contexts here are repeated
since there are multiple questions per context):
take this apart into parallel lists of contexts, questions, and answers (note that the contexts here are repeated since
there are multiple questions per context):
.. code-block:: python
......@@ -495,13 +495,13 @@ since there are multiple questions per context):
train_contexts, train_questions, train_answers = read_squad('squad/train-v2.0.json')
val_contexts, val_questions, val_answers = read_squad('squad/dev-v2.0.json')
The contexts and questions are just strings. The answers are dicts containing the subsequence of the passage with
the correct answer as well as an integer indicating the character at which the answer begins. In order to train a
model on this data we need (1) the tokenized context/question pairs, and (2) integers indicating at which *token*
positions the answer begins and ends.
The contexts and questions are just strings. The answers are dicts containing the subsequence of the passage with the
correct answer as well as an integer indicating the character at which the answer begins. In order to train a model on
this data we need (1) the tokenized context/question pairs, and (2) integers indicating at which *token* positions the
answer begins and ends.
First, let's get the *character* position at which the answer ends in the passage (we are given the starting
position). Sometimes SQuAD answers are off by one or two characters, so we will also adjust for that.
First, let's get the *character* position at which the answer ends in the passage (we are given the starting position).
Sometimes SQuAD answers are off by one or two characters, so we will also adjust for that.
.. code-block:: python
......@@ -524,9 +524,9 @@ position). Sometimes SQuAD answers are off by one or two characters, so we will
add_end_idx(train_answers, train_contexts)
add_end_idx(val_answers, val_contexts)
Now ``train_answers`` and ``val_answers`` include the character end positions and the corrected start positions.
Next, let's tokenize our context/question pairs. 🤗 Tokenizers can accept parallel lists of sequences and encode
them together as sequence pairs.
Now ``train_answers`` and ``val_answers`` include the character end positions and the corrected start positions. Next,
let's tokenize our context/question pairs. 🤗 Tokenizers can accept parallel lists of sequences and encode them together
as sequence pairs.
.. code-block:: python
......@@ -536,8 +536,8 @@ them together as sequence pairs.
train_encodings = tokenizer(train_contexts, train_questions, truncation=True, padding=True)
val_encodings = tokenizer(val_contexts, val_questions, truncation=True, padding=True)
Next we need to convert our character start/end positions to token start/end positions. When using 🤗 Fast
Tokenizers, we can use the built in :func:`~transformers.BatchEncoding.char_to_token` method.
Next we need to convert our character start/end positions to token start/end positions. When using 🤗 Fast Tokenizers,
we can use the built in :func:`~transformers.BatchEncoding.char_to_token` method.
.. code-block:: python
......@@ -557,9 +557,9 @@ Tokenizers, we can use the built in :func:`~transformers.BatchEncoding.char_to_t
add_token_positions(train_encodings, train_answers)
add_token_positions(val_encodings, val_answers)
Our data is ready. Let's just put it in a PyTorch/TensorFlow dataset so that we can easily use it for
training. In PyTorch, we define a custom ``Dataset`` class. In TensorFlow, we pass a tuple of
``(inputs_dict, labels_dict)`` to the ``from_tensor_slices`` method.
Our data is ready. Let's just put it in a PyTorch/TensorFlow dataset so that we can easily use it for training. In
PyTorch, we define a custom ``Dataset`` class. In TensorFlow, we pass a tuple of ``(inputs_dict, labels_dict)`` to the
``from_tensor_slices`` method.
.. code-block:: python
......@@ -668,12 +668,11 @@ Additional Resources
Using the 🤗 NLP Datasets & Metrics library
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This tutorial demonstrates how to read in datasets from various raw text formats and prepare them for training with
🤗 Transformers so that you can do the same thing with your own custom datasets. However, we recommend users use the
`🤗 NLP library <https://github.com/huggingface/nlp>`_ for working with the 150+ datasets included in the
`hub <https://huggingface.co/datasets>`_, including the three datasets used in this tutorial. As a very brief overview,
we will show how to use the NLP library to download and prepare the IMDb dataset from the first example,
:ref:`seq_imdb`.
This tutorial demonstrates how to read in datasets from various raw text formats and prepare them for training with 🤗
Transformers so that you can do the same thing with your own custom datasets. However, we recommend users use the `🤗
NLP library <https://github.com/huggingface/nlp>`_ for working with the 150+ datasets included in the `hub
<https://huggingface.co/datasets>`_, including the three datasets used in this tutorial. As a very brief overview, we
will show how to use the NLP library to download and prepare the IMDb dataset from the first example, :ref:`seq_imdb`.
Start by downloading the dataset:
......@@ -689,8 +688,8 @@ Each dataset has multiple columns corresponding to different features. Let's see
>>> print(train.column_names)
['label', 'text']
Great. Now let's tokenize the text. We can do this using the ``map`` method. We'll also rename the ``label`` column
to ``labels`` to match the model's input arguments.
Great. Now let's tokenize the text. We can do this using the ``map`` method. We'll also rename the ``label`` column to
``labels`` to match the model's input arguments.
.. code-block:: python
......@@ -711,5 +710,5 @@ dataset elements.
>>> {key: val.shape for key, val in train[0].items()})
{'labels': TensorShape([]), 'input_ids': TensorShape([512]), 'attention_mask': TensorShape([512])}
We now have a fully-prepared dataset. Check out `the 🤗 NLP docs <https://huggingface.co/nlp/processing.html>`_ for
a more thorough introduction.
\ No newline at end of file
We now have a fully-prepared dataset. Check out `the 🤗 NLP docs <https://huggingface.co/nlp/processing.html>`_ for a
more thorough introduction.
......@@ -57,8 +57,8 @@ The tokenizer takes care of splitting the sequence into tokens available in the
>>> tokenized_sequence = tokenizer.tokenize(sequence)
The tokens are either words or subwords. Here for instance, "VRAM" wasn't in the model vocabulary, so it's been split
in "V", "RA" and "M". To indicate those tokens are not separate words but parts of the same word, a double-hash prefix is
added for "RA" and "M":
in "V", "RA" and "M". To indicate those tokens are not separate words but parts of the same word, a double-hash prefix
is added for "RA" and "M":
.. code-block::
......@@ -66,8 +66,8 @@ added for "RA" and "M":
['A', 'Titan', 'R', '##T', '##X', 'has', '24', '##GB', 'of', 'V', '##RA', '##M']
These tokens can then be converted into IDs which are understandable by the model. This can be done by directly feeding
the sentence to the tokenizer, which leverages the Rust implementation of
`huggingface/tokenizers <https://github.com/huggingface/tokenizers>`__ for peak performance.
the sentence to the tokenizer, which leverages the Rust implementation of `huggingface/tokenizers
<https://github.com/huggingface/tokenizers>`__ for peak performance.
.. code-block::
......@@ -105,8 +105,8 @@ because this is the way a :class:`~transformers.BertModel` is going to expect it
Attention mask
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The attention mask is an optional argument used when batching sequences together. This argument indicates to the
model which tokens should be attended to, and which should not.
The attention mask is an optional argument used when batching sequences together. This argument indicates to the model
which tokens should be attended to, and which should not.
For example, consider these two sequences:
......@@ -145,10 +145,10 @@ We can see that 0s have been added on the right of the first sentence to make it
>>> padded_sequences["input_ids"]
[[101, 1188, 1110, 170, 1603, 4954, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1188, 1110, 170, 1897, 1263, 4954, 119, 1135, 1110, 1120, 1655, 2039, 1190, 1103, 4954, 138, 119, 102]]
This can then be converted into a tensor in PyTorch or TensorFlow. The attention mask is a binary tensor indicating
the position of the padded indices so that the model does not attend to them. For the
:class:`~transformers.BertTokenizer`, :obj:`1` indicates a value that should be attended to, while :obj:`0` indicates
a padded value. This attention mask is in the dictionary returned by the tokenizer under the key "attention_mask":
This can then be converted into a tensor in PyTorch or TensorFlow. The attention mask is a binary tensor indicating the
position of the padded indices so that the model does not attend to them. For the :class:`~transformers.BertTokenizer`,
:obj:`1` indicates a value that should be attended to, while :obj:`0` indicates a padded value. This attention mask is
in the dictionary returned by the tokenizer under the key "attention_mask":
.. code-block::
......@@ -161,15 +161,16 @@ Token Type IDs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Some models' purpose is to do sequence classification or question answering. These require two different sequences to
be joined in a single "input_ids" entry, which usually is performed with the help of special tokens, such as the classifier (``[CLS]``) and separator (``[SEP]``)
tokens. For example, the BERT model builds its two sequence input as such:
be joined in a single "input_ids" entry, which usually is performed with the help of special tokens, such as the
classifier (``[CLS]``) and separator (``[SEP]``) tokens. For example, the BERT model builds its two sequence input as
such:
.. code-block::
>>> # [CLS] SEQUENCE_A [SEP] SEQUENCE_B [SEP]
We can use our tokenizer to automatically generate such a sentence by passing the two sequences to ``tokenizer`` as two arguments (and
not a list, like before) like this:
We can use our tokenizer to automatically generate such a sentence by passing the two sequences to ``tokenizer`` as two
arguments (and not a list, like before) like this:
.. code-block::
......@@ -189,8 +190,8 @@ which will return:
[CLS] HuggingFace is based in NYC [SEP] Where is HuggingFace based? [SEP]
This is enough for some models to understand where one sequence ends and where another begins. However, other models,
such as BERT, also deploy token type IDs (also called segment IDs). They are represented as a binary
mask identifying the two types of sequence in the model.
such as BERT, also deploy token type IDs (also called segment IDs). They are represented as a binary mask identifying
the two types of sequence in the model.
The tokenizer returns this mask as the "token_type_ids" entry:
......@@ -209,14 +210,15 @@ Some models, like :class:`~transformers.XLNetModel` use an additional token repr
Position IDs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Contrary to RNNs that have the position of each token embedded within them,
transformers are unaware of the position of each token. Therefore, the position IDs (``position_ids``) are used by the model to identify each token's position in the list of tokens.
Contrary to RNNs that have the position of each token embedded within them, transformers are unaware of the position of
each token. Therefore, the position IDs (``position_ids``) are used by the model to identify each token's position in
the list of tokens.
They are an optional parameter. If no ``position_ids`` is passed to the model, the IDs are automatically created as absolute
positional embeddings.
They are an optional parameter. If no ``position_ids`` is passed to the model, the IDs are automatically created as
absolute positional embeddings.
Absolute positional embeddings are selected in the range ``[0, config.max_position_embeddings - 1]``. Some models
use other types of positional embeddings, such as sinusoidal position embeddings or relative position embeddings.
Absolute positional embeddings are selected in the range ``[0, config.max_position_embeddings - 1]``. Some models use
other types of positional embeddings, such as sinusoidal position embeddings or relative position embeddings.
.. _labels:
......@@ -224,43 +226,41 @@ Labels
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The labels are an optional argument which can be passed in order for the model to compute the loss itself. These labels
should be the expected prediction of the model: it will use the standard loss in order to compute the loss between
its predictions and the expected value (the label).
should be the expected prediction of the model: it will use the standard loss in order to compute the loss between its
predictions and the expected value (the label).
These labels are different according to the model head, for example:
- For sequence classification models (e.g., :class:`~transformers.BertForSequenceClassification`), the model expects
a tensor of dimension :obj:`(batch_size)` with each value of the batch corresponding to the expected label of the
- For sequence classification models (e.g., :class:`~transformers.BertForSequenceClassification`), the model expects a
tensor of dimension :obj:`(batch_size)` with each value of the batch corresponding to the expected label of the
entire sequence.
- For token classification models (e.g., :class:`~transformers.BertForTokenClassification`), the model expects
a tensor of dimension :obj:`(batch_size, seq_length)` with each value corresponding to the expected label of each
individual token.
- For masked language modeling (e.g., :class:`~transformers.BertForMaskedLM`), the model expects
a tensor of dimension :obj:`(batch_size, seq_length)` with each value corresponding to the expected label of each
individual token: the labels being the token ID for the masked token, and values to be ignored for the rest (usually
-100).
- For token classification models (e.g., :class:`~transformers.BertForTokenClassification`), the model expects a tensor
of dimension :obj:`(batch_size, seq_length)` with each value corresponding to the expected label of each individual
token.
- For masked language modeling (e.g., :class:`~transformers.BertForMaskedLM`), the model expects a tensor of dimension
:obj:`(batch_size, seq_length)` with each value corresponding to the expected label of each individual token: the
labels being the token ID for the masked token, and values to be ignored for the rest (usually -100).
- For sequence to sequence tasks,(e.g., :class:`~transformers.BartForConditionalGeneration`,
:class:`~transformers.MBartForConditionalGeneration`), the model expects a tensor of dimension
:obj:`(batch_size, tgt_seq_length)` with each value corresponding to the target sequences associated with each
input sequence. During training, both `BART` and `T5` will make the appropriate `decoder_input_ids` and decoder
attention masks internally. They usually do not need to be supplied. This does not apply to models leveraging the
Encoder-Decoder framework.
See the documentation of each model for more information on each specific model's labels.
:class:`~transformers.MBartForConditionalGeneration`), the model expects a tensor of dimension :obj:`(batch_size,
tgt_seq_length)` with each value corresponding to the target sequences associated with each input sequence. During
training, both `BART` and `T5` will make the appropriate `decoder_input_ids` and decoder attention masks internally.
They usually do not need to be supplied. This does not apply to models leveraging the Encoder-Decoder framework. See
the documentation of each model for more information on each specific model's labels.
The base models (e.g., :class:`~transformers.BertModel`) do not accept labels, as these are the base transformer models,
simply outputting features.
The base models (e.g., :class:`~transformers.BertModel`) do not accept labels, as these are the base transformer
models, simply outputting features.
.. _decoder-input-ids:
Decoder input IDs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This input is specific to encoder-decoder models, and contains the input IDs that will be fed to the decoder.
These inputs should be used for sequence to sequence tasks, such as translation or summarization, and are usually
built in a way specific to each model.
This input is specific to encoder-decoder models, and contains the input IDs that will be fed to the decoder. These
inputs should be used for sequence to sequence tasks, such as translation or summarization, and are usually built in a
way specific to each model.
Most encoder-decoder models (BART, T5) create their :obj:`decoder_input_ids` on their own from the :obj:`labels`.
In such models, passing the :obj:`labels` is the preferred way to handle training.
Most encoder-decoder models (BART, T5) create their :obj:`decoder_input_ids` on their own from the :obj:`labels`. In
such models, passing the :obj:`labels` is the preferred way to handle training.
Please check each model's docs to see how they handle these input IDs for sequence to sequence training.
......@@ -270,17 +270,17 @@ Feed Forward Chunking
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In each residual attention block in transformers the self-attention layer is usually followed by 2 feed forward layers.
The intermediate embedding size of the feed forward layers is often bigger than the hidden size of the model (e.g.,
for ``bert-base-uncased``).
The intermediate embedding size of the feed forward layers is often bigger than the hidden size of the model (e.g., for
``bert-base-uncased``).
For an input of size ``[batch_size, sequence_length]``, the memory required to store the intermediate feed forward
embeddings ``[batch_size, sequence_length, config.intermediate_size]`` can account for a large fraction of the memory
use. The authors of `Reformer: The Efficient Transformer <https://arxiv.org/abs/2001.04451>`_ noticed that since the
computation is independent of the ``sequence_length`` dimension, it is mathematically equivalent to compute the output
embeddings of both feed forward layers ``[batch_size, config.hidden_size]_0, ..., [batch_size, config.hidden_size]_n``
individually and concat them afterward to ``[batch_size, sequence_length, config.hidden_size]`` with
``n = sequence_length``, which trades increased computation time against reduced memory use, but yields a
mathematically **equivalent** result.
individually and concat them afterward to ``[batch_size, sequence_length, config.hidden_size]`` with ``n =
sequence_length``, which trades increased computation time against reduced memory use, but yields a mathematically
**equivalent** result.
For models employing the function :func:`~.transformers.apply_chunking_to_forward`, the ``chunk_size`` defines the
number of output embeddings that are computed in parallel and thus defines the trade-off between memory and time
......
......@@ -47,6 +47,7 @@ The documentation is organized in five parts:
- **RESEARCH** focuses on tutorials that have less to do with how to use the library but more about general resarch in
transformers model
- The three last section contain the documentation of each public class and function, grouped in:
- **MAIN CLASSES** for the main classes exposing the important APIs of the library.
- **MODELS** for the classes and functions related to each model implemented in the library.
- **INTERNAL HELPERS** for the classes and functions we use internally.
......
......@@ -25,6 +25,7 @@ SpecialTokensMixin
Enums and namedtuples
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.tokenization_utils_base.ExplicitEnum
.. autoclass:: transformers.tokenization_utils_base.PaddingStrategy
......
Pipelines
-----------------------------------------------------------------------------------------------------------------------
The pipelines are a great and easy way to use models for inference. These pipelines are objects that abstract most
of the complex code from the library, offering a simple API dedicated to several tasks, including Named Entity
The pipelines are a great and easy way to use models for inference. These pipelines are objects that abstract most of
the complex code from the library, offering a simple API dedicated to several tasks, including Named Entity
Recognition, Masked Language Modeling, Sentiment Analysis, Feature Extraction and Question Answering. See the
:doc:`task summary <../task_summary>` for examples of use.
......@@ -26,8 +26,8 @@ There are two categories of pipeline abstractions to be aware about:
The pipeline abstraction
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The `pipeline` abstraction is a wrapper around all the other available pipelines. It is instantiated as any
other pipeline but requires an additional argument which is the `task`.
The `pipeline` abstraction is a wrapper around all the other available pipelines. It is instantiated as any other
pipeline but requires an additional argument which is the `task`.
.. autofunction:: transformers.pipeline
......
......@@ -8,8 +8,8 @@ Processors
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All processors follow the same architecture which is that of the
:class:`~transformers.data.processors.utils.DataProcessor`. The processor returns a list
of :class:`~transformers.data.processors.utils.InputExample`. These
:class:`~transformers.data.processors.utils.DataProcessor`. The processor returns a list of
:class:`~transformers.data.processors.utils.InputExample`. These
:class:`~transformers.data.processors.utils.InputExample` can be converted to
:class:`~transformers.data.processors.utils.InputFeatures` in order to be fed to the model.
......@@ -28,14 +28,16 @@ of :class:`~transformers.data.processors.utils.InputExample`. These
GLUE
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
`General Language Understanding Evaluation (GLUE) <https://gluebenchmark.com/>`__ is a benchmark that evaluates
the performance of models across a diverse set of existing NLU tasks. It was released together with the paper
`GLUE: A multi-task benchmark and analysis platform for natural language understanding <https://openreview.net/pdf?id=rJ4km2R5t7>`__
`General Language Understanding Evaluation (GLUE) <https://gluebenchmark.com/>`__ is a benchmark that evaluates the
performance of models across a diverse set of existing NLU tasks. It was released together with the paper `GLUE: A
multi-task benchmark and analysis platform for natural language understanding
<https://openreview.net/pdf?id=rJ4km2R5t7>`__
This library hosts a total of 10 processors for the following tasks: MRPC, MNLI, MNLI (mismatched),
CoLA, SST2, STSB, QQP, QNLI, RTE and WNLI.
This library hosts a total of 10 processors for the following tasks: MRPC, MNLI, MNLI (mismatched), CoLA, SST2, STSB,
QQP, QNLI, RTE and WNLI.
Those processors are:
- :class:`~transformers.data.processors.utils.MrpcProcessor`
- :class:`~transformers.data.processors.utils.MnliProcessor`
- :class:`~transformers.data.processors.utils.MnliMismatchedProcessor`
......@@ -54,36 +56,39 @@ Additionally, the following method can be used to load values from a data file
Example usage
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
An example using these processors is given in the `run_glue.py <https://github.com/huggingface/pytorch-transformers/blob/master/examples/text-classification/run_glue.py>`__ script.
An example using these processors is given in the `run_glue.py
<https://github.com/huggingface/pytorch-transformers/blob/master/examples/text-classification/run_glue.py>`__ script.
XNLI
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
`The Cross-Lingual NLI Corpus (XNLI) <https://www.nyu.edu/projects/bowman/xnli/>`__ is a benchmark that evaluates
the quality of cross-lingual text representations.
XNLI is crowd-sourced dataset based on `MultiNLI <http://www.nyu.edu/projects/bowman/multinli/>`: pairs of text are labeled with textual entailment
annotations for 15 different languages (including both high-resource language such as English and low-resource languages such as Swahili).
`The Cross-Lingual NLI Corpus (XNLI) <https://www.nyu.edu/projects/bowman/xnli/>`__ is a benchmark that evaluates the
quality of cross-lingual text representations. XNLI is crowd-sourced dataset based on `MultiNLI
<http://www.nyu.edu/projects/bowman/multinli/>`: pairs of text are labeled with textual entailment annotations for 15
different languages (including both high-resource language such as English and low-resource languages such as Swahili).
It was released together with the paper
`XNLI: Evaluating Cross-lingual Sentence Representations <https://arxiv.org/abs/1809.05053>`__
It was released together with the paper `XNLI: Evaluating Cross-lingual Sentence Representations
<https://arxiv.org/abs/1809.05053>`__
This library hosts the processor to load the XNLI data:
- :class:`~transformers.data.processors.utils.XnliProcessor`
Please note that since the gold labels are available on the test set, evaluation is performed on the test set.
An example using these processors is given in the
`run_xnli.py <https://github.com/huggingface/pytorch-transformers/blob/master/examples/text-classification/run_xnli.py>`__ script.
An example using these processors is given in the `run_xnli.py
<https://github.com/huggingface/pytorch-transformers/blob/master/examples/text-classification/run_xnli.py>`__ script.
SQuAD
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
`The Stanford Question Answering Dataset (SQuAD) <https://rajpurkar.github.io/SQuAD-explorer//>`__ is a benchmark that evaluates
the performance of models on question answering. Two versions are available, v1.1 and v2.0. The first version (v1.1) was released together with the paper
`SQuAD: 100,000+ Questions for Machine Comprehension of Text <https://arxiv.org/abs/1606.05250>`__. The second version (v2.0) was released alongside
the paper `Know What You Don't Know: Unanswerable Questions for SQuAD <https://arxiv.org/abs/1806.03822>`__.
`The Stanford Question Answering Dataset (SQuAD) <https://rajpurkar.github.io/SQuAD-explorer//>`__ is a benchmark that
evaluates the performance of models on question answering. Two versions are available, v1.1 and v2.0. The first version
(v1.1) was released together with the paper `SQuAD: 100,000+ Questions for Machine Comprehension of Text
<https://arxiv.org/abs/1606.05250>`__. The second version (v2.0) was released alongside the paper `Know What You Don't
Know: Unanswerable Questions for SQuAD <https://arxiv.org/abs/1806.03822>`__.
This library hosts a processor for each of the two versions:
......@@ -91,6 +96,7 @@ Processors
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Those processors are:
- :class:`~transformers.data.processors.utils.SquadV1Processor`
- :class:`~transformers.data.processors.utils.SquadV2Processor`
......@@ -99,17 +105,18 @@ They both inherit from the abstract class :class:`~transformers.data.processors.
.. autoclass:: transformers.data.processors.squad.SquadProcessor
:members:
Additionally, the following method can be used to convert SQuAD examples into :class:`~transformers.data.processors.utils.SquadFeatures`
that can be used as model inputs.
Additionally, the following method can be used to convert SQuAD examples into
:class:`~transformers.data.processors.utils.SquadFeatures` that can be used as model inputs.
.. automethod:: transformers.data.processors.squad.squad_convert_examples_to_features
These processors as well as the aforementionned method can be used with files containing the data as well as with the `tensorflow_datasets` package.
Examples are given below.
These processors as well as the aforementionned method can be used with files containing the data as well as with the
`tensorflow_datasets` package. Examples are given below.
Example usage
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Here is an example using the processors as well as the conversion method using data files:
.. code-block::
......@@ -149,5 +156,5 @@ Using `tensorflow_datasets` is as easy as using a data file:
)
Another example using these processors is given in the
`run_squad.py <https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py>`__ script.
Another example using these processors is given in the `run_squad.py
<https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py>`__ script.
......@@ -29,11 +29,12 @@ methods for using all the tokenizers:
:class:`~transformers.BatchEncoding` holds the output of the tokenizer's encoding methods (``__call__``,
``encode_plus`` and ``batch_encode_plus``) and is derived from a Python dictionary. When the tokenizer is a pure python
tokenizer, this class behaves just like a standard python dictionary and holds the various model inputs computed by these
methods (``input_ids``, ``attention_mask``...). When the tokenizer is a "Fast" tokenizer (i.e., backed by HuggingFace
`tokenizers library <https://github.com/huggingface/tokenizers>`__), this class provides in addition several advanced
alignment methods which can be used to map between the original string (character and words) and the token space (e.g.,
getting the index of the token comprising a given character or the span of characters corresponding to a given token).
tokenizer, this class behaves just like a standard python dictionary and holds the various model inputs computed by
these methods (``input_ids``, ``attention_mask``...). When the tokenizer is a "Fast" tokenizer (i.e., backed by
HuggingFace `tokenizers library <https://github.com/huggingface/tokenizers>`__), this class provides in addition
several advanced alignment methods which can be used to map between the original string (character and words) and the
token space (e.g., getting the index of the token comprising a given character or the span of characters corresponding
to a given token).
PreTrainedTokenizer
......
......@@ -19,14 +19,14 @@ downstream tasks. However, at some point further model increases become harder d
longer training times, and unexpected model degradation. To address these problems, we present two parameter-reduction
techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows
that our proposed methods lead to models that scale much better compared to the original BERT. We also use a
self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream
tasks with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE,
RACE, and SQuAD benchmarks while having fewer parameters compared to BERT-large.*
self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks
with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and
SQuAD benchmarks while having fewer parameters compared to BERT-large.*
Tips:
- ALBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on
the right rather than the left.
- ALBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather
than the left.
- ALBERT uses repeating layers which results in a small memory footprint, however the computational cost remains
similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same
number of (repeating) layers.
......
......@@ -2,9 +2,8 @@ AutoClasses
-----------------------------------------------------------------------------------------------------------------------
In many cases, the architecture you want to use can be guessed from the name or the path of the pretrained model you
are supplying to the :obj:`from_pretrained()` method.
AutoClasses are here to do this job for you so that you automatically retrieve the relevant model given the name/path
to the pretrained weights/config/vocabulary.
are supplying to the :obj:`from_pretrained()` method. AutoClasses are here to do this job for you so that you
automatically retrieve the relevant model given the name/path to the pretrained weights/config/vocabulary.
Instantiating one of :class:`~transformers.AutoConfig`, :class:`~transformers.AutoModel`, and
:class:`~transformers.AutoTokenizer` will directly create a class of the relevant architecture. For instance
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment