Unverified Commit 08f534d2 authored by Sylvain Gugger's avatar Sylvain Gugger Committed by GitHub
Browse files

Doc styling (#8067)

* Important files

* Styling them all

* Revert "Styling them all"

This reverts commit 7d029395fdae8513b8281cbc2a6c239f8093503e.

* Syling them for realsies

* Fix syntax error

* Fix benchmark_utils

* More fixes

* Fix modeling auto and script

* Remove new line

* Fixes

* More fixes

* Fix more files

* Style

* Add FSMT

* More fixes

* More fixes

* More fixes

* More fixes

* Fixes

* More fixes

* More fixes

* Last fixes

* Make sphinx happy
parent 04a17f85
...@@ -17,7 +17,7 @@ work properly. ...@@ -17,7 +17,7 @@ work properly.
the text you give it in tokens the same way for the pretraining corpus, and it will use the same correspondence the text you give it in tokens the same way for the pretraining corpus, and it will use the same correspondence
token to index (that we usually call a `vocab`) as during pretraining. token to index (that we usually call a `vocab`) as during pretraining.
To automatically download the vocab used during pretraining or fine-tuning a given model, you can use the To automatically download the vocab used during pretraining or fine-tuning a given model, you can use the
:func:`~transformers.AutoTokenizer.from_pretrained` method: :func:`~transformers.AutoTokenizer.from_pretrained` method:
.. code-block:: .. code-block::
...@@ -39,10 +39,10 @@ is its ``__call__``: you just need to feed your sentence to your tokenizer objec ...@@ -39,10 +39,10 @@ is its ``__call__``: you just need to feed your sentence to your tokenizer objec
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]} 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
This returns a dictionary string to list of ints. This returns a dictionary string to list of ints. The `input_ids <glossary.html#input-ids>`__ are the indices
The `input_ids <glossary.html#input-ids>`__ are the indices corresponding to each token in our sentence. We will see corresponding to each token in our sentence. We will see below what the `attention_mask
below what the `attention_mask <glossary.html#attention-mask>`__ is used for and in <glossary.html#attention-mask>`__ is used for and in :ref:`the next section <sentence-pairs>` the goal of
:ref:`the next section <sentence-pairs>` the goal of `token_type_ids <glossary.html#token-type-ids>`__. `token_type_ids <glossary.html#token-type-ids>`__.
The tokenizer can decode a list of token ids in a proper sentence: The tokenizer can decode a list of token ids in a proper sentence:
...@@ -51,10 +51,10 @@ The tokenizer can decode a list of token ids in a proper sentence: ...@@ -51,10 +51,10 @@ The tokenizer can decode a list of token ids in a proper sentence:
>>> tokenizer.decode(encoded_input["input_ids"]) >>> tokenizer.decode(encoded_input["input_ids"])
"[CLS] Hello, I'm a single sentence! [SEP]" "[CLS] Hello, I'm a single sentence! [SEP]"
As you can see, the tokenizer automatically added some special tokens that the model expects. Not all models need special As you can see, the tokenizer automatically added some special tokens that the model expects. Not all models need
tokens; for instance, if we had used` gtp2-medium` instead of `bert-base-cased` to create our tokenizer, we would have special tokens; for instance, if we had used` gtp2-medium` instead of `bert-base-cased` to create our tokenizer, we
seen the same sentence as the original one here. You can disable this behavior (which is only advised if you have added would have seen the same sentence as the original one here. You can disable this behavior (which is only advised if you
those special tokens yourself) by passing ``add_special_tokens=False``. have added those special tokens yourself) by passing ``add_special_tokens=False``.
If you have several sentences you want to process, you can do this efficiently by sending them as a list to the If you have several sentences you want to process, you can do this efficiently by sending them as a list to the
tokenizer: tokenizer:
...@@ -114,9 +114,9 @@ You can do all of this by using the following options when feeding your list of ...@@ -114,9 +114,9 @@ You can do all of this by using the following options when feeding your list of
[1, 1, 1, 1, 1, 0, 0, 0, 0], [1, 1, 1, 1, 1, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 1, 1, 1, 0]])} [1, 1, 1, 1, 1, 1, 1, 1, 0]])}
It returns a dictionary with string keys and tensor values. We can now see what the `attention_mask <glossary.html#attention-mask>`__ is It returns a dictionary with string keys and tensor values. We can now see what the `attention_mask
all about: it points out which tokens the model should pay attention to and which ones it should not (because they <glossary.html#attention-mask>`__ is all about: it points out which tokens the model should pay attention to and which
represent padding in this case). ones it should not (because they represent padding in this case).
Note that if your model does not have a maximum length associated to it, the command above will throw a warning. You Note that if your model does not have a maximum length associated to it, the command above will throw a warning. You
...@@ -127,9 +127,9 @@ can safely ignore it. You can also pass ``verbose=False`` to stop the tokenizer ...@@ -127,9 +127,9 @@ can safely ignore it. You can also pass ``verbose=False`` to stop the tokenizer
Preprocessing pairs of sentences Preprocessing pairs of sentences
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Sometimes you need to feed a pair of sentences to your model. For instance, if you want to classify if two sentences in a Sometimes you need to feed a pair of sentences to your model. For instance, if you want to classify if two sentences in
pair are similar, or for question-answering models, which take a context and a question. For BERT models, the input is a pair are similar, or for question-answering models, which take a context and a question. For BERT models, the input
then represented like this: :obj:`[CLS] Sequence A [SEP] Sequence B [SEP]` is then represented like this: :obj:`[CLS] Sequence A [SEP] Sequence B [SEP]`
You can encode a pair of sentences in the format expected by your model by supplying the two sentences as two arguments You can encode a pair of sentences in the format expected by your model by supplying the two sentences as two arguments
(not a list since a list of two sentences will be interpreted as a batch of two single sentences, as we saw before). (not a list since a list of two sentences will be interpreted as a batch of two single sentences, as we saw before).
...@@ -146,8 +146,8 @@ This will once again return a dict string to list of ints: ...@@ -146,8 +146,8 @@ This will once again return a dict string to list of ints:
This shows us what the `token_type_ids <glossary.html#token-type-ids>`__ are for: they indicate to the model which part This shows us what the `token_type_ids <glossary.html#token-type-ids>`__ are for: they indicate to the model which part
of the inputs correspond to the first sentence and which part corresponds to the second sentence. Note that of the inputs correspond to the first sentence and which part corresponds to the second sentence. Note that
`token_type_ids` are not required or handled by all models. By default, a tokenizer will only return the inputs that `token_type_ids` are not required or handled by all models. By default, a tokenizer will only return the inputs that
its associated model expects. You can force the return (or the non-return) of any of those special arguments by its associated model expects. You can force the return (or the non-return) of any of those special arguments by using
using ``return_input_ids`` or ``return_token_type_ids``. ``return_input_ids`` or ``return_token_type_ids``.
If we decode the token ids we obtained, we will see that the special tokens have been properly added. If we decode the token ids we obtained, we will see that the special tokens have been properly added.
...@@ -215,7 +215,7 @@ three arguments you need to know for this are :obj:`padding`, :obj:`truncation` ...@@ -215,7 +215,7 @@ three arguments you need to know for this are :obj:`padding`, :obj:`truncation`
a single sequence). a single sequence).
- :obj:`'max_length'` to pad to a length specified by the :obj:`max_length` argument or the maximum length accepted - :obj:`'max_length'` to pad to a length specified by the :obj:`max_length` argument or the maximum length accepted
by the model if no :obj:`max_length` is provided (``max_length=None``). If you only provide a single sequence, by the model if no :obj:`max_length` is provided (``max_length=None``). If you only provide a single sequence,
padding will still be applied to it. padding will still be applied to it.
- :obj:`False` or :obj:`'do_not_pad'` to not pad the sequences. As we have seen before, this is the default - :obj:`False` or :obj:`'do_not_pad'` to not pad the sequences. As we have seen before, this is the default
behavior. behavior.
...@@ -238,9 +238,9 @@ three arguments you need to know for this are :obj:`padding`, :obj:`truncation` ...@@ -238,9 +238,9 @@ three arguments you need to know for this are :obj:`padding`, :obj:`truncation`
truncation/padding to :obj:`max_length` is deactivated. truncation/padding to :obj:`max_length` is deactivated.
Here is a table summarizing the recommend way to setup padding and truncation. If you use pair of inputs sequence in Here is a table summarizing the recommend way to setup padding and truncation. If you use pair of inputs sequence in
any of the following examples, you can replace :obj:`truncation=True` by a :obj:`STRATEGY` selected in any of the following examples, you can replace :obj:`truncation=True` by a :obj:`STRATEGY` selected in
:obj:`['only_first', 'only_second', 'longest_first']`, i.e. :obj:`truncation='only_second'` or :obj:`['only_first', 'only_second', 'longest_first']`, i.e. :obj:`truncation='only_second'` or :obj:`truncation=
:obj:`truncation= 'longest_first'` to control how both sequence in the pair are truncated as detailed before. 'longest_first'` to control how both sequence in the pair are truncated as detailed before.
+--------------------------------------+-----------------------------------+---------------------------------------------------------------------------------------------+ +--------------------------------------+-----------------------------------+---------------------------------------------------------------------------------------------+
| Truncation | Padding | Instruction | | Truncation | Padding | Instruction |
......
...@@ -3,7 +3,8 @@ Pretrained models ...@@ -3,7 +3,8 @@ Pretrained models
Here is the full list of the currently provided pretrained models together with a short presentation of each model. Here is the full list of the currently provided pretrained models together with a short presentation of each model.
For a list that includes community-uploaded models, refer to `https://huggingface.co/models <https://huggingface.co/models>`__. For a list that includes community-uploaded models, refer to `https://huggingface.co/models
<https://huggingface.co/models>`__.
+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| Architecture | Shortcut name | Details of the model | | Architecture | Shortcut name | Details of the model |
......
Quick tour Quick tour
======================================================================================================================= =======================================================================================================================
Let's have a quick look at the 🤗 Transformers library features. The library downloads pretrained models for Let's have a quick look at the 🤗 Transformers library features. The library downloads pretrained models for Natural
Natural Language Understanding (NLU) tasks, such as analyzing the sentiment of a text, and Natural Language Generation (NLG), Language Understanding (NLU) tasks, such as analyzing the sentiment of a text, and Natural Language Generation (NLG),
such as completing a prompt with new text or translating in another language. such as completing a prompt with new text or translating in another language.
First we will see how to easily leverage the pipeline API to quickly use those pretrained models at inference. Then, we First we will see how to easily leverage the pipeline API to quickly use those pretrained models at inference. Then, we
...@@ -29,8 +29,8 @@ provides the following tasks out of the box: ...@@ -29,8 +29,8 @@ provides the following tasks out of the box:
- Translation: translate a text in another language. - Translation: translate a text in another language.
- Feature extraction: return a tensor representation of the text. - Feature extraction: return a tensor representation of the text.
Let's see how this work for sentiment analysis (the other tasks are all covered in the Let's see how this work for sentiment analysis (the other tasks are all covered in the :doc:`task summary
:doc:`task summary </task_summary>`): </task_summary>`):
.. code-block:: .. code-block::
...@@ -160,9 +160,10 @@ To apply these steps on a given text, we can just feed it to our tokenizer: ...@@ -160,9 +160,10 @@ To apply these steps on a given text, we can just feed it to our tokenizer:
>>> inputs = tokenizer("We are very happy to show you the 🤗 Transformers library.") >>> inputs = tokenizer("We are very happy to show you the 🤗 Transformers library.")
This returns a dictionary string to list of ints. It contains the `ids of the tokens <glossary.html#input-ids>`__, This returns a dictionary string to list of ints. It contains the `ids of the tokens <glossary.html#input-ids>`__, as
as mentioned before, but also additional arguments that will be useful to the model. Here for instance, we also have an mentioned before, but also additional arguments that will be useful to the model. Here for instance, we also have an
`attention mask <glossary.html#attention-mask>`__ that the model will use to have a better understanding of the sequence: `attention mask <glossary.html#attention-mask>`__ that the model will use to have a better understanding of the
sequence:
.. code-block:: .. code-block::
...@@ -191,8 +192,8 @@ and get tensors back. You can specify all of that to the tokenizer: ...@@ -191,8 +192,8 @@ and get tensors back. You can specify all of that to the tokenizer:
... return_tensors="tf" ... return_tensors="tf"
... ) ... )
The padding is automatically applied on the side expected by the model (in this case, on the right), with the The padding is automatically applied on the side expected by the model (in this case, on the right), with the padding
padding token the model was pretrained with. The attention mask is also adapted to take the padding into account: token the model was pretrained with. The attention mask is also adapted to take the padding into account:
.. code-block:: .. code-block::
...@@ -213,8 +214,8 @@ Using the model ...@@ -213,8 +214,8 @@ Using the model
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Once your input has been preprocessed by the tokenizer, you can send it directly to the model. As we mentioned, it will Once your input has been preprocessed by the tokenizer, you can send it directly to the model. As we mentioned, it will
contain all the relevant information the model needs. If you're using a TensorFlow model, you can pass the contain all the relevant information the model needs. If you're using a TensorFlow model, you can pass the dictionary
dictionary keys directly to tensors, for a PyTorch model, you need to unpack the dictionary by adding :obj:`**`. keys directly to tensors, for a PyTorch model, you need to unpack the dictionary by adding :obj:`**`.
.. code-block:: .. code-block::
...@@ -223,8 +224,8 @@ dictionary keys directly to tensors, for a PyTorch model, you need to unpack the ...@@ -223,8 +224,8 @@ dictionary keys directly to tensors, for a PyTorch model, you need to unpack the
>>> ## TENSORFLOW CODE >>> ## TENSORFLOW CODE
>>> tf_outputs = tf_model(tf_batch) >>> tf_outputs = tf_model(tf_batch)
In 🤗 Transformers, all outputs are tuples (with only one element potentially). Here, we get a tuple with just the In 🤗 Transformers, all outputs are tuples (with only one element potentially). Here, we get a tuple with just the final
final activations of the model. activations of the model.
.. code-block:: .. code-block::
...@@ -239,11 +240,10 @@ final activations of the model. ...@@ -239,11 +240,10 @@ final activations of the model.
[ 0.08181786, -0.04179301]], dtype=float32)>,) [ 0.08181786, -0.04179301]], dtype=float32)>,)
The model can return more than just the final activations, which is why the output is a tuple. Here we only asked for The model can return more than just the final activations, which is why the output is a tuple. Here we only asked for
the final activations, so we get a tuple with one element. the final activations, so we get a tuple with one element. .. note::
.. note::
All 🤗 Transformers models (PyTorch or TensorFlow) return the activations of the model *before* the final All 🤗 Transformers models (PyTorch or TensorFlow) return the activations of the model *before* the final activation
activation function (like SoftMax) since this final activation function is often fused with the loss. function (like SoftMax) since this final activation function is often fused with the loss.
Let's apply the SoftMax activation to get predictions. Let's apply the SoftMax activation to get predictions.
...@@ -281,11 +281,11 @@ If you have labels, you can provide them to the model, it will return a tuple wi ...@@ -281,11 +281,11 @@ If you have labels, you can provide them to the model, it will return a tuple wi
>>> import tensorflow as tf >>> import tensorflow as tf
>>> tf_outputs = tf_model(tf_batch, labels = tf.constant([1, 0])) >>> tf_outputs = tf_model(tf_batch, labels = tf.constant([1, 0]))
Models are standard `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ or Models are standard `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ or `tf.keras.Model
`tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ so you can use them in your usual <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ so you can use them in your usual training loop. 🤗
training loop. 🤗 Transformers also provides a :class:`~transformers.Trainer` (or :class:`~transformers.TFTrainer` if Transformers also provides a :class:`~transformers.Trainer` (or :class:`~transformers.TFTrainer` if you are using
you are using TensorFlow) class to help with your training (taking care of things such as distributed training, mixed TensorFlow) class to help with your training (taking care of things such as distributed training, mixed precision,
precision, etc.). See the :doc:`training tutorial <training>` for more details. etc.). See the :doc:`training tutorial <training>` for more details.
.. note:: .. note::
...@@ -336,13 +336,13 @@ The :obj:`AutoModel` and :obj:`AutoTokenizer` classes are just shortcuts that wi ...@@ -336,13 +336,13 @@ The :obj:`AutoModel` and :obj:`AutoTokenizer` classes are just shortcuts that wi
pretrained model. Behind the scenes, the library has one model class per combination of architecture plus class, so the pretrained model. Behind the scenes, the library has one model class per combination of architecture plus class, so the
code is easy to access and tweak if you need to. code is easy to access and tweak if you need to.
In our previous example, the model was called "distilbert-base-uncased-finetuned-sst-2-english", which means it's In our previous example, the model was called "distilbert-base-uncased-finetuned-sst-2-english", which means it's using
using the :doc:`DistilBERT </model_doc/distilbert>` architecture. As the :doc:`DistilBERT </model_doc/distilbert>` architecture. As
:class:`~transformers.AutoModelForSequenceClassification` (or :class:`~transformers.TFAutoModelForSequenceClassification` :class:`~transformers.AutoModelForSequenceClassification` (or
if you are using TensorFlow) was used, the model automatically created is then a :class:`~transformers.TFAutoModelForSequenceClassification` if you are using TensorFlow) was used, the model
:class:`~transformers.DistilBertForSequenceClassification`. You can look at its documentation for all details relevant automatically created is then a :class:`~transformers.DistilBertForSequenceClassification`. You can look at its
to that specific model, or browse the source code. This is how you would directly instantiate model and tokenizer documentation for all details relevant to that specific model, or browse the source code. This is how you would
without the auto magic: directly instantiate model and tokenizer without the auto magic:
.. code-block:: .. code-block::
......
...@@ -5,16 +5,18 @@ Exporting transformers models ...@@ -5,16 +5,18 @@ Exporting transformers models
ONNX / ONNXRuntime ONNX / ONNXRuntime
======================================================================================================================= =======================================================================================================================
Projects `ONNX (Open Neural Network eXchange) <http://onnx.ai>`_ and `ONNXRuntime (ORT) <https://microsoft.github.io/onnxruntime/>`_ are part of an effort from leading industries in the AI field Projects `ONNX (Open Neural Network eXchange) <http://onnx.ai>`_ and `ONNXRuntime (ORT)
to provide a unified and community-driven format to store and, by extension, efficiently execute neural network leveraging a variety <https://microsoft.github.io/onnxruntime/>`_ are part of an effort from leading industries in the AI field to provide a
unified and community-driven format to store and, by extension, efficiently execute neural network leveraging a variety
of hardware and dedicated optimizations. of hardware and dedicated optimizations.
Starting from transformers v2.10.0 we partnered with ONNX Runtime to provide an easy export of transformers models to Starting from transformers v2.10.0 we partnered with ONNX Runtime to provide an easy export of transformers models to
the ONNX format. You can have a look at the effort by looking at our joint blog post `Accelerate your NLP pipelines using the ONNX format. You can have a look at the effort by looking at our joint blog post `Accelerate your NLP pipelines
Hugging Face Transformers and ONNX Runtime <https://medium.com/microsoftazure/accelerate-your-nlp-pipelines-using-hugging-face-transformers-and-onnx-runtime-2443578f4333>`_. using Hugging Face Transformers and ONNX Runtime
<https://medium.com/microsoftazure/accelerate-your-nlp-pipelines-using-hugging-face-transformers-and-onnx-runtime-2443578f4333>`_.
Exporting a model is done through the script `convert_graph_to_onnx.py` at the root of the transformers sources. Exporting a model is done through the script `convert_graph_to_onnx.py` at the root of the transformers sources. The
The following command shows how easy it is to export a BERT model from the library, simply run: following command shows how easy it is to export a BERT model from the library, simply run:
.. code-block:: bash .. code-block:: bash
...@@ -27,62 +29,66 @@ The conversion tool works for both PyTorch and Tensorflow models and ensures: ...@@ -27,62 +29,66 @@ The conversion tool works for both PyTorch and Tensorflow models and ensures:
* The generated model can be correctly loaded through onnxruntime. * The generated model can be correctly loaded through onnxruntime.
.. note:: .. note::
Currently, inputs and outputs are always exported with dynamic sequence axes preventing some optimizations Currently, inputs and outputs are always exported with dynamic sequence axes preventing some optimizations on the
on the ONNX Runtime. If you would like to see such support for fixed-length inputs/outputs, please ONNX Runtime. If you would like to see such support for fixed-length inputs/outputs, please open up an issue on
open up an issue on transformers. transformers.
Also, the conversion tool supports different options which let you tune the behavior of the generated model: Also, the conversion tool supports different options which let you tune the behavior of the generated model:
* **Change the target opset version of the generated model.** (More recent opset generally supports more operators and enables faster inference) * **Change the target opset version of the generated model.** (More recent opset generally supports more operators and
enables faster inference)
* **Export pipeline-specific prediction heads.** (Allow to export model along with its task-specific prediction head(s)) * **Export pipeline-specific prediction heads.** (Allow to export model along with its task-specific prediction
head(s))
* **Use the external data format (PyTorch only).** (Lets you export model which size is above 2Gb (`More info <https://github.com/pytorch/pytorch/pull/33062>`_)) * **Use the external data format (PyTorch only).** (Lets you export model which size is above 2Gb (`More info
<https://github.com/pytorch/pytorch/pull/33062>`_))
Optimizations Optimizations
----------------------------------------------------------------------------------------------------------------------- -----------------------------------------------------------------------------------------------------------------------
ONNXRuntime includes some transformers-specific transformations to leverage optimized operations in the graph. ONNXRuntime includes some transformers-specific transformations to leverage optimized operations in the graph. Below
Below are some of the operators which can be enabled to speed up inference through ONNXRuntime (*see note below*): are some of the operators which can be enabled to speed up inference through ONNXRuntime (*see note below*):
* Constant folding * Constant folding
* Attention Layer fusing * Attention Layer fusing
* Skip connection LayerNormalization fusing * Skip connection LayerNormalization fusing
* FastGeLU approximation * FastGeLU approximation
Some of the optimizations performed by ONNX runtime can be hardware specific and thus lead to different performances Some of the optimizations performed by ONNX runtime can be hardware specific and thus lead to different performances if
if used on another machine with a different hardware configuration than the one used for exporting the model. used on another machine with a different hardware configuration than the one used for exporting the model. For this
For this reason, when using ``convert_graph_to_onnx.py`` optimizations are not enabled, reason, when using ``convert_graph_to_onnx.py`` optimizations are not enabled, ensuring the model can be easily
ensuring the model can be easily exported to various hardware. exported to various hardware. Optimizations can then be enabled when loading the model through ONNX runtime for
Optimizations can then be enabled when loading the model through ONNX runtime for inference. inference.
.. note:: .. note::
When quantization is enabled (see below), ``convert_graph_to_onnx.py`` script will enable optimizations on the model When quantization is enabled (see below), ``convert_graph_to_onnx.py`` script will enable optimizations on the
because quantization would modify the underlying graph making it impossible for ONNX runtime to do the optimizations model because quantization would modify the underlying graph making it impossible for ONNX runtime to do the
afterwards. optimizations afterwards.
.. note:: .. note::
For more information about the optimizations enabled by ONNXRuntime, please have a look at the (`ONNXRuntime Github <https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/transformers>`_) For more information about the optimizations enabled by ONNXRuntime, please have a look at the (`ONNXRuntime Github
<https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/transformers>`_)
Quantization Quantization
----------------------------------------------------------------------------------------------------------------------- -----------------------------------------------------------------------------------------------------------------------
ONNX exporter supports generating a quantized version of the model to allow efficient inference. ONNX exporter supports generating a quantized version of the model to allow efficient inference.
Quantization works by converting the memory representation of the parameters in the neural network Quantization works by converting the memory representation of the parameters in the neural network to a compact integer
to a compact integer format. By default, weights of a neural network are stored as single-precision float (`float32`) format. By default, weights of a neural network are stored as single-precision float (`float32`) which can express a
which can express a wide-range of floating-point numbers with decent precision. wide-range of floating-point numbers with decent precision. These properties are especially interesting at training
These properties are especially interesting at training where you want fine-grained representation. where you want fine-grained representation.
On the other hand, after the training phase, it has been shown one can greatly reduce the range and the precision of `float32` numbers On the other hand, after the training phase, it has been shown one can greatly reduce the range and the precision of
without changing the performances of the neural network. `float32` numbers without changing the performances of the neural network.
More technically, `float32` parameters are converted to a type requiring fewer bits to represent each number, thus reducing More technically, `float32` parameters are converted to a type requiring fewer bits to represent each number, thus
the overall size of the model. Here, we are enabling `float32` mapping to `int8` values (a non-floating, single byte, number representation) reducing the overall size of the model. Here, we are enabling `float32` mapping to `int8` values (a non-floating,
according to the following formula: single byte, number representation) according to the following formula:
.. math:: .. math::
y_{float32} = scale * x_{int8} - zero\_point y_{float32} = scale * x_{int8} - zero\_point
...@@ -96,9 +102,9 @@ Leveraging tiny-integers has numerous advantages when it comes to inference: ...@@ -96,9 +102,9 @@ Leveraging tiny-integers has numerous advantages when it comes to inference:
* Integer operations execute a magnitude faster on modern hardware * Integer operations execute a magnitude faster on modern hardware
* Integer operations require less power to do the computations * Integer operations require less power to do the computations
In order to convert a transformers model to ONNX IR with quantized weights you just need to specify ``--quantize`` In order to convert a transformers model to ONNX IR with quantized weights you just need to specify ``--quantize`` when
when using ``convert_graph_to_onnx.py``. Also, you can have a look at the ``quantize()`` utility-method in this using ``convert_graph_to_onnx.py``. Also, you can have a look at the ``quantize()`` utility-method in this same script
same script file. file.
Example of quantized BERT model export: Example of quantized BERT model export:
...@@ -111,26 +117,27 @@ Example of quantized BERT model export: ...@@ -111,26 +117,27 @@ Example of quantized BERT model export:
.. note:: .. note::
When exporting quantized model you will end up with two different ONNX files. The one specified at the end of the When exporting quantized model you will end up with two different ONNX files. The one specified at the end of the
above command will contain the original ONNX model storing `float32` weights. above command will contain the original ONNX model storing `float32` weights. The second one, with ``-quantized``
The second one, with ``-quantized`` suffix, will hold the quantized parameters. suffix, will hold the quantized parameters.
TorchScript TorchScript
======================================================================================================================= =======================================================================================================================
.. note:: .. note::
This is the very beginning of our experiments with TorchScript and we are still exploring its capabilities This is the very beginning of our experiments with TorchScript and we are still exploring its capabilities with
with variable-input-size models. It is a focus of interest to us and we will deepen our analysis in upcoming variable-input-size models. It is a focus of interest to us and we will deepen our analysis in upcoming releases,
releases, with more code examples, a more flexible implementation, and benchmarks comparing python-based codes with more code examples, a more flexible implementation, and benchmarks comparing python-based codes with compiled
with compiled TorchScript. TorchScript.
According to Pytorch's documentation: "TorchScript is a way to create serializable and optimizable models from PyTorch code". According to Pytorch's documentation: "TorchScript is a way to create serializable and optimizable models from PyTorch
Pytorch's two modules `JIT and TRACE <https://pytorch.org/docs/stable/jit.html>`_ allow the developer to export code". Pytorch's two modules `JIT and TRACE <https://pytorch.org/docs/stable/jit.html>`_ allow the developer to export
their model to be re-used in other programs, such as efficiency-oriented C++ programs. their model to be re-used in other programs, such as efficiency-oriented C++ programs.
We have provided an interface that allows the export of 🤗 Transformers models to TorchScript so that they can We have provided an interface that allows the export of 🤗 Transformers models to TorchScript so that they can be reused
be reused in a different environment than a Pytorch-based python program. Here we explain how to export and use our models using TorchScript. in a different environment than a Pytorch-based python program. Here we explain how to export and use our models using
TorchScript.
Exporting a model requires two things: Exporting a model requires two things:
...@@ -145,13 +152,14 @@ Implications ...@@ -145,13 +152,14 @@ Implications
TorchScript flag and tied weights TorchScript flag and tied weights
----------------------------------------------------------------------------------------------------------------------- -----------------------------------------------------------------------------------------------------------------------
This flag is necessary because most of the language models in this repository have tied weights between their This flag is necessary because most of the language models in this repository have tied weights between their
``Embedding`` layer and their ``Decoding`` layer. TorchScript does not allow the export of models that have tied weights, therefore ``Embedding`` layer and their ``Decoding`` layer. TorchScript does not allow the export of models that have tied
it is necessary to untie and clone the weights beforehand. weights, therefore it is necessary to untie and clone the weights beforehand.
This implies that models instantiated with the ``torchscript`` flag have their ``Embedding`` layer and ``Decoding`` layer This implies that models instantiated with the ``torchscript`` flag have their ``Embedding`` layer and ``Decoding``
separate, which means that they should not be trained down the line. Training would de-synchronize the two layers, layer separate, which means that they should not be trained down the line. Training would de-synchronize the two
leading to unexpected results. layers, leading to unexpected results.
This is not the case for models that do not have a Language Model head, as those do not have tied weights. These models This is not the case for models that do not have a Language Model head, as those do not have tied weights. These models
can be safely exported without the ``torchscript`` flag. can be safely exported without the ``torchscript`` flag.
...@@ -160,8 +168,8 @@ Dummy inputs and standard lengths ...@@ -160,8 +168,8 @@ Dummy inputs and standard lengths
----------------------------------------------------------------------------------------------------------------------- -----------------------------------------------------------------------------------------------------------------------
The dummy inputs are used to do a model forward pass. While the inputs' values are propagating through the layers, The dummy inputs are used to do a model forward pass. While the inputs' values are propagating through the layers,
Pytorch keeps track of the different operations executed on each tensor. These recorded operations are then used Pytorch keeps track of the different operations executed on each tensor. These recorded operations are then used to
to create the "trace" of the model. create the "trace" of the model.
The trace is created relatively to the inputs' dimensions. It is therefore constrained by the dimensions of the dummy The trace is created relatively to the inputs' dimensions. It is therefore constrained by the dimensions of the dummy
input, and will not work for any other sequence length or batch size. When trying with a different size, an error such input, and will not work for any other sequence length or batch size. When trying with a different size, an error such
...@@ -185,8 +193,8 @@ Below is an example, showing how to save, load models as well as how to use the ...@@ -185,8 +193,8 @@ Below is an example, showing how to save, load models as well as how to use the
Saving a model Saving a model
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This snippet shows how to use TorchScript to export a ``BertModel``. Here the ``BertModel`` is instantiated This snippet shows how to use TorchScript to export a ``BertModel``. Here the ``BertModel`` is instantiated according
according to a ``BertConfig`` class and then saved to disk under the filename ``traced_bert.pt`` to a ``BertConfig`` class and then saved to disk under the filename ``traced_bert.pt``
.. code-block:: python .. code-block:: python
......
...@@ -2,30 +2,30 @@ Summary of the tasks ...@@ -2,30 +2,30 @@ Summary of the tasks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This page shows the most frequent use-cases when using the library. The models available allow for many different This page shows the most frequent use-cases when using the library. The models available allow for many different
configurations and a great versatility in use-cases. The most simple ones are presented here, showcasing usage configurations and a great versatility in use-cases. The most simple ones are presented here, showcasing usage for
for tasks such as question answering, sequence classification, named entity recognition and others. tasks such as question answering, sequence classification, named entity recognition and others.
These examples leverage auto-models, which are classes that will instantiate a model according to a given checkpoint, These examples leverage auto-models, which are classes that will instantiate a model according to a given checkpoint,
automatically selecting the correct model architecture. Please check the :class:`~transformers.AutoModel` documentation automatically selecting the correct model architecture. Please check the :class:`~transformers.AutoModel` documentation
for more information. for more information. Feel free to modify the code to be more specific and adapt it to your specific use-case.
Feel free to modify the code to be more specific and adapt it to your specific use-case.
In order for a model to perform well on a task, it must be loaded from a checkpoint corresponding to that task. These In order for a model to perform well on a task, it must be loaded from a checkpoint corresponding to that task. These
checkpoints are usually pre-trained on a large corpus of data and fine-tuned on a specific task. This means the checkpoints are usually pre-trained on a large corpus of data and fine-tuned on a specific task. This means the
following: following:
- Not all models were fine-tuned on all tasks. If you want to fine-tune a model on a specific task, you can leverage - Not all models were fine-tuned on all tasks. If you want to fine-tune a model on a specific task, you can leverage
one of the `run_$TASK.py` scripts in the one of the `run_$TASK.py` scripts in the `examples
`examples <https://github.com/huggingface/transformers/tree/master/examples>`__ directory. <https://github.com/huggingface/transformers/tree/master/examples>`__ directory.
- Fine-tuned models were fine-tuned on a specific dataset. This dataset may or may not overlap with your use-case - Fine-tuned models were fine-tuned on a specific dataset. This dataset may or may not overlap with your use-case and
and domain. As mentioned previously, you may leverage the domain. As mentioned previously, you may leverage the `examples
`examples <https://github.com/huggingface/transformers/tree/master/examples>`__ scripts to fine-tune your model, or you <https://github.com/huggingface/transformers/tree/master/examples>`__ scripts to fine-tune your model, or you may
may create your own training script. create your own training script.
In order to do an inference on a task, several mechanisms are made available by the library: In order to do an inference on a task, several mechanisms are made available by the library:
- Pipelines: very easy-to-use abstractions, which require as little as two lines of code. - Pipelines: very easy-to-use abstractions, which require as little as two lines of code.
- Direct model use: Less abstractions, but more flexibility and power via a direct access to a tokenizer (PyTorch/TensorFlow) and full inference capacity. - Direct model use: Less abstractions, but more flexibility and power via a direct access to a tokenizer
(PyTorch/TensorFlow) and full inference capacity.
Both approaches are showcased here. Both approaches are showcased here.
...@@ -40,15 +40,17 @@ Both approaches are showcased here. ...@@ -40,15 +40,17 @@ Both approaches are showcased here.
Sequence Classification Sequence Classification
----------------------------------------------------------------------------------------------------------------------- -----------------------------------------------------------------------------------------------------------------------
Sequence classification is the task of classifying sequences according to a given number of classes. An example Sequence classification is the task of classifying sequences according to a given number of classes. An example of
of sequence classification is the GLUE dataset, which is entirely based on that task. If you would like to fine-tune sequence classification is the GLUE dataset, which is entirely based on that task. If you would like to fine-tune a
a model on a GLUE sequence classification task, you may leverage the model on a GLUE sequence classification task, you may leverage the `run_glue.py
`run_glue.py <https://github.com/huggingface/transformers/tree/master/examples/text-classification/run_glue.py>`__ and <https://github.com/huggingface/transformers/tree/master/examples/text-classification/run_glue.py>`__ and
`run_pl_glue.py <https://github.com/huggingface/transformers/tree/master/examples/text-classification/run_pl_glue.py>`__ or `run_pl_glue.py
`run_tf_glue.py <https://github.com/huggingface/transformers/tree/master/examples/text-classification/run_tf_glue.py>`__ scripts. <https://github.com/huggingface/transformers/tree/master/examples/text-classification/run_pl_glue.py>`__ or
`run_tf_glue.py
<https://github.com/huggingface/transformers/tree/master/examples/text-classification/run_tf_glue.py>`__ scripts.
Here is an example of using pipelines to do sentiment analysis: identifying if a sequence is positive or negative. Here is an example of using pipelines to do sentiment analysis: identifying if a sequence is positive or negative. It
It leverages a fine-tuned model on sst2, which is a GLUE task. leverages a fine-tuned model on sst2, which is a GLUE task.
This returns a label ("POSITIVE" or "NEGATIVE") alongside a score, as follows: This returns a label ("POSITIVE" or "NEGATIVE") alongside a score, as follows:
...@@ -67,18 +69,16 @@ This returns a label ("POSITIVE" or "NEGATIVE") alongside a score, as follows: ...@@ -67,18 +69,16 @@ This returns a label ("POSITIVE" or "NEGATIVE") alongside a score, as follows:
label: POSITIVE, with score: 0.9999 label: POSITIVE, with score: 0.9999
Here is an example of doing a sequence classification using a model to determine if two sequences are paraphrases Here is an example of doing a sequence classification using a model to determine if two sequences are paraphrases of
of each other. The process is the following: each other. The process is the following:
1. Instantiate a tokenizer and a model from the checkpoint name. The model is 1. Instantiate a tokenizer and a model from the checkpoint name. The model is identified as a BERT model and loads it
identified as a BERT model and loads it with the weights stored in the with the weights stored in the checkpoint.
checkpoint. 2. Build a sequence from the two sentences, with the correct model-specific separators token type ids and attention
2. Build a sequence from the two sentences, with the correct model-specific masks (:func:`~transformers.PreTrainedTokenizer.encode` and :func:`~transformers.PreTrainedTokenizer.__call__` take
separators token type ids and attention masks care of this).
(:func:`~transformers.PreTrainedTokenizer.encode` and 3. Pass this sequence through the model so that it is classified in one of the two available classes: 0 (not a
:func:`~transformers.PreTrainedTokenizer.__call__` take care of this). paraphrase) and 1 (is a paraphrase).
3. Pass this sequence through the model so that it is classified in one of the
two available classes: 0 (not a paraphrase) and 1 (is a paraphrase).
4. Compute the softmax of the result to get probabilities over the classes. 4. Compute the softmax of the result to get probabilities over the classes.
5. Print the results. 5. Print the results.
...@@ -155,14 +155,15 @@ Extractive Question Answering ...@@ -155,14 +155,15 @@ Extractive Question Answering
----------------------------------------------------------------------------------------------------------------------- -----------------------------------------------------------------------------------------------------------------------
Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune a
a model on a SQuAD task, you may leverage the model on a SQuAD task, you may leverage the `run_squad.py
`run_squad.py <https://github.com/huggingface/transformers/tree/master/examples/question-answering/run_squad.py>`__ and <https://github.com/huggingface/transformers/tree/master/examples/question-answering/run_squad.py>`__ and
`run_tf_squad.py <https://github.com/huggingface/transformers/tree/master/examples/question-answering/run_tf_squad.py>`__ scripts. `run_tf_squad.py
<https://github.com/huggingface/transformers/tree/master/examples/question-answering/run_tf_squad.py>`__ scripts.
Here is an example of using pipelines to do question answering: extracting an answer from a text given a question. Here is an example of using pipelines to do question answering: extracting an answer from a text given a question. It
It leverages a fine-tuned model on SQuAD. leverages a fine-tuned model on SQuAD.
.. code-block:: .. code-block::
...@@ -176,8 +177,8 @@ It leverages a fine-tuned model on SQuAD. ...@@ -176,8 +177,8 @@ It leverages a fine-tuned model on SQuAD.
... a model on a SQuAD task, you may leverage the examples/question-answering/run_squad.py script. ... a model on a SQuAD task, you may leverage the examples/question-answering/run_squad.py script.
... """ ... """
This returns an answer extracted from the text, a confidence score, alongside "start" and "end" values, which This returns an answer extracted from the text, a confidence score, alongside "start" and "end" values, which are the
are the positions of the extracted answer in the text. positions of the extracted answer in the text.
.. code-block:: .. code-block::
...@@ -192,16 +193,13 @@ are the positions of the extracted answer in the text. ...@@ -192,16 +193,13 @@ are the positions of the extracted answer in the text.
Here is an example of question answering using a model and a tokenizer. The process is the following: Here is an example of question answering using a model and a tokenizer. The process is the following:
1. Instantiate a tokenizer and a model from the checkpoint name. The model is 1. Instantiate a tokenizer and a model from the checkpoint name. The model is identified as a BERT model and loads it
identified as a BERT model and loads it with the weights stored in the with the weights stored in the checkpoint.
checkpoint.
2. Define a text and a few questions. 2. Define a text and a few questions.
3. Iterate over the questions and build a sequence from the text and the current 3. Iterate over the questions and build a sequence from the text and the current question, with the correct
question, with the correct model-specific separators token type ids and model-specific separators token type ids and attention masks.
attention masks. 4. Pass this sequence through the model. This outputs a range of scores across the entire sequence tokens (question and
4. Pass this sequence through the model. This outputs a range of scores across text), for both the start and end positions.
the entire sequence tokens (question and text), for both the start and end
positions.
5. Compute the softmax of the result to get probabilities over the tokens. 5. Compute the softmax of the result to get probabilities over the tokens.
6. Fetch the tokens from the identified start and stop values, convert those tokens to a string. 6. Fetch the tokens from the identified start and stop values, convert those tokens to a string.
7. Print the results. 7. Print the results.
...@@ -299,22 +297,22 @@ Here is an example of question answering using a model and a tokenizer. The proc ...@@ -299,22 +297,22 @@ Here is an example of question answering using a model and a tokenizer. The proc
Language Modeling Language Modeling
----------------------------------------------------------------------------------------------------------------------- -----------------------------------------------------------------------------------------------------------------------
Language modeling is the task of fitting a model to a corpus, which can be domain specific. All popular transformer-based Language modeling is the task of fitting a model to a corpus, which can be domain specific. All popular
models are trained using a variant of language modeling, e.g. BERT with masked language modeling, GPT-2 with transformer-based models are trained using a variant of language modeling, e.g. BERT with masked language modeling,
causal language modeling. GPT-2 with causal language modeling.
Language modeling can be useful outside of pre-training as well, for example to shift the model distribution to be Language modeling can be useful outside of pre-training as well, for example to shift the model distribution to be
domain-specific: using a language model trained over a very large corpus, and then fine-tuning it to a news dataset domain-specific: using a language model trained over a very large corpus, and then fine-tuning it to a news dataset or
or on scientific papers e.g. `LysandreJik/arxiv-nlp <https://huggingface.co/lysandre/arxiv-nlp>`__. on scientific papers e.g. `LysandreJik/arxiv-nlp <https://huggingface.co/lysandre/arxiv-nlp>`__.
Masked Language Modeling Masked Language Modeling
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Masked language modeling is the task of masking tokens in a sequence with a masking token, and prompting the model to Masked language modeling is the task of masking tokens in a sequence with a masking token, and prompting the model to
fill that mask with an appropriate token. This allows the model to attend to both the right context (tokens on the fill that mask with an appropriate token. This allows the model to attend to both the right context (tokens on the
right of the mask) and the left context (tokens on the left of the mask). Such a training creates a strong basis right of the mask) and the left context (tokens on the left of the mask). Such a training creates a strong basis for
for downstream tasks, requiring bi-directional context such as SQuAD (question answering, downstream tasks, requiring bi-directional context such as SQuAD (question answering, see `Lewis, Lui, Goyal et al.
see `Lewis, Lui, Goyal et al. <https://arxiv.org/abs/1910.13461>`__, part 4.2). <https://arxiv.org/abs/1910.13461>`__, part 4.2).
Here is an example of using pipelines to replace a mask from a sequence: Here is an example of using pipelines to replace a mask from a sequence:
...@@ -324,8 +322,7 @@ Here is an example of using pipelines to replace a mask from a sequence: ...@@ -324,8 +322,7 @@ Here is an example of using pipelines to replace a mask from a sequence:
>>> nlp = pipeline("fill-mask") >>> nlp = pipeline("fill-mask")
This outputs the sequences with the mask filled, the confidence score, and the token id in the tokenizer This outputs the sequences with the mask filled, the confidence score, and the token id in the tokenizer vocabulary:
vocabulary:
.. code-block:: .. code-block::
...@@ -359,14 +356,12 @@ vocabulary: ...@@ -359,14 +356,12 @@ vocabulary:
Here is an example of doing masked language modeling using a model and a tokenizer. The process is the following: Here is an example of doing masked language modeling using a model and a tokenizer. The process is the following:
1. Instantiate a tokenizer and a model from the checkpoint name. The model is 1. Instantiate a tokenizer and a model from the checkpoint name. The model is identified as a DistilBERT model and
identified as a DistilBERT model and loads it with the weights stored in the loads it with the weights stored in the checkpoint.
checkpoint.
2. Define a sequence with a masked token, placing the :obj:`tokenizer.mask_token` instead of a word. 2. Define a sequence with a masked token, placing the :obj:`tokenizer.mask_token` instead of a word.
3. Encode that sequence into a list of IDs and find the position of the masked token in that list. 3. Encode that sequence into a list of IDs and find the position of the masked token in that list.
4. Retrieve the predictions at the index of the mask token: this tensor has the 4. Retrieve the predictions at the index of the mask token: this tensor has the same size as the vocabulary, and the
same size as the vocabulary, and the values are the scores attributed to each values are the scores attributed to each token. The model gives higher score to tokens it deems probable in that
token. The model gives higher score to tokens it deems probable in that
context. context.
5. Retrieve the top 5 tokens using the PyTorch :obj:`topk` or TensorFlow :obj:`top_k` methods. 5. Retrieve the top 5 tokens using the PyTorch :obj:`topk` or TensorFlow :obj:`top_k` methods.
6. Replace the mask token by the tokens and print the results 6. Replace the mask token by the tokens and print the results
...@@ -427,9 +422,12 @@ Causal language modeling is the task of predicting the token following a sequenc ...@@ -427,9 +422,12 @@ Causal language modeling is the task of predicting the token following a sequenc
model only attends to the left context (tokens on the left of the mask). Such a training is particularly interesting model only attends to the left context (tokens on the left of the mask). Such a training is particularly interesting
for generation tasks. for generation tasks.
Usually, the next token is predicted by sampling from the logits of the last hidden state the model produces from the input sequence. Usually, the next token is predicted by sampling from the logits of the last hidden state the model produces from the
input sequence.
Here is an example of using the tokenizer and model and leveraging the :func:`~transformers.PreTrainedModel.top_k_top_p_filtering` method to sample the next token following an input sequence of tokens. Here is an example of using the tokenizer and model and leveraging the
:func:`~transformers.PreTrainedModel.top_k_top_p_filtering` method to sample the next token following an input sequence
of tokens.
.. code-block:: .. code-block::
...@@ -490,12 +488,16 @@ This outputs a (hopefully) coherent next token following the original sequence, ...@@ -490,12 +488,16 @@ This outputs a (hopefully) coherent next token following the original sequence,
>>> print(resulting_string) >>> print(resulting_string)
Hugging Face is based in DUMBO, New York City, and has Hugging Face is based in DUMBO, New York City, and has
In the next section, we show how this functionality is leveraged in :func:`~transformers.PreTrainedModel.generate` to generate multiple tokens up to a user-defined length. In the next section, we show how this functionality is leveraged in :func:`~transformers.PreTrainedModel.generate` to
generate multiple tokens up to a user-defined length.
Text Generation Text Generation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In text generation (*a.k.a* *open-ended text generation*) the goal is to create a coherent portion of text that is a continuation from the given context. The following example shows how *GPT-2* can be used in pipelines to generate text. As a default all models apply *Top-K* sampling when used in pipelines, as configured in their respective configurations (see `gpt-2 config <https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-config.json>`__ for example). In text generation (*a.k.a* *open-ended text generation*) the goal is to create a coherent portion of text that is a
continuation from the given context. The following example shows how *GPT-2* can be used in pipelines to generate text.
As a default all models apply *Top-K* sampling when used in pipelines, as configured in their respective configurations
(see `gpt-2 config <https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-config.json>`__ for example).
.. code-block:: .. code-block::
...@@ -507,8 +509,9 @@ In text generation (*a.k.a* *open-ended text generation*) the goal is to create ...@@ -507,8 +509,9 @@ In text generation (*a.k.a* *open-ended text generation*) the goal is to create
Here, the model generates a random text with a total maximal length of *50* tokens from context *"As far as I am concerned, I will"*. Here, the model generates a random text with a total maximal length of *50* tokens from context *"As far as I am
The default arguments of ``PreTrainedModel.generate()`` can be directly overriden in the pipeline, as is shown above for the argument ``max_length``. concerned, I will"*. The default arguments of ``PreTrainedModel.generate()`` can be directly overriden in the pipeline,
as is shown above for the argument ``max_length``.
Here is an example of text generation using ``XLNet`` and its tokenzier. Here is an example of text generation using ``XLNet`` and its tokenzier.
...@@ -569,25 +572,30 @@ Here is an example of text generation using ``XLNet`` and its tokenzier. ...@@ -569,25 +572,30 @@ Here is an example of text generation using ``XLNet`` and its tokenzier.
>>> print(generated) >>> print(generated)
Today the weather is really nice and I am planning on anning on taking a nice...... of a great time!<eop>............... Today the weather is really nice and I am planning on anning on taking a nice...... of a great time!<eop>...............
Text generation is currently possible with *GPT-2*, *OpenAi-GPT*, *CTRL*, *XLNet*, *Transfo-XL* and *Reformer* in PyTorch and for most models in Tensorflow as well. As can be seen in the example above *XLNet* and *Transfo-XL* often need to be padded to work well. Text generation is currently possible with *GPT-2*, *OpenAi-GPT*, *CTRL*, *XLNet*, *Transfo-XL* and *Reformer* in
GPT-2 is usually a good choice for *open-ended text generation* because it was trained on millions of webpages with a causal language modeling objective. PyTorch and for most models in Tensorflow as well. As can be seen in the example above *XLNet* and *Transfo-XL* often
need to be padded to work well. GPT-2 is usually a good choice for *open-ended text generation* because it was trained
on millions of webpages with a causal language modeling objective.
For more information on how to apply different decoding strategies for text generation, please also refer to our text generation blog post `here <https://huggingface.co/blog/how-to-generate>`__. For more information on how to apply different decoding strategies for text generation, please also refer to our text
generation blog post `here <https://huggingface.co/blog/how-to-generate>`__.
Named Entity Recognition Named Entity Recognition
----------------------------------------------------------------------------------------------------------------------- -----------------------------------------------------------------------------------------------------------------------
Named Entity Recognition (NER) is the task of classifying tokens according to a class, for example, identifying a Named Entity Recognition (NER) is the task of classifying tokens according to a class, for example, identifying a token
token as a person, an organisation or a location. as a person, an organisation or a location. An example of a named entity recognition dataset is the CoNLL-2003 dataset,
An example of a named entity recognition dataset is the CoNLL-2003 dataset, which is entirely based on that task. which is entirely based on that task. If you would like to fine-tune a model on an NER task, you may leverage the
If you would like to fine-tune a model on an NER task, you may leverage the `run_ner.py <https://github.com/huggingface/transformers/tree/master/examples/token-classification/run_ner.py>`__
`run_ner.py <https://github.com/huggingface/transformers/tree/master/examples/token-classification/run_ner.py>`__ (PyTorch), (PyTorch), `run_pl_ner.py
`run_pl_ner.py <https://github.com/huggingface/transformers/tree/master/examples/token-classification/run_pl_ner.py>`__ (leveraging pytorch-lightning) or the <https://github.com/huggingface/transformers/tree/master/examples/token-classification/run_pl_ner.py>`__ (leveraging
`run_tf_ner.py <https://github.com/huggingface/transformers/tree/master/examples/token-classification/run_tf_ner.py>`__ (TensorFlow) scripts. pytorch-lightning) or the `run_tf_ner.py
<https://github.com/huggingface/transformers/tree/master/examples/token-classification/run_tf_ner.py>`__ (TensorFlow)
scripts.
Here is an example of using pipelines to do named entity recognition, specifically, trying to identify tokens as belonging to one Here is an example of using pipelines to do named entity recognition, specifically, trying to identify tokens as
of 9 classes: belonging to one of 9 classes:
- O, Outside of a named entity - O, Outside of a named entity
- B-MIS, Beginning of a miscellaneous entity right after another miscellaneous entity - B-MIS, Beginning of a miscellaneous entity right after another miscellaneous entity
...@@ -599,8 +607,8 @@ of 9 classes: ...@@ -599,8 +607,8 @@ of 9 classes:
- B-LOC, Beginning of a location right after another location - B-LOC, Beginning of a location right after another location
- I-LOC, Location - I-LOC, Location
It leverages a fine-tuned model on CoNLL-2003, fine-tuned by `@stefan-it <https://github.com/stefan-it>`__ from It leverages a fine-tuned model on CoNLL-2003, fine-tuned by `@stefan-it <https://github.com/stefan-it>`__ from `dbmdz
`dbmdz <https://github.com/dbmdz>`__. <https://github.com/dbmdz>`__.
.. code-block:: .. code-block::
...@@ -612,8 +620,8 @@ It leverages a fine-tuned model on CoNLL-2003, fine-tuned by `@stefan-it <https: ...@@ -612,8 +620,8 @@ It leverages a fine-tuned model on CoNLL-2003, fine-tuned by `@stefan-it <https:
... "close to the Manhattan Bridge which is visible from the window." ... "close to the Manhattan Bridge which is visible from the window."
This outputs a list of all words that have been identified as one of the entities from the 9 classes defined above. Here are the This outputs a list of all words that have been identified as one of the entities from the 9 classes defined above.
expected results: Here are the expected results:
.. code-block:: .. code-block::
...@@ -633,24 +641,21 @@ expected results: ...@@ -633,24 +641,21 @@ expected results:
{'word': 'Bridge', 'score': 0.990249514579773, 'entity': 'I-LOC'} {'word': 'Bridge', 'score': 0.990249514579773, 'entity': 'I-LOC'}
] ]
Note, how the tokens of the sequence "Hugging Face" have been identified as an organisation, and "New York City", "DUMBO" and Note, how the tokens of the sequence "Hugging Face" have been identified as an organisation, and "New York City",
"Manhattan Bridge" have been identified as locations. "DUMBO" and "Manhattan Bridge" have been identified as locations.
Here is an example of doing named entity recognition, using a model and a tokenizer. The process is the following: Here is an example of doing named entity recognition, using a model and a tokenizer. The process is the following:
1. Instantiate a tokenizer and a model from the checkpoint name. The model is 1. Instantiate a tokenizer and a model from the checkpoint name. The model is identified as a BERT model and loads it
identified as a BERT model and loads it with the weights stored in the with the weights stored in the checkpoint.
checkpoint.
2. Define the label list with which the model was trained on. 2. Define the label list with which the model was trained on.
3. Define a sequence with known entities, such as "Hugging Face" as an organisation and "New York City" as a location. 3. Define a sequence with known entities, such as "Hugging Face" as an organisation and "New York City" as a location.
4. Split words into tokens so that they can be mapped to predictions. We use a 4. Split words into tokens so that they can be mapped to predictions. We use a small hack by, first, completely
small hack by, first, completely encoding and decoding the sequence, so that encoding and decoding the sequence, so that we're left with a string that contains the special tokens.
we're left with a string that contains the special tokens.
5. Encode that sequence into IDs (special tokens are added automatically). 5. Encode that sequence into IDs (special tokens are added automatically).
6. Retrieve the predictions by passing the input to the model and getting the 6. Retrieve the predictions by passing the input to the model and getting the first output. This results in a
first output. This results in a distribution over the 9 possible classes for distribution over the 9 possible classes for each token. We take the argmax to retrieve the most likely class for
each token. We take the argmax to retrieve the most likely class for each each token.
token.
7. Zip together each token with its prediction and print it. 7. Zip together each token with its prediction and print it.
.. code-block:: .. code-block::
...@@ -713,9 +718,9 @@ Here is an example of doing named entity recognition, using a model and a tokeni ...@@ -713,9 +718,9 @@ Here is an example of doing named entity recognition, using a model and a tokeni
>>> predictions = tf.argmax(outputs, axis=2) >>> predictions = tf.argmax(outputs, axis=2)
This outputs a list of each token mapped to its corresponding prediction. Differently from the pipeline, here every token has This outputs a list of each token mapped to its corresponding prediction. Differently from the pipeline, here every
a prediction as we didn't remove the "0"th class, which means that no particular entity was found on that token. The token has a prediction as we didn't remove the "0"th class, which means that no particular entity was found on that
following array should be the output: token. The following array should be the output:
.. code-block:: .. code-block::
...@@ -727,11 +732,13 @@ Summarization ...@@ -727,11 +732,13 @@ Summarization
Summarization is the task of summarizing a document or an article into a shorter text. Summarization is the task of summarizing a document or an article into a shorter text.
An example of a summarization dataset is the CNN / Daily Mail dataset, which consists of long news articles and was created for the task of summarization. An example of a summarization dataset is the CNN / Daily Mail dataset, which consists of long news articles and was
If you would like to fine-tune a model on a summarization task, various approaches are described in this created for the task of summarization. If you would like to fine-tune a model on a summarization task, various
`document <https://github.com/huggingface/transformers/blob/master/examples/seq2seq/README.md>`__. approaches are described in this `document
<https://github.com/huggingface/transformers/blob/master/examples/seq2seq/README.md>`__.
Here is an example of using the pipelines to do summarization. It leverages a Bart model that was fine-tuned on the CNN / Daily Mail data set. Here is an example of using the pipelines to do summarization. It leverages a Bart model that was fine-tuned on the CNN
/ Daily Mail data set.
.. code-block:: .. code-block::
...@@ -758,9 +765,9 @@ Here is an example of using the pipelines to do summarization. It leverages a Ba ...@@ -758,9 +765,9 @@ Here is an example of using the pipelines to do summarization. It leverages a Ba
... If convicted, Barrientos faces up to four years in prison. Her next court appearance is scheduled for May 18. ... If convicted, Barrientos faces up to four years in prison. Her next court appearance is scheduled for May 18.
... """ ... """
Because the summarization pipeline depends on the ``PreTrainedModel.generate()`` method, we can override the default arguments Because the summarization pipeline depends on the ``PreTrainedModel.generate()`` method, we can override the default
of ``PreTrainedModel.generate()`` directly in the pipeline for ``max_length`` and ``min_length`` as shown below. arguments of ``PreTrainedModel.generate()`` directly in the pipeline for ``max_length`` and ``min_length`` as shown
This outputs the following summary: below. This outputs the following summary:
.. code-block:: .. code-block::
...@@ -769,12 +776,14 @@ This outputs the following summary: ...@@ -769,12 +776,14 @@ This outputs the following summary:
Here is an example of doing summarization using a model and a tokenizer. The process is the following: Here is an example of doing summarization using a model and a tokenizer. The process is the following:
1. Instantiate a tokenizer and a model from the checkpoint name. Summarization is usually done using an encoder-decoder model, such as ``Bart`` or ``T5``. 1. Instantiate a tokenizer and a model from the checkpoint name. Summarization is usually done using an encoder-decoder
model, such as ``Bart`` or ``T5``.
2. Define the article that should be summarized. 2. Define the article that should be summarized.
3. Add the T5 specific prefix "summarize: ". 3. Add the T5 specific prefix "summarize: ".
4. Use the ``PreTrainedModel.generate()`` method to generate the summary. 4. Use the ``PreTrainedModel.generate()`` method to generate the summary.
In this example we use Google`s T5 model. Even though it was pre-trained only on a multi-task mixed dataset (including CNN / Daily Mail), it yields very good results. In this example we use Google`s T5 model. Even though it was pre-trained only on a multi-task mixed dataset (including
CNN / Daily Mail), it yields very good results.
.. code-block:: .. code-block::
...@@ -802,14 +811,13 @@ Translation ...@@ -802,14 +811,13 @@ Translation
Translation is the task of translating a text from one language to another. Translation is the task of translating a text from one language to another.
An example of a translation dataset is the WMT English to German dataset, which has sentences in English as the input data An example of a translation dataset is the WMT English to German dataset, which has sentences in English as the input
and the corresponding sentences in German as the target data. data and the corresponding sentences in German as the target data. If you would like to fine-tune a model on a
If you would like to fine-tune a model on a translation task, various approaches are described in this translation task, various approaches are described in this `document
`document <https://github.com/huggingface/transformers/blob/master/examples/seq2seq/README.md>`__. <https://github.com/huggingface/transformers/blob/master/examples/seq2seq/README.md>`__.
Here is an example of using the pipelines to do translation. Here is an example of using the pipelines to do translation. It leverages a T5 model that was only pre-trained on a
It leverages a T5 model that was only pre-trained on a multi-task mixture dataset (including WMT), yet, yielding impressive multi-task mixture dataset (including WMT), yet, yielding impressive translation results.
translation results.
.. code-block:: .. code-block::
...@@ -819,12 +827,13 @@ translation results. ...@@ -819,12 +827,13 @@ translation results.
>>> print(translator("Hugging Face is a technology company based in New York and Paris", max_length=40)) >>> print(translator("Hugging Face is a technology company based in New York and Paris", max_length=40))
[{'translation_text': 'Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.'}] [{'translation_text': 'Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.'}]
Because the translation pipeline depends on the ``PreTrainedModel.generate()`` method, we can override the default arguments Because the translation pipeline depends on the ``PreTrainedModel.generate()`` method, we can override the default
of ``PreTrainedModel.generate()`` directly in the pipeline as is shown for ``max_length`` above. arguments of ``PreTrainedModel.generate()`` directly in the pipeline as is shown for ``max_length`` above.
Here is an example of doing translation using a model and a tokenizer. The process is the following: Here is an example of doing translation using a model and a tokenizer. The process is the following:
1. Instantiate a tokenizer and a model from the checkpoint name. Summarization is usually done using an encoder-decoder model, such as ``Bart`` or ``T5``. 1. Instantiate a tokenizer and a model from the checkpoint name. Summarization is usually done using an encoder-decoder
model, such as ``Bart`` or ``T5``.
2. Define the article that should be summarizaed. 2. Define the article that should be summarizaed.
3. Add the T5 specific prefix "translate English to German: " 3. Add the T5 specific prefix "translate English to German: "
4. Use the ``PreTrainedModel.generate()`` method to perform the translation. 4. Use the ``PreTrainedModel.generate()`` method to perform the translation.
......
...@@ -12,17 +12,26 @@ There are 2 test suites in the repository: ...@@ -12,17 +12,26 @@ There are 2 test suites in the repository:
How transformers are tested How transformers are tested
----------------------------------------------------------------------------------------------------------------------- -----------------------------------------------------------------------------------------------------------------------
1. Once a PR is submitted it gets tested with 9 CircleCi jobs. Every new commit to that PR gets retested. These jobs are defined in this `config file <https://github.com/huggingface/transformers/blob/master/.circleci/config.yml>`__, so that if needed you can reproduce the same environment on your machine. 1. Once a PR is submitted it gets tested with 9 CircleCi jobs. Every new commit to that PR gets retested. These jobs
are defined in this `config file <https://github.com/huggingface/transformers/blob/master/.circleci/config.yml>`__,
so that if needed you can reproduce the same environment on your machine.
These CI jobs don't run ``@slow`` tests. These CI jobs don't run ``@slow`` tests.
2. There are 3 jobs run by `github actions <https://github.com/huggingface/transformers/actions>`__: 2. There are 3 jobs run by `github actions <https://github.com/huggingface/transformers/actions>`__:
* `torch hub integration <https://github.com/huggingface/transformers/blob/master/.github/workflows/github-torch-hub.yml>`__: checks whether torch hub integration works. * `torch hub integration
<https://github.com/huggingface/transformers/blob/master/.github/workflows/github-torch-hub.yml>`__: checks
whether torch hub integration works.
* `self-hosted (push) <https://github.com/huggingface/transformers/blob/master/.github/workflows/self-push.yml>`__:
runs fast tests on GPU only on commits on ``master``. It only runs if a commit on ``master`` has updated the code
in one of the following folders: ``src``, ``tests``, ``.github`` (to prevent running on added model cards,
notebooks, etc.)
* `self-hosted (push) <https://github.com/huggingface/transformers/blob/master/.github/workflows/self-push.yml>`__: runs fast tests on GPU only on commits on ``master``. It only runs if a commit on ``master`` has updated the code in one of the following folders: ``src``, ``tests``, ``.github`` (to prevent running on added model cards, notebooks, etc.) * `self-hosted runner
<https://github.com/huggingface/transformers/blob/master/.github/workflows/self-scheduled.yml>`__: runs normal and
* `self-hosted runner <https://github.com/huggingface/transformers/blob/master/.github/workflows/self-scheduled.yml>`__: runs normal and slow tests on GPU in ``tests`` and ``examples``: slow tests on GPU in ``tests`` and ``examples``:
.. code-block:: bash .. code-block:: bash
...@@ -43,7 +52,8 @@ Running tests ...@@ -43,7 +52,8 @@ Running tests
Choosing which tests to run Choosing which tests to run
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This document goes into many details of how tests can be run. If after reading everything, you need even more details you will find them `here <https://docs.pytest.org/en/latest/usage.html>`__. This document goes into many details of how tests can be run. If after reading everything, you need even more details
you will find them `here <https://docs.pytest.org/en/latest/usage.html>`__.
Here are some most useful ways of running tests. Here are some most useful ways of running tests.
...@@ -90,7 +100,7 @@ All tests of a given test file: ...@@ -90,7 +100,7 @@ All tests of a given test file:
pytest tests/test_optimization.py --collect-only -q pytest tests/test_optimization.py --collect-only -q
Run a specific test module Run a specific test module
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
...@@ -99,12 +109,13 @@ To run an individual test module: ...@@ -99,12 +109,13 @@ To run an individual test module:
.. code-block:: bash .. code-block:: bash
pytest tests/test_logging.py pytest tests/test_logging.py
Run specific tests Run specific tests
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Since unittest is used inside most of the tests, to run specific subtests you need to know the name of the unittest class containing those tests. For example, it could be: Since unittest is used inside most of the tests, to run specific subtests you need to know the name of the unittest
class containing those tests. For example, it could be:
.. code-block:: bash .. code-block:: bash
...@@ -131,7 +142,7 @@ As mentioned earlier you can see what tests are contained inside the ``Optimizat ...@@ -131,7 +142,7 @@ As mentioned earlier you can see what tests are contained inside the ``Optimizat
pytest tests/test_optimization.py::OptimizationTest --collect-only -q pytest tests/test_optimization.py::OptimizationTest --collect-only -q
You can run tests by keyword expressions. You can run tests by keyword expressions.
To run only tests whose name contains ``adam``: To run only tests whose name contains ``adam``:
...@@ -158,7 +169,9 @@ And you can combine the two patterns in one: ...@@ -158,7 +169,9 @@ And you can combine the two patterns in one:
Run only modified tests Run only modified tests
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You can run the tests related to the unstaged files or the current branch (according to Git) by using `pytest-picked <https://github.com/anapaulagomes/pytest-picked>`__. This is a great way of quickly testing your changes didn't break anything, since it won't run the tests related to files you didn't touch. You can run the tests related to the unstaged files or the current branch (according to Git) by using `pytest-picked
<https://github.com/anapaulagomes/pytest-picked>`__. This is a great way of quickly testing your changes didn't break
anything, since it won't run the tests related to files you didn't touch.
.. code-block:: bash .. code-block:: bash
...@@ -168,17 +181,14 @@ You can run the tests related to the unstaged files or the current branch (accor ...@@ -168,17 +181,14 @@ You can run the tests related to the unstaged files or the current branch (accor
pytest --picked pytest --picked
All tests will be run from files and folders which are modified, but not All tests will be run from files and folders which are modified, but not yet committed.
yet committed.
Automatically rerun failed tests on source modification Automatically rerun failed tests on source modification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
`pytest-xdist <https://github.com/pytest-dev/pytest-xdist>`__ provides a `pytest-xdist <https://github.com/pytest-dev/pytest-xdist>`__ provides a very useful feature of detecting all failed
very useful feature of detecting all failed tests, and then waiting for tests, and then waiting for you to modify files and continuously re-rerun those failing tests until they pass while you
you to modify files and continuously re-rerun those failing tests until fix them. So that you don't need to re start pytest after you made the fix. This is repeated until all tests pass after
they pass while you fix them. So that you don't need to re start pytest
after you made the fix. This is repeated until all tests pass after
which again a full run is performed. which again a full run is performed.
.. code-block:: bash .. code-block:: bash
...@@ -187,10 +197,9 @@ which again a full run is performed. ...@@ -187,10 +197,9 @@ which again a full run is performed.
To enter the mode: ``pytest -f`` or ``pytest --looponfail`` To enter the mode: ``pytest -f`` or ``pytest --looponfail``
File changes are detected by looking at ``looponfailroots`` root File changes are detected by looking at ``looponfailroots`` root directories and all of their contents (recursively).
directories and all of their contents (recursively). If the default for If the default for this value does not work for you, you can change it in your project by setting a configuration
this value does not work for you, you can change it in your project by option in ``setup.cfg``:
setting a configuration option in ``setup.cfg``:
.. code-block:: ini .. code-block:: ini
...@@ -204,17 +213,17 @@ or ``pytest.ini``/``tox.ini`` files: ...@@ -204,17 +213,17 @@ or ``pytest.ini``/``tox.ini`` files:
[pytest] [pytest]
looponfailroots = transformers tests looponfailroots = transformers tests
This would lead to only looking for file changes in the respective This would lead to only looking for file changes in the respective directories, specified relatively to the ini-files
directories, specified relatively to the ini-file’s directory. directory.
`pytest-watch <https://github.com/joeyespo/pytest-watch>`__ is an `pytest-watch <https://github.com/joeyespo/pytest-watch>`__ is an alternative implementation of this functionality.
alternative implementation of this functionality.
Skip a test module Skip a test module
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you want to run all test modules, except a few you can exclude them by giving an explicit list of tests to run. For example, to run all except ``test_modeling_*.py`` tests: If you want to run all test modules, except a few you can exclude them by giving an explicit list of tests to run. For
example, to run all except ``test_modeling_*.py`` tests:
.. code-block:: bash .. code-block:: bash
...@@ -224,8 +233,7 @@ If you want to run all test modules, except a few you can exclude them by giving ...@@ -224,8 +233,7 @@ If you want to run all test modules, except a few you can exclude them by giving
Clearing state Clearing state
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
CI builds and when isolation is important (against speed), cache should CI builds and when isolation is important (against speed), cache should be cleared:
be cleared:
.. code-block:: bash .. code-block:: bash
...@@ -234,24 +242,23 @@ be cleared: ...@@ -234,24 +242,23 @@ be cleared:
Running tests in parallel Running tests in parallel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
As mentioned earlier ``make test`` runs tests in parallel via ``pytest-xdist`` plugin (``-n X`` argument, e.g. ``-n 2`` to run 2 parallel jobs). As mentioned earlier ``make test`` runs tests in parallel via ``pytest-xdist`` plugin (``-n X`` argument, e.g. ``-n 2``
to run 2 parallel jobs).
``pytest-xdist``'s ``--dist=`` option allows one to control how the tests are grouped. ``--dist=loadfile`` puts the tests located in one file onto the same process. ``pytest-xdist``'s ``--dist=`` option allows one to control how the tests are grouped. ``--dist=loadfile`` puts the
tests located in one file onto the same process.
Since the order of executed tests is different and unpredictable, if Since the order of executed tests is different and unpredictable, if running the test suite with ``pytest-xdist``
running the test suite with ``pytest-xdist`` produces failures (meaning produces failures (meaning we have some undetected coupled tests), use `pytest-replay
we have some undetected coupled tests), use <https://github.com/ESSS/pytest-replay>`__ to replay the tests in the same order, which should help with then somehow
`pytest-replay <https://github.com/ESSS/pytest-replay>`__ to replay the reducing that failing sequence to a minimum.
tests in the same order, which should help with then somehow reducing
that failing sequence to a minimum.
Test order and repetition Test order and repetition
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
It's good to repeat the tests several times, in sequence, randomly, or It's good to repeat the tests several times, in sequence, randomly, or in sets, to detect any potential
in sets, to detect any potential inter-dependency and state-related bugs inter-dependency and state-related bugs (tear down). And the straightforward multiple repetition is just good to detect
(tear down). And the straightforward multiple repetition is just good to some problems that get uncovered by randomness of DL.
detect some problems that get uncovered by randomness of DL.
Repeat tests Repeat tests
...@@ -268,10 +275,10 @@ And then run every test multiple times (50 by default): ...@@ -268,10 +275,10 @@ And then run every test multiple times (50 by default):
.. code-block:: bash .. code-block:: bash
pytest --flake-finder --flake-runs=5 tests/test_failing_test.py pytest --flake-finder --flake-runs=5 tests/test_failing_test.py
.. note:: .. note::
This plugin doesn't work with ``-n`` flag from ``pytest-xdist``. This plugin doesn't work with ``-n`` flag from ``pytest-xdist``.
.. note:: .. note::
There is another plugin ``pytest-repeat``, but it doesn't work with ``unittest``. There is another plugin ``pytest-repeat``, but it doesn't work with ``unittest``.
...@@ -283,14 +290,11 @@ Run tests in a random order ...@@ -283,14 +290,11 @@ Run tests in a random order
pip install pytest-random-order pip install pytest-random-order
Important: the presence of ``pytest-random-order`` will automatically Important: the presence of ``pytest-random-order`` will automatically randomize tests, no configuration change or
randomize tests, no configuration change or command line options is command line options is required.
required.
As explained earlier this allows detection of coupled tests - where one As explained earlier this allows detection of coupled tests - where one test's state affects the state of another. When
test's state affects the state of another. When ``pytest-random-order`` ``pytest-random-order`` is installed it will print the random seed it used for that session, e.g:
is installed it will print the random seed it used for that session,
e.g:
.. code-block:: bash .. code-block:: bash
...@@ -299,8 +303,7 @@ e.g: ...@@ -299,8 +303,7 @@ e.g:
Using --random-order-bucket=module Using --random-order-bucket=module
Using --random-order-seed=573663 Using --random-order-seed=573663
So that if the given particular sequence fails, you can reproduce it by So that if the given particular sequence fails, you can reproduce it by adding that exact seed, e.g.:
adding that exact seed, e.g.:
.. code-block:: bash .. code-block:: bash
...@@ -309,11 +312,9 @@ adding that exact seed, e.g.: ...@@ -309,11 +312,9 @@ adding that exact seed, e.g.:
Using --random-order-bucket=module Using --random-order-bucket=module
Using --random-order-seed=573663 Using --random-order-seed=573663
It will only reproduce the exact order if you use the exact same list of It will only reproduce the exact order if you use the exact same list of tests (or no list at all). Once you start to
tests (or no list at all). Once you start to manually narrowing manually narrowing down the list you can no longer rely on the seed, but have to list them manually in the exact order
down the list you can no longer rely on the seed, but have to list them they failed and tell pytest to not randomize them instead using ``--random-order-bucket=none``, e.g.:
manually in the exact order they failed and tell pytest to not randomize
them instead using ``--random-order-bucket=none``, e.g.:
.. code-block:: bash .. code-block:: bash
...@@ -325,12 +326,13 @@ To disable the shuffling for all tests: ...@@ -325,12 +326,13 @@ To disable the shuffling for all tests:
pytest --random-order-bucket=none pytest --random-order-bucket=none
By default ``--random-order-bucket=module`` is implied, which will By default ``--random-order-bucket=module`` is implied, which will shuffle the files on the module levels. It can also
shuffle the files on the module levels. It can also shuffle on shuffle on ``class``, ``package``, ``global`` and ``none`` levels. For the complete details please see its
``class``, ``package``, ``global`` and ``none`` levels. For the complete `documentation <https://github.com/jbasko/pytest-random-order>`__.
details please see its `documentation <https://github.com/jbasko/pytest-random-order>`__.
Another randomization alternative is: ``pytest-randomly`` <https://github.com/pytest-dev/pytest-randomly>`__. This module has a very similar functionality/interface, but it doesn't have the bucket modes available in ``pytest-random-order``. It has the same problem of imposing itself once installed. Another randomization alternative is: ``pytest-randomly`` <https://github.com/pytest-dev/pytest-randomly>`__. This
module has a very similar functionality/interface, but it doesn't have the bucket modes available in
``pytest-random-order``. It has the same problem of imposing itself once installed.
Look and feel variations Look and feel variations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
...@@ -338,13 +340,11 @@ Look and feel variations ...@@ -338,13 +340,11 @@ Look and feel variations
pytest-sugar pytest-sugar
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
`pytest-sugar <https://github.com/Frozenball/pytest-sugar>`__ is a `pytest-sugar <https://github.com/Frozenball/pytest-sugar>`__ is a plugin that improves the look-n-feel, adds a
plugin that improves the look-n-feel, adds a progressbar, and show tests progressbar, and show tests that fail and the assert instantly. It gets activated automatically upon installation.
that fail and the assert instantly. It gets activated automatically upon
installation.
.. code-block:: bash .. code-block:: bash
pip install pytest-sugar pip install pytest-sugar
To run tests without it, run: To run tests without it, run:
...@@ -360,8 +360,7 @@ or uninstall it. ...@@ -360,8 +360,7 @@ or uninstall it.
Report each sub-test name and its progress Report each sub-test name and its progress
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
For a single or a group of tests via ``pytest`` (after For a single or a group of tests via ``pytest`` (after ``pip install pytest-pspec``):
``pip install pytest-pspec``):
.. code-block:: bash .. code-block:: bash
...@@ -372,9 +371,8 @@ For a single or a group of tests via ``pytest`` (after ...@@ -372,9 +371,8 @@ For a single or a group of tests via ``pytest`` (after
Instantly shows failed tests Instantly shows failed tests
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
`pytest-instafail <https://github.com/pytest-dev/pytest-instafail>`__ `pytest-instafail <https://github.com/pytest-dev/pytest-instafail>`__ shows failures and errors instantly instead of
shows failures and errors instantly instead of waiting until the end of waiting until the end of test session.
test session.
.. code-block:: bash .. code-block:: bash
...@@ -390,18 +388,20 @@ To GPU or not to GPU ...@@ -390,18 +388,20 @@ To GPU or not to GPU
On a GPU-enabled setup, to test in CPU-only mode add ``CUDA_VISIBLE_DEVICES=""``: On a GPU-enabled setup, to test in CPU-only mode add ``CUDA_VISIBLE_DEVICES=""``:
.. code-block:: bash .. code-block:: bash
CUDA_VISIBLE_DEVICES="" pytest tests/test_logging.py CUDA_VISIBLE_DEVICES="" pytest tests/test_logging.py
or if you have multiple gpus, you can specify which one is to be used by ``pytest``. For example, to use only the second gpu if you have gpus ``0`` and ``1``, you can run: or if you have multiple gpus, you can specify which one is to be used by ``pytest``. For example, to use only the
second gpu if you have gpus ``0`` and ``1``, you can run:
.. code-block:: bash .. code-block:: bash
CUDA_VISIBLE_DEVICES="1" pytest tests/test_logging.py CUDA_VISIBLE_DEVICES="1" pytest tests/test_logging.py
This is handy when you want to run different tasks on different GPUs. This is handy when you want to run different tasks on different GPUs.
Some tests must be run on CPU-only, others on either CPU or GPU or TPU, yet others on multiple-GPUs. The following skip decorators are used to set the requirements of tests CPU/GPU/TPU-wise: Some tests must be run on CPU-only, others on either CPU or GPU or TPU, yet others on multiple-GPUs. The following skip
decorators are used to set the requirements of tests CPU/GPU/TPU-wise:
* ``require_torch`` - this test will run only under torch * ``require_torch`` - this test will run only under torch
* ``require_torch_gpu`` - as ``require_torch`` plus requires at least 1 GPU * ``require_torch_gpu`` - as ``require_torch`` plus requires at least 1 GPU
...@@ -423,7 +423,8 @@ If a test requires ``tensorflow`` use the ``require_tf`` decorator. For example: ...@@ -423,7 +423,8 @@ If a test requires ``tensorflow`` use the ``require_tf`` decorator. For example:
@require_tf @require_tf
def test_tf_thing_with_tensorflow(): def test_tf_thing_with_tensorflow():
These decorators can be stacked. For example, if a test is slow and requires at least one GPU under pytorch, here is how to set it up: These decorators can be stacked. For example, if a test is slow and requires at least one GPU under pytorch, here is
how to set it up:
.. code-block:: python .. code-block:: python
...@@ -431,7 +432,8 @@ These decorators can be stacked. For example, if a test is slow and requires at ...@@ -431,7 +432,8 @@ These decorators can be stacked. For example, if a test is slow and requires at
@slow @slow
def test_example_slow_on_gpu(): def test_example_slow_on_gpu():
Some decorators like ``@parametrized`` rewrite test names, therefore ``@require_*`` skip decorators have to be listed last for them to work correctly. Here is an example of the correct usage: Some decorators like ``@parametrized`` rewrite test names, therefore ``@require_*`` skip decorators have to be listed
last for them to work correctly. Here is an example of the correct usage:
.. code-block:: python .. code-block:: python
...@@ -439,7 +441,8 @@ Some decorators like ``@parametrized`` rewrite test names, therefore ``@require_ ...@@ -439,7 +441,8 @@ Some decorators like ``@parametrized`` rewrite test names, therefore ``@require_
@require_torch_multigpu @require_torch_multigpu
def test_integration_foo(): def test_integration_foo():
This order problem doesn't exist with ``@pytest.mark.parametrize``, you can put it first or last and it will still work. But it only works with non-unittests. This order problem doesn't exist with ``@pytest.mark.parametrize``, you can put it first or last and it will still
work. But it only works with non-unittests.
Inside tests: Inside tests:
...@@ -450,16 +453,22 @@ Inside tests: ...@@ -450,16 +453,22 @@ Inside tests:
torch.cuda.device_count() torch.cuda.device_count()
Distributed training Distributed training
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
``pytest`` can't deal with distributed training directly. If this is attempted - the sub-processes don't do the right thing and end up thinking they are ``pytest`` and start running the test suite in loops. It works, however, if one spawns a normal process that then spawns off multiple workers and manages the IO pipes. ``pytest`` can't deal with distributed training directly. If this is attempted - the sub-processes don't do the right
thing and end up thinking they are ``pytest`` and start running the test suite in loops. It works, however, if one
spawns a normal process that then spawns off multiple workers and manages the IO pipes.
This is still under development but you can study 2 different tests that perform this successfully: This is still under development but you can study 2 different tests that perform this successfully:
* `test_seq2seq_examples_multi_gpu.py <https://github.com/huggingface/transformers/blob/master/examples/seq2seq/test_seq2seq_examples_multi_gpu.py>`__ - a ``pytorch-lightning``-running test (had to use PL's ``ddp`` spawning method which is the default) * `test_seq2seq_examples_multi_gpu.py
* `test_finetune_trainer.py <https://github.com/huggingface/transformers/blob/master/examples/seq2seq/test_finetune_trainer.py>`__ - a normal (non-PL) test <https://github.com/huggingface/transformers/blob/master/examples/seq2seq/test_seq2seq_examples_multi_gpu.py>`__ - a
``pytorch-lightning``-running test (had to use PL's ``ddp`` spawning method which is the default)
* `test_finetune_trainer.py
<https://github.com/huggingface/transformers/blob/master/examples/seq2seq/test_finetune_trainer.py>`__ - a normal
(non-PL) test
To jump right into the execution point, search for the ``execute_async_std`` function in those tests. To jump right into the execution point, search for the ``execute_async_std`` function in those tests.
...@@ -474,12 +483,10 @@ You will need at least 2 GPUs to see these tests in action: ...@@ -474,12 +483,10 @@ You will need at least 2 GPUs to see these tests in action:
Output capture Output capture
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
During test execution any output sent to ``stdout`` and ``stderr`` is During test execution any output sent to ``stdout`` and ``stderr`` is captured. If a test or a setup method fails, its
captured. If a test or a setup method fails, its according captured according captured output will usually be shown along with the failure traceback.
output will usually be shown along with the failure traceback.
To disable output capturing and to get the ``stdout`` and ``stderr`` To disable output capturing and to get the ``stdout`` and ``stderr`` normally, use ``-s`` or ``--capture=no``:
normally, use ``-s`` or ``--capture=no``:
.. code-block:: bash .. code-block:: bash
...@@ -512,9 +519,8 @@ Creating a URL for each test failure: ...@@ -512,9 +519,8 @@ Creating a URL for each test failure:
pytest --pastebin=failed tests/test_logging.py pytest --pastebin=failed tests/test_logging.py
This will submit test run information to a remote Paste service and This will submit test run information to a remote Paste service and provide a URL for each failure. You may select
provide a URL for each failure. You may select tests as usual or add for tests as usual or add for example -x if you only want to send one particular failure.
example -x if you only want to send one particular failure.
Creating a URL for a whole test session log: Creating a URL for a whole test session log:
...@@ -527,18 +533,22 @@ Creating a URL for a whole test session log: ...@@ -527,18 +533,22 @@ Creating a URL for a whole test session log:
Writing tests Writing tests
----------------------------------------------------------------------------------------------------------------------- -----------------------------------------------------------------------------------------------------------------------
🤗 transformers tests are based on ``unittest``, but run by ``pytest``, so most of the time features from both systems can be used. 🤗 transformers tests are based on ``unittest``, but run by ``pytest``, so most of the time features from both systems
can be used.
You can read `here <https://docs.pytest.org/en/stable/unittest.html>`__ which features are supported, but the important thing to remember is that most ``pytest`` fixtures don't work. Neither parametrization, but we use the module ``parameterized`` that works in a similar way. You can read `here <https://docs.pytest.org/en/stable/unittest.html>`__ which features are supported, but the important
thing to remember is that most ``pytest`` fixtures don't work. Neither parametrization, but we use the module
``parameterized`` that works in a similar way.
Parametrization Parametrization
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Often, there is a need to run the same test multiple times, but with different arguments. It could be done from within the test, but then there is no way of running that test for just one set of arguments. Often, there is a need to run the same test multiple times, but with different arguments. It could be done from within
the test, but then there is no way of running that test for just one set of arguments.
.. code-block:: python .. code-block:: python
# test_this1.py # test_this1.py
import unittest import unittest
from parameterized import parameterized from parameterized import parameterized
...@@ -551,7 +561,8 @@ Often, there is a need to run the same test multiple times, but with different a ...@@ -551,7 +561,8 @@ Often, there is a need to run the same test multiple times, but with different a
def test_floor(self, name, input, expected): def test_floor(self, name, input, expected):
assert_equal(math.floor(input), expected) assert_equal(math.floor(input), expected)
Now, by default this test will be run 3 times, each time with the last 3 arguments of ``test_floor`` being assigned the corresponding arguments in the parameter list. Now, by default this test will be run 3 times, each time with the last 3 arguments of ``test_floor`` being assigned the
corresponding arguments in the parameter list.
and you could run just the ``negative`` and ``integer`` sets of params with: and you could run just the ``negative`` and ``integer`` sets of params with:
...@@ -565,14 +576,15 @@ or all but ``negative`` sub-tests, with: ...@@ -565,14 +576,15 @@ or all but ``negative`` sub-tests, with:
pytest -k "not negative" tests/test_mytest.py pytest -k "not negative" tests/test_mytest.py
Besides using the ``-k`` filter that was just mentioned, you can find out the exact name of each sub-test and run any or all of them using their exact names. Besides using the ``-k`` filter that was just mentioned, you can find out the exact name of each sub-test and run any
or all of them using their exact names.
.. code-block:: bash .. code-block:: bash
pytest test_this1.py --collect-only -q pytest test_this1.py --collect-only -q
and it will list: and it will list:
.. code-block:: bash .. code-block:: bash
test_this1.py::TestMathUnitTest::test_floor_0_negative test_this1.py::TestMathUnitTest::test_floor_0_negative
...@@ -584,10 +596,12 @@ So now you can run just 2 specific sub-tests: ...@@ -584,10 +596,12 @@ So now you can run just 2 specific sub-tests:
.. code-block:: bash .. code-block:: bash
pytest test_this1.py::TestMathUnitTest::test_floor_0_negative test_this1.py::TestMathUnitTest::test_floor_1_integer pytest test_this1.py::TestMathUnitTest::test_floor_0_negative test_this1.py::TestMathUnitTest::test_floor_1_integer
The module `parameterized <https://pypi.org/project/parameterized/>`__ which is already in the developer dependencies of ``transformers`` works for both: ``unittests`` and ``pytest`` tests.
If, however, the test is not a ``unittest``, you may use ``pytest.mark.parametrize`` (or you may see it being used in some existing tests, mostly under ``examples``). The module `parameterized <https://pypi.org/project/parameterized/>`__ which is already in the developer dependencies
of ``transformers`` works for both: ``unittests`` and ``pytest`` tests.
If, however, the test is not a ``unittest``, you may use ``pytest.mark.parametrize`` (or you may see it being used in
some existing tests, mostly under ``examples``).
Here is the same example, this time using ``pytest``'s ``parametrize`` marker: Here is the same example, this time using ``pytest``'s ``parametrize`` marker:
...@@ -606,14 +620,16 @@ Here is the same example, this time using ``pytest``'s ``parametrize`` marker: ...@@ -606,14 +620,16 @@ Here is the same example, this time using ``pytest``'s ``parametrize`` marker:
def test_floor(name, input, expected): def test_floor(name, input, expected):
assert_equal(math.floor(input), expected) assert_equal(math.floor(input), expected)
Same as with ``parameterized``, with ``pytest.mark.parametrize`` you can have a fine control over which sub-tests are run, if the ``-k`` filter doesn't do the job. Except, this parametrization function creates a slightly different set of names for the sub-tests. Here is what they look like: Same as with ``parameterized``, with ``pytest.mark.parametrize`` you can have a fine control over which sub-tests are
run, if the ``-k`` filter doesn't do the job. Except, this parametrization function creates a slightly different set of
names for the sub-tests. Here is what they look like:
.. code-block:: bash .. code-block:: bash
pytest test_this2.py --collect-only -q pytest test_this2.py --collect-only -q
and it will list: and it will list:
.. code-block:: bash .. code-block:: bash
test_this2.py::test_floor[integer-1-1.0] test_this2.py::test_floor[integer-1-1.0]
...@@ -628,16 +644,20 @@ So now you can run just the specific test: ...@@ -628,16 +644,20 @@ So now you can run just the specific test:
as in the previous example. as in the previous example.
Temporary files and directories Temporary files and directories
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Using unique temporary files and directories are essential for parallel test running, so that the tests won't overwrite each other's data. Also we want to get the temp files and directories removed at the end of each test that created them. Therefore, using packages like ``tempfile``, which address these needs is essential. Using unique temporary files and directories are essential for parallel test running, so that the tests won't overwrite
each other's data. Also we want to get the temp files and directories removed at the end of each test that created
them. Therefore, using packages like ``tempfile``, which address these needs is essential.
However, when debugging tests, you need to be able to see what goes into the temp file or directory and you want to know it's exact path and not having it randomized on every test re-run. However, when debugging tests, you need to be able to see what goes into the temp file or directory and you want to
know it's exact path and not having it randomized on every test re-run.
A helper class :obj:`transformers.test_utils.TestCasePlus` is best used for such purposes. It's a sub-class of :obj:`unittest.TestCase`, so we can easily inherit from it in the test modules. A helper class :obj:`transformers.test_utils.TestCasePlus` is best used for such purposes. It's a sub-class of
:obj:`unittest.TestCase`, so we can easily inherit from it in the test modules.
Here is an example of its usage: Here is an example of its usage:
...@@ -650,23 +670,27 @@ Here is an example of its usage: ...@@ -650,23 +670,27 @@ Here is an example of its usage:
This code creates a unique temporary directory, and sets :obj:`tmp_dir` to its location. This code creates a unique temporary directory, and sets :obj:`tmp_dir` to its location.
In this and all the following scenarios the temporary directory will be auto-removed at the end of test, unless ``after=False`` is passed to the helper function. In this and all the following scenarios the temporary directory will be auto-removed at the end of test, unless
``after=False`` is passed to the helper function.
* Create a temporary directory of my choice and delete it at the end - useful for debugging when you want to monitor a specific directory: * Create a temporary directory of my choice and delete it at the end - useful for debugging when you want to monitor a
specific directory:
.. code-block:: python .. code-block:: python
def test_whatever(self): def test_whatever(self):
tmp_dir = self.get_auto_remove_tmp_dir(tmp_dir="./tmp/run/test") tmp_dir = self.get_auto_remove_tmp_dir(tmp_dir="./tmp/run/test")
* Create a temporary directory of my choice and do not delete it at the end---useful for when you want to look at the temp results: * Create a temporary directory of my choice and do not delete it at the end---useful for when you want to look at the
temp results:
.. code-block:: python .. code-block:: python
def test_whatever(self): def test_whatever(self):
tmp_dir = self.get_auto_remove_tmp_dir(tmp_dir="./tmp/run/test", after=False) tmp_dir = self.get_auto_remove_tmp_dir(tmp_dir="./tmp/run/test", after=False)
* Create a temporary directory of my choice and ensure to delete it right away---useful for when you disabled deletion in the previous test run and want to make sure the that temporary directory is empty before the new test is run: * Create a temporary directory of my choice and ensure to delete it right away---useful for when you disabled deletion
in the previous test run and want to make sure the that temporary directory is empty before the new test is run:
.. code-block:: python .. code-block:: python
...@@ -674,38 +698,33 @@ In this and all the following scenarios the temporary directory will be auto-rem ...@@ -674,38 +698,33 @@ In this and all the following scenarios the temporary directory will be auto-rem
tmp_dir = self.get_auto_remove_tmp_dir(tmp_dir="./tmp/run/test", before=True) tmp_dir = self.get_auto_remove_tmp_dir(tmp_dir="./tmp/run/test", before=True)
.. note:: .. note::
In order to run the equivalent of ``rm -r`` safely, only subdirs of the project repository checkout are allowed if an explicit obj:`tmp_dir` is used, so that by mistake no ``/tmp`` or similar important part of the filesystem will get nuked. i.e. please always pass paths that start with ``./``. In order to run the equivalent of ``rm -r`` safely, only subdirs of the project repository checkout are allowed if
an explicit obj:`tmp_dir` is used, so that by mistake no ``/tmp`` or similar important part of the filesystem will
get nuked. i.e. please always pass paths that start with ``./``.
.. note:: .. note::
Each test can register multiple temporary directories and they all will get auto-removed, unless requested otherwise. Each test can register multiple temporary directories and they all will get auto-removed, unless requested
otherwise.
Skipping tests Skipping tests
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This is useful when a bug is found and a new test is written, yet the This is useful when a bug is found and a new test is written, yet the bug is not fixed yet. In order to be able to
bug is not fixed yet. In order to be able to commit it to the main commit it to the main repository we need make sure it's skipped during ``make test``.
repository we need make sure it's skipped during ``make test``.
Methods: Methods:
- A **skip** means that you expect your test to pass only if some - A **skip** means that you expect your test to pass only if some conditions are met, otherwise pytest should skip
conditions are met, otherwise pytest should skip running the test running the test altogether. Common examples are skipping windows-only tests on non-windows platforms, or skipping
altogether. Common examples are skipping windows-only tests on tests that depend on an external resource which is not available at the moment (for example a database).
non-windows platforms, or skipping tests that depend on an external
resource which is not available at the moment (for example a
database).
- A **xfail** means that you expect a test to fail for some reason. A - A **xfail** means that you expect a test to fail for some reason. A common example is a test for a feature not yet
common example is a test for a feature not yet implemented, or a bug implemented, or a bug not yet fixed. When a test passes despite being expected to fail (marked with
not yet fixed. When a test passes despite being expected to fail pytest.mark.xfail), its an xpass and will be reported in the test summary.
(marked with pytest.mark.xfail), it’s an xpass and will be reported
in the test summary.
One of the important differences between the two is that ``skip`` One of the important differences between the two is that ``skip`` doesn't run the test, and ``xfail`` does. So if the
doesn't run the test, and ``xfail`` does. So if the code that's buggy code that's buggy causes some bad state that will affect other tests, do not use ``xfail``.
causes some bad state that will affect other tests, do not use
``xfail``.
Implementation Implementation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...@@ -772,7 +791,7 @@ or: ...@@ -772,7 +791,7 @@ or:
@unittest.skipIf(torch_device == "cpu", "Can't do half precision") @unittest.skipIf(torch_device == "cpu", "Can't do half precision")
def test_feature_x(): def test_feature_x():
or skip the whole module: or skip the whole module:
.. code-block:: python .. code-block:: python
...@@ -786,7 +805,9 @@ More details, example and ways are `here <https://docs.pytest.org/en/latest/skip ...@@ -786,7 +805,9 @@ More details, example and ways are `here <https://docs.pytest.org/en/latest/skip
Slow tests Slow tests
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The library of tests is ever-growing, and some of the tests take minutes to run, therefore we can't afford waiting for an hour for the test suite to complete on CI. Therefore, with some exceptions for essential tests, slow tests should be marked as in the example below: The library of tests is ever-growing, and some of the tests take minutes to run, therefore we can't afford waiting for
an hour for the test suite to complete on CI. Therefore, with some exceptions for essential tests, slow tests should be
marked as in the example below:
.. code-block:: python .. code-block:: python
...@@ -799,8 +820,9 @@ Once a test is marked as ``@slow``, to run such tests set ``RUN_SLOW=1`` env var ...@@ -799,8 +820,9 @@ Once a test is marked as ``@slow``, to run such tests set ``RUN_SLOW=1`` env var
.. code-block:: bash .. code-block:: bash
RUN_SLOW=1 pytest tests RUN_SLOW=1 pytest tests
Some decorators like ``@parameterized`` rewrite test names, therefore ``@slow`` and the rest of the skip decorators ``@require_*`` have to be listed last for them to work correctly. Here is an example of the correct usage: Some decorators like ``@parameterized`` rewrite test names, therefore ``@slow`` and the rest of the skip decorators
``@require_*`` have to be listed last for them to work correctly. Here is an example of the correct usage:
.. code-block:: python .. code-block:: python
...@@ -808,39 +830,55 @@ Some decorators like ``@parameterized`` rewrite test names, therefore ``@slow`` ...@@ -808,39 +830,55 @@ Some decorators like ``@parameterized`` rewrite test names, therefore ``@slow``
@slow @slow
def test_integration_foo(): def test_integration_foo():
As explained at the beginning of this document, slow tests get to run on a scheduled basis, rather than in PRs CI checks. So it's possible that some problems will be missed during a PR submission and get merged. Such problems will get caught during the next scheduled CI job. But it also means that it's important to run the slow tests on your machine before submitting the PR. As explained at the beginning of this document, slow tests get to run on a scheduled basis, rather than in PRs CI
checks. So it's possible that some problems will be missed during a PR submission and get merged. Such problems will
get caught during the next scheduled CI job. But it also means that it's important to run the slow tests on your
machine before submitting the PR.
Here is a rough decision making mechanism for choosing which tests should be marked as slow: Here is a rough decision making mechanism for choosing which tests should be marked as slow:
If the test is focused on one of the library's internal components (e.g., modeling files, tokenization files, pipelines), then we should run that test in the non-slow test suite. If it's focused on an other aspect of the library, such as the documentation or the examples, then we should run these tests in the slow test suite. And then, to refine this approach we should have exceptions: If the test is focused on one of the library's internal components (e.g., modeling files, tokenization files,
pipelines), then we should run that test in the non-slow test suite. If it's focused on an other aspect of the library,
such as the documentation or the examples, then we should run these tests in the slow test suite. And then, to refine
this approach we should have exceptions:
* All tests that need to download a heavy set of weights (e.g., model or tokenizer integration tests, pipeline integration tests) should be set to slow. If you're adding a new model, you should create and upload to the hub a tiny version of it (with random weights) for integration tests. This is discussed in the following paragraphs. * All tests that need to download a heavy set of weights (e.g., model or tokenizer integration tests, pipeline
integration tests) should be set to slow. If you're adding a new model, you should create and upload to the hub a
tiny version of it (with random weights) for integration tests. This is discussed in the following paragraphs.
* All tests that need to do a training not specifically optimized to be fast should be set to slow. * All tests that need to do a training not specifically optimized to be fast should be set to slow.
* We can introduce exceptions if some of these should-be-non-slow tests are excruciatingly slow, and set them to ``@slow``. Auto-modeling tests, which save and load large files to disk, are a good example of tests that are marked as ``@slow``. * We can introduce exceptions if some of these should-be-non-slow tests are excruciatingly slow, and set them to
``@slow``. Auto-modeling tests, which save and load large files to disk, are a good example of tests that are marked
as ``@slow``.
* If a test completes under 1 second on CI (including downloads if any) then it should be a normal test regardless. * If a test completes under 1 second on CI (including downloads if any) then it should be a normal test regardless.
Collectively, all the non-slow tests need to cover entirely the different internals, while remaining fast. Collectively, all the non-slow tests need to cover entirely the different internals, while remaining fast. For example,
For example, a significant coverage can be achieved by testing with specially created tiny models with random weights. Such models have the very minimal number of layers (e.g., 2), vocab size (e.g., 1000), etc. a significant coverage can be achieved by testing with specially created tiny models with random weights. Such models
Then the ``@slow`` tests can use large slow models to do qualitative testing. To see the use of these simply look for *tiny* models with: have the very minimal number of layers (e.g., 2), vocab size (e.g., 1000), etc. Then the ``@slow`` tests can use large
slow models to do qualitative testing. To see the use of these simply look for *tiny* models with:
.. code-block:: bash .. code-block:: bash
grep tiny tests examples grep tiny tests examples
Here is a an example of a `script <https://github.com/huggingface/transformers/blob/master/scripts/fsmt/fsmt-make-tiny-model.py>`__ that created the tiny model `stas/tiny-wmt19-en-de <https://huggingface.co/stas/tiny-wmt19-en-de>`__. You can easily adjust it to your specific model's architecture. Here is a an example of a `script
<https://github.com/huggingface/transformers/blob/master/scripts/fsmt/fsmt-make-tiny-model.py>`__ that created the tiny
model `stas/tiny-wmt19-en-de <https://huggingface.co/stas/tiny-wmt19-en-de>`__. You can easily adjust it to your
specific model's architecture.
It's easy to measure the run-time incorrectly if for example there is an overheard of downloading a huge model, but if you test it locally the downloaded files would be cached and thus the download time not measured. Hence check the execution speed report in CI logs instead (the output of ``pytest --durations=0 tests``). It's easy to measure the run-time incorrectly if for example there is an overheard of downloading a huge model, but if
you test it locally the downloaded files would be cached and thus the download time not measured. Hence check the
execution speed report in CI logs instead (the output of ``pytest --durations=0 tests``).
That report is also useful to find slow outliers that aren't marked as such, or which need to be re-written to be fast. If you notice that the test suite starts getting slow on CI, the top listing of this report will show the slowest tests. That report is also useful to find slow outliers that aren't marked as such, or which need to be re-written to be fast.
If you notice that the test suite starts getting slow on CI, the top listing of this report will show the slowest
tests.
Testing the stdout/stderr output Testing the stdout/stderr output
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In order to test functions that write to ``stdout`` and/or ``stderr``, In order to test functions that write to ``stdout`` and/or ``stderr``, the test can access those streams using the
the test can access those streams using the ``pytest``'s `capsys ``pytest``'s `capsys system <https://docs.pytest.org/en/latest/capture.html>`__. Here is how this is accomplished:
system <https://docs.pytest.org/en/latest/capture.html>`__. Here is how
this is accomplished:
.. code-block:: python .. code-block:: python
...@@ -859,8 +897,8 @@ this is accomplished: ...@@ -859,8 +897,8 @@ this is accomplished:
assert msg in out assert msg in out
assert msg in err assert msg in err
And, of course, most of the time, ``stderr`` will come as a part of an And, of course, most of the time, ``stderr`` will come as a part of an exception, so try/except has to be used in such
exception, so try/except has to be used in such a case: a case:
.. code-block:: python .. code-block:: python
...@@ -892,16 +930,13 @@ Another approach to capturing stdout is via ``contextlib.redirect_stdout``: ...@@ -892,16 +930,13 @@ Another approach to capturing stdout is via ``contextlib.redirect_stdout``:
# test: # test:
assert msg in out assert msg in out
An important potential issue with capturing stdout is that it may An important potential issue with capturing stdout is that it may contain ``\r`` characters that in normal ``print``
contain ``\r`` characters that in normal ``print`` reset everything that reset everything that has been printed so far. There is no problem with ``pytest``, but with ``pytest -s`` these
has been printed so far. There is no problem with ``pytest``, but with characters get included in the buffer, so to be able to have the test run with and without ``-s``, you have to make an
``pytest -s`` these characters get included in the buffer, so to be able extra cleanup to the captured output, using ``re.sub(r'~.*\r', '', buf, 0, re.M)``.
to have the test run with and without ``-s``, you have to make an extra
cleanup to the captured output, using ``re.sub(r'~.*\r', '', buf, 0, re.M)``.
But, then we have a helper context manager wrapper to automatically take But, then we have a helper context manager wrapper to automatically take care of it all, regardless of whether it has
care of it all, regardless of whether it has some ``\r``'s in it or some ``\r``'s in it or not, so it's a simple:
not, so it's a simple:
.. code-block:: python .. code-block:: python
...@@ -921,8 +956,7 @@ Here is a full test example: ...@@ -921,8 +956,7 @@ Here is a full test example:
print(msg + final) print(msg + final)
assert cs.out == final+"\n", f"captured: {cs.out}, expecting {final}" assert cs.out == final+"\n", f"captured: {cs.out}, expecting {final}"
If you'd like to capture ``stderr`` use the :obj:`CaptureStderr` class If you'd like to capture ``stderr`` use the :obj:`CaptureStderr` class instead:
instead:
.. code-block:: python .. code-block:: python
...@@ -931,8 +965,7 @@ instead: ...@@ -931,8 +965,7 @@ instead:
function_that_writes_to_stderr() function_that_writes_to_stderr()
print(cs.err) print(cs.err)
If you need to capture both streams at once, use the parent If you need to capture both streams at once, use the parent :obj:`CaptureStd` class:
:obj:`CaptureStd` class:
.. code-block:: python .. code-block:: python
...@@ -964,7 +997,8 @@ If you need to validate the output of a logger, you can use :obj:`CaptureLogger` ...@@ -964,7 +997,8 @@ If you need to validate the output of a logger, you can use :obj:`CaptureLogger`
Testing with environment variables Testing with environment variables
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you want to test the impact of environment variables for a specific test you can use a helper decorator ``transformers.testing_utils.mockenv`` If you want to test the impact of environment variables for a specific test you can use a helper decorator
``transformers.testing_utils.mockenv``
.. code-block:: python .. code-block:: python
...@@ -978,8 +1012,8 @@ If you want to test the impact of environment variables for a specific test you ...@@ -978,8 +1012,8 @@ If you want to test the impact of environment variables for a specific test you
Getting reproducible results Getting reproducible results
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In some situations you may want to remove randomness for your tests. To In some situations you may want to remove randomness for your tests. To get identical reproducable results set, you
get identical reproducable results set, you will need to fix the seed: will need to fix the seed:
.. code-block:: python .. code-block:: python
......
Tokenizer summary Tokenizer summary
----------------------------------------------------------------------------------------------------------------------- -----------------------------------------------------------------------------------------------------------------------
In this page, we will have a closer look at tokenization. As we saw in In this page, we will have a closer look at tokenization. As we saw in :doc:`the preprocessing tutorial
:doc:`the preprocessing tutorial <preprocessing>`, tokenizing a text is splitting it into words or subwords, which then <preprocessing>`, tokenizing a text is splitting it into words or subwords, which then are converted to ids. The second
are converted to ids. The second part is pretty straightforward, here we will focus on the first part. More part is pretty straightforward, here we will focus on the first part. More specifically, we will look at the three main
specifically, we will look at the three main different kinds of tokenizers used in 🤗 Transformers: different kinds of tokenizers used in 🤗 Transformers: :ref:`Byte-Pair Encoding (BPE) <byte-pair-encoding>`,
:ref:`Byte-Pair Encoding (BPE) <byte-pair-encoding>`, :ref:`WordPiece <wordpiece>` and :ref:`WordPiece <wordpiece>` and :ref:`SentencePiece <sentencepiece>`, and provide examples of models using each of
:ref:`SentencePiece <sentencepiece>`, and provide examples of models using each of those. those.
Note that on each model page, you can look at the documentation of the associated tokenizer to know which of those Note that on each model page, you can look at the documentation of the associated tokenizer to know which of those
algorithms the pretrained model used. For instance, if we look at :class:`~transformers.BertTokenizer`, we can see it's algorithms the pretrained model used. For instance, if we look at :class:`~transformers.BertTokenizer`, we can see it's
...@@ -16,8 +16,8 @@ Introduction to tokenization ...@@ -16,8 +16,8 @@ Introduction to tokenization
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Splitting a text in smaller chunks is a task that's harder than it looks, and there are multiple ways of doing it. For Splitting a text in smaller chunks is a task that's harder than it looks, and there are multiple ways of doing it. For
instance, let's look at the sentence "Don't you love 🤗 Transformers? We sure do." A first simple way of tokenizing instance, let's look at the sentence "Don't you love 🤗 Transformers? We sure do." A first simple way of tokenizing this
this text is just to split it by spaces, which would give: text is just to split it by spaces, which would give:
.. code-block:: .. code-block::
...@@ -46,9 +46,8 @@ rule-based tokenizers. On the text above, they'd output something like: ...@@ -46,9 +46,8 @@ rule-based tokenizers. On the text above, they'd output something like:
Space/punctuation-tokenization and rule-based tokenization are both examples of word tokenization, which is splitting a Space/punctuation-tokenization and rule-based tokenization are both examples of word tokenization, which is splitting a
sentence into words. While it's the most intuitive way to separate texts in smaller chunks, it can have a problem when sentence into words. While it's the most intuitive way to separate texts in smaller chunks, it can have a problem when
you have a huge corpus: it usually yields a very big vocabulary (the set of all unique tokens used). you have a huge corpus: it usually yields a very big vocabulary (the set of all unique tokens used). :doc:`Transformer
:doc:`Transformer XL <model_doc/transformerxl>` for instance uses space/punctuation-tokenization, and has a vocabulary XL <model_doc/transformerxl>` for instance uses space/punctuation-tokenization, and has a vocabulary size of 267,735!
size of 267,735!
A huge vocabulary size means a huge embedding matrix at the start of the model, which will cause memory problems. A huge vocabulary size means a huge embedding matrix at the start of the model, which will cause memory problems.
TransformerXL deals with it by using a special kind of embeddings called adaptive embeddings, but in general, TransformerXL deals with it by using a special kind of embeddings called adaptive embeddings, but in general,
...@@ -69,9 +68,8 @@ decomposed as "annoying" and "ly". This is especially useful in agglutinative la ...@@ -69,9 +68,8 @@ decomposed as "annoying" and "ly". This is especially useful in agglutinative la
form (almost) arbitrarily long complex words by stringing together some subwords. form (almost) arbitrarily long complex words by stringing together some subwords.
This allows the model to keep a reasonable vocabulary while still learning useful representations for common words or This allows the model to keep a reasonable vocabulary while still learning useful representations for common words or
subwords. This also enables the model to process words it has never seen before, by decomposing them into subwords. This also enables the model to process words it has never seen before, by decomposing them into subwords it
subwords it knows. For instance, the base :class:`~transformers.BertTokenizer` will tokenize "I have a new GPU!" like knows. For instance, the base :class:`~transformers.BertTokenizer` will tokenize "I have a new GPU!" like this:
this:
.. code-block:: .. code-block::
...@@ -81,8 +79,8 @@ this: ...@@ -81,8 +79,8 @@ this:
['i', 'have', 'a', 'new', 'gp', '##u', '!'] ['i', 'have', 'a', 'new', 'gp', '##u', '!']
Since we are considering the uncased model, the sentence was lowercased first. Then all the words were present in the Since we are considering the uncased model, the sentence was lowercased first. Then all the words were present in the
vocabulary of the tokenizer, except for "gpu", so the tokenizer splits it in subwords it knows: "gp" and "##u". The "##" vocabulary of the tokenizer, except for "gpu", so the tokenizer splits it in subwords it knows: "gp" and "##u". The
means that the rest of the token should be attached to the previous one, without space (for when we need to decode "##" means that the rest of the token should be attached to the previous one, without space (for when we need to decode
predictions and reverse the tokenization). predictions and reverse the tokenization).
Another example is when we use the base :class:`~transformers.XLNetTokenizer` to tokenize our previous text: Another example is when we use the base :class:`~transformers.XLNetTokenizer` to tokenize our previous text:
...@@ -106,9 +104,9 @@ Byte-Pair Encoding ...@@ -106,9 +104,9 @@ Byte-Pair Encoding
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Byte-Pair Encoding was introduced in `this paper <https://arxiv.org/abs/1508.07909>`__. It relies on a pretokenizer Byte-Pair Encoding was introduced in `this paper <https://arxiv.org/abs/1508.07909>`__. It relies on a pretokenizer
splitting the training data into words, which can be a simple space tokenization splitting the training data into words, which can be a simple space tokenization (:doc:`GPT-2 <model_doc/gpt2>` and
(:doc:`GPT-2 <model_doc/gpt2>` and :doc:`Roberta <model_doc/roberta>` uses this for instance) or a rule-based tokenizer :doc:`Roberta <model_doc/roberta>` uses this for instance) or a rule-based tokenizer (:doc:`XLM <model_doc/xlm>` use
(:doc:`XLM <model_doc/xlm>` use Moses for most languages, as does :doc:`FlauBERT <model_doc/flaubert>`), Moses for most languages, as does :doc:`FlauBERT <model_doc/flaubert>`),
:doc:`GPT <model_doc/gpt>` uses Spacy and ftfy, and counts the frequency of each word in the training corpus. :doc:`GPT <model_doc/gpt>` uses Spacy and ftfy, and counts the frequency of each word in the training corpus.
...@@ -148,10 +146,10 @@ represented as ...@@ -148,10 +146,10 @@ represented as
('hug', 10), ('p' 'ug', 5), ('p' 'un', 12), ('b' 'un', 4), ('hug' 's', 5) ('hug', 10), ('p' 'ug', 5), ('p' 'un', 12), ('b' 'un', 4), ('hug' 's', 5)
If we stop there, the tokenizer can apply the rules it learned to new words (as long as they don't contain characters that If we stop there, the tokenizer can apply the rules it learned to new words (as long as they don't contain characters
were not in the base vocabulary). For instance 'bug' would be tokenized as ``['b', 'ug']`` but mug would be tokenized as that were not in the base vocabulary). For instance 'bug' would be tokenized as ``['b', 'ug']`` but mug would be
``['<unk>', 'ug']`` since the 'm' is not in the base vocabulary. This doesn't happen to letters in general (since the tokenized as ``['<unk>', 'ug']`` since the 'm' is not in the base vocabulary. This doesn't happen to letters in general
base corpus uses all of them), but to special characters like emojis. (since the base corpus uses all of them), but to special characters like emojis.
As we said before, the vocabulary size (which is the base vocabulary size + the number of merges) is a hyperparameter As we said before, the vocabulary size (which is the base vocabulary size + the number of merges) is a hyperparameter
to choose. For instance :doc:`GPT <model_doc/gpt>` has a vocabulary size of 40,478 since they have 478 base characters to choose. For instance :doc:`GPT <model_doc/gpt>` has a vocabulary size of 40,478 since they have 478 base characters
...@@ -161,24 +159,24 @@ Byte-level BPE ...@@ -161,24 +159,24 @@ Byte-level BPE
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
To deal with the fact the base vocabulary needs to get all base characters, which can be quite big if one allows for To deal with the fact the base vocabulary needs to get all base characters, which can be quite big if one allows for
all unicode characters, the all unicode characters, the `GPT-2 paper
`GPT-2 paper <https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`__ <https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`__ introduces a
introduces a clever trick, which is to use bytes as the base vocabulary (which gives a size of 256). With some clever trick, which is to use bytes as the base vocabulary (which gives a size of 256). With some additional rules to
additional rules to deal with punctuation, this manages to be able to tokenize every text without needing an unknown deal with punctuation, this manages to be able to tokenize every text without needing an unknown token. For instance,
token. For instance, the :doc:`GPT-2 model <model_doc/gpt>` has a vocabulary size of 50,257, which corresponds to the the :doc:`GPT-2 model <model_doc/gpt>` has a vocabulary size of 50,257, which corresponds to the 256 bytes base tokens,
256 bytes base tokens, a special end-of-text token and the symbols learned with 50,000 merges. a special end-of-text token and the symbols learned with 50,000 merges.
.. _wordpiece: .. _wordpiece:
WordPiece WordPiece
======================================================================================================================= =======================================================================================================================
WordPiece is the subword tokenization algorithm used for :doc:`BERT <model_doc/bert>` (as well as WordPiece is the subword tokenization algorithm used for :doc:`BERT <model_doc/bert>` (as well as :doc:`DistilBERT
:doc:`DistilBERT <model_doc/distilbert>` and :doc:`Electra <model_doc/electra>`) and was outlined in <model_doc/distilbert>` and :doc:`Electra <model_doc/electra>`) and was outlined in `this paper
`this paper <https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf>`__. It relies <https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf>`__. It relies on the same
on the same base as BPE, which is to initialize the vocabulary to every character present in the corpus and base as BPE, which is to initialize the vocabulary to every character present in the corpus and progressively learn a
progressively learn a given number of merge rules, the difference is that it doesn't choose the pair that is the most given number of merge rules, the difference is that it doesn't choose the pair that is the most frequent but the one
frequent but the one that will maximize the likelihood on the corpus once merged. that will maximize the likelihood on the corpus once merged.
What does this mean? Well, in the previous example, it means we would only merge 'u' and 'g' if the probability of What does this mean? Well, in the previous example, it means we would only merge 'u' and 'g' if the probability of
having 'ug' divided by the probability of having 'u' then 'g' is greater than for any other pair of symbols. It's having 'ug' divided by the probability of having 'u' then 'g' is greater than for any other pair of symbols. It's
...@@ -198,10 +196,10 @@ with :ref:`SentencePiece <sentencepiece>`. ...@@ -198,10 +196,10 @@ with :ref:`SentencePiece <sentencepiece>`.
More specifically, at a given step, unigram computes a loss from the corpus we have and the current vocabulary, then, More specifically, at a given step, unigram computes a loss from the corpus we have and the current vocabulary, then,
for each subword, evaluate how much the loss would increase if the subword was removed from the vocabulary. It then for each subword, evaluate how much the loss would increase if the subword was removed from the vocabulary. It then
sorts the subwords by this quantity (that represents how much worse the loss becomes if the token is removed) and removes sorts the subwords by this quantity (that represents how much worse the loss becomes if the token is removed) and
all the worst p tokens (for instance p could be 10% or 20%). It then repeats the process until the vocabulary has removes all the worst p tokens (for instance p could be 10% or 20%). It then repeats the process until the vocabulary
reached the desired size, always keeping the base characters (to be able to tokenize any word written with them, like has reached the desired size, always keeping the base characters (to be able to tokenize any word written with them,
BPE or WordPiece). like BPE or WordPiece).
Contrary to BPE and WordPiece that work out rules in a certain order that you can then apply in the same order when Contrary to BPE and WordPiece that work out rules in a certain order that you can then apply in the same order when
tokenizing new text, Unigram will have several ways of tokenizing a new text. For instance, if it ends up with the tokenizing new text, Unigram will have several ways of tokenizing a new text. For instance, if it ends up with the
...@@ -217,9 +215,9 @@ training corpus. You can then give a probability to each tokenization (which is ...@@ -217,9 +215,9 @@ training corpus. You can then give a probability to each tokenization (which is
tokens forming it) and pick the most likely one (or if you want to apply some data augmentation, you could sample one tokens forming it) and pick the most likely one (or if you want to apply some data augmentation, you could sample one
of the tokenization according to their probabilities). of the tokenization according to their probabilities).
Those probabilities define the loss that trains the tokenizer: if our corpus consists of the Those probabilities define the loss that trains the tokenizer: if our corpus consists of the words :math:`x_{1}, \dots,
words :math:`x_{1}, \dots, x_{N}` and if for the word :math:`x_{i}` we note :math:`S(x_{i})` the set of all possible x_{N}` and if for the word :math:`x_{i}` we note :math:`S(x_{i})` the set of all possible tokenizations of
tokenizations of :math:`x_{i}` (with the current vocabulary), then the loss is defined as :math:`x_{i}` (with the current vocabulary), then the loss is defined as
.. math:: .. math::
\mathcal{L} = -\sum_{i=1}^{N} \log \left ( \sum_{x \in S(x_{i})} p(x) \right ) \mathcal{L} = -\sum_{i=1}^{N} \log \left ( \sum_{x \in S(x_{i})} p(x) \right )
...@@ -236,8 +234,8 @@ SentencePiece (introduced in `this paper <https://arxiv.org/pdf/1808.06226.pdf>` ...@@ -236,8 +234,8 @@ SentencePiece (introduced in `this paper <https://arxiv.org/pdf/1808.06226.pdf>`
includes the space in the set of characters to use, then uses BPE or unigram to construct the appropriate vocabulary. includes the space in the set of characters to use, then uses BPE or unigram to construct the appropriate vocabulary.
That's why in the example we saw before using :class:`~transformers.XLNetTokenizer` (which uses SentencePiece), we had That's why in the example we saw before using :class:`~transformers.XLNetTokenizer` (which uses SentencePiece), we had
the '▁' character, that represents space. Decoding a tokenized text is then super easy: we just have to concatenate the '▁' character, that represents space. Decoding a tokenized text is then super easy: we just have to concatenate all
all of them together and replace '▁' with space. of them together and replace '▁' with space.
All transformers models in the library that use SentencePiece use it with unigram. Examples of models using it are All transformers models in the library that use SentencePiece use it with unigram. Examples of models using it are
:doc:`ALBERT <model_doc/albert>`, :doc:`XLNet <model_doc/xlnet>` or the :doc:`Marian framework <model_doc/marian>`. :doc:`ALBERT <model_doc/albert>`, :doc:`XLNet <model_doc/xlnet>` or the :doc:`Marian framework <model_doc/marian>`.
Training and fine-tuning Training and fine-tuning
======================================================================================================================= =======================================================================================================================
Model classes in 🤗 Transformers are designed to be compatible with native Model classes in 🤗 Transformers are designed to be compatible with native PyTorch and TensorFlow 2 and can be used
PyTorch and TensorFlow 2 and can be used seemlessly with either. In this seemlessly with either. In this quickstart, we will show how to fine-tune (or train from scratch) a model using the
quickstart, we will show how to fine-tune (or train from scratch) a model standard training tools available in either framework. We will also show how to use our included
using the standard training tools available in either framework. We will also :func:`~transformers.Trainer` class which handles much of the complexity of training for you.
show how to use our included :func:`~transformers.Trainer` class which
handles much of the complexity of training for you. This guide assume that you are already familiar with loading and use our models for inference; otherwise, see the
:doc:`task summary <task_summary>`. We also assume that you are familiar with training deep neural networks in either
This guide assume that you are already familiar with loading and use our PyTorch or TF2, and focus specifically on the nuances and tools for training models in 🤗 Transformers.
models for inference; otherwise, see the :doc:`task summary <task_summary>`. We also assume
that you are familiar with training deep neural networks in either PyTorch or
TF2, and focus specifically on the nuances and tools for training models in
🤗 Transformers.
Sections: Sections:
...@@ -26,25 +22,19 @@ Sections: ...@@ -26,25 +22,19 @@ Sections:
Fine-tuning in native PyTorch Fine-tuning in native PyTorch
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Model classes in 🤗 Transformers that don't begin with ``TF`` are Model classes in 🤗 Transformers that don't begin with ``TF`` are `PyTorch Modules
`PyTorch Modules <https://pytorch.org/docs/master/generated/torch.nn.Module.html>`_, <https://pytorch.org/docs/master/generated/torch.nn.Module.html>`_, meaning that you can use them just as you would any
meaning that you can use them just as you would any model in PyTorch for model in PyTorch for both inference and optimization.
both inference and optimization.
Let's consider the common task of fine-tuning a masked language model like BERT on a sequence classification dataset.
Let's consider the common task of fine-tuning a masked language model like When we instantiate a model with :func:`~transformers.PreTrainedModel.from_pretrained`, the model configuration and
BERT on a sequence classification dataset. When we instantiate a model with pre-trained weights of the specified model are used to initialize the model. The library also includes a number of
:func:`~transformers.PreTrainedModel.from_pretrained`, the model task-specific final layers or 'heads' whose weights are instantiated randomly when not present in the specified
configuration and pre-trained weights
of the specified model are used to initialize the model. The
library also includes a number of task-specific final layers or 'heads' whose
weights are instantiated randomly when not present in the specified
pre-trained model. For example, instantiating a model with pre-trained model. For example, instantiating a model with
``BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)`` ``BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)`` will create a BERT model instance
will create a BERT model instance with encoder weights copied from the with encoder weights copied from the ``bert-base-uncased`` model and a randomly initialized sequence classification
``bert-base-uncased`` model and a randomly initialized sequence head on top of the encoder with an output size of 2. Models are initialized in ``eval`` mode by default. We can call
classification head on top of the encoder with an output size of 2. Models ``model.train()`` to put it in train mode.
are initialized in ``eval`` mode by default. We can call ``model.train()`` to
put it in train mode.
.. code-block:: python .. code-block:: python
...@@ -52,20 +42,17 @@ put it in train mode. ...@@ -52,20 +42,17 @@ put it in train mode.
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', return_dict=True) model = BertForSequenceClassification.from_pretrained('bert-base-uncased', return_dict=True)
model.train() model.train()
This is useful because it allows us to make use of the pre-trained BERT This is useful because it allows us to make use of the pre-trained BERT encoder and easily train it on whatever
encoder and easily train it on whatever sequence classification dataset we sequence classification dataset we choose. We can use any PyTorch optimizer, but our library also provides the
choose. We can use any PyTorch optimizer, but our library also provides the :func:`~transformers.AdamW` optimizer which implements gradient bias correction as well as weight decay.
:func:`~transformers.AdamW` optimizer which implements gradient bias
correction as well as weight decay.
.. code-block:: python .. code-block:: python
from transformers import AdamW from transformers import AdamW
optimizer = AdamW(model.parameters(), lr=1e-5) optimizer = AdamW(model.parameters(), lr=1e-5)
The optimizer allows us to apply different hyperpameters for specific The optimizer allows us to apply different hyperpameters for specific parameter groups. For example, we can apply
parameter groups. For example, we can apply weight decay to all parameters weight decay to all parameters other than bias and layer normalization terms:
other than bias and layer normalization terms:
.. code-block:: python .. code-block:: python
...@@ -75,11 +62,9 @@ other than bias and layer normalization terms: ...@@ -75,11 +62,9 @@ other than bias and layer normalization terms:
{'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0} {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
] ]
optimizer = AdamW(optimizer_grouped_parameters, lr=1e-5) optimizer = AdamW(optimizer_grouped_parameters, lr=1e-5)
Now we can set up a simple dummy training batch using Now we can set up a simple dummy training batch using :func:`~transformers.PreTrainedTokenizer.__call__`. This returns
:func:`~transformers.PreTrainedTokenizer.__call__`. This returns a a :func:`~transformers.BatchEncoding` instance which prepares everything we might need to pass to the model.
:func:`~transformers.BatchEncoding` instance which
prepares everything we might need to pass to the model.
.. code-block:: python .. code-block:: python
...@@ -90,10 +75,9 @@ prepares everything we might need to pass to the model. ...@@ -90,10 +75,9 @@ prepares everything we might need to pass to the model.
input_ids = encoding['input_ids'] input_ids = encoding['input_ids']
attention_mask = encoding['attention_mask'] attention_mask = encoding['attention_mask']
When we call a classification model with the ``labels`` argument, the first When we call a classification model with the ``labels`` argument, the first returned element is the Cross Entropy loss
returned element is the Cross Entropy loss between the predictions and the between the predictions and the passed labels. Having already set up our optimizer, we can then do a backwards pass and
passed labels. Having already set up our optimizer, we can then do a update the weights:
backwards pass and update the weights:
.. code-block:: python .. code-block:: python
...@@ -103,8 +87,8 @@ backwards pass and update the weights: ...@@ -103,8 +87,8 @@ backwards pass and update the weights:
loss.backward() loss.backward()
optimizer.step() optimizer.step()
Alternatively, you can just get the logits and calculate the loss yourself. Alternatively, you can just get the logits and calculate the loss yourself. The following is equivalent to the previous
The following is equivalent to the previous example: example:
.. code-block:: python .. code-block:: python
...@@ -115,12 +99,10 @@ The following is equivalent to the previous example: ...@@ -115,12 +99,10 @@ The following is equivalent to the previous example:
loss.backward() loss.backward()
optimizer.step() optimizer.step()
Of course, you can train on GPU by calling ``to('cuda')`` on the model and Of course, you can train on GPU by calling ``to('cuda')`` on the model and inputs as usual.
inputs as usual.
We also provide a few learning rate scheduling tools. With the following, we We also provide a few learning rate scheduling tools. With the following, we can set up a scheduler which warms up for
can set up a scheduler which warms up for ``num_warmup_steps`` and then ``num_warmup_steps`` and then linearly decays to 0 by the end of training.
linearly decays to 0 by the end of training.
.. code-block:: python .. code-block:: python
...@@ -135,19 +117,16 @@ Then all we have to do is call ``scheduler.step()`` after ``optimizer.step()``. ...@@ -135,19 +117,16 @@ Then all we have to do is call ``scheduler.step()`` after ``optimizer.step()``.
optimizer.step() optimizer.step()
scheduler.step() scheduler.step()
We highly recommend using :func:`~transformers.Trainer`, discussed below, We highly recommend using :func:`~transformers.Trainer`, discussed below, which conveniently handles the moving parts
which conveniently handles the moving parts of training 🤗 Transformers models of training 🤗 Transformers models with features like mixed precision and easy tensorboard logging.
with features like mixed precision and easy tensorboard logging.
Freezing the encoder Freezing the encoder
----------------------------------------------------------------------------------------------------------------------- -----------------------------------------------------------------------------------------------------------------------
In some cases, you might be interested in keeping the weights of the In some cases, you might be interested in keeping the weights of the pre-trained encoder frozen and optimizing only the
pre-trained encoder frozen and optimizing only the weights of the head weights of the head layers. To do so, simply set the ``requires_grad`` attribute to ``False`` on the encoder
layers. To do so, simply set the ``requires_grad`` attribute to ``False`` on parameters, which can be accessed with the ``base_model`` submodule on any task-specific model in the library:
the encoder parameters, which can be accessed with the ``base_model``
submodule on any task-specific model in the library:
.. code-block:: python .. code-block:: python
...@@ -160,10 +139,8 @@ submodule on any task-specific model in the library: ...@@ -160,10 +139,8 @@ submodule on any task-specific model in the library:
Fine-tuning in native TensorFlow 2 Fine-tuning in native TensorFlow 2
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Models can also be trained natively in TensorFlow 2. Just as with PyTorch, Models can also be trained natively in TensorFlow 2. Just as with PyTorch, TensorFlow models can be instantiated with
TensorFlow models can be instantiated with :func:`~transformers.PreTrainedModel.from_pretrained` to load the weights of the encoder from a pretrained model.
:func:`~transformers.PreTrainedModel.from_pretrained` to load the weights of
the encoder from a pretrained model.
.. code-block:: python .. code-block:: python
...@@ -171,11 +148,9 @@ the encoder from a pretrained model. ...@@ -171,11 +148,9 @@ the encoder from a pretrained model.
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased') model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased')
Let's use ``tensorflow_datasets`` to load in the `MRPC dataset Let's use ``tensorflow_datasets`` to load in the `MRPC dataset
<https://www.tensorflow.org/datasets/catalog/glue#gluemrpc>`_ from GLUE. We <https://www.tensorflow.org/datasets/catalog/glue#gluemrpc>`_ from GLUE. We can then use our built-in
can then use our built-in :func:`~transformers.data.processors.glue.glue_convert_examples_to_features` to tokenize MRPC and convert it to a
:func:`~transformers.data.processors.glue.glue_convert_examples_to_features` TensorFlow ``Dataset`` object. Note that tokenizers are framework-agnostic, so there is no need to prepend ``TF`` to
to tokenize MRPC and convert it to a TensorFlow ``Dataset`` object. Note that
tokenizers are framework-agnostic, so there is no need to prepend ``TF`` to
the pretrained tokenizer name. the pretrained tokenizer name.
.. code-block:: python .. code-block:: python
...@@ -197,8 +172,8 @@ The model can then be compiled and trained as any Keras model: ...@@ -197,8 +172,8 @@ The model can then be compiled and trained as any Keras model:
model.compile(optimizer=optimizer, loss=loss) model.compile(optimizer=optimizer, loss=loss)
model.fit(train_dataset, epochs=2, steps_per_epoch=115) model.fit(train_dataset, epochs=2, steps_per_epoch=115)
With the tight interoperability between TensorFlow and PyTorch models, you With the tight interoperability between TensorFlow and PyTorch models, you can even save the model and then reload it
can even save the model and then reload it as a PyTorch model (or vice-versa): as a PyTorch model (or vice-versa):
.. code-block:: python .. code-block:: python
...@@ -212,12 +187,9 @@ can even save the model and then reload it as a PyTorch model (or vice-versa): ...@@ -212,12 +187,9 @@ can even save the model and then reload it as a PyTorch model (or vice-versa):
Trainer Trainer
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
We also provide a simple but feature-complete training and evaluation We also provide a simple but feature-complete training and evaluation interface through :func:`~transformers.Trainer`
interface through :func:`~transformers.Trainer` and and :func:`~transformers.TFTrainer`. You can train, fine-tune, and evaluate any 🤗 Transformers model with a wide range
:func:`~transformers.TFTrainer`. You can train, fine-tune, of training options and with built-in features like logging, gradient accumulation, and mixed precision.
and evaluate any 🤗 Transformers model with a wide range of training options and
with built-in features like logging, gradient accumulation, and mixed
precision.
.. code-block:: python .. code-block:: python
...@@ -264,21 +236,16 @@ precision. ...@@ -264,21 +236,16 @@ precision.
eval_dataset=tfds_test_dataset # tensorflow_datasets evaluation dataset eval_dataset=tfds_test_dataset # tensorflow_datasets evaluation dataset
) )
Now simply call ``trainer.train()`` to train and ``trainer.evaluate()`` to Now simply call ``trainer.train()`` to train and ``trainer.evaluate()`` to evaluate. You can use your own module as
evaluate. You can use your own module as well, but the first well, but the first argument returned from ``forward`` must be the loss which you wish to optimize.
argument returned from ``forward`` must be the loss which you wish to
optimize.
:func:`~transformers.Trainer` uses a built-in default function to collate :func:`~transformers.Trainer` uses a built-in default function to collate batches and prepare them to be fed into the
batches and prepare them to be fed into the model. If needed, you can also model. If needed, you can also use the ``data_collator`` argument to pass your own collator function which takes in the
use the ``data_collator`` argument to pass your own collator function which data in the format provided by your dataset and returns a batch ready to be fed into the model. Note that
takes in the data in the format provided by your dataset and returns a :func:`~transformers.TFTrainer` expects the passed datasets to be dataset objects from ``tensorflow_datasets``.
batch ready to be fed into the model. Note that
:func:`~transformers.TFTrainer` expects the passed datasets to be dataset
objects from ``tensorflow_datasets``.
To calculate additional metrics in addition to the loss, you can also define To calculate additional metrics in addition to the loss, you can also define your own ``compute_metrics`` function and
your own ``compute_metrics`` function and pass it to the trainer. pass it to the trainer.
.. code-block:: python .. code-block:: python
...@@ -296,8 +263,8 @@ your own ``compute_metrics`` function and pass it to the trainer. ...@@ -296,8 +263,8 @@ your own ``compute_metrics`` function and pass it to the trainer.
'recall': recall 'recall': recall
} }
Finally, you can view the results, including any calculated metrics, by Finally, you can view the results, including any calculated metrics, by launching tensorboard in your specified
launching tensorboard in your specified ``logging_dir`` directory. ``logging_dir`` directory.
.. _additional-resources: .. _additional-resources:
...@@ -308,11 +275,12 @@ Additional resources ...@@ -308,11 +275,12 @@ Additional resources
- `A lightweight colab demo <https://colab.research.google.com/drive/1-JIJlao4dI-Ilww_NnTc0rxtp-ymgDgM?usp=sharing>`_ - `A lightweight colab demo <https://colab.research.google.com/drive/1-JIJlao4dI-Ilww_NnTc0rxtp-ymgDgM?usp=sharing>`_
which uses ``Trainer`` for IMDb sentiment classification. which uses ``Trainer`` for IMDb sentiment classification.
- `🤗 Transformers Examples <https://github.com/huggingface/transformers/tree/master/examples>`_ - `🤗 Transformers Examples <https://github.com/huggingface/transformers/tree/master/examples>`_ including scripts for
including scripts for training and fine-tuning on GLUE, SQuAD, and several other tasks. training and fine-tuning on GLUE, SQuAD, and several other tasks.
- `How to train a language model <https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb>`_, - `How to train a language model
a detailed colab notebook which uses ``Trainer`` to train a masked language model from scratch on Esperanto. <https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb>`_, a detailed
colab notebook which uses ``Trainer`` to train a masked language model from scratch on Esperanto.
- `🤗 Transformers Notebooks <notebooks.html>`_ which contain dozens of example notebooks from the community for - `🤗 Transformers Notebooks <notebooks.html>`_ which contain dozens of example notebooks from the community for
training and using 🤗 Transformers on a variety of tasks. training and using 🤗 Transformers on a variety of tasks.
...@@ -14,18 +14,19 @@ def swish(x): ...@@ -14,18 +14,19 @@ def swish(x):
def _gelu_python(x): def _gelu_python(x):
"""Original Implementation of the gelu activation function in Google Bert repo when initially created. """
For information: OpenAI GPT's gelu is slightly different (and gives slightly different results): Original Implementation of the gelu activation function in Google Bert repo when initially created. For
0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3)))) information: OpenAI GPT's gelu is slightly different (and gives slightly different results): 0.5 * x * (1 +
This is now written in C in torch.nn.functional torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3)))) This is now written in C in
Also see https://arxiv.org/abs/1606.08415 torch.nn.functional Also see https://arxiv.org/abs/1606.08415
""" """
return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0))) return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))
def gelu_new(x): def gelu_new(x):
"""Implementation of the gelu activation function currently in Google Bert repo (identical to OpenAI GPT). """
Also see https://arxiv.org/abs/1606.08415 Implementation of the gelu activation function currently in Google Bert repo (identical to OpenAI GPT). Also see
https://arxiv.org/abs/1606.08415
""" """
return 0.5 * x * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (x + 0.044715 * torch.pow(x, 3.0)))) return 0.5 * x * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (x + 0.044715 * torch.pow(x, 3.0))))
......
...@@ -4,11 +4,11 @@ import tensorflow as tf ...@@ -4,11 +4,11 @@ import tensorflow as tf
def gelu(x): def gelu(x):
"""Gaussian Error Linear Unit. """
Original Implementation of the gelu activation function in Google Bert repo when initially created. Gaussian Error Linear Unit. Original Implementation of the gelu activation function in Google Bert repo when
For information: OpenAI GPT's gelu is slightly different (and gives slightly different results): initially created. For information: OpenAI GPT's gelu is slightly different (and gives slightly different results):
0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3)))) 0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3)))) Also see
Also see https://arxiv.org/abs/1606.08415 https://arxiv.org/abs/1606.08415
""" """
x = tf.convert_to_tensor(x) x = tf.convert_to_tensor(x)
cdf = 0.5 * (1.0 + tf.math.erf(x / tf.math.sqrt(2.0))) cdf = 0.5 * (1.0 + tf.math.erf(x / tf.math.sqrt(2.0)))
...@@ -17,11 +17,12 @@ def gelu(x): ...@@ -17,11 +17,12 @@ def gelu(x):
def gelu_new(x): def gelu_new(x):
"""Gaussian Error Linear Unit. """
This is a smoother version of the GELU. Gaussian Error Linear Unit. This is a smoother version of the GELU. Original paper: https://arxiv.org/abs/1606.0841
Original paper: https://arxiv.org/abs/1606.08415
Args: Args:
x: float Tensor to perform activation. x: float Tensor to perform activation
Returns: Returns:
`x` with the GELU activation applied. `x` with the GELU activation applied.
""" """
......
...@@ -46,8 +46,9 @@ class PyTorchBenchmarkArguments(BenchmarkArguments): ...@@ -46,8 +46,9 @@ class PyTorchBenchmarkArguments(BenchmarkArguments):
] ]
def __init__(self, **kwargs): def __init__(self, **kwargs):
"""This __init__ is there for legacy code. When removing """
deprecated args completely, the class can simply be deleted This __init__ is there for legacy code. When removing deprecated args completely, the class can simply be
deleted
""" """
for deprecated_arg in self.deprecated_args: for deprecated_arg in self.deprecated_args:
if deprecated_arg in kwargs: if deprecated_arg in kwargs:
......
...@@ -43,8 +43,9 @@ class TensorFlowBenchmarkArguments(BenchmarkArguments): ...@@ -43,8 +43,9 @@ class TensorFlowBenchmarkArguments(BenchmarkArguments):
] ]
def __init__(self, **kwargs): def __init__(self, **kwargs):
"""This __init__ is there for legacy code. When removing """
deprecated args completely, the class can simply be deleted This __init__ is there for legacy code. When removing deprecated args completely, the class can simply be
deleted
""" """
for deprecated_arg in self.deprecated_args: for deprecated_arg in self.deprecated_args:
if deprecated_arg in kwargs: if deprecated_arg in kwargs:
......
# coding=utf-8 # coding=utf-8
# Copyright 2018 The HuggingFace Inc. team. # Copyright 2018 The HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved. # Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
# #
# Licensed under the Apache License, Version 2.0 (the "License"); # Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License. # you may not use this file except in compliance with the License.
# You may obtain a copy of the License at # You may obtain a copy of the License at
# #
# http://www.apache.org/licenses/LICENSE-2.0 # http://www.apache.org/licenses/LICENSE-2.0
# #
# Unless required by applicable law or agreed to in writing, software # Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS, # distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
import dataclasses import dataclasses
import json import json
from dataclasses import dataclass, field from dataclasses import dataclass, field
from time import time from time import time
from typing import List from typing import List
from ..utils import logging from ..utils import logging
logger = logging.get_logger(__name__) logger = logging.get_logger(__name__)
def list_field(default=None, metadata=None): def list_field(default=None, metadata=None):
return field(default_factory=lambda: default, metadata=metadata) return field(default_factory=lambda: default, metadata=metadata)
@dataclass @dataclass
class BenchmarkArguments: class BenchmarkArguments:
""" """
BenchMarkArguments are arguments we use in our benchmark scripts BenchMarkArguments are arguments we use in our benchmark scripts **which relate to the training loop itself**.
**which relate to the training loop itself**.
Using `HfArgumentParser` we can turn this class into argparse arguments to be able to specify them on the command
Using `HfArgumentParser` we can turn this class line.
into argparse arguments to be able to specify them on """
the command line.
""" models: List[str] = list_field(
default=[],
models: List[str] = list_field( metadata={
default=[], "help": "Model checkpoints to be provided to the AutoModel classes. Leave blank to benchmark the base version of all available models"
metadata={ },
"help": "Model checkpoints to be provided to the AutoModel classes. Leave blank to benchmark the base version of all available models" )
},
) batch_sizes: List[int] = list_field(
default=[8], metadata={"help": "List of batch sizes for which memory and time performance will be evaluated"}
batch_sizes: List[int] = list_field( )
default=[8], metadata={"help": "List of batch sizes for which memory and time performance will be evaluated"}
) sequence_lengths: List[int] = list_field(
default=[8, 32, 128, 512],
sequence_lengths: List[int] = list_field( metadata={"help": "List of sequence lengths for which memory and time performance will be evaluated"},
default=[8, 32, 128, 512], )
metadata={"help": "List of sequence lengths for which memory and time performance will be evaluated"},
) inference: bool = field(
default=True,
inference: bool = field( metadata={"help": "Whether to benchmark inference of model. Inference can be disabled via --no-inference."},
default=True, )
metadata={"help": "Whether to benchmark inference of model. Inference can be disabled via --no-inference."}, cuda: bool = field(
) default=True,
cuda: bool = field( metadata={"help": "Whether to run on available cuda devices. Cuda can be disabled via --no-cuda."},
default=True, )
metadata={"help": "Whether to run on available cuda devices. Cuda can be disabled via --no-cuda."}, tpu: bool = field(
) default=True, metadata={"help": "Whether to run on available tpu devices. TPU can be disabled via --no-tpu."}
tpu: bool = field( )
default=True, metadata={"help": "Whether to run on available tpu devices. TPU can be disabled via --no-tpu."} fp16: bool = field(default=False, metadata={"help": "Use FP16 to accelerate inference."})
) training: bool = field(default=False, metadata={"help": "Benchmark training of model"})
fp16: bool = field(default=False, metadata={"help": "Use FP16 to accelerate inference."}) verbose: bool = field(default=False, metadata={"help": "Verbose memory tracing"})
training: bool = field(default=False, metadata={"help": "Benchmark training of model"}) speed: bool = field(
verbose: bool = field(default=False, metadata={"help": "Verbose memory tracing"}) default=True,
speed: bool = field( metadata={"help": "Whether to perform speed measurements. Speed measurements can be disabled via --no-speed."},
default=True, )
metadata={"help": "Whether to perform speed measurements. Speed measurements can be disabled via --no-speed."}, memory: bool = field(
) default=True,
memory: bool = field( metadata={
default=True, "help": "Whether to perform memory measurements. Memory measurements can be disabled via --no-memory"
metadata={ },
"help": "Whether to perform memory measurements. Memory measurements can be disabled via --no-memory" )
}, trace_memory_line_by_line: bool = field(default=False, metadata={"help": "Trace memory line by line"})
) save_to_csv: bool = field(default=False, metadata={"help": "Save result to a CSV file"})
trace_memory_line_by_line: bool = field(default=False, metadata={"help": "Trace memory line by line"}) log_print: bool = field(default=False, metadata={"help": "Save all print statements in a log file"})
save_to_csv: bool = field(default=False, metadata={"help": "Save result to a CSV file"}) env_print: bool = field(default=False, metadata={"help": "Whether to print environment information"})
log_print: bool = field(default=False, metadata={"help": "Save all print statements in a log file"}) multi_process: bool = field(
env_print: bool = field(default=False, metadata={"help": "Whether to print environment information"}) default=True,
multi_process: bool = field( metadata={
default=True, "help": "Whether to use multiprocessing for memory and speed measurement. It is highly recommended to use multiprocessing for accurate CPU and GPU memory measurements. This option should only be disabled for debugging / testing and on TPU."
metadata={ },
"help": "Whether to use multiprocessing for memory and speed measurement. It is highly recommended to use multiprocessing for accurate CPU and GPU memory measurements. This option should only be disabled for debugging / testing and on TPU." )
}, inference_time_csv_file: str = field(
) default=f"inference_time_{round(time())}.csv",
inference_time_csv_file: str = field( metadata={"help": "CSV filename used if saving time results to csv."},
default=f"inference_time_{round(time())}.csv", )
metadata={"help": "CSV filename used if saving time results to csv."}, inference_memory_csv_file: str = field(
) default=f"inference_memory_{round(time())}.csv",
inference_memory_csv_file: str = field( metadata={"help": "CSV filename used if saving memory results to csv."},
default=f"inference_memory_{round(time())}.csv", )
metadata={"help": "CSV filename used if saving memory results to csv."}, train_time_csv_file: str = field(
) default=f"train_time_{round(time())}.csv",
train_time_csv_file: str = field( metadata={"help": "CSV filename used if saving time results to csv for training."},
default=f"train_time_{round(time())}.csv", )
metadata={"help": "CSV filename used if saving time results to csv for training."}, train_memory_csv_file: str = field(
) default=f"train_memory_{round(time())}.csv",
train_memory_csv_file: str = field( metadata={"help": "CSV filename used if saving memory results to csv for training."},
default=f"train_memory_{round(time())}.csv", )
metadata={"help": "CSV filename used if saving memory results to csv for training."}, env_info_csv_file: str = field(
) default=f"env_info_{round(time())}.csv",
env_info_csv_file: str = field( metadata={"help": "CSV filename used if saving environment information."},
default=f"env_info_{round(time())}.csv", )
metadata={"help": "CSV filename used if saving environment information."}, log_filename: str = field(
) default=f"log_{round(time())}.csv",
log_filename: str = field( metadata={"help": "Log filename used if print statements are saved in log."},
default=f"log_{round(time())}.csv", )
metadata={"help": "Log filename used if print statements are saved in log."}, repeat: int = field(default=3, metadata={"help": "Times an experiment will be run."})
) only_pretrain_model: bool = field(
repeat: int = field(default=3, metadata={"help": "Times an experiment will be run."}) default=False,
only_pretrain_model: bool = field( metadata={
default=False, "help": "Instead of loading the model as defined in `config.architectures` if exists, just load the pretrain model weights."
metadata={ },
"help": "Instead of loading the model as defined in `config.architectures` if exists, just load the pretrain model weights." )
},
) def to_json_string(self):
"""
def to_json_string(self): Serializes this instance to a JSON string.
""" """
Serializes this instance to a JSON string. return json.dumps(dataclasses.asdict(self), indent=2)
"""
return json.dumps(dataclasses.asdict(self), indent=2) @property
def model_names(self):
@property assert (
def model_names(self): len(self.models) > 0
assert ( ), "Please make sure you provide at least one model name / model identifier, *e.g.* `--models bert-base-cased` or `args.models = ['bert-base-cased']."
len(self.models) > 0 return self.models
), "Please make sure you provide at least one model name / model identifier, *e.g.* `--models bert-base-cased` or `args.models = ['bert-base-cased']."
return self.models @property
def do_multi_processing(self):
@property if not self.multi_process:
def do_multi_processing(self): return False
if not self.multi_process: elif self.is_tpu:
return False logger.info("Multiprocessing is currently not possible on TPU.")
elif self.is_tpu: return False
logger.info("Multiprocessing is currently not possible on TPU.") else:
return False return True
else:
return True
# This file is adapted from the AllenNLP library at https://github.com/allenai/allennlp # This file is adapted from the AllenNLP library at https://github.com/allenai/allennlp
# Copyright by the AllenNLP authors. # Copyright by the AllenNLP authors.
""" """
Utilities for working with the local dataset cache. Utilities for working with the local dataset cache.
""" """
import copy import copy
import csv import csv
import linecache import linecache
import os import os
import platform import platform
import sys import sys
from abc import ABC, abstractmethod from abc import ABC, abstractmethod
from collections import defaultdict, namedtuple from collections import defaultdict, namedtuple
from datetime import datetime from datetime import datetime
from multiprocessing import Pipe, Process, Queue from multiprocessing import Pipe, Process, Queue
from multiprocessing.connection import Connection from multiprocessing.connection import Connection
from typing import Callable, Iterable, List, NamedTuple, Optional, Union from typing import Callable, Iterable, List, NamedTuple, Optional, Union
from transformers import AutoConfig, PretrainedConfig from transformers import AutoConfig, PretrainedConfig
from transformers import __version__ as version from transformers import __version__ as version
from ..file_utils import is_psutil_available, is_py3nvml_available, is_tf_available, is_torch_available from ..file_utils import is_psutil_available, is_py3nvml_available, is_tf_available, is_torch_available
from ..utils import logging from ..utils import logging
from .benchmark_args_utils import BenchmarkArguments from .benchmark_args_utils import BenchmarkArguments
if is_torch_available(): if is_torch_available():
from torch.cuda import empty_cache as torch_empty_cache from torch.cuda import empty_cache as torch_empty_cache
if is_tf_available(): if is_tf_available():
from tensorflow.python.eager import context as tf_context from tensorflow.python.eager import context as tf_context
if is_psutil_available(): if is_psutil_available():
import psutil import psutil
if is_py3nvml_available(): if is_py3nvml_available():
import py3nvml.py3nvml as nvml import py3nvml.py3nvml as nvml
if platform.system() == "Windows": if platform.system() == "Windows":
from signal import CTRL_C_EVENT as SIGKILL from signal import CTRL_C_EVENT as SIGKILL
else: else:
from signal import SIGKILL from signal import SIGKILL
logger = logging.get_logger(__name__) # pylint: disable=invalid-name logger = logging.get_logger(__name__) # pylint: disable=invalid-name
_is_memory_tracing_enabled = False _is_memory_tracing_enabled = False
BenchmarkOutput = namedtuple( BenchmarkOutput = namedtuple(
"BenchmarkOutput", "BenchmarkOutput",
[ [
"time_inference_result", "time_inference_result",
"memory_inference_result", "memory_inference_result",
"time_train_result", "time_train_result",
"memory_train_result", "memory_train_result",
"inference_summary", "inference_summary",
"train_summary", "train_summary",
], ],
) )
def separate_process_wrapper_fn(func: Callable[[], None], do_multi_processing: bool) -> Callable[[], None]: def separate_process_wrapper_fn(func: Callable[[], None], do_multi_processing: bool) -> Callable[[], None]:
""" """
This function wraps another function into its own separated process. This function wraps another function into its own separated process. In order to ensure accurate memory
In order to ensure accurate memory measurements it is important that the function measurements it is important that the function is executed in a separate process
is executed in a separate process
Args:
Args:
- `func`: (`callable`): function() -> ... - `func`: (`callable`): function() -> ... generic function which will be executed in its own separate process
generic function which will be executed in its own separate process - `do_multi_processing`: (`bool`) Whether to run function on separate process or not
- `do_multi_processing`: (`bool`) """
Whether to run function on separate process or not
""" def multi_process_func(*args, **kwargs):
# run function in an individual
def multi_process_func(*args, **kwargs): # process to get correct memory
# run function in an individual def wrapper_func(queue: Queue, *args):
# process to get correct memory try:
def wrapper_func(queue: Queue, *args): result = func(*args)
try: except Exception as e:
result = func(*args) logger.error(e)
except Exception as e: print(e)
logger.error(e) result = "N/A"
print(e) queue.put(result)
result = "N/A"
queue.put(result) queue = Queue()
p = Process(target=wrapper_func, args=[queue] + list(args))
queue = Queue() p.start()
p = Process(target=wrapper_func, args=[queue] + list(args)) result = queue.get()
p.start() p.join()
result = queue.get() return result
p.join()
return result if do_multi_processing:
logger.info(f"Function {func} is executed in its own process...")
if do_multi_processing: return multi_process_func
logger.info(f"Function {func} is executed in its own process...") else:
return multi_process_func return func
else:
return func
def is_memory_tracing_enabled():
global _is_memory_tracing_enabled
def is_memory_tracing_enabled(): return _is_memory_tracing_enabled
global _is_memory_tracing_enabled
return _is_memory_tracing_enabled
class Frame(NamedTuple):
"""
class Frame(NamedTuple): `Frame` is a NamedTuple used to gather the current frame state. `Frame` has the following fields:
"""`Frame` is a NamedTuple used to gather the current frame state.
`Frame` has the following fields: - 'filename' (string): Name of the file currently executed
- 'filename' (string): Name of the file currently executed - 'module' (string): Name of the module currently executed
- 'module' (string): Name of the module currently executed - 'line_number' (int): Number of the line currently executed
- 'line_number' (int): Number of the line currently executed - 'event' (string): Event that triggered the tracing (default will be "line")
- 'event' (string): Event that triggered the tracing (default will be "line") - 'line_text' (string): Text of the line in the python script
- 'line_text' (string): Text of the line in the python script """
"""
filename: str
filename: str module: str
module: str line_number: int
line_number: int event: str
event: str line_text: str
line_text: str
class UsedMemoryState(NamedTuple):
class UsedMemoryState(NamedTuple): """
"""`UsedMemoryState` are named tuples with the following fields: `UsedMemoryState` are named tuples with the following fields:
- 'frame': a `Frame` namedtuple (see below) storing information on the current tracing frame (current file, location in current file)
- 'cpu_memory': CPU RSS memory state *before* executing the line - 'frame': a `Frame` namedtuple (see below) storing information on the current tracing frame (current file,
- 'gpu_memory': GPU used memory *before* executing the line (sum for all GPUs or for only `gpus_to_trace` if provided) location in current file)
""" - 'cpu_memory': CPU RSS memory state *before* executing the line
- 'gpu_memory': GPU used memory *before* executing the line (sum for all GPUs or for only `gpus_to_trace` if
frame: Frame provided)
cpu_memory: int """
gpu_memory: int
frame: Frame
cpu_memory: int
class Memory(NamedTuple): gpu_memory: int
"""`Memory` NamedTuple have a single field `bytes` and
you can get a human readable str of the number of mega bytes by calling `__repr__`
- `byte` (integer): number of bytes, class Memory(NamedTuple):
""" """
`Memory` NamedTuple have a single field `bytes` and you can get a human readable str of the number of mega bytes by
bytes: int calling `__repr__`
def __repr__(self) -> str: - `byte` (integer): number of bytes,
return str(bytes_to_mega_bytes(self.bytes)) """
bytes: int
class MemoryState(NamedTuple):
"""`MemoryState` are namedtuples listing frame + CPU/GPU memory with the following fields: def __repr__(self) -> str:
- `frame` (`Frame`): the current frame (see above) return str(bytes_to_mega_bytes(self.bytes))
- `cpu`: CPU memory consumed at during the current frame as a `Memory` named tuple
- `gpu`: GPU memory consumed at during the current frame as a `Memory` named tuple
- `cpu_gpu`: CPU + GPU memory consumed at during the current frame as a `Memory` named tuple class MemoryState(NamedTuple):
""" """
`MemoryState` are namedtuples listing frame + CPU/GPU memory with the following fields:
frame: Frame
cpu: Memory - `frame` (`Frame`): the current frame (see above)
gpu: Memory - `cpu`: CPU memory consumed at during the current frame as a `Memory` named tuple
cpu_gpu: Memory - `gpu`: GPU memory consumed at during the current frame as a `Memory` named tuple
- `cpu_gpu`: CPU + GPU memory consumed at during the current frame as a `Memory` named tuple
"""
class MemorySummary(NamedTuple):
"""`MemorySummary` namedtuple otherwise with the fields: frame: Frame
- `sequential`: a list of `MemoryState` namedtuple (see below) computed from the provided `memory_trace` cpu: Memory
by substracting the memory after executing each line from the memory before executing said line. gpu: Memory
- `cumulative`: a list of `MemoryState` namedtuple (see below) with cumulative increase in memory for each line cpu_gpu: Memory
obtained by summing repeated memory increase for a line if it's executed several times.
The list is sorted from the frame with the largest memory consumption to the frame with the smallest (can be negative if memory is released)
- `total`: total memory increase during the full tracing as a `Memory` named tuple (see below). class MemorySummary(NamedTuple):
Line with memory release (negative consumption) are ignored if `ignore_released_memory` is `True` (default). """
""" `MemorySummary` namedtuple otherwise with the fields:
sequential: List[MemoryState] - `sequential`: a list of `MemoryState` namedtuple (see below) computed from the provided `memory_trace` by
cumulative: List[MemoryState] substracting the memory after executing each line from the memory before executing said line.
current: List[MemoryState] - `cumulative`: a list of `MemoryState` namedtuple (see below) with cumulative increase in memory for each line
total: Memory obtained by summing repeated memory increase for a line if it's executed several times. The list is sorted
from the frame with the largest memory consumption to the frame with the smallest (can be negative if memory
is released)
MemoryTrace = List[UsedMemoryState] - `total`: total memory increase during the full tracing as a `Memory` named tuple (see below). Line with
memory release (negative consumption) are ignored if `ignore_released_memory` is `True` (default).
"""
def measure_peak_memory_cpu(function: Callable[[], None], interval=0.5, device_idx=None) -> int:
""" sequential: List[MemoryState]
measures peak cpu memory consumption of a given `function` cumulative: List[MemoryState]
running the function for at least interval seconds current: List[MemoryState]
and at most 20 * interval seconds. total: Memory
This function is heavily inspired by: `memory_usage`
of the package `memory_profiler`: https://github.com/pythonprofilers/memory_profiler/blob/895c4ac7a08020d66ae001e24067da6dcea42451/memory_profiler.py#L239
MemoryTrace = List[UsedMemoryState]
Args:
- `function`: (`callable`): function() -> ...
function without any arguments to measure for which to measure the peak memory def measure_peak_memory_cpu(function: Callable[[], None], interval=0.5, device_idx=None) -> int:
"""
- `interval`: (`float`, `optional`, defaults to `0.5`) measures peak cpu memory consumption of a given `function` running the function for at least interval seconds and
interval in second for which to measure the memory usage at most 20 * interval seconds. This function is heavily inspired by: `memory_usage` of the package
`memory_profiler`:
- `device_idx`: (`int`, `optional`, defaults to `None`) https://github.com/pythonprofilers/memory_profiler/blob/895c4ac7a08020d66ae001e24067da6dcea42451/memory_profiler.py#L239
device id for which to measure gpu usage
Args:
Returns:
- `max_memory`: (`int`) - `function`: (`callable`): function() -> ... function without any arguments to measure for which to measure
cosumed memory peak in Bytes the peak memory
"""
- `interval`: (`float`, `optional`, defaults to `0.5`) interval in second for which to measure the memory usage
def get_cpu_memory(process_id: int) -> int:
""" - `device_idx`: (`int`, `optional`, defaults to `None`) device id for which to measure gpu usage
measures current cpu memory usage of a given `process_id`
Returns:
Args:
- `process_id`: (`int`) - `max_memory`: (`int`) cosumed memory peak in Bytes
process_id for which to measure memory """
Returns def get_cpu_memory(process_id: int) -> int:
- `memory`: (`int`) """
cosumed memory in Bytes measures current cpu memory usage of a given `process_id`
"""
process = psutil.Process(process_id) Args:
try:
meminfo_attr = "memory_info" if hasattr(process, "memory_info") else "get_memory_info" - `process_id`: (`int`) process_id for which to measure memory
memory = getattr(process, meminfo_attr)()[0]
except psutil.AccessDenied: Returns
raise ValueError("Error with Psutil.")
return memory - `memory`: (`int`) cosumed memory in Bytes
"""
if not is_psutil_available(): process = psutil.Process(process_id)
logger.warning( try:
"Psutil not installed, we won't log CPU memory usage. " meminfo_attr = "memory_info" if hasattr(process, "memory_info") else "get_memory_info"
"Install Psutil (pip install psutil) to use CPU memory tracing." memory = getattr(process, meminfo_attr)()[0]
) except psutil.AccessDenied:
max_memory = "N/A" raise ValueError("Error with Psutil.")
else: return memory
class MemoryMeasureProcess(Process): if not is_psutil_available():
logger.warning(
""" "Psutil not installed, we won't log CPU memory usage. "
`MemoryMeasureProcess` inherits from `Process` and overwrites "Install Psutil (pip install psutil) to use CPU memory tracing."
its `run()` method. Used to measure the memory usage of a process )
""" max_memory = "N/A"
else:
def __init__(self, process_id: int, child_connection: Connection, interval: float):
super().__init__() class MemoryMeasureProcess(Process):
self.process_id = process_id
self.interval = interval """
self.connection = child_connection `MemoryMeasureProcess` inherits from `Process` and overwrites its `run()` method. Used to measure the
self.num_measurements = 1 memory usage of a process
self.mem_usage = get_cpu_memory(self.process_id) """
def run(self): def __init__(self, process_id: int, child_connection: Connection, interval: float):
self.connection.send(0) super().__init__()
stop = False self.process_id = process_id
while True: self.interval = interval
self.mem_usage = max(self.mem_usage, get_cpu_memory(self.process_id)) self.connection = child_connection
self.num_measurements += 1 self.num_measurements = 1
self.mem_usage = get_cpu_memory(self.process_id)
if stop:
break def run(self):
self.connection.send(0)
stop = self.connection.poll(self.interval) stop = False
while True:
# send results to parent pipe self.mem_usage = max(self.mem_usage, get_cpu_memory(self.process_id))
self.connection.send(self.mem_usage) self.num_measurements += 1
self.connection.send(self.num_measurements)
if stop:
while True: break
# create child, parent connection
child_connection, parent_connection = Pipe() stop = self.connection.poll(self.interval)
# instantiate process # send results to parent pipe
mem_process = MemoryMeasureProcess(os.getpid(), child_connection, interval) self.connection.send(self.mem_usage)
mem_process.start() self.connection.send(self.num_measurements)
# wait until we get memory while True:
parent_connection.recv() # create child, parent connection
child_connection, parent_connection = Pipe()
try:
# execute function # instantiate process
function() mem_process = MemoryMeasureProcess(os.getpid(), child_connection, interval)
mem_process.start()
# start parent connection
parent_connection.send(0) # wait until we get memory
parent_connection.recv()
# receive memory and num measurements
max_memory = parent_connection.recv() try:
num_measurements = parent_connection.recv() # execute function
except Exception: function()
# kill process in a clean way
parent = psutil.Process(os.getpid()) # start parent connection
for child in parent.children(recursive=True): parent_connection.send(0)
os.kill(child.pid, SIGKILL)
mem_process.join(0) # receive memory and num measurements
raise RuntimeError("Process killed. Error in Process") max_memory = parent_connection.recv()
num_measurements = parent_connection.recv()
# run process at least 20 * interval or until it finishes except Exception:
mem_process.join(20 * interval) # kill process in a clean way
parent = psutil.Process(os.getpid())
if (num_measurements > 4) or (interval < 1e-6): for child in parent.children(recursive=True):
break os.kill(child.pid, SIGKILL)
mem_process.join(0)
# reduce interval raise RuntimeError("Process killed. Error in Process")
interval /= 10
# run process at least 20 * interval or until it finishes
return max_memory mem_process.join(20 * interval)
if (num_measurements > 4) or (interval < 1e-6):
def start_memory_tracing( break
modules_to_trace: Optional[Union[str, Iterable[str]]] = None,
modules_not_to_trace: Optional[Union[str, Iterable[str]]] = None, # reduce interval
events_to_trace: str = "line", interval /= 10
gpus_to_trace: Optional[List[int]] = None,
) -> MemoryTrace: return max_memory
"""Setup line-by-line tracing to record rss mem (RAM) at each line of a module or sub-module.
See `./benchmark.py` for usage examples.
Current memory consumption is returned using psutil and in particular is the RSS memory def start_memory_tracing(
"Resident Set Size” (the non-swapped physical memory the process is using). modules_to_trace: Optional[Union[str, Iterable[str]]] = None,
See https://psutil.readthedocs.io/en/latest/#psutil.Process.memory_info modules_not_to_trace: Optional[Union[str, Iterable[str]]] = None,
events_to_trace: str = "line",
Args: gpus_to_trace: Optional[List[int]] = None,
- `modules_to_trace`: (None, string, list/tuple of string) ) -> MemoryTrace:
if None, all events are recorded """
if string or list of strings: only events from the listed module/sub-module will be recorded (e.g. 'fairseq' or 'transformers.modeling_gpt2') Setup line-by-line tracing to record rss mem (RAM) at each line of a module or sub-module. See `./benchmark.py` for
- `modules_not_to_trace`: (None, string, list/tuple of string) usage examples. Current memory consumption is returned using psutil and in particular is the RSS memory "Resident
if None, no module is avoided Set Size” (the non-swapped physical memory the process is using). See
if string or list of strings: events from the listed module/sub-module will not be recorded (e.g. 'torch') https://psutil.readthedocs.io/en/latest/#psutil.Process.memory_info
- `events_to_trace`: string or list of string of events to be recorded (see official python doc for `sys.settrace` for the list of events)
default to line Args:
- `gpus_to_trace`: (optional list, default None) list of GPUs to trace. Default to tracing all GPUs
- `modules_to_trace`: (None, string, list/tuple of string) if None, all events are recorded if string or list
Return: of strings: only events from the listed module/sub-module will be recorded (e.g. 'fairseq' or
- `memory_trace` is a list of `UsedMemoryState` for each event (default each line of the traced script). 'transformers.modeling_gpt2')
- `UsedMemoryState` are named tuples with the following fields: - `modules_not_to_trace`: (None, string, list/tuple of string) if None, no module is avoided if string or list
- 'frame': a `Frame` namedtuple (see below) storing information on the current tracing frame (current file, location in current file) of strings: events from the listed module/sub-module will not be recorded (e.g. 'torch')
- 'cpu_memory': CPU RSS memory state *before* executing the line - `events_to_trace`: string or list of string of events to be recorded (see official python doc for
- 'gpu_memory': GPU used memory *before* executing the line (sum for all GPUs or for only `gpus_to_trace` if provided) `sys.settrace` for the list of events) default to line
- `gpus_to_trace`: (optional list, default None) list of GPUs to trace. Default to tracing all GPUs
`Frame` is a namedtuple used by `UsedMemoryState` to list the current frame state.
`Frame` has the following fields: Return:
- 'filename' (string): Name of the file currently executed
- 'module' (string): Name of the module currently executed - `memory_trace` is a list of `UsedMemoryState` for each event (default each line of the traced script).
- 'line_number' (int): Number of the line currently executed
- 'event' (string): Event that triggered the tracing (default will be "line") - `UsedMemoryState` are named tuples with the following fields:
- 'line_text' (string): Text of the line in the python script
- 'frame': a `Frame` namedtuple (see below) storing information on the current tracing frame (current
""" file, location in current file)
if is_psutil_available(): - 'cpu_memory': CPU RSS memory state *before* executing the line
process = psutil.Process(os.getpid()) - 'gpu_memory': GPU used memory *before* executing the line (sum for all GPUs or for only
else: `gpus_to_trace` if provided)
logger.warning(
"Psutil not installed, we won't log CPU memory usage. " `Frame` is a namedtuple used by `UsedMemoryState` to list the current frame state. `Frame` has the following
"Install psutil (pip install psutil) to use CPU memory tracing." fields: - 'filename' (string): Name of the file currently executed - 'module' (string): Name of the module
) currently executed - 'line_number' (int): Number of the line currently executed - 'event' (string): Event that
process = None triggered the tracing (default will be "line") - 'line_text' (string): Text of the line in the python script
if is_py3nvml_available(): """
try: if is_psutil_available():
nvml.nvmlInit() process = psutil.Process(os.getpid())
devices = list(range(nvml.nvmlDeviceGetCount())) if gpus_to_trace is None else gpus_to_trace else:
nvml.nvmlShutdown() logger.warning(
except (OSError, nvml.NVMLError): "Psutil not installed, we won't log CPU memory usage. "
logger.warning("Error while initializing comunication with GPU. " "We won't perform GPU memory tracing.") "Install psutil (pip install psutil) to use CPU memory tracing."
log_gpu = False )
else: process = None
log_gpu = is_torch_available() or is_tf_available()
else: if is_py3nvml_available():
logger.warning( try:
"py3nvml not installed, we won't log GPU memory usage. " nvml.nvmlInit()
"Install py3nvml (pip install py3nvml) to use GPU memory tracing." devices = list(range(nvml.nvmlDeviceGetCount())) if gpus_to_trace is None else gpus_to_trace
) nvml.nvmlShutdown()
log_gpu = False except (OSError, nvml.NVMLError):
logger.warning("Error while initializing comunication with GPU. " "We won't perform GPU memory tracing.")
memory_trace = [] log_gpu = False
else:
def traceit(frame, event, args): log_gpu = is_torch_available() or is_tf_available()
"""Tracing method executed before running each line in a module or sub-module else:
Record memory allocated in a list with debugging information logger.warning(
""" "py3nvml not installed, we won't log GPU memory usage. "
global _is_memory_tracing_enabled "Install py3nvml (pip install py3nvml) to use GPU memory tracing."
)
if not _is_memory_tracing_enabled: log_gpu = False
return traceit
memory_trace = []
# Filter events
if events_to_trace is not None: def traceit(frame, event, args):
if isinstance(events_to_trace, str) and event != events_to_trace: """
return traceit Tracing method executed before running each line in a module or sub-module Record memory allocated in a list
elif isinstance(events_to_trace, (list, tuple)) and event not in events_to_trace: with debugging information
return traceit """
global _is_memory_tracing_enabled
if "__name__" not in frame.f_globals:
return traceit if not _is_memory_tracing_enabled:
return traceit
# Filter modules
name = frame.f_globals["__name__"] # Filter events
if not isinstance(name, str): if events_to_trace is not None:
return traceit if isinstance(events_to_trace, str) and event != events_to_trace:
else: return traceit
# Filter whitelist of modules to trace elif isinstance(events_to_trace, (list, tuple)) and event not in events_to_trace:
if modules_to_trace is not None: return traceit
if isinstance(modules_to_trace, str) and modules_to_trace not in name:
return traceit if "__name__" not in frame.f_globals:
elif isinstance(modules_to_trace, (list, tuple)) and all(m not in name for m in modules_to_trace): return traceit
return traceit
# Filter modules
# Filter blacklist of modules not to trace name = frame.f_globals["__name__"]
if modules_not_to_trace is not None: if not isinstance(name, str):
if isinstance(modules_not_to_trace, str) and modules_not_to_trace in name: return traceit
return traceit else:
elif isinstance(modules_not_to_trace, (list, tuple)) and any(m in name for m in modules_not_to_trace): # Filter whitelist of modules to trace
return traceit if modules_to_trace is not None:
if isinstance(modules_to_trace, str) and modules_to_trace not in name:
# Record current tracing state (file, location in file...) return traceit
lineno = frame.f_lineno elif isinstance(modules_to_trace, (list, tuple)) and all(m not in name for m in modules_to_trace):
filename = frame.f_globals["__file__"] return traceit
if filename.endswith(".pyc") or filename.endswith(".pyo"):
filename = filename[:-1] # Filter blacklist of modules not to trace
line = linecache.getline(filename, lineno).rstrip() if modules_not_to_trace is not None:
traced_state = Frame(filename, name, lineno, event, line) if isinstance(modules_not_to_trace, str) and modules_not_to_trace in name:
return traceit
# Record current memory state (rss memory) and compute difference with previous memory state elif isinstance(modules_not_to_trace, (list, tuple)) and any(m in name for m in modules_not_to_trace):
cpu_mem = 0 return traceit
if process is not None:
mem = process.memory_info() # Record current tracing state (file, location in file...)
cpu_mem = mem.rss lineno = frame.f_lineno
filename = frame.f_globals["__file__"]
gpu_mem = 0 if filename.endswith(".pyc") or filename.endswith(".pyo"):
if log_gpu: filename = filename[:-1]
# Clear GPU caches line = linecache.getline(filename, lineno).rstrip()
if is_torch_available(): traced_state = Frame(filename, name, lineno, event, line)
torch_empty_cache()
if is_tf_available(): # Record current memory state (rss memory) and compute difference with previous memory state
tf_context.context()._clear_caches() # See https://github.com/tensorflow/tensorflow/issues/20218#issuecomment-416771802 cpu_mem = 0
if process is not None:
# Sum used memory for all GPUs mem = process.memory_info()
nvml.nvmlInit() cpu_mem = mem.rss
for i in devices: gpu_mem = 0
handle = nvml.nvmlDeviceGetHandleByIndex(i) if log_gpu:
meminfo = nvml.nvmlDeviceGetMemoryInfo(handle) # Clear GPU caches
gpu_mem += meminfo.used if is_torch_available():
torch_empty_cache()
nvml.nvmlShutdown() if is_tf_available():
tf_context.context()._clear_caches() # See https://github.com/tensorflow/tensorflow/issues/20218#issuecomment-416771802
mem_state = UsedMemoryState(traced_state, cpu_mem, gpu_mem)
memory_trace.append(mem_state) # Sum used memory for all GPUs
nvml.nvmlInit()
return traceit
for i in devices:
sys.settrace(traceit) handle = nvml.nvmlDeviceGetHandleByIndex(i)
meminfo = nvml.nvmlDeviceGetMemoryInfo(handle)
global _is_memory_tracing_enabled gpu_mem += meminfo.used
_is_memory_tracing_enabled = True
nvml.nvmlShutdown()
return memory_trace
mem_state = UsedMemoryState(traced_state, cpu_mem, gpu_mem)
memory_trace.append(mem_state)
def stop_memory_tracing(
memory_trace: Optional[MemoryTrace] = None, ignore_released_memory: bool = True return traceit
) -> Optional[MemorySummary]:
"""Stop memory tracing cleanly and return a summary of the memory trace if a trace is given. sys.settrace(traceit)
Args: global _is_memory_tracing_enabled
- `memory_trace` (optional output of start_memory_tracing, default: None): memory trace to convert in summary _is_memory_tracing_enabled = True
- `ignore_released_memory` (boolean, default: None): if True we only sum memory increase to compute total memory
return memory_trace
Return:
- None if `memory_trace` is None
- `MemorySummary` namedtuple otherwise with the fields: def stop_memory_tracing(
- `sequential`: a list of `MemoryState` namedtuple (see below) computed from the provided `memory_trace` memory_trace: Optional[MemoryTrace] = None, ignore_released_memory: bool = True
by substracting the memory after executing each line from the memory before executing said line. ) -> Optional[MemorySummary]:
- `cumulative`: a list of `MemoryState` namedtuple (see below) with cumulative increase in memory for each line """
obtained by summing repeated memory increase for a line if it's executed several times. Stop memory tracing cleanly and return a summary of the memory trace if a trace is given.
The list is sorted from the frame with the largest memory consumption to the frame with the smallest (can be negative if memory is released)
- `total`: total memory increase during the full tracing as a `Memory` named tuple (see below). Args:
Line with memory release (negative consumption) are ignored if `ignore_released_memory` is `True` (default).
- `memory_trace` (optional output of start_memory_tracing, default: None): memory trace to convert in summary
`Memory` named tuple have fields - `ignore_released_memory` (boolean, default: None): if True we only sum memory increase to compute total
- `byte` (integer): number of bytes, memory
- `string` (string): same as human readable string (ex: "3.5MB")
Return:
`Frame` are namedtuple used to list the current frame state and have the following fields:
- 'filename' (string): Name of the file currently executed - None if `memory_trace` is None
- 'module' (string): Name of the module currently executed - `MemorySummary` namedtuple otherwise with the fields:
- 'line_number' (int): Number of the line currently executed
- 'event' (string): Event that triggered the tracing (default will be "line") - `sequential`: a list of `MemoryState` namedtuple (see below) computed from the provided `memory_trace` by
- 'line_text' (string): Text of the line in the python script substracting the memory after executing each line from the memory before executing said line.
- `cumulative`: a list of `MemoryState` namedtuple (see below) with cumulative increase in memory for each
`MemoryState` are namedtuples listing frame + CPU/GPU memory with the following fields: line obtained by summing repeated memory increase for a line if it's executed several times. The list is
- `frame` (`Frame`): the current frame (see above) sorted from the frame with the largest memory consumption to the frame with the smallest (can be negative
- `cpu`: CPU memory consumed at during the current frame as a `Memory` named tuple if memory is released)
- `gpu`: GPU memory consumed at during the current frame as a `Memory` named tuple - `total`: total memory increase during the full tracing as a `Memory` named tuple (see below). Line with
- `cpu_gpu`: CPU + GPU memory consumed at during the current frame as a `Memory` named tuple memory release (negative consumption) are ignored if `ignore_released_memory` is `True` (default).
"""
global _is_memory_tracing_enabled `Memory` named tuple have fields
_is_memory_tracing_enabled = False
- `byte` (integer): number of bytes,
if memory_trace is not None and len(memory_trace) > 1: - `string` (string): same as human readable string (ex: "3.5MB")
memory_diff_trace = []
memory_curr_trace = [] `Frame` are namedtuple used to list the current frame state and have the following fields:
cumulative_memory_dict = defaultdict(lambda: [0, 0, 0]) - 'filename' (string): Name of the file currently executed
- 'module' (string): Name of the module currently executed
for ( - 'line_number' (int): Number of the line currently executed
(frame, cpu_mem, gpu_mem), - 'event' (string): Event that triggered the tracing (default will be "line")
(next_frame, next_cpu_mem, next_gpu_mem), - 'line_text' (string): Text of the line in the python script
) in zip(memory_trace[:-1], memory_trace[1:]):
cpu_mem_inc = next_cpu_mem - cpu_mem `MemoryState` are namedtuples listing frame + CPU/GPU memory with the following fields:
gpu_mem_inc = next_gpu_mem - gpu_mem
cpu_gpu_mem_inc = cpu_mem_inc + gpu_mem_inc - `frame` (`Frame`): the current frame (see above)
memory_diff_trace.append( - `cpu`: CPU memory consumed at during the current frame as a `Memory` named tuple
MemoryState( - `gpu`: GPU memory consumed at during the current frame as a `Memory` named tuple
frame=frame, - `cpu_gpu`: CPU + GPU memory consumed at during the current frame as a `Memory` named tuple
cpu=Memory(cpu_mem_inc), """
gpu=Memory(gpu_mem_inc), global _is_memory_tracing_enabled
cpu_gpu=Memory(cpu_gpu_mem_inc), _is_memory_tracing_enabled = False
)
) if memory_trace is not None and len(memory_trace) > 1:
memory_diff_trace = []
memory_curr_trace.append( memory_curr_trace = []
MemoryState(
frame=frame, cumulative_memory_dict = defaultdict(lambda: [0, 0, 0])
cpu=Memory(next_cpu_mem),
gpu=Memory(next_gpu_mem), for (
cpu_gpu=Memory(next_gpu_mem + next_cpu_mem), (frame, cpu_mem, gpu_mem),
) (next_frame, next_cpu_mem, next_gpu_mem),
) ) in zip(memory_trace[:-1], memory_trace[1:]):
cpu_mem_inc = next_cpu_mem - cpu_mem
cumulative_memory_dict[frame][0] += cpu_mem_inc gpu_mem_inc = next_gpu_mem - gpu_mem
cumulative_memory_dict[frame][1] += gpu_mem_inc cpu_gpu_mem_inc = cpu_mem_inc + gpu_mem_inc
cumulative_memory_dict[frame][2] += cpu_gpu_mem_inc memory_diff_trace.append(
MemoryState(
cumulative_memory = sorted( frame=frame,
list(cumulative_memory_dict.items()), key=lambda x: x[1][2], reverse=True cpu=Memory(cpu_mem_inc),
) # order by the total CPU + GPU memory increase gpu=Memory(gpu_mem_inc),
cumulative_memory = list( cpu_gpu=Memory(cpu_gpu_mem_inc),
MemoryState( )
frame=frame, )
cpu=Memory(cpu_mem_inc),
gpu=Memory(gpu_mem_inc), memory_curr_trace.append(
cpu_gpu=Memory(cpu_gpu_mem_inc), MemoryState(
) frame=frame,
for frame, (cpu_mem_inc, gpu_mem_inc, cpu_gpu_mem_inc) in cumulative_memory cpu=Memory(next_cpu_mem),
) gpu=Memory(next_gpu_mem),
cpu_gpu=Memory(next_gpu_mem + next_cpu_mem),
memory_curr_trace = sorted(memory_curr_trace, key=lambda x: x.cpu_gpu.bytes, reverse=True) )
)
if ignore_released_memory:
total_memory = sum(max(0, step_trace.cpu_gpu.bytes) for step_trace in memory_diff_trace) cumulative_memory_dict[frame][0] += cpu_mem_inc
else: cumulative_memory_dict[frame][1] += gpu_mem_inc
total_memory = sum(step_trace.cpu_gpu.bytes for step_trace in memory_diff_trace) cumulative_memory_dict[frame][2] += cpu_gpu_mem_inc
total_memory = Memory(total_memory) cumulative_memory = sorted(
list(cumulative_memory_dict.items()), key=lambda x: x[1][2], reverse=True
return MemorySummary( ) # order by the total CPU + GPU memory increase
sequential=memory_diff_trace, cumulative_memory = list(
cumulative=cumulative_memory, MemoryState(
current=memory_curr_trace, frame=frame,
total=total_memory, cpu=Memory(cpu_mem_inc),
) gpu=Memory(gpu_mem_inc),
cpu_gpu=Memory(cpu_gpu_mem_inc),
return None )
for frame, (cpu_mem_inc, gpu_mem_inc, cpu_gpu_mem_inc) in cumulative_memory
)
def bytes_to_mega_bytes(memory_amount: int) -> int:
"""Utility to convert a number of bytes (int) into a number of mega bytes (int)""" memory_curr_trace = sorted(memory_curr_trace, key=lambda x: x.cpu_gpu.bytes, reverse=True)
return memory_amount >> 20
if ignore_released_memory:
total_memory = sum(max(0, step_trace.cpu_gpu.bytes) for step_trace in memory_diff_trace)
class Benchmark(ABC): else:
""" total_memory = sum(step_trace.cpu_gpu.bytes for step_trace in memory_diff_trace)
Benchmarks is a simple but feature-complete benchmarking script
to compare memory and time performance of models in Transformers. total_memory = Memory(total_memory)
"""
return MemorySummary(
args: BenchmarkArguments sequential=memory_diff_trace,
configs: PretrainedConfig cumulative=cumulative_memory,
framework: str current=memory_curr_trace,
total=total_memory,
def __init__(self, args: BenchmarkArguments = None, configs: PretrainedConfig = None): )
self.args = args
if configs is None: return None
self.config_dict = {
model_name: AutoConfig.from_pretrained(model_name) for model_name in self.args.model_names
} def bytes_to_mega_bytes(memory_amount: int) -> int:
else: """Utility to convert a number of bytes (int) into a number of mega bytes (int)"""
self.config_dict = {model_name: config for model_name, config in zip(self.args.model_names, configs)} return memory_amount >> 20
if self.args.memory and os.getenv("TRANSFORMERS_USE_MULTIPROCESSING") == 0:
logger.warning( class Benchmark(ABC):
"Memory consumption will not be measured accurately if `args.multi_process` is set to `False.` The flag 'TRANSFORMERS_USE_MULTIPROCESSING' should only be disabled for debugging / testing." """
) Benchmarks is a simple but feature-complete benchmarking script to compare memory and time performance of models in
Transformers.
self._print_fn = None """
self._framework_version = None
self._environment_info = None args: BenchmarkArguments
configs: PretrainedConfig
@property framework: str
def print_fn(self):
if self._print_fn is None: def __init__(self, args: BenchmarkArguments = None, configs: PretrainedConfig = None):
if self.args.log_print: self.args = args
if configs is None:
def print_and_log(*args): self.config_dict = {
with open(self.args.log_filename, "a") as log_file: model_name: AutoConfig.from_pretrained(model_name) for model_name in self.args.model_names
log_file.write("".join(args) + "\n") }
print(*args) else:
self.config_dict = {model_name: config for model_name, config in zip(self.args.model_names, configs)}
self._print_fn = print_and_log
else: if self.args.memory and os.getenv("TRANSFORMERS_USE_MULTIPROCESSING") == 0:
self._print_fn = print logger.warning(
return self._print_fn "Memory consumption will not be measured accurately if `args.multi_process` is set to `False.` The flag 'TRANSFORMERS_USE_MULTIPROCESSING' should only be disabled for debugging / testing."
)
@property
@abstractmethod self._print_fn = None
def framework_version(self): self._framework_version = None
pass self._environment_info = None
@abstractmethod @property
def _inference_speed(self, model_name: str, batch_size: int, sequence_length: int) -> float: def print_fn(self):
pass if self._print_fn is None:
if self.args.log_print:
@abstractmethod
def _train_speed(self, model_name: str, batch_size: int, sequence_length: int) -> float: def print_and_log(*args):
pass with open(self.args.log_filename, "a") as log_file:
log_file.write("".join(args) + "\n")
@abstractmethod print(*args)
def _inference_memory(
self, model_name: str, batch_size: int, sequence_length: int self._print_fn = print_and_log
) -> [Memory, Optional[MemorySummary]]: else:
pass self._print_fn = print
return self._print_fn
@abstractmethod
def _train_memory( @property
self, model_name: str, batch_size: int, sequence_length: int @abstractmethod
) -> [Memory, Optional[MemorySummary]]: def framework_version(self):
pass pass
def inference_speed(self, *args, **kwargs) -> float: @abstractmethod
return separate_process_wrapper_fn(self._inference_speed, self.args.do_multi_processing)(*args, **kwargs) def _inference_speed(self, model_name: str, batch_size: int, sequence_length: int) -> float:
pass
def train_speed(self, *args, **kwargs) -> float:
return separate_process_wrapper_fn(self._train_speed, self.args.do_multi_processing)(*args, **kwargs) @abstractmethod
def _train_speed(self, model_name: str, batch_size: int, sequence_length: int) -> float:
def inference_memory(self, *args, **kwargs) -> [Memory, Optional[MemorySummary]]: pass
return separate_process_wrapper_fn(self._inference_memory, self.args.do_multi_processing)(*args, **kwargs)
@abstractmethod
def train_memory(self, *args, **kwargs) -> [Memory, Optional[MemorySummary]]: def _inference_memory(
return separate_process_wrapper_fn(self._train_memory, self.args.do_multi_processing)(*args, **kwargs) self, model_name: str, batch_size: int, sequence_length: int
) -> [Memory, Optional[MemorySummary]]:
def run(self): pass
result_dict = {model_name: {} for model_name in self.args.model_names}
inference_result_time = copy.deepcopy(result_dict) @abstractmethod
inference_result_memory = copy.deepcopy(result_dict) def _train_memory(
train_result_time = copy.deepcopy(result_dict) self, model_name: str, batch_size: int, sequence_length: int
train_result_memory = copy.deepcopy(result_dict) ) -> [Memory, Optional[MemorySummary]]:
pass
for c, model_name in enumerate(self.args.model_names):
self.print_fn(f"{c + 1} / {len(self.args.model_names)}") def inference_speed(self, *args, **kwargs) -> float:
return separate_process_wrapper_fn(self._inference_speed, self.args.do_multi_processing)(*args, **kwargs)
model_dict = {
"bs": self.args.batch_sizes, def train_speed(self, *args, **kwargs) -> float:
"ss": self.args.sequence_lengths, return separate_process_wrapper_fn(self._train_speed, self.args.do_multi_processing)(*args, **kwargs)
"result": {i: {} for i in self.args.batch_sizes},
} def inference_memory(self, *args, **kwargs) -> [Memory, Optional[MemorySummary]]:
inference_result_time[model_name] = copy.deepcopy(model_dict) return separate_process_wrapper_fn(self._inference_memory, self.args.do_multi_processing)(*args, **kwargs)
inference_result_memory[model_name] = copy.deepcopy(model_dict)
train_result_time[model_name] = copy.deepcopy(model_dict) def train_memory(self, *args, **kwargs) -> [Memory, Optional[MemorySummary]]:
train_result_memory[model_name] = copy.deepcopy(model_dict) return separate_process_wrapper_fn(self._train_memory, self.args.do_multi_processing)(*args, **kwargs)
inference_summary = train_summary = None def run(self):
result_dict = {model_name: {} for model_name in self.args.model_names}
for batch_size in self.args.batch_sizes: inference_result_time = copy.deepcopy(result_dict)
for sequence_length in self.args.sequence_lengths: inference_result_memory = copy.deepcopy(result_dict)
if self.args.inference: train_result_time = copy.deepcopy(result_dict)
if self.args.memory: train_result_memory = copy.deepcopy(result_dict)
memory, inference_summary = self.inference_memory(model_name, batch_size, sequence_length)
inference_result_memory[model_name]["result"][batch_size][sequence_length] = memory for c, model_name in enumerate(self.args.model_names):
if self.args.speed: self.print_fn(f"{c + 1} / {len(self.args.model_names)}")
time = self.inference_speed(model_name, batch_size, sequence_length)
inference_result_time[model_name]["result"][batch_size][sequence_length] = time model_dict = {
"bs": self.args.batch_sizes,
if self.args.training: "ss": self.args.sequence_lengths,
if self.args.memory: "result": {i: {} for i in self.args.batch_sizes},
memory, train_summary = self.train_memory(model_name, batch_size, sequence_length) }
train_result_memory[model_name]["result"][batch_size][sequence_length] = memory inference_result_time[model_name] = copy.deepcopy(model_dict)
if self.args.speed: inference_result_memory[model_name] = copy.deepcopy(model_dict)
time = self.train_speed(model_name, batch_size, sequence_length) train_result_time[model_name] = copy.deepcopy(model_dict)
train_result_time[model_name]["result"][batch_size][sequence_length] = time train_result_memory[model_name] = copy.deepcopy(model_dict)
if self.args.inference: inference_summary = train_summary = None
if self.args.speed:
self.print_fn("\n" + 20 * "=" + ("INFERENCE - SPEED - RESULT").center(40) + 20 * "=") for batch_size in self.args.batch_sizes:
self.print_results(inference_result_time, type_label="Time in s") for sequence_length in self.args.sequence_lengths:
self.save_to_csv(inference_result_time, self.args.inference_time_csv_file) if self.args.inference:
if self.args.is_tpu: if self.args.memory:
self.print_fn( memory, inference_summary = self.inference_memory(model_name, batch_size, sequence_length)
"TPU was used for inference. Note that the time after compilation stabilized (after ~10 inferences model.forward(..) calls) was measured." inference_result_memory[model_name]["result"][batch_size][sequence_length] = memory
) if self.args.speed:
time = self.inference_speed(model_name, batch_size, sequence_length)
if self.args.memory: inference_result_time[model_name]["result"][batch_size][sequence_length] = time
self.print_fn("\n" + 20 * "=" + ("INFERENCE - MEMORY - RESULT").center(40) + 20 * "=")
self.print_results(inference_result_memory, type_label="Memory in MB") if self.args.training:
self.save_to_csv(inference_result_memory, self.args.inference_memory_csv_file) if self.args.memory:
memory, train_summary = self.train_memory(model_name, batch_size, sequence_length)
if self.args.trace_memory_line_by_line: train_result_memory[model_name]["result"][batch_size][sequence_length] = memory
self.print_fn("\n" + 20 * "=" + ("INFERENCE - MEMOMRY - LINE BY LINE - SUMMARY").center(40) + 20 * "=") if self.args.speed:
self.print_memory_trace_statistics(inference_summary) time = self.train_speed(model_name, batch_size, sequence_length)
train_result_time[model_name]["result"][batch_size][sequence_length] = time
if self.args.training:
if self.args.speed: if self.args.inference:
self.print_fn("\n" + 20 * "=" + ("TRAIN - SPEED - RESULTS").center(40) + 20 * "=") if self.args.speed:
self.print_results(train_result_time, "Time in s") self.print_fn("\n" + 20 * "=" + ("INFERENCE - SPEED - RESULT").center(40) + 20 * "=")
self.save_to_csv(train_result_time, self.args.train_time_csv_file) self.print_results(inference_result_time, type_label="Time in s")
if self.args.is_tpu: self.save_to_csv(inference_result_time, self.args.inference_time_csv_file)
self.print_fn( if self.args.is_tpu:
"TPU was used for training. Note that the time after compilation stabilized (after ~10 train loss=model.forward(...) + loss.backward() calls) was measured." self.print_fn(
) "TPU was used for inference. Note that the time after compilation stabilized (after ~10 inferences model.forward(..) calls) was measured."
)
if self.args.memory:
self.print_fn("\n" + 20 * "=" + ("TRAIN - MEMORY - RESULTS").center(40) + 20 * "=") if self.args.memory:
self.print_results(train_result_memory, type_label="Memory in MB") self.print_fn("\n" + 20 * "=" + ("INFERENCE - MEMORY - RESULT").center(40) + 20 * "=")
self.save_to_csv(train_result_memory, self.args.train_memory_csv_file) self.print_results(inference_result_memory, type_label="Memory in MB")
self.save_to_csv(inference_result_memory, self.args.inference_memory_csv_file)
if self.args.trace_memory_line_by_line:
self.print_fn("\n" + 20 * "=" + ("TRAIN - MEMOMRY - LINE BY LINE - SUMMARY").center(40) + 20 * "=") if self.args.trace_memory_line_by_line:
self.print_memory_trace_statistics(train_summary) self.print_fn("\n" + 20 * "=" + ("INFERENCE - MEMOMRY - LINE BY LINE - SUMMARY").center(40) + 20 * "=")
self.print_memory_trace_statistics(inference_summary)
if self.args.env_print:
self.print_fn("\n" + 20 * "=" + ("ENVIRONMENT INFORMATION").center(40) + 20 * "=") if self.args.training:
self.print_fn( if self.args.speed:
"\n".join(["- {}: {}".format(prop, val) for prop, val in self.environment_info.items()]) + "\n" self.print_fn("\n" + 20 * "=" + ("TRAIN - SPEED - RESULTS").center(40) + 20 * "=")
) self.print_results(train_result_time, "Time in s")
self.save_to_csv(train_result_time, self.args.train_time_csv_file)
if self.args.save_to_csv: if self.args.is_tpu:
with open(self.args.env_info_csv_file, mode="w", newline="") as csv_file: self.print_fn(
writer = csv.writer(csv_file) "TPU was used for training. Note that the time after compilation stabilized (after ~10 train loss=model.forward(...) + loss.backward() calls) was measured."
for key, value in self.environment_info.items(): )
writer.writerow([key, value])
if self.args.memory:
return BenchmarkOutput( self.print_fn("\n" + 20 * "=" + ("TRAIN - MEMORY - RESULTS").center(40) + 20 * "=")
inference_result_time, self.print_results(train_result_memory, type_label="Memory in MB")
inference_result_memory, self.save_to_csv(train_result_memory, self.args.train_memory_csv_file)
train_result_time,
train_result_memory, if self.args.trace_memory_line_by_line:
inference_summary, self.print_fn("\n" + 20 * "=" + ("TRAIN - MEMOMRY - LINE BY LINE - SUMMARY").center(40) + 20 * "=")
train_summary, self.print_memory_trace_statistics(train_summary)
)
if self.args.env_print:
@property self.print_fn("\n" + 20 * "=" + ("ENVIRONMENT INFORMATION").center(40) + 20 * "=")
def environment_info(self): self.print_fn(
if self._environment_info is None: "\n".join(["- {}: {}".format(prop, val) for prop, val in self.environment_info.items()]) + "\n"
info = {} )
info["transformers_version"] = version
info["framework"] = self.framework if self.args.save_to_csv:
if self.framework == "PyTorch": with open(self.args.env_info_csv_file, mode="w", newline="") as csv_file:
info["use_torchscript"] = self.args.torchscript writer = csv.writer(csv_file)
if self.framework == "TensorFlow": for key, value in self.environment_info.items():
info["eager_mode"] = self.args.eager_mode writer.writerow([key, value])
info["use_xla"] = self.args.use_xla
info["framework_version"] = self.framework_version return BenchmarkOutput(
info["python_version"] = platform.python_version() inference_result_time,
info["system"] = platform.system() inference_result_memory,
info["cpu"] = platform.processor() train_result_time,
info["architecture"] = platform.architecture()[0] train_result_memory,
info["date"] = datetime.date(datetime.now()) inference_summary,
info["time"] = datetime.time(datetime.now()) train_summary,
info["fp16"] = self.args.fp16 )
info["use_multiprocessing"] = self.args.do_multi_processing
info["only_pretrain_model"] = self.args.only_pretrain_model @property
def environment_info(self):
if is_psutil_available(): if self._environment_info is None:
info["cpu_ram_mb"] = bytes_to_mega_bytes(psutil.virtual_memory().total) info = {}
else: info["transformers_version"] = version
logger.warning( info["framework"] = self.framework
"Psutil not installed, we won't log available CPU memory." if self.framework == "PyTorch":
"Install psutil (pip install psutil) to log available CPU memory." info["use_torchscript"] = self.args.torchscript
) if self.framework == "TensorFlow":
info["cpu_ram_mb"] = "N/A" info["eager_mode"] = self.args.eager_mode
info["use_xla"] = self.args.use_xla
info["use_gpu"] = self.args.is_gpu info["framework_version"] = self.framework_version
if self.args.is_gpu: info["python_version"] = platform.python_version()
info["num_gpus"] = 1 # TODO(PVP) Currently only single GPU is supported info["system"] = platform.system()
if is_py3nvml_available(): info["cpu"] = platform.processor()
nvml.nvmlInit() info["architecture"] = platform.architecture()[0]
handle = nvml.nvmlDeviceGetHandleByIndex(self.args.device_idx) info["date"] = datetime.date(datetime.now())
info["gpu"] = nvml.nvmlDeviceGetName(handle) info["time"] = datetime.time(datetime.now())
info["gpu_ram_mb"] = bytes_to_mega_bytes(nvml.nvmlDeviceGetMemoryInfo(handle).total) info["fp16"] = self.args.fp16
info["gpu_power_watts"] = nvml.nvmlDeviceGetPowerManagementLimit(handle) / 1000 info["use_multiprocessing"] = self.args.do_multi_processing
info["gpu_performance_state"] = nvml.nvmlDeviceGetPerformanceState(handle) info["only_pretrain_model"] = self.args.only_pretrain_model
nvml.nvmlShutdown()
else: if is_psutil_available():
logger.warning( info["cpu_ram_mb"] = bytes_to_mega_bytes(psutil.virtual_memory().total)
"py3nvml not installed, we won't log GPU memory usage. " else:
"Install py3nvml (pip install py3nvml) to log information about GPU." logger.warning(
) "Psutil not installed, we won't log available CPU memory."
info["gpu"] = "N/A" "Install psutil (pip install psutil) to log available CPU memory."
info["gpu_ram_mb"] = "N/A" )
info["gpu_power_watts"] = "N/A" info["cpu_ram_mb"] = "N/A"
info["gpu_performance_state"] = "N/A"
info["use_gpu"] = self.args.is_gpu
info["use_tpu"] = self.args.is_tpu if self.args.is_gpu:
# TODO(PVP): See if we can add more information about TPU info["num_gpus"] = 1 # TODO(PVP) Currently only single GPU is supported
# see: https://github.com/pytorch/xla/issues/2180 if is_py3nvml_available():
nvml.nvmlInit()
self._environment_info = info handle = nvml.nvmlDeviceGetHandleByIndex(self.args.device_idx)
return self._environment_info info["gpu"] = nvml.nvmlDeviceGetName(handle)
info["gpu_ram_mb"] = bytes_to_mega_bytes(nvml.nvmlDeviceGetMemoryInfo(handle).total)
def print_results(self, result_dict, type_label): info["gpu_power_watts"] = nvml.nvmlDeviceGetPowerManagementLimit(handle) / 1000
self.print_fn(80 * "-") info["gpu_performance_state"] = nvml.nvmlDeviceGetPerformanceState(handle)
self.print_fn( nvml.nvmlShutdown()
"Model Name".center(30) + "Batch Size".center(15) + "Seq Length".center(15) + type_label.center(15) else:
) logger.warning(
self.print_fn(80 * "-") "py3nvml not installed, we won't log GPU memory usage. "
for model_name in self.args.model_names: "Install py3nvml (pip install py3nvml) to log information about GPU."
for batch_size in result_dict[model_name]["bs"]: )
for sequence_length in result_dict[model_name]["ss"]: info["gpu"] = "N/A"
result = result_dict[model_name]["result"][batch_size][sequence_length] info["gpu_ram_mb"] = "N/A"
if isinstance(result, float): info["gpu_power_watts"] = "N/A"
result = round(1000 * result) / 1000 info["gpu_performance_state"] = "N/A"
result = "< 0.001" if result == 0.0 else str(result)
else: info["use_tpu"] = self.args.is_tpu
result = str(result) # TODO(PVP): See if we can add more information about TPU
self.print_fn( # see: https://github.com/pytorch/xla/issues/2180
model_name[:30].center(30) + str(batch_size).center(15),
str(sequence_length).center(15), self._environment_info = info
result.center(15), return self._environment_info
)
self.print_fn(80 * "-") def print_results(self, result_dict, type_label):
self.print_fn(80 * "-")
def print_memory_trace_statistics(self, summary: MemorySummary): self.print_fn(
self.print_fn( "Model Name".center(30) + "Batch Size".center(15) + "Seq Length".center(15) + type_label.center(15)
"\nLine by line memory consumption:\n" )
+ "\n".join( self.print_fn(80 * "-")
f"{state.frame.filename}:{state.frame.line_number}: mem {state.cpu_gpu}: {state.frame.line_text}" for model_name in self.args.model_names:
for state in summary.sequential for batch_size in result_dict[model_name]["bs"]:
) for sequence_length in result_dict[model_name]["ss"]:
) result = result_dict[model_name]["result"][batch_size][sequence_length]
self.print_fn( if isinstance(result, float):
"\nLines with top memory consumption:\n" result = round(1000 * result) / 1000
+ "\n".join( result = "< 0.001" if result == 0.0 else str(result)
f"=> {state.frame.filename}:{state.frame.line_number}: mem {state.cpu_gpu}: {state.frame.line_text}" else:
for state in summary.cumulative[:6] result = str(result)
) self.print_fn(
) model_name[:30].center(30) + str(batch_size).center(15),
self.print_fn( str(sequence_length).center(15),
"\nLines with lowest memory consumption:\n" result.center(15),
+ "\n".join( )
f"=> {state.frame.filename}:{state.frame.line_number}: mem {state.cpu_gpu}: {state.frame.line_text}" self.print_fn(80 * "-")
for state in summary.cumulative[-6:]
) def print_memory_trace_statistics(self, summary: MemorySummary):
) self.print_fn(
self.print_fn(f"\nTotal memory increase: {summary.total}") "\nLine by line memory consumption:\n"
+ "\n".join(
def save_to_csv(self, result_dict, filename): f"{state.frame.filename}:{state.frame.line_number}: mem {state.cpu_gpu}: {state.frame.line_text}"
if not self.args.save_to_csv: for state in summary.sequential
return )
self.print_fn("Saving results to csv.") )
with open(filename, mode="w") as csv_file: self.print_fn(
"\nLines with top memory consumption:\n"
assert len(self.args.model_names) > 0, "At least 1 model should be defined, but got {}".format( + "\n".join(
self.model_names f"=> {state.frame.filename}:{state.frame.line_number}: mem {state.cpu_gpu}: {state.frame.line_text}"
) for state in summary.cumulative[:6]
)
fieldnames = ["model", "batch_size", "sequence_length"] )
writer = csv.DictWriter(csv_file, fieldnames=fieldnames + ["result"]) self.print_fn(
writer.writeheader() "\nLines with lowest memory consumption:\n"
+ "\n".join(
for model_name in self.args.model_names: f"=> {state.frame.filename}:{state.frame.line_number}: mem {state.cpu_gpu}: {state.frame.line_text}"
result_dict_model = result_dict[model_name]["result"] for state in summary.cumulative[-6:]
for bs in result_dict_model: )
for ss in result_dict_model[bs]: )
result_model = result_dict_model[bs][ss] self.print_fn(f"\nTotal memory increase: {summary.total}")
writer.writerow(
{ def save_to_csv(self, result_dict, filename):
"model": model_name, if not self.args.save_to_csv:
"batch_size": bs, return
"sequence_length": ss, self.print_fn("Saving results to csv.")
"result": ("{}" if not isinstance(result_model, float) else "{:.4f}").format( with open(filename, mode="w") as csv_file:
result_model
), assert len(self.args.model_names) > 0, "At least 1 model should be defined, but got {}".format(
} self.model_names
) )
fieldnames = ["model", "batch_size", "sequence_length"]
writer = csv.DictWriter(csv_file, fieldnames=fieldnames + ["result"])
writer.writeheader()
for model_name in self.args.model_names:
result_dict_model = result_dict[model_name]["result"]
for bs in result_dict_model:
for ss in result_dict_model[bs]:
result_model = result_dict_model[bs][ss]
writer.writerow(
{
"model": model_name,
"batch_size": bs,
"sequence_length": ss,
"result": ("{}" if not isinstance(result_model, float) else "{:.4f}").format(
result_model
),
}
)
...@@ -16,9 +16,9 @@ def convert_command_factory(args: Namespace): ...@@ -16,9 +16,9 @@ def convert_command_factory(args: Namespace):
) )
IMPORT_ERROR_MESSAGE = """transformers can only be used from the commandline to convert TensorFlow models in PyTorch, IMPORT_ERROR_MESSAGE = """
In that case, it requires TensorFlow to be installed. Please see transformers can only be used from the commandline to convert TensorFlow models in PyTorch, In that case, it requires
https://www.tensorflow.org/install/ for installation instructions. TensorFlow to be installed. Please see https://www.tensorflow.org/install/ for installation instructions.
""" """
......
...@@ -164,9 +164,9 @@ class ServeCommand(BaseTransformersCLICommand): ...@@ -164,9 +164,9 @@ class ServeCommand(BaseTransformersCLICommand):
def tokenize(self, text_input: str = Body(None, embed=True), return_ids: bool = Body(False, embed=True)): def tokenize(self, text_input: str = Body(None, embed=True), return_ids: bool = Body(False, embed=True)):
""" """
Tokenize the provided input and eventually returns corresponding tokens id: Tokenize the provided input and eventually returns corresponding tokens id: - **text_input**: String to
- **text_input**: String to tokenize tokenize - **return_ids**: Boolean flags indicating if the tokens have to be converted to their integer
- **return_ids**: Boolean flags indicating if the tokens have to be converted to their integer mapping. mapping.
""" """
try: try:
tokens_txt = self._pipeline.tokenizer.tokenize(text_input) tokens_txt = self._pipeline.tokenizer.tokenize(text_input)
...@@ -187,10 +187,9 @@ class ServeCommand(BaseTransformersCLICommand): ...@@ -187,10 +187,9 @@ class ServeCommand(BaseTransformersCLICommand):
cleanup_tokenization_spaces: bool = Body(True, embed=True), cleanup_tokenization_spaces: bool = Body(True, embed=True),
): ):
""" """
Detokenize the provided tokens ids to readable text: Detokenize the provided tokens ids to readable text: - **tokens_ids**: List of tokens ids -
- **tokens_ids**: List of tokens ids **skip_special_tokens**: Flag indicating to not try to decode special tokens - **cleanup_tokenization_spaces**:
- **skip_special_tokens**: Flag indicating to not try to decode special tokens Flag indicating to remove all leading/trailing spaces and intermediate ones.
- **cleanup_tokenization_spaces**: Flag indicating to remove all leading/trailing spaces and intermediate ones.
""" """
try: try:
decoded_str = self._pipeline.tokenizer.decode(tokens_ids, skip_special_tokens, cleanup_tokenization_spaces) decoded_str = self._pipeline.tokenizer.decode(tokens_ids, skip_special_tokens, cleanup_tokenization_spaces)
......
...@@ -37,9 +37,8 @@ class AlbertConfig(PretrainedConfig): ...@@ -37,9 +37,8 @@ class AlbertConfig(PretrainedConfig):
arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
configuration to that of the ALBERT `xxlarge <https://huggingface.co/albert-xxlarge-v2>`__ architecture. configuration to that of the ALBERT `xxlarge <https://huggingface.co/albert-xxlarge-v2>`__ architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig` outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
for more information.
Args: Args:
vocab_size (:obj:`int`, `optional`, defaults to 30000): vocab_size (:obj:`int`, `optional`, defaults to 30000):
...@@ -61,15 +60,15 @@ class AlbertConfig(PretrainedConfig): ...@@ -61,15 +60,15 @@ class AlbertConfig(PretrainedConfig):
inner_group_num (:obj:`int`, `optional`, defaults to 1): inner_group_num (:obj:`int`, `optional`, defaults to 1):
The number of inner repetition of attention and ffn. The number of inner repetition of attention and ffn.
hidden_act (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu_new"`): hidden_act (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu_new"`):
The non-linear activation function (function or string) in the encoder and pooler. The non-linear activation function (function or string) in the encoder and pooler. If string,
If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported. :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0): hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0):
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0): attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0):
The dropout ratio for the attention probabilities. The dropout ratio for the attention probabilities.
max_position_embeddings (:obj:`int`, `optional`, defaults to 512): max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
The maximum sequence length that this model might ever be used with. Typically set this to something The maximum sequence length that this model might ever be used with. Typically set this to something large
large (e.g., 512 or 1024 or 2048). (e.g., 512 or 1024 or 2048).
type_vocab_size (:obj:`int`, `optional`, defaults to 2): type_vocab_size (:obj:`int`, `optional`, defaults to 2):
The vocabulary size of the :obj:`token_type_ids` passed when calling :class:`~transformers.AlbertModel` or The vocabulary size of the :obj:`token_type_ids` passed when calling :class:`~transformers.AlbertModel` or
:class:`~transformers.TFAlbertModel`. :class:`~transformers.TFAlbertModel`.
......
...@@ -258,8 +258,8 @@ class AutoConfig: ...@@ -258,8 +258,8 @@ class AutoConfig:
r""" r"""
Instantiate one of the configuration classes of the library from a pretrained model configuration. Instantiate one of the configuration classes of the library from a pretrained model configuration.
The configuration class to instantiate is selected based on the :obj:`model_type` property of the config The configuration class to instantiate is selected based on the :obj:`model_type` property of the config object
object that is loaded, or when it's missing, by falling back to using pattern matching on that is loaded, or when it's missing, by falling back to using pattern matching on
:obj:`pretrained_model_name_or_path`: :obj:`pretrained_model_name_or_path`:
List options List options
...@@ -287,9 +287,8 @@ class AutoConfig: ...@@ -287,9 +287,8 @@ class AutoConfig:
Whether or not to delete incompletely received files. Will attempt to resume the download if such a Whether or not to delete incompletely received files. Will attempt to resume the download if such a
file exists. file exists.
proxies (:obj:`Dict[str, str]`, `optional`): proxies (:obj:`Dict[str, str]`, `optional`):
A dictionary of proxy servers to use by protocol or endpoint, e.g., A dictionary of proxy servers to use by protocol or endpoint, e.g., :obj:`{'http': 'foo.bar:3128',
:obj:`{'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}`. The proxies are used on each 'http://hostname': 'foo.bar:4012'}`. The proxies are used on each request.
request.
return_unused_kwargs (:obj:`bool`, `optional`, defaults to :obj:`False`): return_unused_kwargs (:obj:`bool`, `optional`, defaults to :obj:`False`):
If :obj:`False`, then this function returns just the final configuration object. If :obj:`False`, then this function returns just the final configuration object.
...@@ -298,8 +297,8 @@ class AutoConfig: ...@@ -298,8 +297,8 @@ class AutoConfig:
the part of ``kwargs`` which has not been used to update ``config`` and is otherwise ignored. the part of ``kwargs`` which has not been used to update ``config`` and is otherwise ignored.
kwargs(additional keyword arguments, `optional`): kwargs(additional keyword arguments, `optional`):
The values in kwargs of any keys which are configuration attributes will be used to override the loaded The values in kwargs of any keys which are configuration attributes will be used to override the loaded
values. Behavior concerning key/value pairs whose keys are *not* configuration attributes is values. Behavior concerning key/value pairs whose keys are *not* configuration attributes is controlled
controlled by the ``return_unused_kwargs`` keyword parameter. by the ``return_unused_kwargs`` keyword parameter.
Examples:: Examples::
......
...@@ -36,9 +36,8 @@ class BartConfig(PretrainedConfig): ...@@ -36,9 +36,8 @@ class BartConfig(PretrainedConfig):
This is the configuration class to store the configuration of a :class:`~transformers.BartModel`. It is used to This is the configuration class to store the configuration of a :class:`~transformers.BartModel`. It is used to
instantiate a BART model according to the specified arguments, defining the model architecture. instantiate a BART model according to the specified arguments, defining the model architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig` outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
for more information.
Args: Args:
vocab_size (:obj:`int`, `optional`, defaults to 50265): vocab_size (:obj:`int`, `optional`, defaults to 50265):
...@@ -59,8 +58,8 @@ class BartConfig(PretrainedConfig): ...@@ -59,8 +58,8 @@ class BartConfig(PretrainedConfig):
encoder_ffn_dim (:obj:`int`, `optional`, defaults to 4096): encoder_ffn_dim (:obj:`int`, `optional`, defaults to 4096):
Dimensionality of the "intermediate" (often named feed-forward) layer in decoder. Dimensionality of the "intermediate" (often named feed-forward) layer in decoder.
activation_function (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu"`): activation_function (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu"`):
The non-linear activation function (function or string) in the encoder and pooler. The non-linear activation function (function or string) in the encoder and pooler. If string,
If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported. :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
dropout (:obj:`float`, `optional`, defaults to 0.1): dropout (:obj:`float`, `optional`, defaults to 0.1):
The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler. The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
attention_dropout (:obj:`float`, `optional`, defaults to 0.0): attention_dropout (:obj:`float`, `optional`, defaults to 0.0):
...@@ -70,8 +69,8 @@ class BartConfig(PretrainedConfig): ...@@ -70,8 +69,8 @@ class BartConfig(PretrainedConfig):
classifier_dropout (:obj:`float`, `optional`, defaults to 0.0): classifier_dropout (:obj:`float`, `optional`, defaults to 0.0):
The dropout ratio for classifier. The dropout ratio for classifier.
max_position_embeddings (:obj:`int`, `optional`, defaults to 1024): max_position_embeddings (:obj:`int`, `optional`, defaults to 1024):
The maximum sequence length that this model might ever be used with. The maximum sequence length that this model might ever be used with. Typically set this to something large
Typically set this to something large just in case (e.g., 512 or 1024 or 2048). just in case (e.g., 512 or 1024 or 2048).
init_std (:obj:`float`, `optional`, defaults to 0.02): init_std (:obj:`float`, `optional`, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices. The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
add_bias_logits (:obj:`bool`, `optional`, defaults to :obj:`False`): add_bias_logits (:obj:`bool`, `optional`, defaults to :obj:`False`):
...@@ -95,11 +94,11 @@ class BartConfig(PretrainedConfig): ...@@ -95,11 +94,11 @@ class BartConfig(PretrainedConfig):
bos_token_id (:obj:`int`, `optional`, defaults to 0) bos_token_id (:obj:`int`, `optional`, defaults to 0)
Beginning of stream token id. Beginning of stream token id.
encoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0): encoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0):
The LayerDrop probability for the encoder. See the `LayerDrop paper The LayerDrop probability for the encoder. See the `LayerDrop paper <see
<see https://arxiv.org/abs/1909.11556>`__ for more details. https://arxiv.org/abs/1909.11556>`__ for more details.
decoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0): decoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0):
The LayerDrop probability for the decoder. See the `LayerDrop paper The LayerDrop probability for the decoder. See the `LayerDrop paper <see
<see https://arxiv.org/abs/1909.11556>`__ for more details. https://arxiv.org/abs/1909.11556>`__ for more details.
extra_pos_embeddings: (:obj:`int`, `optional`, defaults to 2): extra_pos_embeddings: (:obj:`int`, `optional`, defaults to 2):
How many extra learned positional embeddings to use. Should be set to :obj:`pad_token_id+1`. How many extra learned positional embeddings to use. Should be set to :obj:`pad_token_id+1`.
num_labels: (:obj:`int`, `optional`, defaults to 3): num_labels: (:obj:`int`, `optional`, defaults to 3):
...@@ -107,8 +106,8 @@ class BartConfig(PretrainedConfig): ...@@ -107,8 +106,8 @@ class BartConfig(PretrainedConfig):
is_encoder_decoder (:obj:`bool`, `optional`, defaults to :obj:`True`): is_encoder_decoder (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether this is an encoder/decoder model. Whether this is an encoder/decoder model.
force_bos_token_to_be_generated (:obj:`bool`, `optional`, defaults to :obj:`False`): force_bos_token_to_be_generated (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether or not to force BOS token to be generated at step 1 (after ``decoder_start_token_id``), Whether or not to force BOS token to be generated at step 1 (after ``decoder_start_token_id``), only
only :obj:`True` for `bart-large-cnn`. :obj:`True` for `bart-large-cnn`.
""" """
model_type = "bart" model_type = "bart"
......
...@@ -51,13 +51,12 @@ BERT_PRETRAINED_CONFIG_ARCHIVE_MAP = { ...@@ -51,13 +51,12 @@ BERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
class BertConfig(PretrainedConfig): class BertConfig(PretrainedConfig):
r""" r"""
This is the configuration class to store the configuration of a :class:`~transformers.BertModel` or a This is the configuration class to store the configuration of a :class:`~transformers.BertModel` or a
:class:`~transformers.TFBertModel`. It is used to instantiate a BERT model according to the specified :class:`~transformers.TFBertModel`. It is used to instantiate a BERT model according to the specified arguments,
arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration
configuration to that of the BERT `bert-base-uncased <https://huggingface.co/bert-base-uncased>`__ architecture. to that of the BERT `bert-base-uncased <https://huggingface.co/bert-base-uncased>`__ architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig` outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
for more information.
Args: Args:
...@@ -74,15 +73,15 @@ class BertConfig(PretrainedConfig): ...@@ -74,15 +73,15 @@ class BertConfig(PretrainedConfig):
intermediate_size (:obj:`int`, `optional`, defaults to 3072): intermediate_size (:obj:`int`, `optional`, defaults to 3072):
Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder. Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
hidden_act (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu"`): hidden_act (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu"`):
The non-linear activation function (function or string) in the encoder and pooler. The non-linear activation function (function or string) in the encoder and pooler. If string,
If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported. :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.1): hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler. The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1): attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
The dropout ratio for the attention probabilities. The dropout ratio for the attention probabilities.
max_position_embeddings (:obj:`int`, `optional`, defaults to 512): max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
The maximum sequence length that this model might ever be used with. The maximum sequence length that this model might ever be used with. Typically set this to something large
Typically set this to something large just in case (e.g., 512 or 1024 or 2048). just in case (e.g., 512 or 1024 or 2048).
type_vocab_size (:obj:`int`, `optional`, defaults to 2): type_vocab_size (:obj:`int`, `optional`, defaults to 2):
The vocabulary size of the :obj:`token_type_ids` passed when calling :class:`~transformers.BertModel` or The vocabulary size of the :obj:`token_type_ids` passed when calling :class:`~transformers.BertModel` or
:class:`~transformers.TFBertModel`. :class:`~transformers.TFBertModel`.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment