Unverified Commit 08f534d2 authored by Sylvain Gugger's avatar Sylvain Gugger Committed by GitHub
Browse files

Doc styling (#8067)

* Important files

* Styling them all

* Revert "Styling them all"

This reverts commit 7d029395fdae8513b8281cbc2a6c239f8093503e.

* Syling them for realsies

* Fix syntax error

* Fix benchmark_utils

* More fixes

* Fix modeling auto and script

* Remove new line

* Fixes

* More fixes

* Fix more files

* Style

* Add FSMT

* More fixes

* More fixes

* More fixes

* More fixes

* Fixes

* More fixes

* More fixes

* Last fixes

* Make sphinx happy
parent 04a17f85
...@@ -39,10 +39,10 @@ is its ``__call__``: you just need to feed your sentence to your tokenizer objec ...@@ -39,10 +39,10 @@ is its ``__call__``: you just need to feed your sentence to your tokenizer objec
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]} 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
This returns a dictionary string to list of ints. This returns a dictionary string to list of ints. The `input_ids <glossary.html#input-ids>`__ are the indices
The `input_ids <glossary.html#input-ids>`__ are the indices corresponding to each token in our sentence. We will see corresponding to each token in our sentence. We will see below what the `attention_mask
below what the `attention_mask <glossary.html#attention-mask>`__ is used for and in <glossary.html#attention-mask>`__ is used for and in :ref:`the next section <sentence-pairs>` the goal of
:ref:`the next section <sentence-pairs>` the goal of `token_type_ids <glossary.html#token-type-ids>`__. `token_type_ids <glossary.html#token-type-ids>`__.
The tokenizer can decode a list of token ids in a proper sentence: The tokenizer can decode a list of token ids in a proper sentence:
...@@ -51,10 +51,10 @@ The tokenizer can decode a list of token ids in a proper sentence: ...@@ -51,10 +51,10 @@ The tokenizer can decode a list of token ids in a proper sentence:
>>> tokenizer.decode(encoded_input["input_ids"]) >>> tokenizer.decode(encoded_input["input_ids"])
"[CLS] Hello, I'm a single sentence! [SEP]" "[CLS] Hello, I'm a single sentence! [SEP]"
As you can see, the tokenizer automatically added some special tokens that the model expects. Not all models need special As you can see, the tokenizer automatically added some special tokens that the model expects. Not all models need
tokens; for instance, if we had used` gtp2-medium` instead of `bert-base-cased` to create our tokenizer, we would have special tokens; for instance, if we had used` gtp2-medium` instead of `bert-base-cased` to create our tokenizer, we
seen the same sentence as the original one here. You can disable this behavior (which is only advised if you have added would have seen the same sentence as the original one here. You can disable this behavior (which is only advised if you
those special tokens yourself) by passing ``add_special_tokens=False``. have added those special tokens yourself) by passing ``add_special_tokens=False``.
If you have several sentences you want to process, you can do this efficiently by sending them as a list to the If you have several sentences you want to process, you can do this efficiently by sending them as a list to the
tokenizer: tokenizer:
...@@ -114,9 +114,9 @@ You can do all of this by using the following options when feeding your list of ...@@ -114,9 +114,9 @@ You can do all of this by using the following options when feeding your list of
[1, 1, 1, 1, 1, 0, 0, 0, 0], [1, 1, 1, 1, 1, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 1, 1, 1, 0]])} [1, 1, 1, 1, 1, 1, 1, 1, 0]])}
It returns a dictionary with string keys and tensor values. We can now see what the `attention_mask <glossary.html#attention-mask>`__ is It returns a dictionary with string keys and tensor values. We can now see what the `attention_mask
all about: it points out which tokens the model should pay attention to and which ones it should not (because they <glossary.html#attention-mask>`__ is all about: it points out which tokens the model should pay attention to and which
represent padding in this case). ones it should not (because they represent padding in this case).
Note that if your model does not have a maximum length associated to it, the command above will throw a warning. You Note that if your model does not have a maximum length associated to it, the command above will throw a warning. You
...@@ -127,9 +127,9 @@ can safely ignore it. You can also pass ``verbose=False`` to stop the tokenizer ...@@ -127,9 +127,9 @@ can safely ignore it. You can also pass ``verbose=False`` to stop the tokenizer
Preprocessing pairs of sentences Preprocessing pairs of sentences
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Sometimes you need to feed a pair of sentences to your model. For instance, if you want to classify if two sentences in a Sometimes you need to feed a pair of sentences to your model. For instance, if you want to classify if two sentences in
pair are similar, or for question-answering models, which take a context and a question. For BERT models, the input is a pair are similar, or for question-answering models, which take a context and a question. For BERT models, the input
then represented like this: :obj:`[CLS] Sequence A [SEP] Sequence B [SEP]` is then represented like this: :obj:`[CLS] Sequence A [SEP] Sequence B [SEP]`
You can encode a pair of sentences in the format expected by your model by supplying the two sentences as two arguments You can encode a pair of sentences in the format expected by your model by supplying the two sentences as two arguments
(not a list since a list of two sentences will be interpreted as a batch of two single sentences, as we saw before). (not a list since a list of two sentences will be interpreted as a batch of two single sentences, as we saw before).
...@@ -146,8 +146,8 @@ This will once again return a dict string to list of ints: ...@@ -146,8 +146,8 @@ This will once again return a dict string to list of ints:
This shows us what the `token_type_ids <glossary.html#token-type-ids>`__ are for: they indicate to the model which part This shows us what the `token_type_ids <glossary.html#token-type-ids>`__ are for: they indicate to the model which part
of the inputs correspond to the first sentence and which part corresponds to the second sentence. Note that of the inputs correspond to the first sentence and which part corresponds to the second sentence. Note that
`token_type_ids` are not required or handled by all models. By default, a tokenizer will only return the inputs that `token_type_ids` are not required or handled by all models. By default, a tokenizer will only return the inputs that
its associated model expects. You can force the return (or the non-return) of any of those special arguments by its associated model expects. You can force the return (or the non-return) of any of those special arguments by using
using ``return_input_ids`` or ``return_token_type_ids``. ``return_input_ids`` or ``return_token_type_ids``.
If we decode the token ids we obtained, we will see that the special tokens have been properly added. If we decode the token ids we obtained, we will see that the special tokens have been properly added.
...@@ -239,8 +239,8 @@ three arguments you need to know for this are :obj:`padding`, :obj:`truncation` ...@@ -239,8 +239,8 @@ three arguments you need to know for this are :obj:`padding`, :obj:`truncation`
Here is a table summarizing the recommend way to setup padding and truncation. If you use pair of inputs sequence in Here is a table summarizing the recommend way to setup padding and truncation. If you use pair of inputs sequence in
any of the following examples, you can replace :obj:`truncation=True` by a :obj:`STRATEGY` selected in any of the following examples, you can replace :obj:`truncation=True` by a :obj:`STRATEGY` selected in
:obj:`['only_first', 'only_second', 'longest_first']`, i.e. :obj:`truncation='only_second'` or :obj:`['only_first', 'only_second', 'longest_first']`, i.e. :obj:`truncation='only_second'` or :obj:`truncation=
:obj:`truncation= 'longest_first'` to control how both sequence in the pair are truncated as detailed before. 'longest_first'` to control how both sequence in the pair are truncated as detailed before.
+--------------------------------------+-----------------------------------+---------------------------------------------------------------------------------------------+ +--------------------------------------+-----------------------------------+---------------------------------------------------------------------------------------------+
| Truncation | Padding | Instruction | | Truncation | Padding | Instruction |
......
...@@ -3,7 +3,8 @@ Pretrained models ...@@ -3,7 +3,8 @@ Pretrained models
Here is the full list of the currently provided pretrained models together with a short presentation of each model. Here is the full list of the currently provided pretrained models together with a short presentation of each model.
For a list that includes community-uploaded models, refer to `https://huggingface.co/models <https://huggingface.co/models>`__. For a list that includes community-uploaded models, refer to `https://huggingface.co/models
<https://huggingface.co/models>`__.
+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| Architecture | Shortcut name | Details of the model | | Architecture | Shortcut name | Details of the model |
......
Quick tour Quick tour
======================================================================================================================= =======================================================================================================================
Let's have a quick look at the 🤗 Transformers library features. The library downloads pretrained models for Let's have a quick look at the 🤗 Transformers library features. The library downloads pretrained models for Natural
Natural Language Understanding (NLU) tasks, such as analyzing the sentiment of a text, and Natural Language Generation (NLG), Language Understanding (NLU) tasks, such as analyzing the sentiment of a text, and Natural Language Generation (NLG),
such as completing a prompt with new text or translating in another language. such as completing a prompt with new text or translating in another language.
First we will see how to easily leverage the pipeline API to quickly use those pretrained models at inference. Then, we First we will see how to easily leverage the pipeline API to quickly use those pretrained models at inference. Then, we
...@@ -29,8 +29,8 @@ provides the following tasks out of the box: ...@@ -29,8 +29,8 @@ provides the following tasks out of the box:
- Translation: translate a text in another language. - Translation: translate a text in another language.
- Feature extraction: return a tensor representation of the text. - Feature extraction: return a tensor representation of the text.
Let's see how this work for sentiment analysis (the other tasks are all covered in the Let's see how this work for sentiment analysis (the other tasks are all covered in the :doc:`task summary
:doc:`task summary </task_summary>`): </task_summary>`):
.. code-block:: .. code-block::
...@@ -160,9 +160,10 @@ To apply these steps on a given text, we can just feed it to our tokenizer: ...@@ -160,9 +160,10 @@ To apply these steps on a given text, we can just feed it to our tokenizer:
>>> inputs = tokenizer("We are very happy to show you the 🤗 Transformers library.") >>> inputs = tokenizer("We are very happy to show you the 🤗 Transformers library.")
This returns a dictionary string to list of ints. It contains the `ids of the tokens <glossary.html#input-ids>`__, This returns a dictionary string to list of ints. It contains the `ids of the tokens <glossary.html#input-ids>`__, as
as mentioned before, but also additional arguments that will be useful to the model. Here for instance, we also have an mentioned before, but also additional arguments that will be useful to the model. Here for instance, we also have an
`attention mask <glossary.html#attention-mask>`__ that the model will use to have a better understanding of the sequence: `attention mask <glossary.html#attention-mask>`__ that the model will use to have a better understanding of the
sequence:
.. code-block:: .. code-block::
...@@ -191,8 +192,8 @@ and get tensors back. You can specify all of that to the tokenizer: ...@@ -191,8 +192,8 @@ and get tensors back. You can specify all of that to the tokenizer:
... return_tensors="tf" ... return_tensors="tf"
... ) ... )
The padding is automatically applied on the side expected by the model (in this case, on the right), with the The padding is automatically applied on the side expected by the model (in this case, on the right), with the padding
padding token the model was pretrained with. The attention mask is also adapted to take the padding into account: token the model was pretrained with. The attention mask is also adapted to take the padding into account:
.. code-block:: .. code-block::
...@@ -213,8 +214,8 @@ Using the model ...@@ -213,8 +214,8 @@ Using the model
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Once your input has been preprocessed by the tokenizer, you can send it directly to the model. As we mentioned, it will Once your input has been preprocessed by the tokenizer, you can send it directly to the model. As we mentioned, it will
contain all the relevant information the model needs. If you're using a TensorFlow model, you can pass the contain all the relevant information the model needs. If you're using a TensorFlow model, you can pass the dictionary
dictionary keys directly to tensors, for a PyTorch model, you need to unpack the dictionary by adding :obj:`**`. keys directly to tensors, for a PyTorch model, you need to unpack the dictionary by adding :obj:`**`.
.. code-block:: .. code-block::
...@@ -223,8 +224,8 @@ dictionary keys directly to tensors, for a PyTorch model, you need to unpack the ...@@ -223,8 +224,8 @@ dictionary keys directly to tensors, for a PyTorch model, you need to unpack the
>>> ## TENSORFLOW CODE >>> ## TENSORFLOW CODE
>>> tf_outputs = tf_model(tf_batch) >>> tf_outputs = tf_model(tf_batch)
In 🤗 Transformers, all outputs are tuples (with only one element potentially). Here, we get a tuple with just the In 🤗 Transformers, all outputs are tuples (with only one element potentially). Here, we get a tuple with just the final
final activations of the model. activations of the model.
.. code-block:: .. code-block::
...@@ -239,11 +240,10 @@ final activations of the model. ...@@ -239,11 +240,10 @@ final activations of the model.
[ 0.08181786, -0.04179301]], dtype=float32)>,) [ 0.08181786, -0.04179301]], dtype=float32)>,)
The model can return more than just the final activations, which is why the output is a tuple. Here we only asked for The model can return more than just the final activations, which is why the output is a tuple. Here we only asked for
the final activations, so we get a tuple with one element. the final activations, so we get a tuple with one element. .. note::
.. note::
All 🤗 Transformers models (PyTorch or TensorFlow) return the activations of the model *before* the final All 🤗 Transformers models (PyTorch or TensorFlow) return the activations of the model *before* the final activation
activation function (like SoftMax) since this final activation function is often fused with the loss. function (like SoftMax) since this final activation function is often fused with the loss.
Let's apply the SoftMax activation to get predictions. Let's apply the SoftMax activation to get predictions.
...@@ -281,11 +281,11 @@ If you have labels, you can provide them to the model, it will return a tuple wi ...@@ -281,11 +281,11 @@ If you have labels, you can provide them to the model, it will return a tuple wi
>>> import tensorflow as tf >>> import tensorflow as tf
>>> tf_outputs = tf_model(tf_batch, labels = tf.constant([1, 0])) >>> tf_outputs = tf_model(tf_batch, labels = tf.constant([1, 0]))
Models are standard `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ or Models are standard `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ or `tf.keras.Model
`tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ so you can use them in your usual <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ so you can use them in your usual training loop. 🤗
training loop. 🤗 Transformers also provides a :class:`~transformers.Trainer` (or :class:`~transformers.TFTrainer` if Transformers also provides a :class:`~transformers.Trainer` (or :class:`~transformers.TFTrainer` if you are using
you are using TensorFlow) class to help with your training (taking care of things such as distributed training, mixed TensorFlow) class to help with your training (taking care of things such as distributed training, mixed precision,
precision, etc.). See the :doc:`training tutorial <training>` for more details. etc.). See the :doc:`training tutorial <training>` for more details.
.. note:: .. note::
...@@ -336,13 +336,13 @@ The :obj:`AutoModel` and :obj:`AutoTokenizer` classes are just shortcuts that wi ...@@ -336,13 +336,13 @@ The :obj:`AutoModel` and :obj:`AutoTokenizer` classes are just shortcuts that wi
pretrained model. Behind the scenes, the library has one model class per combination of architecture plus class, so the pretrained model. Behind the scenes, the library has one model class per combination of architecture plus class, so the
code is easy to access and tweak if you need to. code is easy to access and tweak if you need to.
In our previous example, the model was called "distilbert-base-uncased-finetuned-sst-2-english", which means it's In our previous example, the model was called "distilbert-base-uncased-finetuned-sst-2-english", which means it's using
using the :doc:`DistilBERT </model_doc/distilbert>` architecture. As the :doc:`DistilBERT </model_doc/distilbert>` architecture. As
:class:`~transformers.AutoModelForSequenceClassification` (or :class:`~transformers.TFAutoModelForSequenceClassification` :class:`~transformers.AutoModelForSequenceClassification` (or
if you are using TensorFlow) was used, the model automatically created is then a :class:`~transformers.TFAutoModelForSequenceClassification` if you are using TensorFlow) was used, the model
:class:`~transformers.DistilBertForSequenceClassification`. You can look at its documentation for all details relevant automatically created is then a :class:`~transformers.DistilBertForSequenceClassification`. You can look at its
to that specific model, or browse the source code. This is how you would directly instantiate model and tokenizer documentation for all details relevant to that specific model, or browse the source code. This is how you would
without the auto magic: directly instantiate model and tokenizer without the auto magic:
.. code-block:: .. code-block::
......
...@@ -5,16 +5,18 @@ Exporting transformers models ...@@ -5,16 +5,18 @@ Exporting transformers models
ONNX / ONNXRuntime ONNX / ONNXRuntime
======================================================================================================================= =======================================================================================================================
Projects `ONNX (Open Neural Network eXchange) <http://onnx.ai>`_ and `ONNXRuntime (ORT) <https://microsoft.github.io/onnxruntime/>`_ are part of an effort from leading industries in the AI field Projects `ONNX (Open Neural Network eXchange) <http://onnx.ai>`_ and `ONNXRuntime (ORT)
to provide a unified and community-driven format to store and, by extension, efficiently execute neural network leveraging a variety <https://microsoft.github.io/onnxruntime/>`_ are part of an effort from leading industries in the AI field to provide a
unified and community-driven format to store and, by extension, efficiently execute neural network leveraging a variety
of hardware and dedicated optimizations. of hardware and dedicated optimizations.
Starting from transformers v2.10.0 we partnered with ONNX Runtime to provide an easy export of transformers models to Starting from transformers v2.10.0 we partnered with ONNX Runtime to provide an easy export of transformers models to
the ONNX format. You can have a look at the effort by looking at our joint blog post `Accelerate your NLP pipelines using the ONNX format. You can have a look at the effort by looking at our joint blog post `Accelerate your NLP pipelines
Hugging Face Transformers and ONNX Runtime <https://medium.com/microsoftazure/accelerate-your-nlp-pipelines-using-hugging-face-transformers-and-onnx-runtime-2443578f4333>`_. using Hugging Face Transformers and ONNX Runtime
<https://medium.com/microsoftazure/accelerate-your-nlp-pipelines-using-hugging-face-transformers-and-onnx-runtime-2443578f4333>`_.
Exporting a model is done through the script `convert_graph_to_onnx.py` at the root of the transformers sources. Exporting a model is done through the script `convert_graph_to_onnx.py` at the root of the transformers sources. The
The following command shows how easy it is to export a BERT model from the library, simply run: following command shows how easy it is to export a BERT model from the library, simply run:
.. code-block:: bash .. code-block:: bash
...@@ -27,62 +29,66 @@ The conversion tool works for both PyTorch and Tensorflow models and ensures: ...@@ -27,62 +29,66 @@ The conversion tool works for both PyTorch and Tensorflow models and ensures:
* The generated model can be correctly loaded through onnxruntime. * The generated model can be correctly loaded through onnxruntime.
.. note:: .. note::
Currently, inputs and outputs are always exported with dynamic sequence axes preventing some optimizations Currently, inputs and outputs are always exported with dynamic sequence axes preventing some optimizations on the
on the ONNX Runtime. If you would like to see such support for fixed-length inputs/outputs, please ONNX Runtime. If you would like to see such support for fixed-length inputs/outputs, please open up an issue on
open up an issue on transformers. transformers.
Also, the conversion tool supports different options which let you tune the behavior of the generated model: Also, the conversion tool supports different options which let you tune the behavior of the generated model:
* **Change the target opset version of the generated model.** (More recent opset generally supports more operators and enables faster inference) * **Change the target opset version of the generated model.** (More recent opset generally supports more operators and
enables faster inference)
* **Export pipeline-specific prediction heads.** (Allow to export model along with its task-specific prediction head(s)) * **Export pipeline-specific prediction heads.** (Allow to export model along with its task-specific prediction
head(s))
* **Use the external data format (PyTorch only).** (Lets you export model which size is above 2Gb (`More info <https://github.com/pytorch/pytorch/pull/33062>`_)) * **Use the external data format (PyTorch only).** (Lets you export model which size is above 2Gb (`More info
<https://github.com/pytorch/pytorch/pull/33062>`_))
Optimizations Optimizations
----------------------------------------------------------------------------------------------------------------------- -----------------------------------------------------------------------------------------------------------------------
ONNXRuntime includes some transformers-specific transformations to leverage optimized operations in the graph. ONNXRuntime includes some transformers-specific transformations to leverage optimized operations in the graph. Below
Below are some of the operators which can be enabled to speed up inference through ONNXRuntime (*see note below*): are some of the operators which can be enabled to speed up inference through ONNXRuntime (*see note below*):
* Constant folding * Constant folding
* Attention Layer fusing * Attention Layer fusing
* Skip connection LayerNormalization fusing * Skip connection LayerNormalization fusing
* FastGeLU approximation * FastGeLU approximation
Some of the optimizations performed by ONNX runtime can be hardware specific and thus lead to different performances Some of the optimizations performed by ONNX runtime can be hardware specific and thus lead to different performances if
if used on another machine with a different hardware configuration than the one used for exporting the model. used on another machine with a different hardware configuration than the one used for exporting the model. For this
For this reason, when using ``convert_graph_to_onnx.py`` optimizations are not enabled, reason, when using ``convert_graph_to_onnx.py`` optimizations are not enabled, ensuring the model can be easily
ensuring the model can be easily exported to various hardware. exported to various hardware. Optimizations can then be enabled when loading the model through ONNX runtime for
Optimizations can then be enabled when loading the model through ONNX runtime for inference. inference.
.. note:: .. note::
When quantization is enabled (see below), ``convert_graph_to_onnx.py`` script will enable optimizations on the model When quantization is enabled (see below), ``convert_graph_to_onnx.py`` script will enable optimizations on the
because quantization would modify the underlying graph making it impossible for ONNX runtime to do the optimizations model because quantization would modify the underlying graph making it impossible for ONNX runtime to do the
afterwards. optimizations afterwards.
.. note:: .. note::
For more information about the optimizations enabled by ONNXRuntime, please have a look at the (`ONNXRuntime Github <https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/transformers>`_) For more information about the optimizations enabled by ONNXRuntime, please have a look at the (`ONNXRuntime Github
<https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/transformers>`_)
Quantization Quantization
----------------------------------------------------------------------------------------------------------------------- -----------------------------------------------------------------------------------------------------------------------
ONNX exporter supports generating a quantized version of the model to allow efficient inference. ONNX exporter supports generating a quantized version of the model to allow efficient inference.
Quantization works by converting the memory representation of the parameters in the neural network Quantization works by converting the memory representation of the parameters in the neural network to a compact integer
to a compact integer format. By default, weights of a neural network are stored as single-precision float (`float32`) format. By default, weights of a neural network are stored as single-precision float (`float32`) which can express a
which can express a wide-range of floating-point numbers with decent precision. wide-range of floating-point numbers with decent precision. These properties are especially interesting at training
These properties are especially interesting at training where you want fine-grained representation. where you want fine-grained representation.
On the other hand, after the training phase, it has been shown one can greatly reduce the range and the precision of `float32` numbers On the other hand, after the training phase, it has been shown one can greatly reduce the range and the precision of
without changing the performances of the neural network. `float32` numbers without changing the performances of the neural network.
More technically, `float32` parameters are converted to a type requiring fewer bits to represent each number, thus reducing More technically, `float32` parameters are converted to a type requiring fewer bits to represent each number, thus
the overall size of the model. Here, we are enabling `float32` mapping to `int8` values (a non-floating, single byte, number representation) reducing the overall size of the model. Here, we are enabling `float32` mapping to `int8` values (a non-floating,
according to the following formula: single byte, number representation) according to the following formula:
.. math:: .. math::
y_{float32} = scale * x_{int8} - zero\_point y_{float32} = scale * x_{int8} - zero\_point
...@@ -96,9 +102,9 @@ Leveraging tiny-integers has numerous advantages when it comes to inference: ...@@ -96,9 +102,9 @@ Leveraging tiny-integers has numerous advantages when it comes to inference:
* Integer operations execute a magnitude faster on modern hardware * Integer operations execute a magnitude faster on modern hardware
* Integer operations require less power to do the computations * Integer operations require less power to do the computations
In order to convert a transformers model to ONNX IR with quantized weights you just need to specify ``--quantize`` In order to convert a transformers model to ONNX IR with quantized weights you just need to specify ``--quantize`` when
when using ``convert_graph_to_onnx.py``. Also, you can have a look at the ``quantize()`` utility-method in this using ``convert_graph_to_onnx.py``. Also, you can have a look at the ``quantize()`` utility-method in this same script
same script file. file.
Example of quantized BERT model export: Example of quantized BERT model export:
...@@ -111,26 +117,27 @@ Example of quantized BERT model export: ...@@ -111,26 +117,27 @@ Example of quantized BERT model export:
.. note:: .. note::
When exporting quantized model you will end up with two different ONNX files. The one specified at the end of the When exporting quantized model you will end up with two different ONNX files. The one specified at the end of the
above command will contain the original ONNX model storing `float32` weights. above command will contain the original ONNX model storing `float32` weights. The second one, with ``-quantized``
The second one, with ``-quantized`` suffix, will hold the quantized parameters. suffix, will hold the quantized parameters.
TorchScript TorchScript
======================================================================================================================= =======================================================================================================================
.. note:: .. note::
This is the very beginning of our experiments with TorchScript and we are still exploring its capabilities This is the very beginning of our experiments with TorchScript and we are still exploring its capabilities with
with variable-input-size models. It is a focus of interest to us and we will deepen our analysis in upcoming variable-input-size models. It is a focus of interest to us and we will deepen our analysis in upcoming releases,
releases, with more code examples, a more flexible implementation, and benchmarks comparing python-based codes with more code examples, a more flexible implementation, and benchmarks comparing python-based codes with compiled
with compiled TorchScript. TorchScript.
According to Pytorch's documentation: "TorchScript is a way to create serializable and optimizable models from PyTorch code". According to Pytorch's documentation: "TorchScript is a way to create serializable and optimizable models from PyTorch
Pytorch's two modules `JIT and TRACE <https://pytorch.org/docs/stable/jit.html>`_ allow the developer to export code". Pytorch's two modules `JIT and TRACE <https://pytorch.org/docs/stable/jit.html>`_ allow the developer to export
their model to be re-used in other programs, such as efficiency-oriented C++ programs. their model to be re-used in other programs, such as efficiency-oriented C++ programs.
We have provided an interface that allows the export of 🤗 Transformers models to TorchScript so that they can We have provided an interface that allows the export of 🤗 Transformers models to TorchScript so that they can be reused
be reused in a different environment than a Pytorch-based python program. Here we explain how to export and use our models using TorchScript. in a different environment than a Pytorch-based python program. Here we explain how to export and use our models using
TorchScript.
Exporting a model requires two things: Exporting a model requires two things:
...@@ -145,13 +152,14 @@ Implications ...@@ -145,13 +152,14 @@ Implications
TorchScript flag and tied weights TorchScript flag and tied weights
----------------------------------------------------------------------------------------------------------------------- -----------------------------------------------------------------------------------------------------------------------
This flag is necessary because most of the language models in this repository have tied weights between their This flag is necessary because most of the language models in this repository have tied weights between their
``Embedding`` layer and their ``Decoding`` layer. TorchScript does not allow the export of models that have tied weights, therefore ``Embedding`` layer and their ``Decoding`` layer. TorchScript does not allow the export of models that have tied
it is necessary to untie and clone the weights beforehand. weights, therefore it is necessary to untie and clone the weights beforehand.
This implies that models instantiated with the ``torchscript`` flag have their ``Embedding`` layer and ``Decoding`` layer This implies that models instantiated with the ``torchscript`` flag have their ``Embedding`` layer and ``Decoding``
separate, which means that they should not be trained down the line. Training would de-synchronize the two layers, layer separate, which means that they should not be trained down the line. Training would de-synchronize the two
leading to unexpected results. layers, leading to unexpected results.
This is not the case for models that do not have a Language Model head, as those do not have tied weights. These models This is not the case for models that do not have a Language Model head, as those do not have tied weights. These models
can be safely exported without the ``torchscript`` flag. can be safely exported without the ``torchscript`` flag.
...@@ -160,8 +168,8 @@ Dummy inputs and standard lengths ...@@ -160,8 +168,8 @@ Dummy inputs and standard lengths
----------------------------------------------------------------------------------------------------------------------- -----------------------------------------------------------------------------------------------------------------------
The dummy inputs are used to do a model forward pass. While the inputs' values are propagating through the layers, The dummy inputs are used to do a model forward pass. While the inputs' values are propagating through the layers,
Pytorch keeps track of the different operations executed on each tensor. These recorded operations are then used Pytorch keeps track of the different operations executed on each tensor. These recorded operations are then used to
to create the "trace" of the model. create the "trace" of the model.
The trace is created relatively to the inputs' dimensions. It is therefore constrained by the dimensions of the dummy The trace is created relatively to the inputs' dimensions. It is therefore constrained by the dimensions of the dummy
input, and will not work for any other sequence length or batch size. When trying with a different size, an error such input, and will not work for any other sequence length or batch size. When trying with a different size, an error such
...@@ -185,8 +193,8 @@ Below is an example, showing how to save, load models as well as how to use the ...@@ -185,8 +193,8 @@ Below is an example, showing how to save, load models as well as how to use the
Saving a model Saving a model
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This snippet shows how to use TorchScript to export a ``BertModel``. Here the ``BertModel`` is instantiated This snippet shows how to use TorchScript to export a ``BertModel``. Here the ``BertModel`` is instantiated according
according to a ``BertConfig`` class and then saved to disk under the filename ``traced_bert.pt`` to a ``BertConfig`` class and then saved to disk under the filename ``traced_bert.pt``
.. code-block:: python .. code-block:: python
......
This diff is collapsed.
This diff is collapsed.
Tokenizer summary Tokenizer summary
----------------------------------------------------------------------------------------------------------------------- -----------------------------------------------------------------------------------------------------------------------
In this page, we will have a closer look at tokenization. As we saw in In this page, we will have a closer look at tokenization. As we saw in :doc:`the preprocessing tutorial
:doc:`the preprocessing tutorial <preprocessing>`, tokenizing a text is splitting it into words or subwords, which then <preprocessing>`, tokenizing a text is splitting it into words or subwords, which then are converted to ids. The second
are converted to ids. The second part is pretty straightforward, here we will focus on the first part. More part is pretty straightforward, here we will focus on the first part. More specifically, we will look at the three main
specifically, we will look at the three main different kinds of tokenizers used in 🤗 Transformers: different kinds of tokenizers used in 🤗 Transformers: :ref:`Byte-Pair Encoding (BPE) <byte-pair-encoding>`,
:ref:`Byte-Pair Encoding (BPE) <byte-pair-encoding>`, :ref:`WordPiece <wordpiece>` and :ref:`WordPiece <wordpiece>` and :ref:`SentencePiece <sentencepiece>`, and provide examples of models using each of
:ref:`SentencePiece <sentencepiece>`, and provide examples of models using each of those. those.
Note that on each model page, you can look at the documentation of the associated tokenizer to know which of those Note that on each model page, you can look at the documentation of the associated tokenizer to know which of those
algorithms the pretrained model used. For instance, if we look at :class:`~transformers.BertTokenizer`, we can see it's algorithms the pretrained model used. For instance, if we look at :class:`~transformers.BertTokenizer`, we can see it's
...@@ -16,8 +16,8 @@ Introduction to tokenization ...@@ -16,8 +16,8 @@ Introduction to tokenization
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Splitting a text in smaller chunks is a task that's harder than it looks, and there are multiple ways of doing it. For Splitting a text in smaller chunks is a task that's harder than it looks, and there are multiple ways of doing it. For
instance, let's look at the sentence "Don't you love 🤗 Transformers? We sure do." A first simple way of tokenizing instance, let's look at the sentence "Don't you love 🤗 Transformers? We sure do." A first simple way of tokenizing this
this text is just to split it by spaces, which would give: text is just to split it by spaces, which would give:
.. code-block:: .. code-block::
...@@ -46,9 +46,8 @@ rule-based tokenizers. On the text above, they'd output something like: ...@@ -46,9 +46,8 @@ rule-based tokenizers. On the text above, they'd output something like:
Space/punctuation-tokenization and rule-based tokenization are both examples of word tokenization, which is splitting a Space/punctuation-tokenization and rule-based tokenization are both examples of word tokenization, which is splitting a
sentence into words. While it's the most intuitive way to separate texts in smaller chunks, it can have a problem when sentence into words. While it's the most intuitive way to separate texts in smaller chunks, it can have a problem when
you have a huge corpus: it usually yields a very big vocabulary (the set of all unique tokens used). you have a huge corpus: it usually yields a very big vocabulary (the set of all unique tokens used). :doc:`Transformer
:doc:`Transformer XL <model_doc/transformerxl>` for instance uses space/punctuation-tokenization, and has a vocabulary XL <model_doc/transformerxl>` for instance uses space/punctuation-tokenization, and has a vocabulary size of 267,735!
size of 267,735!
A huge vocabulary size means a huge embedding matrix at the start of the model, which will cause memory problems. A huge vocabulary size means a huge embedding matrix at the start of the model, which will cause memory problems.
TransformerXL deals with it by using a special kind of embeddings called adaptive embeddings, but in general, TransformerXL deals with it by using a special kind of embeddings called adaptive embeddings, but in general,
...@@ -69,9 +68,8 @@ decomposed as "annoying" and "ly". This is especially useful in agglutinative la ...@@ -69,9 +68,8 @@ decomposed as "annoying" and "ly". This is especially useful in agglutinative la
form (almost) arbitrarily long complex words by stringing together some subwords. form (almost) arbitrarily long complex words by stringing together some subwords.
This allows the model to keep a reasonable vocabulary while still learning useful representations for common words or This allows the model to keep a reasonable vocabulary while still learning useful representations for common words or
subwords. This also enables the model to process words it has never seen before, by decomposing them into subwords. This also enables the model to process words it has never seen before, by decomposing them into subwords it
subwords it knows. For instance, the base :class:`~transformers.BertTokenizer` will tokenize "I have a new GPU!" like knows. For instance, the base :class:`~transformers.BertTokenizer` will tokenize "I have a new GPU!" like this:
this:
.. code-block:: .. code-block::
...@@ -81,8 +79,8 @@ this: ...@@ -81,8 +79,8 @@ this:
['i', 'have', 'a', 'new', 'gp', '##u', '!'] ['i', 'have', 'a', 'new', 'gp', '##u', '!']
Since we are considering the uncased model, the sentence was lowercased first. Then all the words were present in the Since we are considering the uncased model, the sentence was lowercased first. Then all the words were present in the
vocabulary of the tokenizer, except for "gpu", so the tokenizer splits it in subwords it knows: "gp" and "##u". The "##" vocabulary of the tokenizer, except for "gpu", so the tokenizer splits it in subwords it knows: "gp" and "##u". The
means that the rest of the token should be attached to the previous one, without space (for when we need to decode "##" means that the rest of the token should be attached to the previous one, without space (for when we need to decode
predictions and reverse the tokenization). predictions and reverse the tokenization).
Another example is when we use the base :class:`~transformers.XLNetTokenizer` to tokenize our previous text: Another example is when we use the base :class:`~transformers.XLNetTokenizer` to tokenize our previous text:
...@@ -106,9 +104,9 @@ Byte-Pair Encoding ...@@ -106,9 +104,9 @@ Byte-Pair Encoding
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Byte-Pair Encoding was introduced in `this paper <https://arxiv.org/abs/1508.07909>`__. It relies on a pretokenizer Byte-Pair Encoding was introduced in `this paper <https://arxiv.org/abs/1508.07909>`__. It relies on a pretokenizer
splitting the training data into words, which can be a simple space tokenization splitting the training data into words, which can be a simple space tokenization (:doc:`GPT-2 <model_doc/gpt2>` and
(:doc:`GPT-2 <model_doc/gpt2>` and :doc:`Roberta <model_doc/roberta>` uses this for instance) or a rule-based tokenizer :doc:`Roberta <model_doc/roberta>` uses this for instance) or a rule-based tokenizer (:doc:`XLM <model_doc/xlm>` use
(:doc:`XLM <model_doc/xlm>` use Moses for most languages, as does :doc:`FlauBERT <model_doc/flaubert>`), Moses for most languages, as does :doc:`FlauBERT <model_doc/flaubert>`),
:doc:`GPT <model_doc/gpt>` uses Spacy and ftfy, and counts the frequency of each word in the training corpus. :doc:`GPT <model_doc/gpt>` uses Spacy and ftfy, and counts the frequency of each word in the training corpus.
...@@ -148,10 +146,10 @@ represented as ...@@ -148,10 +146,10 @@ represented as
('hug', 10), ('p' 'ug', 5), ('p' 'un', 12), ('b' 'un', 4), ('hug' 's', 5) ('hug', 10), ('p' 'ug', 5), ('p' 'un', 12), ('b' 'un', 4), ('hug' 's', 5)
If we stop there, the tokenizer can apply the rules it learned to new words (as long as they don't contain characters that If we stop there, the tokenizer can apply the rules it learned to new words (as long as they don't contain characters
were not in the base vocabulary). For instance 'bug' would be tokenized as ``['b', 'ug']`` but mug would be tokenized as that were not in the base vocabulary). For instance 'bug' would be tokenized as ``['b', 'ug']`` but mug would be
``['<unk>', 'ug']`` since the 'm' is not in the base vocabulary. This doesn't happen to letters in general (since the tokenized as ``['<unk>', 'ug']`` since the 'm' is not in the base vocabulary. This doesn't happen to letters in general
base corpus uses all of them), but to special characters like emojis. (since the base corpus uses all of them), but to special characters like emojis.
As we said before, the vocabulary size (which is the base vocabulary size + the number of merges) is a hyperparameter As we said before, the vocabulary size (which is the base vocabulary size + the number of merges) is a hyperparameter
to choose. For instance :doc:`GPT <model_doc/gpt>` has a vocabulary size of 40,478 since they have 478 base characters to choose. For instance :doc:`GPT <model_doc/gpt>` has a vocabulary size of 40,478 since they have 478 base characters
...@@ -161,24 +159,24 @@ Byte-level BPE ...@@ -161,24 +159,24 @@ Byte-level BPE
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
To deal with the fact the base vocabulary needs to get all base characters, which can be quite big if one allows for To deal with the fact the base vocabulary needs to get all base characters, which can be quite big if one allows for
all unicode characters, the all unicode characters, the `GPT-2 paper
`GPT-2 paper <https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`__ <https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`__ introduces a
introduces a clever trick, which is to use bytes as the base vocabulary (which gives a size of 256). With some clever trick, which is to use bytes as the base vocabulary (which gives a size of 256). With some additional rules to
additional rules to deal with punctuation, this manages to be able to tokenize every text without needing an unknown deal with punctuation, this manages to be able to tokenize every text without needing an unknown token. For instance,
token. For instance, the :doc:`GPT-2 model <model_doc/gpt>` has a vocabulary size of 50,257, which corresponds to the the :doc:`GPT-2 model <model_doc/gpt>` has a vocabulary size of 50,257, which corresponds to the 256 bytes base tokens,
256 bytes base tokens, a special end-of-text token and the symbols learned with 50,000 merges. a special end-of-text token and the symbols learned with 50,000 merges.
.. _wordpiece: .. _wordpiece:
WordPiece WordPiece
======================================================================================================================= =======================================================================================================================
WordPiece is the subword tokenization algorithm used for :doc:`BERT <model_doc/bert>` (as well as WordPiece is the subword tokenization algorithm used for :doc:`BERT <model_doc/bert>` (as well as :doc:`DistilBERT
:doc:`DistilBERT <model_doc/distilbert>` and :doc:`Electra <model_doc/electra>`) and was outlined in <model_doc/distilbert>` and :doc:`Electra <model_doc/electra>`) and was outlined in `this paper
`this paper <https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf>`__. It relies <https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf>`__. It relies on the same
on the same base as BPE, which is to initialize the vocabulary to every character present in the corpus and base as BPE, which is to initialize the vocabulary to every character present in the corpus and progressively learn a
progressively learn a given number of merge rules, the difference is that it doesn't choose the pair that is the most given number of merge rules, the difference is that it doesn't choose the pair that is the most frequent but the one
frequent but the one that will maximize the likelihood on the corpus once merged. that will maximize the likelihood on the corpus once merged.
What does this mean? Well, in the previous example, it means we would only merge 'u' and 'g' if the probability of What does this mean? Well, in the previous example, it means we would only merge 'u' and 'g' if the probability of
having 'ug' divided by the probability of having 'u' then 'g' is greater than for any other pair of symbols. It's having 'ug' divided by the probability of having 'u' then 'g' is greater than for any other pair of symbols. It's
...@@ -198,10 +196,10 @@ with :ref:`SentencePiece <sentencepiece>`. ...@@ -198,10 +196,10 @@ with :ref:`SentencePiece <sentencepiece>`.
More specifically, at a given step, unigram computes a loss from the corpus we have and the current vocabulary, then, More specifically, at a given step, unigram computes a loss from the corpus we have and the current vocabulary, then,
for each subword, evaluate how much the loss would increase if the subword was removed from the vocabulary. It then for each subword, evaluate how much the loss would increase if the subword was removed from the vocabulary. It then
sorts the subwords by this quantity (that represents how much worse the loss becomes if the token is removed) and removes sorts the subwords by this quantity (that represents how much worse the loss becomes if the token is removed) and
all the worst p tokens (for instance p could be 10% or 20%). It then repeats the process until the vocabulary has removes all the worst p tokens (for instance p could be 10% or 20%). It then repeats the process until the vocabulary
reached the desired size, always keeping the base characters (to be able to tokenize any word written with them, like has reached the desired size, always keeping the base characters (to be able to tokenize any word written with them,
BPE or WordPiece). like BPE or WordPiece).
Contrary to BPE and WordPiece that work out rules in a certain order that you can then apply in the same order when Contrary to BPE and WordPiece that work out rules in a certain order that you can then apply in the same order when
tokenizing new text, Unigram will have several ways of tokenizing a new text. For instance, if it ends up with the tokenizing new text, Unigram will have several ways of tokenizing a new text. For instance, if it ends up with the
...@@ -217,9 +215,9 @@ training corpus. You can then give a probability to each tokenization (which is ...@@ -217,9 +215,9 @@ training corpus. You can then give a probability to each tokenization (which is
tokens forming it) and pick the most likely one (or if you want to apply some data augmentation, you could sample one tokens forming it) and pick the most likely one (or if you want to apply some data augmentation, you could sample one
of the tokenization according to their probabilities). of the tokenization according to their probabilities).
Those probabilities define the loss that trains the tokenizer: if our corpus consists of the Those probabilities define the loss that trains the tokenizer: if our corpus consists of the words :math:`x_{1}, \dots,
words :math:`x_{1}, \dots, x_{N}` and if for the word :math:`x_{i}` we note :math:`S(x_{i})` the set of all possible x_{N}` and if for the word :math:`x_{i}` we note :math:`S(x_{i})` the set of all possible tokenizations of
tokenizations of :math:`x_{i}` (with the current vocabulary), then the loss is defined as :math:`x_{i}` (with the current vocabulary), then the loss is defined as
.. math:: .. math::
\mathcal{L} = -\sum_{i=1}^{N} \log \left ( \sum_{x \in S(x_{i})} p(x) \right ) \mathcal{L} = -\sum_{i=1}^{N} \log \left ( \sum_{x \in S(x_{i})} p(x) \right )
...@@ -236,8 +234,8 @@ SentencePiece (introduced in `this paper <https://arxiv.org/pdf/1808.06226.pdf>` ...@@ -236,8 +234,8 @@ SentencePiece (introduced in `this paper <https://arxiv.org/pdf/1808.06226.pdf>`
includes the space in the set of characters to use, then uses BPE or unigram to construct the appropriate vocabulary. includes the space in the set of characters to use, then uses BPE or unigram to construct the appropriate vocabulary.
That's why in the example we saw before using :class:`~transformers.XLNetTokenizer` (which uses SentencePiece), we had That's why in the example we saw before using :class:`~transformers.XLNetTokenizer` (which uses SentencePiece), we had
the '▁' character, that represents space. Decoding a tokenized text is then super easy: we just have to concatenate the '▁' character, that represents space. Decoding a tokenized text is then super easy: we just have to concatenate all
all of them together and replace '▁' with space. of them together and replace '▁' with space.
All transformers models in the library that use SentencePiece use it with unigram. Examples of models using it are All transformers models in the library that use SentencePiece use it with unigram. Examples of models using it are
:doc:`ALBERT <model_doc/albert>`, :doc:`XLNet <model_doc/xlnet>` or the :doc:`Marian framework <model_doc/marian>`. :doc:`ALBERT <model_doc/albert>`, :doc:`XLNet <model_doc/xlnet>` or the :doc:`Marian framework <model_doc/marian>`.
Training and fine-tuning Training and fine-tuning
======================================================================================================================= =======================================================================================================================
Model classes in 🤗 Transformers are designed to be compatible with native Model classes in 🤗 Transformers are designed to be compatible with native PyTorch and TensorFlow 2 and can be used
PyTorch and TensorFlow 2 and can be used seemlessly with either. In this seemlessly with either. In this quickstart, we will show how to fine-tune (or train from scratch) a model using the
quickstart, we will show how to fine-tune (or train from scratch) a model standard training tools available in either framework. We will also show how to use our included
using the standard training tools available in either framework. We will also :func:`~transformers.Trainer` class which handles much of the complexity of training for you.
show how to use our included :func:`~transformers.Trainer` class which
handles much of the complexity of training for you. This guide assume that you are already familiar with loading and use our models for inference; otherwise, see the
:doc:`task summary <task_summary>`. We also assume that you are familiar with training deep neural networks in either
This guide assume that you are already familiar with loading and use our PyTorch or TF2, and focus specifically on the nuances and tools for training models in 🤗 Transformers.
models for inference; otherwise, see the :doc:`task summary <task_summary>`. We also assume
that you are familiar with training deep neural networks in either PyTorch or
TF2, and focus specifically on the nuances and tools for training models in
🤗 Transformers.
Sections: Sections:
...@@ -26,25 +22,19 @@ Sections: ...@@ -26,25 +22,19 @@ Sections:
Fine-tuning in native PyTorch Fine-tuning in native PyTorch
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Model classes in 🤗 Transformers that don't begin with ``TF`` are Model classes in 🤗 Transformers that don't begin with ``TF`` are `PyTorch Modules
`PyTorch Modules <https://pytorch.org/docs/master/generated/torch.nn.Module.html>`_, <https://pytorch.org/docs/master/generated/torch.nn.Module.html>`_, meaning that you can use them just as you would any
meaning that you can use them just as you would any model in PyTorch for model in PyTorch for both inference and optimization.
both inference and optimization.
Let's consider the common task of fine-tuning a masked language model like BERT on a sequence classification dataset.
Let's consider the common task of fine-tuning a masked language model like When we instantiate a model with :func:`~transformers.PreTrainedModel.from_pretrained`, the model configuration and
BERT on a sequence classification dataset. When we instantiate a model with pre-trained weights of the specified model are used to initialize the model. The library also includes a number of
:func:`~transformers.PreTrainedModel.from_pretrained`, the model task-specific final layers or 'heads' whose weights are instantiated randomly when not present in the specified
configuration and pre-trained weights
of the specified model are used to initialize the model. The
library also includes a number of task-specific final layers or 'heads' whose
weights are instantiated randomly when not present in the specified
pre-trained model. For example, instantiating a model with pre-trained model. For example, instantiating a model with
``BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)`` ``BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)`` will create a BERT model instance
will create a BERT model instance with encoder weights copied from the with encoder weights copied from the ``bert-base-uncased`` model and a randomly initialized sequence classification
``bert-base-uncased`` model and a randomly initialized sequence head on top of the encoder with an output size of 2. Models are initialized in ``eval`` mode by default. We can call
classification head on top of the encoder with an output size of 2. Models ``model.train()`` to put it in train mode.
are initialized in ``eval`` mode by default. We can call ``model.train()`` to
put it in train mode.
.. code-block:: python .. code-block:: python
...@@ -52,20 +42,17 @@ put it in train mode. ...@@ -52,20 +42,17 @@ put it in train mode.
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', return_dict=True) model = BertForSequenceClassification.from_pretrained('bert-base-uncased', return_dict=True)
model.train() model.train()
This is useful because it allows us to make use of the pre-trained BERT This is useful because it allows us to make use of the pre-trained BERT encoder and easily train it on whatever
encoder and easily train it on whatever sequence classification dataset we sequence classification dataset we choose. We can use any PyTorch optimizer, but our library also provides the
choose. We can use any PyTorch optimizer, but our library also provides the :func:`~transformers.AdamW` optimizer which implements gradient bias correction as well as weight decay.
:func:`~transformers.AdamW` optimizer which implements gradient bias
correction as well as weight decay.
.. code-block:: python .. code-block:: python
from transformers import AdamW from transformers import AdamW
optimizer = AdamW(model.parameters(), lr=1e-5) optimizer = AdamW(model.parameters(), lr=1e-5)
The optimizer allows us to apply different hyperpameters for specific The optimizer allows us to apply different hyperpameters for specific parameter groups. For example, we can apply
parameter groups. For example, we can apply weight decay to all parameters weight decay to all parameters other than bias and layer normalization terms:
other than bias and layer normalization terms:
.. code-block:: python .. code-block:: python
...@@ -76,10 +63,8 @@ other than bias and layer normalization terms: ...@@ -76,10 +63,8 @@ other than bias and layer normalization terms:
] ]
optimizer = AdamW(optimizer_grouped_parameters, lr=1e-5) optimizer = AdamW(optimizer_grouped_parameters, lr=1e-5)
Now we can set up a simple dummy training batch using Now we can set up a simple dummy training batch using :func:`~transformers.PreTrainedTokenizer.__call__`. This returns
:func:`~transformers.PreTrainedTokenizer.__call__`. This returns a a :func:`~transformers.BatchEncoding` instance which prepares everything we might need to pass to the model.
:func:`~transformers.BatchEncoding` instance which
prepares everything we might need to pass to the model.
.. code-block:: python .. code-block:: python
...@@ -90,10 +75,9 @@ prepares everything we might need to pass to the model. ...@@ -90,10 +75,9 @@ prepares everything we might need to pass to the model.
input_ids = encoding['input_ids'] input_ids = encoding['input_ids']
attention_mask = encoding['attention_mask'] attention_mask = encoding['attention_mask']
When we call a classification model with the ``labels`` argument, the first When we call a classification model with the ``labels`` argument, the first returned element is the Cross Entropy loss
returned element is the Cross Entropy loss between the predictions and the between the predictions and the passed labels. Having already set up our optimizer, we can then do a backwards pass and
passed labels. Having already set up our optimizer, we can then do a update the weights:
backwards pass and update the weights:
.. code-block:: python .. code-block:: python
...@@ -103,8 +87,8 @@ backwards pass and update the weights: ...@@ -103,8 +87,8 @@ backwards pass and update the weights:
loss.backward() loss.backward()
optimizer.step() optimizer.step()
Alternatively, you can just get the logits and calculate the loss yourself. Alternatively, you can just get the logits and calculate the loss yourself. The following is equivalent to the previous
The following is equivalent to the previous example: example:
.. code-block:: python .. code-block:: python
...@@ -115,12 +99,10 @@ The following is equivalent to the previous example: ...@@ -115,12 +99,10 @@ The following is equivalent to the previous example:
loss.backward() loss.backward()
optimizer.step() optimizer.step()
Of course, you can train on GPU by calling ``to('cuda')`` on the model and Of course, you can train on GPU by calling ``to('cuda')`` on the model and inputs as usual.
inputs as usual.
We also provide a few learning rate scheduling tools. With the following, we We also provide a few learning rate scheduling tools. With the following, we can set up a scheduler which warms up for
can set up a scheduler which warms up for ``num_warmup_steps`` and then ``num_warmup_steps`` and then linearly decays to 0 by the end of training.
linearly decays to 0 by the end of training.
.. code-block:: python .. code-block:: python
...@@ -135,19 +117,16 @@ Then all we have to do is call ``scheduler.step()`` after ``optimizer.step()``. ...@@ -135,19 +117,16 @@ Then all we have to do is call ``scheduler.step()`` after ``optimizer.step()``.
optimizer.step() optimizer.step()
scheduler.step() scheduler.step()
We highly recommend using :func:`~transformers.Trainer`, discussed below, We highly recommend using :func:`~transformers.Trainer`, discussed below, which conveniently handles the moving parts
which conveniently handles the moving parts of training 🤗 Transformers models of training 🤗 Transformers models with features like mixed precision and easy tensorboard logging.
with features like mixed precision and easy tensorboard logging.
Freezing the encoder Freezing the encoder
----------------------------------------------------------------------------------------------------------------------- -----------------------------------------------------------------------------------------------------------------------
In some cases, you might be interested in keeping the weights of the In some cases, you might be interested in keeping the weights of the pre-trained encoder frozen and optimizing only the
pre-trained encoder frozen and optimizing only the weights of the head weights of the head layers. To do so, simply set the ``requires_grad`` attribute to ``False`` on the encoder
layers. To do so, simply set the ``requires_grad`` attribute to ``False`` on parameters, which can be accessed with the ``base_model`` submodule on any task-specific model in the library:
the encoder parameters, which can be accessed with the ``base_model``
submodule on any task-specific model in the library:
.. code-block:: python .. code-block:: python
...@@ -160,10 +139,8 @@ submodule on any task-specific model in the library: ...@@ -160,10 +139,8 @@ submodule on any task-specific model in the library:
Fine-tuning in native TensorFlow 2 Fine-tuning in native TensorFlow 2
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Models can also be trained natively in TensorFlow 2. Just as with PyTorch, Models can also be trained natively in TensorFlow 2. Just as with PyTorch, TensorFlow models can be instantiated with
TensorFlow models can be instantiated with :func:`~transformers.PreTrainedModel.from_pretrained` to load the weights of the encoder from a pretrained model.
:func:`~transformers.PreTrainedModel.from_pretrained` to load the weights of
the encoder from a pretrained model.
.. code-block:: python .. code-block:: python
...@@ -171,11 +148,9 @@ the encoder from a pretrained model. ...@@ -171,11 +148,9 @@ the encoder from a pretrained model.
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased') model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased')
Let's use ``tensorflow_datasets`` to load in the `MRPC dataset Let's use ``tensorflow_datasets`` to load in the `MRPC dataset
<https://www.tensorflow.org/datasets/catalog/glue#gluemrpc>`_ from GLUE. We <https://www.tensorflow.org/datasets/catalog/glue#gluemrpc>`_ from GLUE. We can then use our built-in
can then use our built-in :func:`~transformers.data.processors.glue.glue_convert_examples_to_features` to tokenize MRPC and convert it to a
:func:`~transformers.data.processors.glue.glue_convert_examples_to_features` TensorFlow ``Dataset`` object. Note that tokenizers are framework-agnostic, so there is no need to prepend ``TF`` to
to tokenize MRPC and convert it to a TensorFlow ``Dataset`` object. Note that
tokenizers are framework-agnostic, so there is no need to prepend ``TF`` to
the pretrained tokenizer name. the pretrained tokenizer name.
.. code-block:: python .. code-block:: python
...@@ -197,8 +172,8 @@ The model can then be compiled and trained as any Keras model: ...@@ -197,8 +172,8 @@ The model can then be compiled and trained as any Keras model:
model.compile(optimizer=optimizer, loss=loss) model.compile(optimizer=optimizer, loss=loss)
model.fit(train_dataset, epochs=2, steps_per_epoch=115) model.fit(train_dataset, epochs=2, steps_per_epoch=115)
With the tight interoperability between TensorFlow and PyTorch models, you With the tight interoperability between TensorFlow and PyTorch models, you can even save the model and then reload it
can even save the model and then reload it as a PyTorch model (or vice-versa): as a PyTorch model (or vice-versa):
.. code-block:: python .. code-block:: python
...@@ -212,12 +187,9 @@ can even save the model and then reload it as a PyTorch model (or vice-versa): ...@@ -212,12 +187,9 @@ can even save the model and then reload it as a PyTorch model (or vice-versa):
Trainer Trainer
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
We also provide a simple but feature-complete training and evaluation We also provide a simple but feature-complete training and evaluation interface through :func:`~transformers.Trainer`
interface through :func:`~transformers.Trainer` and and :func:`~transformers.TFTrainer`. You can train, fine-tune, and evaluate any 🤗 Transformers model with a wide range
:func:`~transformers.TFTrainer`. You can train, fine-tune, of training options and with built-in features like logging, gradient accumulation, and mixed precision.
and evaluate any 🤗 Transformers model with a wide range of training options and
with built-in features like logging, gradient accumulation, and mixed
precision.
.. code-block:: python .. code-block:: python
...@@ -264,21 +236,16 @@ precision. ...@@ -264,21 +236,16 @@ precision.
eval_dataset=tfds_test_dataset # tensorflow_datasets evaluation dataset eval_dataset=tfds_test_dataset # tensorflow_datasets evaluation dataset
) )
Now simply call ``trainer.train()`` to train and ``trainer.evaluate()`` to Now simply call ``trainer.train()`` to train and ``trainer.evaluate()`` to evaluate. You can use your own module as
evaluate. You can use your own module as well, but the first well, but the first argument returned from ``forward`` must be the loss which you wish to optimize.
argument returned from ``forward`` must be the loss which you wish to
optimize.
:func:`~transformers.Trainer` uses a built-in default function to collate :func:`~transformers.Trainer` uses a built-in default function to collate batches and prepare them to be fed into the
batches and prepare them to be fed into the model. If needed, you can also model. If needed, you can also use the ``data_collator`` argument to pass your own collator function which takes in the
use the ``data_collator`` argument to pass your own collator function which data in the format provided by your dataset and returns a batch ready to be fed into the model. Note that
takes in the data in the format provided by your dataset and returns a :func:`~transformers.TFTrainer` expects the passed datasets to be dataset objects from ``tensorflow_datasets``.
batch ready to be fed into the model. Note that
:func:`~transformers.TFTrainer` expects the passed datasets to be dataset
objects from ``tensorflow_datasets``.
To calculate additional metrics in addition to the loss, you can also define To calculate additional metrics in addition to the loss, you can also define your own ``compute_metrics`` function and
your own ``compute_metrics`` function and pass it to the trainer. pass it to the trainer.
.. code-block:: python .. code-block:: python
...@@ -296,8 +263,8 @@ your own ``compute_metrics`` function and pass it to the trainer. ...@@ -296,8 +263,8 @@ your own ``compute_metrics`` function and pass it to the trainer.
'recall': recall 'recall': recall
} }
Finally, you can view the results, including any calculated metrics, by Finally, you can view the results, including any calculated metrics, by launching tensorboard in your specified
launching tensorboard in your specified ``logging_dir`` directory. ``logging_dir`` directory.
.. _additional-resources: .. _additional-resources:
...@@ -308,11 +275,12 @@ Additional resources ...@@ -308,11 +275,12 @@ Additional resources
- `A lightweight colab demo <https://colab.research.google.com/drive/1-JIJlao4dI-Ilww_NnTc0rxtp-ymgDgM?usp=sharing>`_ - `A lightweight colab demo <https://colab.research.google.com/drive/1-JIJlao4dI-Ilww_NnTc0rxtp-ymgDgM?usp=sharing>`_
which uses ``Trainer`` for IMDb sentiment classification. which uses ``Trainer`` for IMDb sentiment classification.
- `🤗 Transformers Examples <https://github.com/huggingface/transformers/tree/master/examples>`_ - `🤗 Transformers Examples <https://github.com/huggingface/transformers/tree/master/examples>`_ including scripts for
including scripts for training and fine-tuning on GLUE, SQuAD, and several other tasks. training and fine-tuning on GLUE, SQuAD, and several other tasks.
- `How to train a language model <https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb>`_, - `How to train a language model
a detailed colab notebook which uses ``Trainer`` to train a masked language model from scratch on Esperanto. <https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb>`_, a detailed
colab notebook which uses ``Trainer`` to train a masked language model from scratch on Esperanto.
- `🤗 Transformers Notebooks <notebooks.html>`_ which contain dozens of example notebooks from the community for - `🤗 Transformers Notebooks <notebooks.html>`_ which contain dozens of example notebooks from the community for
training and using 🤗 Transformers on a variety of tasks. training and using 🤗 Transformers on a variety of tasks.
...@@ -14,18 +14,19 @@ def swish(x): ...@@ -14,18 +14,19 @@ def swish(x):
def _gelu_python(x): def _gelu_python(x):
"""Original Implementation of the gelu activation function in Google Bert repo when initially created. """
For information: OpenAI GPT's gelu is slightly different (and gives slightly different results): Original Implementation of the gelu activation function in Google Bert repo when initially created. For
0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3)))) information: OpenAI GPT's gelu is slightly different (and gives slightly different results): 0.5 * x * (1 +
This is now written in C in torch.nn.functional torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3)))) This is now written in C in
Also see https://arxiv.org/abs/1606.08415 torch.nn.functional Also see https://arxiv.org/abs/1606.08415
""" """
return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0))) return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))
def gelu_new(x): def gelu_new(x):
"""Implementation of the gelu activation function currently in Google Bert repo (identical to OpenAI GPT). """
Also see https://arxiv.org/abs/1606.08415 Implementation of the gelu activation function currently in Google Bert repo (identical to OpenAI GPT). Also see
https://arxiv.org/abs/1606.08415
""" """
return 0.5 * x * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (x + 0.044715 * torch.pow(x, 3.0)))) return 0.5 * x * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (x + 0.044715 * torch.pow(x, 3.0))))
......
...@@ -4,11 +4,11 @@ import tensorflow as tf ...@@ -4,11 +4,11 @@ import tensorflow as tf
def gelu(x): def gelu(x):
"""Gaussian Error Linear Unit. """
Original Implementation of the gelu activation function in Google Bert repo when initially created. Gaussian Error Linear Unit. Original Implementation of the gelu activation function in Google Bert repo when
For information: OpenAI GPT's gelu is slightly different (and gives slightly different results): initially created. For information: OpenAI GPT's gelu is slightly different (and gives slightly different results):
0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3)))) 0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3)))) Also see
Also see https://arxiv.org/abs/1606.08415 https://arxiv.org/abs/1606.08415
""" """
x = tf.convert_to_tensor(x) x = tf.convert_to_tensor(x)
cdf = 0.5 * (1.0 + tf.math.erf(x / tf.math.sqrt(2.0))) cdf = 0.5 * (1.0 + tf.math.erf(x / tf.math.sqrt(2.0)))
...@@ -17,11 +17,12 @@ def gelu(x): ...@@ -17,11 +17,12 @@ def gelu(x):
def gelu_new(x): def gelu_new(x):
"""Gaussian Error Linear Unit. """
This is a smoother version of the GELU. Gaussian Error Linear Unit. This is a smoother version of the GELU. Original paper: https://arxiv.org/abs/1606.0841
Original paper: https://arxiv.org/abs/1606.08415
Args: Args:
x: float Tensor to perform activation. x: float Tensor to perform activation
Returns: Returns:
`x` with the GELU activation applied. `x` with the GELU activation applied.
""" """
......
...@@ -46,8 +46,9 @@ class PyTorchBenchmarkArguments(BenchmarkArguments): ...@@ -46,8 +46,9 @@ class PyTorchBenchmarkArguments(BenchmarkArguments):
] ]
def __init__(self, **kwargs): def __init__(self, **kwargs):
"""This __init__ is there for legacy code. When removing """
deprecated args completely, the class can simply be deleted This __init__ is there for legacy code. When removing deprecated args completely, the class can simply be
deleted
""" """
for deprecated_arg in self.deprecated_args: for deprecated_arg in self.deprecated_args:
if deprecated_arg in kwargs: if deprecated_arg in kwargs:
......
...@@ -43,8 +43,9 @@ class TensorFlowBenchmarkArguments(BenchmarkArguments): ...@@ -43,8 +43,9 @@ class TensorFlowBenchmarkArguments(BenchmarkArguments):
] ]
def __init__(self, **kwargs): def __init__(self, **kwargs):
"""This __init__ is there for legacy code. When removing """
deprecated args completely, the class can simply be deleted This __init__ is there for legacy code. When removing deprecated args completely, the class can simply be
deleted
""" """
for deprecated_arg in self.deprecated_args: for deprecated_arg in self.deprecated_args:
if deprecated_arg in kwargs: if deprecated_arg in kwargs:
......
...@@ -33,12 +33,10 @@ def list_field(default=None, metadata=None): ...@@ -33,12 +33,10 @@ def list_field(default=None, metadata=None):
@dataclass @dataclass
class BenchmarkArguments: class BenchmarkArguments:
""" """
BenchMarkArguments are arguments we use in our benchmark scripts BenchMarkArguments are arguments we use in our benchmark scripts **which relate to the training loop itself**.
**which relate to the training loop itself**.
Using `HfArgumentParser` we can turn this class Using `HfArgumentParser` we can turn this class into argparse arguments to be able to specify them on the command
into argparse arguments to be able to specify them on line.
the command line.
""" """
models: List[str] = list_field( models: List[str] = list_field(
......
...@@ -16,9 +16,9 @@ def convert_command_factory(args: Namespace): ...@@ -16,9 +16,9 @@ def convert_command_factory(args: Namespace):
) )
IMPORT_ERROR_MESSAGE = """transformers can only be used from the commandline to convert TensorFlow models in PyTorch, IMPORT_ERROR_MESSAGE = """
In that case, it requires TensorFlow to be installed. Please see transformers can only be used from the commandline to convert TensorFlow models in PyTorch, In that case, it requires
https://www.tensorflow.org/install/ for installation instructions. TensorFlow to be installed. Please see https://www.tensorflow.org/install/ for installation instructions.
""" """
......
...@@ -164,9 +164,9 @@ class ServeCommand(BaseTransformersCLICommand): ...@@ -164,9 +164,9 @@ class ServeCommand(BaseTransformersCLICommand):
def tokenize(self, text_input: str = Body(None, embed=True), return_ids: bool = Body(False, embed=True)): def tokenize(self, text_input: str = Body(None, embed=True), return_ids: bool = Body(False, embed=True)):
""" """
Tokenize the provided input and eventually returns corresponding tokens id: Tokenize the provided input and eventually returns corresponding tokens id: - **text_input**: String to
- **text_input**: String to tokenize tokenize - **return_ids**: Boolean flags indicating if the tokens have to be converted to their integer
- **return_ids**: Boolean flags indicating if the tokens have to be converted to their integer mapping. mapping.
""" """
try: try:
tokens_txt = self._pipeline.tokenizer.tokenize(text_input) tokens_txt = self._pipeline.tokenizer.tokenize(text_input)
...@@ -187,10 +187,9 @@ class ServeCommand(BaseTransformersCLICommand): ...@@ -187,10 +187,9 @@ class ServeCommand(BaseTransformersCLICommand):
cleanup_tokenization_spaces: bool = Body(True, embed=True), cleanup_tokenization_spaces: bool = Body(True, embed=True),
): ):
""" """
Detokenize the provided tokens ids to readable text: Detokenize the provided tokens ids to readable text: - **tokens_ids**: List of tokens ids -
- **tokens_ids**: List of tokens ids **skip_special_tokens**: Flag indicating to not try to decode special tokens - **cleanup_tokenization_spaces**:
- **skip_special_tokens**: Flag indicating to not try to decode special tokens Flag indicating to remove all leading/trailing spaces and intermediate ones.
- **cleanup_tokenization_spaces**: Flag indicating to remove all leading/trailing spaces and intermediate ones.
""" """
try: try:
decoded_str = self._pipeline.tokenizer.decode(tokens_ids, skip_special_tokens, cleanup_tokenization_spaces) decoded_str = self._pipeline.tokenizer.decode(tokens_ids, skip_special_tokens, cleanup_tokenization_spaces)
......
...@@ -37,9 +37,8 @@ class AlbertConfig(PretrainedConfig): ...@@ -37,9 +37,8 @@ class AlbertConfig(PretrainedConfig):
arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
configuration to that of the ALBERT `xxlarge <https://huggingface.co/albert-xxlarge-v2>`__ architecture. configuration to that of the ALBERT `xxlarge <https://huggingface.co/albert-xxlarge-v2>`__ architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig` outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
for more information.
Args: Args:
vocab_size (:obj:`int`, `optional`, defaults to 30000): vocab_size (:obj:`int`, `optional`, defaults to 30000):
...@@ -61,15 +60,15 @@ class AlbertConfig(PretrainedConfig): ...@@ -61,15 +60,15 @@ class AlbertConfig(PretrainedConfig):
inner_group_num (:obj:`int`, `optional`, defaults to 1): inner_group_num (:obj:`int`, `optional`, defaults to 1):
The number of inner repetition of attention and ffn. The number of inner repetition of attention and ffn.
hidden_act (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu_new"`): hidden_act (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu_new"`):
The non-linear activation function (function or string) in the encoder and pooler. The non-linear activation function (function or string) in the encoder and pooler. If string,
If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported. :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0): hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0):
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0): attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0):
The dropout ratio for the attention probabilities. The dropout ratio for the attention probabilities.
max_position_embeddings (:obj:`int`, `optional`, defaults to 512): max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
The maximum sequence length that this model might ever be used with. Typically set this to something The maximum sequence length that this model might ever be used with. Typically set this to something large
large (e.g., 512 or 1024 or 2048). (e.g., 512 or 1024 or 2048).
type_vocab_size (:obj:`int`, `optional`, defaults to 2): type_vocab_size (:obj:`int`, `optional`, defaults to 2):
The vocabulary size of the :obj:`token_type_ids` passed when calling :class:`~transformers.AlbertModel` or The vocabulary size of the :obj:`token_type_ids` passed when calling :class:`~transformers.AlbertModel` or
:class:`~transformers.TFAlbertModel`. :class:`~transformers.TFAlbertModel`.
......
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment