serialization.rst 8.31 KB
Newer Older
Funtowicz Morgan's avatar
Funtowicz Morgan committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
**********************************************
Exporting transformers models
**********************************************

ONNX / ONNXRuntime
==============================================

Projects ONNX (Open Neural Network eXchange) and ONNXRuntime (ORT) are part of an effort from leading industries in the AI field
to provide a unified and community-driven format to store and, by extension, efficiently execute neural network leveraging a variety
of hardware and dedicated optimizations.

Starting from transformers v2.10.0 we partnered with ONNX Runtime to provide an easy export of transformers models to
the ONNX format. You can have a look at the effort by looking at our joint blog post `Accelerate your NLP pipelines using
Hugging Face Transformers and ONNX Runtime <https://medium.com/microsoftazure/accelerate-your-nlp-pipelines-using-hugging-face-transformers-and-onnx-runtime-2443578f4333>`_.

Exporting a model is done through the script `convert_graph_to_onnx.py` at the root of the transformers sources.
The following command shows how easy it is to export a BERT model from the library, simply run:

.. code-block:: bash

    python convert_graph_to_onnx.py --framework <pt, tf> --model bert-base-cased bert-base-cased.onnx

The conversion tool works for both PyTorch and Tensorflow models and ensures:
    * The model and its weights are correctly initialized from the Hugging Face model hub or a local checkpoint.
    * The inputs and outputs are correctly generated to their ONNX counterpart.
    * The generated model can be correctly loaded through onnxruntime.

.. note::
    Currently, inputs and outputs are always exported with dynamic sequence axes preventing some optimizations
    on the ONNX Runtime. If you would like to see such support for fixed-length inputs/outputs, please
    open up an issue on transformers.


Also, the conversion tool supports different options which let you tune the behavior of the generated model:
    * Change the target opset version of the generated model: More recent opset generally supports more operator and enables faster inference.
    * Export pipeline specific prediction heads: Allow to export model along with its task-specific prediction head(s).
    * Use the external data format (PyTorch only): Lets you export model which size is above 2Gb (`More info <https://github.com/pytorch/pytorch/pull/33062>`_).


40
TorchScript
Funtowicz Morgan's avatar
Funtowicz Morgan committed
41
=======================================
42

43
44
45
46
47
48
49
.. note::
    This is the very beginning of our experiments with TorchScript and we are still exploring its capabilities
    with variable-input-size models. It is a focus of interest to us and we will deepen our analysis in upcoming
    releases, with more code examples, a more flexible implementation, and benchmarks comparing python-based codes
    with compiled TorchScript.


50
51
52
53
According to Pytorch's documentation: "TorchScript is a way to create serializable and optimizable models from PyTorch code".
Pytorch's two modules `JIT and TRACE <https://pytorch.org/docs/stable/jit.html>`_ allow the developer to export
their model to be re-used in other programs, such as efficiency-oriented C++ programs.

Sylvain Gugger's avatar
Sylvain Gugger committed
54
We have provided an interface that allows the export of 馃 Transformers models to TorchScript so that they can
55
56
57
58
59
60
61
62
63
64
65
66
be reused in a different environment than a Pytorch-based python program. Here we explain how to use our models so that
they can be exported, and what to be mindful of when using these models with TorchScript.

Exporting a model needs two things:

* dummy inputs to execute a model forward pass.
* the model needs to be instantiated with the ``torchscript`` flag.

These necessities imply several things developers should be careful about. These are detailed below.


Implications
Funtowicz Morgan's avatar
Funtowicz Morgan committed
67
------------------------------------------------
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103

TorchScript flag and tied weights
------------------------------------------------
This flag is necessary because most of the language models in this repository have tied weights between their
``Embedding`` layer and their ``Decoding`` layer. TorchScript does not allow the export of models that have tied weights,
it is therefore necessary to untie the weights beforehand.

This implies that models instantiated with the ``torchscript`` flag have their ``Embedding`` layer and ``Decoding`` layer
separate, which means that they should not be trained down the line. Training would de-synchronize the two layers,
leading to unexpected results.

This is not the case for models that do not have a Language Model head, as those do not have tied weights. These models
can be safely exported without the ``torchscript`` flag.

Dummy inputs and standard lengths
------------------------------------------------

The dummy inputs are used to do a model forward pass. While the inputs' values are propagating through the layers,
Pytorch keeps track of the different operations executed on each tensor. These recorded operations are then used
to create the "trace" of the model.

The trace is created relatively to the inputs' dimensions. It is therefore constrained by the dimensions of the dummy
input, and will not work for any other sequence length or batch size. When trying with a different size, an error such
as:

``The expanded size of the tensor (3) must match the existing size (7) at non-singleton dimension 2``

will be raised. It is therefore recommended to trace the model with a dummy input size at least as large as the largest
input that will be fed to the model during inference. Padding can be performed to fill the missing values. As the model
will have been traced with a large input size however, the dimensions of the different matrix will be large as well,
resulting in more calculations.

It is recommended to be careful of the total number of operations done on each input and to follow performance closely
when exporting varying sequence-length models.

Using TorchScript in Python
Funtowicz Morgan's avatar
Funtowicz Morgan committed
104
-------------------------------------------------
105
106
107
108

Below are examples of using the Python to save, load models as well as how to use the trace for inference.

Saving a model
Funtowicz Morgan's avatar
Funtowicz Morgan committed
109
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
110
111
112
113
114
115

This snippet shows how to use TorchScript to export a ``BertModel``. Here the ``BertModel`` is instantiated
according to a ``BertConfig`` class and then saved to disk under the filename ``traced_bert.pt``

.. code-block:: python

116
    from transformers import BertModel, BertTokenizer, BertConfig
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
    import torch

    enc = BertTokenizer.from_pretrained("bert-base-uncased")

    # Tokenizing input text
    text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
    tokenized_text = enc.tokenize(text)

    # Masking one of the input tokens
    masked_index = 8
    tokenized_text[masked_index] = '[MASK]'
    indexed_tokens = enc.convert_tokens_to_ids(tokenized_text)
    segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]

    # Creating a dummy input
    tokens_tensor = torch.tensor([indexed_tokens])
    segments_tensors = torch.tensor([segments_ids])
    dummy_input = [tokens_tensor, segments_tensors]

    # Initializing the model with the torchscript flag
    # Flag set to True even though it is not necessary as this model does not have an LM Head.
    config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072, torchscript=True)

    # Instantiating the model
    model = BertModel(config)

    # The model needs to be in evaluation mode
    model.eval()

147
148
149
    # If you are instantiating the model with `from_pretrained` you can also easily set the TorchScript flag
    model = BertModel.from_pretrained("bert-base-uncased", torchscript=True)

150
151
152
153
154
    # Creating the trace
    traced_model = torch.jit.trace(model, [tokens_tensor, segments_tensors])
    torch.jit.save(traced_model, "traced_bert.pt")

Loading a model
Funtowicz Morgan's avatar
Funtowicz Morgan committed
155
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
156
157
158
159
160
161
162
163
164
165
166
167

This snippet shows how to load the ``BertModel`` that was previously saved to disk under the name ``traced_bert.pt``.
We are re-using the previously initialised ``dummy_input``.

.. code-block:: python

    loaded_model = torch.jit.load("traced_model.pt")
    loaded_model.eval()

    all_encoder_layers, pooled_output = loaded_model(dummy_input)

Using a traced model for inference
Funtowicz Morgan's avatar
Funtowicz Morgan committed
168
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
169
170
171
172
173

Using the traced model for inference is as simple as using its ``__call__`` dunder method:

.. code-block:: python

Sukuya's avatar
Sukuya committed
174
    traced_model(tokens_tensor, segments_tensors)