serialization.rst 20 KB
Newer Older
Sylvain Gugger's avatar
Sylvain Gugger committed
1
2
3
4
5
6
7
8
9
10
11
12
.. 
    Copyright 2020 The HuggingFace Team. All rights reserved.

    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at

        http://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.

Funtowicz Morgan's avatar
Funtowicz Morgan committed
13
Exporting transformers models
Sylvain Gugger's avatar
Sylvain Gugger committed
14
***********************************************************************************************************************
Funtowicz Morgan's avatar
Funtowicz Morgan committed
15
16

ONNX / ONNXRuntime
Sylvain Gugger's avatar
Sylvain Gugger committed
17
=======================================================================================================================
Funtowicz Morgan's avatar
Funtowicz Morgan committed
18

Sylvain Gugger's avatar
Sylvain Gugger committed
19
20
21
Projects `ONNX (Open Neural Network eXchange) <http://onnx.ai>`_ and `ONNXRuntime (ORT)
<https://microsoft.github.io/onnxruntime/>`_ are part of an effort from leading industries in the AI field to provide a
unified and community-driven format to store and, by extension, efficiently execute neural network leveraging a variety
Funtowicz Morgan's avatar
Funtowicz Morgan committed
22
23
of hardware and dedicated optimizations.

24

Funtowicz Morgan's avatar
Funtowicz Morgan committed
25
Starting from transformers v2.10.0 we partnered with ONNX Runtime to provide an easy export of transformers models to
Sylvain Gugger's avatar
Sylvain Gugger committed
26
27
28
the ONNX format. You can have a look at the effort by looking at our joint blog post `Accelerate your NLP pipelines
using Hugging Face Transformers and ONNX Runtime
<https://medium.com/microsoftazure/accelerate-your-nlp-pipelines-using-hugging-face-transformers-and-onnx-runtime-2443578f4333>`_.
Funtowicz Morgan's avatar
Funtowicz Morgan committed
29

30
31
32
33
34
35
36
37
38
39

Configuration-based approach
-----------------------------------------------------------------------------------------------------------------------

Transformers v4.9.0 introduces a new package: ``transformers.onnx``. This package allows converting checkpoints to an
ONNX graph by leveraging configuration objects. These configuration objects come ready made for a number of model
architectures, and are made to be easily extendable to other architectures.

Ready-made configurations include the following models:

40
41
42
..
    This table is automatically generated by make style, do not fill manually!

43
44
45
- ALBERT
- BART
- BERT
46
- CamemBERT
47
- DistilBERT
48
- GPT Neo
49
- LayoutLM
50
51
52
- Longformer
- mBART
- OpenAI GPT-2
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
- RoBERTa
- T5
- XLM-RoBERTa

This conversion is handled with the PyTorch version of models - it, therefore, requires PyTorch to be installed. If you
would like to be able to convert from TensorFlow, please let us know by opening an issue.

.. note::
    The models showcased here are close to fully feature complete, but do lack some features that are currently in
    development. Namely, the ability to handle the past key values for decoder models is currently in the works.


Converting an ONNX model using the ``transformers.onnx`` package
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The package may be used as a Python module:

.. code-block::

    python -m transformers.onnx --help

    usage: Hugging Face ONNX Exporter tool [-h] -m MODEL -f {pytorch} [--features {default}] [--opset OPSET] [--atol ATOL] output

    positional arguments:
      output                Path indicating where to store generated ONNX model.

    optional arguments:
      -h, --help            show this help message and exit
      -m MODEL, --model MODEL
                            Model's name of path on disk to load.
      --features {default}  Export the model with some additional features.
      --opset OPSET         ONNX opset version to export the model with (default 12).
      --atol ATOL           Absolute difference tolerance when validating the model.

Exporting a checkpoint using a ready-made configuration can be done as follows:

.. code-block::

91
    python -m transformers.onnx --model=bert-base-cased onnx/bert-base-cased/
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109

This exports an ONNX graph of the mentioned checkpoint. Here it is `bert-base-cased`, but it can be any model from the
hub, or a local path.

It will be exported under ``onnx/bert-base-cased``. You should see similar logs:

.. code-block::

    Validating ONNX model...
            -[✓] ONNX model outputs' name match reference model ({'pooler_output', 'last_hidden_state'}
            - Validating ONNX Model output "last_hidden_state":
                    -[] (2, 8, 768) matchs (2, 8, 768)
                    -[] all values close (atol: 0.0001)
            - Validating ONNX Model output "pooler_output":
                    -[] (2, 768) matchs (2, 768)
                    -[] all values close (atol: 0.0001)
    All good, model saved at: onnx/bert-base-cased/model.onnx

Lysandre Debut's avatar
Lysandre Debut committed
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
This export can now be used in the ONNX inference runtime:

.. code-block::

    import onnxruntime as ort

    from transformers import BertTokenizerFast
    tokenizer = BertTokenizerFast.from_pretrained("bert-base-cased")

    ort_session = ort.InferenceSession("onnx/bert-base-cased/model.onnx")

    inputs = tokenizer("Using BERT in ONNX!", return_tensors="np")
    outputs = ort_session.run(["last_hidden_state", "pooler_output"], dict(inputs))

The outputs used (:obj:`["last_hidden_state", "pooler_output"]`) can be obtained by taking a look at the ONNX
configuration of each model. For example, for BERT:

.. code-block::

    from transformers.models.bert import BertOnnxConfig, BertConfig

    config = BertConfig()
    onnx_config = BertOnnxConfig(config)
    output_keys = list(onnx_config.outputs.keys())
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176

Implementing a custom configuration for an unsupported architecture
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Let's take a look at the changes necessary to add a custom configuration for an unsupported architecture. Firstly, we
will need a custom ONNX configuration object that details the model inputs and outputs. The BERT ONNX configuration is
visible below:

.. code-block::

    class BertOnnxConfig(OnnxConfig):
        @property
        def inputs(self) -> Mapping[str, Mapping[int, str]]:
            return OrderedDict(
                [
                    ("input_ids", {0: "batch", 1: "sequence"}),
                    ("attention_mask", {0: "batch", 1: "sequence"}),
                    ("token_type_ids", {0: "batch", 1: "sequence"}),
                ]
            )

        @property
        def outputs(self) -> Mapping[str, Mapping[int, str]]:
            return OrderedDict([("last_hidden_state", {0: "batch", 1: "sequence"}), ("pooler_output", {0: "batch"})])

Let's understand what's happening here. This configuration has two properties: the inputs, and the outputs.

The inputs return a dictionary, where each key corresponds to an expected input, and each value indicates the axis of
that input.

For BERT, there are three necessary inputs. These three inputs are of similar shape, which is made up of two
dimensions: the batch is the first dimension, and the second is the sequence.

The outputs return a similar dictionary, where, once again, each key corresponds to an expected output, and each value
indicates the axis of that output.

Once this is done, a single step remains: adding this configuration object to the initialisation of the model class,
and to the general ``transformers`` initialisation.

An important fact to notice is the use of `OrderedDict` in both inputs and outputs properties. This is a requirements
as inputs are matched against their relative position within the `PreTrainedModel.forward()` prototype and outputs are
match against there position in the returned `BaseModelOutputX` instance.

Lysandre Debut's avatar
Lysandre Debut committed
177
178
179
180
181
182
An example of such an addition is visible here, for the MBart model: `Making MBART ONNX-convertible
<https://github.com/huggingface/transformers/pull/13049/commits/d097adcebd89a520f04352eb215a85916934204f>`__

If you would like to contribute your addition to the library, we recommend you implement tests. An example of such
tests is visible here: `Adding tests to the MBART ONNX conversion
<https://github.com/huggingface/transformers/pull/13049/commits/5d642f65abf45ceeb72bd855ca7bfe2506a58e6a>`__
183
184
185
186
187
188
189
190

Graph conversion
-----------------------------------------------------------------------------------------------------------------------

.. note::
    The approach detailed here is bing deprecated. We recommend you follow the part above for an up to date approach.


Sylvain Gugger's avatar
Sylvain Gugger committed
191
192
Exporting a model is done through the script `convert_graph_to_onnx.py` at the root of the transformers sources. The
following command shows how easy it is to export a BERT model from the library, simply run:
Funtowicz Morgan's avatar
Funtowicz Morgan committed
193
194
195
196
197
198

.. code-block:: bash

    python convert_graph_to_onnx.py --framework <pt, tf> --model bert-base-cased bert-base-cased.onnx

The conversion tool works for both PyTorch and Tensorflow models and ensures:
199
200
201
202

* The model and its weights are correctly initialized from the Hugging Face model hub or a local checkpoint.
* The inputs and outputs are correctly generated to their ONNX counterpart.
* The generated model can be correctly loaded through onnxruntime.
Funtowicz Morgan's avatar
Funtowicz Morgan committed
203
204

.. note::
Sylvain Gugger's avatar
Sylvain Gugger committed
205
206
207
    Currently, inputs and outputs are always exported with dynamic sequence axes preventing some optimizations on the
    ONNX Runtime. If you would like to see such support for fixed-length inputs/outputs, please open up an issue on
    transformers.
Funtowicz Morgan's avatar
Funtowicz Morgan committed
208
209
210


Also, the conversion tool supports different options which let you tune the behavior of the generated model:
211

Sylvain Gugger's avatar
Sylvain Gugger committed
212
213
* **Change the target opset version of the generated model.** (More recent opset generally supports more operators and
  enables faster inference)
214

Sylvain Gugger's avatar
Sylvain Gugger committed
215
216
* **Export pipeline-specific prediction heads.** (Allow to export model along with its task-specific prediction
  head(s))
217

Sylvain Gugger's avatar
Sylvain Gugger committed
218
219
* **Use the external data format (PyTorch only).** (Lets you export model which size is above 2Gb (`More info
  <https://github.com/pytorch/pytorch/pull/33062>`_))
220
221
222


Optimizations
Sylvain Gugger's avatar
Sylvain Gugger committed
223
-----------------------------------------------------------------------------------------------------------------------
224

Sylvain Gugger's avatar
Sylvain Gugger committed
225
226
ONNXRuntime includes some transformers-specific transformations to leverage optimized operations in the graph. Below
are some of the operators which can be enabled to speed up inference through ONNXRuntime (*see note below*):
227
228
229
230
231
232

* Constant folding
* Attention Layer fusing
* Skip connection LayerNormalization fusing
* FastGeLU approximation

Sylvain Gugger's avatar
Sylvain Gugger committed
233
234
235
236
237
Some of the optimizations performed by ONNX runtime can be hardware specific and thus lead to different performances if
used on another machine with a different hardware configuration than the one used for exporting the model. For this
reason, when using ``convert_graph_to_onnx.py`` optimizations are not enabled, ensuring the model can be easily
exported to various hardware. Optimizations can then be enabled when loading the model through ONNX runtime for
inference.
238
239


240
.. note::
Sylvain Gugger's avatar
Sylvain Gugger committed
241
242
243
    When quantization is enabled (see below), ``convert_graph_to_onnx.py`` script will enable optimizations on the
    model because quantization would modify the underlying graph making it impossible for ONNX runtime to do the
    optimizations afterwards.
244
245

.. note::
Guy Rosin's avatar
Guy Rosin committed
246
247
    For more information about the optimizations enabled by ONNXRuntime, please have a look at the `ONNXRuntime Github
    <https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/transformers>`_.
248
249

Quantization
Sylvain Gugger's avatar
Sylvain Gugger committed
250
-----------------------------------------------------------------------------------------------------------------------
251
252
253

ONNX exporter supports generating a quantized version of the model to allow efficient inference.

Sylvain Gugger's avatar
Sylvain Gugger committed
254
255
256
257
Quantization works by converting the memory representation of the parameters in the neural network to a compact integer
format. By default, weights of a neural network are stored as single-precision float (`float32`) which can express a
wide-range of floating-point numbers with decent precision. These properties are especially interesting at training
where you want fine-grained representation.
258

Sylvain Gugger's avatar
Sylvain Gugger committed
259
260
On the other hand, after the training phase, it has been shown one can greatly reduce the range and the precision of
`float32` numbers without changing the performances of the neural network.
261

Sylvain Gugger's avatar
Sylvain Gugger committed
262
263
264
More technically, `float32` parameters are converted to a type requiring fewer bits to represent each number, thus
reducing the overall size of the model. Here, we are enabling `float32` mapping to `int8` values (a non-floating,
single byte, number representation) according to the following formula:
265
266
267
268
269
270
271
272
273
274
275
276
277

.. math::
    y_{float32} = scale * x_{int8} - zero\_point

.. note::
    The quantization process will infer the parameter `scale` and `zero_point` from the neural network parameters

Leveraging tiny-integers has numerous advantages when it comes to inference:

* Storing fewer bits instead of 32 bits for the `float32` reduces the size of the model and makes it load faster.
* Integer operations execute a magnitude faster on modern hardware
* Integer operations require less power to do the computations

Sylvain Gugger's avatar
Sylvain Gugger committed
278
279
280
In order to convert a transformers model to ONNX IR with quantized weights you just need to specify ``--quantize`` when
using ``convert_graph_to_onnx.py``. Also, you can have a look at the ``quantize()`` utility-method in this same script
file.
281
282
283
284
285
286
287
288
289
290
291
292

Example of quantized BERT model export:

.. code-block:: bash

    python convert_graph_to_onnx.py --framework <pt, tf> --model bert-base-cased --quantize bert-base-cased.onnx

.. note::
    Quantization support requires ONNX Runtime >= 1.4.0

.. note::
    When exporting quantized model you will end up with two different ONNX files. The one specified at the end of the
Sylvain Gugger's avatar
Sylvain Gugger committed
293
294
    above command will contain the original ONNX model storing `float32` weights. The second one, with ``-quantized``
    suffix, will hold the quantized parameters.
Funtowicz Morgan's avatar
Funtowicz Morgan committed
295
296


297
TorchScript
Sylvain Gugger's avatar
Sylvain Gugger committed
298
=======================================================================================================================
299

300
.. note::
Sylvain Gugger's avatar
Sylvain Gugger committed
301
302
303
304
    This is the very beginning of our experiments with TorchScript and we are still exploring its capabilities with
    variable-input-size models. It is a focus of interest to us and we will deepen our analysis in upcoming releases,
    with more code examples, a more flexible implementation, and benchmarks comparing python-based codes with compiled
    TorchScript.
305
306


Sylvain Gugger's avatar
Sylvain Gugger committed
307
308
According to Pytorch's documentation: "TorchScript is a way to create serializable and optimizable models from PyTorch
code". Pytorch's two modules `JIT and TRACE <https://pytorch.org/docs/stable/jit.html>`_ allow the developer to export
309
310
their model to be re-used in other programs, such as efficiency-oriented C++ programs.

Sylvain Gugger's avatar
Sylvain Gugger committed
311
312
313
We have provided an interface that allows the export of 🤗 Transformers models to TorchScript so that they can be reused
in a different environment than a Pytorch-based python program. Here we explain how to export and use our models using
TorchScript.
314

315
Exporting a model requires two things:
316

317
318
* a forward pass with dummy inputs.
* model instantiation with the ``torchscript`` flag.
319
320
321
322
323

These necessities imply several things developers should be careful about. These are detailed below.


Implications
Sylvain Gugger's avatar
Sylvain Gugger committed
324
-----------------------------------------------------------------------------------------------------------------------
325
326

TorchScript flag and tied weights
Sylvain Gugger's avatar
Sylvain Gugger committed
327
-----------------------------------------------------------------------------------------------------------------------
Sylvain Gugger's avatar
Sylvain Gugger committed
328

329
This flag is necessary because most of the language models in this repository have tied weights between their
Sylvain Gugger's avatar
Sylvain Gugger committed
330
331
``Embedding`` layer and their ``Decoding`` layer. TorchScript does not allow the export of models that have tied
weights, therefore it is necessary to untie and clone the weights beforehand.
332

Sylvain Gugger's avatar
Sylvain Gugger committed
333
334
335
This implies that models instantiated with the ``torchscript`` flag have their ``Embedding`` layer and ``Decoding``
layer separate, which means that they should not be trained down the line. Training would de-synchronize the two
layers, leading to unexpected results.
336
337
338
339
340

This is not the case for models that do not have a Language Model head, as those do not have tied weights. These models
can be safely exported without the ``torchscript`` flag.

Dummy inputs and standard lengths
Sylvain Gugger's avatar
Sylvain Gugger committed
341
-----------------------------------------------------------------------------------------------------------------------
342
343

The dummy inputs are used to do a model forward pass. While the inputs' values are propagating through the layers,
Sylvain Gugger's avatar
Sylvain Gugger committed
344
345
Pytorch keeps track of the different operations executed on each tensor. These recorded operations are then used to
create the "trace" of the model.
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361

The trace is created relatively to the inputs' dimensions. It is therefore constrained by the dimensions of the dummy
input, and will not work for any other sequence length or batch size. When trying with a different size, an error such
as:

``The expanded size of the tensor (3) must match the existing size (7) at non-singleton dimension 2``

will be raised. It is therefore recommended to trace the model with a dummy input size at least as large as the largest
input that will be fed to the model during inference. Padding can be performed to fill the missing values. As the model
will have been traced with a large input size however, the dimensions of the different matrix will be large as well,
resulting in more calculations.

It is recommended to be careful of the total number of operations done on each input and to follow performance closely
when exporting varying sequence-length models.

Using TorchScript in Python
Sylvain Gugger's avatar
Sylvain Gugger committed
362
-----------------------------------------------------------------------------------------------------------------------
363

364
Below is an example, showing how to save, load models as well as how to use the trace for inference.
365
366

Saving a model
Sylvain Gugger's avatar
Sylvain Gugger committed
367
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
368

Sylvain Gugger's avatar
Sylvain Gugger committed
369
370
This snippet shows how to use TorchScript to export a ``BertModel``. Here the ``BertModel`` is instantiated according
to a ``BertConfig`` class and then saved to disk under the filename ``traced_bert.pt``
371
372
373

.. code-block:: python

374
    from transformers import BertModel, BertTokenizer, BertConfig
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
    import torch

    enc = BertTokenizer.from_pretrained("bert-base-uncased")

    # Tokenizing input text
    text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
    tokenized_text = enc.tokenize(text)

    # Masking one of the input tokens
    masked_index = 8
    tokenized_text[masked_index] = '[MASK]'
    indexed_tokens = enc.convert_tokens_to_ids(tokenized_text)
    segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]

    # Creating a dummy input
    tokens_tensor = torch.tensor([indexed_tokens])
    segments_tensors = torch.tensor([segments_ids])
    dummy_input = [tokens_tensor, segments_tensors]

    # Initializing the model with the torchscript flag
    # Flag set to True even though it is not necessary as this model does not have an LM Head.
    config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072, torchscript=True)

    # Instantiating the model
    model = BertModel(config)

    # The model needs to be in evaluation mode
    model.eval()

405
406
407
    # If you are instantiating the model with `from_pretrained` you can also easily set the TorchScript flag
    model = BertModel.from_pretrained("bert-base-uncased", torchscript=True)

408
409
410
411
412
    # Creating the trace
    traced_model = torch.jit.trace(model, [tokens_tensor, segments_tensors])
    torch.jit.save(traced_model, "traced_bert.pt")

Loading a model
Sylvain Gugger's avatar
Sylvain Gugger committed
413
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
414
415
416
417
418
419

This snippet shows how to load the ``BertModel`` that was previously saved to disk under the name ``traced_bert.pt``.
We are re-using the previously initialised ``dummy_input``.

.. code-block:: python

420
    loaded_model = torch.jit.load("traced_bert.pt")
421
422
    loaded_model.eval()

423
    all_encoder_layers, pooled_output = loaded_model(*dummy_input)
424
425

Using a traced model for inference
Sylvain Gugger's avatar
Sylvain Gugger committed
426
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
427
428
429
430
431

Using the traced model for inference is as simple as using its ``__call__`` dunder method:

.. code-block:: python

Sukuya's avatar
Sukuya committed
432
    traced_model(tokens_tensor, segments_tensors)