serialization.mdx 26.1 KB
Newer Older
Sylvain Gugger's avatar
Sylvain Gugger committed
1
2
3
4
5
6
7
8
9
10
11
12
<!--Copyright 2020 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

13
# Export 馃 Transformers Models
Sylvain Gugger's avatar
Sylvain Gugger committed
14

lewtun's avatar
lewtun committed
15
16
17
18
If you need to deploy 馃 Transformers models in production environments, we
recommend exporting them to a serialized format that can be loaded and executed
on specialized runtimes and hardware. In this guide, we'll show you how to
export 馃 Transformers models in two widely used formats: ONNX and TorchScript.
Sylvain Gugger's avatar
Sylvain Gugger committed
19

lewtun's avatar
lewtun committed
20
21
22
23
Once exported, a model can optimized for inference via techniques such as
quantization and pruning. If you are interested in optimizing your models to run
with maximum efficiency, check out the [馃 Optimum
library](https://github.com/huggingface/optimum).
Sylvain Gugger's avatar
Sylvain Gugger committed
24

lewtun's avatar
lewtun committed
25
## ONNX
Sylvain Gugger's avatar
Sylvain Gugger committed
26

lewtun's avatar
lewtun committed
27
28
29
30
31
32
33
The [ONNX (Open Neural Network eXchange)](http://onnx.ai) project is an open
standard that defines a common set of operators and a common file format to
represent deep learning models in a wide variety of frameworks, including
PyTorch and TensorFlow. When a model is exported to the ONNX format, these
operators are used to construct a computational graph (often called an
_intermediate representation_) which represents the flow of data through the
neural network.
Sylvain Gugger's avatar
Sylvain Gugger committed
34

lewtun's avatar
lewtun committed
35
36
37
By exposing a graph with standardized operators and data types, ONNX makes it
easy to switch between frameworks. For example, a model trained in PyTorch can
be exported to ONNX format and then imported in TensorFlow (and vice versa).
Sylvain Gugger's avatar
Sylvain Gugger committed
38

lewtun's avatar
lewtun committed
39
40
41
42
馃 Transformers provides a `transformers.onnx` package that enables you to
convert model checkpoints to an ONNX graph by leveraging configuration objects.
These configuration objects come ready made for a number of model architectures,
and are designed to be easily extendable to other architectures.
Sylvain Gugger's avatar
Sylvain Gugger committed
43

lewtun's avatar
lewtun committed
44
Ready-made configurations include the following architectures:
Sylvain Gugger's avatar
Sylvain Gugger committed
45

46
<!--This table is automatically generated by `make fix-copies`, do not fill manually!-->
Sylvain Gugger's avatar
Sylvain Gugger committed
47
48
49

- ALBERT
- BART
Jim Rohrer's avatar
Jim Rohrer committed
50
- BEiT
Sylvain Gugger's avatar
Sylvain Gugger committed
51
- BERT
52
- BigBird
53
- BigBird-Pegasus
54
55
- Blenderbot
- BlenderbotSmall
56
- BLOOM
Sylvain Gugger's avatar
Sylvain Gugger committed
57
- CamemBERT
58
- CLIP
rooa's avatar
rooa committed
59
- CodeGen
60
- ConvBERT
61
- ConvNeXT
62
- Data2VecText
63
- Data2VecVision
64
65
- DeBERTa
- DeBERTa-v2
66
- DeiT
regisss's avatar
regisss committed
67
- DETR
Sylvain Gugger's avatar
Sylvain Gugger committed
68
- DistilBERT
69
- ELECTRA
70
- ERNIE
71
- FlauBERT
Sylvain Gugger's avatar
Sylvain Gugger committed
72
- GPT Neo
73
- GPT-J
74
- GroupViT
75
- I-BERT
Sylvain Gugger's avatar
Sylvain Gugger committed
76
- LayoutLM
77
- LayoutLMv3
gcheron's avatar
gcheron committed
78
- LeViT
79
- Longformer
Daniel Stancl's avatar
Daniel Stancl committed
80
- LongT5
81
- M2M100
82
- Marian
Sylvain Gugger's avatar
Sylvain Gugger committed
83
- mBART
84
- MobileBERT
85
- MobileViT
86
- MT5
Sylvain Gugger's avatar
Sylvain Gugger committed
87
- OpenAI GPT-2
88
- OWL-ViT
89
- Perceiver
Gunjan Chhablani's avatar
Gunjan Chhablani committed
90
- PLBart
regisss's avatar
regisss committed
91
- ResNet
Sylvain Gugger's avatar
Sylvain Gugger committed
92
- RoBERTa
93
- RoFormer
94
- SegFormer
95
- SqueezeBERT
Sylvain Gugger's avatar
Sylvain Gugger committed
96
- T5
lewtun's avatar
lewtun committed
97
- ViT
Ritik Nandwal's avatar
Ritik Nandwal committed
98
- XLM
Sylvain Gugger's avatar
Sylvain Gugger committed
99
- XLM-RoBERTa
100
- XLM-RoBERTa-XL
NielsRogge's avatar
NielsRogge committed
101
- YOLOS
Sylvain Gugger's avatar
Sylvain Gugger committed
102

lewtun's avatar
lewtun committed
103
In the next two sections, we'll show you how to:
Sylvain Gugger's avatar
Sylvain Gugger committed
104

lewtun's avatar
lewtun committed
105
106
* Export a supported model using the `transformers.onnx` package.
* Export a custom model for an unsupported architecture.
Sylvain Gugger's avatar
Sylvain Gugger committed
107

lewtun's avatar
lewtun committed
108
### Exporting a model to ONNX
Sylvain Gugger's avatar
Sylvain Gugger committed
109

lewtun's avatar
lewtun committed
110
111
To export a 馃 Transformers model to ONNX, you'll first need to install some
extra dependencies:
Sylvain Gugger's avatar
Sylvain Gugger committed
112

lewtun's avatar
lewtun committed
113
114
115
116
117
```bash
pip install transformers[onnx]
```

The `transformers.onnx` package can then be used as a Python module:
Sylvain Gugger's avatar
Sylvain Gugger committed
118
119
120
121

```bash
python -m transformers.onnx --help

lewtun's avatar
lewtun committed
122
usage: Hugging Face Transformers ONNX exporter [-h] -m MODEL [--feature {causal-lm, ...}] [--opset OPSET] [--atol ATOL] output
Sylvain Gugger's avatar
Sylvain Gugger committed
123
124
125
126
127
128
129

positional arguments:
  output                Path indicating where to store generated ONNX model.

optional arguments:
  -h, --help            show this help message and exit
  -m MODEL, --model MODEL
lewtun's avatar
lewtun committed
130
131
132
133
134
                        Model ID on huggingface.co or path on disk to load model from.
  --feature {causal-lm, ...}
                        The type of features to export the model with.
  --opset OPSET         ONNX opset version to export the model with.
  --atol ATOL           Absolute difference tolerence when validating the model.
Sylvain Gugger's avatar
Sylvain Gugger committed
135
136
137
138
139
```

Exporting a checkpoint using a ready-made configuration can be done as follows:

```bash
lewtun's avatar
lewtun committed
140
python -m transformers.onnx --model=distilbert-base-uncased onnx/
Sylvain Gugger's avatar
Sylvain Gugger committed
141
142
```

lewtun's avatar
lewtun committed
143
which should show the following logs:
Sylvain Gugger's avatar
Sylvain Gugger committed
144
145
146

```bash
Validating ONNX model...
147
        -[鉁揮 ONNX model output names match reference model ({'last_hidden_state'})
lewtun's avatar
lewtun committed
148
149
150
151
        - Validating ONNX Model output "last_hidden_state":
                -[鉁揮 (2, 8, 768) matches (2, 8, 768)
                -[鉁揮 all values close (atol: 1e-05)
All good, model saved at: onnx/model.onnx
Sylvain Gugger's avatar
Sylvain Gugger committed
152
153
```

lewtun's avatar
lewtun committed
154
This exports an ONNX graph of the checkpoint defined by the `--model` argument.
155
156
In this example it is `distilbert-base-uncased`, but it can be any checkpoint on
the Hugging Face Hub or one that's stored locally.
Sylvain Gugger's avatar
Sylvain Gugger committed
157

lewtun's avatar
lewtun committed
158
159
160
161
The resulting `model.onnx` file can then be run on one of the [many
accelerators](https://onnx.ai/supported-tools.html#deployModel) that support the
ONNX standard. For example, we can load and run the model with [ONNX
Runtime](https://onnxruntime.ai/) as follows:
Sylvain Gugger's avatar
Sylvain Gugger committed
162

lewtun's avatar
lewtun committed
163
164
165
166
167
168
169
170
171
172
```python
>>> from transformers import AutoTokenizer
>>> from onnxruntime import InferenceSession

>>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
>>> session = InferenceSession("onnx/model.onnx")
>>> # ONNX Runtime expects NumPy arrays as input
>>> inputs = tokenizer("Using DistilBERT with ONNX Runtime!", return_tensors="np")
>>> outputs = session.run(output_names=["last_hidden_state"], input_feed=dict(inputs))
```
Sylvain Gugger's avatar
Sylvain Gugger committed
173

lewtun's avatar
lewtun committed
174
175
176
The required output names (i.e. `["last_hidden_state"]`) can be obtained by
taking a look at the ONNX configuration of each model. For example, for
DistilBERT we have:
Sylvain Gugger's avatar
Sylvain Gugger committed
177

lewtun's avatar
lewtun committed
178
179
```python
>>> from transformers.models.distilbert import DistilBertConfig, DistilBertOnnxConfig
Sylvain Gugger's avatar
Sylvain Gugger committed
180

lewtun's avatar
lewtun committed
181
182
183
184
>>> config = DistilBertConfig()
>>> onnx_config = DistilBertOnnxConfig(config)
>>> print(list(onnx_config.outputs.keys()))
["last_hidden_state"]
Sylvain Gugger's avatar
Sylvain Gugger committed
185
186
```

187
188
189
190
191
192
193
194
195
196
197
198
The process is identical for TensorFlow checkpoints on the Hub. For example, we
can export a pure TensorFlow checkpoint from the [Keras
organization](https://huggingface.co/keras-io) as follows:

```bash
python -m transformers.onnx --model=keras-io/transformers-qa onnx/
```

To export a model that's stored locally, you'll need to have the model's weights
and tokenizer files stored in a directory. For example, we can load and save a
checkpoint as follows:

Sylvain Gugger's avatar
Sylvain Gugger committed
199
200
<frameworkcontent>
<pt>
201
202
203
204
```python
>>> from transformers import AutoTokenizer, AutoModelForSequenceClassification

>>> # Load tokenizer and PyTorch weights form the Hub
205
>>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
206
207
208
209
>>> pt_model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
>>> # Save to disk
>>> tokenizer.save_pretrained("local-pt-checkpoint")
>>> pt_model.save_pretrained("local-pt-checkpoint")
Sylvain Gugger's avatar
Sylvain Gugger committed
210
211
212
213
214
215
216
217
218
219
220
```

Once the checkpoint is saved, we can export it to ONNX by pointing the `--model`
argument of the `transformers.onnx` package to the desired directory:

```bash
python -m transformers.onnx --model=local-pt-checkpoint onnx/
```
</pt>
<tf>
```python
221
222
223
>>> from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

>>> # Load tokenizer and TensorFlow weights from the Hub
224
>>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
225
226
227
228
229
230
231
232
233
234
235
236
>>> tf_model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
>>> # Save to disk
>>> tokenizer.save_pretrained("local-tf-checkpoint")
>>> tf_model.save_pretrained("local-tf-checkpoint")
```

Once the checkpoint is saved, we can export it to ONNX by pointing the `--model`
argument of the `transformers.onnx` package to the desired directory:

```bash
python -m transformers.onnx --model=local-tf-checkpoint onnx/
```
Sylvain Gugger's avatar
Sylvain Gugger committed
237
238
</tf>
</frameworkcontent>
239

lewtun's avatar
lewtun committed
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
### Selecting features for different model topologies

Each ready-made configuration comes with a set of _features_ that enable you to
export models for different types of topologies or tasks. As shown in the table
below, each feature is associated with a different auto class:

| Feature                              | Auto Class                           |
| ------------------------------------ | ------------------------------------ |
| `causal-lm`, `causal-lm-with-past`   | `AutoModelForCausalLM`               |
| `default`, `default-with-past`       | `AutoModel`                          |
| `masked-lm`                          | `AutoModelForMaskedLM`               |
| `question-answering`                 | `AutoModelForQuestionAnswering`      |
| `seq2seq-lm`, `seq2seq-lm-with-past` | `AutoModelForSeq2SeqLM`              |
| `sequence-classification`            | `AutoModelForSequenceClassification` |
| `token-classification`               | `AutoModelForTokenClassification`    |

For each configuration, you can find the list of supported features via the
`FeaturesManager`. For example, for DistilBERT we have:
Sylvain Gugger's avatar
Sylvain Gugger committed
258
259

```python
lewtun's avatar
lewtun committed
260
>>> from transformers.onnx.features import FeaturesManager
Sylvain Gugger's avatar
Sylvain Gugger committed
261

lewtun's avatar
lewtun committed
262
263
264
>>> distilbert_features = list(FeaturesManager.get_supported_features_for_model_type("distilbert").keys())
>>> print(distilbert_features)
["default", "masked-lm", "causal-lm", "sequence-classification", "token-classification", "question-answering"]
Sylvain Gugger's avatar
Sylvain Gugger committed
265
266
```

lewtun's avatar
lewtun committed
267
268
269
You can then pass one of these features to the `--feature` argument in the
`transformers.onnx` package. For example, to export a text-classification model
we can pick a fine-tuned model from the Hub and run:
Sylvain Gugger's avatar
Sylvain Gugger committed
270

lewtun's avatar
lewtun committed
271
272
273
274
```bash
python -m transformers.onnx --model=distilbert-base-uncased-finetuned-sst-2-english \
                            --feature=sequence-classification onnx/
```
Sylvain Gugger's avatar
Sylvain Gugger committed
275

lewtun's avatar
lewtun committed
276
277
278
279
which will display the following logs:

```bash
Validating ONNX model...
280
        -[鉁揮 ONNX model output names match reference model ({'logits'})
lewtun's avatar
lewtun committed
281
282
283
284
        - Validating ONNX Model output "logits":
                -[鉁揮 (2, 2) matches (2, 2)
                -[鉁揮 all values close (atol: 1e-05)
All good, model saved at: onnx/model.onnx
Sylvain Gugger's avatar
Sylvain Gugger committed
285
286
```

lewtun's avatar
lewtun committed
287
288
289
290
291
292
293
294
295
296
297
298
Notice that in this case, the output names from the fine-tuned model are
`logits` instead of the `last_hidden_state` we saw with the
`distilbert-base-uncased` checkpoint earlier. This is expected since the
fine-tuned model has a sequence classification head.

<Tip>

The features that have a `with-past` suffix (e.g. `causal-lm-with-past`)
correspond to model topologies with precomputed hidden states (key and values
in the attention blocks) that can be used for fast autoregressive decoding.

</Tip>
Sylvain Gugger's avatar
Sylvain Gugger committed
299
300


lewtun's avatar
lewtun committed
301
### Exporting a model for an unsupported architecture
Sylvain Gugger's avatar
Sylvain Gugger committed
302

lewtun's avatar
lewtun committed
303
304
If you wish to export a model whose architecture is not natively supported by
the library, there are three main steps to follow:
Sylvain Gugger's avatar
Sylvain Gugger committed
305

lewtun's avatar
lewtun committed
306
307
308
1. Implement a custom ONNX configuration.
2. Export the model to ONNX.
3. Validate the outputs of the PyTorch and exported models.
Sylvain Gugger's avatar
Sylvain Gugger committed
309

lewtun's avatar
lewtun committed
310
311
In this section, we'll look at how DistilBERT was implemented to show what's
involved with each step.
Sylvain Gugger's avatar
Sylvain Gugger committed
312

lewtun's avatar
lewtun committed
313
#### Implementing a custom ONNX configuration
Sylvain Gugger's avatar
Sylvain Gugger committed
314

lewtun's avatar
lewtun committed
315
316
317
Let's start with the ONNX configuration object. We provide three abstract
classes that you should inherit from, depending on the type of model
architecture you wish to export:
Sylvain Gugger's avatar
Sylvain Gugger committed
318

319
320
321
* Encoder-based models inherit from [`~onnx.config.OnnxConfig`]
* Decoder-based models inherit from [`~onnx.config.OnnxConfigWithPast`]
* Encoder-decoder models inherit from [`~onnx.config.OnnxSeq2SeqConfigWithPast`]
Sylvain Gugger's avatar
Sylvain Gugger committed
322
323
324

<Tip>

lewtun's avatar
lewtun committed
325
326
A good way to implement a custom ONNX configuration is to look at the existing
implementation in the `configuration_<model_name>.py` file of a similar architecture.
Sylvain Gugger's avatar
Sylvain Gugger committed
327
328
329

</Tip>

lewtun's avatar
lewtun committed
330
331
Since DistilBERT is an encoder-based model, its configuration inherits from
`OnnxConfig`:
Sylvain Gugger's avatar
Sylvain Gugger committed
332

lewtun's avatar
lewtun committed
333
334
335
336
337
338
339
340
341
342
343
344
345
346
```python
>>> from typing import Mapping, OrderedDict
>>> from transformers.onnx import OnnxConfig


>>> class DistilBertOnnxConfig(OnnxConfig):
...     @property
...     def inputs(self) -> Mapping[str, Mapping[int, str]]:
...         return OrderedDict(
...             [
...                 ("input_ids", {0: "batch", 1: "sequence"}),
...                 ("attention_mask", {0: "batch", 1: "sequence"}),
...             ]
...         )
Sylvain Gugger's avatar
Sylvain Gugger committed
347
348
```

lewtun's avatar
lewtun committed
349
350
351
352
353
354
Every configuration object must implement the `inputs` property and return a
mapping, where each key corresponds to an expected input, and each value
indicates the axis of that input. For DistilBERT, we can see that two inputs are
required: `input_ids` and `attention_mask`. These inputs have the same shape of
`(batch_size, sequence_length)` which is why we see the same axes used in the
configuration.
Sylvain Gugger's avatar
Sylvain Gugger committed
355
356
357

<Tip>

lewtun's avatar
lewtun committed
358
359
360
361
362
Notice that `inputs` property for `DistilBertOnnxConfig` returns an
`OrderedDict`. This ensures that the inputs are matched with their relative
position within the `PreTrainedModel.forward()` method when tracing the graph.
We recommend using an `OrderedDict` for the `inputs` and `outputs` properties
when implementing custom ONNX configurations.
Sylvain Gugger's avatar
Sylvain Gugger committed
363
364
365

</Tip>

lewtun's avatar
lewtun committed
366
367
Once you have implemented an ONNX configuration, you can instantiate it by
providing the base model's configuration as follows:
Sylvain Gugger's avatar
Sylvain Gugger committed
368

lewtun's avatar
lewtun committed
369
370
```python
>>> from transformers import AutoConfig
Sylvain Gugger's avatar
Sylvain Gugger committed
371

lewtun's avatar
lewtun committed
372
373
374
>>> config = AutoConfig.from_pretrained("distilbert-base-uncased")
>>> onnx_config = DistilBertOnnxConfig(config)
```
Sylvain Gugger's avatar
Sylvain Gugger committed
375

lewtun's avatar
lewtun committed
376
377
The resulting object has several useful properties. For example you can view the
ONNX operator set that will be used during the export:
Sylvain Gugger's avatar
Sylvain Gugger committed
378

lewtun's avatar
lewtun committed
379
380
381
382
```python
>>> print(onnx_config.default_onnx_opset)
11
```
Sylvain Gugger's avatar
Sylvain Gugger committed
383

lewtun's avatar
lewtun committed
384
You can also view the outputs associated with the model as follows:
Sylvain Gugger's avatar
Sylvain Gugger committed
385

lewtun's avatar
lewtun committed
386
387
388
389
```python
>>> print(onnx_config.outputs)
OrderedDict([("last_hidden_state", {0: "batch", 1: "sequence"})])
```
Sylvain Gugger's avatar
Sylvain Gugger committed
390

lewtun's avatar
lewtun committed
391
392
393
394
395
396
397
398
399
Notice that the outputs property follows the same structure as the inputs; it
returns an `OrderedDict` of named outputs and their shapes. The output structure
is linked to the choice of feature that the configuration is initialised with.
By default, the ONNX configuration is initialized with the `default` feature
that corresponds to exporting a model loaded with the `AutoModel` class. If you
want to export a different model topology, just provide a different feature to
the `task` argument when you initialize the ONNX configuration. For example, if
we wished to export DistilBERT with a sequence classification head, we could
use:
Sylvain Gugger's avatar
Sylvain Gugger committed
400

lewtun's avatar
lewtun committed
401
402
```python
>>> from transformers import AutoConfig
Sylvain Gugger's avatar
Sylvain Gugger committed
403

lewtun's avatar
lewtun committed
404
405
406
407
408
>>> config = AutoConfig.from_pretrained("distilbert-base-uncased")
>>> onnx_config_for_seq_clf = DistilBertOnnxConfig(config, task="sequence-classification")
>>> print(onnx_config_for_seq_clf.outputs)
OrderedDict([('logits', {0: 'batch'})])
```
Sylvain Gugger's avatar
Sylvain Gugger committed
409
410
411

<Tip>

412
413
414
All of the base properties and methods associated with [`~onnx.config.OnnxConfig`] and the
other configuration classes can be overriden if needed. Check out
[`BartOnnxConfig`] for an advanced example.
Sylvain Gugger's avatar
Sylvain Gugger committed
415
416
417

</Tip>

lewtun's avatar
lewtun committed
418
#### Exporting the model
Sylvain Gugger's avatar
Sylvain Gugger committed
419

lewtun's avatar
lewtun committed
420
421
422
423
Once you have implemented the ONNX configuration, the next step is to export the
model. Here we can use the `export()` function provided by the
`transformers.onnx` package. This function expects the ONNX configuration, along
with the base model and tokenizer, and the path to save the exported file:
Sylvain Gugger's avatar
Sylvain Gugger committed
424

lewtun's avatar
lewtun committed
425
426
427
428
```python
>>> from pathlib import Path
>>> from transformers.onnx import export
>>> from transformers import AutoTokenizer, AutoModel
Sylvain Gugger's avatar
Sylvain Gugger committed
429

lewtun's avatar
lewtun committed
430
431
432
433
>>> onnx_path = Path("model.onnx")
>>> model_ckpt = "distilbert-base-uncased"
>>> base_model = AutoModel.from_pretrained(model_ckpt)
>>> tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
Sylvain Gugger's avatar
Sylvain Gugger committed
434

lewtun's avatar
lewtun committed
435
436
>>> onnx_inputs, onnx_outputs = export(tokenizer, base_model, onnx_config, onnx_config.default_onnx_opset, onnx_path)
```
Sylvain Gugger's avatar
Sylvain Gugger committed
437

lewtun's avatar
lewtun committed
438
439
440
441
The `onnx_inputs` and `onnx_outputs` returned by the `export()` function are
lists of the keys defined in the `inputs` and `outputs` properties of the
configuration. Once the model is exported, you can test that the model is well
formed as follows:
Sylvain Gugger's avatar
Sylvain Gugger committed
442

lewtun's avatar
lewtun committed
443
444
```python
>>> import onnx
Sylvain Gugger's avatar
Sylvain Gugger committed
445

lewtun's avatar
lewtun committed
446
447
448
>>> onnx_model = onnx.load("model.onnx")
>>> onnx.checker.check_model(onnx_model)
```
Sylvain Gugger's avatar
Sylvain Gugger committed
449
450
451

<Tip>

lewtun's avatar
lewtun committed
452
453
454
455
456
457
If your model is larger than 2GB, you will see that many additional files are
created during the export. This is _expected_ because ONNX uses [Protocol
Buffers](https://developers.google.com/protocol-buffers/) to store the model and
these have a size limit of 2GB. See the [ONNX
documentation](https://github.com/onnx/onnx/blob/master/docs/ExternalData.md)
for instructions on how to load models with external data.
Sylvain Gugger's avatar
Sylvain Gugger committed
458
459
460

</Tip>

lewtun's avatar
lewtun committed
461
#### Validating the model outputs
Sylvain Gugger's avatar
Sylvain Gugger committed
462

lewtun's avatar
lewtun committed
463
464
465
466
The final step is to validate that the outputs from the base and exported model
agree within some absolute tolerance. Here we can use the
`validate_model_outputs()` function provided by the `transformers.onnx` package
as follows:
Sylvain Gugger's avatar
Sylvain Gugger committed
467

lewtun's avatar
lewtun committed
468
469
```python
>>> from transformers.onnx import validate_model_outputs
Sylvain Gugger's avatar
Sylvain Gugger committed
470

lewtun's avatar
lewtun committed
471
472
473
>>> validate_model_outputs(
...     onnx_config, tokenizer, base_model, onnx_path, onnx_outputs, onnx_config.atol_for_validation
... )
Sylvain Gugger's avatar
Sylvain Gugger committed
474
475
```

lewtun's avatar
lewtun committed
476
477
478
479
This function uses the `OnnxConfig.generate_dummy_inputs()` method to generate
inputs for the base and exported model, and the absolute tolerance can be
defined in the configuration. We generally find numerical agreement in the 1e-6
to 1e-4 range, although anything smaller than 1e-3 is likely to be OK.
Sylvain Gugger's avatar
Sylvain Gugger committed
480

lewtun's avatar
lewtun committed
481
### Contributing a new configuration to 馃 Transformers
Sylvain Gugger's avatar
Sylvain Gugger committed
482

lewtun's avatar
lewtun committed
483
484
485
We are looking to expand the set of ready-made configurations and welcome
contributions from the community! If you would like to contribute your addition
to the library, you will need to:
Sylvain Gugger's avatar
Sylvain Gugger committed
486

lewtun's avatar
lewtun committed
487
488
* Implement the ONNX configuration in the corresponding `configuration_<model_name>.py`
file
489
490
* Include the model architecture and corresponding features in [`~onnx.features.FeatureManager`]
* Add your model architecture to the tests in `test_onnx_v2.py`
Sylvain Gugger's avatar
Sylvain Gugger committed
491

lewtun's avatar
lewtun committed
492
493
494
Check out how the configuration for [IBERT was
contributed](https://github.com/huggingface/transformers/pull/14868/files) to
get an idea of what's involved.
Sylvain Gugger's avatar
Sylvain Gugger committed
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575

## TorchScript

<Tip>

This is the very beginning of our experiments with TorchScript and we are still exploring its capabilities with
variable-input-size models. It is a focus of interest to us and we will deepen our analysis in upcoming releases,
with more code examples, a more flexible implementation, and benchmarks comparing python-based codes with compiled
TorchScript.

</Tip>

According to Pytorch's documentation: "TorchScript is a way to create serializable and optimizable models from PyTorch
code". Pytorch's two modules [JIT and TRACE](https://pytorch.org/docs/stable/jit.html) allow the developer to export
their model to be re-used in other programs, such as efficiency-oriented C++ programs.

We have provided an interface that allows the export of 馃 Transformers models to TorchScript so that they can be reused
in a different environment than a Pytorch-based python program. Here we explain how to export and use our models using
TorchScript.

Exporting a model requires two things:

- a forward pass with dummy inputs.
- model instantiation with the `torchscript` flag.

These necessities imply several things developers should be careful about. These are detailed below.

### TorchScript flag and tied weights

This flag is necessary because most of the language models in this repository have tied weights between their
`Embedding` layer and their `Decoding` layer. TorchScript does not allow the export of models that have tied
weights, therefore it is necessary to untie and clone the weights beforehand.

This implies that models instantiated with the `torchscript` flag have their `Embedding` layer and `Decoding`
layer separate, which means that they should not be trained down the line. Training would de-synchronize the two
layers, leading to unexpected results.

This is not the case for models that do not have a Language Model head, as those do not have tied weights. These models
can be safely exported without the `torchscript` flag.

### Dummy inputs and standard lengths

The dummy inputs are used to do a model forward pass. While the inputs' values are propagating through the layers,
Pytorch keeps track of the different operations executed on each tensor. These recorded operations are then used to
create the "trace" of the model.

The trace is created relatively to the inputs' dimensions. It is therefore constrained by the dimensions of the dummy
input, and will not work for any other sequence length or batch size. When trying with a different size, an error such
as:

`The expanded size of the tensor (3) must match the existing size (7) at non-singleton dimension 2`

will be raised. It is therefore recommended to trace the model with a dummy input size at least as large as the largest
input that will be fed to the model during inference. Padding can be performed to fill the missing values. As the model
will have been traced with a large input size however, the dimensions of the different matrix will be large as well,
resulting in more calculations.

It is recommended to be careful of the total number of operations done on each input and to follow performance closely
when exporting varying sequence-length models.

### Using TorchScript in Python

Below is an example, showing how to save, load models as well as how to use the trace for inference.

#### Saving a model

This snippet shows how to use TorchScript to export a `BertModel`. Here the `BertModel` is instantiated according
to a `BertConfig` class and then saved to disk under the filename `traced_bert.pt`

```python
from transformers import BertModel, BertTokenizer, BertConfig
import torch

enc = BertTokenizer.from_pretrained("bert-base-uncased")

# Tokenizing input text
text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
tokenized_text = enc.tokenize(text)

# Masking one of the input tokens
masked_index = 8
Sylvain Gugger's avatar
Sylvain Gugger committed
576
tokenized_text[masked_index] = "[MASK]"
Sylvain Gugger's avatar
Sylvain Gugger committed
577
578
579
580
581
582
583
584
585
586
indexed_tokens = enc.convert_tokens_to_ids(tokenized_text)
segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]

# Creating a dummy input
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
dummy_input = [tokens_tensor, segments_tensors]

# Initializing the model with the torchscript flag
# Flag set to True even though it is not necessary as this model does not have an LM Head.
Sylvain Gugger's avatar
Sylvain Gugger committed
587
588
589
590
591
592
593
594
config = BertConfig(
    vocab_size_or_config_json_file=32000,
    hidden_size=768,
    num_hidden_layers=12,
    num_attention_heads=12,
    intermediate_size=3072,
    torchscript=True,
)
Sylvain Gugger's avatar
Sylvain Gugger committed
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628

# Instantiating the model
model = BertModel(config)

# The model needs to be in evaluation mode
model.eval()

# If you are instantiating the model with *from_pretrained* you can also easily set the TorchScript flag
model = BertModel.from_pretrained("bert-base-uncased", torchscript=True)

# Creating the trace
traced_model = torch.jit.trace(model, [tokens_tensor, segments_tensors])
torch.jit.save(traced_model, "traced_bert.pt")
```

#### Loading a model

This snippet shows how to load the `BertModel` that was previously saved to disk under the name `traced_bert.pt`.
We are re-using the previously initialised `dummy_input`.

```python
loaded_model = torch.jit.load("traced_bert.pt")
loaded_model.eval()

all_encoder_layers, pooled_output = loaded_model(*dummy_input)
```

#### Using a traced model for inference

Using the traced model for inference is as simple as using its `__call__` dunder method:

```python
traced_model(tokens_tensor, segments_tensors)
```
629
630
631

### Deploying HuggingFace TorchScript models on AWS using the Neuron SDK

632
633
634
635
636
637
AWS introduced the [Amazon EC2 Inf1](https://aws.amazon.com/ec2/instance-types/inf1/)
instance family for low cost, high performance machine learning inference in the cloud.
The Inf1 instances are powered by the AWS Inferentia chip, a custom-built hardware accelerator,
specializing in deep learning inferencing workloads.
[AWS Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/#)
is the SDK for Inferentia that supports tracing and optimizing transformers models for
638
639
640
641
642
643
644
645
646
647
deployment on Inf1. The Neuron SDK provides:


1. Easy-to-use API with one line of code change to trace and optimize a TorchScript model for inference in the cloud.
2. Out of the box performance optimizations for [improved cost-performance](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/benchmark/>)
3. Support for HuggingFace transformers models built with either [PyTorch](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/src/examples/pytorch/bert_tutorial/tutorial_pretrained_bert.html)
   or [TensorFlow](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/src/examples/tensorflow/huggingface_bert/huggingface_bert.html).

#### Implications

648
649
650
Transformers Models based on the [BERT (Bidirectional Encoder Representations from Transformers)](https://huggingface.co/docs/transformers/main/model_doc/bert)
architecture, or its variants such as [distilBERT](https://huggingface.co/docs/transformers/main/model_doc/distilbert)
 and [roBERTa](https://huggingface.co/docs/transformers/main/model_doc/roberta)
651
 will run best on Inf1 for non-generative tasks such as Extractive Question Answering,
652
 Sequence Classification, Token Classification. Alternatively, text generation
653
654
tasks can be adapted to run on Inf1, according to this [AWS Neuron MarianMT tutorial](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/src/examples/pytorch/transformers-marianmt.html).
More information about models that can be converted out of the box on Inferentia can be
655
656
657
658
659
660
661
662
663
664
665
found in the [Model Architecture Fit section of the Neuron documentation](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/models/models-inferentia.html#models-inferentia).

#### Dependencies

Using AWS Neuron to convert models requires the following dependencies and environment:

* A [Neuron SDK environment](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/neuron-frameworks/pytorch-neuron/index.html#installation-guide),
  which comes pre-configured on [AWS Deep Learning AMI](https://docs.aws.amazon.com/dlami/latest/devguide/tutorial-inferentia-launching.html).

#### Converting a Model for AWS Neuron

666
Using the same script as in [Using TorchScript in Python](https://huggingface.co/docs/transformers/main/en/serialization#using-torchscript-in-python)
667
to trace a "BertModel", you import `torch.neuron` framework extension to access
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
the components of the Neuron SDK through a Python API.

```python
from transformers import BertModel, BertTokenizer, BertConfig
import torch
import torch.neuron
```
And only modify the tracing line of code

from:

```python
torch.jit.trace(model, [tokens_tensor, segments_tensors])
```

to:

```python
torch.neuron.trace(model, [token_tensor, segments_tensors])
```

This change enables Neuron SDK to trace the model and optimize it to run in Inf1 instances.

691
To learn more about AWS Neuron SDK features, tools, example tutorials and latest updates,
StevenTang1998's avatar
StevenTang1998 committed
692
please see the [AWS NeuronSDK documentation](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/index.html).