onnx_export.ipynb 14.6 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Export to ONNX and inference using TensorRT\n",
    "\n",
    "<div class=\"alert alert-info\">\n",
    "\n",
    "<b>Note:</b>\n",
    "\n",
    "Currently, export to ONNX is supported only for high precision, FP8 delayed scaling and MXFP8.\n",
    "\n",
    "</div>\n",
    "\n",
    "Transformer Engine (TE) is a library designed primarily for training DL models in low precision. It is not specifically optimized for inference tasks, so other dedicated solutions should be used. NVIDIA provides several [inference tools](https://www.nvidia.com/en-us/solutions/ai/inference/) that enhance the entire inference pipeline. Two prominent NVIDIA inference SDKs are [TensorRT](https://github.com/NVIDIA/TensorRT) and [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM).\n",
    "\n",
    "This tutorial illustrates how one can export a PyTorch model to ONNX format and subsequently perform inference with TensorRT. This approach is particularly beneficial if model integrates Transformer Engine layers within more complex architectures. It's important to highlight that for Transformer-based large language models (LLMs), TensorRT-LLM could provide a more optimized inference experience. However, the ONNX-to-TensorRT approach described here may be more suitable for other models, such as diffusion-based architectures or vision transformers.\n",
    "\n",
    "#### Creating models with TE\n",
    "\n",
    "Let's begin by defining a simple model composed of layers both from Transformer Engine and standard PyTorch:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch \n",
    "import torch.nn as nn\n",
    "import transformer_engine as te\n",
    "import warnings\n",
    "warnings.filterwarnings(\"ignore\")\n",
    "\n",
    "# batch size, sequence length, hidden dimension\n",
    "B, S, H = 256, 512, 256\n",
    "\n",
    "class Model(torch.nn.Module):\n",
    "    def __init__(self, hidden_dim=H, num_non_te_layers=16, num_te_layers=4, num_te_heads=4):\n",
    "        super(Model, self).__init__()\n",
    "        self.non_te_part = nn.Sequential(\n",
    "            *[nn.Sequential(nn.Linear(hidden_dim, hidden_dim), nn.GELU()) for _ in range(num_non_te_layers)]\n",
    "        )\n",
    "        self.te_part = nn.Sequential(\n",
    "            *[te.pytorch.TransformerLayer(hidden_dim, hidden_dim, num_te_heads) for _ in range(num_te_layers)]\n",
    "        )\n",
    "\n",
    "    def forward(self, x):\n",
    "        x = self.non_te_part(x)\n",
    "        return self.te_part(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's run some simple inference benchmarks:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Average inference time FP32: 0.065 ms\n",
      "Average inference time FP8: 0.062 ms\n"
     ]
    }
   ],
   "source": [
    "from utils import  _measure_time\n",
    "\n",
    "model = Model().eval().cuda()\n",
    "inps = (torch.randn([S, B, H], device=\"cuda\"),)\n",
    "def _inference(fp8_enabled):\n",
    "     with torch.no_grad(), te.pytorch.fp8_autocast(enabled=fp8_enabled):\n",
    "        model(*inps)\n",
    "\n",
    "te_fp32_time = _measure_time(lambda: _inference(fp8_enabled=False))\n",
    "te_fp8_time = _measure_time(lambda: _inference(fp8_enabled=True))\n",
    "\n",
    "print(f\"Average inference time FP32: {te_fp32_time} ms\")\n",
    "print(f\"Average inference time FP8: {te_fp8_time} ms\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Exporting the TE Model to ONNX Format\n",
    "\n",
    "PyTorch developed a new [ONNX exporter](https://pytorch.org/docs/stable/onnx.html) built on TorchDynamo and plans to phase out the existing TorchScript exporter. As this feature is currently in active development, we recommend running this process with the latest PyTorch version.\n",
    "\n",
    "\n",
    "To export a Transformer Engine model into ONNX format, follow these steps:\n",
    "\n",
    "- Conduct warm-up run within autocast using the recipe intended for export.\n",
    "- Encapsulate your export-related code within `te.onnx_export`, ensuring warm-up runs remain outside this wrapper.\n",
    "- Use the PyTorch Dynamo ONNX exporter by invoking: `torch.onnx.export(..., dynamo=True)`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Exporting model_fp8.onnx\n",
      "[torch.onnx] Obtain model graph for `Model([...]` with `torch.export.export(..., strict=False)`...\n",
      "[torch.onnx] Obtain model graph for `Model([...]` with `torch.export.export(..., strict=False)`... ✅\n",
      "[torch.onnx] Run decomposition...\n",
      "[torch.onnx] Run decomposition... ✅\n",
      "[torch.onnx] Translate the graph into ONNX...\n",
      "[torch.onnx] Translate the graph into ONNX... ✅\n",
      "Applied 12 of general pattern rewrite rules.\n",
      "Exporting model_fp32.onnx\n",
      "[torch.onnx] Obtain model graph for `Model([...]` with `torch.export.export(..., strict=False)`...\n",
      "[torch.onnx] Obtain model graph for `Model([...]` with `torch.export.export(..., strict=False)`... ✅\n",
      "[torch.onnx] Run decomposition...\n",
      "[torch.onnx] Run decomposition... ✅\n",
      "[torch.onnx] Translate the graph into ONNX...\n",
      "[torch.onnx] Translate the graph into ONNX... ✅\n",
      "Applied 12 of general pattern rewrite rules.\n"
     ]
    }
   ],
   "source": [
    "from transformer_engine.pytorch.export import te_translation_table\n",
    "\n",
    "def export(model, fname, inputs, fp8=True):\n",
    "    with torch.no_grad(), te.pytorch.fp8_autocast(enabled=fp8):\n",
    "        # ! IMPORTANT !\n",
    "        # Transformer Engine models must have warm-up run\n",
    "        # before export. FP8 recipe during warm-up should  \n",
    "        # match the recipe used during export.\n",
    "        model(*inputs)\n",
    "    \n",
    "        # Only dynamo=True mode is supported;\n",
    "        # dynamo=False is deprecated and unsupported.\n",
    "        #\n",
    "        # te_translation_table contains necessary ONNX translations\n",
    "        # for FP8 quantize/dequantize operators.\n",
    "        print(f\"Exporting {fname}\")\n",
    "        with te.pytorch.onnx_export(enabled=True):\n",
    "            torch.onnx.export(\n",
    "                model,\n",
    "                inputs,\n",
    "                fname,\n",
    "                output_names=[\"output\"],\n",
    "                dynamo=True,\n",
    "                custom_translation_table=te_translation_table\n",
    "            )\n",
    "\n",
    "# Example usage:\n",
    "export(model, \"model_fp8.onnx\", inps, fp8=True)\n",
    "export(model, \"model_fp32.onnx\", inps, fp8=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Inference with TensorRT\n",
    "\n",
    "TensorRT is a high-performance deep learning inference optimizer and runtime developed by NVIDIA. It enables optimized deployment of neural network models by maximizing inference throughput and reducing latency on NVIDIA GPUs. TensorRT performs various optimization techniques, including layer fusion, precision calibration, kernel tuning, and memory optimization. \n",
    "For detailed information and documentation, refer to the official [TensorRT documentation](https://developer.nvidia.com/tensorrt).\n",
    "\n",
    "When using TensorRT, ONNX model must first be compiled into a TensorRT engine. This compilation step involves converting the ONNX model into an optimized representation tailored specifically to the target GPU platform. The compiled engine file can then be loaded into applications for rapid and efficient inference execution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "!trtexec --onnx=model_fp32.onnx --saveEngine=model_fp32.engine > output_fp32.log 2>&1\n",
    "!trtexec --onnx=model_fp8.onnx --saveEngine=model_fp8.engine > output_fp8.log 2>&1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's run the benchmarks for inference:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Average inference time without TRT (FP32 for all layers):                       0.065 ms\n",
      "Average inference time without TRT (FP8 for TE layers, FP32 for non-TE layers): 0.062 ms, speedup = 1.05x\n",
      "Average inference time with TRT (FP32 for all layers):                          0.0500 ms, speedup = 1.30x\n",
      "Average inference time with TRT (FP8 for TE layers, FP32 for non-TE layers):    0.0470 ms, speedup = 1.38x\n"
     ]
    }
   ],
   "source": [
    "import tensorrt as trt\n",
    "\n",
    "# Output tensor is allocated - TRT needs static memory address.\n",
    "output_tensor = torch.empty_like(model(*inps))\n",
    "\n",
    "# Loads TRT engine from file.\n",
    "def load_engine(engine_file_path):\n",
    "    logger = trt.Logger(trt.Logger.WARNING)\n",
    "    runtime = trt.Runtime(logger)\n",
    "    \n",
    "    with open(engine_file_path, \"rb\") as f:\n",
    "        engine_data = f.read()\n",
    "    \n",
    "    engine = runtime.deserialize_cuda_engine(engine_data)\n",
    "    return engine\n",
    "\n",
    "def benchmark_inference(model_name):\n",
    "    engine = load_engine(model_name)\n",
    "    context = engine.create_execution_context()\n",
    "    stream = torch.cuda.Stream()\n",
    "    \n",
    "    # TRT need static input and output addresses.\n",
    "    # Here they are set.\n",
    "    for i in range(len(inps)):\n",
    "        context.set_tensor_address(engine.get_tensor_name(i), inps[i].data_ptr())    \n",
    "    context.set_tensor_address(\"output\", output_tensor.data_ptr())\n",
    "    \n",
    "    def _inference():\n",
    "        # The data is loaded from static input addresses\n",
    "        # and output is written to static output address.\n",
    "        context.execute_async_v3(stream_handle=stream.cuda_stream)\n",
    "        stream.synchronize()\n",
    "        \n",
    "    return _measure_time(_inference)\n",
    "\n",
    "\n",
    "trt_fp8_time = benchmark_inference(\"model_fp8.engine\")\n",
    "trt_fp32_time = benchmark_inference(\"model_fp32.engine\")\n",
    "\n",
    "print(f\"Average inference time without TRT (FP32 for all layers):                       {te_fp32_time} ms\")\n",
    "print(f\"Average inference time without TRT (FP8 for TE layers, FP32 for non-TE layers): {te_fp8_time} ms, speedup = {te_fp32_time/te_fp8_time:.2f}x\")\n",
    "print(f\"Average inference time with TRT (FP32 for all layers):                          {trt_fp32_time:.4f} ms, speedup = {te_fp32_time/trt_fp32_time:.2f}x\")\n",
    "print(f\"Average inference time with TRT (FP8 for TE layers, FP32 for non-TE layers):    {trt_fp8_time:.4f} ms, speedup = {te_fp32_time/trt_fp8_time:.2f}x\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<p>\n",
    "\n",
    "\n",
    "| Run                               | Inference Time (ms) | Speedup             |\n",
    "| ----------------------------------| ------------------- | ------------------- |\n",
    "| PyTorch + TE                      | 0.065               | 1.00x               |\n",
    "| PyTorch + TE (FP8 for TE layers)  | 0.062               | 1.05x               |\n",
    "| TRT                               | 0.0500              | 1.30x               |\n",
    "| TRT (FP8 for TE layers)           | 0.047               | 1.38x               |\n",
    "\n",
    "Note that this example highlights how TensorRT can speed up models composed of both TE and non-TE layers.\n",
    "If a larger part of the model's layers were implemented with TE, the benefits of using FP8 for inference could be greater.\n",
    "\n",
    "</p>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We clearly observe performance improvements when using FP8 and the TensorRT inference engine. These improvements may become even more significant with more complex models, as TensorRT could potentially identify additional optimization opportunities.\n",
    "\n",
    "#### Appendix: Low Precision Operators in ONNX and TensorRT\n",
    "\n",
    "The ONNX standard does not currently support all precision types provided by the Transformer Engine. All available ONNX operators are listed on [this website](https://onnx.ai/onnx/operators/). Consequently, TensorRT and the Transformer Engine utilize certain specialized low-precision operators, detailed below.\n",
    "\n",
    "**TRT_FP8_QUANTIZE**\n",
    "\n",
    "- **Name**: TRT_FP8_QUANTIZE\n",
    "- **Domain**: trt\n",
    "- **Inputs**:\n",
    "    - `x`: float32 tensor\n",
    "    - `scale`: float32 scalar\n",
    "- **Outputs**:\n",
    "    - `y`: int8 tensor\n",
    "\n",
    "Produces an int8 tensor that represents the binary encoding of FP8 values.\n",
    "\n",
    "**TRT_FP8_DEQUANTIZE**\n",
    "\n",
    "- **Name**: TRT_FP8_DEQUANTIZE\n",
    "- **Domain**: trt\n",
    "- **Inputs**:\n",
    "    - `x`: int8 tensor\n",
    "    - `scale`: float32 scalar\n",
    "- **Outputs**:\n",
    "    - `y`: float32 tensor\n",
    "\n",
    "Converts FP8-encoded int8 tensor data back into float32 precision.\n",
    "\n",
    "<div class=\"alert alert-info\">\n",
    "\n",
    "<b>Note:</b>\n",
    "\n",
    "Since standard ONNX operators do not support certain input and output precision types, a workaround is employed: tensors are dequantized to higher precision (float32) before input into these operators or quantized to lower precision after processing. TensorRT recognizes such quantize-dequantize patterns and replaces them with optimized operations. More details are available in [this section](https://docs.nvidia.com/deeplearning/tensorrt/latest/inference-library/work-quantized-types.html#tensorrt-processing-of-q-dq-networks) of the TensorRT documentation.\n",
    "\n",
    "</div>"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}