advanced_optimizations.ipynb 13.7 KB
Newer Older
1
2
3
4
5
6
7
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "24184f3f",
   "metadata": {},
   "source": [
8
    "# Performance Optimizations"
9
10
11
12
13
14
15
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6dcbf25a",
   "metadata": {},
   "source": [
16
    "This guide is a follow-up to the discussion in the [Getting Started guide](../getting_started/index.rst). We will focus on techniques to achieve maximum performance when training a basic GPT encoder layer. For convenience, we use some helper functions defined in [quickstart_utils.py](quickstart_utils.py). "
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "2b53dfa7",
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "import transformer_engine.pytorch as te\n",
    "from transformer_engine.common.recipe import Format, DelayedScaling\n",
    "import quickstart_utils as utils\n",
    "\n",
    "# Layer configuration\n",
    "hidden_size = 4096\n",
    "sequence_length = 2048\n",
    "batch_size = 4\n",
    "ffn_hidden_size = 16384\n",
    "num_attention_heads = 32\n",
    "dtype = torch.float16\n",
    "\n",
    "# Synthetic data\n",
    "x = torch.rand(sequence_length, batch_size, hidden_size).cuda().to(dtype=dtype)\n",
    "dy = torch.rand(sequence_length, batch_size, hidden_size).cuda().to(dtype=dtype)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "b96a9ef6",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Mean time: 27.82952880859375 ms\n"
     ]
    }
   ],
   "source": [
    "# Construct layer\n",
    "basic_transformer = te.TransformerLayer(\n",
    "    hidden_size,\n",
    "    ffn_hidden_size,\n",
    "    num_attention_heads,\n",
    ")\n",
    "basic_transformer.to(dtype=dtype).cuda()\n",
    "\n",
    "fp8_format = Format.HYBRID\n",
    "fp8_recipe = DelayedScaling(\n",
    "    fp8_format=fp8_format,\n",
    "    amax_history_len=16,\n",
    "    amax_compute_algo=\"max\",\n",
    ")\n",
    "# Training step\n",
74
    "with te.autocast(enabled=True, recipe=fp8_recipe):\n",
75
76
77
78
79
80
81
82
83
    "    y = basic_transformer(x, attention_mask=None)\n",
    "y.backward(dy)\n",
    "\n",
    "# Measure step time\n",
    "utils.speedometer(\n",
    "    basic_transformer,\n",
    "    x,\n",
    "    dy,\n",
    "    forward_kwargs = { \"attention_mask\": None },\n",
84
    "    autocast_kwargs = { \"enabled\": True, \"recipe\": fp8_recipe },\n",
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "11367f5b",
   "metadata": {},
   "source": [
    "## Multi-GPU training\n",
    "\n",
    "<div class=\"alert alert-info\">\n",
    "\n",
    "<b>Summary</b>\n",
    "    \n",
    "We parallelize a Transformer layer with data, tensor, and sequence parallelism.\n",
    "\n",
    "</div>\n",
    "\n",
Paweł Gadziński's avatar
Paweł Gadziński committed
103
    "A variety of parallelism strategies can be used to enable multi-GPU training of Transformer models, often based on different approaches to distribute their $\\text{sequence_length} \\cdot \\text{batch_size} \\cdot \\text{hidden_size}$ activation tensors. The most common approach is data parallelism, which distributes along the $\\text{batch_size}$ dimension. By storing duplicate copies of the model on each GPU, the forward and backward passes of the training step can be done independently, followed by a gradient synchronization. A more advanced strategy is tensor parallelism, a type of model parallelism that distributes along the $\\text{hidden_size}$ dimension. This allows us to scale past the limits of data parallelism (typically $\\text{hidden_size} > \\text{batch_size}$) and to reduce the per-GPU memory usage (since model parameters are also distributed), but it also incurs the overhead of communicating activation tensors between GPUs at every step. For a more detailed explanation, please see the [Megatron-LM paper](https://arxiv.org/pdf/1909.08053.pdf). Finally, sequence parallelism distributes along the $\\text{sequence_length}$ dimension. This can be used when tensor parallelism is enabled in order to parallelize operations that run outside the tensor-parallel region (e.g. layer norm). For more details, please see [this paper](https://arxiv.org/pdf/2205.05198.pdf).\n",
104
105
106
107
108
109
110
111
112
113
114
115
116
117
    "\n",
    "To show this in action, let's first initialize NCCL with a trivial process group:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "fca06ec3",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Configure parallel groups\n",
    "import os\n",
    "import torch\n",
118
    "torch.distributed.init_process_group(\n",
119
120
121
122
123
    "    \"nccl\",\n",
    "    init_method=\"file:///tmp/rdzv\",\n",
    "    world_size=1,\n",
    "    rank=0,\n",
    ")\n",
124
    "world_group = torch.distributed.new_group(ranks=[0], backend=\"nccl\")\n",
125
126
127
128
129
130
131
132
133
    "data_parallel_group = torch.distributed.new_group(ranks=[0], backend=\"nccl\")\n",
    "tensor_parallel_group = torch.distributed.new_group(ranks=[0], backend=\"nccl\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1f2b80d0",
   "metadata": {},
   "source": [
Paweł Gadziński's avatar
Paweł Gadziński committed
134
    "We only initialize with one GPU to keep this example simple. Please consult the documentation [torch.distributed](https://pytorch.org/docs/stable/distributed.html) for guidance on running with multiple GPUs. Note that we require that each distributed process corresponds to exactly one GPU, so we treat them interchangeably. In practice, there are multiple factors that can affect the optimal parallel layout: the system hardware, the network topology, usage of other parallelism schemes like pipeline parallelism. A rough rule-of-thumb is to interpret the GPUs as a 2D grid with dimensions of $\\text{num_nodes} \\cdot \\text{gpus_per_node}$. The rows are tensor-parallel groups and the columns are data-parallel groups.\n",
135
    "\n",
136
137
    "Enabling data parallelism with Transformer Engine is similar to enabling data parallelism with standard PyTorch models: simply wrap the modules with [torch.nn.parallel.DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html). Transformer Engine modules also have native support for tensor and sequence parallelism. If the user provides a process group for tensor parallelism, the modules will distribute the data and perform communication internally. If sequence parallelism is enabled, it will be applied for operations that are not amenable to tensor parallelism and it will use the tensor-parallel process group.\n",
    "\n",
138
    "One important consideration for multi-GPU FP8 training is how to synchronize the FP8 scaling factors between GPUs. If tensor parallelism is enabled, the scales must be synchronized over the tensor-parallel group. However, synchronizing over both the data-parallel and tensor-parallel groups is recommended for the best convergence. This can be configured with the **fp8_group** argument in the [autocast](../api/pytorch.rst#transformer_engine.pytorch.autocast) context manager."
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "1892cc9d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Mean time: 29.09606689453125 ms\n"
     ]
    }
   ],
   "source": [
    "# Construct layer\n",
    "parallel_transformer = te.TransformerLayer(\n",
    "    hidden_size,\n",
    "    ffn_hidden_size,\n",
    "    num_attention_heads,\n",
    "    set_parallel_mode=True,\n",
    "    tp_group=tensor_parallel_group,\n",
    "    sequence_parallel=True,\n",
    ")\n",
    "parallel_transformer.to(dtype=dtype).cuda()\n",
    "parallel_transformer = torch.nn.parallel.DistributedDataParallel(\n",
    "    parallel_transformer,\n",
    "    process_group=data_parallel_group,\n",
    ")\n",
    "\n",
    "# Training step\n",
172
    "with te.autocast(enabled=True, recipe=fp8_recipe, amax_reduction_group=world_group):\n",
173
174
175
176
177
178
179
180
181
    "    y = parallel_transformer(x, attention_mask=None)\n",
    "y.backward(dy)\n",
    "\n",
    "# Measure step time\n",
    "utils.speedometer(\n",
    "    parallel_transformer,\n",
    "    x,\n",
    "    dy,\n",
    "    forward_kwargs = { \"attention_mask\": None },\n",
182
    "    autocast_kwargs = {\n",
183
    "        \"enabled\": True,\n",
184
185
    "        \"recipe\": fp8_recipe,\n",
    "        \"amax_reduction_group\": world_group,\n",
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
    "    },\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5f03f6d8",
   "metadata": {},
   "source": [
    "## Gradient accumulation fusion\n",
    "\n",
    "<div class=\"alert alert-info\">\n",
    "\n",
    "<b>Summary</b>\n",
    "    \n",
    "We take advantage of the ability of Tensor Cores to accumulate outputs directly into FP32.\n",
    "\n",
    "</div>\n",
    "\n",
hugo-syn's avatar
hugo-syn committed
205
    "PyTorch's autograd functionality assumes that a model parameter and its corresponding gradient have the same data type. However, while low-precision data types like FP8 are sufficient for evaluating a neural network's forward and backward passes, the optimization step typically requires full FP32 precision to avoid significant learning degradation. In addition, Tensor Cores on Hopper GPUs have the option to accumulate matrix products directly into FP32, resulting in better numerical accuracy and avoiding the need for a separate casting kernel. Thus, Transformer Engine provides an option to directly generate FP32 gradients for weight tensors. The FP32 gradients are not output to the parameter's `grad` tensor, but rather to a `main_grad` tensor that must be initialized before the backward pass."
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "a7f612ec",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Mean time: 27.510029296875 ms\n"
     ]
    }
   ],
   "source": [
    "# Construct layer\n",
    "wgrad_transformer = te.TransformerLayer(\n",
    "    hidden_size,\n",
    "    ffn_hidden_size,\n",
    "    num_attention_heads,\n",
    "    fuse_wgrad_accumulation=True,\n",
    "    fuse_qkv_params=True, # Required for fuse_wgrad_accumulation\n",
    ")\n",
    "wgrad_transformer.to(dtype=dtype).cuda()\n",
    "for param in wgrad_transformer.parameters():\n",
    "    param.grad = None\n",
    "    param.main_grad = torch.zeros_like(param, dtype=torch.float32)\n",
    "\n",
    "# Training step\n",
237
    "with te.autocast(enabled=True, recipe=fp8_recipe):\n",
238
239
240
241
242
243
244
245
246
247
248
249
250
    "    y = wgrad_transformer(x, attention_mask=None)\n",
    "y.backward(dy)\n",
    "for param in wgrad_transformer.parameters():\n",
    "    if param.grad is not None:\n",
    "        param.main_grad.copy_(param.grad)\n",
    "        param.grad = None\n",
    "\n",
    "# Measure step time\n",
    "utils.speedometer(\n",
    "    wgrad_transformer,\n",
    "    x,\n",
    "    dy,\n",
    "    forward_kwargs = { \"attention_mask\": None },\n",
251
    "    autocast_kwargs = { \"enabled\": True, \"recipe\": fp8_recipe },\n",
252
253
254
255
    ")"
   ]
  },
  {
256
   "attachments": {},
257
258
259
260
261
262
263
264
265
266
267
268
269
270
   "cell_type": "markdown",
   "id": "add64bd5",
   "metadata": {},
   "source": [
    "## FP8 weight caching\n",
    "\n",
    "<div class=\"alert alert-info\">\n",
    "\n",
    "<b>Summary</b>\n",
    "    \n",
    "We avoid redundant FP8 casting when training with multiple gradient accumulation steps.\n",
    "\n",
    "</div>\n",
    "\n",
271
    "Since weights are typically trained in FP32, a type conversion is required before we can perform compute in FP8. By default, the [autocast](../api/pytorch.rst#transformer_engine.pytorch.autocast) context manager will handle this internally by casting non-FP8 tensors to FP8 as they are encountered. However, we can improve upon this in some cases. In particular, if our training iteration is split into multiple gradient accumulation steps, each micro-batch will encounter the same weight tensors. Thus, we only need to cast the weights to FP8 in the first gradient accumulation step and we can cache the resulting FP8 weights for the remaining gradient accumulation steps.\n",
272
273
274
275
276
277
278
279
    "\n",
    "<div class=\"alert alert-warning\">\n",
    "\n",
    "<b>Warning!</b> \n",
    "\n",
    "The precise numerical outputs with and without the FP8 weight caching optimization may not be bitwise identical. This is because while the weights remain frozen across a gradient accumulation cycle, the scaling factors and amaxes for the FP8 weights can change as they are updated at the end of every iteration. These changes in amax tensors are incorporated into the amax history, which is not frozen.\n",
    "\n",
    "</div>"
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "abbc218e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Mean time: 27.262666015625 ms\n"
     ]
    }
   ],
   "source": [
    "# Construct layer\n",
    "weight_caching_transformer = te.TransformerLayer(\n",
    "    hidden_size,\n",
    "    ffn_hidden_size,\n",
    "    num_attention_heads,\n",
    ")\n",
    "weight_caching_transformer.to(dtype=dtype).cuda()\n",
    "\n",
    "# Cast weights in first gradient accumulation step\n",
306
    "with te.autocast(enabled=True, recipe=fp8_recipe):\n",
307
308
309
310
    "    y = weight_caching_transformer(x, attention_mask=None, is_first_microbatch=True)\n",
    "y.backward(dy)\n",
    "\n",
    "# Reuse FP8 weights in subsequent gradient accumulation steps\n",
311
    "with te.autocast(enabled=True, recipe=fp8_recipe):\n",
312
313
314
315
316
317
318
319
320
    "    y = weight_caching_transformer(x, attention_mask=None, is_first_microbatch=False)\n",
    "y.backward(dy)\n",
    "\n",
    "# Measure step time\n",
    "utils.speedometer(\n",
    "    weight_caching_transformer,\n",
    "    x,\n",
    "    dy,\n",
    "    forward_kwargs = { \"attention_mask\": None, \"is_first_microbatch\": False },\n",
321
    "    autocast_kwargs = { \"enabled\": True, \"recipe\": fp8_recipe },\n",
322
323
324
325
326
327
328
329
330
331
332
333
    ")"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}