1_getting_started.rst 11.8 KB
Newer Older
1
2
3
4
5
6
..
    Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

    See LICENSE for license information.

Getting started
Paweł Gadziński's avatar
Paweł Gadziński committed
7
===============
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

.. note::

   Precision debug tools with `Nvidia-DL-Framework-Inspect <https://github.com/NVIDIA/nvidia-dlfw-inspect>`_ for Transformer Engine are currently supported only for PyTorch.

Transformer Engine provides a set of precision debug tools which allow you to easily:

- log the statistics for each of the tensors in every matrix multiply (GEMM) operation,
- run selected GEMMs in higher precision,
- run current scaling - with one scaling factor per tensor - for particular GEMMs,
- test new precisions and integrate them with FP8 training,
- ... and many more.

There are 4 things one needs to do to use Transformer Engine debug features:

1. Create a configuration YAML file to configure the desired features.
24
2. Import, initialize, and install the `Nvidia-DL-Framework-Inspect <https://github.com/NVIDIA/nvidia-dlfw-inspect>`_ tool.
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
3. One can pass ``name="..."`` when creating TE layers to easier identify layer names. If this is not provided, names will be inferred automatically.
4. Invoke ``debug_api.step()`` at the end of one forward-backward pass.

To start debugging, one needs to create a configuration YAML file. This file lists the features to be used in particular layers. There are 2 kinds of features:

- provided by the Transformer Engine - for example, DisableFP8GEMM or LogTensorStats - they are listed in the :doc:`debug features API <3_api_features>` section
- defined by the user. For details on how to create a custom feature - please read the :doc:`calls to Nvidia-DL-Framework-Inspect <3_api_te_calls>` section.

.. figure:: ./img/introduction.svg
   :align: center

   Fig 1: Example of Nvidia-DL-Framework-Inspect affecting training script with 3 TE Linear Layers. 
   ``config.yaml`` contains the specification of the features used for each Linear layer. Some feature classes are provided by TE,
   one - ``UserProvidedPrecision`` - is a custom feature implemented by the user. Nvidia-DL-Framework-Inspect inserts features into the layers according to the config.

Example training script
Paweł Gadziński's avatar
Paweł Gadziński committed
41
-----------------------
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71

Let's look at a simple example of training a Transformer layer using Transformer Engine with FP8 precision. This example demonstrates how to set up the layer, define an optimizer, and perform a few training iterations using synthetic data.

.. code-block:: python

    # train.py

    from transformer_engine.pytorch import TransformerLayer
    import torch
    import torch.nn as nn
    import torch.optim as optim
    import transformer_engine.pytorch as te

    hidden_size = 512
    num_attention_heads = 8

    transformer_layer = TransformerLayer(
        hidden_size=hidden_size,
        ffn_hidden_size=hidden_size,
        num_attention_heads=num_attention_heads
    ).cuda()

    dummy_input = torch.randn(10, 32, hidden_size).cuda()
    criterion = nn.MSELoss()
    optimizer = optim.Adam(transformer_layer.parameters(), lr=1e-4)
    dummy_target = torch.randn(10, 32, hidden_size).cuda()

    for epoch in range(5):
        transformer_layer.train()
        optimizer.zero_grad()
72
        with te.autocast(enabled=True):
73
74
75
76
77
78
79
80
81
82
83
            output = transformer_layer(dummy_input)
        loss = criterion(output, dummy_target)
        loss.backward()
        optimizer.step()

We will demonstrate two debug features on the code above:

1. Disabling FP8 precision for specific GEMM operations, such as the FC1 and FC2 forward propagation GEMM.
2. Logging statistics for other GEMM operations, such as gradient statistics for data gradient GEMM within the LayerNormLinear sub-layer of the TransformerLayer.

Config file
Paweł Gadziński's avatar
Paweł Gadziński committed
84
-----------
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116

We need to prepare the configuration YAML file, as below

.. code-block:: yaml

    # config.yaml

    fc1_fprop_to_fp8:
      enabled: True
      layers:
        layer_types: [fc1, fc2] # contains fc1 or fc2 in name
      transformer_engine:
        DisableFP8GEMM:
          enabled: True
          gemms: [fprop]

    log_tensor_stats:
      enabled: True
      layers:
        layer_types: [layernorm_linear] # contains layernorm_linear in name
      transformer_engine:
        LogTensorStats:
          enabled: True
          stats: [max, min, mean, std, l1_norm]
          tensors: [activation]
          freq: 1
          start_step: 2
          end_step: 5

Further explanation on how to create config files is in the :doc:`next part of the documentation <2_config_file_structure>`.

Adjusting Python file
Paweł Gadziński's avatar
Paweł Gadziński committed
117
118
---------------------

119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144

.. code-block:: python

    # (...)

    import nvdlfw_inspect.api as debug_api
    debug_api.initialize(
        config_file="./config.yaml",
        feature_dirs=["/path/to/transformer_engine/debug/features"],
        log_dir="./log",
        default_logging_enabled=True)

    # initialization of the TransformerLayer with the name
    transformer_layer = TransformerLayer(
      name="transformer_layer",
      # ...)

    # (...)
    for epoch in range(5):
      # forward and backward pass
      # ...
      debug_api.step()

In the modified code above, the following changes were made:

1. Added an import for ``nvdlfw_inspect.api``.
145
2. Initialized the Nvidia-DL-Framework-Inspect by calling ``debug_api.initialize()`` with appropriate configuration, specifying the path to the config file, feature directories, and log directory. The directory with Transformer Engine features is located `here <https://github.com/NVIDIA/TransformerEngine/tree/main/transformer_engine/debug/features>`_. The full parameters description could be found :doc:`here <3_api_debug_setup>`.
146
147
148
3. Added ``debug_api.step()`` after each of the forward-backward pass.

Inspecting the logs
Paweł Gadziński's avatar
Paweł Gadziński committed
149
150
-------------------

151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217

Let's look at the files with the logs. Two files will be created:

1. debug logs.
2. statistics logs.

Let's look inside them!

In the main log file, you can find detailed information about the transformer layer's GEMMs behavior. You can see that ``fc1`` and ``fc2`` fprop GEMMs are run in high precision, as intended.

.. code-block:: text

    # log/nvdlfw_inspect_logs/nvdlfw_inspect_globalrank-0.log

    INFO - Default logging to file enabled at ./log
    INFO - Reading config from ./config.yaml.
    INFO - Loaded configs for dict_keys(['fc1_fprop_to_fp8', 'log_tensor_stats']).
    INFO - transformer_layer.self_attention.layernorm_qkv: Tensor: activation, gemm fprop - FP8 quantization
    INFO - transformer_layer.self_attention.layernorm_qkv: Tensor: activation, gemm wgrad - FP8 quantization
    INFO - transformer_layer.self_attention.layernorm_qkv: Tensor: weight, gemm fprop - FP8 quantization
    INFO - transformer_layer.self_attention.layernorm_qkv: Tensor: weight, gemm dgrad - FP8 quantization
    INFO - transformer_layer.self_attention.layernorm_qkv: Tensor: gradient, gemm dgrad - FP8 quantization
    INFO - transformer_layer.self_attention.layernorm_qkv: Tensor: gradient, gemm wgrad - FP8 quantization
    INFO - transformer_layer.self_attention.proj: Tensor: activation, gemm fprop - FP8 quantization
    INFO - transformer_layer.self_attention.proj: Tensor: activation, gemm wgrad - FP8 quantization
    INFO - transformer_layer.self_attention.proj: Tensor: weight, gemm fprop - FP8 quantization
    INFO - transformer_layer.self_attention.proj: Tensor: weight, gemm dgrad - FP8 quantization
    INFO - transformer_layer.self_attention.proj: Tensor: gradient, gemm dgrad - FP8 quantization
    INFO - transformer_layer.self_attention.proj: Tensor: gradient, gemm wgrad - FP8 quantization
    INFO - transformer_layer.layernorm_mlp.fc1: Tensor: activation, gemm fprop - High precision
    INFO - transformer_layer.layernorm_mlp.fc1: Tensor: activation, gemm wgrad - FP8 quantization
    INFO - transformer_layer.layernorm_mlp.fc1: Tensor: weight, gemm fprop - High precision
    INFO - transformer_layer.layernorm_mlp.fc1: Tensor: weight, gemm dgrad - FP8 quantization
    INFO - transformer_layer.layernorm_mlp.fc1: Tensor: gradient, gemm dgrad - FP8 quantization
    INFO - transformer_layer.layernorm_mlp.fc1: Tensor: gradient, gemm wgrad - FP8 quantization
    INFO - transformer_layer.layernorm_mlp.fc2: Tensor: activation, gemm fprop - High precision
    INFO - transformer_layer.layernorm_mlp.fc2: Tensor: activation, gemm wgrad - FP8 quantization
    INFO - transformer_layer.layernorm_mlp.fc2: Tensor: weight, gemm fprop - High precision
    INFO - transformer_layer.layernorm_mlp.fc2: Tensor: weight, gemm dgrad - FP8 quantization
    INFO - transformer_layer.layernorm_mlp.fc2: Tensor: gradient, gemm dgrad - FP8 quantization
    INFO - transformer_layer.layernorm_mlp.fc2: Tensor: gradient, gemm wgrad - FP8 quantization
    INFO - transformer_layer.self_attention.layernorm_qkv: Feature=LogTensorStats, API=look_at_tensor_before_process: activation
    ....

The second log file (``nvdlfw_inspect_statistics_logs/nvdlfw_inspect_globalrank-0.log``) contains statistics for tensors we requested in ``config.yaml``.

.. code-block:: text

    # log/nvdlfw_inspect_statistics_logs/nvdlfw_inspect_globalrank-0.log

    INFO - transformer_layer.self_attention.layernorm_qkv_activation_max                 iteration=000002                  value=4.3188
    INFO - transformer_layer.self_attention.layernorm_qkv_activation_min                 iteration=000002                  value=-4.3386
    INFO - transformer_layer.self_attention.layernorm_qkv_activation_mean                iteration=000002                  value=0.0000
    INFO - transformer_layer.self_attention.layernorm_qkv_activation_std                 iteration=000002                  value=0.9998
    INFO - transformer_layer.self_attention.layernorm_qkv_activation_l1_norm             iteration=000002                  value=130799.6953
    INFO - transformer_layer.self_attention.layernorm_qkv_activation_max                 iteration=000003                  value=4.3184
    INFO - transformer_layer.self_attention.layernorm_qkv_activation_min                 iteration=000003                  value=-4.3381
    INFO - transformer_layer.self_attention.layernorm_qkv_activation_mean                iteration=000003                  value=0.0000
    INFO - transformer_layer.self_attention.layernorm_qkv_activation_std                 iteration=000003                  value=0.9997
    INFO - transformer_layer.self_attention.layernorm_qkv_activation_l1_norm             iteration=000003                  value=130788.1016
    INFO - transformer_layer.self_attention.layernorm_qkv_activation_max                 iteration=000004                  value=4.3181
    INFO - transformer_layer.self_attention.layernorm_qkv_activation_min                 iteration=000004                  value=-4.3377
    INFO - transformer_layer.self_attention.layernorm_qkv_activation_mean                iteration=000004                  value=0.0000
    INFO - transformer_layer.self_attention.layernorm_qkv_activation_std                 iteration=000004                  value=0.9996
    INFO - transformer_layer.self_attention.layernorm_qkv_activation_l1_norm             iteration=000004                  value=130776.7969

Logging using TensorBoard
Paweł Gadziński's avatar
Paweł Gadziński committed
218
219
-------------------------

220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243

Precision debug tools support logging using `TensorBoard <https://www.tensorflow.org/tensorboard>`_. To enable it, one needs to pass the argument ``tb_writer`` to the ``debug_api.initialize()``.  Let's modify ``train.py`` file.

.. code-block:: python

    # (...)

    from torch.utils.tensorboard import SummaryWriter
    tb_writer = SummaryWriter('./tensorboard_dir/run1')

    # add tb_writer to the Debug API initialization
    debug_api.initialize(
        config_file="./config.yaml",
        feature_dirs=["/path/to/transformer_engine/debug/features"],
        log_dir="./log",
        tb_writer=tb_writer)

    # (...)

Let's run training and open TensorBoard by ``tensorboard --logdir=./tensorboard_dir/run1``:

.. figure:: ./img/tensorboard.png
   :align: center

244
   Fig 2: TensorBoard with plotted stats.