"src/models/vscode:/vscode.git/clone" did not exist on "760b769ee5c90d953b2f737445ef89ca96d513fe"
Commit 2b05e121 authored by yuguo's avatar yuguo
Browse files

Merge commit 'a69692ac' of...

Merge commit 'a69692ac' of https://github.com/NVIDIA/TransformerEngine
parents 0fd441c2 a69692ac
......@@ -18,8 +18,8 @@ jobs:
- name: 'Dependencies'
run: |
apt-get update
apt-get install -y git python3.9 pip ninja-build cudnn9-cuda-12
pip install cmake==3.21.0
apt-get install -y git python3.9 pip cudnn9-cuda-12
pip install cmake==3.21.0 pybind11[global] ninja
- name: 'Checkout'
uses: actions/checkout@v3
with:
......@@ -42,8 +42,8 @@ jobs:
- name: 'Dependencies'
run: |
apt-get update
apt-get install -y git python3.9 pip ninja-build cudnn9-cuda-12
pip install cmake torch pydantic importlib-metadata>=1.0 packaging pybind11
apt-get install -y git python3.9 pip cudnn9-cuda-12
pip install cmake torch ninja pydantic importlib-metadata>=1.0 packaging pybind11 numpy einops
- name: 'Checkout'
uses: actions/checkout@v3
with:
......@@ -54,7 +54,6 @@ jobs:
NVTE_FRAMEWORK: pytorch
MAX_JOBS: 1
- name: 'Sanity check'
if: false # Sanity import test requires Flash Attention
run: python3 tests/pytorch/test_sanity_import.py
jax:
name: 'JAX'
......@@ -63,6 +62,8 @@ jobs:
image: ghcr.io/nvidia/jax:jax
options: --user root
steps:
- name: 'Dependencies'
run: pip install pybind11[global]
- name: 'Checkout'
uses: actions/checkout@v3
with:
......@@ -73,4 +74,24 @@ jobs:
NVTE_FRAMEWORK: jax
MAX_JOBS: 1
- name: 'Sanity check'
run: python tests/jax/test_sanity_import.py
run: python3 tests/jax/test_sanity_import.py
all:
name: 'All'
runs-on: ubuntu-latest
container:
image: ghcr.io/nvidia/jax:jax
options: --user root
steps:
- name: 'Dependencies'
run: pip install torch pybind11[global] einops
- name: 'Checkout'
uses: actions/checkout@v3
with:
submodules: recursive
- name: 'Build'
run: pip install --no-build-isolation . -v --no-deps
env:
NVTE_FRAMEWORK: all
MAX_JOBS: 1
- name: 'Sanity check'
run: python3 tests/pytorch/test_sanity_import.py && python3 tests/jax/test_sanity_import.py
......@@ -53,6 +53,7 @@ jobs:
|| github.actor == 'lhb8125'
|| github.actor == 'kunlunl'
|| github.actor == 'pstjohn'
|| github.actor == 'mk-61'
)
steps:
- name: Check if comment is issued by authorized person
......
......@@ -146,7 +146,7 @@ Installation
============
System Requirements
^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^
* **Hardware:** Blackwell, Hopper, Grace Hopper/Blackwell, Ada, Ampere
......@@ -164,10 +164,10 @@ System Requirements
* **Notes:** FP8 features require Compute Capability 8.9+ (Ada/Hopper/Blackwell)
Installation Methods
^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^
Docker (Recommended)
^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^
The quickest way to get started with Transformer Engine is by using Docker images on
`NVIDIA GPU Cloud (NGC) Catalog <https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch>`_.
......@@ -192,7 +192,7 @@ Where 25.04 (corresponding to April 2025 release) is the container version.
* NGC PyTorch 23.08+ containers include FlashAttention-2
pip Installation
^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^
**Prerequisites for pip installation:**
......@@ -216,13 +216,25 @@ Alternatively, install directly from the GitHub repository:
.. code-block:: bash
pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable
pip install --no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@stable
When installing from GitHub, you can explicitly specify frameworks using the environment variable:
.. code-block:: bash
NVTE_FRAMEWORK=pytorch,jax pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable
NVTE_FRAMEWORK=pytorch,jax pip install --no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@stable
conda Installation
^^^^^^^^^^^^^^^^^^
To install the latest stable version with conda from conda-forge:
.. code-block:: bash
# For PyTorch integration
conda install -c conda-forge transformer-engine-torch
# JAX integration (coming soon)
Source Installation
^^^^^^^^^^^^^^^^^^^
......@@ -230,7 +242,7 @@ Source Installation
`See the installation guide <https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/installation.html#installation-from-source>`_
Environment Variables
^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^
These environment variables can be set before installation to customize the build process:
* **CUDA_PATH**: Path to CUDA installation
......@@ -241,7 +253,7 @@ These environment variables can be set before installation to customize the buil
* **NVTE_BUILD_THREADS_PER_JOB**: Control threads per build job
Compiling with FlashAttention
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Transformer Engine supports both FlashAttention-2 and FlashAttention-3 in PyTorch for improved performance. FlashAttention-3 was added in release v1.11 and is prioritized over FlashAttention-2 when both are present in the environment.
You can verify which FlashAttention version is being used by setting these environment variables:
......@@ -253,8 +265,9 @@ You can verify which FlashAttention version is being used by setting these envir
It is a known issue that FlashAttention-2 compilation is resource-intensive and requires a large amount of RAM (see `bug <https://github.com/Dao-AILab/flash-attention/issues/358>`_), which may lead to out of memory errors during the installation of Transformer Engine. Please try setting **MAX_JOBS=1** in the environment to circumvent the issue.
.. troubleshooting-begin-marker-do-not-remove
Troubleshooting
^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^
**Common Issues and Solutions:**
......@@ -388,7 +401,7 @@ Papers
Videos
======
* `Stable and Scalable FP8 Deep Learning Training on Blackwell | GTC 2025 <https://www.nvidia.com/en-us/on-demand/session/gtc24-s62457/>`_
* `Stable and Scalable FP8 Deep Learning Training on Blackwell | GTC 2025 <https://www.nvidia.com/en-us/on-demand/session/gtc24-s62457/>`__
* `Blackwell Numerics for AI | GTC 2025 <https://www.nvidia.com/en-us/on-demand/session/gtc25-s72458/>`_
* `Building LLMs: Accelerating Pretraining of Foundational Models With FP8 Precision | GTC 2025 <https://www.nvidia.com/gtc/session-catalog/?regcode=no-ncid&ncid=no-ncid&tab.catalogallsessionstab=16566177511100015Kus&search=zoho#/session/1726152813607001vnYK>`_
* `From FP8 LLM Training to Inference: Language AI at Scale | GTC 2025 <https://www.nvidia.com/en-us/on-demand/session/gtc25-s72799/>`_
......
......@@ -4,7 +4,6 @@
"""JAX related extensions."""
import os
import shutil
from pathlib import Path
import setuptools
......@@ -13,6 +12,16 @@ from .utils import get_cuda_include_dirs, all_files_in_dir, debug_build_enabled
from typing import List
def install_requirements() -> List[str]:
"""Install dependencies for TE/JAX extensions."""
return ["jax", "flax>=0.7.1"]
def test_requirements() -> List[str]:
"""Test dependencies for TE/JAX extensions."""
return ["numpy"]
def xla_path() -> str:
"""XLA root path lookup.
Throws FileNotFoundError if XLA source is not found."""
......@@ -66,20 +75,9 @@ def setup_jax_extension(
# Define TE/JAX as a Pybind11Extension
from pybind11.setup_helpers import Pybind11Extension
class Pybind11CPPExtension(Pybind11Extension):
"""Modified Pybind11Extension to allow custom CXX flags."""
def _add_cflags(self, flags: List[str]) -> None:
if isinstance(self.extra_compile_args, dict):
cxx_flags = self.extra_compile_args.pop("cxx", [])
cxx_flags += flags
self.extra_compile_args["cxx"] = cxx_flags
else:
self.extra_compile_args[:0] = flags
return Pybind11CPPExtension(
return Pybind11Extension(
"transformer_engine_jax",
sources=[str(path) for path in sources],
include_dirs=[str(path) for path in include_dirs],
extra_compile_args={"cxx": cxx_flags},
extra_compile_args=cxx_flags,
)
......@@ -9,6 +9,22 @@ from pathlib import Path
import setuptools
from .utils import all_files_in_dir, cuda_version, get_cuda_include_dirs, debug_build_enabled, rocm_build, hipify
from typing import List
def install_requirements() -> List[str]:
"""Install dependencies for TE/JAX extensions."""
reqs = ["torch>=2.1", "einops"]
reqs.append(
"nvdlfw-inspect @"
" git+https://github.com/NVIDIA/nvidia-dlfw-inspect.git@v0.1#egg=nvdlfw-inspect"
)
return reqs
def test_requirements() -> List[str]:
"""Test dependencies for TE/JAX extensions."""
return ["numpy", "torchvision", "transformers"]
def setup_pytorch_extension(
......
......@@ -21,13 +21,7 @@ from typing import List, Optional, Tuple, Union
@functools.lru_cache(maxsize=None)
def debug_build_enabled() -> bool:
"""Whether to build with a debug configuration"""
for arg in sys.argv:
if arg == "--debug":
sys.argv.remove(arg)
return True
if int(os.getenv("NVTE_BUILD_DEBUG", "0")):
return True
return False
return bool(int(os.getenv("NVTE_BUILD_DEBUG", "0")))
@functools.lru_cache(maxsize=None)
......@@ -280,9 +274,12 @@ def get_cuda_include_dirs() -> Tuple[str, str]:
def cuda_archs() -> str:
version = cuda_version()
if os.getenv("NVTE_CUDA_ARCHS") is None:
os.environ["NVTE_CUDA_ARCHS"] = (
"70;80;89;90;100;120" if version >= (12, 8) else "70;80;89;90"
)
if version >= (13, 0):
os.environ["NVTE_CUDA_ARCHS"] = "75;80;89;90;100;120"
elif version >= (12, 8):
os.environ["NVTE_CUDA_ARCHS"] = "70;80;89;90;100;120"
else:
os.environ["NVTE_CUDA_ARCHS"] = "70;80;89;90"
return os.getenv("NVTE_CUDA_ARCHS")
......@@ -455,10 +452,3 @@ def hipify(base_dir, src_dir, sources, include_dirs):
# *never* absolute paths
hipified_sources.add(os.path.relpath(fname, cwd))
return list(hipified_sources)
def install_and_import(package):
"""Install a package via pip (if not already installed) and import into globals."""
main_package = package.split("[")[0]
subprocess.check_call([sys.executable, "-m", "pip", "install", package])
globals()[main_package] = importlib.import_module(main_package)
......@@ -20,6 +20,9 @@ cd /TransformerEngine
git checkout $TARGET_BRANCH
git submodule update --init --recursive
# Install deps
/opt/python/cp310-cp310/bin/pip install cmake pybind11[global] ninja
if $BUILD_METAPACKAGE ; then
cd /TransformerEngine
NVTE_BUILD_METAPACKAGE=1 /opt/python/cp310-cp310/bin/python setup.py bdist_wheel 2>&1 | tee /wheelhouse/logs/metapackage.txt
......@@ -31,15 +34,15 @@ if $BUILD_COMMON ; then
WHL_BASE="transformer_engine-${VERSION}"
# Create the wheel.
/opt/python/cp38-cp38/bin/python setup.py bdist_wheel --verbose --python-tag=py3 --plat-name=$PLATFORM 2>&1 | tee /wheelhouse/logs/common.txt
/opt/python/cp310-cp310/bin/python setup.py bdist_wheel --verbose --python-tag=py3 --plat-name=$PLATFORM 2>&1 | tee /wheelhouse/logs/common.txt
# Repack the wheel for cuda specific package, i.e. cu12.
/opt/python/cp38-cp38/bin/wheel unpack dist/*
/opt/python/cp310-cp310/bin/wheel unpack dist/*
# From python 3.10 to 3.11, the package name delimiter in metadata got changed from - (hyphen) to _ (underscore).
sed -i "s/Name: transformer-engine/Name: transformer-engine-cu12/g" "transformer_engine-${VERSION}/transformer_engine-${VERSION}.dist-info/METADATA"
sed -i "s/Name: transformer_engine/Name: transformer_engine_cu12/g" "transformer_engine-${VERSION}/transformer_engine-${VERSION}.dist-info/METADATA"
mv "${WHL_BASE}/${WHL_BASE}.dist-info" "${WHL_BASE}/transformer_engine_cu12-${VERSION}.dist-info"
/opt/python/cp38-cp38/bin/wheel pack ${WHL_BASE}
/opt/python/cp310-cp310/bin/wheel pack ${WHL_BASE}
# Rename the wheel to make it python version agnostic.
whl_name=$(basename dist/*)
......@@ -51,14 +54,14 @@ fi
if $BUILD_PYTORCH ; then
cd /TransformerEngine/transformer_engine/pytorch
/opt/python/cp38-cp38/bin/pip install torch
/opt/python/cp38-cp38/bin/python setup.py sdist 2>&1 | tee /wheelhouse/logs/torch.txt
/opt/python/cp310-cp310/bin/pip install torch
/opt/python/cp310-cp310/bin/python setup.py sdist 2>&1 | tee /wheelhouse/logs/torch.txt
cp dist/* /wheelhouse/
fi
if $BUILD_JAX ; then
cd /TransformerEngine/transformer_engine/jax
/opt/python/cp310-cp310/bin/pip install "jax[cuda12_local]" jaxlib
/opt/python/cp310-cp310/bin/pip install "jax[cuda12_local]" jaxlib
/opt/python/cp310-cp310/bin/python setup.py sdist 2>&1 | tee /wheelhouse/logs/jax.txt
cp dist/* /wheelhouse/
fi
..
Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
See LICENSE for license information.
cast_transpose_noop.h
=====================
.. doxygenfile:: cast_transpose_noop.h
..
Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
See LICENSE for license information.
cudnn.h
=======
.. doxygenfile:: cudnn.h
......@@ -14,10 +14,13 @@ directly from C/C++, without Python.
transformer_engine.h <transformer_engine>
activation.h <activation>
cast_transpose_noop.h <cast_transpose_noop>
cast.h <cast>
cudnn.h <cudnn>
fused_attn.h <fused_attn>
fused_rope.h <fused_rope>
gemm.h <gemm>
multi_tensor.h <multi_tensor>
normalization.h <normalization>
padding.h <padding>
permutation.h <permutation>
......
..
Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
See LICENSE for license information.
multi_tensor.h
==============
.. doxygenfile:: multi_tensor.h
......@@ -11,3 +11,7 @@ Common API
.. autoapiclass:: transformer_engine.common.recipe.DelayedScaling(margin=0, fp8_format=Format.HYBRID, amax_history_len=1024, amax_compute_algo="max", scaling_factor_compute_algo=None)
.. autoapiclass:: transformer_engine.common.recipe.MXFP8BlockScaling(fp8_format=Format.E4M3)
.. autoapiclass:: transformer_engine.common.recipe.Float8CurrentScaling(fp8_format=Format.HYBRID)
.. autoapiclass:: transformer_engine.common.recipe.Float8BlockScaling(fp8_format=Format.E4M3)
..
Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
See LICENSE for license information.
Precision debug tools
==============================================
.. toctree::
:caption: Precision debug tools
debug/1_getting_started.rst
debug/2_config_file_structure.rst
debug/api
debug/4_distributed.rst
\ No newline at end of file
..
Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
See LICENSE for license information.
Getting started
==============
.. note::
Precision debug tools with `Nvidia-DL-Framework-Inspect <https://github.com/NVIDIA/nvidia-dlfw-inspect>`_ for Transformer Engine are currently supported only for PyTorch.
Transformer Engine provides a set of precision debug tools which allow you to easily:
- log the statistics for each of the tensors in every matrix multiply (GEMM) operation,
- run selected GEMMs in higher precision,
- run current scaling - with one scaling factor per tensor - for particular GEMMs,
- test new precisions and integrate them with FP8 training,
- ... and many more.
There are 4 things one needs to do to use Transformer Engine debug features:
1. Create a configuration YAML file to configure the desired features.
2. Import, and initialize the `Nvidia-DL-Framework-Inspect <https://github.com/NVIDIA/nvidia-dlfw-inspect>`_ tool, which is installed as the dependency of the Transformer Engine.
3. One can pass ``name="..."`` when creating TE layers to easier identify layer names. If this is not provided, names will be inferred automatically.
4. Invoke ``debug_api.step()`` at the end of one forward-backward pass.
To start debugging, one needs to create a configuration YAML file. This file lists the features to be used in particular layers. There are 2 kinds of features:
- provided by the Transformer Engine - for example, DisableFP8GEMM or LogTensorStats - they are listed in the :doc:`debug features API <3_api_features>` section
- defined by the user. For details on how to create a custom feature - please read the :doc:`calls to Nvidia-DL-Framework-Inspect <3_api_te_calls>` section.
.. figure:: ./img/introduction.svg
:align: center
Fig 1: Example of Nvidia-DL-Framework-Inspect affecting training script with 3 TE Linear Layers.
``config.yaml`` contains the specification of the features used for each Linear layer. Some feature classes are provided by TE,
one - ``UserProvidedPrecision`` - is a custom feature implemented by the user. Nvidia-DL-Framework-Inspect inserts features into the layers according to the config.
Example training script
----------------------
Let's look at a simple example of training a Transformer layer using Transformer Engine with FP8 precision. This example demonstrates how to set up the layer, define an optimizer, and perform a few training iterations using synthetic data.
.. code-block:: python
# train.py
from transformer_engine.pytorch import TransformerLayer
import torch
import torch.nn as nn
import torch.optim as optim
import transformer_engine.pytorch as te
hidden_size = 512
num_attention_heads = 8
transformer_layer = TransformerLayer(
hidden_size=hidden_size,
ffn_hidden_size=hidden_size,
num_attention_heads=num_attention_heads
).cuda()
dummy_input = torch.randn(10, 32, hidden_size).cuda()
criterion = nn.MSELoss()
optimizer = optim.Adam(transformer_layer.parameters(), lr=1e-4)
dummy_target = torch.randn(10, 32, hidden_size).cuda()
for epoch in range(5):
transformer_layer.train()
optimizer.zero_grad()
with te.fp8_autocast(enabled=True):
output = transformer_layer(dummy_input)
loss = criterion(output, dummy_target)
loss.backward()
optimizer.step()
We will demonstrate two debug features on the code above:
1. Disabling FP8 precision for specific GEMM operations, such as the FC1 and FC2 forward propagation GEMM.
2. Logging statistics for other GEMM operations, such as gradient statistics for data gradient GEMM within the LayerNormLinear sub-layer of the TransformerLayer.
Config file
----------
We need to prepare the configuration YAML file, as below
.. code-block:: yaml
# config.yaml
fc1_fprop_to_fp8:
enabled: True
layers:
layer_types: [fc1, fc2] # contains fc1 or fc2 in name
transformer_engine:
DisableFP8GEMM:
enabled: True
gemms: [fprop]
log_tensor_stats:
enabled: True
layers:
layer_types: [layernorm_linear] # contains layernorm_linear in name
transformer_engine:
LogTensorStats:
enabled: True
stats: [max, min, mean, std, l1_norm]
tensors: [activation]
freq: 1
start_step: 2
end_step: 5
Further explanation on how to create config files is in the :doc:`next part of the documentation <2_config_file_structure>`.
Adjusting Python file
--------------------
.. code-block:: python
# (...)
import nvdlfw_inspect.api as debug_api
debug_api.initialize(
config_file="./config.yaml",
feature_dirs=["/path/to/transformer_engine/debug/features"],
log_dir="./log",
default_logging_enabled=True)
# initialization of the TransformerLayer with the name
transformer_layer = TransformerLayer(
name="transformer_layer",
# ...)
# (...)
for epoch in range(5):
# forward and backward pass
# ...
debug_api.step()
In the modified code above, the following changes were made:
1. Added an import for ``nvdlfw_inspect.api``.
2. Initialized the Nvidia-DL-Framework-Inspect by calling ``debug_api.initialize()`` with appropriate configuration, specifying the path to the config file, feature directories, and log directory.
3. Added ``debug_api.step()`` after each of the forward-backward pass.
Inspecting the logs
------------------
Let's look at the files with the logs. Two files will be created:
1. debug logs.
2. statistics logs.
Let's look inside them!
In the main log file, you can find detailed information about the transformer layer's GEMMs behavior. You can see that ``fc1`` and ``fc2`` fprop GEMMs are run in high precision, as intended.
.. code-block:: text
# log/nvdlfw_inspect_logs/nvdlfw_inspect_globalrank-0.log
INFO - Default logging to file enabled at ./log
INFO - Reading config from ./config.yaml.
INFO - Loaded configs for dict_keys(['fc1_fprop_to_fp8', 'log_tensor_stats']).
INFO - transformer_layer.self_attention.layernorm_qkv: Tensor: activation, gemm fprop - FP8 quantization
INFO - transformer_layer.self_attention.layernorm_qkv: Tensor: activation, gemm wgrad - FP8 quantization
INFO - transformer_layer.self_attention.layernorm_qkv: Tensor: weight, gemm fprop - FP8 quantization
INFO - transformer_layer.self_attention.layernorm_qkv: Tensor: weight, gemm dgrad - FP8 quantization
INFO - transformer_layer.self_attention.layernorm_qkv: Tensor: gradient, gemm dgrad - FP8 quantization
INFO - transformer_layer.self_attention.layernorm_qkv: Tensor: gradient, gemm wgrad - FP8 quantization
INFO - transformer_layer.self_attention.proj: Tensor: activation, gemm fprop - FP8 quantization
INFO - transformer_layer.self_attention.proj: Tensor: activation, gemm wgrad - FP8 quantization
INFO - transformer_layer.self_attention.proj: Tensor: weight, gemm fprop - FP8 quantization
INFO - transformer_layer.self_attention.proj: Tensor: weight, gemm dgrad - FP8 quantization
INFO - transformer_layer.self_attention.proj: Tensor: gradient, gemm dgrad - FP8 quantization
INFO - transformer_layer.self_attention.proj: Tensor: gradient, gemm wgrad - FP8 quantization
INFO - transformer_layer.layernorm_mlp.fc1: Tensor: activation, gemm fprop - High precision
INFO - transformer_layer.layernorm_mlp.fc1: Tensor: activation, gemm wgrad - FP8 quantization
INFO - transformer_layer.layernorm_mlp.fc1: Tensor: weight, gemm fprop - High precision
INFO - transformer_layer.layernorm_mlp.fc1: Tensor: weight, gemm dgrad - FP8 quantization
INFO - transformer_layer.layernorm_mlp.fc1: Tensor: gradient, gemm dgrad - FP8 quantization
INFO - transformer_layer.layernorm_mlp.fc1: Tensor: gradient, gemm wgrad - FP8 quantization
INFO - transformer_layer.layernorm_mlp.fc2: Tensor: activation, gemm fprop - High precision
INFO - transformer_layer.layernorm_mlp.fc2: Tensor: activation, gemm wgrad - FP8 quantization
INFO - transformer_layer.layernorm_mlp.fc2: Tensor: weight, gemm fprop - High precision
INFO - transformer_layer.layernorm_mlp.fc2: Tensor: weight, gemm dgrad - FP8 quantization
INFO - transformer_layer.layernorm_mlp.fc2: Tensor: gradient, gemm dgrad - FP8 quantization
INFO - transformer_layer.layernorm_mlp.fc2: Tensor: gradient, gemm wgrad - FP8 quantization
INFO - transformer_layer.self_attention.layernorm_qkv: Feature=LogTensorStats, API=look_at_tensor_before_process: activation
....
The second log file (``nvdlfw_inspect_statistics_logs/nvdlfw_inspect_globalrank-0.log``) contains statistics for tensors we requested in ``config.yaml``.
.. code-block:: text
# log/nvdlfw_inspect_statistics_logs/nvdlfw_inspect_globalrank-0.log
INFO - transformer_layer.self_attention.layernorm_qkv_activation_max iteration=000002 value=4.3188
INFO - transformer_layer.self_attention.layernorm_qkv_activation_min iteration=000002 value=-4.3386
INFO - transformer_layer.self_attention.layernorm_qkv_activation_mean iteration=000002 value=0.0000
INFO - transformer_layer.self_attention.layernorm_qkv_activation_std iteration=000002 value=0.9998
INFO - transformer_layer.self_attention.layernorm_qkv_activation_l1_norm iteration=000002 value=130799.6953
INFO - transformer_layer.self_attention.layernorm_qkv_activation_max iteration=000003 value=4.3184
INFO - transformer_layer.self_attention.layernorm_qkv_activation_min iteration=000003 value=-4.3381
INFO - transformer_layer.self_attention.layernorm_qkv_activation_mean iteration=000003 value=0.0000
INFO - transformer_layer.self_attention.layernorm_qkv_activation_std iteration=000003 value=0.9997
INFO - transformer_layer.self_attention.layernorm_qkv_activation_l1_norm iteration=000003 value=130788.1016
INFO - transformer_layer.self_attention.layernorm_qkv_activation_max iteration=000004 value=4.3181
INFO - transformer_layer.self_attention.layernorm_qkv_activation_min iteration=000004 value=-4.3377
INFO - transformer_layer.self_attention.layernorm_qkv_activation_mean iteration=000004 value=0.0000
INFO - transformer_layer.self_attention.layernorm_qkv_activation_std iteration=000004 value=0.9996
INFO - transformer_layer.self_attention.layernorm_qkv_activation_l1_norm iteration=000004 value=130776.7969
Logging using TensorBoard
------------------------
Precision debug tools support logging using `TensorBoard <https://www.tensorflow.org/tensorboard>`_. To enable it, one needs to pass the argument ``tb_writer`` to the ``debug_api.initialize()``. Let's modify ``train.py`` file.
.. code-block:: python
# (...)
from torch.utils.tensorboard import SummaryWriter
tb_writer = SummaryWriter('./tensorboard_dir/run1')
# add tb_writer to the Debug API initialization
debug_api.initialize(
config_file="./config.yaml",
feature_dirs=["/path/to/transformer_engine/debug/features"],
log_dir="./log",
tb_writer=tb_writer)
# (...)
Let's run training and open TensorBoard by ``tensorboard --logdir=./tensorboard_dir/run1``:
.. figure:: ./img/tensorboard.png
:align: center
Fig 2: TensorBoard with plotted stats.
\ No newline at end of file
..
Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
See LICENSE for license information.
Config File Structure
====================
To enable debug features, create a configuration YAML file to specify the desired behavior, such as determining which GEMMs (General Matrix Multiply operations) should run in higher precision rather than FP8 and defining which statistics to log.
Below, we outline how to structure the configuration YAML file.
General Format
-------------
A config file can have one or more sections, each containing settings for specific layers and features:
.. code-block:: yaml
section_name_1:
enabled: ...
layers:
# Specify layers here...
transformer_engine:
Feature1Name:
enabled: ...
# Feature details...
Feature2Name:
enabled: ...
# Feature details...
section_name_2:
enabled: ...
layers:
# Specify layers here...
Feature1Name: # If feature has no namespace, then it is in the default namespace.
enabled: ...
# Feature details...
section_name_3:
enabled: ...
layers:
# Specify layers here...
transformer_engine:
Feature1Name:
enabled: ...
# Feature details...
Feature2Name:
enabled: ...
# Feature details...
Sections may have any name and must contain:
1. An ``enabled`` field that specifies whether the features in that section will be active.
2. A ``layers`` field specifying which layers the section applies to. Each layer can belong to only one section.
3. Additional fields describing features for those layers.
Layer Specification
------------------
Debug layers can be identified by a ``name`` parameter:
.. code-block:: python
linear = transformer_engine.debug.pytorch.Linear(in_features, out_features, name="linear1")
This name is used in the config file to identify the layer. To specify the ``layers`` field, you can use one of the following methods:
1. ``layer_name_regex_pattern``: Use a regular expression to match layer names. This expression must adhere to the Python ``re`` module syntax.
2. ``layer_types``: Provide a list of strings, where a layer will be selected if any string matches part of its name.
Examples:
.. code-block:: yaml
# Example 1: Using regular expression to select layers
my_section:
enabled: ...
layers:
layer_name_regex_pattern: 'self_attn.*'
transformer_engine:
(...)
# Example 2: Using layer type to select layers
another_section:
enabled: ...
layers:
layer_types: ['fc1', 'layernorm_linear']
transformer_engine:
(...)
Names in Transformer Layers
--------------------------
There are three ways to assign a name to a layer in the Transformer Engine:
- Initialize the layer with the ``name=...`` argument.
- Use ``debug_api.infer_and_assign_layer_names(model)``, which assigns names based on class names.
- Rely on the default names assigned during module initialization, such as ``Layer_n``, where ``n`` represents the layer number.
The ``TransformerLayer`` in Transformer Engine is a composition of multiple sub-layers. We can modify some of these layers using precision debug tools, particularly those that contain exactly one linear layer. To see the names of all such layers, we can inspect log files. For instance, a ``TransformerLayer`` named ``transformer_layer`` might consist of:
- ``transformer_layer.self_attn.layernorm_linear_qkv`` / ``transformer_layer.self_attn.linear_qkv`` / ``transformer_layer.self_attn.layernorm_linear_q`` / ``transformer_layer.self_attn.linear_q`` / ``transformer_layer.self_attn.linear_kv``,
- ``transformer_layer.self_attn.proj``,
- ``transformer_layer.inter_attn.*`` for ``layer_type="decoder"``,
- ``transformer_layer.layernorm_mlp.fc1``,
- ``transformer_layer.layernorm_mlp.fc2``,
depending on the configuration. Some layers, like ``LayerNormLinear``, are fusions of two layers: ``LayerNorm`` and ``Linear``. When referring to such layers in precision debug tools, only the ``Linear`` part is affected.
Below is an example ``TransformerLayer`` with four linear layers that can be influenced by the precision debug tools.
.. figure:: ./img/names.svg
:align: center
:width: 80%
Fig 1: Names of layers in an example configuration of TransformerLayer. The most nested blocks represent the most basic layers, each containing one linear layer. Layers that do not contain linear layers, such as ``DotProductAttention``, are omitted.
**Configuration File Example**
.. code-block:: yaml
# Disables wgrad in all 4 GEMMs
section1:
enabled: True
layers:
layer_types: [transformer_layer]
transformer_engine:
DisableFP8GEMM:
enabled: True
gemms: [wgrad]
# Disables all GEMMs in layernorm_mlp layer
section2:
enabled: True
layers:
layer_types: [layernorm_mlp]
transformer_engine:
DisableFP8Layer:
enabled: True
# Logs wgrad stats in fc1
section3:
enabled: True
layers:
layer_types: [fc1]
transformer_engine:
LogTensorStats:
enabled: True
stats: [min]
tensors: [wgrad]
freq: 1
start_step: 0
end_step: 50
Structured Configuration for GEMMs and Tensors
---------------------------------------------
Sometimes a feature is parameterized by a list of tensors or by a list of GEMMs.
There are multiple ways of describing this parameterization.
We can pass lists, as below.
.. code-block:: yaml
Feature:
enabled: ...
gemms: [gemm1, gemm2]
tensors: [tensor1, tensor2]
...
We can use struct for tensors.
.. code-block:: yaml
Feature:
gemms: [gemm1, gemm2]
tensors_struct:
- tensor: tensor1
feature_param1: value
- tensor: tensor2
feature_param1: value
gemm_feature_param1: value
Similarly, we can use struct for GEMMs.
.. code-block:: yaml
Feature:
enabled: ...
tensors: [tensor1, tensor2]
gemms_struct:
- gemm: gemm1
feature_param1: value
- gemm: gemm2
feature_param1: value
gemm_feature_param1: value
We can use both structs for tensors and GEMMs. The tensors_struct should be nested inside gemms_struct.
.. code-block:: yaml
Feature:
enabled: ...
gemms_struct:
- gemm: gemm1
tensors: [tensor1, tensor2]
tensor_feature_param1: value
gemm_feature_param1: value
- gemm: gemm2
tensors_struct:
- tensor: tensor1
tensor_feature_param1: value
- tensor: tensor2
tensor_feature_param2: value
gemm_feature_param1: value
Enabling or Disabling Sections and Features
------------------------------------------
Debug features can be enabled or disabled with the ``enabled`` keyword:
.. code-block:: yaml
section1:
enabled: True
layers:
layer_types: [self_attention]
transformer_engine:
LogTensorStats:
enabled: False # Disables the LogTensorStats feature
stats: [max, min, mean, std, l1_norm]
section2:
enabled: False # Disables entire section2
transformer_engine:
LogFp8TensorStats:
enabled: True # Does not enable the LogFp8TensorStats feature, because section2 is disabled
stats: [underflows, overflows]
By organizing your ``config.yaml`` properly, you can easily manage debugging features, ensuring a more streamlined and customizable debugging experience.
..
Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
See LICENSE for license information.
Setup
=====
Precision debug tools for the Transformer Engine use `Nvidia-DL-Framework-Inspect <https://github.com/NVIDIA/nvidia-dlfw-inspect>`_ package from NVIDIA.
Please refer to the Nvidia-DL-Framework-Inspect `documentation <https://github.com/NVIDIA/nvidia-dlfw-inspect/tree/main/docs>`_ for more details.
Below, we outline the steps for debug initialization.
initialize()
-----------
Must be called once on every rank in the global context to initialize Nvidia-DL-Framework-Inspect.
**Parameters**
- **config_file** (*str*, default=""): Path to the configuration YAML file containing features to enable and layer names. If one wants to run without the configuration file, pass ``""``.
- **feature_dirs** (*List[str] | str*): List of directories containing features to load and register. One needs to pass ``[/path/to/transformerengine/transformer_engine/debug/features]`` to use TE features.
- **logger** (*Union[BaseLogger, None]*, default=None): Logger for logging tensor statistics. Should adhere to ``BaseLogger`` from the `Nvidia-DL-Framework-Inspect <https://github.com/NVIDIA/nvidia-dlfw-inspect>`_ package.
- **log_dir** (*str*, default= "."): Directory path to hold ``debug_logs`` and ``debug_statistics_logs``.
- **tb_writer** (*TensorBoardWriter*, default=None): TensorBoard writer for logging.
- **default_logging_enabled** (*bool*, default=False): Enable default logging to the file.
.. code-block:: python
import nvdlfw_inspect.api as debug_api
debug_api.initialize(
config_file="./config.yaml",
feature_dirs=["/path/to/transformer_engine/debug/features"],
log_dir="./log_dir")
set_tensor_reduction_group()
--------------------------
Needed only for logging tensor stats. In multi-GPU training, activation and gradient tensors are distributed across multiple nodes. This method lets you specify the group for the reduction of stats; see the `reduction group section <./4_distributed.rst#reduction-groups>`_ for more details.
If the tensor reduction group is not specified, then statistics are reduced across all nodes in the run.
**Parameters**
- **group** (torch.distributed.ProcessGroup): The process group across which tensors will be reduced to get stats.
.. code-block:: python
import nvdlfw_inspect.api as debug_api
# initialization
# (...)
pipeline_parallel_group = initialize_pipeline_parallel_group()
debug_api.set_tensor_reduction_group(pipeline_parallel_group)
# training
# (...)
# activation/gradient tensor statistics are reduced along pipeline_parallel_group
set_weight_tensor_tp_group_reduce()
---------------------------------
By default, weight tensor statistics are reduced within the tensor parallel group. This function allows you to disable that behavior; for more details, see `reduction group section <./4_distributed.rst#reduction-groups>`_.
This method is not provided by the ``debug_api``, but by the ``transformer_engine.debug``.
**Parameters**
- **enabled** (*bool*, default=True): A boolean flag to enable or disable the reduction of weight tensor statistics within the tensor parallel group.
.. code-block:: python
import nvdlfw_inspect.api as debug_api
from transformer_engine.debug import set_weight_tensor_tp_group_reduce
# initialization
# (...)
set_weight_tensor_tp_group_reduce(False)
# training
# (...)
# weight tensor statistics are not reduced
..
Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
See LICENSE for license information.
Debug features
==========
.. autoapiclass:: transformer_engine.debug.features.log_tensor_stats.LogTensorStats
.. autoapiclass:: transformer_engine.debug.features.log_fp8_tensor_stats.LogFp8TensorStats
.. autoapiclass:: transformer_engine.debug.features.disable_fp8_gemm.DisableFP8GEMM
.. autoapiclass:: transformer_engine.debug.features.disable_fp8_layer.DisableFP8Layer
.. autoapiclass:: transformer_engine.debug.features.per_tensor_scaling.PerTensorScaling
.. autoapiclass:: transformer_engine.debug.features.fake_quant.FakeQuant
..
Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
See LICENSE for license information.
Calls to Nvidia-DL-Framework-Inspect
====================================
Let's look deeper into how Nvidia-DL-Framework-Inspect with Transformer Engine work together. TransformerEngine layers have some hook calls inside each of the GEMMs. Users can define feature classes or use feature classes provided with TE. File ``config.yaml`` describes which hooks need to be used for which layers. Nvidia-DL-Framework-Inspect combines 3 things: TE training, feature classes and ``config.yaml`` and takes care of inserting hooks in the correct places. This process is illustrated in the image below.
.. figure:: ./img/api_calls1.svg
:align: center
Fig 1: Example of Nvidia-DL-Framework-Inspect affecting training script with 1 Linear Layer. For tensors mentioned in ``config.yaml``, behavior of ``modify_tensor_enabled()`` and ``modify_tensor()`` calls are substituted with definitions from the feature class. Other calls return default values - in fact they do nothing.
In this page, all calls from TransformerEngine to the Nvidia-DL-Framework-Inspect for each GEMM are listed. The order of these calls is illustrated in the image below.
.. figure:: ./img/api_calls2.svg
:align: center
Fig 2: The calls to Nvidia-DL-Framework-Inspect done for Transformer Engine. There are 2 types of calls: GEMM calls and routing calls.
There are 2 categories of API calls, each is used for different purposes:
- GEMM calls - invoked during every GEMM, used to process or quantize tensors and collect information about them,
- Routing calls - invoked at the beginning of every forward pass - they indicate whether a feature is going to use `modify_tensor()`, etc.
If all routing calls for the layer return `False`, then the layer is invoked in an optimized version with Transformer Engine fusions.
If any of the routing calls return `True`, layers are run without the fusions. This is necessary because otherwise some tensors cannot be accessed
if fusions happen. An important remark is that if no feature is used for the layer, then it should perform as fast as the layer without initializing `debug_api`.
.. autoapifunction:: transformer_engine.debug.features.api.TEDefaultFeatures.modify_tensor
.. autoapifunction:: transformer_engine.debug.features.api.TEDefaultFeatures.inspect_tensor
.. autoapifunction:: transformer_engine.debug.features.api.TEDefaultFeatures.inspect_tensor_postquantize
.. autoapifunction:: transformer_engine.debug.features.api.TEDefaultFeatures.modify_tensor_enabled
.. autoapifunction:: transformer_engine.debug.features.api.TEDefaultFeatures.fp8_gemm_enabled
.. autoapifunction:: transformer_engine.debug.features.api.TEDefaultFeatures.inspect_tensor_enabled
.. autoapifunction:: transformer_engine.debug.features.api.TEDefaultFeatures.inspect_tensor_postquantize_enabled
..
Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
See LICENSE for license information.
Distributed training
===================
Nvidia-Pytorch-Inspect with Transformer Engine supports multi-GPU training. This guide describes how to run it and how the supported features work in the distributed setting.
To use precision debug tools in multi-GPU training, one needs to:
1. Run ``debug_api.initialize(...)`` and provide the same configuration YAML file on every node.
2. If one wants to log stats, one may want to invoke ``debug_api.set_tensor_reduction_group`` with a proper reduction group.
Behavior of the features
-----------------------
In a distributed setting, **DisableFP8GEMM** and **DisableFP8Layer** function similarly to the single-GPU case, with no notable differences.
**PerTensorScaling** and **FakeQuant** calculate FP8 scaling factors independently on each node, meaning the number of GPUs may affect results. This differs from the delayed scaling FP8 recipe behavior, in which scaling factors are synchronized.
.. figure:: ./img/scaling_factors.svg
:align: center
Fig 1: For **PerTensorScaling** and **FakeQuant** tensor scaling factors are computed separately for each of the tensor shards. This is not the case for delayed scaling FP8 scaling factors, which are synchronized.
Logging-related features are more complex and will be discussed further in the next sections.
Reduction groups
--------------
In setups with tensor, data, or pipeline parallelism, some tensors are distributed across multiple GPUs, requiring a reduction operation to compute statistics for these tensors.
The weight tensor is always split among the tensor parallel group, and debug tools automatically reduce statistics within this group by default. To disable this automatic reduction, use:
.. code-block:: python
transformer_engine.debug.set_weight_tensor_tp_group_reduce(False)
In cases of data parallelism, Transformer Engine modules lack the process group needed for reduction. To manually specify the group, use:
.. code-block:: python
debug_api.set_tensor_reduction_group(group)
This command ensures statistics are reduced across the defined group. Activation statistics are logged after the forward pass (immediately after exiting autocast), while gradient (dgrad and wgrad) statistics are logged following the backward pass.
Below, we illustrate configurations for a 4-node setup with tensor parallelism size 2 and data parallelism size 2, showcasing different reduction configurations.
.. figure:: ./img/reduction1.svg
:align: center
Fig 2: There is a single tensor reduction group composed of all nodes. As a result, each node logs the same statistics for the tensors, as they are fully reduced across all nodes.
.. figure:: ./img/reduction2.svg
:align: center
Fig 3: Every node is set with a tensor reduction group consisting of itself. Every node prints the same statistics for weights (which are still synchronized within TP groups), but the statistics of activations and gradients are not synchronized.
.. figure:: ./img/reduction3.svg
:align: center
Fig 4: Weight synchronization is disabled by ``set_weight_tensor_tp_group_reduce(False)``, so every node logs stats for its shard of the weight.
Microbatching
-----------
Let's dive into how statistics collection works with microbatching. By microbatching, we mean invoking multiple ``forward()`` calls for each ``debug_api.step()``. The behavior is as follows:
- For weight tensors, the stats remain the same for each microbatch because the weight does not change.
- For other tensors, the stats are accumulated.
Logging to files and TensorBoard
------------------------------
In a single-node setup with ``default_logging_enabled=True``, all logs are saved by default to ``log_dir/nvdlfw_inspect_statistics_logs/nvdlfw_inspect_globalrank-0.log``. In multi-GPU training, each node writes its reduced statistics to its unique file, named ``log_dir/nvdlfw_inspect_statistics_logs/nvdlfw_inspect_globalrank-i.log`` for rank i. Because these logs contain reduced statistics, the logged values are identical for all nodes within a reduction group.
If certain nodes are given a TensorBoard writer, only those nodes will log to TensorBoard. This is useful in scenarios involving pipeline, data, and tensor parallelism, such as with two transformer layers and settings TP_SIZE = 2, DP_SIZE = 2, and PP_SIZE = 2. To log all stats to TensorBoard, you should pass a TensorBoard writer to one process in each pipeline parallel group.
.. figure:: ./img/pipeline_logging.svg
:align: center
Fig 5: Example with pipeline parallelism, where a ``tb_writer`` is assigned to one node within each pipeline parallel group, setting these as tensor reduction groups.
Alternatively, setting the tensor reduction group to None will yield unreduced statistics for wgrad and dgrad tensors on each node, allowing for post-processing. For weight statistics without reduction in the TP parallel group, use:
.. code-block:: python
transformer_engine.debug.set_weight_tensor_tp_group_reduce(False)
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment