Merge commit 'a69692ac' of...

Merge commit 'a69692ac' of https://github.com/NVIDIA/TransformerEngine

Merge commit 'a69692ac' of...
Merge commit 'a69692ac' of https://github.com/NVIDIA/TransformerEngine
2b05e121 · yuguo · 0fd441c2 · a69692ac · 2b05e121 · 2b05e121
Commit 2b05e121 authored Jun 17, 2025 by yuguo
20 changed files
--- a/.github/workflows/build.yml
+++ b/.github/workflows/build.yml
@@ -18,8 +18,8 @@ jobs:
      - name: 'Dependencies'
        run: |
          apt-get update
-          apt-get install -y git python3.9 pip ninja-build cudnn9-cuda-12
-          pip install cmake==3.21.0
+          apt-get install -y git python3.9 pip cudnn9-cuda-12
+          pip install cmake==3.21.0 pybind11[global] ninja
      - name: 'Checkout'
        uses: actions/checkout@v3
        with:
@@ -42,8 +42,8 @@ jobs:
      - name: 'Dependencies'
        run: |
          apt-get update
-          apt-get install -y git python3.9 pip ninja-build cudnn9-cuda-12
-          pip install cmake torch pydantic importlib-metadata>=1.0 packaging pybind11
+          apt-get install -y git python3.9 pip cudnn9-cuda-12
+          pip install cmake torch ninja pydantic importlib-metadata>=1.0 packaging pybind11 numpy einops
      - name: 'Checkout'
        uses: actions/checkout@v3
        with:
@@ -54,7 +54,6 @@ jobs:
          NVTE_FRAMEWORK: pytorch
          MAX_JOBS: 1
      - name: 'Sanity check'
-        if: false  # Sanity import test requires Flash Attention
        run: python3 tests/pytorch/test_sanity_import.py
  jax:
    name: 'JAX'
@@ -63,6 +62,8 @@ jobs:
      image: ghcr.io/nvidia/jax:jax
      options: --user root
    steps:
+      - name: 'Dependencies'
+        run: pip install pybind11[global]
      - name: 'Checkout'
        uses: actions/checkout@v3
        with:
@@ -73,4 +74,24 @@ jobs:
          NVTE_FRAMEWORK: jax
          MAX_JOBS: 1
      - name: 'Sanity check'
-        run: python tests/jax/test_sanity_import.py
+        run: python3 tests/jax/test_sanity_import.py
+  all:
+    name: 'All'
+    runs-on: ubuntu-latest
+    container:
+      image: ghcr.io/nvidia/jax:jax
+      options: --user root
+    steps:
+      - name: 'Dependencies'
+        run: pip install torch pybind11[global] einops
+      - name: 'Checkout'
+        uses: actions/checkout@v3
+        with:
+          submodules: recursive
+      - name: 'Build'
+        run: pip install --no-build-isolation . -v --no-deps
+        env:
+          NVTE_FRAMEWORK: all
+          MAX_JOBS: 1
+      - name: 'Sanity check'
+        run: python3 tests/pytorch/test_sanity_import.py && python3 tests/jax/test_sanity_import.py
--- a/.github/workflows/trigger-ci.yml
+++ b/.github/workflows/trigger-ci.yml
@@ -53,6 +53,7 @@ jobs:
           || github.actor == 'lhb8125'
           || github.actor == 'kunlunl'
           || github.actor == 'pstjohn'
+           || github.actor == 'mk-61'
         )
    steps:
      - name: Check if comment is issued by authorized person

--- a/README.rst
+++ b/README.rst
@@ -146,7 +146,7 @@ Installation
 ============

 System Requirements
-^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^

 * **Hardware:** Blackwell, Hopper, Grace Hopper/Blackwell, Ada, Ampere

@@ -164,10 +164,10 @@ System Requirements
 * **Notes:** FP8 features require Compute Capability 8.9+ (Ada/Hopper/Blackwell)

 Installation Methods
-^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^

 Docker (Recommended)
-^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^
 The quickest way to get started with Transformer Engine is by using Docker images on
 `NVIDIA GPU Cloud (NGC) Catalog <https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch>`_.

@@ -192,7 +192,7 @@ Where 25.04 (corresponding to April 2025 release) is the container version.
 * NGC PyTorch 23.08+ containers include FlashAttention-2

 pip Installation
-^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^

 **Prerequisites for pip installation:**

@@ -216,13 +216,25 @@ Alternatively, install directly from the GitHub repository:

 .. code-block:: bash

-    pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable
+    pip install --no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@stable

 When installing from GitHub, you can explicitly specify frameworks using the environment variable:

 .. code-block:: bash

-    NVTE_FRAMEWORK=pytorch,jax pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable
+    NVTE_FRAMEWORK=pytorch,jax pip install --no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@stable
+
+conda Installation
+^^^^^^^^^^^^^^^^^^
+
+To install the latest stable version with conda from conda-forge:
+
+.. code-block:: bash
+
+    # For PyTorch integration
+    conda install -c conda-forge transformer-engine-torch
+    
+    # JAX integration (coming soon)

 Source Installation
 ^^^^^^^^^^^^^^^^^^^
@@ -230,7 +242,7 @@ Source Installation
 `See the installation guide <https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/installation.html#installation-from-source>`_

 Environment Variables
-^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^
 These environment variables can be set before installation to customize the build process:

 * **CUDA_PATH**: Path to CUDA installation
@@ -241,7 +253,7 @@ These environment variables can be set before installation to customize the buil
 * **NVTE_BUILD_THREADS_PER_JOB**: Control threads per build job

 Compiling with FlashAttention
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 Transformer Engine supports both FlashAttention-2 and FlashAttention-3 in PyTorch for improved performance. FlashAttention-3 was added in release v1.11 and is prioritized over FlashAttention-2 when both are present in the environment.

 You can verify which FlashAttention version is being used by setting these environment variables:
@@ -253,8 +265,9 @@ You can verify which FlashAttention version is being used by setting these envir
 It is a known issue that FlashAttention-2 compilation is resource-intensive and requires a large amount of RAM (see `bug <https://github.com/Dao-AILab/flash-attention/issues/358>`_), which may lead to out of memory errors during the installation of Transformer Engine. Please try setting **MAX_JOBS=1** in the environment to circumvent the issue.

 .. troubleshooting-begin-marker-do-not-remove
+
 Troubleshooting
-^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^

 **Common Issues and Solutions:**

@@ -388,7 +401,7 @@ Papers
 Videos
 ======

-* `Stable and Scalable FP8 Deep Learning Training on Blackwell | GTC 2025 <https://www.nvidia.com/en-us/on-demand/session/gtc24-s62457/>`_
+* `Stable and Scalable FP8 Deep Learning Training on Blackwell | GTC 2025 <https://www.nvidia.com/en-us/on-demand/session/gtc24-s62457/>`__
 * `Blackwell Numerics for AI | GTC 2025 <https://www.nvidia.com/en-us/on-demand/session/gtc25-s72458/>`_
 * `Building LLMs: Accelerating Pretraining of Foundational Models With FP8 Precision | GTC 2025 <https://www.nvidia.com/gtc/session-catalog/?regcode=no-ncid&ncid=no-ncid&tab.catalogallsessionstab=16566177511100015Kus&search=zoho#/session/1726152813607001vnYK>`_
 * `From FP8 LLM Training to Inference: Language AI at Scale | GTC 2025 <https://www.nvidia.com/en-us/on-demand/session/gtc25-s72799/>`_

--- a/build_tools/VERSION.txt
+++ b/build_tools/VERSION.txt
-2.5.0.dev0
+2.6.0.dev0
--- a/build_tools/jax.py
+++ b/build_tools/jax.py
@@ -4,7 +4,6 @@

 """JAX related extensions."""
 import os
-import shutil
 from pathlib import Path

 import setuptools
@@ -13,6 +12,16 @@ from .utils import get_cuda_include_dirs, all_files_in_dir, debug_build_enabled
 from typing import List


+def install_requirements() -> List[str]:
+    """Install dependencies for TE/JAX extensions."""
+    return ["jax", "flax>=0.7.1"]
+
+
+def test_requirements() -> List[str]:
+    """Test dependencies for TE/JAX extensions."""
+    return ["numpy"]
+
+
 def xla_path() -> str:
    """XLA root path lookup.
    Throws FileNotFoundError if XLA source is not found."""
@@ -66,20 +75,9 @@ def setup_jax_extension(
    # Define TE/JAX as a Pybind11Extension
    from pybind11.setup_helpers import Pybind11Extension

-    class Pybind11CPPExtension(Pybind11Extension):
-        """Modified Pybind11Extension to allow custom CXX flags."""
-
-        def _add_cflags(self, flags: List[str]) -> None:
-            if isinstance(self.extra_compile_args, dict):
-                cxx_flags = self.extra_compile_args.pop("cxx", [])
-                cxx_flags += flags
-                self.extra_compile_args["cxx"] = cxx_flags
-            else:
-                self.extra_compile_args[:0] = flags
-
-    return Pybind11CPPExtension(
+    return Pybind11Extension(
        "transformer_engine_jax",
        sources=[str(path) for path in sources],
        include_dirs=[str(path) for path in include_dirs],
-        extra_compile_args={"cxx": cxx_flags},
+        extra_compile_args=cxx_flags,
    )
--- a/build_tools/pytorch.py
+++ b/build_tools/pytorch.py
@@ -9,6 +9,22 @@ from pathlib import Path
 import setuptools

 from .utils import all_files_in_dir, cuda_version, get_cuda_include_dirs, debug_build_enabled, rocm_build, hipify
+from typing import List
+
+
+def install_requirements() -> List[str]:
+    """Install dependencies for TE/JAX extensions."""
+    reqs = ["torch>=2.1", "einops"]
+    reqs.append(
+        "nvdlfw-inspect @"
+        " git+https://github.com/NVIDIA/nvidia-dlfw-inspect.git@v0.1#egg=nvdlfw-inspect"
+    )
+    return reqs
+
+
+def test_requirements() -> List[str]:
+    """Test dependencies for TE/JAX extensions."""
+    return ["numpy", "torchvision", "transformers"]


 def setup_pytorch_extension(

--- a/build_tools/utils.py
+++ b/build_tools/utils.py
@@ -21,13 +21,7 @@ from typing import List, Optional, Tuple, Union
 @functools.lru_cache(maxsize=None)
 def debug_build_enabled() -> bool:
    """Whether to build with a debug configuration"""
-    for arg in sys.argv:
-        if arg == "--debug":
-            sys.argv.remove(arg)
-            return True
-    if int(os.getenv("NVTE_BUILD_DEBUG", "0")):
-        return True
-    return False
+    return bool(int(os.getenv("NVTE_BUILD_DEBUG", "0")))


 @functools.lru_cache(maxsize=None)
@@ -280,9 +274,12 @@ def get_cuda_include_dirs() -> Tuple[str, str]:
 def cuda_archs() -> str:
    version = cuda_version()
    if os.getenv("NVTE_CUDA_ARCHS") is None:
-        os.environ["NVTE_CUDA_ARCHS"] = (
-            "70;80;89;90;100;120" if version >= (12, 8) else "70;80;89;90"
-        )
+        if version >= (13, 0):
+            os.environ["NVTE_CUDA_ARCHS"] = "75;80;89;90;100;120"
+        elif version >= (12, 8):
+            os.environ["NVTE_CUDA_ARCHS"] = "70;80;89;90;100;120"
+        else:
+            os.environ["NVTE_CUDA_ARCHS"] = "70;80;89;90"
    return os.getenv("NVTE_CUDA_ARCHS")


@@ -455,10 +452,3 @@ def hipify(base_dir, src_dir, sources, include_dirs):
        # *never* absolute paths
        hipified_sources.add(os.path.relpath(fname, cwd))
    return list(hipified_sources)
-
-
-def install_and_import(package):
-    """Install a package via pip (if not already installed) and import into globals."""
-    main_package = package.split("[")[0]
-    subprocess.check_call([sys.executable, "-m", "pip", "install", package])
-    globals()[main_package] = importlib.import_module(main_package)
--- a/build_tools/wheel_utils/build_wheels.sh
+++ b/build_tools/wheel_utils/build_wheels.sh
@@ -20,6 +20,9 @@ cd /TransformerEngine
 git checkout $TARGET_BRANCH
 git submodule update --init --recursive

+# Install deps
+/opt/python/cp310-cp310/bin/pip install cmake pybind11[global] ninja
+
 if $BUILD_METAPACKAGE ; then
        cd /TransformerEngine
        NVTE_BUILD_METAPACKAGE=1 /opt/python/cp310-cp310/bin/python setup.py bdist_wheel 2>&1 | tee /wheelhouse/logs/metapackage.txt
@@ -31,15 +34,15 @@ if $BUILD_COMMON ; then
        WHL_BASE="transformer_engine-${VERSION}"

        # Create the wheel.
-        /opt/python/cp38-cp38/bin/python setup.py bdist_wheel --verbose --python-tag=py3 --plat-name=$PLATFORM 2>&1 | tee /wheelhouse/logs/common.txt
+        /opt/python/cp310-cp310/bin/python setup.py bdist_wheel --verbose --python-tag=py3 --plat-name=$PLATFORM 2>&1 | tee /wheelhouse/logs/common.txt

        # Repack the wheel for cuda specific package, i.e. cu12.
-        /opt/python/cp38-cp38/bin/wheel unpack dist/*
+        /opt/python/cp310-cp310/bin/wheel unpack dist/*
        # From python 3.10 to 3.11, the package name delimiter in metadata got changed from - (hyphen) to _ (underscore).
        sed -i "s/Name: transformer-engine/Name: transformer-engine-cu12/g" "transformer_engine-${VERSION}/transformer_engine-${VERSION}.dist-info/METADATA"
        sed -i "s/Name: transformer_engine/Name: transformer_engine_cu12/g" "transformer_engine-${VERSION}/transformer_engine-${VERSION}.dist-info/METADATA"
        mv "${WHL_BASE}/${WHL_BASE}.dist-info" "${WHL_BASE}/transformer_engine_cu12-${VERSION}.dist-info"
-        /opt/python/cp38-cp38/bin/wheel pack ${WHL_BASE}
+        /opt/python/cp310-cp310/bin/wheel pack ${WHL_BASE}

        # Rename the wheel to make it python version agnostic.
        whl_name=$(basename dist/*)
@@ -51,14 +54,14 @@ fi

 if $BUILD_PYTORCH ; then
 	cd /TransformerEngine/transformer_engine/pytorch
-	/opt/python/cp38-cp38/bin/pip install torch
-	/opt/python/cp38-cp38/bin/python setup.py sdist 2>&1 | tee /wheelhouse/logs/torch.txt
+        /opt/python/cp310-cp310/bin/pip install torch
+	/opt/python/cp310-cp310/bin/python setup.py sdist 2>&1 | tee /wheelhouse/logs/torch.txt
 	cp dist/* /wheelhouse/
 fi

 if $BUILD_JAX ; then
 	cd /TransformerEngine/transformer_engine/jax
-	/opt/python/cp310-cp310/bin/pip install "jax[cuda12_local]" jaxlib
+        /opt/python/cp310-cp310/bin/pip install "jax[cuda12_local]" jaxlib
 	/opt/python/cp310-cp310/bin/python setup.py sdist 2>&1 | tee /wheelhouse/logs/jax.txt
 	cp dist/* /wheelhouse/
 fi
--- a/docs/api/c/cast_transpose_noop.rst
+++ b/docs/api/c/cast_transpose_noop.rst
+..
+    Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+    See LICENSE for license information.
+
+cast_transpose_noop.h
+=====================
+
+.. doxygenfile:: cast_transpose_noop.h
--- a/docs/api/c/cudnn.rst
+++ b/docs/api/c/cudnn.rst
+..
+    Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+    See LICENSE for license information.
+
+cudnn.h
+=======
+
+.. doxygenfile:: cudnn.h
--- a/docs/api/c/index.rst
+++ b/docs/api/c/index.rst
@@ -14,10 +14,13 @@ directly from C/C++, without Python.

   transformer_engine.h <transformer_engine>
   activation.h <activation>
+   cast_transpose_noop.h <cast_transpose_noop>
   cast.h <cast>
+   cudnn.h <cudnn>
   fused_attn.h <fused_attn>
   fused_rope.h <fused_rope>
   gemm.h <gemm>
+   multi_tensor.h <multi_tensor>
   normalization.h <normalization>
   padding.h <padding>
   permutation.h <permutation>

--- a/docs/api/c/multi_tensor.rst
+++ b/docs/api/c/multi_tensor.rst
+..
+    Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+    See LICENSE for license information.
+
+multi_tensor.h
+==============
+
+.. doxygenfile:: multi_tensor.h
--- a/docs/api/common.rst
+++ b/docs/api/common.rst
@@ -11,3 +11,7 @@ Common API
 .. autoapiclass:: transformer_engine.common.recipe.DelayedScaling(margin=0, fp8_format=Format.HYBRID, amax_history_len=1024, amax_compute_algo="max", scaling_factor_compute_algo=None)

 .. autoapiclass:: transformer_engine.common.recipe.MXFP8BlockScaling(fp8_format=Format.E4M3)
+
+.. autoapiclass:: transformer_engine.common.recipe.Float8CurrentScaling(fp8_format=Format.HYBRID)
+
+.. autoapiclass:: transformer_engine.common.recipe.Float8BlockScaling(fp8_format=Format.E4M3)
--- a/docs/debug.rst
+++ b/docs/debug.rst
+..
+    Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+    See LICENSE for license information.
+Precision debug tools
+==============================================
+
+.. toctree::
+   :caption: Precision debug tools
+
+   debug/1_getting_started.rst
+   debug/2_config_file_structure.rst
+   debug/api
+   debug/4_distributed.rst
\ No newline at end of file
--- a/docs/debug/1_getting_started.rst
+++ b/docs/debug/1_getting_started.rst
+..
+    Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+    See LICENSE for license information.
+
+Getting started
+==============
+
+.. note::
+
+   Precision debug tools with `Nvidia-DL-Framework-Inspect <https://github.com/NVIDIA/nvidia-dlfw-inspect>`_ for Transformer Engine are currently supported only for PyTorch.
+
+Transformer Engine provides a set of precision debug tools which allow you to easily:
+
+- log the statistics for each of the tensors in every matrix multiply (GEMM) operation,
+- run selected GEMMs in higher precision,
+- run current scaling - with one scaling factor per tensor - for particular GEMMs,
+- test new precisions and integrate them with FP8 training,
+- ... and many more.
+
+There are 4 things one needs to do to use Transformer Engine debug features:
+
+1. Create a configuration YAML file to configure the desired features.
+2. Import, and initialize the `Nvidia-DL-Framework-Inspect <https://github.com/NVIDIA/nvidia-dlfw-inspect>`_ tool, which is installed as the dependency of the Transformer Engine.
+3. One can pass ``name="..."`` when creating TE layers to easier identify layer names. If this is not provided, names will be inferred automatically.
+4. Invoke ``debug_api.step()`` at the end of one forward-backward pass.
+
+To start debugging, one needs to create a configuration YAML file. This file lists the features to be used in particular layers. There are 2 kinds of features:
+
+- provided by the Transformer Engine - for example, DisableFP8GEMM or LogTensorStats - they are listed in the :doc:`debug features API <3_api_features>` section
+- defined by the user. For details on how to create a custom feature - please read the :doc:`calls to Nvidia-DL-Framework-Inspect <3_api_te_calls>` section.
+
+.. figure:: ./img/introduction.svg
+   :align: center
+
+   Fig 1: Example of Nvidia-DL-Framework-Inspect affecting training script with 3 TE Linear Layers. 
+   ``config.yaml`` contains the specification of the features used for each Linear layer. Some feature classes are provided by TE,
+   one - ``UserProvidedPrecision`` - is a custom feature implemented by the user. Nvidia-DL-Framework-Inspect inserts features into the layers according to the config.
+
+Example training script
+----------------------
+
+Let's look at a simple example of training a Transformer layer using Transformer Engine with FP8 precision. This example demonstrates how to set up the layer, define an optimizer, and perform a few training iterations using synthetic data.
+
+.. code-block:: python
+
+    # train.py
+
+    from transformer_engine.pytorch import TransformerLayer
+    import torch
+    import torch.nn as nn
+    import torch.optim as optim
+    import transformer_engine.pytorch as te
+
+    hidden_size = 512
+    num_attention_heads = 8
+
+    transformer_layer = TransformerLayer(
+        hidden_size=hidden_size,
+        ffn_hidden_size=hidden_size,
+        num_attention_heads=num_attention_heads
+    ).cuda()
+
+    dummy_input = torch.randn(10, 32, hidden_size).cuda()
+    criterion = nn.MSELoss()
+    optimizer = optim.Adam(transformer_layer.parameters(), lr=1e-4)
+    dummy_target = torch.randn(10, 32, hidden_size).cuda()
+
+    for epoch in range(5):
+        transformer_layer.train()
+        optimizer.zero_grad()
+        with te.fp8_autocast(enabled=True):
+            output = transformer_layer(dummy_input)
+        loss = criterion(output, dummy_target)
+        loss.backward()
+        optimizer.step()
+
+We will demonstrate two debug features on the code above:
+
+1. Disabling FP8 precision for specific GEMM operations, such as the FC1 and FC2 forward propagation GEMM.
+2. Logging statistics for other GEMM operations, such as gradient statistics for data gradient GEMM within the LayerNormLinear sub-layer of the TransformerLayer.
+
+Config file
+----------
+
+We need to prepare the configuration YAML file, as below
+
+.. code-block:: yaml
+
+    # config.yaml
+
+    fc1_fprop_to_fp8:
+      enabled: True
+      layers:
+        layer_types: [fc1, fc2] # contains fc1 or fc2 in name
+      transformer_engine:
+        DisableFP8GEMM:
+          enabled: True
+          gemms: [fprop]
+
+    log_tensor_stats:
+      enabled: True
+      layers:
+        layer_types: [layernorm_linear] # contains layernorm_linear in name
+      transformer_engine:
+        LogTensorStats:
+          enabled: True
+          stats: [max, min, mean, std, l1_norm]
+          tensors: [activation]
+          freq: 1
+          start_step: 2
+          end_step: 5
+
+Further explanation on how to create config files is in the :doc:`next part of the documentation <2_config_file_structure>`.
+
+Adjusting Python file
+--------------------
+
+.. code-block:: python
+
+    # (...)
+
+    import nvdlfw_inspect.api as debug_api
+    debug_api.initialize(
+        config_file="./config.yaml",
+        feature_dirs=["/path/to/transformer_engine/debug/features"],
+        log_dir="./log",
+        default_logging_enabled=True)
+
+    # initialization of the TransformerLayer with the name
+    transformer_layer = TransformerLayer(
+      name="transformer_layer",
+      # ...)
+
+    # (...)
+    for epoch in range(5):
+      # forward and backward pass
+      # ...
+      debug_api.step()
+
+In the modified code above, the following changes were made:
+
+1. Added an import for ``nvdlfw_inspect.api``.
+2. Initialized the Nvidia-DL-Framework-Inspect by calling ``debug_api.initialize()`` with appropriate configuration, specifying the path to the config file, feature directories, and log directory.
+3. Added ``debug_api.step()`` after each of the forward-backward pass.
+
+Inspecting the logs
+------------------
+
+Let's look at the files with the logs. Two files will be created:
+
+1. debug logs.
+2. statistics logs.
+
+Let's look inside them!
+
+In the main log file, you can find detailed information about the transformer layer's GEMMs behavior. You can see that ``fc1`` and ``fc2`` fprop GEMMs are run in high precision, as intended.
+
+.. code-block:: text
+
+    # log/nvdlfw_inspect_logs/nvdlfw_inspect_globalrank-0.log
+
+    INFO - Default logging to file enabled at ./log
+    INFO - Reading config from ./config.yaml.
+    INFO - Loaded configs for dict_keys(['fc1_fprop_to_fp8', 'log_tensor_stats']).
+    INFO - transformer_layer.self_attention.layernorm_qkv: Tensor: activation, gemm fprop - FP8 quantization
+    INFO - transformer_layer.self_attention.layernorm_qkv: Tensor: activation, gemm wgrad - FP8 quantization
+    INFO - transformer_layer.self_attention.layernorm_qkv: Tensor: weight, gemm fprop - FP8 quantization
+    INFO - transformer_layer.self_attention.layernorm_qkv: Tensor: weight, gemm dgrad - FP8 quantization
+    INFO - transformer_layer.self_attention.layernorm_qkv: Tensor: gradient, gemm dgrad - FP8 quantization
+    INFO - transformer_layer.self_attention.layernorm_qkv: Tensor: gradient, gemm wgrad - FP8 quantization
+    INFO - transformer_layer.self_attention.proj: Tensor: activation, gemm fprop - FP8 quantization
+    INFO - transformer_layer.self_attention.proj: Tensor: activation, gemm wgrad - FP8 quantization
+    INFO - transformer_layer.self_attention.proj: Tensor: weight, gemm fprop - FP8 quantization
+    INFO - transformer_layer.self_attention.proj: Tensor: weight, gemm dgrad - FP8 quantization
+    INFO - transformer_layer.self_attention.proj: Tensor: gradient, gemm dgrad - FP8 quantization
+    INFO - transformer_layer.self_attention.proj: Tensor: gradient, gemm wgrad - FP8 quantization
+    INFO - transformer_layer.layernorm_mlp.fc1: Tensor: activation, gemm fprop - High precision
+    INFO - transformer_layer.layernorm_mlp.fc1: Tensor: activation, gemm wgrad - FP8 quantization
+    INFO - transformer_layer.layernorm_mlp.fc1: Tensor: weight, gemm fprop - High precision
+    INFO - transformer_layer.layernorm_mlp.fc1: Tensor: weight, gemm dgrad - FP8 quantization
+    INFO - transformer_layer.layernorm_mlp.fc1: Tensor: gradient, gemm dgrad - FP8 quantization
+    INFO - transformer_layer.layernorm_mlp.fc1: Tensor: gradient, gemm wgrad - FP8 quantization
+    INFO - transformer_layer.layernorm_mlp.fc2: Tensor: activation, gemm fprop - High precision
+    INFO - transformer_layer.layernorm_mlp.fc2: Tensor: activation, gemm wgrad - FP8 quantization
+    INFO - transformer_layer.layernorm_mlp.fc2: Tensor: weight, gemm fprop - High precision
+    INFO - transformer_layer.layernorm_mlp.fc2: Tensor: weight, gemm dgrad - FP8 quantization
+    INFO - transformer_layer.layernorm_mlp.fc2: Tensor: gradient, gemm dgrad - FP8 quantization
+    INFO - transformer_layer.layernorm_mlp.fc2: Tensor: gradient, gemm wgrad - FP8 quantization
+    INFO - transformer_layer.self_attention.layernorm_qkv: Feature=LogTensorStats, API=look_at_tensor_before_process: activation
+    ....
+
+The second log file (``nvdlfw_inspect_statistics_logs/nvdlfw_inspect_globalrank-0.log``) contains statistics for tensors we requested in ``config.yaml``.
+
+.. code-block:: text
+
+    # log/nvdlfw_inspect_statistics_logs/nvdlfw_inspect_globalrank-0.log
+
+    INFO - transformer_layer.self_attention.layernorm_qkv_activation_max                 iteration=000002                  value=4.3188
+    INFO - transformer_layer.self_attention.layernorm_qkv_activation_min                 iteration=000002                  value=-4.3386
+    INFO - transformer_layer.self_attention.layernorm_qkv_activation_mean                iteration=000002                  value=0.0000
+    INFO - transformer_layer.self_attention.layernorm_qkv_activation_std                 iteration=000002                  value=0.9998
+    INFO - transformer_layer.self_attention.layernorm_qkv_activation_l1_norm             iteration=000002                  value=130799.6953
+    INFO - transformer_layer.self_attention.layernorm_qkv_activation_max                 iteration=000003                  value=4.3184
+    INFO - transformer_layer.self_attention.layernorm_qkv_activation_min                 iteration=000003                  value=-4.3381
+    INFO - transformer_layer.self_attention.layernorm_qkv_activation_mean                iteration=000003                  value=0.0000
+    INFO - transformer_layer.self_attention.layernorm_qkv_activation_std                 iteration=000003                  value=0.9997
+    INFO - transformer_layer.self_attention.layernorm_qkv_activation_l1_norm             iteration=000003                  value=130788.1016
+    INFO - transformer_layer.self_attention.layernorm_qkv_activation_max                 iteration=000004                  value=4.3181
+    INFO - transformer_layer.self_attention.layernorm_qkv_activation_min                 iteration=000004                  value=-4.3377
+    INFO - transformer_layer.self_attention.layernorm_qkv_activation_mean                iteration=000004                  value=0.0000
+    INFO - transformer_layer.self_attention.layernorm_qkv_activation_std                 iteration=000004                  value=0.9996
+    INFO - transformer_layer.self_attention.layernorm_qkv_activation_l1_norm             iteration=000004                  value=130776.7969
+
+Logging using TensorBoard
+------------------------
+
+Precision debug tools support logging using `TensorBoard <https://www.tensorflow.org/tensorboard>`_. To enable it, one needs to pass the argument ``tb_writer`` to the ``debug_api.initialize()``.  Let's modify ``train.py`` file.
+
+.. code-block:: python
+
+    # (...)
+
+    from torch.utils.tensorboard import SummaryWriter
+    tb_writer = SummaryWriter('./tensorboard_dir/run1')
+
+    # add tb_writer to the Debug API initialization
+    debug_api.initialize(
+        config_file="./config.yaml",
+        feature_dirs=["/path/to/transformer_engine/debug/features"],
+        log_dir="./log",
+        tb_writer=tb_writer)
+
+    # (...)
+
+Let's run training and open TensorBoard by ``tensorboard --logdir=./tensorboard_dir/run1``:
+
+.. figure:: ./img/tensorboard.png
+   :align: center
+
+   Fig 2: TensorBoard with plotted stats.
\ No newline at end of file
--- a/docs/debug/2_config_file_structure.rst
+++ b/docs/debug/2_config_file_structure.rst
+..
+    Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+    See LICENSE for license information.
+
+Config File Structure
+====================
+
+To enable debug features, create a configuration YAML file to specify the desired behavior, such as determining which GEMMs (General Matrix Multiply operations) should run in higher precision rather than FP8 and defining which statistics to log. 
+Below, we outline how to structure the configuration YAML file.
+
+General Format
+-------------
+
+A config file can have one or more sections, each containing settings for specific layers and features:
+
+.. code-block:: yaml
+
+    section_name_1:
+      enabled: ...
+      layers:
+        # Specify layers here...
+      transformer_engine:
+        Feature1Name:
+          enabled: ...
+          # Feature details...
+        Feature2Name:
+          enabled: ...
+          # Feature details...
+
+    section_name_2:
+      enabled: ...
+      layers:
+        # Specify layers here...
+      Feature1Name: # If feature has no namespace, then it is in the default namespace.
+        enabled: ...
+        # Feature details...
+
+    section_name_3:
+      enabled: ...
+      layers:
+        # Specify layers here...
+      transformer_engine:
+        Feature1Name:
+          enabled: ...
+          # Feature details...
+        Feature2Name:
+          enabled: ...
+          # Feature details...
+
+Sections may have any name and must contain:
+
+1. An ``enabled`` field that specifies whether the features in that section will be active.
+2. A ``layers`` field specifying which layers the section applies to. Each layer can belong to only one section.
+3. Additional fields describing features for those layers.
+
+Layer Specification
+------------------
+
+Debug layers can be identified by a ``name`` parameter:
+
+.. code-block:: python
+
+    linear = transformer_engine.debug.pytorch.Linear(in_features, out_features, name="linear1")
+
+This name is used in the config file to identify the layer. To specify the ``layers`` field, you can use one of the following methods:
+
+1. ``layer_name_regex_pattern``: Use a regular expression to match layer names. This expression must adhere to the Python ``re`` module syntax.
+2. ``layer_types``: Provide a list of strings, where a layer will be selected if any string matches part of its name.
+
+Examples:
+
+.. code-block:: yaml
+
+    # Example 1: Using regular expression to select layers
+    my_section:
+      enabled: ...
+      layers:
+        layer_name_regex_pattern: 'self_attn.*'
+      transformer_engine:
+        (...)
+
+    # Example 2: Using layer type to select layers
+    another_section:
+      enabled: ...
+      layers:
+        layer_types: ['fc1', 'layernorm_linear']
+      transformer_engine:
+        (...)
+
+Names in Transformer Layers
+--------------------------
+
+There are three ways to assign a name to a layer in the Transformer Engine:
+
+- Initialize the layer with the ``name=...`` argument.
+- Use ``debug_api.infer_and_assign_layer_names(model)``, which assigns names based on class names.
+- Rely on the default names assigned during module initialization, such as ``Layer_n``, where ``n`` represents the layer number.
+
+The ``TransformerLayer`` in Transformer Engine is a composition of multiple sub-layers. We can modify some of these layers using precision debug tools, particularly those that contain exactly one linear layer. To see the names of all such layers, we can inspect log files. For instance, a ``TransformerLayer`` named ``transformer_layer`` might consist of:
+
+- ``transformer_layer.self_attn.layernorm_linear_qkv`` / ``transformer_layer.self_attn.linear_qkv`` / ``transformer_layer.self_attn.layernorm_linear_q`` / ``transformer_layer.self_attn.linear_q`` / ``transformer_layer.self_attn.linear_kv``,
+- ``transformer_layer.self_attn.proj``,
+- ``transformer_layer.inter_attn.*`` for ``layer_type="decoder"``,
+- ``transformer_layer.layernorm_mlp.fc1``,
+- ``transformer_layer.layernorm_mlp.fc2``,
+
+depending on the configuration. Some layers, like ``LayerNormLinear``, are fusions of two layers: ``LayerNorm`` and ``Linear``. When referring to such layers in precision debug tools, only the ``Linear`` part is affected.
+
+Below is an example ``TransformerLayer`` with four linear layers that can be influenced by the precision debug tools.
+
+.. figure:: ./img/names.svg
+   :align: center
+   :width: 80%
+
+   Fig 1: Names of layers in an example configuration of TransformerLayer. The most nested blocks represent the most basic layers, each containing one linear layer. Layers that do not contain linear layers, such as ``DotProductAttention``, are omitted.
+
+**Configuration File Example**
+
+.. code-block:: yaml
+
+    # Disables wgrad in all 4 GEMMs
+    section1:
+      enabled: True
+      layers:
+        layer_types: [transformer_layer]
+      transformer_engine:
+        DisableFP8GEMM:
+          enabled: True
+          gemms: [wgrad]
+
+    # Disables all GEMMs in layernorm_mlp layer
+    section2:
+      enabled: True
+      layers:
+        layer_types: [layernorm_mlp]
+      transformer_engine:
+        DisableFP8Layer:
+          enabled: True
+      
+    # Logs wgrad stats in fc1
+    section3:
+      enabled: True
+      layers:
+        layer_types: [fc1]
+      transformer_engine:
+        LogTensorStats:
+          enabled: True
+          stats: [min]
+          tensors: [wgrad]
+          freq: 1
+          start_step: 0
+          end_step: 50
+
+
+Structured Configuration for GEMMs and Tensors
+---------------------------------------------
+
+Sometimes a feature is parameterized by a list of tensors or by a list of GEMMs.
+There are multiple ways of describing this parameterization.
+
+We can pass lists, as below.
+
+.. code-block:: yaml
+
+    Feature:
+      enabled: ...
+      gemms: [gemm1, gemm2]
+      tensors: [tensor1, tensor2]
+      ...
+
+We can use struct for tensors.
+
+.. code-block:: yaml
+
+    Feature:
+      gemms: [gemm1, gemm2]
+      tensors_struct:
+      - tensor: tensor1
+        feature_param1: value
+      - tensor: tensor2
+        feature_param1: value
+      gemm_feature_param1: value
+
+Similarly, we can use struct for GEMMs.
+
+.. code-block:: yaml
+
+    Feature:
+      enabled: ...
+      tensors: [tensor1, tensor2]
+      gemms_struct:
+      - gemm: gemm1
+        feature_param1: value
+      - gemm: gemm2
+        feature_param1: value
+      gemm_feature_param1: value
+
+We can use both structs for tensors and GEMMs. The tensors_struct should be nested inside gemms_struct.
+
+.. code-block:: yaml
+
+    Feature:
+      enabled: ...
+      gemms_struct:
+        - gemm: gemm1
+          tensors: [tensor1, tensor2]
+          tensor_feature_param1: value
+          gemm_feature_param1: value
+        - gemm: gemm2
+          tensors_struct:
+          - tensor: tensor1
+            tensor_feature_param1: value
+          - tensor: tensor2
+            tensor_feature_param2: value
+          gemm_feature_param1: value
+
+Enabling or Disabling Sections and Features
+------------------------------------------
+
+Debug features can be enabled or disabled with the ``enabled`` keyword:
+
+.. code-block:: yaml
+
+    section1:
+      enabled: True
+      layers:
+        layer_types: [self_attention]
+      transformer_engine:
+        LogTensorStats:
+          enabled: False # Disables the LogTensorStats feature
+          stats: [max, min, mean, std, l1_norm]
+
+    section2:
+      enabled: False # Disables entire section2
+      transformer_engine:
+        LogFp8TensorStats:
+          enabled: True # Does not enable the LogFp8TensorStats feature, because section2 is disabled
+          stats: [underflows, overflows]
+
+By organizing your ``config.yaml`` properly, you can easily manage debugging features, ensuring a more streamlined and customizable debugging experience.
--- a/docs/debug/3_api_debug_setup.rst
+++ b/docs/debug/3_api_debug_setup.rst
+..
+    Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+    See LICENSE for license information.
+
+Setup
+=====
+
+Precision debug tools for the Transformer Engine use `Nvidia-DL-Framework-Inspect <https://github.com/NVIDIA/nvidia-dlfw-inspect>`_ package from NVIDIA. 
+Please refer to the Nvidia-DL-Framework-Inspect `documentation <https://github.com/NVIDIA/nvidia-dlfw-inspect/tree/main/docs>`_ for more details.
+Below, we outline the steps for debug initialization.
+
+initialize()
+-----------
+
+Must be called once on every rank in the global context to initialize Nvidia-DL-Framework-Inspect.
+
+**Parameters**
+
+- **config_file** (*str*, default=""): Path to the configuration YAML file containing features to enable and layer names. If one wants to run without the configuration file, pass ``""``.
+- **feature_dirs** (*List[str] | str*): List of directories containing features to load and register. One needs to pass ``[/path/to/transformerengine/transformer_engine/debug/features]`` to use TE features.
+- **logger** (*Union[BaseLogger, None]*, default=None): Logger for logging tensor statistics. Should adhere to ``BaseLogger`` from the `Nvidia-DL-Framework-Inspect <https://github.com/NVIDIA/nvidia-dlfw-inspect>`_ package.
+- **log_dir** (*str*, default= "."): Directory path to hold ``debug_logs`` and ``debug_statistics_logs``.
+- **tb_writer** (*TensorBoardWriter*, default=None): TensorBoard writer for logging.
+- **default_logging_enabled** (*bool*, default=False): Enable default logging to the file.
+
+.. code-block:: python
+
+    import nvdlfw_inspect.api as debug_api
+
+    debug_api.initialize(
+        config_file="./config.yaml",
+        feature_dirs=["/path/to/transformer_engine/debug/features"],
+        log_dir="./log_dir")
+
+set_tensor_reduction_group()
+--------------------------
+
+Needed only for logging tensor stats. In multi-GPU training, activation and gradient tensors are distributed across multiple nodes. This method lets you specify the group for the reduction of stats; see the `reduction group section <./4_distributed.rst#reduction-groups>`_ for more details.
+
+If the tensor reduction group is not specified, then statistics are reduced across all nodes in the run.
+
+**Parameters**
+
+- **group** (torch.distributed.ProcessGroup): The process group across which tensors will be reduced to get stats.
+
+
+.. code-block:: python
+
+    import nvdlfw_inspect.api as debug_api
+
+    # initialization
+    # (...)
+
+    pipeline_parallel_group = initialize_pipeline_parallel_group() 
+
+    debug_api.set_tensor_reduction_group(pipeline_parallel_group)
+
+    # training
+    # (...)
+    # activation/gradient tensor statistics are reduced along pipeline_parallel_group
+
+set_weight_tensor_tp_group_reduce()
+---------------------------------
+
+By default, weight tensor statistics are reduced within the tensor parallel group. This function allows you to disable that behavior; for more details, see `reduction group section <./4_distributed.rst#reduction-groups>`_.
+
+This method is not provided by the ``debug_api``, but by the ``transformer_engine.debug``.
+
+**Parameters**
+
+- **enabled** (*bool*, default=True): A boolean flag to enable or disable the reduction of weight tensor statistics within the tensor parallel group.
+
+
+.. code-block:: python
+
+    import nvdlfw_inspect.api as debug_api
+    from transformer_engine.debug import set_weight_tensor_tp_group_reduce
+
+    # initialization
+    # (...)
+
+    set_weight_tensor_tp_group_reduce(False)
+
+    # training
+    # (...)
+    # weight tensor statistics are not reduced
--- a/docs/debug/3_api_features.rst
+++ b/docs/debug/3_api_features.rst
+..
+    Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+    See LICENSE for license information.
+
+Debug features
+==========
+
+.. autoapiclass:: transformer_engine.debug.features.log_tensor_stats.LogTensorStats
+.. autoapiclass:: transformer_engine.debug.features.log_fp8_tensor_stats.LogFp8TensorStats
+.. autoapiclass:: transformer_engine.debug.features.disable_fp8_gemm.DisableFP8GEMM
+.. autoapiclass:: transformer_engine.debug.features.disable_fp8_layer.DisableFP8Layer
+.. autoapiclass:: transformer_engine.debug.features.per_tensor_scaling.PerTensorScaling
+.. autoapiclass:: transformer_engine.debug.features.fake_quant.FakeQuant
--- a/docs/debug/3_api_te_calls.rst
+++ b/docs/debug/3_api_te_calls.rst
+..
+    Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+    See LICENSE for license information.
+
+Calls to Nvidia-DL-Framework-Inspect
+====================================
+Let's look deeper into how Nvidia-DL-Framework-Inspect with Transformer Engine work together. TransformerEngine layers have some hook calls inside each of the GEMMs. Users can define feature classes or use feature classes provided with TE. File ``config.yaml`` describes which hooks need to be used for which layers. Nvidia-DL-Framework-Inspect combines 3 things: TE training, feature classes and ``config.yaml`` and takes care of inserting hooks in the correct places. This process is illustrated in the image below.
+
+.. figure:: ./img/api_calls1.svg
+   :align: center
+
+   Fig 1: Example of Nvidia-DL-Framework-Inspect affecting training script with 1 Linear Layer. For tensors mentioned in ``config.yaml``, behavior of ``modify_tensor_enabled()`` and ``modify_tensor()`` calls are substituted with definitions from the feature class. Other calls return default values - in fact they do nothing.
+
+In this page, all calls from TransformerEngine to the Nvidia-DL-Framework-Inspect for each GEMM are listed. The order of these calls is illustrated in the image below.
+
+.. figure:: ./img/api_calls2.svg
+   :align: center
+
+   Fig 2: The calls to Nvidia-DL-Framework-Inspect done for Transformer Engine. There are 2 types of calls: GEMM calls and routing calls.
+
+
+There are 2 categories of API calls, each is used for different purposes:
+
+- GEMM calls - invoked during every GEMM, used to process or quantize tensors and collect information about them,
+- Routing calls - invoked at the beginning of every forward pass - they indicate whether a feature is going to use `modify_tensor()`, etc.
+
+If all routing calls for the layer return `False`, then the layer is invoked in an optimized version with Transformer Engine fusions.
+If any of the routing calls return `True`, layers are run without the fusions. This is necessary because otherwise some tensors cannot be accessed
+if fusions happen. An important remark is that if no feature is used for the layer, then it should perform as fast as the layer without initializing `debug_api`.
+
+
+.. autoapifunction:: transformer_engine.debug.features.api.TEDefaultFeatures.modify_tensor
+
+.. autoapifunction:: transformer_engine.debug.features.api.TEDefaultFeatures.inspect_tensor
+
+.. autoapifunction:: transformer_engine.debug.features.api.TEDefaultFeatures.inspect_tensor_postquantize
+
+.. autoapifunction:: transformer_engine.debug.features.api.TEDefaultFeatures.modify_tensor_enabled
+
+.. autoapifunction:: transformer_engine.debug.features.api.TEDefaultFeatures.fp8_gemm_enabled
+
+.. autoapifunction:: transformer_engine.debug.features.api.TEDefaultFeatures.inspect_tensor_enabled
+
+.. autoapifunction:: transformer_engine.debug.features.api.TEDefaultFeatures.inspect_tensor_postquantize_enabled
--- a/docs/debug/4_distributed.rst
+++ b/docs/debug/4_distributed.rst
+..
+    Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+    See LICENSE for license information.
+
+Distributed training
+===================
+
+Nvidia-Pytorch-Inspect with Transformer Engine supports multi-GPU training. This guide describes how to run it and how the supported features work in the distributed setting.
+
+To use precision debug tools in multi-GPU training, one needs to:
+
+1. Run ``debug_api.initialize(...)`` and provide the same configuration YAML file on every node.
+2. If one wants to log stats, one may want to invoke ``debug_api.set_tensor_reduction_group`` with a proper reduction group.
+
+Behavior of the features
+-----------------------
+
+In a distributed setting, **DisableFP8GEMM** and **DisableFP8Layer** function similarly to the single-GPU case, with no notable differences. 
+
+**PerTensorScaling** and **FakeQuant** calculate FP8 scaling factors independently on each node, meaning the number of GPUs may affect results. This differs from the delayed scaling FP8 recipe behavior, in which scaling factors are synchronized.
+
+.. figure:: ./img/scaling_factors.svg
+   :align: center
+
+   Fig 1:  For **PerTensorScaling** and **FakeQuant** tensor scaling factors are computed separately for each of the tensor shards. This is not the case for delayed scaling FP8 scaling factors, which are synchronized.
+
+Logging-related features are more complex and will be discussed further in the next sections.
+
+Reduction groups
+--------------
+
+In setups with tensor, data, or pipeline parallelism, some tensors are distributed across multiple GPUs, requiring a reduction operation to compute statistics for these tensors.
+
+The weight tensor is always split among the tensor parallel group, and debug tools automatically reduce statistics within this group by default. To disable this automatic reduction, use:
+
+.. code-block:: python
+
+    transformer_engine.debug.set_weight_tensor_tp_group_reduce(False)
+
+In cases of data parallelism, Transformer Engine modules lack the process group needed for reduction. To manually specify the group, use:
+
+.. code-block:: python
+
+    debug_api.set_tensor_reduction_group(group)
+
+This command ensures statistics are reduced across the defined group. Activation statistics are logged after the forward pass (immediately after exiting autocast), while gradient (dgrad and wgrad) statistics are logged following the backward pass.
+
+Below, we illustrate configurations for a 4-node setup with tensor parallelism size 2 and data parallelism size 2, showcasing different reduction configurations.
+
+.. figure:: ./img/reduction1.svg
+   :align: center
+
+   Fig 2: There is a single tensor reduction group composed of all nodes. As a result, each node logs the same statistics for the tensors, as they are fully reduced across all nodes.
+
+.. figure:: ./img/reduction2.svg
+   :align: center
+
+   Fig 3: Every node is set with a tensor reduction group consisting of itself. Every node prints the same statistics for weights (which are still synchronized within TP groups), but the statistics of activations and gradients are not synchronized.
+
+.. figure:: ./img/reduction3.svg
+   :align: center
+
+   Fig 4: Weight synchronization is disabled by ``set_weight_tensor_tp_group_reduce(False)``, so every node logs stats for its shard of the weight.
+
+
+Microbatching
+-----------
+
+Let's dive into how statistics collection works with microbatching. By microbatching, we mean invoking multiple ``forward()`` calls for each ``debug_api.step()``. The behavior is as follows:
+
+- For weight tensors, the stats remain the same for each microbatch because the weight does not change.
+- For other tensors, the stats are accumulated.
+
+Logging to files and TensorBoard
+------------------------------
+
+In a single-node setup with ``default_logging_enabled=True``, all logs are saved by default to ``log_dir/nvdlfw_inspect_statistics_logs/nvdlfw_inspect_globalrank-0.log``. In multi-GPU training, each node writes its reduced statistics to its unique file, named ``log_dir/nvdlfw_inspect_statistics_logs/nvdlfw_inspect_globalrank-i.log`` for rank i. Because these logs contain reduced statistics, the logged values are identical for all nodes within a reduction group.
+
+If certain nodes are given a TensorBoard writer, only those nodes will log to TensorBoard. This is useful in scenarios involving pipeline, data, and tensor parallelism, such as with two transformer layers and settings TP_SIZE = 2, DP_SIZE = 2, and PP_SIZE = 2. To log all stats to TensorBoard, you should pass a TensorBoard writer to one process in each pipeline parallel group.
+
+.. figure:: ./img/pipeline_logging.svg
+   :align: center
+
+   Fig 5: Example with pipeline parallelism, where a ``tb_writer`` is assigned to one node within each pipeline parallel group, setting these as tensor reduction groups.
+
+Alternatively, setting the tensor reduction group to None will yield unreduced statistics for wgrad and dgrad tensors on each node, allowing for post-processing. For weight statistics without reduction in the TP parallel group, use:
+
+.. code-block:: python
+
+    transformer_engine.debug.set_weight_tensor_tp_group_reduce(False)
\ No newline at end of file