Merge branch 'nv_main' of v2.12

0d874a4e · wenjh · a68e5f87 · dfdd3820 · 0d874a4e · 0d874a4e
Commit 0d874a4e authored Mar 03, 2026 by wenjh
20 changed files
--- a/docs/api/c/gemm.rst
+++ b/docs/api/c/gemm.rst
 ..
-    Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+    Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

    See LICENSE for license information.


--- a/docs/api/c/index.rst
+++ b/docs/api/c/index.rst
 ..
-    Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+    Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

    See LICENSE for license information.


--- a/docs/api/c/multi_tensor.rst
+++ b/docs/api/c/multi_tensor.rst
 ..
-    Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+    Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

    See LICENSE for license information.


--- a/docs/api/c/normalization.rst
+++ b/docs/api/c/normalization.rst
 ..
-    Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+    Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

    See LICENSE for license information.


--- a/docs/api/c/padding.rst
+++ b/docs/api/c/padding.rst
 ..
-    Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+    Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

    See LICENSE for license information.


--- a/docs/api/c/permutation.rst
+++ b/docs/api/c/permutation.rst
 ..
-    Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+    Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

    See LICENSE for license information.


--- a/docs/api/c/recipe.rst
+++ b/docs/api/c/recipe.rst
 ..
-    Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+    Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

    See LICENSE for license information.


--- a/docs/api/c/softmax.rst
+++ b/docs/api/c/softmax.rst
 ..
-    Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+    Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

    See LICENSE for license information.


--- a/docs/api/c/swizzle.rst
+++ b/docs/api/c/swizzle.rst
 ..
-    Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+    Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

    See LICENSE for license information.


--- a/docs/api/c/transformer_engine.rst
+++ b/docs/api/c/transformer_engine.rst
 ..
-    Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+    Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

    See LICENSE for license information.


--- a/docs/api/c/transpose.rst
+++ b/docs/api/c/transpose.rst
 ..
-    Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+    Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

    See LICENSE for license information.


--- a/docs/api/common.rst
+++ b/docs/api/common.rst
 ..
-    Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+    Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

    See LICENSE for license information.

@@ -17,3 +17,5 @@ Common API
 .. autoapiclass:: transformer_engine.common.recipe.Float8CurrentScaling(fp8_format=Format.HYBRID)

 .. autoapiclass:: transformer_engine.common.recipe.Float8BlockScaling(fp8_format=Format.E4M3)
+
+.. autoapiclass:: transformer_engine.common.recipe.CustomRecipe(qfactory, fp8_dpa=False, fp8_mha=False)
--- a/docs/api/framework.rst
+++ b/docs/api/framework.rst
 ..
-    Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+    Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

    See LICENSE for license information.


--- a/docs/api/jax.rst
+++ b/docs/api/jax.rst
 ..
-    Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+    Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

    See LICENSE for license information.

 Jax
-=======
+===

 Pre-defined Variable of Logical Axes
 ------------------------------------
@@ -20,11 +20,11 @@ Variables are available in `transformer_engine.jax.sharding`.


 Checkpointing
------------------------------------
+-------------
 When using checkpointing with Transformer Engine JAX, please be aware of the checkpointing policy being applied to your model. Any JAX checkpointing policy using `dot`, such as `jax.checkpoint_policies.dots_with_no_batch_dims`, may not work with GEMMs provided by Transformer Engine as they do not always use the `jax.lax.dot_general` primitive. Instead, you can use `transformer_engine.jax.checkpoint_policies.dots_and_te_gemms_with_no_batch_dims` or similar policies that are designed to work with Transformer Engine's GEMMs and `jax.lax.dot_general` GEMMs. You may also use any JAX policies that do not filter by primitive, such as `jax.checkpoint_policies.save_only_these_names` or `jax.checkpoint_policies.everything_saveable`.

 Modules
------------------------------------
+-------
 .. autoapiclass:: transformer_engine.jax.flax.TransformerLayerType
 .. autoapiclass:: transformer_engine.jax.MeshResource()


--- a/docs/api/pytorch.rst
+++ b/docs/api/pytorch.rst
 ..
-    Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+    Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

    See LICENSE for license information.

-pyTorch
+PyTorch
 =======

 .. autoapiclass:: transformer_engine.pytorch.Linear(in_features, out_features, bias=True, **kwargs)
@@ -37,9 +37,6 @@ pyTorch
 .. autoapiclass:: transformer_engine.pytorch.CudaRNGStatesTracker()
  :members: reset, get_states, set_states, add, fork

-.. autoapifunction:: transformer_engine.pytorch.fp8_autocast
-
-.. autoapifunction:: transformer_engine.pytorch.fp8_model_init

 .. autoapifunction:: transformer_engine.pytorch.autocast

@@ -47,6 +44,16 @@ pyTorch

 .. autoapifunction:: transformer_engine.pytorch.checkpoint

+
+.. autoapifunction:: transformer_engine.pytorch.make_graphed_callables
+
+.. autoapifunction:: transformer_engine.pytorch.get_cpu_offload_context
+
+.. autoapifunction:: transformer_engine.pytorch.parallel_cross_entropy
+
+Recipe availability
+-------------------
+
 .. autoapifunction:: transformer_engine.pytorch.is_fp8_available

 .. autoapifunction:: transformer_engine.pytorch.is_mxfp8_available
@@ -63,9 +70,8 @@ pyTorch

 .. autoapifunction:: transformer_engine.pytorch.get_default_recipe

-.. autoapifunction:: transformer_engine.pytorch.make_graphed_callables
-
-.. autoapifunction:: transformer_engine.pytorch.get_cpu_offload_context
+Mixture of Experts (MoE) functions
+----------------------------------

 .. autoapifunction:: transformer_engine.pytorch.moe_permute

@@ -75,13 +81,71 @@ pyTorch

 .. autoapifunction:: transformer_engine.pytorch.moe_sort_chunks_by_index

-.. autoapifunction:: transformer_engine.pytorch.parallel_cross_entropy
-
 .. autoapifunction:: transformer_engine.pytorch.moe_sort_chunks_by_index_with_probs

+
+Communication-computation overlap
+---------------------------------
+
 .. autoapifunction:: transformer_engine.pytorch.initialize_ub

 .. autoapifunction:: transformer_engine.pytorch.destroy_ub

 .. autoapiclass:: transformer_engine.pytorch.UserBufferQuantizationMode
  :members: FP8, NONE
+
+
+Quantized tensors
+-----------------
+
+.. autoapiclass:: transformer_engine.pytorch.QuantizedTensorStorage
+   :members: update_usage, prepare_for_saving, restore_from_saved
+
+.. autoapiclass:: transformer_engine.pytorch.QuantizedTensor(shape, dtype, *, requires_grad=False, device=None)
+   :members: dequantize, quantize_
+
+.. autoapiclass:: transformer_engine.pytorch.Float8TensorStorage(data, fp8_scale_inv, fp8_dtype, data_transpose=None, quantizer=None)
+
+.. autoapiclass:: transformer_engine.pytorch.MXFP8TensorStorage(rowwise_data, rowwise_scale_inv, columnwise_data, columnwise_scale_inv, fp8_dtype, quantizer)
+
+.. autoapiclass:: transformer_engine.pytorch.Float8BlockwiseQTensorStorage(rowwise_data, rowwise_scale_inv, columnwise_data, columnwise_scale_inv, fp8_dtype, quantizer, is_2D_scaled, data_format)
+
+.. autoapiclass:: transformer_engine.pytorch.NVFP4TensorStorage(rowwise_data, rowwise_scale_inv, columnwise_data, columnwise_scale_inv, amax_rowwise, amax_columnwise, fp4_dtype, quantizer)
+
+.. autoapiclass:: transformer_engine.pytorch.Float8Tensor(shape, dtype, data, fp8_scale_inv, fp8_dtype, requires_grad=False, data_transpose=None, quantizer=None)
+
+.. autoapiclass:: transformer_engine.pytorch.MXFP8Tensor(rowwise_data, rowwise_scale_inv, columnwise_data, columnwise_scale_inv, fp8_dtype, quantizer)
+
+.. autoapiclass:: transformer_engine.pytorch.Float8BlockwiseQTensor(rowwise_data, rowwise_scale_inv, columnwise_data, columnwise_scale_inv, fp8_dtype, quantizer, is_2D_scaled, data_format)
+
+.. autoapiclass:: transformer_engine.pytorch.NVFP4Tensor(rowwise_data, rowwise_scale_inv, columnwise_data, columnwise_scale_inv, amax_rowwise, amax_columnwise, fp4_dtype, quantizer)
+
+Quantizers
+----------
+
+.. autoapiclass:: transformer_engine.pytorch.Quantizer(rowwise, columnwise)
+   :members: update_quantized, quantize
+
+.. autoapiclass:: transformer_engine.pytorch.Float8Quantizer(scale, amax, fp8_dtype, *, rowwise=True, columnwise=True)
+
+.. autoapiclass:: transformer_engine.pytorch.Float8CurrentScalingQuantizer(fp8_dtype, device, *, rowwise=True, columnwise=True, **kwargs)
+
+.. autoapiclass:: transformer_engine.pytorch.MXFP8Quantizer(fp8_dtype, *, rowwise=True, columnwise=True)
+
+.. autoapiclass:: transformer_engine.pytorch.Float8BlockQuantizer(fp8_dtype, *, rowwise, columnwise, **kwargs)
+
+.. autoapiclass:: transformer_engine.pytorch.NVFP4Quantizer(fp4_dtype, *, rowwise=True, columnwise=True, **kwargs)
+
+Tensor saving and restoring functions
+-------------------------------------
+
+.. autoapifunction:: transformer_engine.pytorch.prepare_for_saving
+
+.. autoapifunction:: transformer_engine.pytorch.restore_from_saved
+
+Deprecated functions
+--------------------
+
+.. autoapifunction:: transformer_engine.pytorch.fp8_autocast
+
+.. autoapifunction:: transformer_engine.pytorch.fp8_model_init
--- a/docs/conf.py
+++ b/docs/conf.py
-# Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # See LICENSE for license information.

@@ -58,10 +58,15 @@ extensions = [
    "nbsphinx",
    "breathe",
    "autoapi.extension",
+    "sphinx_tabs.tabs",
 ]

 templates_path = ["_templates"]
-exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"]
+exclude_patterns = [
+    "_build",
+    "Thumbs.db",
+    "sphinx_rtd_theme",
+]

 source_suffix = ".rst"

@@ -79,6 +84,8 @@ html_show_sphinx = False
 html_css_files = [
    "css/nvidia_font.css",
    "css/nvidia_footer.css",
+    "css/rtabs.css",
+    "css/output-style.css",
 ]

 html_theme_options = {
@@ -94,6 +101,7 @@ napoleon_custom_sections = [
    ("Values", "params_style"),
    ("Graphing parameters", "params_style"),
    ("FP8-related parameters", "params_style"),
+    ("Quantization parameters", "params_style"),
 ]

 breathe_projects = {"TransformerEngine": root_path / "docs" / "doxygen" / "xml"}
@@ -101,3 +109,23 @@ breathe_default_project = "TransformerEngine"

 autoapi_generate_api_docs = False
 autoapi_dirs = [root_path / "transformer_engine"]
+autoapi_ignore = ["*test*"]
+
+
+# There are 2 warnings about the same namespace (transformer_engine) in two different c++ api
+# docs pages. This seems to be the only way to suppress these warnings.
+def setup(app):
+    """Custom Sphinx setup to filter warnings."""
+    import logging
+
+    # Filter out duplicate C++ declaration warnings
+    class DuplicateDeclarationFilter(logging.Filter):
+        def filter(self, record):
+            message = record.getMessage()
+            if "Duplicate C++ declaration" in message and "transformer_engine" in message:
+                return False
+            return True
+
+    # Apply filter to Sphinx logger
+    logger = logging.getLogger("sphinx")
+    logger.addFilter(DuplicateDeclarationFilter())
--- a/docs/debug.rst
+++ b/docs/debug.rst
 ..
-    Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+    Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

    See LICENSE for license information.
+
 Precision debug tools
-==============================================
+=====================

 .. toctree::
   :caption: Precision debug tools

--- a/docs/debug/1_getting_started.rst
+++ b/docs/debug/1_getting_started.rst
 ..
-    Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+    Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

    See LICENSE for license information.

 Getting started
-==============
+===============

 .. note::

@@ -21,7 +21,7 @@ Transformer Engine provides a set of precision debug tools which allow you to ea
 There are 4 things one needs to do to use Transformer Engine debug features:

 1. Create a configuration YAML file to configure the desired features.
-2. Import, initialize, and install the `Nvidia-DL-Framework-Inspect <https://github.com/NVIDIA/nvidia-dlfw-inspect>`_ tool.
+2. Import and initialize the `Nvidia-DL-Framework-Inspect <https://github.com/NVIDIA/nvidia-dlfw-inspect>`_ tool, which is installed as a dependency of Transformer Engine.
 3. One can pass ``name="..."`` when creating TE layers to easier identify layer names. If this is not provided, names will be inferred automatically.
 4. Invoke ``debug_api.step()`` at the end of one forward-backward pass.

@@ -38,7 +38,7 @@ To start debugging, one needs to create a configuration YAML file. This file lis
   one - ``UserProvidedPrecision`` - is a custom feature implemented by the user. Nvidia-DL-Framework-Inspect inserts features into the layers according to the config.

 Example training script
----------------------
+-----------------------

 Let's look at a simple example of training a Transformer layer using Transformer Engine with FP8 precision. This example demonstrates how to set up the layer, define an optimizer, and perform a few training iterations using synthetic data.

@@ -81,7 +81,7 @@ We will demonstrate two debug features on the code above:
 2. Logging statistics for other GEMM operations, such as gradient statistics for data gradient GEMM within the LayerNormLinear sub-layer of the TransformerLayer.

 Config file
----------
+-----------

 We need to prepare the configuration YAML file, as below

@@ -114,7 +114,8 @@ We need to prepare the configuration YAML file, as below
 Further explanation on how to create config files is in the :doc:`next part of the documentation <2_config_file_structure>`.

 Adjusting Python file
--------------------
+---------------------
+

 .. code-block:: python

@@ -145,7 +146,8 @@ In the modified code above, the following changes were made:
 3. Added ``debug_api.step()`` after each of the forward-backward pass.

 Inspecting the logs
------------------
+-------------------
+

 Let's look at the files with the logs. Two files will be created:

@@ -213,7 +215,8 @@ The second log file (``nvdlfw_inspect_statistics_logs/nvdlfw_inspect_globalrank-
    INFO - transformer_layer.self_attention.layernorm_qkv_activation_l1_norm             iteration=000004                  value=130776.7969

 Logging using TensorBoard
------------------------
+-------------------------
+

 Precision debug tools support logging using `TensorBoard <https://www.tensorflow.org/tensorboard>`_. To enable it, one needs to pass the argument ``tb_writer`` to the ``debug_api.initialize()``.  Let's modify ``train.py`` file.


--- a/docs/debug/2_config_file_structure.rst
+++ b/docs/debug/2_config_file_structure.rst
 ..
-    Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+    Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

    See LICENSE for license information.

 Config File Structure
-====================
+=====================

 To enable debug features, create a configuration YAML file to specify the desired behavior, such as determining which GEMMs (General Matrix Multiply operations) should run in higher precision rather than FP8 and defining which statistics to log. 
 Below, we outline how to structure the configuration YAML file.

 General Format
-------------
+--------------
+

 A config file can have one or more sections, each containing settings for specific layers and features:

@@ -55,7 +56,8 @@ Sections may have any name and must contain:
 3. Additional fields describing features for those layers.

 Layer Specification
------------------
+-------------------
+

 Debug layers can be identified by a ``name`` parameter:

@@ -89,7 +91,8 @@ Examples:
        (...)

 Names in Transformer Layers
--------------------------
+---------------------------
+

 There are three ways to assign a name to a layer in the Transformer Engine:

@@ -107,6 +110,8 @@ The ``TransformerLayer`` in Transformer Engine is a composition of multiple sub-

 depending on the configuration. Some layers, like ``LayerNormLinear``, are fusions of two layers: ``LayerNorm`` and ``Linear``. When referring to such layers in precision debug tools, only the ``Linear`` part is affected.

+For `GroupedLinear` layer, the names of underlying GEMMS are of the form `layer_name.gemm_n`, where `n` is the index of the GEMM.
+
 Below is an example ``TransformerLayer`` with four linear layers that can be influenced by the precision debug tools.

 .. figure:: ./img/names.svg
@@ -154,7 +159,7 @@ Below is an example ``TransformerLayer`` with four linear layers that can be inf


 Structured Configuration for GEMMs and Tensors
---------------------------------------------
+----------------------------------------------

 Sometimes a feature is parameterized by a list of tensors or by a list of GEMMs.
 There are multiple ways of describing this parameterization.
@@ -216,7 +221,7 @@ We can use both structs for tensors and GEMMs. The tensors_struct should be nest
          gemm_feature_param1: value

 Enabling or Disabling Sections and Features
------------------------------------------
+-------------------------------------------

 Debug features can be enabled or disabled with the ``enabled`` keyword:


--- a/docs/debug/3_api_debug_setup.rst
+++ b/docs/debug/3_api_debug_setup.rst
 ..
-    Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+    Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

    See LICENSE for license information.

@@ -11,7 +11,8 @@ Please refer to the Nvidia-DL-Framework-Inspect `documentation <https://github.c
 Below, we outline the steps for debug initialization.

 initialize()
-----------
+------------
+

 Must be called once on every rank in the global context to initialize Nvidia-DL-Framework-Inspect.

@@ -34,7 +35,7 @@ Must be called once on every rank in the global context to initialize Nvidia-DL-
        log_dir="./log_dir")

 set_tensor_reduction_group()
--------------------------
+----------------------------

 Needed only for logging tensor stats. In multi-GPU training, activation and gradient tensors are distributed across multiple nodes. This method lets you specify the group for the reduction of stats; see the `reduction group section <./4_distributed.rst#reduction-groups>`_ for more details.

@@ -61,7 +62,7 @@ If the tensor reduction group is not specified, then statistics are reduced acro
    # activation/gradient tensor statistics are reduced along pipeline_parallel_group

 set_weight_tensor_tp_group_reduce()
---------------------------------
+-----------------------------------

 By default, weight tensor statistics are reduced within the tensor parallel group. This function allows you to disable that behavior; for more details, see `reduction group section <./4_distributed.rst#reduction-groups>`_.