Docs fix (#2301)

* init Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * lines lenght Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * subtitle --- fix in many files: Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * cross entropy _input -> input rename Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * cross entropy _input -> input rename Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * a lot of small fixes Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * torch_version() change Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add missing module and fix warnings Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * removed training whitespace: Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * Update docs/api/pytorch.rst Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com> * Fix import Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix more imports Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix NumPy docstring parameter spacing and indentation - Standardize parameter documentation to use 'param : type' format (space before and after colon) per NumPy style guide - Fix inconsistent indentation in cpu_offload.py docstring - Modified 51 Python files across transformer_engine/pytorch Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Docs fix (#2301)
* init Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * lines lenght Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * subtitle --- fix in many files: Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * cross entropy _input -> input rename Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * cross entropy _input -> input rename Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * a lot of small fixes Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * torch_version() change Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add missing module and fix warnings Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * removed training whitespace: Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * Update docs/api/pytorch.rst Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com> * Fix import Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix more imports Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix NumPy docstring parameter spacing and indentation - Standardize parameter documentation to use 'param : type' format (space before and after colon) per NumPy style guide - Fix inconsistent indentation in cpu_offload.py docstring - Modified 51 Python files across transformer_engine/pytorch Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
df39a7c2 · Paweł Gadziński · GitHub · ca468ebe · df39a7c2 · df39a7c2
Unverified Commit df39a7c2 authored Nov 26, 2025 by Paweł Gadziński Committed by GitHub Nov 26, 2025
20 changed files
--- a/.github/workflows/docs.yml
+++ b/.github/workflows/docs.yml
@@ -22,10 +22,10 @@ jobs:
          sudo apt-get install -y pandoc graphviz doxygen
          export GIT_SHA=$(git show-ref --hash HEAD)
      - name: 'Build docs'
-        run: |
+        run: | # SPHINXOPTS="-W" errors out on warnings
          doxygen docs/Doxyfile
          cd docs
-          make html
+          make html SPHINXOPTS="-W"
      - name: 'Upload docs'
        uses: actions/upload-artifact@v4
        with:

--- a/docs/api/jax.rst
+++ b/docs/api/jax.rst
@@ -4,7 +4,7 @@
    See LICENSE for license information.

 Jax
-=======
+===

 Pre-defined Variable of Logical Axes
 ------------------------------------
@@ -20,11 +20,11 @@ Variables are available in `transformer_engine.jax.sharding`.


 Checkpointing
------------------------------------
+-------------
 When using checkpointing with Transformer Engine JAX, please be aware of the checkpointing policy being applied to your model. Any JAX checkpointing policy using `dot`, such as `jax.checkpoint_policies.dots_with_no_batch_dims`, may not work with GEMMs provided by Transformer Engine as they do not always use the `jax.lax.dot_general` primitive. Instead, you can use `transformer_engine.jax.checkpoint_policies.dots_and_te_gemms_with_no_batch_dims` or similar policies that are designed to work with Transformer Engine's GEMMs and `jax.lax.dot_general` GEMMs. You may also use any JAX policies that do not filter by primitive, such as `jax.checkpoint_policies.save_only_these_names` or `jax.checkpoint_policies.everything_saveable`.

 Modules
------------------------------------
+-------
 .. autoapiclass:: transformer_engine.jax.flax.TransformerLayerType
 .. autoapiclass:: transformer_engine.jax.MeshResource()


--- a/docs/api/pytorch.rst
+++ b/docs/api/pytorch.rst
@@ -3,7 +3,7 @@

    See LICENSE for license information.

-pyTorch
+PyTorch
 =======

 .. autoapiclass:: transformer_engine.pytorch.Linear(in_features, out_features, bias=True, **kwargs)
@@ -37,9 +37,6 @@ pyTorch
 .. autoapiclass:: transformer_engine.pytorch.CudaRNGStatesTracker()
  :members: reset, get_states, set_states, add, fork

-.. autoapifunction:: transformer_engine.pytorch.fp8_autocast
-
-.. autoapifunction:: transformer_engine.pytorch.fp8_model_init

 .. autoapifunction:: transformer_engine.pytorch.autocast

@@ -47,6 +44,16 @@ pyTorch

 .. autoapifunction:: transformer_engine.pytorch.checkpoint

+
+.. autoapifunction:: transformer_engine.pytorch.make_graphed_callables
+
+.. autoapifunction:: transformer_engine.pytorch.get_cpu_offload_context
+
+.. autoapifunction:: transformer_engine.pytorch.parallel_cross_entropy
+
+Recipe availability
+-------------------
+
 .. autoapifunction:: transformer_engine.pytorch.is_fp8_available

 .. autoapifunction:: transformer_engine.pytorch.is_mxfp8_available
@@ -63,9 +70,8 @@ pyTorch

 .. autoapifunction:: transformer_engine.pytorch.get_default_recipe

-.. autoapifunction:: transformer_engine.pytorch.make_graphed_callables
-
-.. autoapifunction:: transformer_engine.pytorch.get_cpu_offload_context
+Mixture of Experts (MoE) functions
+----------------------------------

 .. autoapifunction:: transformer_engine.pytorch.moe_permute

@@ -75,10 +81,12 @@ pyTorch

 .. autoapifunction:: transformer_engine.pytorch.moe_sort_chunks_by_index

-.. autoapifunction:: transformer_engine.pytorch.parallel_cross_entropy
-
 .. autoapifunction:: transformer_engine.pytorch.moe_sort_chunks_by_index_with_probs

+
+Communication-computation overlap
+---------------------------------
+
 .. autoapifunction:: transformer_engine.pytorch.initialize_ub

 .. autoapifunction:: transformer_engine.pytorch.destroy_ub
@@ -86,6 +94,7 @@ pyTorch
 .. autoapiclass:: transformer_engine.pytorch.UserBufferQuantizationMode
  :members: FP8, NONE

+
 Quantized tensors
 -----------------

@@ -133,3 +142,10 @@ Tensor saving and restoring functions
 .. autoapifunction:: transformer_engine.pytorch.prepare_for_saving

 .. autoapifunction:: transformer_engine.pytorch.restore_from_saved
+
+Deprecated functions
+--------------------
+
+.. autoapifunction:: transformer_engine.pytorch.fp8_autocast
+
+.. autoapifunction:: transformer_engine.pytorch.fp8_model_init
--- a/docs/conf.py
+++ b/docs/conf.py
@@ -61,7 +61,11 @@ extensions = [
 ]

 templates_path = ["_templates"]
-exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"]
+exclude_patterns = [
+    "_build",
+    "Thumbs.db",
+    "sphinx_rtd_theme",
+]

 source_suffix = ".rst"

@@ -94,6 +98,7 @@ napoleon_custom_sections = [
    ("Values", "params_style"),
    ("Graphing parameters", "params_style"),
    ("FP8-related parameters", "params_style"),
+    ("Quantization parameters", "params_style"),
 ]

 breathe_projects = {"TransformerEngine": root_path / "docs" / "doxygen" / "xml"}
@@ -101,4 +106,23 @@ breathe_default_project = "TransformerEngine"

 autoapi_generate_api_docs = False
 autoapi_dirs = [root_path / "transformer_engine"]
-autoapi_ignore = ["*/_[!_]*"]
+autoapi_ignore = ["*test*"]
+
+
+# There are 2 warnings about the same namespace (transformer_engine) in two different c++ api
+# docs pages. This seems to be the only way to suppress these warnings.
+def setup(app):
+    """Custom Sphinx setup to filter warnings."""
+    import logging
+
+    # Filter out duplicate C++ declaration warnings
+    class DuplicateDeclarationFilter(logging.Filter):
+        def filter(self, record):
+            message = record.getMessage()
+            if "Duplicate C++ declaration" in message and "transformer_engine" in message:
+                return False
+            return True
+
+    # Apply filter to Sphinx logger
+    logger = logging.getLogger("sphinx")
+    logger.addFilter(DuplicateDeclarationFilter())
--- a/docs/debug.rst
+++ b/docs/debug.rst
@@ -2,8 +2,9 @@
    Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

    See LICENSE for license information.
+
 Precision debug tools
-==============================================
+=====================

 .. toctree::
   :caption: Precision debug tools

--- a/docs/debug/1_getting_started.rst
+++ b/docs/debug/1_getting_started.rst
@@ -4,7 +4,7 @@
    See LICENSE for license information.

 Getting started
-==============
+===============

 .. note::

@@ -38,7 +38,7 @@ To start debugging, one needs to create a configuration YAML file. This file lis
   one - ``UserProvidedPrecision`` - is a custom feature implemented by the user. Nvidia-DL-Framework-Inspect inserts features into the layers according to the config.

 Example training script
----------------------
+-----------------------

 Let's look at a simple example of training a Transformer layer using Transformer Engine with FP8 precision. This example demonstrates how to set up the layer, define an optimizer, and perform a few training iterations using synthetic data.

@@ -81,7 +81,7 @@ We will demonstrate two debug features on the code above:
 2. Logging statistics for other GEMM operations, such as gradient statistics for data gradient GEMM within the LayerNormLinear sub-layer of the TransformerLayer.

 Config file
----------
+-----------

 We need to prepare the configuration YAML file, as below

@@ -114,7 +114,8 @@ We need to prepare the configuration YAML file, as below
 Further explanation on how to create config files is in the :doc:`next part of the documentation <2_config_file_structure>`.

 Adjusting Python file
--------------------
+---------------------
+

 .. code-block:: python

@@ -145,7 +146,8 @@ In the modified code above, the following changes were made:
 3. Added ``debug_api.step()`` after each of the forward-backward pass.

 Inspecting the logs
------------------
+-------------------
+

 Let's look at the files with the logs. Two files will be created:

@@ -213,7 +215,8 @@ The second log file (``nvdlfw_inspect_statistics_logs/nvdlfw_inspect_globalrank-
    INFO - transformer_layer.self_attention.layernorm_qkv_activation_l1_norm             iteration=000004                  value=130776.7969

 Logging using TensorBoard
------------------------
+-------------------------
+

 Precision debug tools support logging using `TensorBoard <https://www.tensorflow.org/tensorboard>`_. To enable it, one needs to pass the argument ``tb_writer`` to the ``debug_api.initialize()``.  Let's modify ``train.py`` file.


--- a/docs/debug/2_config_file_structure.rst
+++ b/docs/debug/2_config_file_structure.rst
@@ -4,13 +4,14 @@
    See LICENSE for license information.

 Config File Structure
-====================
+=====================

 To enable debug features, create a configuration YAML file to specify the desired behavior, such as determining which GEMMs (General Matrix Multiply operations) should run in higher precision rather than FP8 and defining which statistics to log. 
 Below, we outline how to structure the configuration YAML file.

 General Format
-------------
+--------------
+

 A config file can have one or more sections, each containing settings for specific layers and features:

@@ -55,7 +56,8 @@ Sections may have any name and must contain:
 3. Additional fields describing features for those layers.

 Layer Specification
------------------
+-------------------
+

 Debug layers can be identified by a ``name`` parameter:

@@ -89,7 +91,8 @@ Examples:
        (...)

 Names in Transformer Layers
--------------------------
+---------------------------
+

 There are three ways to assign a name to a layer in the Transformer Engine:

@@ -156,7 +159,7 @@ Below is an example ``TransformerLayer`` with four linear layers that can be inf


 Structured Configuration for GEMMs and Tensors
---------------------------------------------
+----------------------------------------------

 Sometimes a feature is parameterized by a list of tensors or by a list of GEMMs.
 There are multiple ways of describing this parameterization.
@@ -218,7 +221,7 @@ We can use both structs for tensors and GEMMs. The tensors_struct should be nest
          gemm_feature_param1: value

 Enabling or Disabling Sections and Features
------------------------------------------
+-------------------------------------------

 Debug features can be enabled or disabled with the ``enabled`` keyword:


--- a/docs/debug/3_api_debug_setup.rst
+++ b/docs/debug/3_api_debug_setup.rst
@@ -11,7 +11,8 @@ Please refer to the Nvidia-DL-Framework-Inspect `documentation <https://github.c
 Below, we outline the steps for debug initialization.

 initialize()
-----------
+------------
+

 Must be called once on every rank in the global context to initialize Nvidia-DL-Framework-Inspect.

@@ -34,7 +35,7 @@ Must be called once on every rank in the global context to initialize Nvidia-DL-
        log_dir="./log_dir")

 set_tensor_reduction_group()
--------------------------
+----------------------------

 Needed only for logging tensor stats. In multi-GPU training, activation and gradient tensors are distributed across multiple nodes. This method lets you specify the group for the reduction of stats; see the `reduction group section <./4_distributed.rst#reduction-groups>`_ for more details.

@@ -61,7 +62,7 @@ If the tensor reduction group is not specified, then statistics are reduced acro
    # activation/gradient tensor statistics are reduced along pipeline_parallel_group

 set_weight_tensor_tp_group_reduce()
---------------------------------
+-----------------------------------

 By default, weight tensor statistics are reduced within the tensor parallel group. This function allows you to disable that behavior; for more details, see `reduction group section <./4_distributed.rst#reduction-groups>`_.


--- a/docs/debug/3_api_features.rst
+++ b/docs/debug/3_api_features.rst
@@ -4,7 +4,7 @@
    See LICENSE for license information.

 Debug features
-==========
+==============

 .. autoapiclass:: transformer_engine.debug.features.log_tensor_stats.LogTensorStats
 .. autoapiclass:: transformer_engine.debug.features.log_fp8_tensor_stats.LogFp8TensorStats

--- a/docs/debug/4_distributed.rst
+++ b/docs/debug/4_distributed.rst
@@ -4,7 +4,7 @@
    See LICENSE for license information.

 Distributed training
-===================
+====================

 Nvidia-Pytorch-Inspect with Transformer Engine supports multi-GPU training. This guide describes how to run it and how the supported features work in the distributed setting.

@@ -14,7 +14,8 @@ To use precision debug tools in multi-GPU training, one needs to:
 2. If one wants to log stats, one may want to invoke ``debug_api.set_tensor_reduction_group`` with a proper reduction group.

 Behavior of the features
-----------------------
+------------------------
+

 In a distributed setting, **DisableFP8GEMM** and **DisableFP8Layer** function similarly to the single-GPU case, with no notable differences. 

@@ -28,7 +29,8 @@ In a distributed setting, **DisableFP8GEMM** and **DisableFP8Layer** function si
 Logging-related features are more complex and will be discussed further in the next sections.

 Reduction groups
--------------
+----------------
+

 In setups with tensor, data, or pipeline parallelism, some tensors are distributed across multiple GPUs, requiring a reduction operation to compute statistics for these tensors.

@@ -65,7 +67,8 @@ Below, we illustrate configurations for a 4-node setup with tensor parallelism s


 Microbatching
-----------
+-------------
+

 Let's dive into how statistics collection works with microbatching. By microbatching, we mean invoking multiple ``forward()`` calls for each ``debug_api.step()``. The behavior is as follows:

@@ -73,7 +76,7 @@ Let's dive into how statistics collection works with microbatching. By microbatc
 - For other tensors, the stats are accumulated.

 Logging to files and TensorBoard
------------------------------
+--------------------------------

 In a single-node setup with ``default_logging_enabled=True``, all logs are saved by default to ``log_dir/nvdlfw_inspect_statistics_logs/nvdlfw_inspect_globalrank-0.log``. In multi-GPU training, each node writes its reduced statistics to its unique file, named ``log_dir/nvdlfw_inspect_statistics_logs/nvdlfw_inspect_globalrank-i.log`` for rank i. Because these logs contain reduced statistics, the logged values are identical for all nodes within a reduction group.


--- a/docs/debug/api.rst
+++ b/docs/debug/api.rst
@@ -2,8 +2,9 @@
    Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

    See LICENSE for license information.
+
 API
-============
+===

 .. toctree::
   :caption: Precision debug tools API

--- a/docs/examples/advanced_optimizations.ipynb
+++ b/docs/examples/advanced_optimizations.ipynb
@@ -100,7 +100,7 @@
    "\n",
    "</div>\n",
    "\n",
-    "A variety of parallelism strategies can be used to enable multi-GPU training of Transformer models, often based on different approaches to distribute their $\\text{sequence_length} \\times \\text{batch_size} \\times \\text{hidden_size}$ activation tensors. The most common approach is data parallelism, which distributes along the $\\text{batch_size}$ dimension. By storing duplicate copies of the model on each GPU, the forward and backward passes of the training step can be done independently, followed by a gradient synchronization. A more advanced strategy is tensor parallelism, a type of model parallelism that distributes along the $\\text{hidden_size}$ dimension. This allows us to scale past the limits of data parallelism (typically $\\text{hidden_size} > \\text{batch_size}$) and to reduce the per-GPU memory usage (since model parameters are also distributed), but it also incurs the overhead of communicating activation tensors between GPUs at every step. For a more detailed explanation, please see the [Megatron-LM paper](https://arxiv.org/pdf/1909.08053.pdf). Finally, sequence parallelism distributes along the $\\text{sequence_length}$ dimension. This can be used when tensor parallelism is enabled in order to parallelize operations that run outside the tensor-parallel region (e.g. layer norm). For more details, please see [this paper](https://arxiv.org/pdf/2205.05198.pdf).\n",
+    "A variety of parallelism strategies can be used to enable multi-GPU training of Transformer models, often based on different approaches to distribute their $\\text{sequence_length} \\cdot \\text{batch_size} \\cdot \\text{hidden_size}$ activation tensors. The most common approach is data parallelism, which distributes along the $\\text{batch_size}$ dimension. By storing duplicate copies of the model on each GPU, the forward and backward passes of the training step can be done independently, followed by a gradient synchronization. A more advanced strategy is tensor parallelism, a type of model parallelism that distributes along the $\\text{hidden_size}$ dimension. This allows us to scale past the limits of data parallelism (typically $\\text{hidden_size} > \\text{batch_size}$) and to reduce the per-GPU memory usage (since model parameters are also distributed), but it also incurs the overhead of communicating activation tensors between GPUs at every step. For a more detailed explanation, please see the [Megatron-LM paper](https://arxiv.org/pdf/1909.08053.pdf). Finally, sequence parallelism distributes along the $\\text{sequence_length}$ dimension. This can be used when tensor parallelism is enabled in order to parallelize operations that run outside the tensor-parallel region (e.g. layer norm). For more details, please see [this paper](https://arxiv.org/pdf/2205.05198.pdf).\n",
    "\n",
    "To show this in action, let's first initialize NCCL with a trivial process group:"
   ]
@@ -131,7 +131,7 @@
   "id": "1f2b80d0",
   "metadata": {},
   "source": [
-    "We only initialize with one GPU to keep this example simple. Please consult the documentation [torch.distributed](https://pytorch.org/docs/stable/distributed.html) for guidance on running with multiple GPUs. Note that we require that each distributed process corresponds to exactly one GPU, so we treat them interchangeably. In practice, there are multiple factors that can affect the optimal parallel layout: the system hardware, the network topology, usage of other parallelism schemes like pipeline parallelism. A rough rule-of-thumb is to interpret the GPUs as a 2D grid with dimensions of $\\text{num_nodes} \\times \\text{gpus_per_node}$. The rows are tensor-parallel groups and the columns are data-parallel groups.\n",
+    "We only initialize with one GPU to keep this example simple. Please consult the documentation [torch.distributed](https://pytorch.org/docs/stable/distributed.html) for guidance on running with multiple GPUs. Note that we require that each distributed process corresponds to exactly one GPU, so we treat them interchangeably. In practice, there are multiple factors that can affect the optimal parallel layout: the system hardware, the network topology, usage of other parallelism schemes like pipeline parallelism. A rough rule-of-thumb is to interpret the GPUs as a 2D grid with dimensions of $\\text{num_nodes} \\cdot \\text{gpus_per_node}$. The rows are tensor-parallel groups and the columns are data-parallel groups.\n",
    "\n",
    "Enabling data parallelism with Transformer Engine is similar to enabling data parallelism with standard PyTorch models: simply wrap the modules with [torch.nn.parallel.DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html). Transformer Engine modules also have native support for tensor and sequence parallelism. If the user provides a process group for tensor parallelism, the modules will distribute the data and perform communication internally. If sequence parallelism is enabled, it will be applied for operations that are not amenable to tensor parallelism and it will use the tensor-parallel process group.\n",
    "\n",

--- a/docs/examples/attention/attention.ipynb
+++ b/docs/examples/attention/attention.ipynb
@@ -174,7 +174,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 1,
+   "execution_count": null,
   "id": "50852cb5",
   "metadata": {},
   "outputs": [
@@ -266,7 +266,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 24,
+   "execution_count": null,
   "id": "906b8cf1",
   "metadata": {},
   "outputs": [
@@ -299,7 +299,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 23,
+   "execution_count": null,
   "id": "d3637094",
   "metadata": {},
   "outputs": [
@@ -509,10 +509,10 @@
    "\n",
    "* PyTorch: When both options are provided by the user, `cu_seqlens` is preferred as there is no extra conversion needed.\n",
    "  - `cu_seqlens`: Users can provide cumulative sequence length tensors `cu_seqlens_q` and `cu_seqlens_kv` for `q` and `k`/`v` to the flash-attention or cuDNN attention backend. An example of `cu_seqlens` is `[0, 2, 6, 7]` for a batch of 3 `[aa000, bbbb0, c0000]`.\n",
-    "  - `attention_mask`: Users can also provide `attention_mask` as an alternative, which will then be converted to `cu_seqlens`. For self-attention, `attention_mask` should be one single tensor in shape `[batch_size, 1, 1, seqlen_q]`, and for cross-attention, `attention_mask` should be a list of two tensors in shapes `[batch_size, 1, 1, seqlen_q]` and `[batch_size, 1, 1, seqlen_kv]`, respectively.\n",
+    "  - `attention_mask`: Users can also provide `attention_mask` as an alternative, which will then be converted to `cu_seqlens`. For self-attention, `attention_mask` should be one single tensor of shape `[batch_size, 1, 1, seqlen_q]`, and for cross-attention, `attention_mask` should be a list of two tensors of shapes `[batch_size, 1, 1, seqlen_q]` and `[batch_size, 1, 1, seqlen_kv]`, respectively.\n",
    "\n",
    "\n",
-    "* JAX: Users should provide the `attention_mask` tensor in shape `[batch_size, 1, seqlen_q, seqlen_kv]`.\n",
+    "* JAX: Users should provide the `attention_mask` tensor of shape `[batch_size, 1, seqlen_q, seqlen_kv]`.\n",
    "\n",
    "**qkv_format=thd:** Transformer Engine extracts the max sequence length information from `q`, `k`, `v` if `max_seqlen_q` and `max_seqlen_kv` are not provided. This requires GPU-CPU copy and synchronization operations. For performance reasons, please set `max_seqlen_q` and `max_seqlen_kv` to their appropriate values for `thd` QKV format.\n",
    "\n",
@@ -521,7 +521,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 33,
+   "execution_count": null,
   "id": "a1f25a9b",
   "metadata": {},
   "outputs": [

--- a/docs/examples/quickstart_jax.ipynb
+++ b/docs/examples/quickstart_jax.ipynb
@@ -502,7 +502,7 @@
    "\n",
    "</div>\n",
    "\n",
-    "Enabling FP8 support is very simple in Transformer Engine. We just need to wrap the modules within an [autocast](.../api/jax.rst#transformer_engine.jax.fp8_autocast) context manager. See the [FP8 tutorial](fp8_primer.ipynb) for a detailed explanation of FP8 recipes and the supported options.\n",
+    "Enabling FP8 support is very simple in Transformer Engine. We just need to wrap the modules within an [autocast](../api/jax.rst#transformer_engine.jax.fp8_autocast) context manager. See the [FP8 tutorial](fp8_primer.ipynb) for a detailed explanation of FP8 recipes and the supported options.\n",
    "\n",
    "<div class=\"alert alert-warning\">\n",
    "\n",

--- a/docs/examples/te_gemma/tutorial_generation_gemma_with_te.ipynb
+++ b/docs/examples/te_gemma/tutorial_generation_gemma_with_te.ipynb
@@ -38,7 +38,7 @@
    "\n",
    "For those seeking a deeper understanding of text generation mechanisms in Transformers, it is recommended to check out the [HuggingFace generation tutorial](https://huggingface.co/docs/transformers/llm_tutorial).\n",
    "\n",
-    "In a previous tutorial on [Llama](../te_llama/tutorial_accelerate_hf_llama_finetuning_with_te.ipynb), it was demonstrated how finetuning of an open-source Llama model can be accelerated using Transformer Engine's `TransformerLayer`. Building on that foundation, this tutorial showcases how to accelerate the token generation from the open-source Hugging Face Gemma 7B model.\n",
+    "In a previous tutorial on [Llama](../te_llama/tutorial_accelerate_hf_llama_with_te.ipynb), it was demonstrated how finetuning of an open-source Llama model can be accelerated using Transformer Engine's `TransformerLayer`. Building on that foundation, this tutorial showcases how to accelerate the token generation from the open-source Hugging Face Gemma 7B model.\n",
    "\n",
    "This tutorial introduces several features of the Transformer Engine library that contribute towards this goal. A brief explanation is as follows:\n",
    "\n",

--- a/docs/index.rst
+++ b/docs/index.rst
@@ -4,7 +4,7 @@
    See LICENSE for license information.

 Transformer Engine documentation
-==============================================
+=================================

 .. ifconfig:: "dev" in release


--- a/docs/installation.rst
+++ b/docs/installation.rst
@@ -28,7 +28,7 @@ on `NVIDIA GPU Cloud <https://ngc.nvidia.com>`_.


 pip - from PyPI
-----------------------
+---------------

 Transformer Engine can be directly installed from `our PyPI <https://pypi.org/project/transformer-engine/>`_, e.g.

@@ -47,7 +47,7 @@ The core package from Transformer Engine (without any framework extensions) can
 By default, this will install the core library compiled for CUDA 12. The cuda major version can be specified by modified the extra dependency to `core_cu12` or `core_cu13`.

 pip - from GitHub
-----------------------
+-----------------

 Additional Prerequisites
 ^^^^^^^^^^^^^^^^^^^^^^^^

--- a/transformer_engine/common/fused_attn/kv_cache.cu
+++ b/transformer_engine/common/fused_attn/kv_cache.cu
@@ -278,7 +278,7 @@ void convert_bshd_to_thd(Tensor tensor, Tensor cu_seqlens, Tensor new_tensor, in
 /***************************************************************************************************
 * KV Cache: Copy new KV tokens to the KV cache
 *   1. new_k and new_v are in qkv_format; k_cache and v_cache are in 'bshd' format
- *   2. cu_new_lens and cu_cached_lens are in shape [b + 1]; cu_cached_lens include the added lens
+ *   2. cu_new_lens and cu_cached_lens are of shape [b + 1]; cu_cached_lens include the added lens
 *      in current step
 *   3. Non-paged KV cache is a special case of paged KV cache, with page_table = [b, 1] and
 *      max_pages_per_seq = 1. We use the same underlying kernel for both non-paged and paged.

--- a/transformer_engine/common/include/transformer_engine/fused_attn.h
+++ b/transformer_engine/common/include/transformer_engine/fused_attn.h
@@ -131,7 +131,7 @@ enum NVTE_Mask_Type {
 *  NVTE_VANILLA_SOFTMAX: S[:,:,:,i] = exp(S[:,:,:,i])/sum(exp(S[:,:,:,:]), dim=-1),
 *  NVTE_OFF_BY_ONE_SOFTMAX: S[:,:,:,i] = exp(S[:,:,:,i])/(1 + sum(exp(S[:,:,:,:]), dim=-1)), and
 *  NVTE_LEARNABLE_SOFTMAX: S[:,j,:,i] = exp(S[:,j,:,i])/(exp(alpha[j]) + sum(exp(S[:,j,:,:]), dim=-1)),
- *  where alpha is a learnable parameter in shape [H].
+ *  where alpha is a learnable parameter of shape [H].
 */
 enum NVTE_Softmax_Type {
  /*! Vanilla softmax */

--- a/transformer_engine/common/recipe/__init__.py
+++ b/transformer_engine/common/recipe/__init__.py
@@ -50,7 +50,7 @@ class MMParams:

    Parameters
    ----------
-    use_split_accumulator : bool, default = `True`
+    use_split_accumulator : bool, default = True
        Use FP8 fast accumulation on Hopper or Ada. For more details,
        see CUBLASLT_MATMUL_DESC_FAST_ACCUM option for cublasLtMatmul.
    """
@@ -159,7 +159,7 @@ class DelayedScaling(Recipe):
                                                              recipe: DelayedScaling) -> Tensor

                                 where `Tensor` is a framework tensor type.
-    reduce_amax: bool, default = `True`
+    reduce_amax: bool, default = True
                By default, if `torch.distributed` is initialized, the `amax` value for FP8
                tensors is reduced across the `amax_reduction_group` (specified in the `autocast`
                call). This keeps the amaxes and scaling factors synced across the given
@@ -167,13 +167,13 @@ class DelayedScaling(Recipe):
                GPU maintains local amaxes and scaling factors. To ensure results are
                numerically identical across checkpointing boundaries in this case, all
                ranks must checkpoint in order to store the local tensors.
-    fp8_dpa: bool, default = `False`
+    fp8_dpa: bool, default = False
             Whether to enable FP8 dot product attention (DPA). When the model is placed in an
             `autocast(enabled=True)` region and `fp8_dpa` is set to `True`, DPA casts the
             inputs from higher precision to FP8, performs attention in FP8, and casts tensors
             back to higher precision as outputs. FP8 DPA currently is only supported in the
             `FusedAttention` backend.
-    fp8_mha: bool, default = `False`
+    fp8_mha: bool, default = False
            Whether to enable FP8 multi-head attention (MHA). When `True`, it removes the casting
            operations mentioned above at the DPA boundaries. Currently only standard MHA modules
            i.e. `LayerNormLinear/Linear + DPA + Linear`, are supported for this feature. When
@@ -422,11 +422,11 @@ class NVFP4BlockScaling(Recipe):
    ----------
    fp4_format : {Format.E2M1}, default = Format.E2M1
             FP4 data type.
-    disable_rht : bool, default = `False`
+    disable_rht : bool, default = False
             If set to `True`, random Hadamard transforms are not applied to any tensor.
-    disable_stochastic_rounding : bool, default = `False`
+    disable_stochastic_rounding : bool, default = False
             If set to `True`, stochastic rounding is disabled during quantization for all tensors.
-    disable_2d_quantization : bool, default = `False`
+    disable_2d_quantization : bool, default = False
             If set to `True`, 1D block scaling with block size 16 is used for all tensors.
    """

@@ -494,13 +494,15 @@ class CustomRecipe(Recipe):
    qfactory : Callable
        Factory callable that returns a quantizer instance for a
        given semantic tensor role.
-               The callable is typically invoked as:
+        The callable is typically invoked as::
+
            qfactory(
                role: str,
            )

        Where `role` is one of the following strings for e.g. te.Linear
        (stable public contract):
+
        - forward:  "linear_input", "linear_weight", "linear_output"
        - backward: "linear_grad_output", "linear_grad_input"
    """