More detailed documentation for recipes (#2343)

* Code drop: Update recipes documentation and remove custom recipes from low precision training Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * Fix SVG css import path for diagrams Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * Refactor low_precision_training docs: remove optimizers, fix imports, add GPU checks Changes: - Remove optimizer code from all recipe examples (keep only forward/backward) - Fix Format imports (use Format.E4M3 instead of string 'E4M3') - Fix params_dtype for PyTorch examples (add params_dtype=torch.bfloat16) - Add GPU capability assertions before START blocks for blockwise/mxfp8/nvfp4 - Fix JAX imports (Float8CurrentScaling from common.recipe, NVFP4BlockScaling) - Add global_shard_guard for TransformerLayer examples in JAX - Fix fused_layers_jax.py return tuple unpacking - Update memory_usage JAX examples with dynamic GPU measurement - Remove memory_usage_3_jax (JAX doesn't support FP8 weight storage) - Update performance_considerations.rst for JAX differences - Delete unused .out files and fp8_autocast_jax.py Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix JAX memory usage .out files with correct output Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * responded to comments Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * applied suggestions form greptile Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * year change Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * jax compute capability fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fixes Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

More detailed documentation for recipes (#2343)
* Code drop: Update recipes documentation and remove custom recipes from low precision training Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * Fix SVG css import path for diagrams Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * Refactor low_precision_training docs: remove optimizers, fix imports, add GPU checks Changes: - Remove optimizer code from all recipe examples (keep only forward/backward) - Fix Format imports (use Format.E4M3 instead of string 'E4M3') - Fix params_dtype for PyTorch examples (add params_dtype=torch.bfloat16) - Add GPU capability assertions before START blocks for blockwise/mxfp8/nvfp4 - Fix JAX imports (Float8CurrentScaling from common.recipe, NVFP4BlockScaling) - Add global_shard_guard for TransformerLayer examples in JAX - Fix fused_layers_jax.py return tuple unpacking - Update memory_usage JAX examples with dynamic GPU measurement - Remove memory_usage_3_jax (JAX doesn't support FP8 weight storage) - Update performance_considerations.rst for JAX differences - Delete unused .out files and fp8_autocast_jax.py Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix JAX memory usage .out files with correct output Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * responded to comments Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * applied suggestions form greptile Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * year change Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * jax compute capability fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fixes Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
3ceb248e · Paweł Gadziński · GitHub · c3769cb7 · 3ceb248e · 3ceb248e
Unverified Commit 3ceb248e authored Feb 02, 2026 by Paweł Gadziński Committed by GitHub Feb 02, 2026
11 changed files
--- a/docs/features/low_precision_training/performance_considerations/memory_usage_1_pytorch.py
+++ b/docs/features/low_precision_training/performance_considerations/memory_usage_1_pytorch.py
+# Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# See LICENSE for license information.
+
+import torch
+
+# Requires Ada (SM89) or Hopper (SM90), different results on Blackwell+
+cc = torch.cuda.get_device_capability()
+assert cc[0] == 8 and cc[1] >= 9 or cc[0] == 9, "This example requires SM89 (Ada) or SM90 (Hopper)"
+
+print("# START_MEMORY_USAGE_1")
+import torch
+import transformer_engine.pytorch as te
+
+
+def measure_memory():
+    torch.cuda.empty_cache()
+    torch.cuda.reset_peak_memory_stats()
+
+    init_memory = torch.cuda.memory_allocated()
+    layer = te.Linear(1024, 1024, params_dtype=torch.bfloat16)
+
+    inp = torch.randn(1024, 1024, dtype=torch.bfloat16, device="cuda")
+    out = layer(inp)
+    del inp  # Input is saved by model for backward, not by user script
+
+    mem_after_forward = torch.cuda.memory_allocated() - init_memory
+    return mem_after_forward
+
+
+# Warmup run
+measure_memory()
+
+# Actual measurement
+mem_after_forward = measure_memory()
+print(f"Memory usage after forward pass: {mem_after_forward/1024**2:.2f} MB")
+# END_MEMORY_USAGE_1
+print("# END_MEMORY_USAGE_1")
--- a/docs/features/low_precision_training/performance_considerations/memory_usage_2_jax.out
+++ b/docs/features/low_precision_training/performance_considerations/memory_usage_2_jax.out
+# START_MEMORY_USAGE_2
+Tensors in memory:
+  Shape: (1024, 1024), Dtype: float8_e4m3fn, Size: 1024.0 KB
+  Shape: (1024, 1024), Dtype: float8_e4m3fn, Size: 1024.0 KB
+  Shape: (1024, 1024), Dtype: bfloat16, Size: 2048.0 KB
+  Total from all live arrays: 4.02 MB
+# END_MEMORY_USAGE_2
+Processing events...
+Generated:
+	No reports were generated
--- a/docs/features/low_precision_training/performance_considerations/memory_usage_2_jax.py
+++ b/docs/features/low_precision_training/performance_considerations/memory_usage_2_jax.py
+# Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# See LICENSE for license information.
+
+# Requires Ada (SM89) or Hopper (SM90), different results on Blackwell+
+
+print("# START_MEMORY_USAGE_2")
+
+import jax
+import jax.numpy as jnp
+import transformer_engine.jax as te
+from transformer_engine.jax.flax import DenseGeneral
+from transformer_engine.common.recipe import DelayedScaling
+
+
+key = jax.random.PRNGKey(0)
+recipe = DelayedScaling()
+jax.clear_caches()
+
+
+# Initialize layer with BF16 parameters (outside autocast)
+layer = DenseGeneral(features=1024, dtype=jnp.bfloat16)
+x = jax.random.normal(key, (1024, 1024), dtype=jnp.bfloat16)
+
+
+# Forward and backward pass with FP8 compute
+with te.autocast(enabled=True, recipe=recipe):
+    var_collect = layer.init(key, x)
+
+    @jax.jit
+    def loss_fn(var_collect, x):
+        output = layer.apply(var_collect, x)
+        return output.sum()
+
+    # Trace the backward pass - this allocates saved tensors
+    _, backward_fn = jax.vjp(loss_fn, var_collect, x)
+
+del x
+
+print("Tensors in memory:")
+total_bytes = 0
+for arr in jax.live_arrays():
+    total_bytes += arr.nbytes
+    if arr.nbytes > 200000:  # do not count small tensors
+        print(f"  Shape: {arr.shape}, Dtype: {arr.dtype}, Size: {arr.nbytes / 1024:.1f} KB")
+print(f"  Total from all live arrays: {total_bytes / (1024**2):.2f} MB")
+
+print("# END_MEMORY_USAGE_2")
--- a/docs/features/low_precision_training/performance_considerations/memory_usage_2_pytorch.out
+++ b/docs/features/low_precision_training/performance_considerations/memory_usage_2_pytorch.out
+
+# START_MEMORY_USAGE_2
+Memory after forward pass: 6.02 MB
+# END_MEMORY_USAGE_2
--- a/docs/features/low_precision_training/performance_considerations/memory_usage_2_pytorch.py
+++ b/docs/features/low_precision_training/performance_considerations/memory_usage_2_pytorch.py
+# Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# See LICENSE for license information.
+
+import torch
+
+# Requires Ada (SM89) or Hopper (SM90), different results on Blackwell+
+cc = torch.cuda.get_device_capability()
+assert cc[0] == 8 and cc[1] >= 9 or cc[0] == 9, "This example requires SM89 (Ada) or SM90 (Hopper)"
+
+print("# START_MEMORY_USAGE_2")
+import torch
+import transformer_engine.pytorch as te
+
+
+def measure_memory():
+    torch.cuda.empty_cache()
+    torch.cuda.reset_peak_memory_stats()
+
+    init_memory = torch.cuda.memory_allocated()
+    layer = te.Linear(1024, 1024, params_dtype=torch.bfloat16)
+
+    inp = torch.randn(1024, 1024, dtype=torch.bfloat16, device="cuda")
+    with te.autocast(enabled=True):
+        out = layer(inp)
+    del inp  # Input is saved by model for backward, not by user script
+
+    mem_after_forward = torch.cuda.memory_allocated() - init_memory
+    return mem_after_forward
+
+
+# Warmup run
+measure_memory()
+
+# Actual measurement
+mem_after_forward = measure_memory()
+print(f"Memory after forward pass: {mem_after_forward/1024**2:.2f} MB")
+# END_MEMORY_USAGE_2
+print("# END_MEMORY_USAGE_2")
--- a/docs/features/low_precision_training/performance_considerations/memory_usage_3_pytorch.out
+++ b/docs/features/low_precision_training/performance_considerations/memory_usage_3_pytorch.out
+
+# START_MEMORY_USAGE_3
+Memory after forward pass: 3.02 MB
+# END_MEMORY_USAGE_3
--- a/docs/features/low_precision_training/performance_considerations/memory_usage_3_pytorch.py
+++ b/docs/features/low_precision_training/performance_considerations/memory_usage_3_pytorch.py
+# Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# See LICENSE for license information.
+
+import torch
+
+# Requires Ada (SM89) or Hopper (SM90), different results on Blackwell+
+cc = torch.cuda.get_device_capability()
+assert cc[0] == 8 and cc[1] >= 9 or cc[0] == 9, "This example requires SM89 (Ada) or SM90 (Hopper)"
+
+print("# START_MEMORY_USAGE_3")
+import torch
+import transformer_engine.pytorch as te
+
+
+def measure_memory():
+    torch.cuda.empty_cache()
+    torch.cuda.reset_peak_memory_stats()
+
+    init_memory = torch.cuda.memory_allocated()
+
+    # FP8 inference with FP8 weights
+    with te.quantized_model_init(enabled=True), torch.no_grad():
+        layer_fp8 = te.Linear(1024, 1024, params_dtype=torch.bfloat16)
+
+    with torch.no_grad():
+        inp = torch.randn(1024, 1024, dtype=torch.bfloat16, device="cuda")
+        with te.autocast(enabled=True):
+            out = layer_fp8(inp)
+        del inp  # Input is not saved by model for backward in inference
+
+    mem_after_forward = torch.cuda.memory_allocated() - init_memory
+
+    return mem_after_forward
+
+
+# Warmup run
+measure_memory()
+
+# Actual measurement
+mem_after_forward = measure_memory()
+print(f"Memory after forward pass: {mem_after_forward/1024**2:.2f} MB")
+# END_MEMORY_USAGE_3
+print("# END_MEMORY_USAGE_3")
--- a/docs/features/low_precision_training/performance_considerations/performance_considerations.rst
+++ b/docs/features/low_precision_training/performance_considerations/performance_considerations.rst
+..
+    Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+    See LICENSE for license information.
+
+Performance Considerations
+===================================
+
+.. _handling_transposes:
+
+Handling transposes
+-------------------
+
+In the last chapter we demonstrated that for FP8 on Hopper architecture,
+some tensors need to be physically transposed in memory to perform needed GEMMs.
+Dealing with transposes in Transformer low precision training is a bit tricky.
+Let's start by introducing the concept of *tensor usages*.
+
+**Tensor usages**
+
+Each quantized tensor may have two usages:
+
+- *rowwise usage* -- which is used for matrix multiplication, when the consecutive elements in row are accessed,
+- *columnwise usage* -- which is used for matrix multiplication, when the consecutive elements in column are accessed,
+
+To understand what access of consecutive elements means, let's consider two matrices ``A`` and ``B``
+and analyze how their elements are accessed during multiplication.
+
+For NN (non-transposed, non-transposed) multiplication ``C = A * B``, the formula is ``C_ij = sum_k(A_ik * B_kj)``. 
+To compute element ``C_ij``, we iterate over the i-th row of ``A`` (elements ``A_i0, A_i1, ...``) 
+and the j-th column of ``B`` (elements ``B_0j, B_1j, ...``). Thus, ``A`` is accessed rowwise 
+and ``B`` is accessed columnwise.
+
+For NT (non-transposed, transposed) multiplication ``C = A * B^T``, the formula changes to ``C_ij = sum_k(A_ik * B_jk)``.
+Now we iterate over the i-th row of ``A`` and the j-th row of ``B`` (elements ``B_j0, B_j1, ...``).
+Both tensors are accessed rowwise.
+
+The figure below illustrates these access patterns:
+
+.. figure:: img/gemm_access_pattern.svg
+   :align: center
+   :width: 60%
+   :alt: Matrix multiplication access pattern showing rowwise access for first tensor and columnwise access for second tensor
+
+   Figure 1: Access patterns in matrix multiplication for matrices in ``A * B`` and ``A * B^T`` operations.
+
+Based on the visualization above, we can derive general rules for when each matrix 
+is accessed in rowwise or columnwise fashion. The key insight is that:
+
+- The **first tensor** in a matrix multiplication is accessed along its rows (rowwise) when non-transposed,
+  or along its columns (columnwise) when transposed.
+- The **second tensor** follows the opposite pattern: columnwise when non-transposed, rowwise when transposed.
+
+.. table:: Table 1: Summary of tensor access patterns based on transpose state.
+   :align: center
+
+   +------------------+--------------+---------------+
+   |                  | First tensor | Second tensor |
+   +------------------+--------------+---------------+
+   | Non-transposed   | rowwise      | columnwise    |
+   +------------------+--------------+---------------+
+   | Transposed       | columnwise   | rowwise       |
+   +------------------+--------------+---------------+
+
+**Input, weight and output gradient usages**
+
+Now let's apply these rules to a Linear layer. During training, a Linear layer performs 
+three GEMM operations: one in the forward pass and two in the backward pass.
+
+
+.. table:: Table 2: Tensor access patterns for GEMM operations in a Linear layer during training.
+   :align: center
+
+   +-------------------+-------------------------------------+---------------------------+---------------------------+
+   | GEMM              | Formula                             | First tensor usage        | Second tensor usage       |
+   +===================+=====================================+===========================+===========================+
+   | Forward           | ``output = input * weight^T``       | input: rowwise            | weight: rowwise           |
+   +-------------------+-------------------------------------+---------------------------+---------------------------+
+   | Weight gradient   | ``wgrad = output_grad^T * input``   | output_grad: columnwise   | input: columnwise         |
+   +-------------------+-------------------------------------+---------------------------+---------------------------+
+   | Input gradient    | ``dgrad = output_grad * weight``    | output_grad: rowwise      | weight: columnwise        |
+   +-------------------+-------------------------------------+---------------------------+---------------------------+
+
+An important observation is that the **forward pass uses only rowwise tensors** - both input 
+and weight are accessed rowwise.
+
+The backward pass introduces columnwise access. For weight gradient, both output gradient and input
+are accessed columnwise. For input gradient, output gradient is rowwise while weight is columnwise.
+
+As a result, each tensor (input, weight, output gradient) needs both rowwise and columnwise 
+usages during training. This has implications for memory layout and transpose operations.
+
+
+**Architecture differences**
+
+The physical memory layout requirements for rowwise and columnwise usages differ between architectures 
+and recipes. For FP8 tensors:
+
+- *Hopper*: cannot efficiently access elements in columnwise fashion, so columnwise tensors need to be physically transposed in memory. Note that higher precision formats (BF16/FP16) do not have this limitation.
+- *Blackwell*: supports columnwise access natively, so no transpose is needed.
+
+We will see that for most of the recipes and devices, rowwise usage and columnwise usage need different tensors.
+Thus by *rowwise tensor* and *columnwise tensor* we mean tensors that are used in rowwise and columnwise usages respectively.
+
+.. figure:: img/hopper_vs_blackwell_layout.svg
+   :align: center
+   :alt: Comparison of rowwise and columnwise tensor layouts on Blackwell vs Hopper
+
+   Figure 2: On Blackwell, rowwise and columnwise usages share the same memory layout. 
+   On Hopper, columnwise usage requires a physical transpose.
+
+**Quantization fusions**
+
+This section is relevant only for recipes for which columnwise tensors
+are different from rowwise tensors. 
+
+Note that performing rowwise and columnwise quantization at the same time
+enables some fusions, which usually lead to better performance.
+We showcase 3 example scenarios of producing quantized tensors in rowwise and columnwise usages,
+TE will use best possible fusion for given recipe and TE module configuration:
+
+1. *Computation of quantized tensor in both rowwise and columnwise usages in a single kernel in forward pass*. 
+
+   This is the fastest one,
+   but since the columnwise usage is saved for backward pass, it may lead to increased memory usage, 
+   if the high precision tensor also needs to be saved for backward - for example if it is the attention output which is saved anyway.
+
+2. *Computation of quantized tensor in rowwise usage in forward pass and fused quantization to produce columnwise usage in backward pass*. 
+
+   This is usually slower than the previous one, since high precision tensor needs to be read twice.
+   It is used for example when high precision tensor is gathered both in forward and in backward 
+   and quantized tensor gather is not implemented for such recipe.
+
+3. *Computation of quantized tensor in rowwise usage in forward pass and transpose to columnwise usage in backward pass*. 
+
+   It is more memory efficient than Option 1, but not all recipes can utilize it (otherwise
+   the quantization accuracy would drop due to double quantization errors).
+
+Transformer Engine chooses the best possible fusion internally taking the recipe and the operation into account.
+
+.. raw:: html
+   :file: img/transpose_fusion.svg
+
+*Figure 3: Three scenarios of producing quantized tensors in rowwise and columnwise usages.*
+
+
+
+Memory usage
+------------
+
+This section discusses memory usage in low precision training. 
+Contrary to intuition, FP8 training does not always reduce memory compared to BF16/FP16.
+
+*Master weights*
+
+Transformer Engine by default stores weights in high precision and quantizes them to low precision before each GEMM.
+Moreover, one can specify which high precision should be used to store the weights in the
+model (FP32/BF16/FP16) -- or choose not to store high precision weights in the model at all.
+There are multiple scenarios to consider, three of them are listed below:
+
+1. model weights are in FP32, quantized to low precision before each GEMM,
+2. model weights are in BF16/FP16, quantized to low precision before each GEMM, master weights in optimizer are in FP32.
+3. model weights are stored directly in low precision, and master weights in optimizer are in FP32.
+
+Note that each of these scenarios may have different memory footprint.
+
+*Activations saved for backward*
+
+Unlike weights, activations do not require a high precision copy for optimizer updates. 
+As shown in Table 2, the input needs rowwise usage in forward and columnwise usage 
+for weight gradient computation in backward — so it must be saved between passes.
+
+The memory impact depends on which scenario from Figure 3 is used.
+Additionally, on architectures where rowwise and columnwise usage tensors share the same memory layout 
+(e.g., FP8 on Blackwell, as shown in Figure 2), a single quantized tensor serves both usages, 
+reducing memory overhead compared to architectures requiring separate tensors.
+
+Output gradients, on the other hand, are computed during backward and do not need to be saved — 
+both rowwise and columnwise usages are produced on the fly as needed.
+
+The FP8 examples below are analyzed on Hopper (SM90) or Ada (SM89) architecture, where rowwise 
+and columnwise tensors require separate memory layouts.
+
+.. tabs::
+
+   .. tab:: PyTorch
+
+      **1. Baseline: high precision forward pass**
+
+      Let's start with a forward pass in higher precision to establish a baseline.
+
+      .. raw:: html
+
+         <div style="background: #f0f4f8; border-left: 3px solid #5c7cfa; padding: 6px 12px; font-size: 13px; color: #495057; margin-bottom: 0; border-radius: 4px 4px 0 0;">
+            Needs to be run on SM89 (Ada) or SM90 (Hopper)
+         </div>
+
+      .. literalinclude:: memory_usage_1_pytorch.py
+         :language: python
+         :start-after: # START_MEMORY_USAGE_1
+         :end-before: # END_MEMORY_USAGE_1
+
+      .. raw:: html
+
+         <div style="background: #f5f5f5; border-left: 3px solid #9ca3af; padding: 4px 12px; font-size: 12px; color: #6b7280; margin-top: -16px;">
+            Output:
+         </div>
+      
+      .. container:: program-output
+      
+         .. literalinclude:: memory_usage_1_pytorch.out
+            :language: text
+            :start-after: # START_MEMORY_USAGE_1
+            :end-before: # END_MEMORY_USAGE_1
+      
+      Layer size is ``1024 * 1024 * 2 (2 bytes per parameter) = 2MB``.
+      Memory after forward pass is ``2 MB (weight) + 2 MB (input saved for backward) + 2 MB (output) = 6 MB``.
+      
+      **2. FP8 training with model weights in BF16**
+
+      Now let's see the memory usage in FP8 training with high precision weights.
+
+      .. raw:: html
+
+         <div style="background: #f0f4f8; border-left: 3px solid #5c7cfa; padding: 6px 12px; font-size: 13px; color: #495057; margin-bottom: 0; border-radius: 4px 4px 0 0;">
+            Needs to be run on SM89 (Ada) or SM90 (Hopper)
+         </div>
+      
+      .. literalinclude:: memory_usage_2_pytorch.py
+         :language: python
+         :start-after: # START_MEMORY_USAGE_2
+         :end-before: # END_MEMORY_USAGE_2
+
+      .. raw:: html
+
+         <div style="background: #f5f5f5; border-left: 3px solid #9ca3af; padding: 4px 12px; font-size: 12px; color: #6b7280; margin-top: -16px;">
+            Output:
+         </div>
+      
+      .. container:: program-output
+      
+         .. literalinclude:: memory_usage_2_pytorch.out
+            :language: text
+            :start-after: # START_MEMORY_USAGE_2
+            :end-before: # END_MEMORY_USAGE_2
+      
+      Total memory usage is ``2 MB (weight) + 1 MB (weight in FP8) + 1 MB (input in FP8 saved for backward) + 2 MB (output) = 6 MB``.
+      
+      **3. FP8 inference with model weights stored directly in low precision**
+
+      For inference scenarios, model weights can be stored directly in low precision. Since we are only 
+      performing forward passes without gradient updates, master weights in high precision are not needed.
+
+      .. raw:: html
+
+         <div style="background: #f0f4f8; border-left: 3px solid #5c7cfa; padding: 6px 12px; font-size: 13px; color: #495057; margin-bottom: 0; border-radius: 4px 4px 0 0;">
+            Needs to be run on SM89 (Ada) or SM90 (Hopper)
+         </div>
+
+      .. literalinclude:: memory_usage_3_pytorch.py
+         :language: python
+         :start-after: # START_MEMORY_USAGE_3
+         :end-before: # END_MEMORY_USAGE_3
+
+      .. raw:: html
+
+         <div style="background: #f5f5f5; border-left: 3px solid #9ca3af; padding: 4px 12px; font-size: 12px; color: #6b7280; margin-top: -16px;">
+            Output:
+         </div>
+      
+      .. container:: program-output
+      
+         .. literalinclude:: memory_usage_3_pytorch.out
+            :language: text
+            :start-after: # START_MEMORY_USAGE_3
+            :end-before: # END_MEMORY_USAGE_3
+
+      Total memory usage is ``1 MB (weight in FP8) + 2 MB (output) = 3 MB``.
+      This is lower than the BF16 baseline (6 MB) since no copies are saved for backward in inference mode.
+      
+      **4. Saving original input instead of quantized**
+
+      By default, TE saves the columnwise quantized input for the backward pass (needed for weight gradient).
+      However, when the high precision input is already being saved (e.g., for a residual connection),
+      keeping an additional quantized copy wastes memory.
+
+      The ``save_original_input=True`` option tells the layer to reference the original high precision input
+      instead of caching a separate quantized copy. The input is re-quantized during backward when needed.
+      Below is an example with a residual block where input is kept for the addition:
+
+      .. raw:: html
+
+         <div style="background: #f0f4f8; border-left: 3px solid #5c7cfa; padding: 6px 12px; font-size: 13px; color: #495057; margin-bottom: 0; border-radius: 4px 4px 0 0;">
+            Needs to be run on SM89 (Ada) or SM90 (Hopper)
+         </div>
+
+      .. literalinclude:: save_original_input_pytorch.py
+         :language: python
+         :start-after: # START_SAVE_ORIGINAL_INPUT
+         :end-before: # END_SAVE_ORIGINAL_INPUT
+
+      .. raw:: html
+
+         <div style="background: #f5f5f5; border-left: 3px solid #9ca3af; padding: 4px 12px; font-size: 12px; color: #6b7280; margin-top: -16px;">
+            Output:
+         </div>
+      
+      .. container:: program-output
+
+         .. literalinclude:: save_original_input_pytorch.out
+            :language: text
+            :start-after: # START_SAVE_ORIGINAL_INPUT
+            :end-before: # END_SAVE_ORIGINAL_INPUT
+
+   .. tab:: JAX
+
+      **1. Baseline: high precision forward pass**
+
+      Let's start with a forward pass in higher precision to establish a baseline.
+
+      .. raw:: html
+
+         <div style="background: #f0f4f8; border-left: 3px solid #5c7cfa; padding: 6px 12px; font-size: 13px; color: #495057; margin-bottom: 0; border-radius: 4px 4px 0 0;">
+            Needs to be run on SM89 (Ada) or SM90 (Hopper)
+         </div>
+
+      .. literalinclude:: memory_usage_1_jax.py
+         :language: python
+         :start-after: # START_MEMORY_USAGE_1
+         :end-before: # END_MEMORY_USAGE_1
+
+      .. raw:: html
+
+         <div style="background: #f5f5f5; border-left: 3px solid #9ca3af; padding: 4px 12px; font-size: 12px; color: #6b7280; margin-top: -16px;">
+            Output:
+         </div>
+      
+      .. container:: program-output
+      
+         .. literalinclude:: memory_usage_1_jax.out
+            :language: text
+            :start-after: # START_MEMORY_USAGE_1
+            :end-before: # END_MEMORY_USAGE_1
+      
+      Layer size is ``1024 * 1024 * 2 (2 bytes per parameter) = 2MB``.
+      Memory after forward pass is ``2 MB (weight) + 2 MB (input saved for backward) = 4 MB``.
+      
+      **2. FP8 training with master weights in BF16**
+
+      Now let's see the memory usage in FP8 training with high precision weights.
+
+      .. raw:: html
+
+         <div style="background: #f0f4f8; border-left: 3px solid #5c7cfa; padding: 6px 12px; font-size: 13px; color: #495057; margin-bottom: 0; border-radius: 4px 4px 0 0;">
+            Needs to be run on SM89 (Ada) or SM90 (Hopper)
+         </div>
+      
+      .. literalinclude:: memory_usage_2_jax.py
+         :language: python
+         :start-after: # START_MEMORY_USAGE_2
+         :end-before: # END_MEMORY_USAGE_2
+
+      .. raw:: html
+
+         <div style="background: #f5f5f5; border-left: 3px solid #9ca3af; padding: 4px 12px; font-size: 12px; color: #6b7280; margin-top: -16px;">
+            Output:
+         </div>
+      
+      .. container:: program-output
+      
+         .. literalinclude:: memory_usage_2_jax.out
+            :language: text
+            :start-after: # START_MEMORY_USAGE_2
+            :end-before: # END_MEMORY_USAGE_2
+      
+      Memory after forward pass is ``2 MB (weight in BF16) + 1 MB (input in FP8) + 1 MB (weight in FP8) = 4 MB``.
+
+Fused layers
+------------
+
+
+Transformer Engine provides fused layers such as ``LayerNormLinear`` (``LayerNormDenseGeneral`` in JAX) and ``LayerNormMLP`` 
+that enable kernel fusion optimizations. One key optimization is fusing layer normalization 
+with quantization.
+
+In a typical Transformer architecture, LayerNorm precedes a Linear layer. Without fusion, 
+the LayerNorm outputs in high precision, and the Linear layer must then quantize this input before 
+performing the GEMM — adding overhead. With ``LayerNormLinear``, these operations are fused 
+into a single kernel: the LayerNorm output is quantized directly, eliminating the separate 
+quantization step and reducing memory movement.
+
+
+.. raw:: html
+   :file: img/fused_layers.svg
+
+*Figure 4: Comparison of separate LayerNorm and Linear layers versus fused LayerNormLinear layer, showing reduced quantization overhead.*
+
+
+Let's see how we can use fused layers in different frameworks.
+
+.. tabs::
+
+   .. tab:: PyTorch
+
+      In PyTorch, Transformer Engine provides fused layers like ``LayerNormLinear`` and ``LayerNormMLP``.
+      These layers combine normalization and linear operations with optimized quantization.
+
+      .. raw:: html
+
+         <div style="background: #f0f4f8; border-left: 3px solid #5c7cfa; padding: 6px 12px; font-size: 13px; color: #495057; margin-bottom: 0; border-radius: 4px 4px 0 0;">
+            Needs to be run on SM89+ (Ada, Hopper, Blackwell, or newer)
+         </div>
+
+      .. literalinclude:: fused_layers_pytorch.py
+         :language: python
+         :start-after: # START_FUSED_LAYERS
+         :end-before: # END_FUSED_LAYERS
+      
+      The fused ``LayerNormLinear`` layer is particularly efficient in FP8 training because 
+      it avoids an intermediate quantization step. The LayerNorm output is directly quantized 
+      for the GEMM operation, reducing memory movement and improving performance.
+
+   .. tab:: JAX
+
+      In JAX, Transformer Engine provides fused layers like ``LayerNormDenseGeneral`` and ``LayerNormMLP``.
+      These layers combine normalization and dense operations with optimized quantization.
+
+      .. raw:: html
+
+         <div style="background: #f0f4f8; border-left: 3px solid #5c7cfa; padding: 6px 12px; font-size: 13px; color: #495057; margin-bottom: 0; border-radius: 4px 4px 0 0;">
+            Needs to be run on SM89+ (Ada, Hopper, Blackwell, or newer)
+         </div>
+
+      .. literalinclude:: fused_layers_jax.py
+         :language: python
+         :start-after: # START_FUSED_LAYERS
+         :end-before: # END_FUSED_LAYERS
+      
+      The fused ``LayerNormDenseGeneral`` layer is particularly efficient in FP8 training because 
+      it avoids an intermediate quantization step. The LayerNorm output is directly quantized 
+      for the GEMM operation, reducing memory movement and improving performance.
+
+
+Distributed training
+--------------------
+
+Transformer Engine handles collective operations internally, so users typically don't need to manage 
+the interaction between communication and low precision computation.
+
+Recall that each Linear layer involves six tensors: weight, input, output, and their gradients. 
+Of these, output and gradients are returned in high precision, and weights are generally not 
+communicated (except in FSDP, which is outside the scope of this section). This leaves two 
+tensors where low precision communication matters: **input** and **output gradient**.
+
+For sequence parallelism, TE supports all-gather of quantized tensors. This provides several benefits:
+
+1. *Reduced memory usage* — no need to store high precision tensors for backward pass.
+2. *Reduced communication* — smaller tensors mean less data to transfer.
+3. *Parallelized quantization* — quantization work is distributed across GPUs.
+
+Support varies by recipe — for example, columnwise quantized all-gather is not available 
+for all configurations.
+
+The figure below illustrates one possible all-gather scenario for input and output gradient tensors. 
+Actual behavior depends on the recipe and module configuration.
+
+.. raw:: html
+   :file: img/sequence_parallel_quantization.svg
+
+*Figure 5: All-gather of quantized tensors for input and gradient tensors. 
+This is one possible scenario — actual behavior varies depending on the recipe and module configuration.*
+
+
--- a/docs/features/low_precision_training/performance_considerations/save_original_input_pytorch.out
+++ b/docs/features/low_precision_training/performance_considerations/save_original_input_pytorch.out
+# START_SAVE_ORIGINAL_INPUT
+save_original_input=False: 25.0 MB
+save_original_input=True: 24.0 MB
+# END_SAVE_ORIGINAL_INPUT
--- a/docs/features/low_precision_training/performance_considerations/save_original_input_pytorch.py
+++ b/docs/features/low_precision_training/performance_considerations/save_original_input_pytorch.py
+# Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# See LICENSE for license information.
+
+import torch
+
+# Requires Ada (SM89) or Hopper (SM90), different results on Blackwell+
+cc = torch.cuda.get_device_capability()
+assert cc[0] == 8 and cc[1] >= 9 or cc[0] == 9, "This example requires SM89 (Ada) or SM90 (Hopper)"
+
+print("# START_SAVE_ORIGINAL_INPUT")
+# START_SAVE_ORIGINAL_INPUT
+import torch
+import transformer_engine.pytorch as te
+from transformer_engine.common.recipe import Float8CurrentScaling
+
+recipe = Float8CurrentScaling()
+
+
+def residual_block(layer, inp):
+    """Residual connection: input is saved for addition after linear."""
+    out = layer(inp)
+    return out + inp  # inp must be kept for this addition
+
+
+def measure_memory(use_save_original):
+    torch.cuda.empty_cache()
+    torch.cuda.reset_peak_memory_stats()
+
+    layer = te.Linear(
+        1024, 1024, params_dtype=torch.bfloat16, save_original_input=use_save_original
+    )
+    inp = torch.randn(1024, 1024, dtype=torch.bfloat16, device="cuda", requires_grad=True)
+
+    with te.autocast(enabled=True, recipe=recipe):
+        out = residual_block(layer, inp)
+    out.sum().backward()
+
+    return torch.cuda.max_memory_allocated() / 1024**2
+
+
+# Warmup runs
+measure_memory(False)
+measure_memory(True)
+
+# Actual measurements
+for use_save_original in [False, True]:
+    peak = measure_memory(use_save_original)
+    print(f"save_original_input={use_save_original}: {peak:.1f} MB")
+# END_SAVE_ORIGINAL_INPUT
+print("# END_SAVE_ORIGINAL_INPUT")
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -39,6 +39,14 @@ Transformer Engine documentation
   api/common
   api/framework

+
+.. toctree::
+   :hidden:
+   :caption: Features
+
+   features/low_precision_training/index.rst
+
+
 .. toctree::
   :hidden:
   :caption: Examples and Tutorials