More detailed documentation for recipes (#2343)

* Code drop: Update recipes documentation and remove custom recipes from low precision training Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * Fix SVG css import path for diagrams Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * Refactor low_precision_training docs: remove optimizers, fix imports, add GPU checks Changes: - Remove optimizer code from all recipe examples (keep only forward/backward) - Fix Format imports (use Format.E4M3 instead of string 'E4M3') - Fix params_dtype for PyTorch examples (add params_dtype=torch.bfloat16) - Add GPU capability assertions before START blocks for blockwise/mxfp8/nvfp4 - Fix JAX imports (Float8CurrentScaling from common.recipe, NVFP4BlockScaling) - Add global_shard_guard for TransformerLayer examples in JAX - Fix fused_layers_jax.py return tuple unpacking - Update memory_usage JAX examples with dynamic GPU measurement - Remove memory_usage_3_jax (JAX doesn't support FP8 weight storage) - Update performance_considerations.rst for JAX differences - Delete unused .out files and fp8_autocast_jax.py Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix JAX memory usage .out files with correct output Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * responded to comments Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * applied suggestions form greptile Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * year change Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * jax compute capability fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fixes Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

More detailed documentation for recipes (#2343)
* Code drop: Update recipes documentation and remove custom recipes from low precision training Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * Fix SVG css import path for diagrams Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * Refactor low_precision_training docs: remove optimizers, fix imports, add GPU checks Changes: - Remove optimizer code from all recipe examples (keep only forward/backward) - Fix Format imports (use Format.E4M3 instead of string 'E4M3') - Fix params_dtype for PyTorch examples (add params_dtype=torch.bfloat16) - Add GPU capability assertions before START blocks for blockwise/mxfp8/nvfp4 - Fix JAX imports (Float8CurrentScaling from common.recipe, NVFP4BlockScaling) - Add global_shard_guard for TransformerLayer examples in JAX - Fix fused_layers_jax.py return tuple unpacking - Update memory_usage JAX examples with dynamic GPU measurement - Remove memory_usage_3_jax (JAX doesn't support FP8 weight storage) - Update performance_considerations.rst for JAX differences - Delete unused .out files and fp8_autocast_jax.py Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix JAX memory usage .out files with correct output Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * responded to comments Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * applied suggestions form greptile Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * year change Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * jax compute capability fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fixes Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
3ceb248e · Paweł Gadziński · GitHub · c3769cb7 · 3ceb248e · 3ceb248e
Unverified Commit 3ceb248e authored Feb 02, 2026 by Paweł Gadziński Committed by GitHub Feb 02, 2026
20 changed files
--- a/docs/_static/css/diagram-colors.css
+++ b/docs/_static/css/diagram-colors.css
+/* Diagram color definitions for Transformer Engine documentation */
+
+/* High precision (BF16/FP16) elements */
+.hp {
+  fill: #ede7f6;
+  stroke: #673ab7;
+  stroke-width: 2;
+}
+
+/* FP8 precision elements */
+.fp8 {
+  fill: #fff8e1;
+  stroke: #ffa726;
+  stroke-width: 2;
+}
+
+/* GEMM/computation operations */
+.gemm {
+  fill: #ffe0b2;
+  stroke: #fb8c00;
+  stroke-width: 2.5;
+}
+
+/* Quantization operations */
+.quantize {
+  fill: #e8f5e9;
+  stroke: #66bb6a;
+  stroke-width: 2;
+}
+
+/* Amax computation operations */
+.amax {
+  fill: #e1f5fe;
+  stroke: #039be5;
+  stroke-width: 2;
+}
+
+/* Text styles */
+.text {
+  font-family: 'Segoe UI', Arial, sans-serif;
+  font-size: 14px;
+  text-anchor: middle;
+  fill: #212121;
+}
+
+.small-text {
+  font-family: 'Segoe UI', Arial, sans-serif;
+  font-size: 14px;
+  text-anchor: middle;
+  fill: #757575;
+}
+
+.label {
+  font-family: 'Segoe UI', Arial, sans-serif;
+  font-size: 14px;
+  text-anchor: middle;
+  fill: #424242;
+}
+
+.title {
+  font-family: 'Segoe UI', Arial, sans-serif;
+  font-size: 18px;
+  font-weight: 600;
+  text-anchor: middle;
+  fill: #212121;
+}
+
+.section-title {
+  font-family: 'Segoe UI', Arial, sans-serif;
+  font-size: 15px;
+  font-weight: 600;
+  text-anchor: middle;
+}
+
+/* Arrows */
+/* Note: marker-end references #arrowhead marker which must be defined in each SVG's <defs> section */
+.arrow {
+  stroke: #616161;
+  stroke-width: 2;
+  fill: none;
+  marker-end: url(#arrowhead);
+}
+
+/* Additional box and element styles */
+.box-blue {
+  fill: #e3f2fd;
+  stroke: #1976d2;
+  stroke-width: 2;
+}
+
+.box-orange {
+  fill: #fff3e0;
+  stroke: #f57c00;
+  stroke-width: 2;
+}
+
+.box-green {
+  fill: #c8e6c9;
+  stroke: #388e3c;
+  stroke-width: 2;
+}
+
+.box-dashed {
+  stroke-dasharray: 5,5;
+}
+
+/* LayerNorm specific */
+.layernorm {
+  fill: #b3e5fc;
+  stroke: #0277bd;
+  stroke-width: 2.5;
+}
+
+/* Fused layers */
+.fused {
+  fill: #b2dfdb;
+  stroke: #00695c;
+  stroke-width: 3;
+}
+
+/* Generic computation blocks */
+.computation {
+  fill: #f5f5f5;
+  stroke: #757575;
+  stroke-width: 2;
+}
+
+/* FP32 precision (alternative red) */
+.fp32 {
+  fill: #ffcdd2;
+  stroke: #d32f2f;
+  stroke-width: 2.5;
+}
+
--- a/docs/_static/css/sphinx_tabs.css
+++ b/docs/_static/css/sphinx_tabs.css
+/* Custom styling for sphinx-tabs */
+
+.sphinx-tabs {
+    margin-bottom: 1rem;
+}
+
+.sphinx-tabs-tab {
+    background-color: #f4f4f4;
+    border: 1px solid #ccc;
+    border-bottom: none;
+    padding: 0.5rem 1rem;
+    margin-right: 0.5rem;
+    cursor: pointer;
+    font-weight: 500;
+    transition: background-color 0.2s;
+}
+
+.sphinx-tabs-tab:hover {
+    background-color: #e0e0e0;
+}
+
+.sphinx-tabs-tab[aria-selected="true"] {
+    background-color: #76b900; /* NVIDIA green */
+    color: white;
+    border-color: #76b900;
+    margin-right: 0.5rem;
+}
+
+.sphinx-tabs-panel {
+    border: 1px solid #ccc;
+    padding: 1rem;
+    background-color: #f9f9f9;
+}
+
+/* Dark mode support for RTD theme */
+.rst-content .sphinx-tabs-tab {
+    color: #333;
+}
+
+.rst-content .sphinx-tabs-tab[aria-selected="true"] {
+    color: white;
+}
+
+
+
--- a/docs/_static/css/svg-responsive.css
+++ b/docs/_static/css/svg-responsive.css
+/* Responsive styling for SVG images */
+
+/* Make all SVG images responsive */
+.document svg,
+.document object[type="image/svg+xml"],
+.rst-content svg {
+    max-width: 100%;
+    height: auto;
+    display: block;
+    margin: 1em auto;
+}
+
+/* For raw HTML embedded SVGs */
+.document .raw-html svg {
+    max-width: 100%;
+    height: auto;
+    width: 100%;
+}
+
+/* Ensure container doesn't overflow */
+.document .raw-html {
+    max-width: 100%;
+    overflow-x: auto;
+}
+
+/* Figure containers with captions */
+.svg-figure {
+    text-align: center;
+    margin: 20px auto;
+}
+
+.svg-figure img {
+    display: block;
+    margin: 0 auto;
+    height: auto;
+}
+
+/* Different width classes for figures */
+.svg-figure.width-70 img {
+    width: 70%;
+    max-width: 100%;
+}
+
+.svg-figure.width-80 img {
+    width: 80%;
+    max-width: 100%;
+}
+
+.svg-figure.width-90 img {
+    width: 90%;
+    max-width: 100%;
+}
+
+.svg-figure.width-100 img {
+    width: 100%;
+}
+
+/* Figure captions */
+.svg-caption {
+    font-style: italic;
+    margin-top: 10px;
+    color: #555;
+    font-size: 0.95em;
+    line-height: 1.4;
+}
+
+
+
+
+
+
+
--- a/docs/_templates/layout.html
+++ b/docs/_templates/layout.html
@@ -67,6 +67,10 @@
        overflow: visible !important;
    }

+    .quant {
+        background-color: yellow !important;
+    }
+
  </style>
  <style>
  a:link, a:visited {

--- a/docs/conf.py
+++ b/docs/conf.py
@@ -84,8 +84,11 @@ html_show_sphinx = False
 html_css_files = [
    "css/nvidia_font.css",
    "css/nvidia_footer.css",
-    "css/rtabs.css",
    "css/output-style.css",
+    "css/diagram-colors.css",
+    "css/sphinx_tabs.css",
+    "css/svg-responsive.css",
+    "css/rtabs.css",
 ]

 html_theme_options = {

--- a/docs/features/low_precision_training/fp8_blockwise_scaling/fp8_blockwise_scaling.rst
+++ b/docs/features/low_precision_training/fp8_blockwise_scaling/fp8_blockwise_scaling.rst
+..
+    Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+    See LICENSE for license information.
+
+FP8 Blockwise Scaling
+===================================
+
+.. warning::
+
+   ``Float8BlockScaling`` is **currently not supported** in JAX.
+
+FP8 Blockwise Scaling recipe is inspired by the quantization scheme used to train the `DeepSeek-v3 model <https://arxiv.org/abs/2412.19437>`__ –
+the first open-source large-scale LLM trained entirely in FP8 precision.
+Unlike the previous recipes, it assigns a dedicated scaling factor to each block of elements.
+
+
+Data Format
+--------------------------
+
+The representation of an FP8 tensor element ``x`` in blockwise precision is given by:
+
+.. code-block:: python
+
+    x = x_fp8 * s_block
+
+where
+
+* ``x_fp8`` is the FP8 value (E4M3 or E5M2),
+* ``s_block`` is a local **FP32** scaling factor shared by a block of elements.
+
+
+.. raw:: html
+   :file: img/combined_scaling.svg
+
+*Figure 1. Top: Comparison of standard FP8 scaling (left) using a single scaling factor per tensor versus 
+FP8 blockwise scaling in 1 dimension (right) using multiple scaling factors, one per block of 128 elements.
+Bottom: FP8 blockwise scaling in 2 dimensions where each 128×128 block in the data tensor has a corresponding
+scaling factor.*
+
+**FP8 format**
+
+Unlike FP8 Current/Delayed Scaling, E4M3 is used by default for both forward and backward passes.
+Tensor-scaled recipes used E5M2 for gradients due to its higher dynamic range,
+but with multiple scaling factors per tensor the dynamic range requirement is lowered, so E4M3 is usually sufficient.
+The ``fp8_format`` parameter also supports ``HYBRID`` mode (E4M3 for forward, E5M2 for backward).
+Pure E5M2 training is not supported.
+
+
+**Block size**
+
+Block size is 128.
+Blocks can be:
+
+* one dimensional – containing 128 consecutive values,
+* two dimensional – containing tiles of 128×128 values.
+
+By default:
+
+* activations use 1D scaling (``x_block_scaling_dim=1``),
+* weights use 2D scaling (``w_block_scaling_dim=2``),
+* gradients use 1D scaling (``grad_block_scaling_dim=1``).
+
+These can be changed in the recipe, but 2D × 2D GEMMs are not supported 
+– at most one operand can use 2D scaling.
+
+One-dimensional scaling is more granular, but 2D scaling offers two advantages:
+
+* *Performance*: On Hopper, block-scaled GEMMs are software-emulated. GEMMs with mixed
+  1D/2D scaled tensors have lower overhead than pure 1D scaled GEMMs.
+* *Numerical stability*: 2D scaling behaves better when transposed (details in the next section).
+
+There are some assumptions on the dimensions of the tensor (for both 1D and 2D scaling):
+
+* the tensor must have at least 2 dimensions,
+* the last dimension must be divisible by 128,
+* the product of all dimensions except the last must be divisible by 128.
+
+**Scaling factors**
+
+Scaling factors are stored as 32-bit floating point numbers.
+By default, they are constrained to powers of 2 (utilizing the 8 exponent bits of FP32).
+On Hopper, this constraint can be relaxed by setting the environment variable ``NVTE_FP8_BLOCK_SCALING_FP32_SCALES=1``.
+On Blackwell, only powers of 2 are supported.
+
+Each block's scaling factor is computed through the following steps:
+
+1. Find the maximum absolute value (``amax_block``) across all elements in the block
+   (128 consecutive values for 1D blocks, or 128×128 values for 2D blocks).
+2. Calculate ``s_block = max_fp8 / amax_block``, where ``max_fp8`` is
+   the maximum representable value in the FP8 format (448 for E4M3, 57344 for E5M2).
+3. If the power-of-2 constraint is enabled, round down to the nearest power of 2
+   by zeroing out the mantissa bits, retaining only the sign and exponent.
+4. Multiply each element in the block by ``s_block`` before converting to FP8.
+
+This approach ensures that the largest value in each block fits within the FP8 representable range without overflow.
+
+
+Handling transposes
+------------------------
+
+On Hopper, columnwise tensor access requires data to be transposed in memory.
+For 1D scaling, the block direction must align with the access pattern:
+
+* *Rowwise access*: 1 scaling factor per 128 consecutive elements in a row.
+* *Columnwise access*: 1 scaling factor per 128 consecutive elements in a row of the transposed tensor,
+  corresponding to 128 consecutive elements in a column of the original tensor.
+
+For 2D scaling, each 128×128 tile has one scaling factor regardless of access direction.
+
+This is illustrated below:
+
+.. raw:: html
+   :file: img/transpose_handling.svg
+
+*Figure 2. Quantization directions for original and transposed tensors.*
+
+Note that for 1D scaling, the rowwise and columnwise quantized tensors may be numerically different,
+so the gradient computation may be affected. This issue is not present for 2D scaling.
+
+
+Activations and weights use the rowwise version in the forward pass and the columnwise version in the backward pass.
+Experiments have shown that 2D scaling for weights is more helpful for numerical stability than for activations,
+so by default 1D scaling is used for activations – as it is more granular – and 2D scaling is used for weights.
+
+
+Unlike FP8 Current/Delayed Scaling, transposing a 1D quantized tensor is not supported.
+Rowwise and columnwise blocks cover different sets of elements, so their scaling factors differ.
+Both versions must be quantized separately from the high-precision source.
+
+For 2D scaling, columnwise data can be created from rowwise data by transposing 
+both the quantized data and the scaling factors. Each 128×128 block covers the same 
+elements regardless of access direction, so the scaling factors remain valid.
+
+
+Distributed training
+-----------------------
+
+**Scale synchronization**
+
+The blockwise scaled tensor does not need any scale synchronization among the nodes.
+This is because each scaling factor is local to its 128 or 128×128 element block,
+unlike FP8 Current/Delayed Scaling where a single global scale applies to the entire tensor, even when sharded.
+
+**Quantized all-gather**
+
+FP8 Blockwise Scaling all-gather is supported.
+
+
+Examples
+--------
+
+Here's how to use the FP8 Blockwise Scaling recipe in PyTorch and JAX:
+
+.. note::
+
+   Requires SM90 (Hopper) or later.
+
+.. tabs::
+
+   .. tab:: PyTorch
+
+      .. literalinclude:: pytorch_blockwise_scaling_example.py
+         :language: python
+         :start-after: # START_BLOCKWISE_SCALING_EXAMPLE
+         :end-before: # END_BLOCKWISE_SCALING_EXAMPLE
+
+   .. tab:: JAX
+
+      ``Float8BlockScaling`` is **not currently supported** in JAX.
+
+Supported devices
+-----------------
+
+Hopper (SM 9.0)
+
+Blackwell and later (SM >= 10.0) – the recipe is emulated with MXFP8. Note that MXFP8 is the preferred recipe on Blackwell. 
+                                   Only scaling factors that are powers of 2 are supported.
+
+
+----
+
+Developer Notes
+---------------
+
+This section contains implementation details that may be useful for developers
+but are not required for using FP8 Blockwise Scaling in practice.
+
+Swizzle of scaling factors
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+FP8 Blockwise Scaling supports all-gather of both rowwise and columnwise tensors.
+To support that, it implements different data layouts for communication (all-gather)
+and computation (GEMM). We refer to the conversion between these formats as *swizzling*.
+
+A tensor of shape ``[A, B]`` can exist in two formats:
+
+**Compact format** (used for all-gather):
+
+The all-gather primitive only supports gathering non-transposed shards into a non-transposed full tensor,
+so all tensor components in this layout are stored without transposition.
+Moreover, all component tensors are stored without padding.
+
+.. list-table::
+   :widths: 30 70
+   :header-rows: 1
+
+   * - Component
+     - Shape
+   * - rowwise data
+     - ``[A, B]``
+   * - columnwise data
+     - ``[A, B]``
+   * - rowwise scales
+     - ``[A, B/128]``
+   * - columnwise scales
+     - ``[A/128, B]``
+
+**GEMM-ready format** (used for computation):
+
+Tensors are transposed and padded as required by the GEMM kernel.
+
+.. list-table::
+   :widths: 30 70
+   :header-rows: 1
+
+   * - Component
+     - Shape
+   * - rowwise data
+     - ``[A, B]``
+   * - columnwise data
+     - ``[B, A]`` (transposed)
+   * - rowwise scales
+     - ``[B/128, pad4(A)]`` (transposed, padded)
+   * - columnwise scales
+     - ``[A/128, pad4(B)]`` (padded)
+
+Swizzling converts from compact to GEMM-ready format. This can be fused with quantization 
+when no all-gather is needed, or performed separately after all-gather.
+
+.. raw:: html
+   :file: img/blockwise_swizzle_flow.svg
+
+*Figure 3. FP8 Blockwise Scaling swizzle paths. Top: With all-gather communication – quantization produces 
+compact format, then swizzle is performed separately after communication. Bottom: Without all-gather – 
+quantize and swizzle are fused into a single operation, directly producing GEMM-ready format.*
+
+All-gather of columnwise tensors
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+All-gather of columnwise tensors is supported and necessary because:
+
+- columnwise quantized tensors cannot be computed from rowwise quantized ones,
+- gathering high-precision tensors is avoided in most cases for performance reasons.
--- a/docs/features/low_precision_training/fp8_blockwise_scaling/img/blockwise_swizzle_flow.svg
+++ b/docs/features/low_precision_training/fp8_blockwise_scaling/img/blockwise_swizzle_flow.svg
+<?xml version="1.0" encoding="UTF-8" standalone="no"?>
+<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 1150 400" width="100%" style="max-width: 900px;">
+  <defs>
+    <style>
+      @import url("../_static/css/diagram-colors.css");
+      
+      /* Diagram-specific styles */
+      .input-box { fill: #f3e5f5; stroke: #7b1fa2; stroke-width: 2.5; }
+      .blockwise-box { fill: #e3f2fd; stroke: #1976d2; stroke-width: 2.5; }
+      .fp8-tile { fill: #bbdefb; stroke: #1565c0; stroke-width: 1.5; }
+      .scale-tile { fill: #a5d6a7; stroke: #388e3c; stroke-width: 1.5; }
+      .scale-swizzled { fill: #ffb74d; stroke: #e65100; stroke-width: 1.5; }
+      .swizzle-box { fill: #fff3e0; stroke: #f57c00; stroke-width: 2; }
+      .quantize-box { fill: #ede7f6; stroke: #5e35b1; stroke-width: 2; }
+      .quantize-fused-box { fill: #d1c4e9; stroke: #5e35b1; stroke-width: 2.5; }
+      .comm-box { fill: #fff9c4; stroke: #f57f17; stroke-width: 2; }
+      .gemm-box { fill: #c8e6c9; stroke: #388e3c; stroke-width: 2; }
+      
+      /* Arrow override */
+      .arrow { marker-end: url(#arrowhead); stroke: #616161; stroke-width: 1.5; fill: none; }
+    </style>
+    
+    <!-- Arrow marker -->
+    <marker id="arrowhead" markerWidth="6" markerHeight="6" refX="5" refY="2" orient="auto">
+      <polygon points="0 0, 6 2, 0 4" fill="#616161" />
+    </marker>
+  </defs>
+  
+  <!-- Section 1: With Communication (Separate Swizzle) -->
+  <g id="with-communication">
+    <!-- Step 0: Input Tensor -->
+    <g id="input-fp32-tensor-1">
+      <text x="80" y="25" class="text" text-anchor="middle" font-weight="600">Input Tensor</text>
+      <rect x="20" y="40" width="120" height="110" rx="6" class="input-box"/>
+      <text x="80" y="100" class="text" text-anchor="middle" fill="#fff">FP32/BF16</text>
+    </g>
+    
+    <!-- Arrow 0 -->
+    <path d="M 140 95 L 175 95" class="arrow"/>
+    
+    <!-- Step 1: Quantize -->
+    <rect x="175" y="60" width="80" height="70" rx="6" class="quantize-box"/>
+    <text x="215" y="100" class="text">Quantize</text>
+    
+    <!-- Arrow 1 -->
+    <path d="M 255 95 L 290 95" class="arrow"/>
+    
+    <!-- Step 2: Blockwise Tensor (Compact) -->
+    <g id="blockwise-tensor-compact">
+      <text x="375" y="25" class="text" text-anchor="middle" font-weight="600">FP8 (Compact)</text>
+      <rect x="290" y="40" width="170" height="110" rx="6" class="blockwise-box"/>
+      
+      <!-- FP32 Scales sub-tile (green) -->
+      <rect x="305" y="52" width="140" height="32" rx="3" class="scale-tile"/>
+      <text x="375" y="73" class="text" text-anchor="middle" fill="#fff">FP32 Scales</text>
+      
+      <!-- FP8 Data sub-tile -->
+      <rect x="305" y="92" width="140" height="45" rx="3" class="fp8-tile"/>
+      <text x="375" y="120" class="text" fill="#fff">FP8 Data</text>
+    </g>
+    
+    <!-- Arrow 2 -->
+    <path d="M 460 95 L 495 95" class="arrow"/>
+    
+    <!-- Step 3: Communication -->
+    <rect x="495" y="60" width="100" height="70" rx="6" class="comm-box"/>
+    <text x="545" y="100" class="text">All-Gather</text>
+    
+    <!-- Arrow 3 -->
+    <path d="M 595 95 L 630 95" class="arrow"/>
+    
+    <!-- Step 4: Swizzle -->
+    <rect x="630" y="60" width="90" height="70" rx="6" class="swizzle-box"/>
+    <text x="675" y="100" class="text">Swizzle</text>
+    
+    <!-- Arrow 4 -->
+    <path d="M 720 95 L 755 95" class="arrow"/>
+    
+    <!-- Step 5: Blockwise Tensor (GEMM Ready) -->
+    <g id="swizzled-tensor-1">
+      <text x="840" y="25" class="text" text-anchor="middle" font-weight="600">FP8 (GEMM Ready)</text>
+      <rect x="755" y="40" width="170" height="110" rx="6" class="blockwise-box"/>
+      
+      <!-- Swizzled Scales sub-tile (orange) -->
+      <rect x="770" y="52" width="140" height="32" rx="3" class="scale-swizzled"/>
+      <text x="840" y="73" class="text" text-anchor="middle" fill="#fff">Swizzled Scales</text>
+      
+      <!-- FP8 Data sub-tile -->
+      <rect x="770" y="92" width="140" height="45" rx="3" class="fp8-tile"/>
+      <text x="840" y="120" class="text" fill="#fff">FP8 Data</text>
+    </g>
+    
+    <!-- Arrow 5 -->
+    <path d="M 925 95 L 960 95" class="arrow"/>
+    
+    <!-- Step 6: GEMM -->
+    <rect x="960" y="60" width="80" height="70" rx="6" class="gemm-box"/>
+    <text x="1000" y="100" class="text">GEMM</text>
+  </g>
+  
+  <!-- Separator Line -->
+  <line x1="20" y1="185" x2="1050" y2="185" stroke="#bdbdbd" stroke-width="1" stroke-dasharray="8,4"/>
+  
+  <!-- Section 2: Without Communication (Fused Quantize + Swizzle) -->
+  <g id="without-communication" transform="translate(0, 170)">
+    <!-- Step 0: Input Tensor -->
+    <g id="input-fp32-tensor-2">
+      <text x="80" y="45" class="text" text-anchor="middle" font-weight="600">Input Tensor</text>
+      <rect x="20" y="60" width="120" height="110" rx="6" class="input-box"/>
+      <text x="80" y="120" class="text" text-anchor="middle" fill="#fff">FP32/BF16</text>
+    </g>
+    
+    <!-- Arrow 0 -->
+    <path d="M 140 115 L 190 115" class="arrow"/>
+    
+    <!-- Step 1: Fused Quantize + Swizzle -->
+    <rect x="190" y="70" width="120" height="90" rx="6" class="quantize-fused-box"/>
+    <text x="250" y="105" class="text">Quantize</text>
+    <text x="250" y="122" class="text">+</text>
+    <text x="250" y="139" class="text">Swizzle</text>
+    
+    <!-- Arrow 1 -->
+    <path d="M 310 115 L 360 115" class="arrow"/>
+    
+    <!-- Step 2: Blockwise Tensor (GEMM Ready) - directly produced -->
+    <g id="swizzled-tensor-2">
+      <text x="455" y="45" class="text" text-anchor="middle" font-weight="600">FP8 (GEMM Ready)</text>
+      <rect x="360" y="60" width="190" height="110" rx="6" class="blockwise-box"/>
+      
+      <!-- Swizzled Scales sub-tile (orange) -->
+      <rect x="378" y="72" width="155" height="32" rx="3" class="scale-swizzled"/>
+      <text x="455" y="93" class="text" text-anchor="middle" fill="#fff">Swizzled Scales</text>
+      
+      <!-- FP8 Data sub-tile -->
+      <rect x="378" y="112" width="155" height="45" rx="3" class="fp8-tile"/>
+      <text x="455" y="140" class="text" fill="#fff">FP8 Data</text>
+    </g>
+    
+    <!-- Arrow 2 -->
+    <path d="M 550 115 L 600 115" class="arrow"/>
+    
+    <!-- Step 3: GEMM -->
+    <rect x="600" y="80" width="80" height="70" rx="6" class="gemm-box"/>
+    <text x="640" y="120" class="text">GEMM</text>
+  </g>
+</svg>
--- a/docs/features/low_precision_training/fp8_blockwise_scaling/img/combined_scaling.svg
+++ b/docs/features/low_precision_training/fp8_blockwise_scaling/img/combined_scaling.svg
+<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 55 900 715">
+  <defs>
+    <style>
+      @import url("../_static/css/diagram-colors.css");
+      
+      .title { font: bold 18px sans-serif; fill: #333; text-anchor: middle; }
+      .dots-text { font: bold 24px sans-serif; fill: #333; text-anchor: middle; }
+      
+      /* Tensor colors */
+      .fp8-tensor { fill: #87CEEB; stroke: #444; stroke-width: 2; }
+      .fp8-block { fill: #87CEEB; stroke: #555; stroke-width: 1.5; }
+      .fp8-block-alt { fill: #5F9FCC; stroke: #555; stroke-width: 1.5; }
+      
+      /* Scaling factor colors */
+      .scale-factor { fill: #FFA500; stroke: #444; stroke-width: 2; }
+      
+      .grid-line { stroke: #444; stroke-width: 2; }
+      .boundary-line { stroke: #444; stroke-width: 2; }
+    </style>
+  </defs>
+  
+  <!-- FIRST IMAGE: Standard vs Blockwise Scaling -->
+  
+  <!-- LEFT SIDE: Standard FP8 Scaling -->
+  <g id="standard-scaling">
+    <text x="225" y="85" class="title">Delayed/Current FP8 Scaling</text>
+    <text x="225" y="108" class="label">(Single scaling factor per tensor)</text>
+    
+    <!-- FP8 Tensor - solid blue with white cross -->
+    <g id="left-tensor">
+      <!-- Solid blue background -->
+      <rect x="105" y="140" width="240" height="120" class="fp8-tensor"/>
+      
+      <!-- White backgrounds for dots areas - cross pattern -->
+      <rect x="225.0" y="140.0" width="40" height="120" fill="#FFFFFF" stroke="none"/>
+      <rect x="105.0" y="190.0" width="240" height="30" fill="#FFFFFF" stroke="none"/>
+      
+      <!-- Three dots in VERTICAL white bar -->
+      <text x="245" y="167.5" class="dots-text">…</text>
+      <text x="245" y="242.5" class="dots-text">…</text>
+      
+      <!-- Three dots in HORIZONTAL white bar -->
+      <text x="165" y="205" class="dots-text">…</text>
+      <text x="305" y="205" class="dots-text">…</text>
+      
+      <!-- ONE diagonal dot at intersection -->
+      <text x="245" y="205" class="dots-text" transform="rotate(45 245 205)">…</text>
+      
+      <!-- Main outline -->
+      <rect x="105.0" y="140.0" width="240" height="120" fill="none" stroke="#444" stroke-width="2"/>
+    </g>
+    
+    <!-- Single scaling factor - one 10x10 square -->
+    <rect x="220" y="285" width="10" height="10" class="scale-factor" stroke="#444" stroke-width="1"/>
+    
+    <text x="225" y="315" class="small-text" text-anchor="middle">1 scaling factor</text>
+  </g>
+  
+  <!-- RIGHT SIDE: FP8 Blockwise Scaling -->
+  <g id="blockwise-scaling">
+    <text x="675" y="85" class="title">Blockwise FP8 Scaling – 1 dimension</text>
+    <text x="675" y="108" class="label">(One scaling factor per 128 elements)</text>
+    
+    <!-- FP8 Tensor split into many small blocks (40×10) - EXACT coordinates from Python script -->
+    <g id="tensor-blocks">
+      <!-- White backgrounds for dots areas - cross pattern -->
+      <rect x="675.0" y="140.0" width="40" height="120" fill="#FFFFFF" stroke="none"/>
+      <rect x="555.0" y="190.0" width="240" height="30" fill="#FFFFFF" stroke="none"/>
+      
+      <!-- Blocks ONLY where they don't overlap with white cross (from Python script) -->
+      <rect x="555" y="140" width="40" height="10" class="fp8-block"/>
+      <rect x="595" y="140" width="40" height="10" class="fp8-block-alt"/>
+      <rect x="635" y="140" width="40" height="10" class="fp8-block"/>
+      <rect x="715" y="140" width="40" height="10" class="fp8-block"/>
+      <rect x="755" y="140" width="40" height="10" class="fp8-block-alt"/>
+      
+      <rect x="555" y="150" width="40" height="10" class="fp8-block-alt"/>
+      <rect x="595" y="150" width="40" height="10" class="fp8-block"/>
+      <rect x="635" y="150" width="40" height="10" class="fp8-block-alt"/>
+      <rect x="715" y="150" width="40" height="10" class="fp8-block-alt"/>
+      <rect x="755" y="150" width="40" height="10" class="fp8-block"/>
+      
+      <rect x="555" y="160" width="40" height="10" class="fp8-block"/>
+      <rect x="595" y="160" width="40" height="10" class="fp8-block-alt"/>
+      <rect x="635" y="160" width="40" height="10" class="fp8-block"/>
+      <rect x="715" y="160" width="40" height="10" class="fp8-block"/>
+      <rect x="755" y="160" width="40" height="10" class="fp8-block-alt"/>
+      
+      <rect x="555" y="170" width="40" height="10" class="fp8-block-alt"/>
+      <rect x="595" y="170" width="40" height="10" class="fp8-block"/>
+      <rect x="635" y="170" width="40" height="10" class="fp8-block-alt"/>
+      <rect x="715" y="170" width="40" height="10" class="fp8-block-alt"/>
+      <rect x="755" y="170" width="40" height="10" class="fp8-block"/>
+      
+      <rect x="555" y="180" width="40" height="10" class="fp8-block"/>
+      <rect x="595" y="180" width="40" height="10" class="fp8-block-alt"/>
+      <rect x="635" y="180" width="40" height="10" class="fp8-block"/>
+      <rect x="715" y="180" width="40" height="10" class="fp8-block"/>
+      <rect x="755" y="180" width="40" height="10" class="fp8-block-alt"/>
+      
+      <!-- Three dots in VERTICAL white bar -->
+      <text x="695" y="167.5" class="dots-text">…</text>
+      <text x="695" y="242.5" class="dots-text">…</text>
+      
+      <!-- Three dots in HORIZONTAL white bar -->
+      <text x="615" y="205" class="dots-text">…</text>
+      <text x="755" y="205" class="dots-text">…</text>
+      
+      <!-- ONE diagonal dot at intersection -->
+      <text x="695" y="205" class="dots-text" transform="rotate(45 695 205)">…</text>
+      
+      <!-- Bottom rows (y >= 220 after horizontal white bar) -->
+      <rect x="555" y="220" width="40" height="10" class="fp8-block"/>
+      <rect x="595" y="220" width="40" height="10" class="fp8-block-alt"/>
+      <rect x="635" y="220" width="40" height="10" class="fp8-block"/>
+      <rect x="715" y="220" width="40" height="10" class="fp8-block"/>
+      <rect x="755" y="220" width="40" height="10" class="fp8-block-alt"/>
+      
+      <rect x="555" y="230" width="40" height="10" class="fp8-block-alt"/>
+      <rect x="595" y="230" width="40" height="10" class="fp8-block"/>
+      <rect x="635" y="230" width="40" height="10" class="fp8-block-alt"/>
+      <rect x="715" y="230" width="40" height="10" class="fp8-block-alt"/>
+      <rect x="755" y="230" width="40" height="10" class="fp8-block"/>
+      
+      <rect x="555" y="240" width="40" height="10" class="fp8-block"/>
+      <rect x="595" y="240" width="40" height="10" class="fp8-block-alt"/>
+      <rect x="635" y="240" width="40" height="10" class="fp8-block"/>
+      <rect x="715" y="240" width="40" height="10" class="fp8-block"/>
+      <rect x="755" y="240" width="40" height="10" class="fp8-block-alt"/>
+      
+      <rect x="555" y="250" width="40" height="10" class="fp8-block-alt"/>
+      <rect x="595" y="250" width="40" height="10" class="fp8-block"/>
+      <rect x="635" y="250" width="40" height="10" class="fp8-block-alt"/>
+      <rect x="715" y="250" width="40" height="10" class="fp8-block-alt"/>
+      <rect x="755" y="250" width="40" height="10" class="fp8-block"/>
+      
+      <!-- Main outline -->
+      <rect x="555.0" y="140.0" width="240" height="120" fill="none" stroke="#444" stroke-width="2"/>
+    </g>
+    
+    <!-- Scaling factors tensor - 3+2 columns of 10px squares -->
+    <g id="scale-factors">
+      <!-- Orange background -->
+      <rect x="640" y="285" width="70" height="120" fill="#FFA500"/>
+      
+      <!-- White backgrounds for dots areas - cross pattern -->
+      <rect x="670" y="285" width="20" height="120" fill="#FFFFFF" stroke="none"/>
+      <rect x="640" y="335" width="70" height="30" fill="#FFFFFF" stroke="none"/>
+      
+      <!-- Grid lines showing 10x10 squares (3 left + 2 right columns) -->
+      <!-- Vertical lines every 10px (skipping white space) -->
+      <!-- Left 3 columns (640-670) -->
+      <line x1="650" y1="285" x2="650" y2="335" class="grid-line" stroke-width="1"/>
+      <line x1="660" y1="285" x2="660" y2="335" class="grid-line" stroke-width="1"/>
+      <line x1="670" y1="285" x2="670" y2="335" class="grid-line" stroke-width="1"/>
+      
+      <!-- Right 2 columns (690-710) -->
+      <line x1="690" y1="285" x2="690" y2="335" class="grid-line" stroke-width="1"/>
+      <line x1="700" y1="285" x2="700" y2="335" class="grid-line" stroke-width="1"/>
+      <line x1="710" y1="285" x2="710" y2="335" class="grid-line" stroke-width="1"/>
+      
+      <!-- Bottom sections -->
+      <line x1="650" y1="365" x2="650" y2="405" class="grid-line" stroke-width="1"/>
+      <line x1="660" y1="365" x2="660" y2="405" class="grid-line" stroke-width="1"/>
+      <line x1="670" y1="365" x2="670" y2="405" class="grid-line" stroke-width="1"/>
+      
+      <line x1="690" y1="365" x2="690" y2="405" class="grid-line" stroke-width="1"/>
+      <line x1="700" y1="365" x2="700" y2="405" class="grid-line" stroke-width="1"/>
+      <line x1="710" y1="365" x2="710" y2="405" class="grid-line" stroke-width="1"/>
+      
+      <!-- Horizontal lines every 10px -->
+      <line x1="640" y1="295" x2="670" y2="295" class="grid-line" stroke-width="1"/>
+      <line x1="690" y1="295" x2="710" y2="295" class="grid-line" stroke-width="1"/>
+      <line x1="640" y1="305" x2="670" y2="305" class="grid-line" stroke-width="1"/>
+      <line x1="690" y1="305" x2="710" y2="305" class="grid-line" stroke-width="1"/>
+      <line x1="640" y1="315" x2="670" y2="315" class="grid-line" stroke-width="1"/>
+      <line x1="690" y1="315" x2="710" y2="315" class="grid-line" stroke-width="1"/>
+      <line x1="640" y1="325" x2="670" y2="325" class="grid-line" stroke-width="1"/>
+      <line x1="690" y1="325" x2="710" y2="325" class="grid-line" stroke-width="1"/>
+      
+      <!-- Top bottom boundaries -->
+      <line x1="640" y1="335" x2="670" y2="335" class="grid-line" stroke-width="1"/>
+      <line x1="690" y1="335" x2="710" y2="335" class="grid-line" stroke-width="1"/>
+      
+      <line x1="640" y1="365" x2="670" y2="365" class="grid-line" stroke-width="1"/>
+      <line x1="690" y1="365" x2="710" y2="365" class="grid-line" stroke-width="1"/>
+      <line x1="640" y1="375" x2="670" y2="375" class="grid-line" stroke-width="1"/>
+      <line x1="690" y1="375" x2="710" y2="375" class="grid-line" stroke-width="1"/>
+      <line x1="640" y1="385" x2="670" y2="385" class="grid-line" stroke-width="1"/>
+      <line x1="690" y1="385" x2="710" y2="385" class="grid-line" stroke-width="1"/>
+      <line x1="640" y1="395" x2="670" y2="395" class="grid-line" stroke-width="1"/>
+      <line x1="690" y1="395" x2="710" y2="395" class="grid-line" stroke-width="1"/>
+      
+      <!-- Bottom boundaries -->
+      <line x1="640" y1="405" x2="670" y2="405" class="grid-line" stroke-width="1"/>
+      <line x1="690" y1="405" x2="710" y2="405" class="grid-line" stroke-width="1"/>
+      
+      <!-- Main outline -->
+      <rect x="640" y="285" width="70" height="120" fill="none" stroke="#444" stroke-width="2"/>
+      
+      <!-- Three dots -->
+      <text x="680" y="312.5" class="dots-text" style="font-size: 14px;">…</text>
+      <text x="680" y="387.5" class="dots-text" style="font-size: 14px;">…</text>
+      <text x="655" y="350" class="dots-text" style="font-size: 14px;">…</text>
+      <text x="700" y="350" class="dots-text" style="font-size: 14px;">…</text>
+      <text x="680" y="350" class="dots-text" style="font-size: 14px;" transform="rotate(45 680 350)">…</text>
+    </g>
+    
+    <text x="675" y="430" class="small-text" text-anchor="middle">Scaling factors (one per block)</text>
+  </g>
+
+  <!-- SECOND IMAGE: 2D Blockwise Scaling -->
+  <!-- Main Title -->
+  <text x="450" y="470" class="title">Blockwise FP8 Scaling – 2 dimensions</text>
+  <text x="450" y="495" class="label">(One scaling factor per 128x128 block of elements)</text>
+  
+  <!-- TOP: DATA TENSOR (20x20 blocks, with 3 extra columns on right) -->
+  <g id="data-tensor">
+    
+    <!-- Background for entire tensor -->
+    <rect x="390" y="525" width="180" height="120" class="fp8-tensor"/>
+    
+    <!-- White space for gaps (cross pattern) -->
+    <rect x="450" y="525" width="20" height="120" fill="#FFFFFF" stroke="none"/>
+    <rect x="390" y="585" width="180" height="20" fill="#FFFFFF" stroke="none"/>
+
+    <!-- Grid Lines (every 20px) -->
+    <!-- Vertical Lines Left (x=410, 430) -->
+    <line x1="410" y1="525" x2="410" y2="585" class="grid-line" stroke-width="1"/>
+    <line x1="430" y1="525" x2="430" y2="585" class="grid-line" stroke-width="1"/>
+    <line x1="410" y1="605" x2="410" y2="645" class="grid-line" stroke-width="1"/>
+    <line x1="430" y1="605" x2="430" y2="645" class="grid-line" stroke-width="1"/>
+    
+    <!-- Vertical Lines Right (x=490, 510, 530, 550) -->
+    <line x1="490" y1="525" x2="490" y2="585" class="grid-line" stroke-width="1"/>
+    <line x1="490" y1="605" x2="490" y2="645" class="grid-line" stroke-width="1"/>
+    <line x1="510" y1="525" x2="510" y2="585" class="grid-line" stroke-width="1"/>
+    <line x1="510" y1="605" x2="510" y2="645" class="grid-line" stroke-width="1"/>
+    <line x1="530" y1="525" x2="530" y2="585" class="grid-line" stroke-width="1"/>
+    <line x1="530" y1="605" x2="530" y2="645" class="grid-line" stroke-width="1"/>
+    <line x1="550" y1="525" x2="550" y2="585" class="grid-line" stroke-width="1"/>
+    <line x1="550" y1="605" x2="550" y2="645" class="grid-line" stroke-width="1"/>
+
+    <!-- Horizontal Lines Top (y=545, 565) -->
+    <line x1="390" y1="545" x2="450" y2="545" class="grid-line" stroke-width="1"/>
+    <line x1="470" y1="545" x2="570" y2="545" class="grid-line" stroke-width="1"/>
+    <line x1="390" y1="565" x2="450" y2="565" class="grid-line" stroke-width="1"/>
+    <line x1="470" y1="565" x2="570" y2="565" class="grid-line" stroke-width="1"/>
+
+    <!-- Horizontal Lines Bottom (y=625) -->
+    <line x1="390" y1="625" x2="450" y2="625" class="grid-line" stroke-width="1"/>
+    <line x1="470" y1="625" x2="570" y2="625" class="grid-line" stroke-width="1"/>
+
+    <!-- Dots / Ellipses -->
+    <!-- Horizontal dots in gap -->
+    <text x="460" y="552" class="dots-text" style="font-size: 14px;">…</text>
+    <text x="460" y="632" class="dots-text" style="font-size: 14px;">…</text>
+    
+    <!-- Vertical dots in gap -->
+    <text x="420" y="597" class="dots-text" style="font-size: 14px;">…</text>
+    <text x="540" y="597" class="dots-text" style="font-size: 14px;">…</text>
+    
+    <!-- Diagonal dot -->
+    <text x="460" y="597" class="dots-text" style="font-size: 14px;" transform="rotate(45 460 597)">…</text>
+
+    <!-- Boundaries around white spaces (excluding center intersection) -->
+    <!-- Vertical boundaries - broken at horizontal white space -->
+    <line x1="450" y1="525" x2="450" y2="585" class="boundary-line"/>
+    <line x1="450" y1="605" x2="450" y2="645" class="boundary-line"/>
+    <line x1="470" y1="525" x2="470" y2="585" class="boundary-line"/>
+    <line x1="470" y1="605" x2="470" y2="645" class="boundary-line"/>
+    <!-- Horizontal boundaries - broken at vertical white space -->
+    <line x1="390" y1="585" x2="450" y2="585" class="boundary-line"/>
+    <line x1="470" y1="585" x2="570" y2="585" class="boundary-line"/>
+    <line x1="390" y1="605" x2="450" y2="605" class="boundary-line"/>
+    <line x1="470" y1="605" x2="570" y2="605" class="boundary-line"/>
+
+    <!-- Main outline -->
+    <rect x="390" y="525" width="180" height="120" fill="none" stroke="#444" stroke-width="2"/>
+  </g>
+
+  <!-- BOTTOM: SCALING FACTORS (10x10 blocks, with 3 extra columns on right) -->
+  <g id="scaling-factors-2d">
+    <!-- Background for entire scaling tensor -->
+    <rect x="420" y="675" width="90" height="60" class="scale-factor"/>
+    
+    <!-- White space for gaps (cross pattern) -->
+    <rect x="450" y="675" width="10" height="60" fill="#FFFFFF" stroke="none"/>
+    <rect x="420" y="705" width="90" height="10" fill="#FFFFFF" stroke="none"/>
+
+    <!-- Grid Lines (every 10px) -->
+    <!-- Vertical Left -->
+    <line x1="430" y1="675" x2="430" y2="705" class="grid-line" stroke-width="1"/>
+    <line x1="440" y1="675" x2="440" y2="705" class="grid-line" stroke-width="1"/>
+    <line x1="430" y1="715" x2="430" y2="735" class="grid-line" stroke-width="1"/>
+    <line x1="440" y1="715" x2="440" y2="735" class="grid-line" stroke-width="1"/>
+
+    <!-- Vertical Right -->
+    <line x1="470" y1="675" x2="470" y2="705" class="grid-line" stroke-width="1"/>
+    <line x1="470" y1="715" x2="470" y2="735" class="grid-line" stroke-width="1"/>
+    <line x1="480" y1="675" x2="480" y2="705" class="grid-line" stroke-width="1"/>
+    <line x1="480" y1="715" x2="480" y2="735" class="grid-line" stroke-width="1"/>
+    <line x1="490" y1="675" x2="490" y2="705" class="grid-line" stroke-width="1"/>
+    <line x1="490" y1="715" x2="490" y2="735" class="grid-line" stroke-width="1"/>
+    <line x1="500" y1="675" x2="500" y2="705" class="grid-line" stroke-width="1"/>
+    <line x1="500" y1="715" x2="500" y2="735" class="grid-line" stroke-width="1"/>
+
+    <!-- Horizontal Top -->
+    <line x1="420" y1="685" x2="450" y2="685" class="grid-line" stroke-width="1"/>
+    <line x1="460" y1="685" x2="510" y2="685" class="grid-line" stroke-width="1"/>
+    <line x1="420" y1="695" x2="450" y2="695" class="grid-line" stroke-width="1"/>
+    <line x1="460" y1="695" x2="510" y2="695" class="grid-line" stroke-width="1"/>
+
+    <!-- Horizontal Bottom -->
+    <line x1="420" y1="725" x2="450" y2="725" class="grid-line" stroke-width="1"/>
+    <line x1="460" y1="725" x2="510" y2="725" class="grid-line" stroke-width="1"/>
+
+    <!-- Dots -->
+    <text x="455" y="692" class="dots-text" style="font-size: 12px;">…</text>
+    <text x="455" y="727" class="dots-text" style="font-size: 12px;">…</text>
+    <text x="435" y="711" class="dots-text" style="font-size: 12px;">…</text>
+    <text x="490" y="711" class="dots-text" style="font-size: 12px;">…</text>
+    <text x="455" y="711" class="dots-text" style="font-size: 12px;" transform="rotate(45 455 711)">…</text>
+
+    <!-- Boundaries around white spaces (excluding center intersection) -->
+    <!-- Vertical boundaries - broken at horizontal white space -->
+    <line x1="450" y1="675" x2="450" y2="705" class="boundary-line"/>
+    <line x1="450" y1="715" x2="450" y2="735" class="boundary-line"/>
+    <line x1="460" y1="675" x2="460" y2="705" class="boundary-line"/>
+    <line x1="460" y1="715" x2="460" y2="735" class="boundary-line"/>
+    <!-- Horizontal boundaries - broken at vertical white space -->
+    <line x1="420" y1="705" x2="450" y2="705" class="boundary-line"/>
+    <line x1="460" y1="705" x2="510" y2="705" class="boundary-line"/>
+    <line x1="420" y1="715" x2="450" y2="715" class="boundary-line"/>
+    <line x1="460" y1="715" x2="510" y2="715" class="boundary-line"/>
+
+    <!-- Main outline -->
+    <rect x="420" y="675" width="90" height="60" fill="none" stroke="#444" stroke-width="2"/>
+
+    <text x="465" y="755" class="small-text" text-anchor="middle">Scaling factors (1 per 2D block)</text>
+  </g>
+</svg>
--- a/docs/features/low_precision_training/fp8_blockwise_scaling/img/transpose_handling.svg
+++ b/docs/features/low_precision_training/fp8_blockwise_scaling/img/transpose_handling.svg
+<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 900 640">
+  <defs>
+    <style>
+      @import url("../_static/css/diagram-colors.css");
+      
+      .title { font: bold 16px sans-serif; fill: #333; text-anchor: middle; }
+      .label { font: 14px sans-serif; fill: #333; text-anchor: middle; }
+      .small-text { font: 12px sans-serif; fill: #555; }
+      .dots-text { font: bold 24px sans-serif; fill: #333; text-anchor: middle; }
+      
+      /* Tensor colors */
+      .fp8-block { fill: #87CEEB; stroke: #555; stroke-width: 1.5; }
+      .fp8-block-alt { fill: #5F9FCC; stroke: #555; stroke-width: 1.5; }
+    </style>
+  </defs>
+  
+  <!-- Section title for 1D -->
+  <text x="450" y="25" class="title" style="font-size: 18px; font-weight: bold;">1D Blockwise Scaling</text>
+  
+  <!-- LEFT SIDE: Original 1D Blockwise (Rowwise Quantization) -->
+  <g id="rowwise-quantization">
+    <text x="225" y="50" class="title">Rowwise Quantization</text>
+    
+    <!-- FP8 Tensor with horizontal stripes -->
+    <g id="left-tensor">
+      <!-- White backgrounds for dots areas - cross pattern -->
+      <rect x="225.0" y="100.0" width="40" height="120" fill="#FFFFFF" stroke="none"/>
+      <rect x="105.0" y="150.0" width="240" height="30" fill="#FFFFFF" stroke="none"/>
+      
+      <!-- Horizontal blocks (40×10 each) - rows of alternating colors -->
+      <!-- Top section (before horizontal gap) -->
+      <rect x="105" y="100" width="40" height="10" class="fp8-block"/>
+      <rect x="145" y="100" width="40" height="10" class="fp8-block-alt"/>
+      <rect x="185" y="100" width="40" height="10" class="fp8-block"/>
+      <rect x="265" y="100" width="40" height="10" class="fp8-block"/>
+      <rect x="305" y="100" width="40" height="10" class="fp8-block-alt"/>
+      
+      <rect x="105" y="110" width="40" height="10" class="fp8-block-alt"/>
+      <rect x="145" y="110" width="40" height="10" class="fp8-block"/>
+      <rect x="185" y="110" width="40" height="10" class="fp8-block-alt"/>
+      <rect x="265" y="110" width="40" height="10" class="fp8-block-alt"/>
+      <rect x="305" y="110" width="40" height="10" class="fp8-block"/>
+      
+      <rect x="105" y="120" width="40" height="10" class="fp8-block"/>
+      <rect x="145" y="120" width="40" height="10" class="fp8-block-alt"/>
+      <rect x="185" y="120" width="40" height="10" class="fp8-block"/>
+      <rect x="265" y="120" width="40" height="10" class="fp8-block"/>
+      <rect x="305" y="120" width="40" height="10" class="fp8-block-alt"/>
+      
+      <rect x="105" y="130" width="40" height="10" class="fp8-block-alt"/>
+      <rect x="145" y="130" width="40" height="10" class="fp8-block"/>
+      <rect x="185" y="130" width="40" height="10" class="fp8-block-alt"/>
+      <rect x="265" y="130" width="40" height="10" class="fp8-block-alt"/>
+      <rect x="305" y="130" width="40" height="10" class="fp8-block"/>
+      
+      <rect x="105" y="140" width="40" height="10" class="fp8-block"/>
+      <rect x="145" y="140" width="40" height="10" class="fp8-block-alt"/>
+      <rect x="185" y="140" width="40" height="10" class="fp8-block"/>
+      <rect x="265" y="140" width="40" height="10" class="fp8-block"/>
+      <rect x="305" y="140" width="40" height="10" class="fp8-block-alt"/>
+      
+      <!-- Three dots in VERTICAL white bar -->
+      <text x="245" y="127.5" class="dots-text">…</text>
+      <text x="245" y="202.5" class="dots-text">…</text>
+      
+      <!-- Three dots in HORIZONTAL white bar -->
+      <text x="165" y="165" class="dots-text">…</text>
+      <text x="305" y="165" class="dots-text">…</text>
+      
+      <!-- ONE diagonal dot at intersection -->
+      <text x="245" y="165" class="dots-text" transform="rotate(45 245 165)">…</text>
+      
+      <!-- Bottom section (after horizontal gap) -->
+      <rect x="105" y="180" width="40" height="10" class="fp8-block"/>
+      <rect x="145" y="180" width="40" height="10" class="fp8-block-alt"/>
+      <rect x="185" y="180" width="40" height="10" class="fp8-block"/>
+      <rect x="265" y="180" width="40" height="10" class="fp8-block"/>
+      <rect x="305" y="180" width="40" height="10" class="fp8-block-alt"/>
+      
+      <rect x="105" y="190" width="40" height="10" class="fp8-block-alt"/>
+      <rect x="145" y="190" width="40" height="10" class="fp8-block"/>
+      <rect x="185" y="190" width="40" height="10" class="fp8-block-alt"/>
+      <rect x="265" y="190" width="40" height="10" class="fp8-block-alt"/>
+      <rect x="305" y="190" width="40" height="10" class="fp8-block"/>
+      
+      <rect x="105" y="200" width="40" height="10" class="fp8-block"/>
+      <rect x="145" y="200" width="40" height="10" class="fp8-block-alt"/>
+      <rect x="185" y="200" width="40" height="10" class="fp8-block"/>
+      <rect x="265" y="200" width="40" height="10" class="fp8-block"/>
+      <rect x="305" y="200" width="40" height="10" class="fp8-block-alt"/>
+      
+      <rect x="105" y="210" width="40" height="10" class="fp8-block-alt"/>
+      <rect x="145" y="210" width="40" height="10" class="fp8-block"/>
+      <rect x="185" y="210" width="40" height="10" class="fp8-block-alt"/>
+      <rect x="265" y="210" width="40" height="10" class="fp8-block-alt"/>
+      <rect x="305" y="210" width="40" height="10" class="fp8-block"/>
+      
+      <!-- Main outline -->
+      <rect x="105.0" y="100.0" width="240" height="120" fill="none" stroke="#444" stroke-width="2"/>
+    </g>
+  </g>
+  
+  <!-- RIGHT SIDE: Transposed (Columnwise Quantization) -->
+  <g id="columnwise-quantization">
+    <text x="625" y="50" class="title">Columnwise Quantization</text>
+    
+    <!-- FP8 Tensor - transposed shape (120 wide × 240 tall) with HORIZONTAL stripes -->
+    <g id="right-tensor">
+      <!-- White backgrounds for dots areas - cross pattern -->
+      <rect x="645.0" y="100.0" width="40" height="240" fill="#FFFFFF" stroke="none"/>
+      <rect x="565.0" y="260.0" width="120" height="30" fill="#FFFFFF" stroke="none"/>
+      
+      <!-- Horizontal stripes 40×10 (same as rowwise) -->
+      <!-- Top section (before horizontal gap) - 16 rows of 10px each = 160px -->
+      <rect x="565" y="100" width="40" height="10" class="fp8-block"/>
+      <rect x="605" y="100" width="40" height="10" class="fp8-block"/>
+      
+      <rect x="565" y="110" width="40" height="10" class="fp8-block-alt"/>
+      <rect x="605" y="110" width="40" height="10" class="fp8-block-alt"/>
+      
+      <rect x="565" y="120" width="40" height="10" class="fp8-block"/>
+      <rect x="605" y="120" width="40" height="10" class="fp8-block"/>
+      
+      <rect x="565" y="130" width="40" height="10" class="fp8-block-alt"/>
+      <rect x="605" y="130" width="40" height="10" class="fp8-block-alt"/>
+      
+      <rect x="565" y="140" width="40" height="10" class="fp8-block"/>
+      <rect x="605" y="140" width="40" height="10" class="fp8-block"/>
+      
+      <rect x="565" y="150" width="40" height="10" class="fp8-block-alt"/>
+      <rect x="605" y="150" width="40" height="10" class="fp8-block-alt"/>
+      
+      <rect x="565" y="160" width="40" height="10" class="fp8-block"/>
+      <rect x="605" y="160" width="40" height="10" class="fp8-block"/>
+      
+      <rect x="565" y="170" width="40" height="10" class="fp8-block-alt"/>
+      <rect x="605" y="170" width="40" height="10" class="fp8-block-alt"/>
+      
+      <rect x="565" y="180" width="40" height="10" class="fp8-block"/>
+      <rect x="605" y="180" width="40" height="10" class="fp8-block"/>
+      
+      <rect x="565" y="190" width="40" height="10" class="fp8-block-alt"/>
+      <rect x="605" y="190" width="40" height="10" class="fp8-block-alt"/>
+      
+      <rect x="565" y="200" width="40" height="10" class="fp8-block"/>
+      <rect x="605" y="200" width="40" height="10" class="fp8-block"/>
+      
+      <rect x="565" y="210" width="40" height="10" class="fp8-block-alt"/>
+      <rect x="605" y="210" width="40" height="10" class="fp8-block-alt"/>
+      
+      <rect x="565" y="220" width="40" height="10" class="fp8-block"/>
+      <rect x="605" y="220" width="40" height="10" class="fp8-block"/>
+      
+      <rect x="565" y="230" width="40" height="10" class="fp8-block-alt"/>
+      <rect x="605" y="230" width="40" height="10" class="fp8-block-alt"/>
+      
+      <rect x="565" y="240" width="40" height="10" class="fp8-block"/>
+      <rect x="605" y="240" width="40" height="10" class="fp8-block"/>
+      
+      <rect x="565" y="250" width="40" height="10" class="fp8-block-alt"/>
+      <rect x="605" y="250" width="40" height="10" class="fp8-block-alt"/>
+      
+      <!-- Three dots in VERTICAL white bar -->
+      <text x="665" y="200" class="dots-text">…</text>
+      <text x="665" y="330" class="dots-text">…</text>
+      
+      <!-- Three dots in HORIZONTAL white bar -->
+      <text x="605" y="275" class="dots-text">…</text>
+      
+      <!-- ONE diagonal dot at intersection -->
+      <text x="665" y="275" class="dots-text" transform="rotate(45 665 275)">…</text>
+      
+      <!-- Bottom section (after horizontal gap) - 5 rows of 10px each = 50px -->
+      <rect x="565" y="290" width="40" height="10" class="fp8-block"/>
+      <rect x="605" y="290" width="40" height="10" class="fp8-block"/>
+      
+      <rect x="565" y="300" width="40" height="10" class="fp8-block-alt"/>
+      <rect x="605" y="300" width="40" height="10" class="fp8-block-alt"/>
+      
+      <rect x="565" y="310" width="40" height="10" class="fp8-block"/>
+      <rect x="605" y="310" width="40" height="10" class="fp8-block"/>
+      
+      <rect x="565" y="320" width="40" height="10" class="fp8-block-alt"/>
+      <rect x="605" y="320" width="40" height="10" class="fp8-block-alt"/>
+      
+      <rect x="565" y="330" width="40" height="10" class="fp8-block"/>
+      <rect x="605" y="330" width="40" height="10" class="fp8-block"/>
+      
+      <!-- Main outline -->
+      <rect x="565.0" y="100.0" width="120" height="240" fill="none" stroke="#444" stroke-width="2"/>
+    </g>
+  </g>
+
+  <!-- SECTION 2: 2D Blockwise Scaling -->
+  
+  <!-- Section title for 2D -->
+  <text x="450" y="380" class="title" style="font-size: 18px; font-weight: bold;">2D Blockwise Scaling</text>
+  
+  <!-- LEFT SIDE: Original 2D Blockwise (copied from combined_scaling.svg) -->
+  <g id="2d-original">
+    <text x="280" y="405" class="title">Rowwise Quantization</text>
+    
+    <!-- TOP: DATA TENSOR (20x20 blocks, with 3 extra columns on right) -->
+    <g id="data-tensor-left">
+      
+      <!-- Background for entire tensor -->
+      <rect x="190" y="445" width="180" height="120" fill="#87CEEB" stroke="#444" stroke-width="2"/>
+      
+      <!-- White space for gaps (cross pattern) -->
+      <rect x="250" y="445" width="20" height="120" fill="#FFFFFF" stroke="none"/>
+      <rect x="190" y="505" width="180" height="20" fill="#FFFFFF" stroke="none"/>
+
+      <!-- Grid Lines (every 20px) -->
+      <!-- Vertical Lines Left (x=210, 230) -->
+      <line x1="210" y1="445" x2="210" y2="505" stroke="#444" stroke-width="1"/>
+      <line x1="230" y1="445" x2="230" y2="505" stroke="#444" stroke-width="1"/>
+      <line x1="210" y1="525" x2="210" y2="565" stroke="#444" stroke-width="1"/>
+      <line x1="230" y1="525" x2="230" y2="565" stroke="#444" stroke-width="1"/>
+      
+      <!-- Vertical Lines Right (x=290, 310, 330, 350) -->
+      <line x1="290" y1="445" x2="290" y2="505" stroke="#444" stroke-width="1"/>
+      <line x1="290" y1="525" x2="290" y2="565" stroke="#444" stroke-width="1"/>
+      <line x1="310" y1="445" x2="310" y2="505" stroke="#444" stroke-width="1"/>
+      <line x1="310" y1="525" x2="310" y2="565" stroke="#444" stroke-width="1"/>
+      <line x1="330" y1="445" x2="330" y2="505" stroke="#444" stroke-width="1"/>
+      <line x1="330" y1="525" x2="330" y2="565" stroke="#444" stroke-width="1"/>
+      <line x1="350" y1="445" x2="350" y2="505" stroke="#444" stroke-width="1"/>
+      <line x1="350" y1="525" x2="350" y2="565" stroke="#444" stroke-width="1"/>
+
+      <!-- Horizontal Lines Top (y=465, 485) -->
+      <line x1="190" y1="465" x2="250" y2="465" stroke="#444" stroke-width="1"/>
+      <line x1="270" y1="465" x2="370" y2="465" stroke="#444" stroke-width="1"/>
+      <line x1="190" y1="485" x2="250" y2="485" stroke="#444" stroke-width="1"/>
+      <line x1="270" y1="485" x2="370" y2="485" stroke="#444" stroke-width="1"/>
+
+      <!-- Horizontal Lines Bottom (y=545) -->
+      <line x1="190" y1="545" x2="250" y2="545" stroke="#444" stroke-width="1"/>
+      <line x1="270" y1="545" x2="370" y2="545" stroke="#444" stroke-width="1"/>
+
+      <!-- Dots / Ellipses -->
+      <!-- Horizontal dots in gap -->
+      <text x="260" y="472" class="dots-text" style="font-size: 14px;">…</text>
+      <text x="260" y="552" class="dots-text" style="font-size: 14px;">…</text>
+      
+      <!-- Vertical dots in gap -->
+      <text x="220" y="517" class="dots-text" style="font-size: 14px;">…</text>
+      <text x="340" y="517" class="dots-text" style="font-size: 14px;">…</text>
+      
+      <!-- Diagonal dot -->
+      <text x="260" y="517" class="dots-text" style="font-size: 14px;" transform="rotate(45 260 517)">…</text>
+
+      <!-- Boundaries around white spaces (excluding center intersection) -->
+      <!-- Vertical boundaries - broken at horizontal white space -->
+      <line x1="250" y1="445" x2="250" y2="505" stroke="#444" stroke-width="2"/>
+      <line x1="250" y1="525" x2="250" y2="565" stroke="#444" stroke-width="2"/>
+      <line x1="270" y1="445" x2="270" y2="505" stroke="#444" stroke-width="2"/>
+      <line x1="270" y1="525" x2="270" y2="565" stroke="#444" stroke-width="2"/>
+      <!-- Horizontal boundaries - broken at vertical white space -->
+      <line x1="190" y1="505" x2="250" y2="505" stroke="#444" stroke-width="2"/>
+      <line x1="270" y1="505" x2="370" y2="505" stroke="#444" stroke-width="2"/>
+      <line x1="190" y1="525" x2="250" y2="525" stroke="#444" stroke-width="2"/>
+      <line x1="270" y1="525" x2="370" y2="525" stroke="#444" stroke-width="2"/>
+
+      <!-- Main outline -->
+      <rect x="190" y="445" width="180" height="120" fill="none" stroke="#444" stroke-width="2"/>
+    </g>
+  </g>
+
+  <!-- RIGHT SIDE: Transposed 2D Blockwise -->
+  <g id="2d-transposed">
+    <text x="605" y="405" class="title">Columnwise Quantization</text>
+    
+    <!-- DATA TENSOR TRANSPOSED (120x180 instead of 180x120) -->
+    <g id="data-tensor-right">
+      
+      <!-- Background for entire tensor -->
+      <rect x="545" y="435" width="120" height="180" fill="#87CEEB" stroke="#444" stroke-width="2"/>
+      
+      <!-- White space for gaps (cross pattern) - TRANSPOSED -->
+      <!-- Original: X structure (180): 60 + 20 + 100 → Y structure (180): 60 + 20 + 100 -->
+      <!-- Original: Y structure (120): 60 + 20 + 40 → X structure (120): 60 + 20 + 40 -->
+      <rect x="545" y="495" width="120" height="20" fill="#FFFFFF" stroke="none"/>
+      <rect x="605" y="435" width="20" height="180" fill="#FFFFFF" stroke="none"/>
+
+      <!-- Grid Lines (every 20px) - TRANSPOSED -->
+      <!-- Original vertical lines at x=210, 230 become horizontal at y=455, 475 -->
+      <line x1="545" y1="455" x2="605" y2="455" stroke="#444" stroke-width="1"/>
+      <line x1="625" y1="455" x2="665" y2="455" stroke="#444" stroke-width="1"/>
+      <line x1="545" y1="475" x2="605" y2="475" stroke="#444" stroke-width="1"/>
+      <line x1="625" y1="475" x2="665" y2="475" stroke="#444" stroke-width="1"/>
+      
+      <!-- Original vertical lines at x=290, 310, 330, 350 become horizontal at y=535, 555, 575, 595 -->
+      <line x1="545" y1="535" x2="605" y2="535" stroke="#444" stroke-width="1"/>
+      <line x1="625" y1="535" x2="665" y2="535" stroke="#444" stroke-width="1"/>
+      <line x1="545" y1="555" x2="605" y2="555" stroke="#444" stroke-width="1"/>
+      <line x1="625" y1="555" x2="665" y2="555" stroke="#444" stroke-width="1"/>
+      <line x1="545" y1="575" x2="605" y2="575" stroke="#444" stroke-width="1"/>
+      <line x1="625" y1="575" x2="665" y2="575" stroke="#444" stroke-width="1"/>
+      <line x1="545" y1="595" x2="605" y2="595" stroke="#444" stroke-width="1"/>
+      <line x1="625" y1="595" x2="665" y2="595" stroke="#444" stroke-width="1"/>
+
+      <!-- Original horizontal lines at y=465, 485 become vertical at x=565, 585 -->
+      <line x1="565" y1="435" x2="565" y2="495" stroke="#444" stroke-width="1"/>
+      <line x1="565" y1="515" x2="565" y2="615" stroke="#444" stroke-width="1"/>
+      <line x1="585" y1="435" x2="585" y2="495" stroke="#444" stroke-width="1"/>
+      <line x1="585" y1="515" x2="585" y2="615" stroke="#444" stroke-width="1"/>
+      
+      <!-- Original horizontal line at y=545 becomes vertical at x=605, 625, 645 -->
+      <line x1="605" y1="435" x2="605" y2="495" stroke="#444" stroke-width="1"/>
+      <line x1="605" y1="515" x2="605" y2="615" stroke="#444" stroke-width="1"/>
+      <line x1="625" y1="435" x2="625" y2="495" stroke="#444" stroke-width="1"/>
+      <line x1="625" y1="515" x2="625" y2="615" stroke="#444" stroke-width="1"/>
+      <line x1="645" y1="435" x2="645" y2="495" stroke="#444" stroke-width="1"/>
+      <line x1="645" y1="515" x2="645" y2="615" stroke="#444" stroke-width="1"/>
+
+      <!-- Dots / Ellipses - TRANSPOSED -->
+      <!-- Original: horizontal dots at (260, 472) and (260, 552) in vertical gap -->
+      <!-- Offsets: (70, 27) and (70, 107) → transposed to (27+545, 70+435) and (107+545, 70+435) -->
+      <text x="572" y="505" class="dots-text" style="font-size: 14px;">…</text>
+      <text x="652" y="505" class="dots-text" style="font-size: 14px;">…</text>
+      
+      <!-- Original: vertical dots at (220, 517) and (340, 517) in horizontal gap -->
+      <!-- Offsets: (30, 72) and (150, 72) → transposed to (72+545, 30+435) and (72+545, 150+435) -->
+      <text x="617" y="465" class="dots-text" style="font-size: 14px;">…</text>
+      <text x="617" y="585" class="dots-text" style="font-size: 14px;">…</text>
+      
+      <!-- Diagonal dot at (260, 517) → offset (70, 72) → transposed to (72+545, 70+435) -->
+      <text x="617" y="505" class="dots-text" style="font-size: 14px;" transform="rotate(45 617 505)">…</text>
+
+      <!-- Boundaries around white spaces - TRANSPOSED -->
+      <!-- Original vertical boundaries (x=250, x=270) become horizontal boundaries (y=495, y=515) -->
+      <line x1="545" y1="495" x2="605" y2="495" stroke="#444" stroke-width="2"/>
+      <line x1="625" y1="495" x2="665" y2="495" stroke="#444" stroke-width="2"/>
+      <line x1="545" y1="515" x2="605" y2="515" stroke="#444" stroke-width="2"/>
+      <line x1="625" y1="515" x2="665" y2="515" stroke="#444" stroke-width="2"/>
+      <!-- Original horizontal boundaries (y=505, y=525) become vertical boundaries (x=605, x=625) -->
+      <line x1="605" y1="435" x2="605" y2="495" stroke="#444" stroke-width="2"/>
+      <line x1="605" y1="515" x2="605" y2="615" stroke="#444" stroke-width="2"/>
+      <line x1="625" y1="435" x2="625" y2="495" stroke="#444" stroke-width="2"/>
+      <line x1="625" y1="515" x2="625" y2="615" stroke="#444" stroke-width="2"/>
+
+      <!-- Main outline -->
+      <rect x="545" y="435" width="120" height="180" fill="none" stroke="#444" stroke-width="2"/>
+    </g>
+  </g>
+
+</svg>
--- a/docs/features/low_precision_training/fp8_blockwise_scaling/pytorch_blockwise_scaling_example.py
+++ b/docs/features/low_precision_training/fp8_blockwise_scaling/pytorch_blockwise_scaling_example.py
+# Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# See LICENSE for license information.
+
+import torch
+
+# Check for Hopper or newer GPU
+major, minor = torch.cuda.get_device_capability()
+assert major >= 9, f"FP8 Blockwise Scaling requires SM90 (Hopper) or later, got SM{major}{minor}"
+
+# START_BLOCKWISE_SCALING_EXAMPLE
+
+import torch
+import transformer_engine.pytorch as te
+from transformer_engine.common.recipe import Float8BlockScaling
+
+# Create FP8 Blockwise Scaling recipe
+recipe = Float8BlockScaling(
+    fp8_format=te.common.recipe.Format.E4M3,  # E4M3 or HYBRID (default: E4M3)
+    x_block_scaling_dim=1,  # 1D scaling for activations (default: 1)
+    w_block_scaling_dim=2,  # 2D scaling for weights (default: 2)
+    grad_block_scaling_dim=1,  # 1D scaling for gradients (default: 1)
+)
+
+# Create a linear layer with bfloat16 parameters
+layer = te.Linear(1024, 1024, params_dtype=torch.bfloat16)
+
+# Forward and backward pass
+inp = torch.randn(32, 128, 1024, dtype=torch.bfloat16, device="cuda")
+
+with te.autocast(enabled=True, recipe=recipe):
+    output = layer(inp)
+    loss = output.sum()
+
+loss.backward()
+
+# END_BLOCKWISE_SCALING_EXAMPLE
--- a/docs/features/low_precision_training/fp8_current_scaling/fp8_current_scaling.rst
+++ b/docs/features/low_precision_training/fp8_current_scaling/fp8_current_scaling.rst
+..
+    Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+    See LICENSE for license information.
+
+FP8 Current Scaling
+===================================
+
+FP8 current scaling recipe is the simplest low precision recipe provided by Transformer Engine. 
+To understand how this recipe works, we first need to examine what the FP8 data type is and how it differs from other floating point formats.
+
+
+FP8 data type
+-------------
+
+The FP8 datatype, introduced in Hopper architecture, is actually 2 distinct datatypes, useful in different parts of the training of neural networks:
+
+* E4M3 -- consists of 1 sign bit, 4 exponent bits and 3 bits of mantissa. It can store values up to +/-448 and ``nan``.
+* E5M2 -- consists of 1 sign bit, 5 exponent bits and 2 bits of mantissa. It can store values up to +/-57344, +/- ``inf`` and ``nan``. The tradeoff of the increased dynamic range is lower precision of the stored values.
+
+.. raw:: html
+   :file: img/fp8_formats.svg
+
+*Figure 1: Structure of the floating point datatypes. All of the values shown (in FP16, BF16, FP8 E4M3 and FP8 E5M2) are the closest representations of value 0.3952.*
+
+
+**E4M3 and E5M2 usage in training**
+
+By default, Transformer Engine uses a hybrid approach:
+
+* *Forward pass* - activations and weights require more precision, so E4M3 datatype is used to store them.
+* *Backward pass* - gradients are less susceptible to precision loss but require higher dynamic range, so E5M2 datatype is preferred. 
+
+The user can configure this behavior via the ``fp8_format`` parameter of the recipe.
+
+
+Scaling factors
+---------------
+
+
+Limited dynamic range of FP8 datatype is insufficient for many tensors. 
+To address this, values in the tensor are scaled. FP8 Current Scaling recipe uses one **FP32** scale factor per tensor. The representation of a tensor element ``x`` in FP8 precision is given by:
+
+.. code-block:: python
+
+    x = x_fp8 * s
+
+where
+
+* ``x_fp8`` is the FP8 value (E4M3 or E5M2),
+* ``s`` is a global **FP32** scaling factor applied to the entire tensor.
+
+**FP8 Current Scaling quantization**
+
+Let's take a closer look at how quantization to FP8 with scaling factor is implemented in
+the FP8 Current Scaling recipe.
+
+.. raw:: html
+   :file: img/fp8_scaling_concept.svg
+
+*Figure 3: Quantization to FP8 consists of amax (absolute maximum) computation, scaling to fit the FP8 range and casting to the respective FP8 format.*
+
+Quantization to FP8 consists of 3 steps:
+
+1. Computation of the absolute maximum value of the tensor - we refer to it as ``amax``.
+2. Applying the scaling factor of ``fp8_max / amax`` to the tensor, to fit it into the FP8 range
+3. Casting into the respective FP8 format using *Round To Nearest Even (RTNE)*. Values round to the nearest representable FP8 value. When exactly halfway between two values, rounds to the one with even mantissa to minimize systematic bias.
+
+**Performance analysis**
+
+Quantization is a memory-bound operation that requires reading the tensor twice:
+
+* First read: compute ``amax`` across all elements.
+* Second read: apply the scaling factor and cast to FP8.
+
+This is a significant overhead compared to other recipes, which typically require only a single memory read.
+
+.. raw:: html
+   :file: img/fp8_cast_process.svg
+
+*Figure 4: FP8 quantization with current scaling recipe - two tensor reads are needed, one to compute amax and one to apply the scaling factor and cast to FP8.*
+
+
+Transpose handling
+------------------
+
+
+
+*Ada and Hopper*
+
+On Ada and Hopper, the backward pass requires a transposed FP8 tensor.
+The columnwise layout is physically different from the rowwise layout, so a transpose operation is needed.
+All 3 options from :ref:`Performance Considerations Transpose handling section <handling_transposes>` are supported.
+
+*Blackwell and later*
+
+Blackwell hardware supports multiple GEMM layouts natively, eliminating the need for explicit transposes.
+The rowwise and columnwise tensors share the same physical memory layout.
+
+.. figure:: ../performance_considerations/img/hopper_vs_blackwell_layout.svg
+   :align: center
+   :alt: Comparison of rowwise and columnwise tensor layouts on Blackwell vs Hopper
+
+   *Figure 6: On Blackwell, rowwise and columnwise usages share the same memory layout. On Hopper, columnwise usage requires a physical transpose.*
+
+
+Distributed training 
+--------------------
+
+**Quantized all-gather**
+
+FP8 all-gather is supported on all architectures (Ada and later).
+
+**Amax reduction**
+
+Tensors that are gathered across nodes (e.g. input and gradient in sequence parallelism) require amax synchronization before quantization.
+Each node computes its local ``amax``, then a reduction produces the global maximum across all nodes.
+All nodes use this synchronized amax to compute identical scaling factors, enabling quantized all-gather.
+
+.. raw:: html
+   :file: img/fp8_current_scaling_all_gather.svg
+
+*Figure 7: Quantization and all-gather flow for FP8 current scaling showing amax computation and synchronization.*
+
+
+Supported devices
+-----------------
+
+Ada and later (SM 8.9+)
+
+Examples
+--------
+
+Here's how to use FP8 Current Scaling recipe in PyTorch and JAX:
+
+.. tabs::
+
+   .. tab:: PyTorch
+
+      .. raw:: html
+
+         <div style="background: #f0f4f8; border-left: 3px solid #5c7cfa; padding: 6px 12px; font-size: 13px; color: #495057; margin-bottom: 0; border-radius: 4px 4px 0 0;">
+            Requires SM89 (Ada) or later
+         </div>
+
+      .. literalinclude:: pytorch_current_scaling_example.py
+         :language: python
+         :start-after: # START_CURRENT_SCALING_EXAMPLE
+         :end-before: # END_CURRENT_SCALING_EXAMPLE
+
+   .. tab:: JAX
+
+      .. raw:: html
+
+         <div style="background: #f0f4f8; border-left: 3px solid #5c7cfa; padding: 6px 12px; font-size: 13px; color: #495057; margin-bottom: 0; border-radius: 4px 4px 0 0;">
+            Requires SM89 (Ada) or later
+         </div>
+
+      .. literalinclude:: jax_current_scaling_example.py
+         :language: python
+         :start-after: # START_CURRENT_SCALING_EXAMPLE
+         :end-before: # END_CURRENT_SCALING_EXAMPLE
+
+
+----
+
+Developer Notes
+---------------
+
+This section contains implementation details that may be useful for developers
+but are not required for using FP8 Current Scaling in practice.
+
+All-gather of columnwise tensors
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+On Blackwell and later, rowwise and columnwise tensors share the same memory layout,
+so all-gather of columnwise tensors is directly supported.
+
+For Hopper and Ada, all-gather of transposed FP8 tensors is not supported. 
+The rowwise tensor is gathered first, then transposed to columnwise format.
\ No newline at end of file
--- a/docs/features/low_precision_training/fp8_current_scaling/img/fp8_cast_process.svg
+++ b/docs/features/low_precision_training/fp8_current_scaling/img/fp8_cast_process.svg
+<?xml version="1.0" encoding="UTF-8"?>
+<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 900 220">
+  <defs>
+    <style>
+      @import url("../_static/css/diagram-colors.css");
+      .arrow {
+        stroke: #616161;
+        stroke-width: 2;
+        fill: none;
+        marker-end: url(#arrowhead-cast);
+      }
+    </style>
+    <marker id="arrowhead-cast" markerWidth="10" markerHeight="10" refX="8" refY="3" orient="auto" markerUnits="strokeWidth">
+      <polygon points="0 0, 10 3, 0 6" fill="#616161" />
+    </marker>
+  </defs>
+  
+  <!-- Title -->
+  <text x="450" y="30" class="title" text-anchor="middle">FP8 quantization </text>
+  
+  <!-- Step 1: High Precision Tensor -->
+  <rect x="80" y="80" width="140" height="70" class="hp" rx="6"/>
+  <text x="150" y="110" class="text" text-anchor="middle">High Precision</text>
+  <text x="150" y="130" class="text" text-anchor="middle">Tensor</text>
+  
+  <!-- Arrow 1 -->
+  <path d="M 220 115 L 270 115" class="arrow"/>
+  
+  <!-- Quantize container box -->
+  <rect x="270" y="60" width="330" height="130" class="quantize" rx="6"/>
+  <text x="435" y="205" class="text" style="font-weight: 600; font-size: 14px;" text-anchor="middle">Quantize</text>
+  
+  <!-- Step 2: Compute Amax (sub-box) -->
+  <rect x="280" y="95" width="140" height="50" class="amax" rx="4"/>
+  <text x="350" y="118" class="text" style="font-weight: 600;" text-anchor="middle">Compute amax</text>
+  <text x="350" y="160" class="small-text" text-anchor="middle">1 tensor read</text>
+  
+  <!-- Arrow 2 (inside quantize box) -->
+  <path d="M 420 120 L 450 120" class="arrow"/>
+  
+  <!-- Step 3: Apply Scale + Cast (sub-box) -->
+  <rect x="450" y="95" width="140" height="50" class="quantize" rx="4"/>
+  <text x="520" y="115" class="text" style="font-weight: 600;" text-anchor="middle">Apply Scale</text>
+  <text x="520" y="130" class="text" style="font-weight: 600;" text-anchor="middle">+ Cast</text>
+  <text x="520" y="160" class="small-text" text-anchor="middle">1 tensor read</text>
+  
+  <!-- Arrow 3 -->
+  <path d="M 600 115 L 650 115" class="arrow"/>
+  
+  <!-- Step 4: FP8 Tensor -->
+  <rect x="650" y="80" width="140" height="70" class="fp8" rx="6"/>
+  <text x="720" y="110" class="text" text-anchor="middle">FP8</text>
+  <text x="720" y="130" class="text" text-anchor="middle">Tensor</text>
+  
+</svg>
--- a/docs/features/low_precision_training/fp8_current_scaling/img/fp8_current_scaling_all_gather.svg
+++ b/docs/features/low_precision_training/fp8_current_scaling/img/fp8_current_scaling_all_gather.svg
+<?xml version="1.0" encoding="UTF-8" standalone="no"?>
+<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 950 170" width="950" height="170">
+  <defs>
+    <style>
+      @import url("../_static/css/diagram-colors.css");
+      
+      /* Arrows */
+      .arrow { stroke: #616161; stroke-width: 2; fill: none; marker-end: url(#arrowhead-ag); }
+      
+      /* All-gather operations - fallback if CSS doesn't load */
+      .allgather {
+        fill: #e1f5fe;
+        stroke: #039be5;
+        stroke-width: 2;
+      }
+    </style>
+    <marker id="arrowhead-ag" markerWidth="6" markerHeight="6" refX="5" refY="2" orient="auto">
+      <polygon points="0 0, 6 2, 0 4" fill="#616161" />
+    </marker>
+  </defs>
+  
+  <!-- Title -->
+  <text x="475" y="30" class="title">Quantization + all gather for FP8 current scaling</text>
+  
+  <!-- High Precision Tensor -->
+  <rect x="30" y="80" width="110" height="55" class="hp" rx="6"/>
+  <text x="85" y="103" class="text">High Precision</text>
+  <text x="85" y="120" class="text">Tensor</text>
+  
+  <!-- Arrow -->
+  <path d="M 140 107 L 165 107" class="arrow"/>
+  
+  <!-- Compute Amax -->
+  <rect x="165" y="80" width="100" height="55" class="amax" rx="6"/>
+  <text x="215" y="103" class="text">Compute</text>
+  <text x="215" y="120" class="text">Amax</text>
+  
+  <!-- Arrow -->
+  <path d="M 265 107 L 290 107" class="arrow"/>
+  
+  <!-- Synchronize Amax -->
+  <rect x="290" y="80" width="100" height="55" class="amax" rx="6"/>
+  <text x="340" y="103" class="text">Synchronize</text>
+  <text x="340" y="120" class="text">Amax</text>
+  
+  <!-- Arrow -->
+  <path d="M 390 107 L 415 107" class="arrow"/>
+  
+  <!-- Scale + Cast -->
+  <rect x="415" y="80" width="100" height="55" class="quantize" rx="6"/>
+  <text x="465" y="103" class="text">Scale +</text>
+  <text x="465" y="120" class="text">Cast</text>
+  
+  <!-- Arrow -->
+  <path d="M 515 107 L 540 107" class="arrow"/>
+  
+  <!-- FP8 Tensor (intermediate) -->
+  <rect x="540" y="80" width="100" height="55" class="fp8" rx="6"/>
+  <text x="590" y="103" class="text">FP8</text>
+  <text x="590" y="120" class="text">Tensor</text>
+  
+  <!-- Arrow -->
+  <path d="M 640 107 L 665 107" class="arrow"/>
+  
+  <!-- All-Gather -->
+  <rect x="665" y="80" width="100" height="55" class="allgather" rx="6"/>
+  <text x="715" y="112" class="text">All-Gather</text>
+  
+  <!-- Arrow -->
+  <path d="M 765 107 L 790 107" class="arrow"/>
+  
+  <!-- FP8 Gathered Tensor -->
+  <rect x="790" y="80" width="130" height="55" class="fp8" rx="6"/>
+  <text x="855" y="103" class="text">FP8 Gathered</text>
+  <text x="855" y="120" class="text">Tensor</text>
+  
+</svg>
+
--- a/docs/features/low_precision_training/fp8_current_scaling/img/fp8_formats.svg
+++ b/docs/features/low_precision_training/fp8_current_scaling/img/fp8_formats.svg
+<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 900 280">
+  <defs>
+    <style>
+      @import url("../_static/css/diagram-colors.css");
+      .sign-bit { fill: #9db4d0; stroke: #333; stroke-width: 1; }
+      .exponent-bit { fill: #d9a066; stroke: #333; stroke-width: 1; }
+      .mantissa-bit { fill: #a8d99c; stroke: #333; stroke-width: 1; }
+      .bit-text { fill: #000; text-anchor: middle; dominant-baseline: middle; font-size: 16px; }
+      .header-text { fill: #555; font-weight: normal; text-anchor: middle; font-size: 18px; }
+      .value-text { fill: #333; font-size: 18px; }
+      .format-label { fill: #333; font-weight: bold; text-anchor: middle; dominant-baseline: middle; font-size: 20px; }
+    </style>
+  </defs>
+  
+  <!-- Header labels - centered -->
+  <text x="149" y="18" class="header-text">sign</text>
+  <text x="220" y="18" class="header-text">exponent</text>
+  <text x="420" y="18" class="header-text">mantissa</text>
+  
+  <!-- FP16 Format (16 bits: 1 + 5 + 10) -->
+  <text x="60" y="60" class="format-label">FP16</text>
+  
+  <!-- Sign bit (1) -->
+  <rect x="140" y="45" width="18" height="30" class="sign-bit"/>
+  <text x="149" y="60" class="bit-text">0</text>
+  
+  <!-- Exponent bits (5) -->
+  <rect x="163" y="45" width="18" height="30" class="exponent-bit"/>
+  <text x="172" y="60" class="bit-text">0</text>
+  <rect x="186" y="45" width="18" height="30" class="exponent-bit"/>
+  <text x="195" y="60" class="bit-text">1</text>
+  <rect x="209" y="45" width="18" height="30" class="exponent-bit"/>
+  <text x="218" y="60" class="bit-text">1</text>
+  <rect x="232" y="45" width="18" height="30" class="exponent-bit"/>
+  <text x="241" y="60" class="bit-text">0</text>
+  <rect x="255" y="45" width="18" height="30" class="exponent-bit"/>
+  <text x="264" y="60" class="bit-text">1</text>
+  
+  <!-- Mantissa bits (10) -->
+  <rect x="278" y="45" width="18" height="30" class="mantissa-bit"/>
+  <text x="287" y="60" class="bit-text">1</text>
+  <rect x="301" y="45" width="18" height="30" class="mantissa-bit"/>
+  <text x="310" y="60" class="bit-text">0</text>
+  <rect x="324" y="45" width="18" height="30" class="mantissa-bit"/>
+  <text x="333" y="60" class="bit-text">0</text>
+  <rect x="347" y="45" width="18" height="30" class="mantissa-bit"/>
+  <text x="356" y="60" class="bit-text">1</text>
+  <rect x="370" y="45" width="18" height="30" class="mantissa-bit"/>
+  <text x="379" y="60" class="bit-text">0</text>
+  <rect x="393" y="45" width="18" height="30" class="mantissa-bit"/>
+  <text x="402" y="60" class="bit-text">1</text>
+  <rect x="416" y="45" width="18" height="30" class="mantissa-bit"/>
+  <text x="425" y="60" class="bit-text">0</text>
+  <rect x="439" y="45" width="18" height="30" class="mantissa-bit"/>
+  <text x="448" y="60" class="bit-text">0</text>
+  <rect x="462" y="45" width="18" height="30" class="mantissa-bit"/>
+  <text x="471" y="60" class="bit-text">1</text>
+  <rect x="485" y="45" width="18" height="30" class="mantissa-bit"/>
+  <text x="494" y="60" class="bit-text">1</text>
+  
+  <text x="540" y="60" class="value-text">= 0.395264</text>
+  
+  
+  <!-- BF16 Format (16 bits: 1 + 8 + 7) -->
+  <text x="60" y="120" class="format-label">BF16</text>
+  
+  <!-- Sign bit (1) -->
+  <rect x="140" y="105" width="18" height="30" class="sign-bit"/>
+  <text x="149" y="120" class="bit-text">0</text>
+  
+  <!-- Exponent bits (8) -->
+  <rect x="163" y="105" width="18" height="30" class="exponent-bit"/>
+  <text x="172" y="120" class="bit-text">0</text>
+  <rect x="186" y="105" width="18" height="30" class="exponent-bit"/>
+  <text x="195" y="120" class="bit-text">1</text>
+  <rect x="209" y="105" width="18" height="30" class="exponent-bit"/>
+  <text x="218" y="120" class="bit-text">1</text>
+  <rect x="232" y="105" width="18" height="30" class="exponent-bit"/>
+  <text x="241" y="120" class="bit-text">1</text>
+  <rect x="255" y="105" width="18" height="30" class="exponent-bit"/>
+  <text x="264" y="120" class="bit-text">1</text>
+  <rect x="278" y="105" width="18" height="30" class="exponent-bit"/>
+  <text x="287" y="120" class="bit-text">1</text>
+  <rect x="301" y="105" width="18" height="30" class="exponent-bit"/>
+  <text x="310" y="120" class="bit-text">0</text>
+  <rect x="324" y="105" width="18" height="30" class="exponent-bit"/>
+  <text x="333" y="120" class="bit-text">1</text>
+  
+  <!-- Mantissa bits (7) -->
+  <rect x="347" y="105" width="18" height="30" class="mantissa-bit"/>
+  <text x="356" y="120" class="bit-text">1</text>
+  <rect x="370" y="105" width="18" height="30" class="mantissa-bit"/>
+  <text x="379" y="120" class="bit-text">0</text>
+  <rect x="393" y="105" width="18" height="30" class="mantissa-bit"/>
+  <text x="402" y="120" class="bit-text">0</text>
+  <rect x="416" y="105" width="18" height="30" class="mantissa-bit"/>
+  <text x="425" y="120" class="bit-text">1</text>
+  <rect x="439" y="105" width="18" height="30" class="mantissa-bit"/>
+  <text x="448" y="120" class="bit-text">0</text>
+  <rect x="462" y="105" width="18" height="30" class="mantissa-bit"/>
+  <text x="471" y="120" class="bit-text">1</text>
+  <rect x="485" y="105" width="18" height="30" class="mantissa-bit"/>
+  <text x="494" y="120" class="bit-text">0</text>
+  
+  <text x="540" y="120" class="value-text">= 0.394531</text>
+  
+  
+  <!-- FP8 E4M3 Format (8 bits: 1 + 4 + 3) -->
+  <text x="60" y="180" class="format-label">FP8 E4M3</text>
+  
+  <!-- Sign bit (1) -->
+  <rect x="140" y="165" width="18" height="30" class="sign-bit"/>
+  <text x="149" y="180" class="bit-text">0</text>
+  
+  <!-- Exponent bits (4) -->
+  <rect x="163" y="165" width="18" height="30" class="exponent-bit"/>
+  <text x="172" y="180" class="bit-text">0</text>
+  <rect x="186" y="165" width="18" height="30" class="exponent-bit"/>
+  <text x="195" y="180" class="bit-text">1</text>
+  <rect x="209" y="165" width="18" height="30" class="exponent-bit"/>
+  <text x="218" y="180" class="bit-text">0</text>
+  <rect x="232" y="165" width="18" height="30" class="exponent-bit"/>
+  <text x="241" y="180" class="bit-text">1</text>
+  
+  <!-- Mantissa bits (3) -->
+  <rect x="255" y="165" width="18" height="30" class="mantissa-bit"/>
+  <text x="264" y="180" class="bit-text">1</text>
+  <rect x="278" y="165" width="18" height="30" class="mantissa-bit"/>
+  <text x="287" y="180" class="bit-text">0</text>
+  <rect x="301" y="165" width="18" height="30" class="mantissa-bit"/>
+  <text x="310" y="180" class="bit-text">1</text>
+  
+  <text x="355" y="180" class="value-text">= 0.40625</text>
+  
+  
+  <!-- FP8 E5M2 Format (8 bits: 1 + 5 + 2) -->
+  <text x="60" y="240" class="format-label">FP8 E5M2</text>
+  
+  <!-- Sign bit (1) -->
+  <rect x="140" y="225" width="18" height="30" class="sign-bit"/>
+  <text x="149" y="240" class="bit-text">0</text>
+  
+  <!-- Exponent bits (5) -->
+  <rect x="163" y="225" width="18" height="30" class="exponent-bit"/>
+  <text x="172" y="240" class="bit-text">0</text>
+  <rect x="186" y="225" width="18" height="30" class="exponent-bit"/>
+  <text x="195" y="240" class="bit-text">1</text>
+  <rect x="209" y="225" width="18" height="30" class="exponent-bit"/>
+  <text x="218" y="240" class="bit-text">1</text>
+  <rect x="232" y="225" width="18" height="30" class="exponent-bit"/>
+  <text x="241" y="240" class="bit-text">0</text>
+  <rect x="255" y="225" width="18" height="30" class="exponent-bit"/>
+  <text x="264" y="240" class="bit-text">1</text>
+  
+  <!-- Mantissa bits (2) -->
+  <rect x="278" y="225" width="18" height="30" class="mantissa-bit"/>
+  <text x="287" y="240" class="bit-text">1</text>
+  <rect x="301" y="225" width="18" height="30" class="mantissa-bit"/>
+  <text x="310" y="240" class="bit-text">0</text>
+  
+  <text x="355" y="240" class="value-text">= 0.375</text>
+  
+</svg>
+
--- a/docs/features/low_precision_training/fp8_current_scaling/img/fp8_scaling_concept.svg
+++ b/docs/features/low_precision_training/fp8_current_scaling/img/fp8_scaling_concept.svg
+<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 900 380">
+  <style>
+    @import url("../_static/css/diagram-colors.css");
+    
+    .axis-line { stroke: #333; stroke-width: 2.5; }
+    .value-dot { fill: #2196f3; stroke: #1976d2; stroke-width: 1; }
+    .arrow { fill: #4caf50; }
+    .arrow-line { stroke: #4caf50; stroke-width: 3; }
+    .range-label { font-size: 14px; fill: #555; font-weight: 500; }
+  </style>
+
+  <!-- Top: Original values (before scaling) -->
+  <text x="450" y="55" class="section-title" text-anchor="middle">Original Tensor Values</text>
+  
+  <!-- Top axis -->
+  <line x1="80" y1="85" x2="820" y2="85" class="axis-line"/>
+  
+  <!-- Zero marker (center) -->
+  <line x1="450" y1="80" x2="450" y2="90" stroke="#333" stroke-width="2"/>
+  <text x="450" y="108" class="text" text-anchor="middle" font-size="12px">0</text>
+  
+  <!-- Value dots (before scaling - irregular, not symmetric around zero) -->
+  <circle cx="118" cy="85" r="6" fill="#e53935" stroke="#c62828" stroke-width="2"/>
+  <circle cx="159" cy="85" r="5" class="value-dot"/>
+  <circle cx="167" cy="85" r="5" class="value-dot"/>
+  <circle cx="187" cy="85" r="5" class="value-dot"/>
+  <circle cx="199" cy="85" r="5" class="value-dot"/>
+  <circle cx="228" cy="85" r="5" class="value-dot"/>
+  <circle cx="326" cy="85" r="5" class="value-dot"/>
+  <circle cx="368" cy="85" r="5" class="value-dot"/>
+  <circle cx="442" cy="85" r="5" class="value-dot"/>
+  <circle cx="621" cy="85" r="5" class="value-dot"/>
+  <circle cx="649" cy="85" r="5" class="value-dot"/>
+  <circle cx="725" cy="85" r="5" class="value-dot"/>
+  
+  <!-- amax label -->
+  <text x="118" y="70" class="text" fill="#e53935" font-weight="700" font-size="14px" text-anchor="middle">amax</text>
+  
+  <!-- Original range bracket spanning all values -->
+  <line x1="118" y1="100" x2="118" y2="110" stroke="#666" stroke-width="1.5"/>
+  <line x1="118" y1="110" x2="725" y2="110" stroke="#666" stroke-width="1.5"/>
+  <line x1="725" y1="100" x2="725" y2="110" stroke="#666" stroke-width="1.5"/>
+  <text x="750" y="114" class="range-label" text-anchor="start">Original range</text>
+  
+  <!-- Trapezoid showing compression from original range to FP8 range -->
+  <polygon points="118,115 725,115 650,165 250,165" fill="#e53935" opacity="0.2" stroke="#e53935" stroke-width="1.5"/>
+  
+  <!-- Bottom: After scaling -->
+  <text x="450" y="190" class="section-title" text-anchor="middle">Scaled Values (fit FP8 range)</text>
+  
+  <!-- Bottom axis -->
+  <line x1="80" y1="220" x2="820" y2="220" class="axis-line"/>
+  
+  <!-- Zero marker (center) -->
+  <line x1="450" y1="215" x2="450" y2="225" stroke="#333" stroke-width="2"/>
+  <text x="450" y="238" class="text" text-anchor="middle" font-size="12px">0</text>
+  
+  <!-- FP8 range bracket -->
+  <line x1="250" y1="245" x2="250" y2="255" stroke="#4caf50" stroke-width="1.5"/>
+  <line x1="250" y1="255" x2="650" y2="255" stroke="#4caf50" stroke-width="1.5"/>
+  <line x1="650" y1="245" x2="650" y2="255" stroke="#4caf50" stroke-width="1.5"/>
+  <text x="750" y="259" class="range-label" text-anchor="start" fill="#4caf50">FP8 range</text>
+  
+  <!-- Value dots (after scaling - homogeneous scaling from zero, all fit into FP8 range) -->
+  <circle cx="250" cy="220" r="6" fill="#e53935" stroke="#c62828" stroke-width="2"/>
+  <text x="250" y="205" class="text" fill="#e53935" font-weight="700" font-size="12px" text-anchor="middle">- FP8 range max</text>
+  <circle cx="275" cy="220" r="5" class="value-dot"/>
+  <circle cx="280" cy="220" r="5" class="value-dot"/>
+  <circle cx="292" cy="220" r="5" class="value-dot"/>
+  <circle cx="299" cy="220" r="5" class="value-dot"/>
+  <circle cx="316" cy="220" r="5" class="value-dot"/>
+  <circle cx="375" cy="220" r="5" class="value-dot"/>
+  <circle cx="401" cy="220" r="5" class="value-dot"/>
+  <circle cx="445" cy="220" r="5" class="value-dot"/>
+  <circle cx="553" cy="220" r="5" class="value-dot"/>
+  <circle cx="569" cy="220" r="5" class="value-dot"/>
+  <circle cx="615" cy="220" r="5" class="value-dot"/>
+
+  <!-- Third line: After cast to FP8 (quantized values) -->
+  <text x="450" y="290" class="section-title" text-anchor="middle">Cast to FP8 (quantized values)</text>
+  
+  <!-- Third axis -->
+  <line x1="80" y1="320" x2="820" y2="320" class="axis-line"/>
+  
+  <!-- Zero marker (center) -->
+  <line x1="450" y1="315" x2="450" y2="325" stroke="#333" stroke-width="2"/>
+  <text x="450" y="338" class="text" text-anchor="middle" font-size="12px">0</text>
+  
+  <!-- FP8 range bracket -->
+  <line x1="250" y1="345" x2="250" y2="355" stroke="#4caf50" stroke-width="1.5"/>
+  <line x1="250" y1="355" x2="650" y2="355" stroke="#4caf50" stroke-width="1.5"/>
+  <line x1="650" y1="345" x2="650" y2="355" stroke="#4caf50" stroke-width="1.5"/>
+  <text x="750" y="359" class="range-label" text-anchor="start" fill="#4caf50">FP8 range</text>
+  
+  <!-- Quantized dots - merged close values to show FP8 granularity -->
+  <circle cx="250" cy="320" r="6" fill="#e53935" stroke="#c62828" stroke-width="2"/>
+  <!-- merged: 275+280 -->
+  <circle cx="278" cy="317" r="4.5" class="value-dot"/>
+  <circle cx="278" cy="323" r="4.5" class="value-dot"/>
+  <!-- merged: 292+299 -->
+  <circle cx="296" cy="317" r="4.5" class="value-dot"/>
+  <circle cx="296" cy="323" r="4.5" class="value-dot"/>
+  <circle cx="318" cy="320" r="5" class="value-dot"/>
+  <circle cx="378" cy="320" r="5" class="value-dot"/>
+  <circle cx="404" cy="320" r="5" class="value-dot"/>
+  <circle cx="450" cy="320" r="5" class="value-dot"/>
+  <!-- merged: 553+569 -->
+  <circle cx="562" cy="317" r="4.5" class="value-dot"/>
+  <circle cx="562" cy="323" r="4.5" class="value-dot"/>
+  <circle cx="615" cy="320" r="5" class="value-dot"/>
+
+</svg>
--- a/docs/features/low_precision_training/fp8_current_scaling/jax_current_scaling_example.py
+++ b/docs/features/low_precision_training/fp8_current_scaling/jax_current_scaling_example.py
+# Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# See LICENSE for license information.
+
+# START_CURRENT_SCALING_EXAMPLE
+
+import jax
+import jax.numpy as jnp
+import transformer_engine.jax as te
+from transformer_engine.jax.flax import DenseGeneral
+from transformer_engine.common.recipe import Float8CurrentScaling, Format
+
+# Create FP8 Current Scaling recipe
+# Available formats:
+#   - Format.HYBRID (default) -- E4M3 for forward pass, E5M2 for backward pass
+#   - Format.E4M3 -- E4M3 for both forward and backward pass
+recipe = Float8CurrentScaling(fp8_format=Format.HYBRID)
+
+with te.autocast(enabled=True, recipe=recipe):
+    # Create and initialize layer
+    layer = DenseGeneral(features=1024)
+    key = jax.random.PRNGKey(0)
+    x = jax.random.normal(key, (32, 128, 1024), dtype=jnp.bfloat16)
+    var_collect = layer.init(key, x)
+
+    # Forward and backward pass
+    def loss_fn(var_collect):
+        output = layer.apply(var_collect, x)
+        return output.sum()
+
+    loss, grads = jax.value_and_grad(loss_fn)(var_collect)
+
+# END_CURRENT_SCALING_EXAMPLE
--- a/docs/features/low_precision_training/fp8_current_scaling/pytorch_current_scaling_example.py
+++ b/docs/features/low_precision_training/fp8_current_scaling/pytorch_current_scaling_example.py
+# Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# See LICENSE for license information.
+
+# START_CURRENT_SCALING_EXAMPLE
+
+import torch
+import transformer_engine.pytorch as te
+from transformer_engine.common.recipe import Float8CurrentScaling, Format
+
+# Create FP8 Current Scaling recipe
+# Available formats:
+#   - Format.HYBRID (default) -- E4M3 for forward pass, E5M2 for backward pass
+#   - Format.E4M3 -- E4M3 for both forward and backward pass
+recipe = Float8CurrentScaling(fp8_format=Format.HYBRID)
+
+# Create a simple linear layer with bfloat16 parameters
+layer = te.Linear(1024, 1024, params_dtype=torch.bfloat16)
+
+# Forward and backward pass
+inp = torch.randn(32, 128, 1024, dtype=torch.bfloat16, device="cuda")
+
+with te.autocast(enabled=True, recipe=recipe):
+    output = layer(inp)
+    loss = output.sum()
+
+loss.backward()
+
+# END_CURRENT_SCALING_EXAMPLE
--- a/docs/features/low_precision_training/fp8_delayed_scaling/fp8_delayed_scaling.rst
+++ b/docs/features/low_precision_training/fp8_delayed_scaling/fp8_delayed_scaling.rst
+..
+    Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+    See LICENSE for license information.
+
+FP8 Delayed Scaling
+===================================
+
+FP8 Delayed Scaling recipe estimates scaling factors from historical amax values rather than computing them
+for each tensor. Compared to Current Scaling recipe, 
+this reduces tensor reads per quantization from two to one, 
+improving memory efficiency.
+
+Both this and :doc:`FP8 Current Scaling <../fp8_current_scaling/fp8_current_scaling>` recipe use 
+the same FP8 formats (E4M3/E5M2) with one FP32 scaling factor per tensor. 
+Reading the FP8 Current Scaling documentation first is recommended.
+
+Quantization with delayed scaling factors
+-----------------------------------------
+
+FP8 Current Scaling requires two tensor reads per quantization: one to compute amax, 
+one to cast. FP8 Delayed Scaling eliminates the first read by predicting the scaling factor 
+from historical amax values - hence *delayed* (using past values) versus *current* (using present values).
+
+The quantization process works as follows:
+
+1. **Compute scaling factor from history** (no tensor read needed):
+   The scaling factor is derived from stored ``amax_history`` using the formula:
+   
+   ``scaling_factor = FP8_MAX / amax``
+   
+   where ``amax`` is computed from history using either ``max`` (maximum over window, default) or ``most_recent`` algorithm.
+
+2. **Quantize the tensor** (one tensor read):
+   Apply the scaling factor and cast to FP8. Values exceeding FP8 range are clipped.
+
+3. **Update history**:
+   Record the actual amax from this quantization for future iterations.
+
+Each module maintains an ``amax_history`` tensor of configurable length (``amax_history_len``) 
+for each quantized tensor.
+
+.. raw:: html
+   :file: img/scaling_comparison.svg
+
+*Figure 1. Comparison of FP8 Current Scaling and FP8 Delayed Scaling quantization processes.*
+
+Amax History Management
+-----------------------
+
+The ``amax_history`` buffer acts as a sliding window of recent amax values.
+Position 0 serves as a staging area for the current amax, while positions 1 to N-1 
+store the history from oldest to newest. Each quantization writes the observed amax 
+to position 0, and after the pass completes, the history is rotated:
+
+.. code-block:: text
+
+   Before rotation: [amax_N, amax_1, amax_2, ..., amax_N-1]   (amax_N = current, amax_1 = oldest)
+   After rotation:  [0,      amax_2, ..., amax_N-1, amax_N]   (amax_1 dropped, amax_N appended)
+
+The scaling factor is computed **before** the rotation, so it uses all ``amax_history_len`` values.
+Position 0 serves as a staging area — it is zeroed after the scale update, ready for the next iteration's amax.
+
+The implementation differs between PyTorch and JAX:
+
+.. tabs::
+
+   .. tab:: PyTorch
+
+      Each module creates two ``amax_history`` tensors, initialized to zero:
+      
+      - Forward: shape ``(amax_history_len, num_gemms * 3)`` — three FP8 tensors per GEMM (input, weight, output)
+      - Backward: shape ``(amax_history_len, num_gemms * 2)`` — two FP8 tensors per GEMM (grad_output, grad_input)
+      
+      When the autocast context exits, a single CUDA kernel processes all tensors at once — 
+      performing amax reduction across GPUs and history rotation. This batched approach 
+      minimizes kernel launch overhead compared to updating each tensor separately.
+
+   .. tab:: JAX
+
+      Each quantizer maintains its own ``amax_history`` with shape ``(amax_history_len,)``
+      and updates independently.
+
+Here's how to use FP8 Delayed Scaling in PyTorch and JAX:
+
+.. tabs::
+
+   .. tab:: PyTorch
+
+      .. raw:: html
+
+         <div style="background: #f0f4f8; border-left: 3px solid #5c7cfa; padding: 6px 12px; font-size: 13px; color: #495057; margin-bottom: 0; border-radius: 4px 4px 0 0;">
+            Requires SM89 (Ada) or later
+         </div>
+
+      .. literalinclude:: pytorch_delayed_scaling_example.py
+         :language: python
+         :start-after: # START_DELAYED_SCALING_EXAMPLE
+         :end-before: # END_DELAYED_SCALING_EXAMPLE
+
+   .. tab:: JAX
+
+      .. raw:: html
+
+         <div style="background: #f0f4f8; border-left: 3px solid #5c7cfa; padding: 6px 12px; font-size: 13px; color: #495057; margin-bottom: 0; border-radius: 4px 4px 0 0;">
+            Requires SM89 (Ada) or later
+         </div>
+
+      .. literalinclude:: jax_delayed_scaling_example.py
+         :language: python
+         :start-after: # START_DELAYED_SCALING_EXAMPLE
+         :end-before: # END_DELAYED_SCALING_EXAMPLE
+
+
+Distributed Training
+--------------------
+
+FP8 Delayed Scaling uses the same data formats as FP8 Current Scaling - quantized all-gather is supported.
+However, amax reduction works slightly differently in different frameworks.
+
+.. tabs::
+
+   .. tab:: PyTorch
+
+      Amax reduction is controlled by two parameters:
+      
+      - ``reduce_amax`` in recipe: enables/disables reduction (required for SP and CP)
+      - ``amax_reduction_group`` in ``autocast``: specifies the process group for reduction
+      
+      We recommend reducing amax across all GPUs where the tensor is sharded, 
+      including data parallel ranks.
+
+      .. literalinclude:: pytorch_delayed_scaling_distributed_example.py
+         :language: python
+         :start-after: # START_AMAX_REDUCTION_EXAMPLE
+         :end-before: # END_AMAX_REDUCTION_EXAMPLE
+
+      In data parallel training, some modules may not execute on certain ranks 
+      (e.g., MoE experts that receive no tokens). This is handled as follows:
+      
+      - **First iteration**: All modules must execute on all ranks to register 
+        their ``amax_history`` tensors in the global buffer. Mismatched registration
+        would cause the ``all_reduce`` to hang due to different tensor sizes across ranks.
+      - **Subsequent iterations**: The ``autocast`` context must be entered and exited
+        on all ranks (this triggers the collective reduction). Individual modules can be
+        skipped - if no rank executes a module, its history is not rotated and scale 
+        remains unchanged.
+
+
+   .. tab:: JAX
+
+      Amax reduction is always enabled and managed automatically.
+      Reduction scope: all parallelism axes except pipeline parallelism (TP, SP, DP/FSDP).
+
+      .. literalinclude:: jax_delayed_scaling_distributed_example.py
+         :language: python
+         :start-after: # START_AMAX_REDUCTION_EXAMPLE
+         :end-before: # END_AMAX_REDUCTION_EXAMPLE
+
+Supported devices
+-----------------
+
+Ada and later (SM 8.9+)
\ No newline at end of file
--- a/docs/features/low_precision_training/fp8_delayed_scaling/img/scaling_comparison.svg
+++ b/docs/features/low_precision_training/fp8_delayed_scaling/img/scaling_comparison.svg
+<?xml version="1.0" encoding="UTF-8"?>
+<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 1000 420">
+  <defs>
+    <style>
+      /* Common styles loaded from diagram-colors.css: .hp, .fp8, .quantize, .amax, .text, .title, .label, .box-orange, .box-dashed */
+      /* Diagram-specific styles for arrows */
+      .arrow {
+        stroke: #616161;
+        stroke-width: 2;
+        fill: none;
+        marker-end: url(#arrowhead);
+      }
+    </style>
+    <marker id="arrowhead" markerWidth="10" markerHeight="10" refX="8" refY="3" orient="auto" markerUnits="strokeWidth">
+      <polygon points="0 0, 10 3, 0 6" fill="#616161" />
+    </marker>
+  </defs>
+  
+  <!-- Current Scaling Section -->
+  <text x="250" y="30" class="title">Current Scaling</text>
+  
+  <!-- Tensor box -->
+  <rect x="150" y="60" width="200" height="60" class="hp" rx="5"/>
+  <text x="250" y="95" class="text">Tensor</text>
+  
+  <!-- Arrow to amax computation -->
+  <path d="M 250 120 L 250 160" class="arrow"/>
+  
+  <!-- Amax computation box -->
+  <rect x="150" y="160" width="200" height="60" class="amax" rx="5"/>
+  <text x="250" y="195" class="text">Amax Computation</text>
+  
+  <!-- Arrow to quantization -->
+  <path d="M 250 220 L 250 260" class="arrow"/>
+  
+  <!-- Quantization box -->
+  <rect x="125" y="260" width="250" height="60" class="quantize" rx="5"/>
+  <text x="250" y="285" class="text">Quantization</text>
+  <text x="250" y="305" class="label">(uses tensor + amax)</text>
+  
+  <!-- Arrow to FP8 tensor -->
+  <path d="M 250 320 L 250 360" class="arrow"/>
+  
+  <!-- FP8 Tensor result -->
+  <rect x="150" y="360" width="200" height="40" class="fp8" rx="5"/>
+  <text x="250" y="385" class="text">FP8 Tensor</text>
+  
+  
+  <!-- Delayed Scaling Section -->
+  <text x="750" y="30" class="title">Delayed Scaling</text>
+  
+  <!-- Tensor box with amax history subbox -->
+  <rect x="650" y="60" width="200" height="80" class="hp" rx="5"/>
+  <text x="750" y="90" class="text">Tensor</text>
+  
+  <!-- Amax history subbox (below tensor) -->
+  <rect x="660" y="110" width="180" height="25" class="box-orange box-dashed" rx="3"/>
+  <text x="750" y="127" class="label">amax history</text>
+  
+  <!-- Arrow to quantization -->
+  <path d="M 750 140 L 750 180" class="arrow"/>
+  <text x="820" y="162" class="small-text" style="text-anchor: start;">read amax</text>
+  
+  <!-- Quantization box -->
+  <rect x="625" y="180" width="250" height="80" class="quantize" rx="5"/>
+  <text x="750" y="210" class="text">Quantization</text>
+  <text x="750" y="230" class="label">(uses tensor + amax from history)</text>
+  <text x="750" y="250" class="label">(updates amax history)</text>
+  
+  <!-- Arrow back to history (curved) -->
+  <path d="M 625 220 Q 590 220 590 127 L 660 127" class="arrow"/>
+  <text x="565" y="175" class="small-text" style="text-anchor: end;">update amax</text>
+  
+  <!-- Arrow to FP8 tensor -->
+  <path d="M 750 260 L 750 300" class="arrow"/>
+  
+  <!-- FP8 Tensor result -->
+  <rect x="650" y="300" width="200" height="40" class="fp8" rx="5"/>
+  <text x="750" y="325" class="text">FP8 Tensor</text>
+  
+</svg>
+
--- a/docs/features/low_precision_training/fp8_delayed_scaling/jax_delayed_scaling_distributed_example.py
+++ b/docs/features/low_precision_training/fp8_delayed_scaling/jax_delayed_scaling_distributed_example.py
+# Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# See LICENSE for license information.
+
+# START_AMAX_REDUCTION_EXAMPLE
+import transformer_engine.jax as te
+from transformer_engine.common.recipe import DelayedScaling
+
+# Amax reduction scope is managed internally
+recipe = DelayedScaling(reduce_amax=True)  # Must be True in JAX
+
+with te.autocast(enabled=True, recipe=recipe):
+    output = layer.apply(params, inp)
+
+# END_AMAX_REDUCTION_EXAMPLE