Unverified Commit 3ceb248e authored by Paweł Gadziński's avatar Paweł Gadziński Committed by GitHub
Browse files

More detailed documentation for recipes (#2343)



* Code drop: Update recipes documentation and remove custom recipes from low precision training
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>

* Fix SVG css import path for diagrams
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>

* Refactor low_precision_training docs: remove optimizers, fix imports, add GPU checks

Changes:
- Remove optimizer code from all recipe examples (keep only forward/backward)
- Fix Format imports (use Format.E4M3 instead of string 'E4M3')
- Fix params_dtype for PyTorch examples (add params_dtype=torch.bfloat16)
- Add GPU capability assertions before START blocks for blockwise/mxfp8/nvfp4
- Fix JAX imports (Float8CurrentScaling from common.recipe, NVFP4BlockScaling)
- Add global_shard_guard for TransformerLayer examples in JAX
- Fix fused_layers_jax.py return tuple unpacking
- Update memory_usage JAX examples with dynamic GPU measurement
- Remove memory_usage_3_jax (JAX doesn't support FP8 weight storage)
- Update performance_considerations.rst for JAX differences
- Delete unused .out files and fp8_autocast_jax.py
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fix JAX memory usage .out files with correct output
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>

* responded to comments
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>

* applied suggestions form greptile
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>

* year change
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>

* jax compute capability fix
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>

* fixes
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>

---------
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: default avatarpre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
parent c3769cb7
/* Diagram color definitions for Transformer Engine documentation */
/* High precision (BF16/FP16) elements */
.hp {
fill: #ede7f6;
stroke: #673ab7;
stroke-width: 2;
}
/* FP8 precision elements */
.fp8 {
fill: #fff8e1;
stroke: #ffa726;
stroke-width: 2;
}
/* GEMM/computation operations */
.gemm {
fill: #ffe0b2;
stroke: #fb8c00;
stroke-width: 2.5;
}
/* Quantization operations */
.quantize {
fill: #e8f5e9;
stroke: #66bb6a;
stroke-width: 2;
}
/* Amax computation operations */
.amax {
fill: #e1f5fe;
stroke: #039be5;
stroke-width: 2;
}
/* Text styles */
.text {
font-family: 'Segoe UI', Arial, sans-serif;
font-size: 14px;
text-anchor: middle;
fill: #212121;
}
.small-text {
font-family: 'Segoe UI', Arial, sans-serif;
font-size: 14px;
text-anchor: middle;
fill: #757575;
}
.label {
font-family: 'Segoe UI', Arial, sans-serif;
font-size: 14px;
text-anchor: middle;
fill: #424242;
}
.title {
font-family: 'Segoe UI', Arial, sans-serif;
font-size: 18px;
font-weight: 600;
text-anchor: middle;
fill: #212121;
}
.section-title {
font-family: 'Segoe UI', Arial, sans-serif;
font-size: 15px;
font-weight: 600;
text-anchor: middle;
}
/* Arrows */
/* Note: marker-end references #arrowhead marker which must be defined in each SVG's <defs> section */
.arrow {
stroke: #616161;
stroke-width: 2;
fill: none;
marker-end: url(#arrowhead);
}
/* Additional box and element styles */
.box-blue {
fill: #e3f2fd;
stroke: #1976d2;
stroke-width: 2;
}
.box-orange {
fill: #fff3e0;
stroke: #f57c00;
stroke-width: 2;
}
.box-green {
fill: #c8e6c9;
stroke: #388e3c;
stroke-width: 2;
}
.box-dashed {
stroke-dasharray: 5,5;
}
/* LayerNorm specific */
.layernorm {
fill: #b3e5fc;
stroke: #0277bd;
stroke-width: 2.5;
}
/* Fused layers */
.fused {
fill: #b2dfdb;
stroke: #00695c;
stroke-width: 3;
}
/* Generic computation blocks */
.computation {
fill: #f5f5f5;
stroke: #757575;
stroke-width: 2;
}
/* FP32 precision (alternative red) */
.fp32 {
fill: #ffcdd2;
stroke: #d32f2f;
stroke-width: 2.5;
}
/* Custom styling for sphinx-tabs */
.sphinx-tabs {
margin-bottom: 1rem;
}
.sphinx-tabs-tab {
background-color: #f4f4f4;
border: 1px solid #ccc;
border-bottom: none;
padding: 0.5rem 1rem;
margin-right: 0.5rem;
cursor: pointer;
font-weight: 500;
transition: background-color 0.2s;
}
.sphinx-tabs-tab:hover {
background-color: #e0e0e0;
}
.sphinx-tabs-tab[aria-selected="true"] {
background-color: #76b900; /* NVIDIA green */
color: white;
border-color: #76b900;
margin-right: 0.5rem;
}
.sphinx-tabs-panel {
border: 1px solid #ccc;
padding: 1rem;
background-color: #f9f9f9;
}
/* Dark mode support for RTD theme */
.rst-content .sphinx-tabs-tab {
color: #333;
}
.rst-content .sphinx-tabs-tab[aria-selected="true"] {
color: white;
}
/* Responsive styling for SVG images */
/* Make all SVG images responsive */
.document svg,
.document object[type="image/svg+xml"],
.rst-content svg {
max-width: 100%;
height: auto;
display: block;
margin: 1em auto;
}
/* For raw HTML embedded SVGs */
.document .raw-html svg {
max-width: 100%;
height: auto;
width: 100%;
}
/* Ensure container doesn't overflow */
.document .raw-html {
max-width: 100%;
overflow-x: auto;
}
/* Figure containers with captions */
.svg-figure {
text-align: center;
margin: 20px auto;
}
.svg-figure img {
display: block;
margin: 0 auto;
height: auto;
}
/* Different width classes for figures */
.svg-figure.width-70 img {
width: 70%;
max-width: 100%;
}
.svg-figure.width-80 img {
width: 80%;
max-width: 100%;
}
.svg-figure.width-90 img {
width: 90%;
max-width: 100%;
}
.svg-figure.width-100 img {
width: 100%;
}
/* Figure captions */
.svg-caption {
font-style: italic;
margin-top: 10px;
color: #555;
font-size: 0.95em;
line-height: 1.4;
}
......@@ -67,6 +67,10 @@
overflow: visible !important;
}
.quant {
background-color: yellow !important;
}
</style>
<style>
a:link, a:visited {
......
......@@ -84,8 +84,11 @@ html_show_sphinx = False
html_css_files = [
"css/nvidia_font.css",
"css/nvidia_footer.css",
"css/rtabs.css",
"css/output-style.css",
"css/diagram-colors.css",
"css/sphinx_tabs.css",
"css/svg-responsive.css",
"css/rtabs.css",
]
html_theme_options = {
......
..
Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
See LICENSE for license information.
FP8 Blockwise Scaling
===================================
.. warning::
``Float8BlockScaling`` is **currently not supported** in JAX.
FP8 Blockwise Scaling recipe is inspired by the quantization scheme used to train the `DeepSeek-v3 model <https://arxiv.org/abs/2412.19437>`__ –
the first open-source large-scale LLM trained entirely in FP8 precision.
Unlike the previous recipes, it assigns a dedicated scaling factor to each block of elements.
Data Format
--------------------------
The representation of an FP8 tensor element ``x`` in blockwise precision is given by:
.. code-block:: python
x = x_fp8 * s_block
where
* ``x_fp8`` is the FP8 value (E4M3 or E5M2),
* ``s_block`` is a local **FP32** scaling factor shared by a block of elements.
.. raw:: html
:file: img/combined_scaling.svg
*Figure 1. Top: Comparison of standard FP8 scaling (left) using a single scaling factor per tensor versus
FP8 blockwise scaling in 1 dimension (right) using multiple scaling factors, one per block of 128 elements.
Bottom: FP8 blockwise scaling in 2 dimensions where each 128×128 block in the data tensor has a corresponding
scaling factor.*
**FP8 format**
Unlike FP8 Current/Delayed Scaling, E4M3 is used by default for both forward and backward passes.
Tensor-scaled recipes used E5M2 for gradients due to its higher dynamic range,
but with multiple scaling factors per tensor the dynamic range requirement is lowered, so E4M3 is usually sufficient.
The ``fp8_format`` parameter also supports ``HYBRID`` mode (E4M3 for forward, E5M2 for backward).
Pure E5M2 training is not supported.
**Block size**
Block size is 128.
Blocks can be:
* one dimensional – containing 128 consecutive values,
* two dimensional – containing tiles of 128×128 values.
By default:
* activations use 1D scaling (``x_block_scaling_dim=1``),
* weights use 2D scaling (``w_block_scaling_dim=2``),
* gradients use 1D scaling (``grad_block_scaling_dim=1``).
These can be changed in the recipe, but 2D × 2D GEMMs are not supported
– at most one operand can use 2D scaling.
One-dimensional scaling is more granular, but 2D scaling offers two advantages:
* *Performance*: On Hopper, block-scaled GEMMs are software-emulated. GEMMs with mixed
1D/2D scaled tensors have lower overhead than pure 1D scaled GEMMs.
* *Numerical stability*: 2D scaling behaves better when transposed (details in the next section).
There are some assumptions on the dimensions of the tensor (for both 1D and 2D scaling):
* the tensor must have at least 2 dimensions,
* the last dimension must be divisible by 128,
* the product of all dimensions except the last must be divisible by 128.
**Scaling factors**
Scaling factors are stored as 32-bit floating point numbers.
By default, they are constrained to powers of 2 (utilizing the 8 exponent bits of FP32).
On Hopper, this constraint can be relaxed by setting the environment variable ``NVTE_FP8_BLOCK_SCALING_FP32_SCALES=1``.
On Blackwell, only powers of 2 are supported.
Each block's scaling factor is computed through the following steps:
1. Find the maximum absolute value (``amax_block``) across all elements in the block
(128 consecutive values for 1D blocks, or 128×128 values for 2D blocks).
2. Calculate ``s_block = max_fp8 / amax_block``, where ``max_fp8`` is
the maximum representable value in the FP8 format (448 for E4M3, 57344 for E5M2).
3. If the power-of-2 constraint is enabled, round down to the nearest power of 2
by zeroing out the mantissa bits, retaining only the sign and exponent.
4. Multiply each element in the block by ``s_block`` before converting to FP8.
This approach ensures that the largest value in each block fits within the FP8 representable range without overflow.
Handling transposes
------------------------
On Hopper, columnwise tensor access requires data to be transposed in memory.
For 1D scaling, the block direction must align with the access pattern:
* *Rowwise access*: 1 scaling factor per 128 consecutive elements in a row.
* *Columnwise access*: 1 scaling factor per 128 consecutive elements in a row of the transposed tensor,
corresponding to 128 consecutive elements in a column of the original tensor.
For 2D scaling, each 128×128 tile has one scaling factor regardless of access direction.
This is illustrated below:
.. raw:: html
:file: img/transpose_handling.svg
*Figure 2. Quantization directions for original and transposed tensors.*
Note that for 1D scaling, the rowwise and columnwise quantized tensors may be numerically different,
so the gradient computation may be affected. This issue is not present for 2D scaling.
Activations and weights use the rowwise version in the forward pass and the columnwise version in the backward pass.
Experiments have shown that 2D scaling for weights is more helpful for numerical stability than for activations,
so by default 1D scaling is used for activations – as it is more granular – and 2D scaling is used for weights.
Unlike FP8 Current/Delayed Scaling, transposing a 1D quantized tensor is not supported.
Rowwise and columnwise blocks cover different sets of elements, so their scaling factors differ.
Both versions must be quantized separately from the high-precision source.
For 2D scaling, columnwise data can be created from rowwise data by transposing
both the quantized data and the scaling factors. Each 128×128 block covers the same
elements regardless of access direction, so the scaling factors remain valid.
Distributed training
-----------------------
**Scale synchronization**
The blockwise scaled tensor does not need any scale synchronization among the nodes.
This is because each scaling factor is local to its 128 or 128×128 element block,
unlike FP8 Current/Delayed Scaling where a single global scale applies to the entire tensor, even when sharded.
**Quantized all-gather**
FP8 Blockwise Scaling all-gather is supported.
Examples
--------
Here's how to use the FP8 Blockwise Scaling recipe in PyTorch and JAX:
.. note::
Requires SM90 (Hopper) or later.
.. tabs::
.. tab:: PyTorch
.. literalinclude:: pytorch_blockwise_scaling_example.py
:language: python
:start-after: # START_BLOCKWISE_SCALING_EXAMPLE
:end-before: # END_BLOCKWISE_SCALING_EXAMPLE
.. tab:: JAX
``Float8BlockScaling`` is **not currently supported** in JAX.
Supported devices
-----------------
Hopper (SM 9.0)
Blackwell and later (SM >= 10.0) – the recipe is emulated with MXFP8. Note that MXFP8 is the preferred recipe on Blackwell.
Only scaling factors that are powers of 2 are supported.
----
Developer Notes
---------------
This section contains implementation details that may be useful for developers
but are not required for using FP8 Blockwise Scaling in practice.
Swizzle of scaling factors
^^^^^^^^^^^^^^^^^^^^^^^^^^
FP8 Blockwise Scaling supports all-gather of both rowwise and columnwise tensors.
To support that, it implements different data layouts for communication (all-gather)
and computation (GEMM). We refer to the conversion between these formats as *swizzling*.
A tensor of shape ``[A, B]`` can exist in two formats:
**Compact format** (used for all-gather):
The all-gather primitive only supports gathering non-transposed shards into a non-transposed full tensor,
so all tensor components in this layout are stored without transposition.
Moreover, all component tensors are stored without padding.
.. list-table::
:widths: 30 70
:header-rows: 1
* - Component
- Shape
* - rowwise data
- ``[A, B]``
* - columnwise data
- ``[A, B]``
* - rowwise scales
- ``[A, B/128]``
* - columnwise scales
- ``[A/128, B]``
**GEMM-ready format** (used for computation):
Tensors are transposed and padded as required by the GEMM kernel.
.. list-table::
:widths: 30 70
:header-rows: 1
* - Component
- Shape
* - rowwise data
- ``[A, B]``
* - columnwise data
- ``[B, A]`` (transposed)
* - rowwise scales
- ``[B/128, pad4(A)]`` (transposed, padded)
* - columnwise scales
- ``[A/128, pad4(B)]`` (padded)
Swizzling converts from compact to GEMM-ready format. This can be fused with quantization
when no all-gather is needed, or performed separately after all-gather.
.. raw:: html
:file: img/blockwise_swizzle_flow.svg
*Figure 3. FP8 Blockwise Scaling swizzle paths. Top: With all-gather communication – quantization produces
compact format, then swizzle is performed separately after communication. Bottom: Without all-gather –
quantize and swizzle are fused into a single operation, directly producing GEMM-ready format.*
All-gather of columnwise tensors
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
All-gather of columnwise tensors is supported and necessary because:
- columnwise quantized tensors cannot be computed from rowwise quantized ones,
- gathering high-precision tensors is avoided in most cases for performance reasons.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 1150 400" width="100%" style="max-width: 900px;">
<defs>
<style>
@import url("../_static/css/diagram-colors.css");
/* Diagram-specific styles */
.input-box { fill: #f3e5f5; stroke: #7b1fa2; stroke-width: 2.5; }
.blockwise-box { fill: #e3f2fd; stroke: #1976d2; stroke-width: 2.5; }
.fp8-tile { fill: #bbdefb; stroke: #1565c0; stroke-width: 1.5; }
.scale-tile { fill: #a5d6a7; stroke: #388e3c; stroke-width: 1.5; }
.scale-swizzled { fill: #ffb74d; stroke: #e65100; stroke-width: 1.5; }
.swizzle-box { fill: #fff3e0; stroke: #f57c00; stroke-width: 2; }
.quantize-box { fill: #ede7f6; stroke: #5e35b1; stroke-width: 2; }
.quantize-fused-box { fill: #d1c4e9; stroke: #5e35b1; stroke-width: 2.5; }
.comm-box { fill: #fff9c4; stroke: #f57f17; stroke-width: 2; }
.gemm-box { fill: #c8e6c9; stroke: #388e3c; stroke-width: 2; }
/* Arrow override */
.arrow { marker-end: url(#arrowhead); stroke: #616161; stroke-width: 1.5; fill: none; }
</style>
<!-- Arrow marker -->
<marker id="arrowhead" markerWidth="6" markerHeight="6" refX="5" refY="2" orient="auto">
<polygon points="0 0, 6 2, 0 4" fill="#616161" />
</marker>
</defs>
<!-- Section 1: With Communication (Separate Swizzle) -->
<g id="with-communication">
<!-- Step 0: Input Tensor -->
<g id="input-fp32-tensor-1">
<text x="80" y="25" class="text" text-anchor="middle" font-weight="600">Input Tensor</text>
<rect x="20" y="40" width="120" height="110" rx="6" class="input-box"/>
<text x="80" y="100" class="text" text-anchor="middle" fill="#fff">FP32/BF16</text>
</g>
<!-- Arrow 0 -->
<path d="M 140 95 L 175 95" class="arrow"/>
<!-- Step 1: Quantize -->
<rect x="175" y="60" width="80" height="70" rx="6" class="quantize-box"/>
<text x="215" y="100" class="text">Quantize</text>
<!-- Arrow 1 -->
<path d="M 255 95 L 290 95" class="arrow"/>
<!-- Step 2: Blockwise Tensor (Compact) -->
<g id="blockwise-tensor-compact">
<text x="375" y="25" class="text" text-anchor="middle" font-weight="600">FP8 (Compact)</text>
<rect x="290" y="40" width="170" height="110" rx="6" class="blockwise-box"/>
<!-- FP32 Scales sub-tile (green) -->
<rect x="305" y="52" width="140" height="32" rx="3" class="scale-tile"/>
<text x="375" y="73" class="text" text-anchor="middle" fill="#fff">FP32 Scales</text>
<!-- FP8 Data sub-tile -->
<rect x="305" y="92" width="140" height="45" rx="3" class="fp8-tile"/>
<text x="375" y="120" class="text" fill="#fff">FP8 Data</text>
</g>
<!-- Arrow 2 -->
<path d="M 460 95 L 495 95" class="arrow"/>
<!-- Step 3: Communication -->
<rect x="495" y="60" width="100" height="70" rx="6" class="comm-box"/>
<text x="545" y="100" class="text">All-Gather</text>
<!-- Arrow 3 -->
<path d="M 595 95 L 630 95" class="arrow"/>
<!-- Step 4: Swizzle -->
<rect x="630" y="60" width="90" height="70" rx="6" class="swizzle-box"/>
<text x="675" y="100" class="text">Swizzle</text>
<!-- Arrow 4 -->
<path d="M 720 95 L 755 95" class="arrow"/>
<!-- Step 5: Blockwise Tensor (GEMM Ready) -->
<g id="swizzled-tensor-1">
<text x="840" y="25" class="text" text-anchor="middle" font-weight="600">FP8 (GEMM Ready)</text>
<rect x="755" y="40" width="170" height="110" rx="6" class="blockwise-box"/>
<!-- Swizzled Scales sub-tile (orange) -->
<rect x="770" y="52" width="140" height="32" rx="3" class="scale-swizzled"/>
<text x="840" y="73" class="text" text-anchor="middle" fill="#fff">Swizzled Scales</text>
<!-- FP8 Data sub-tile -->
<rect x="770" y="92" width="140" height="45" rx="3" class="fp8-tile"/>
<text x="840" y="120" class="text" fill="#fff">FP8 Data</text>
</g>
<!-- Arrow 5 -->
<path d="M 925 95 L 960 95" class="arrow"/>
<!-- Step 6: GEMM -->
<rect x="960" y="60" width="80" height="70" rx="6" class="gemm-box"/>
<text x="1000" y="100" class="text">GEMM</text>
</g>
<!-- Separator Line -->
<line x1="20" y1="185" x2="1050" y2="185" stroke="#bdbdbd" stroke-width="1" stroke-dasharray="8,4"/>
<!-- Section 2: Without Communication (Fused Quantize + Swizzle) -->
<g id="without-communication" transform="translate(0, 170)">
<!-- Step 0: Input Tensor -->
<g id="input-fp32-tensor-2">
<text x="80" y="45" class="text" text-anchor="middle" font-weight="600">Input Tensor</text>
<rect x="20" y="60" width="120" height="110" rx="6" class="input-box"/>
<text x="80" y="120" class="text" text-anchor="middle" fill="#fff">FP32/BF16</text>
</g>
<!-- Arrow 0 -->
<path d="M 140 115 L 190 115" class="arrow"/>
<!-- Step 1: Fused Quantize + Swizzle -->
<rect x="190" y="70" width="120" height="90" rx="6" class="quantize-fused-box"/>
<text x="250" y="105" class="text">Quantize</text>
<text x="250" y="122" class="text">+</text>
<text x="250" y="139" class="text">Swizzle</text>
<!-- Arrow 1 -->
<path d="M 310 115 L 360 115" class="arrow"/>
<!-- Step 2: Blockwise Tensor (GEMM Ready) - directly produced -->
<g id="swizzled-tensor-2">
<text x="455" y="45" class="text" text-anchor="middle" font-weight="600">FP8 (GEMM Ready)</text>
<rect x="360" y="60" width="190" height="110" rx="6" class="blockwise-box"/>
<!-- Swizzled Scales sub-tile (orange) -->
<rect x="378" y="72" width="155" height="32" rx="3" class="scale-swizzled"/>
<text x="455" y="93" class="text" text-anchor="middle" fill="#fff">Swizzled Scales</text>
<!-- FP8 Data sub-tile -->
<rect x="378" y="112" width="155" height="45" rx="3" class="fp8-tile"/>
<text x="455" y="140" class="text" fill="#fff">FP8 Data</text>
</g>
<!-- Arrow 2 -->
<path d="M 550 115 L 600 115" class="arrow"/>
<!-- Step 3: GEMM -->
<rect x="600" y="80" width="80" height="70" rx="6" class="gemm-box"/>
<text x="640" y="120" class="text">GEMM</text>
</g>
</svg>
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 55 900 715">
<defs>
<style>
@import url("../_static/css/diagram-colors.css");
.title { font: bold 18px sans-serif; fill: #333; text-anchor: middle; }
.dots-text { font: bold 24px sans-serif; fill: #333; text-anchor: middle; }
/* Tensor colors */
.fp8-tensor { fill: #87CEEB; stroke: #444; stroke-width: 2; }
.fp8-block { fill: #87CEEB; stroke: #555; stroke-width: 1.5; }
.fp8-block-alt { fill: #5F9FCC; stroke: #555; stroke-width: 1.5; }
/* Scaling factor colors */
.scale-factor { fill: #FFA500; stroke: #444; stroke-width: 2; }
.grid-line { stroke: #444; stroke-width: 2; }
.boundary-line { stroke: #444; stroke-width: 2; }
</style>
</defs>
<!-- FIRST IMAGE: Standard vs Blockwise Scaling -->
<!-- LEFT SIDE: Standard FP8 Scaling -->
<g id="standard-scaling">
<text x="225" y="85" class="title">Delayed/Current FP8 Scaling</text>
<text x="225" y="108" class="label">(Single scaling factor per tensor)</text>
<!-- FP8 Tensor - solid blue with white cross -->
<g id="left-tensor">
<!-- Solid blue background -->
<rect x="105" y="140" width="240" height="120" class="fp8-tensor"/>
<!-- White backgrounds for dots areas - cross pattern -->
<rect x="225.0" y="140.0" width="40" height="120" fill="#FFFFFF" stroke="none"/>
<rect x="105.0" y="190.0" width="240" height="30" fill="#FFFFFF" stroke="none"/>
<!-- Three dots in VERTICAL white bar -->
<text x="245" y="167.5" class="dots-text"></text>
<text x="245" y="242.5" class="dots-text"></text>
<!-- Three dots in HORIZONTAL white bar -->
<text x="165" y="205" class="dots-text"></text>
<text x="305" y="205" class="dots-text"></text>
<!-- ONE diagonal dot at intersection -->
<text x="245" y="205" class="dots-text" transform="rotate(45 245 205)"></text>
<!-- Main outline -->
<rect x="105.0" y="140.0" width="240" height="120" fill="none" stroke="#444" stroke-width="2"/>
</g>
<!-- Single scaling factor - one 10x10 square -->
<rect x="220" y="285" width="10" height="10" class="scale-factor" stroke="#444" stroke-width="1"/>
<text x="225" y="315" class="small-text" text-anchor="middle">1 scaling factor</text>
</g>
<!-- RIGHT SIDE: FP8 Blockwise Scaling -->
<g id="blockwise-scaling">
<text x="675" y="85" class="title">Blockwise FP8 Scaling – 1 dimension</text>
<text x="675" y="108" class="label">(One scaling factor per 128 elements)</text>
<!-- FP8 Tensor split into many small blocks (40×10) - EXACT coordinates from Python script -->
<g id="tensor-blocks">
<!-- White backgrounds for dots areas - cross pattern -->
<rect x="675.0" y="140.0" width="40" height="120" fill="#FFFFFF" stroke="none"/>
<rect x="555.0" y="190.0" width="240" height="30" fill="#FFFFFF" stroke="none"/>
<!-- Blocks ONLY where they don't overlap with white cross (from Python script) -->
<rect x="555" y="140" width="40" height="10" class="fp8-block"/>
<rect x="595" y="140" width="40" height="10" class="fp8-block-alt"/>
<rect x="635" y="140" width="40" height="10" class="fp8-block"/>
<rect x="715" y="140" width="40" height="10" class="fp8-block"/>
<rect x="755" y="140" width="40" height="10" class="fp8-block-alt"/>
<rect x="555" y="150" width="40" height="10" class="fp8-block-alt"/>
<rect x="595" y="150" width="40" height="10" class="fp8-block"/>
<rect x="635" y="150" width="40" height="10" class="fp8-block-alt"/>
<rect x="715" y="150" width="40" height="10" class="fp8-block-alt"/>
<rect x="755" y="150" width="40" height="10" class="fp8-block"/>
<rect x="555" y="160" width="40" height="10" class="fp8-block"/>
<rect x="595" y="160" width="40" height="10" class="fp8-block-alt"/>
<rect x="635" y="160" width="40" height="10" class="fp8-block"/>
<rect x="715" y="160" width="40" height="10" class="fp8-block"/>
<rect x="755" y="160" width="40" height="10" class="fp8-block-alt"/>
<rect x="555" y="170" width="40" height="10" class="fp8-block-alt"/>
<rect x="595" y="170" width="40" height="10" class="fp8-block"/>
<rect x="635" y="170" width="40" height="10" class="fp8-block-alt"/>
<rect x="715" y="170" width="40" height="10" class="fp8-block-alt"/>
<rect x="755" y="170" width="40" height="10" class="fp8-block"/>
<rect x="555" y="180" width="40" height="10" class="fp8-block"/>
<rect x="595" y="180" width="40" height="10" class="fp8-block-alt"/>
<rect x="635" y="180" width="40" height="10" class="fp8-block"/>
<rect x="715" y="180" width="40" height="10" class="fp8-block"/>
<rect x="755" y="180" width="40" height="10" class="fp8-block-alt"/>
<!-- Three dots in VERTICAL white bar -->
<text x="695" y="167.5" class="dots-text"></text>
<text x="695" y="242.5" class="dots-text"></text>
<!-- Three dots in HORIZONTAL white bar -->
<text x="615" y="205" class="dots-text"></text>
<text x="755" y="205" class="dots-text"></text>
<!-- ONE diagonal dot at intersection -->
<text x="695" y="205" class="dots-text" transform="rotate(45 695 205)"></text>
<!-- Bottom rows (y >= 220 after horizontal white bar) -->
<rect x="555" y="220" width="40" height="10" class="fp8-block"/>
<rect x="595" y="220" width="40" height="10" class="fp8-block-alt"/>
<rect x="635" y="220" width="40" height="10" class="fp8-block"/>
<rect x="715" y="220" width="40" height="10" class="fp8-block"/>
<rect x="755" y="220" width="40" height="10" class="fp8-block-alt"/>
<rect x="555" y="230" width="40" height="10" class="fp8-block-alt"/>
<rect x="595" y="230" width="40" height="10" class="fp8-block"/>
<rect x="635" y="230" width="40" height="10" class="fp8-block-alt"/>
<rect x="715" y="230" width="40" height="10" class="fp8-block-alt"/>
<rect x="755" y="230" width="40" height="10" class="fp8-block"/>
<rect x="555" y="240" width="40" height="10" class="fp8-block"/>
<rect x="595" y="240" width="40" height="10" class="fp8-block-alt"/>
<rect x="635" y="240" width="40" height="10" class="fp8-block"/>
<rect x="715" y="240" width="40" height="10" class="fp8-block"/>
<rect x="755" y="240" width="40" height="10" class="fp8-block-alt"/>
<rect x="555" y="250" width="40" height="10" class="fp8-block-alt"/>
<rect x="595" y="250" width="40" height="10" class="fp8-block"/>
<rect x="635" y="250" width="40" height="10" class="fp8-block-alt"/>
<rect x="715" y="250" width="40" height="10" class="fp8-block-alt"/>
<rect x="755" y="250" width="40" height="10" class="fp8-block"/>
<!-- Main outline -->
<rect x="555.0" y="140.0" width="240" height="120" fill="none" stroke="#444" stroke-width="2"/>
</g>
<!-- Scaling factors tensor - 3+2 columns of 10px squares -->
<g id="scale-factors">
<!-- Orange background -->
<rect x="640" y="285" width="70" height="120" fill="#FFA500"/>
<!-- White backgrounds for dots areas - cross pattern -->
<rect x="670" y="285" width="20" height="120" fill="#FFFFFF" stroke="none"/>
<rect x="640" y="335" width="70" height="30" fill="#FFFFFF" stroke="none"/>
<!-- Grid lines showing 10x10 squares (3 left + 2 right columns) -->
<!-- Vertical lines every 10px (skipping white space) -->
<!-- Left 3 columns (640-670) -->
<line x1="650" y1="285" x2="650" y2="335" class="grid-line" stroke-width="1"/>
<line x1="660" y1="285" x2="660" y2="335" class="grid-line" stroke-width="1"/>
<line x1="670" y1="285" x2="670" y2="335" class="grid-line" stroke-width="1"/>
<!-- Right 2 columns (690-710) -->
<line x1="690" y1="285" x2="690" y2="335" class="grid-line" stroke-width="1"/>
<line x1="700" y1="285" x2="700" y2="335" class="grid-line" stroke-width="1"/>
<line x1="710" y1="285" x2="710" y2="335" class="grid-line" stroke-width="1"/>
<!-- Bottom sections -->
<line x1="650" y1="365" x2="650" y2="405" class="grid-line" stroke-width="1"/>
<line x1="660" y1="365" x2="660" y2="405" class="grid-line" stroke-width="1"/>
<line x1="670" y1="365" x2="670" y2="405" class="grid-line" stroke-width="1"/>
<line x1="690" y1="365" x2="690" y2="405" class="grid-line" stroke-width="1"/>
<line x1="700" y1="365" x2="700" y2="405" class="grid-line" stroke-width="1"/>
<line x1="710" y1="365" x2="710" y2="405" class="grid-line" stroke-width="1"/>
<!-- Horizontal lines every 10px -->
<line x1="640" y1="295" x2="670" y2="295" class="grid-line" stroke-width="1"/>
<line x1="690" y1="295" x2="710" y2="295" class="grid-line" stroke-width="1"/>
<line x1="640" y1="305" x2="670" y2="305" class="grid-line" stroke-width="1"/>
<line x1="690" y1="305" x2="710" y2="305" class="grid-line" stroke-width="1"/>
<line x1="640" y1="315" x2="670" y2="315" class="grid-line" stroke-width="1"/>
<line x1="690" y1="315" x2="710" y2="315" class="grid-line" stroke-width="1"/>
<line x1="640" y1="325" x2="670" y2="325" class="grid-line" stroke-width="1"/>
<line x1="690" y1="325" x2="710" y2="325" class="grid-line" stroke-width="1"/>
<!-- Top bottom boundaries -->
<line x1="640" y1="335" x2="670" y2="335" class="grid-line" stroke-width="1"/>
<line x1="690" y1="335" x2="710" y2="335" class="grid-line" stroke-width="1"/>
<line x1="640" y1="365" x2="670" y2="365" class="grid-line" stroke-width="1"/>
<line x1="690" y1="365" x2="710" y2="365" class="grid-line" stroke-width="1"/>
<line x1="640" y1="375" x2="670" y2="375" class="grid-line" stroke-width="1"/>
<line x1="690" y1="375" x2="710" y2="375" class="grid-line" stroke-width="1"/>
<line x1="640" y1="385" x2="670" y2="385" class="grid-line" stroke-width="1"/>
<line x1="690" y1="385" x2="710" y2="385" class="grid-line" stroke-width="1"/>
<line x1="640" y1="395" x2="670" y2="395" class="grid-line" stroke-width="1"/>
<line x1="690" y1="395" x2="710" y2="395" class="grid-line" stroke-width="1"/>
<!-- Bottom boundaries -->
<line x1="640" y1="405" x2="670" y2="405" class="grid-line" stroke-width="1"/>
<line x1="690" y1="405" x2="710" y2="405" class="grid-line" stroke-width="1"/>
<!-- Main outline -->
<rect x="640" y="285" width="70" height="120" fill="none" stroke="#444" stroke-width="2"/>
<!-- Three dots -->
<text x="680" y="312.5" class="dots-text" style="font-size: 14px;"></text>
<text x="680" y="387.5" class="dots-text" style="font-size: 14px;"></text>
<text x="655" y="350" class="dots-text" style="font-size: 14px;"></text>
<text x="700" y="350" class="dots-text" style="font-size: 14px;"></text>
<text x="680" y="350" class="dots-text" style="font-size: 14px;" transform="rotate(45 680 350)"></text>
</g>
<text x="675" y="430" class="small-text" text-anchor="middle">Scaling factors (one per block)</text>
</g>
<!-- SECOND IMAGE: 2D Blockwise Scaling -->
<!-- Main Title -->
<text x="450" y="470" class="title">Blockwise FP8 Scaling – 2 dimensions</text>
<text x="450" y="495" class="label">(One scaling factor per 128x128 block of elements)</text>
<!-- TOP: DATA TENSOR (20x20 blocks, with 3 extra columns on right) -->
<g id="data-tensor">
<!-- Background for entire tensor -->
<rect x="390" y="525" width="180" height="120" class="fp8-tensor"/>
<!-- White space for gaps (cross pattern) -->
<rect x="450" y="525" width="20" height="120" fill="#FFFFFF" stroke="none"/>
<rect x="390" y="585" width="180" height="20" fill="#FFFFFF" stroke="none"/>
<!-- Grid Lines (every 20px) -->
<!-- Vertical Lines Left (x=410, 430) -->
<line x1="410" y1="525" x2="410" y2="585" class="grid-line" stroke-width="1"/>
<line x1="430" y1="525" x2="430" y2="585" class="grid-line" stroke-width="1"/>
<line x1="410" y1="605" x2="410" y2="645" class="grid-line" stroke-width="1"/>
<line x1="430" y1="605" x2="430" y2="645" class="grid-line" stroke-width="1"/>
<!-- Vertical Lines Right (x=490, 510, 530, 550) -->
<line x1="490" y1="525" x2="490" y2="585" class="grid-line" stroke-width="1"/>
<line x1="490" y1="605" x2="490" y2="645" class="grid-line" stroke-width="1"/>
<line x1="510" y1="525" x2="510" y2="585" class="grid-line" stroke-width="1"/>
<line x1="510" y1="605" x2="510" y2="645" class="grid-line" stroke-width="1"/>
<line x1="530" y1="525" x2="530" y2="585" class="grid-line" stroke-width="1"/>
<line x1="530" y1="605" x2="530" y2="645" class="grid-line" stroke-width="1"/>
<line x1="550" y1="525" x2="550" y2="585" class="grid-line" stroke-width="1"/>
<line x1="550" y1="605" x2="550" y2="645" class="grid-line" stroke-width="1"/>
<!-- Horizontal Lines Top (y=545, 565) -->
<line x1="390" y1="545" x2="450" y2="545" class="grid-line" stroke-width="1"/>
<line x1="470" y1="545" x2="570" y2="545" class="grid-line" stroke-width="1"/>
<line x1="390" y1="565" x2="450" y2="565" class="grid-line" stroke-width="1"/>
<line x1="470" y1="565" x2="570" y2="565" class="grid-line" stroke-width="1"/>
<!-- Horizontal Lines Bottom (y=625) -->
<line x1="390" y1="625" x2="450" y2="625" class="grid-line" stroke-width="1"/>
<line x1="470" y1="625" x2="570" y2="625" class="grid-line" stroke-width="1"/>
<!-- Dots / Ellipses -->
<!-- Horizontal dots in gap -->
<text x="460" y="552" class="dots-text" style="font-size: 14px;"></text>
<text x="460" y="632" class="dots-text" style="font-size: 14px;"></text>
<!-- Vertical dots in gap -->
<text x="420" y="597" class="dots-text" style="font-size: 14px;"></text>
<text x="540" y="597" class="dots-text" style="font-size: 14px;"></text>
<!-- Diagonal dot -->
<text x="460" y="597" class="dots-text" style="font-size: 14px;" transform="rotate(45 460 597)"></text>
<!-- Boundaries around white spaces (excluding center intersection) -->
<!-- Vertical boundaries - broken at horizontal white space -->
<line x1="450" y1="525" x2="450" y2="585" class="boundary-line"/>
<line x1="450" y1="605" x2="450" y2="645" class="boundary-line"/>
<line x1="470" y1="525" x2="470" y2="585" class="boundary-line"/>
<line x1="470" y1="605" x2="470" y2="645" class="boundary-line"/>
<!-- Horizontal boundaries - broken at vertical white space -->
<line x1="390" y1="585" x2="450" y2="585" class="boundary-line"/>
<line x1="470" y1="585" x2="570" y2="585" class="boundary-line"/>
<line x1="390" y1="605" x2="450" y2="605" class="boundary-line"/>
<line x1="470" y1="605" x2="570" y2="605" class="boundary-line"/>
<!-- Main outline -->
<rect x="390" y="525" width="180" height="120" fill="none" stroke="#444" stroke-width="2"/>
</g>
<!-- BOTTOM: SCALING FACTORS (10x10 blocks, with 3 extra columns on right) -->
<g id="scaling-factors-2d">
<!-- Background for entire scaling tensor -->
<rect x="420" y="675" width="90" height="60" class="scale-factor"/>
<!-- White space for gaps (cross pattern) -->
<rect x="450" y="675" width="10" height="60" fill="#FFFFFF" stroke="none"/>
<rect x="420" y="705" width="90" height="10" fill="#FFFFFF" stroke="none"/>
<!-- Grid Lines (every 10px) -->
<!-- Vertical Left -->
<line x1="430" y1="675" x2="430" y2="705" class="grid-line" stroke-width="1"/>
<line x1="440" y1="675" x2="440" y2="705" class="grid-line" stroke-width="1"/>
<line x1="430" y1="715" x2="430" y2="735" class="grid-line" stroke-width="1"/>
<line x1="440" y1="715" x2="440" y2="735" class="grid-line" stroke-width="1"/>
<!-- Vertical Right -->
<line x1="470" y1="675" x2="470" y2="705" class="grid-line" stroke-width="1"/>
<line x1="470" y1="715" x2="470" y2="735" class="grid-line" stroke-width="1"/>
<line x1="480" y1="675" x2="480" y2="705" class="grid-line" stroke-width="1"/>
<line x1="480" y1="715" x2="480" y2="735" class="grid-line" stroke-width="1"/>
<line x1="490" y1="675" x2="490" y2="705" class="grid-line" stroke-width="1"/>
<line x1="490" y1="715" x2="490" y2="735" class="grid-line" stroke-width="1"/>
<line x1="500" y1="675" x2="500" y2="705" class="grid-line" stroke-width="1"/>
<line x1="500" y1="715" x2="500" y2="735" class="grid-line" stroke-width="1"/>
<!-- Horizontal Top -->
<line x1="420" y1="685" x2="450" y2="685" class="grid-line" stroke-width="1"/>
<line x1="460" y1="685" x2="510" y2="685" class="grid-line" stroke-width="1"/>
<line x1="420" y1="695" x2="450" y2="695" class="grid-line" stroke-width="1"/>
<line x1="460" y1="695" x2="510" y2="695" class="grid-line" stroke-width="1"/>
<!-- Horizontal Bottom -->
<line x1="420" y1="725" x2="450" y2="725" class="grid-line" stroke-width="1"/>
<line x1="460" y1="725" x2="510" y2="725" class="grid-line" stroke-width="1"/>
<!-- Dots -->
<text x="455" y="692" class="dots-text" style="font-size: 12px;"></text>
<text x="455" y="727" class="dots-text" style="font-size: 12px;"></text>
<text x="435" y="711" class="dots-text" style="font-size: 12px;"></text>
<text x="490" y="711" class="dots-text" style="font-size: 12px;"></text>
<text x="455" y="711" class="dots-text" style="font-size: 12px;" transform="rotate(45 455 711)"></text>
<!-- Boundaries around white spaces (excluding center intersection) -->
<!-- Vertical boundaries - broken at horizontal white space -->
<line x1="450" y1="675" x2="450" y2="705" class="boundary-line"/>
<line x1="450" y1="715" x2="450" y2="735" class="boundary-line"/>
<line x1="460" y1="675" x2="460" y2="705" class="boundary-line"/>
<line x1="460" y1="715" x2="460" y2="735" class="boundary-line"/>
<!-- Horizontal boundaries - broken at vertical white space -->
<line x1="420" y1="705" x2="450" y2="705" class="boundary-line"/>
<line x1="460" y1="705" x2="510" y2="705" class="boundary-line"/>
<line x1="420" y1="715" x2="450" y2="715" class="boundary-line"/>
<line x1="460" y1="715" x2="510" y2="715" class="boundary-line"/>
<!-- Main outline -->
<rect x="420" y="675" width="90" height="60" fill="none" stroke="#444" stroke-width="2"/>
<text x="465" y="755" class="small-text" text-anchor="middle">Scaling factors (1 per 2D block)</text>
</g>
</svg>
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 900 640">
<defs>
<style>
@import url("../_static/css/diagram-colors.css");
.title { font: bold 16px sans-serif; fill: #333; text-anchor: middle; }
.label { font: 14px sans-serif; fill: #333; text-anchor: middle; }
.small-text { font: 12px sans-serif; fill: #555; }
.dots-text { font: bold 24px sans-serif; fill: #333; text-anchor: middle; }
/* Tensor colors */
.fp8-block { fill: #87CEEB; stroke: #555; stroke-width: 1.5; }
.fp8-block-alt { fill: #5F9FCC; stroke: #555; stroke-width: 1.5; }
</style>
</defs>
<!-- Section title for 1D -->
<text x="450" y="25" class="title" style="font-size: 18px; font-weight: bold;">1D Blockwise Scaling</text>
<!-- LEFT SIDE: Original 1D Blockwise (Rowwise Quantization) -->
<g id="rowwise-quantization">
<text x="225" y="50" class="title">Rowwise Quantization</text>
<!-- FP8 Tensor with horizontal stripes -->
<g id="left-tensor">
<!-- White backgrounds for dots areas - cross pattern -->
<rect x="225.0" y="100.0" width="40" height="120" fill="#FFFFFF" stroke="none"/>
<rect x="105.0" y="150.0" width="240" height="30" fill="#FFFFFF" stroke="none"/>
<!-- Horizontal blocks (40×10 each) - rows of alternating colors -->
<!-- Top section (before horizontal gap) -->
<rect x="105" y="100" width="40" height="10" class="fp8-block"/>
<rect x="145" y="100" width="40" height="10" class="fp8-block-alt"/>
<rect x="185" y="100" width="40" height="10" class="fp8-block"/>
<rect x="265" y="100" width="40" height="10" class="fp8-block"/>
<rect x="305" y="100" width="40" height="10" class="fp8-block-alt"/>
<rect x="105" y="110" width="40" height="10" class="fp8-block-alt"/>
<rect x="145" y="110" width="40" height="10" class="fp8-block"/>
<rect x="185" y="110" width="40" height="10" class="fp8-block-alt"/>
<rect x="265" y="110" width="40" height="10" class="fp8-block-alt"/>
<rect x="305" y="110" width="40" height="10" class="fp8-block"/>
<rect x="105" y="120" width="40" height="10" class="fp8-block"/>
<rect x="145" y="120" width="40" height="10" class="fp8-block-alt"/>
<rect x="185" y="120" width="40" height="10" class="fp8-block"/>
<rect x="265" y="120" width="40" height="10" class="fp8-block"/>
<rect x="305" y="120" width="40" height="10" class="fp8-block-alt"/>
<rect x="105" y="130" width="40" height="10" class="fp8-block-alt"/>
<rect x="145" y="130" width="40" height="10" class="fp8-block"/>
<rect x="185" y="130" width="40" height="10" class="fp8-block-alt"/>
<rect x="265" y="130" width="40" height="10" class="fp8-block-alt"/>
<rect x="305" y="130" width="40" height="10" class="fp8-block"/>
<rect x="105" y="140" width="40" height="10" class="fp8-block"/>
<rect x="145" y="140" width="40" height="10" class="fp8-block-alt"/>
<rect x="185" y="140" width="40" height="10" class="fp8-block"/>
<rect x="265" y="140" width="40" height="10" class="fp8-block"/>
<rect x="305" y="140" width="40" height="10" class="fp8-block-alt"/>
<!-- Three dots in VERTICAL white bar -->
<text x="245" y="127.5" class="dots-text"></text>
<text x="245" y="202.5" class="dots-text"></text>
<!-- Three dots in HORIZONTAL white bar -->
<text x="165" y="165" class="dots-text"></text>
<text x="305" y="165" class="dots-text"></text>
<!-- ONE diagonal dot at intersection -->
<text x="245" y="165" class="dots-text" transform="rotate(45 245 165)"></text>
<!-- Bottom section (after horizontal gap) -->
<rect x="105" y="180" width="40" height="10" class="fp8-block"/>
<rect x="145" y="180" width="40" height="10" class="fp8-block-alt"/>
<rect x="185" y="180" width="40" height="10" class="fp8-block"/>
<rect x="265" y="180" width="40" height="10" class="fp8-block"/>
<rect x="305" y="180" width="40" height="10" class="fp8-block-alt"/>
<rect x="105" y="190" width="40" height="10" class="fp8-block-alt"/>
<rect x="145" y="190" width="40" height="10" class="fp8-block"/>
<rect x="185" y="190" width="40" height="10" class="fp8-block-alt"/>
<rect x="265" y="190" width="40" height="10" class="fp8-block-alt"/>
<rect x="305" y="190" width="40" height="10" class="fp8-block"/>
<rect x="105" y="200" width="40" height="10" class="fp8-block"/>
<rect x="145" y="200" width="40" height="10" class="fp8-block-alt"/>
<rect x="185" y="200" width="40" height="10" class="fp8-block"/>
<rect x="265" y="200" width="40" height="10" class="fp8-block"/>
<rect x="305" y="200" width="40" height="10" class="fp8-block-alt"/>
<rect x="105" y="210" width="40" height="10" class="fp8-block-alt"/>
<rect x="145" y="210" width="40" height="10" class="fp8-block"/>
<rect x="185" y="210" width="40" height="10" class="fp8-block-alt"/>
<rect x="265" y="210" width="40" height="10" class="fp8-block-alt"/>
<rect x="305" y="210" width="40" height="10" class="fp8-block"/>
<!-- Main outline -->
<rect x="105.0" y="100.0" width="240" height="120" fill="none" stroke="#444" stroke-width="2"/>
</g>
</g>
<!-- RIGHT SIDE: Transposed (Columnwise Quantization) -->
<g id="columnwise-quantization">
<text x="625" y="50" class="title">Columnwise Quantization</text>
<!-- FP8 Tensor - transposed shape (120 wide × 240 tall) with HORIZONTAL stripes -->
<g id="right-tensor">
<!-- White backgrounds for dots areas - cross pattern -->
<rect x="645.0" y="100.0" width="40" height="240" fill="#FFFFFF" stroke="none"/>
<rect x="565.0" y="260.0" width="120" height="30" fill="#FFFFFF" stroke="none"/>
<!-- Horizontal stripes 40×10 (same as rowwise) -->
<!-- Top section (before horizontal gap) - 16 rows of 10px each = 160px -->
<rect x="565" y="100" width="40" height="10" class="fp8-block"/>
<rect x="605" y="100" width="40" height="10" class="fp8-block"/>
<rect x="565" y="110" width="40" height="10" class="fp8-block-alt"/>
<rect x="605" y="110" width="40" height="10" class="fp8-block-alt"/>
<rect x="565" y="120" width="40" height="10" class="fp8-block"/>
<rect x="605" y="120" width="40" height="10" class="fp8-block"/>
<rect x="565" y="130" width="40" height="10" class="fp8-block-alt"/>
<rect x="605" y="130" width="40" height="10" class="fp8-block-alt"/>
<rect x="565" y="140" width="40" height="10" class="fp8-block"/>
<rect x="605" y="140" width="40" height="10" class="fp8-block"/>
<rect x="565" y="150" width="40" height="10" class="fp8-block-alt"/>
<rect x="605" y="150" width="40" height="10" class="fp8-block-alt"/>
<rect x="565" y="160" width="40" height="10" class="fp8-block"/>
<rect x="605" y="160" width="40" height="10" class="fp8-block"/>
<rect x="565" y="170" width="40" height="10" class="fp8-block-alt"/>
<rect x="605" y="170" width="40" height="10" class="fp8-block-alt"/>
<rect x="565" y="180" width="40" height="10" class="fp8-block"/>
<rect x="605" y="180" width="40" height="10" class="fp8-block"/>
<rect x="565" y="190" width="40" height="10" class="fp8-block-alt"/>
<rect x="605" y="190" width="40" height="10" class="fp8-block-alt"/>
<rect x="565" y="200" width="40" height="10" class="fp8-block"/>
<rect x="605" y="200" width="40" height="10" class="fp8-block"/>
<rect x="565" y="210" width="40" height="10" class="fp8-block-alt"/>
<rect x="605" y="210" width="40" height="10" class="fp8-block-alt"/>
<rect x="565" y="220" width="40" height="10" class="fp8-block"/>
<rect x="605" y="220" width="40" height="10" class="fp8-block"/>
<rect x="565" y="230" width="40" height="10" class="fp8-block-alt"/>
<rect x="605" y="230" width="40" height="10" class="fp8-block-alt"/>
<rect x="565" y="240" width="40" height="10" class="fp8-block"/>
<rect x="605" y="240" width="40" height="10" class="fp8-block"/>
<rect x="565" y="250" width="40" height="10" class="fp8-block-alt"/>
<rect x="605" y="250" width="40" height="10" class="fp8-block-alt"/>
<!-- Three dots in VERTICAL white bar -->
<text x="665" y="200" class="dots-text"></text>
<text x="665" y="330" class="dots-text"></text>
<!-- Three dots in HORIZONTAL white bar -->
<text x="605" y="275" class="dots-text"></text>
<!-- ONE diagonal dot at intersection -->
<text x="665" y="275" class="dots-text" transform="rotate(45 665 275)"></text>
<!-- Bottom section (after horizontal gap) - 5 rows of 10px each = 50px -->
<rect x="565" y="290" width="40" height="10" class="fp8-block"/>
<rect x="605" y="290" width="40" height="10" class="fp8-block"/>
<rect x="565" y="300" width="40" height="10" class="fp8-block-alt"/>
<rect x="605" y="300" width="40" height="10" class="fp8-block-alt"/>
<rect x="565" y="310" width="40" height="10" class="fp8-block"/>
<rect x="605" y="310" width="40" height="10" class="fp8-block"/>
<rect x="565" y="320" width="40" height="10" class="fp8-block-alt"/>
<rect x="605" y="320" width="40" height="10" class="fp8-block-alt"/>
<rect x="565" y="330" width="40" height="10" class="fp8-block"/>
<rect x="605" y="330" width="40" height="10" class="fp8-block"/>
<!-- Main outline -->
<rect x="565.0" y="100.0" width="120" height="240" fill="none" stroke="#444" stroke-width="2"/>
</g>
</g>
<!-- SECTION 2: 2D Blockwise Scaling -->
<!-- Section title for 2D -->
<text x="450" y="380" class="title" style="font-size: 18px; font-weight: bold;">2D Blockwise Scaling</text>
<!-- LEFT SIDE: Original 2D Blockwise (copied from combined_scaling.svg) -->
<g id="2d-original">
<text x="280" y="405" class="title">Rowwise Quantization</text>
<!-- TOP: DATA TENSOR (20x20 blocks, with 3 extra columns on right) -->
<g id="data-tensor-left">
<!-- Background for entire tensor -->
<rect x="190" y="445" width="180" height="120" fill="#87CEEB" stroke="#444" stroke-width="2"/>
<!-- White space for gaps (cross pattern) -->
<rect x="250" y="445" width="20" height="120" fill="#FFFFFF" stroke="none"/>
<rect x="190" y="505" width="180" height="20" fill="#FFFFFF" stroke="none"/>
<!-- Grid Lines (every 20px) -->
<!-- Vertical Lines Left (x=210, 230) -->
<line x1="210" y1="445" x2="210" y2="505" stroke="#444" stroke-width="1"/>
<line x1="230" y1="445" x2="230" y2="505" stroke="#444" stroke-width="1"/>
<line x1="210" y1="525" x2="210" y2="565" stroke="#444" stroke-width="1"/>
<line x1="230" y1="525" x2="230" y2="565" stroke="#444" stroke-width="1"/>
<!-- Vertical Lines Right (x=290, 310, 330, 350) -->
<line x1="290" y1="445" x2="290" y2="505" stroke="#444" stroke-width="1"/>
<line x1="290" y1="525" x2="290" y2="565" stroke="#444" stroke-width="1"/>
<line x1="310" y1="445" x2="310" y2="505" stroke="#444" stroke-width="1"/>
<line x1="310" y1="525" x2="310" y2="565" stroke="#444" stroke-width="1"/>
<line x1="330" y1="445" x2="330" y2="505" stroke="#444" stroke-width="1"/>
<line x1="330" y1="525" x2="330" y2="565" stroke="#444" stroke-width="1"/>
<line x1="350" y1="445" x2="350" y2="505" stroke="#444" stroke-width="1"/>
<line x1="350" y1="525" x2="350" y2="565" stroke="#444" stroke-width="1"/>
<!-- Horizontal Lines Top (y=465, 485) -->
<line x1="190" y1="465" x2="250" y2="465" stroke="#444" stroke-width="1"/>
<line x1="270" y1="465" x2="370" y2="465" stroke="#444" stroke-width="1"/>
<line x1="190" y1="485" x2="250" y2="485" stroke="#444" stroke-width="1"/>
<line x1="270" y1="485" x2="370" y2="485" stroke="#444" stroke-width="1"/>
<!-- Horizontal Lines Bottom (y=545) -->
<line x1="190" y1="545" x2="250" y2="545" stroke="#444" stroke-width="1"/>
<line x1="270" y1="545" x2="370" y2="545" stroke="#444" stroke-width="1"/>
<!-- Dots / Ellipses -->
<!-- Horizontal dots in gap -->
<text x="260" y="472" class="dots-text" style="font-size: 14px;"></text>
<text x="260" y="552" class="dots-text" style="font-size: 14px;"></text>
<!-- Vertical dots in gap -->
<text x="220" y="517" class="dots-text" style="font-size: 14px;"></text>
<text x="340" y="517" class="dots-text" style="font-size: 14px;"></text>
<!-- Diagonal dot -->
<text x="260" y="517" class="dots-text" style="font-size: 14px;" transform="rotate(45 260 517)"></text>
<!-- Boundaries around white spaces (excluding center intersection) -->
<!-- Vertical boundaries - broken at horizontal white space -->
<line x1="250" y1="445" x2="250" y2="505" stroke="#444" stroke-width="2"/>
<line x1="250" y1="525" x2="250" y2="565" stroke="#444" stroke-width="2"/>
<line x1="270" y1="445" x2="270" y2="505" stroke="#444" stroke-width="2"/>
<line x1="270" y1="525" x2="270" y2="565" stroke="#444" stroke-width="2"/>
<!-- Horizontal boundaries - broken at vertical white space -->
<line x1="190" y1="505" x2="250" y2="505" stroke="#444" stroke-width="2"/>
<line x1="270" y1="505" x2="370" y2="505" stroke="#444" stroke-width="2"/>
<line x1="190" y1="525" x2="250" y2="525" stroke="#444" stroke-width="2"/>
<line x1="270" y1="525" x2="370" y2="525" stroke="#444" stroke-width="2"/>
<!-- Main outline -->
<rect x="190" y="445" width="180" height="120" fill="none" stroke="#444" stroke-width="2"/>
</g>
</g>
<!-- RIGHT SIDE: Transposed 2D Blockwise -->
<g id="2d-transposed">
<text x="605" y="405" class="title">Columnwise Quantization</text>
<!-- DATA TENSOR TRANSPOSED (120x180 instead of 180x120) -->
<g id="data-tensor-right">
<!-- Background for entire tensor -->
<rect x="545" y="435" width="120" height="180" fill="#87CEEB" stroke="#444" stroke-width="2"/>
<!-- White space for gaps (cross pattern) - TRANSPOSED -->
<!-- Original: X structure (180): 60 + 20 + 100 → Y structure (180): 60 + 20 + 100 -->
<!-- Original: Y structure (120): 60 + 20 + 40 → X structure (120): 60 + 20 + 40 -->
<rect x="545" y="495" width="120" height="20" fill="#FFFFFF" stroke="none"/>
<rect x="605" y="435" width="20" height="180" fill="#FFFFFF" stroke="none"/>
<!-- Grid Lines (every 20px) - TRANSPOSED -->
<!-- Original vertical lines at x=210, 230 become horizontal at y=455, 475 -->
<line x1="545" y1="455" x2="605" y2="455" stroke="#444" stroke-width="1"/>
<line x1="625" y1="455" x2="665" y2="455" stroke="#444" stroke-width="1"/>
<line x1="545" y1="475" x2="605" y2="475" stroke="#444" stroke-width="1"/>
<line x1="625" y1="475" x2="665" y2="475" stroke="#444" stroke-width="1"/>
<!-- Original vertical lines at x=290, 310, 330, 350 become horizontal at y=535, 555, 575, 595 -->
<line x1="545" y1="535" x2="605" y2="535" stroke="#444" stroke-width="1"/>
<line x1="625" y1="535" x2="665" y2="535" stroke="#444" stroke-width="1"/>
<line x1="545" y1="555" x2="605" y2="555" stroke="#444" stroke-width="1"/>
<line x1="625" y1="555" x2="665" y2="555" stroke="#444" stroke-width="1"/>
<line x1="545" y1="575" x2="605" y2="575" stroke="#444" stroke-width="1"/>
<line x1="625" y1="575" x2="665" y2="575" stroke="#444" stroke-width="1"/>
<line x1="545" y1="595" x2="605" y2="595" stroke="#444" stroke-width="1"/>
<line x1="625" y1="595" x2="665" y2="595" stroke="#444" stroke-width="1"/>
<!-- Original horizontal lines at y=465, 485 become vertical at x=565, 585 -->
<line x1="565" y1="435" x2="565" y2="495" stroke="#444" stroke-width="1"/>
<line x1="565" y1="515" x2="565" y2="615" stroke="#444" stroke-width="1"/>
<line x1="585" y1="435" x2="585" y2="495" stroke="#444" stroke-width="1"/>
<line x1="585" y1="515" x2="585" y2="615" stroke="#444" stroke-width="1"/>
<!-- Original horizontal line at y=545 becomes vertical at x=605, 625, 645 -->
<line x1="605" y1="435" x2="605" y2="495" stroke="#444" stroke-width="1"/>
<line x1="605" y1="515" x2="605" y2="615" stroke="#444" stroke-width="1"/>
<line x1="625" y1="435" x2="625" y2="495" stroke="#444" stroke-width="1"/>
<line x1="625" y1="515" x2="625" y2="615" stroke="#444" stroke-width="1"/>
<line x1="645" y1="435" x2="645" y2="495" stroke="#444" stroke-width="1"/>
<line x1="645" y1="515" x2="645" y2="615" stroke="#444" stroke-width="1"/>
<!-- Dots / Ellipses - TRANSPOSED -->
<!-- Original: horizontal dots at (260, 472) and (260, 552) in vertical gap -->
<!-- Offsets: (70, 27) and (70, 107) → transposed to (27+545, 70+435) and (107+545, 70+435) -->
<text x="572" y="505" class="dots-text" style="font-size: 14px;"></text>
<text x="652" y="505" class="dots-text" style="font-size: 14px;"></text>
<!-- Original: vertical dots at (220, 517) and (340, 517) in horizontal gap -->
<!-- Offsets: (30, 72) and (150, 72) → transposed to (72+545, 30+435) and (72+545, 150+435) -->
<text x="617" y="465" class="dots-text" style="font-size: 14px;"></text>
<text x="617" y="585" class="dots-text" style="font-size: 14px;"></text>
<!-- Diagonal dot at (260, 517) → offset (70, 72) → transposed to (72+545, 70+435) -->
<text x="617" y="505" class="dots-text" style="font-size: 14px;" transform="rotate(45 617 505)"></text>
<!-- Boundaries around white spaces - TRANSPOSED -->
<!-- Original vertical boundaries (x=250, x=270) become horizontal boundaries (y=495, y=515) -->
<line x1="545" y1="495" x2="605" y2="495" stroke="#444" stroke-width="2"/>
<line x1="625" y1="495" x2="665" y2="495" stroke="#444" stroke-width="2"/>
<line x1="545" y1="515" x2="605" y2="515" stroke="#444" stroke-width="2"/>
<line x1="625" y1="515" x2="665" y2="515" stroke="#444" stroke-width="2"/>
<!-- Original horizontal boundaries (y=505, y=525) become vertical boundaries (x=605, x=625) -->
<line x1="605" y1="435" x2="605" y2="495" stroke="#444" stroke-width="2"/>
<line x1="605" y1="515" x2="605" y2="615" stroke="#444" stroke-width="2"/>
<line x1="625" y1="435" x2="625" y2="495" stroke="#444" stroke-width="2"/>
<line x1="625" y1="515" x2="625" y2="615" stroke="#444" stroke-width="2"/>
<!-- Main outline -->
<rect x="545" y="435" width="120" height="180" fill="none" stroke="#444" stroke-width="2"/>
</g>
</g>
</svg>
# Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# See LICENSE for license information.
import torch
# Check for Hopper or newer GPU
major, minor = torch.cuda.get_device_capability()
assert major >= 9, f"FP8 Blockwise Scaling requires SM90 (Hopper) or later, got SM{major}{minor}"
# START_BLOCKWISE_SCALING_EXAMPLE
import torch
import transformer_engine.pytorch as te
from transformer_engine.common.recipe import Float8BlockScaling
# Create FP8 Blockwise Scaling recipe
recipe = Float8BlockScaling(
fp8_format=te.common.recipe.Format.E4M3, # E4M3 or HYBRID (default: E4M3)
x_block_scaling_dim=1, # 1D scaling for activations (default: 1)
w_block_scaling_dim=2, # 2D scaling for weights (default: 2)
grad_block_scaling_dim=1, # 1D scaling for gradients (default: 1)
)
# Create a linear layer with bfloat16 parameters
layer = te.Linear(1024, 1024, params_dtype=torch.bfloat16)
# Forward and backward pass
inp = torch.randn(32, 128, 1024, dtype=torch.bfloat16, device="cuda")
with te.autocast(enabled=True, recipe=recipe):
output = layer(inp)
loss = output.sum()
loss.backward()
# END_BLOCKWISE_SCALING_EXAMPLE
..
Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
See LICENSE for license information.
FP8 Current Scaling
===================================
FP8 current scaling recipe is the simplest low precision recipe provided by Transformer Engine.
To understand how this recipe works, we first need to examine what the FP8 data type is and how it differs from other floating point formats.
FP8 data type
-------------
The FP8 datatype, introduced in Hopper architecture, is actually 2 distinct datatypes, useful in different parts of the training of neural networks:
* E4M3 -- consists of 1 sign bit, 4 exponent bits and 3 bits of mantissa. It can store values up to +/-448 and ``nan``.
* E5M2 -- consists of 1 sign bit, 5 exponent bits and 2 bits of mantissa. It can store values up to +/-57344, +/- ``inf`` and ``nan``. The tradeoff of the increased dynamic range is lower precision of the stored values.
.. raw:: html
:file: img/fp8_formats.svg
*Figure 1: Structure of the floating point datatypes. All of the values shown (in FP16, BF16, FP8 E4M3 and FP8 E5M2) are the closest representations of value 0.3952.*
**E4M3 and E5M2 usage in training**
By default, Transformer Engine uses a hybrid approach:
* *Forward pass* - activations and weights require more precision, so E4M3 datatype is used to store them.
* *Backward pass* - gradients are less susceptible to precision loss but require higher dynamic range, so E5M2 datatype is preferred.
The user can configure this behavior via the ``fp8_format`` parameter of the recipe.
Scaling factors
---------------
Limited dynamic range of FP8 datatype is insufficient for many tensors.
To address this, values in the tensor are scaled. FP8 Current Scaling recipe uses one **FP32** scale factor per tensor. The representation of a tensor element ``x`` in FP8 precision is given by:
.. code-block:: python
x = x_fp8 * s
where
* ``x_fp8`` is the FP8 value (E4M3 or E5M2),
* ``s`` is a global **FP32** scaling factor applied to the entire tensor.
**FP8 Current Scaling quantization**
Let's take a closer look at how quantization to FP8 with scaling factor is implemented in
the FP8 Current Scaling recipe.
.. raw:: html
:file: img/fp8_scaling_concept.svg
*Figure 3: Quantization to FP8 consists of amax (absolute maximum) computation, scaling to fit the FP8 range and casting to the respective FP8 format.*
Quantization to FP8 consists of 3 steps:
1. Computation of the absolute maximum value of the tensor - we refer to it as ``amax``.
2. Applying the scaling factor of ``fp8_max / amax`` to the tensor, to fit it into the FP8 range
3. Casting into the respective FP8 format using *Round To Nearest Even (RTNE)*. Values round to the nearest representable FP8 value. When exactly halfway between two values, rounds to the one with even mantissa to minimize systematic bias.
**Performance analysis**
Quantization is a memory-bound operation that requires reading the tensor twice:
* First read: compute ``amax`` across all elements.
* Second read: apply the scaling factor and cast to FP8.
This is a significant overhead compared to other recipes, which typically require only a single memory read.
.. raw:: html
:file: img/fp8_cast_process.svg
*Figure 4: FP8 quantization with current scaling recipe - two tensor reads are needed, one to compute amax and one to apply the scaling factor and cast to FP8.*
Transpose handling
------------------
*Ada and Hopper*
On Ada and Hopper, the backward pass requires a transposed FP8 tensor.
The columnwise layout is physically different from the rowwise layout, so a transpose operation is needed.
All 3 options from :ref:`Performance Considerations Transpose handling section <handling_transposes>` are supported.
*Blackwell and later*
Blackwell hardware supports multiple GEMM layouts natively, eliminating the need for explicit transposes.
The rowwise and columnwise tensors share the same physical memory layout.
.. figure:: ../performance_considerations/img/hopper_vs_blackwell_layout.svg
:align: center
:alt: Comparison of rowwise and columnwise tensor layouts on Blackwell vs Hopper
*Figure 6: On Blackwell, rowwise and columnwise usages share the same memory layout. On Hopper, columnwise usage requires a physical transpose.*
Distributed training
--------------------
**Quantized all-gather**
FP8 all-gather is supported on all architectures (Ada and later).
**Amax reduction**
Tensors that are gathered across nodes (e.g. input and gradient in sequence parallelism) require amax synchronization before quantization.
Each node computes its local ``amax``, then a reduction produces the global maximum across all nodes.
All nodes use this synchronized amax to compute identical scaling factors, enabling quantized all-gather.
.. raw:: html
:file: img/fp8_current_scaling_all_gather.svg
*Figure 7: Quantization and all-gather flow for FP8 current scaling showing amax computation and synchronization.*
Supported devices
-----------------
Ada and later (SM 8.9+)
Examples
--------
Here's how to use FP8 Current Scaling recipe in PyTorch and JAX:
.. tabs::
.. tab:: PyTorch
.. raw:: html
<div style="background: #f0f4f8; border-left: 3px solid #5c7cfa; padding: 6px 12px; font-size: 13px; color: #495057; margin-bottom: 0; border-radius: 4px 4px 0 0;">
Requires SM89 (Ada) or later
</div>
.. literalinclude:: pytorch_current_scaling_example.py
:language: python
:start-after: # START_CURRENT_SCALING_EXAMPLE
:end-before: # END_CURRENT_SCALING_EXAMPLE
.. tab:: JAX
.. raw:: html
<div style="background: #f0f4f8; border-left: 3px solid #5c7cfa; padding: 6px 12px; font-size: 13px; color: #495057; margin-bottom: 0; border-radius: 4px 4px 0 0;">
Requires SM89 (Ada) or later
</div>
.. literalinclude:: jax_current_scaling_example.py
:language: python
:start-after: # START_CURRENT_SCALING_EXAMPLE
:end-before: # END_CURRENT_SCALING_EXAMPLE
----
Developer Notes
---------------
This section contains implementation details that may be useful for developers
but are not required for using FP8 Current Scaling in practice.
All-gather of columnwise tensors
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
On Blackwell and later, rowwise and columnwise tensors share the same memory layout,
so all-gather of columnwise tensors is directly supported.
For Hopper and Ada, all-gather of transposed FP8 tensors is not supported.
The rowwise tensor is gathered first, then transposed to columnwise format.
\ No newline at end of file
<?xml version="1.0" encoding="UTF-8"?>
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 900 220">
<defs>
<style>
@import url("../_static/css/diagram-colors.css");
.arrow {
stroke: #616161;
stroke-width: 2;
fill: none;
marker-end: url(#arrowhead-cast);
}
</style>
<marker id="arrowhead-cast" markerWidth="10" markerHeight="10" refX="8" refY="3" orient="auto" markerUnits="strokeWidth">
<polygon points="0 0, 10 3, 0 6" fill="#616161" />
</marker>
</defs>
<!-- Title -->
<text x="450" y="30" class="title" text-anchor="middle">FP8 quantization </text>
<!-- Step 1: High Precision Tensor -->
<rect x="80" y="80" width="140" height="70" class="hp" rx="6"/>
<text x="150" y="110" class="text" text-anchor="middle">High Precision</text>
<text x="150" y="130" class="text" text-anchor="middle">Tensor</text>
<!-- Arrow 1 -->
<path d="M 220 115 L 270 115" class="arrow"/>
<!-- Quantize container box -->
<rect x="270" y="60" width="330" height="130" class="quantize" rx="6"/>
<text x="435" y="205" class="text" style="font-weight: 600; font-size: 14px;" text-anchor="middle">Quantize</text>
<!-- Step 2: Compute Amax (sub-box) -->
<rect x="280" y="95" width="140" height="50" class="amax" rx="4"/>
<text x="350" y="118" class="text" style="font-weight: 600;" text-anchor="middle">Compute amax</text>
<text x="350" y="160" class="small-text" text-anchor="middle">1 tensor read</text>
<!-- Arrow 2 (inside quantize box) -->
<path d="M 420 120 L 450 120" class="arrow"/>
<!-- Step 3: Apply Scale + Cast (sub-box) -->
<rect x="450" y="95" width="140" height="50" class="quantize" rx="4"/>
<text x="520" y="115" class="text" style="font-weight: 600;" text-anchor="middle">Apply Scale</text>
<text x="520" y="130" class="text" style="font-weight: 600;" text-anchor="middle">+ Cast</text>
<text x="520" y="160" class="small-text" text-anchor="middle">1 tensor read</text>
<!-- Arrow 3 -->
<path d="M 600 115 L 650 115" class="arrow"/>
<!-- Step 4: FP8 Tensor -->
<rect x="650" y="80" width="140" height="70" class="fp8" rx="6"/>
<text x="720" y="110" class="text" text-anchor="middle">FP8</text>
<text x="720" y="130" class="text" text-anchor="middle">Tensor</text>
</svg>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 950 170" width="950" height="170">
<defs>
<style>
@import url("../_static/css/diagram-colors.css");
/* Arrows */
.arrow { stroke: #616161; stroke-width: 2; fill: none; marker-end: url(#arrowhead-ag); }
/* All-gather operations - fallback if CSS doesn't load */
.allgather {
fill: #e1f5fe;
stroke: #039be5;
stroke-width: 2;
}
</style>
<marker id="arrowhead-ag" markerWidth="6" markerHeight="6" refX="5" refY="2" orient="auto">
<polygon points="0 0, 6 2, 0 4" fill="#616161" />
</marker>
</defs>
<!-- Title -->
<text x="475" y="30" class="title">Quantization + all gather for FP8 current scaling</text>
<!-- High Precision Tensor -->
<rect x="30" y="80" width="110" height="55" class="hp" rx="6"/>
<text x="85" y="103" class="text">High Precision</text>
<text x="85" y="120" class="text">Tensor</text>
<!-- Arrow -->
<path d="M 140 107 L 165 107" class="arrow"/>
<!-- Compute Amax -->
<rect x="165" y="80" width="100" height="55" class="amax" rx="6"/>
<text x="215" y="103" class="text">Compute</text>
<text x="215" y="120" class="text">Amax</text>
<!-- Arrow -->
<path d="M 265 107 L 290 107" class="arrow"/>
<!-- Synchronize Amax -->
<rect x="290" y="80" width="100" height="55" class="amax" rx="6"/>
<text x="340" y="103" class="text">Synchronize</text>
<text x="340" y="120" class="text">Amax</text>
<!-- Arrow -->
<path d="M 390 107 L 415 107" class="arrow"/>
<!-- Scale + Cast -->
<rect x="415" y="80" width="100" height="55" class="quantize" rx="6"/>
<text x="465" y="103" class="text">Scale +</text>
<text x="465" y="120" class="text">Cast</text>
<!-- Arrow -->
<path d="M 515 107 L 540 107" class="arrow"/>
<!-- FP8 Tensor (intermediate) -->
<rect x="540" y="80" width="100" height="55" class="fp8" rx="6"/>
<text x="590" y="103" class="text">FP8</text>
<text x="590" y="120" class="text">Tensor</text>
<!-- Arrow -->
<path d="M 640 107 L 665 107" class="arrow"/>
<!-- All-Gather -->
<rect x="665" y="80" width="100" height="55" class="allgather" rx="6"/>
<text x="715" y="112" class="text">All-Gather</text>
<!-- Arrow -->
<path d="M 765 107 L 790 107" class="arrow"/>
<!-- FP8 Gathered Tensor -->
<rect x="790" y="80" width="130" height="55" class="fp8" rx="6"/>
<text x="855" y="103" class="text">FP8 Gathered</text>
<text x="855" y="120" class="text">Tensor</text>
</svg>
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 900 280">
<defs>
<style>
@import url("../_static/css/diagram-colors.css");
.sign-bit { fill: #9db4d0; stroke: #333; stroke-width: 1; }
.exponent-bit { fill: #d9a066; stroke: #333; stroke-width: 1; }
.mantissa-bit { fill: #a8d99c; stroke: #333; stroke-width: 1; }
.bit-text { fill: #000; text-anchor: middle; dominant-baseline: middle; font-size: 16px; }
.header-text { fill: #555; font-weight: normal; text-anchor: middle; font-size: 18px; }
.value-text { fill: #333; font-size: 18px; }
.format-label { fill: #333; font-weight: bold; text-anchor: middle; dominant-baseline: middle; font-size: 20px; }
</style>
</defs>
<!-- Header labels - centered -->
<text x="149" y="18" class="header-text">sign</text>
<text x="220" y="18" class="header-text">exponent</text>
<text x="420" y="18" class="header-text">mantissa</text>
<!-- FP16 Format (16 bits: 1 + 5 + 10) -->
<text x="60" y="60" class="format-label">FP16</text>
<!-- Sign bit (1) -->
<rect x="140" y="45" width="18" height="30" class="sign-bit"/>
<text x="149" y="60" class="bit-text">0</text>
<!-- Exponent bits (5) -->
<rect x="163" y="45" width="18" height="30" class="exponent-bit"/>
<text x="172" y="60" class="bit-text">0</text>
<rect x="186" y="45" width="18" height="30" class="exponent-bit"/>
<text x="195" y="60" class="bit-text">1</text>
<rect x="209" y="45" width="18" height="30" class="exponent-bit"/>
<text x="218" y="60" class="bit-text">1</text>
<rect x="232" y="45" width="18" height="30" class="exponent-bit"/>
<text x="241" y="60" class="bit-text">0</text>
<rect x="255" y="45" width="18" height="30" class="exponent-bit"/>
<text x="264" y="60" class="bit-text">1</text>
<!-- Mantissa bits (10) -->
<rect x="278" y="45" width="18" height="30" class="mantissa-bit"/>
<text x="287" y="60" class="bit-text">1</text>
<rect x="301" y="45" width="18" height="30" class="mantissa-bit"/>
<text x="310" y="60" class="bit-text">0</text>
<rect x="324" y="45" width="18" height="30" class="mantissa-bit"/>
<text x="333" y="60" class="bit-text">0</text>
<rect x="347" y="45" width="18" height="30" class="mantissa-bit"/>
<text x="356" y="60" class="bit-text">1</text>
<rect x="370" y="45" width="18" height="30" class="mantissa-bit"/>
<text x="379" y="60" class="bit-text">0</text>
<rect x="393" y="45" width="18" height="30" class="mantissa-bit"/>
<text x="402" y="60" class="bit-text">1</text>
<rect x="416" y="45" width="18" height="30" class="mantissa-bit"/>
<text x="425" y="60" class="bit-text">0</text>
<rect x="439" y="45" width="18" height="30" class="mantissa-bit"/>
<text x="448" y="60" class="bit-text">0</text>
<rect x="462" y="45" width="18" height="30" class="mantissa-bit"/>
<text x="471" y="60" class="bit-text">1</text>
<rect x="485" y="45" width="18" height="30" class="mantissa-bit"/>
<text x="494" y="60" class="bit-text">1</text>
<text x="540" y="60" class="value-text">= 0.395264</text>
<!-- BF16 Format (16 bits: 1 + 8 + 7) -->
<text x="60" y="120" class="format-label">BF16</text>
<!-- Sign bit (1) -->
<rect x="140" y="105" width="18" height="30" class="sign-bit"/>
<text x="149" y="120" class="bit-text">0</text>
<!-- Exponent bits (8) -->
<rect x="163" y="105" width="18" height="30" class="exponent-bit"/>
<text x="172" y="120" class="bit-text">0</text>
<rect x="186" y="105" width="18" height="30" class="exponent-bit"/>
<text x="195" y="120" class="bit-text">1</text>
<rect x="209" y="105" width="18" height="30" class="exponent-bit"/>
<text x="218" y="120" class="bit-text">1</text>
<rect x="232" y="105" width="18" height="30" class="exponent-bit"/>
<text x="241" y="120" class="bit-text">1</text>
<rect x="255" y="105" width="18" height="30" class="exponent-bit"/>
<text x="264" y="120" class="bit-text">1</text>
<rect x="278" y="105" width="18" height="30" class="exponent-bit"/>
<text x="287" y="120" class="bit-text">1</text>
<rect x="301" y="105" width="18" height="30" class="exponent-bit"/>
<text x="310" y="120" class="bit-text">0</text>
<rect x="324" y="105" width="18" height="30" class="exponent-bit"/>
<text x="333" y="120" class="bit-text">1</text>
<!-- Mantissa bits (7) -->
<rect x="347" y="105" width="18" height="30" class="mantissa-bit"/>
<text x="356" y="120" class="bit-text">1</text>
<rect x="370" y="105" width="18" height="30" class="mantissa-bit"/>
<text x="379" y="120" class="bit-text">0</text>
<rect x="393" y="105" width="18" height="30" class="mantissa-bit"/>
<text x="402" y="120" class="bit-text">0</text>
<rect x="416" y="105" width="18" height="30" class="mantissa-bit"/>
<text x="425" y="120" class="bit-text">1</text>
<rect x="439" y="105" width="18" height="30" class="mantissa-bit"/>
<text x="448" y="120" class="bit-text">0</text>
<rect x="462" y="105" width="18" height="30" class="mantissa-bit"/>
<text x="471" y="120" class="bit-text">1</text>
<rect x="485" y="105" width="18" height="30" class="mantissa-bit"/>
<text x="494" y="120" class="bit-text">0</text>
<text x="540" y="120" class="value-text">= 0.394531</text>
<!-- FP8 E4M3 Format (8 bits: 1 + 4 + 3) -->
<text x="60" y="180" class="format-label">FP8 E4M3</text>
<!-- Sign bit (1) -->
<rect x="140" y="165" width="18" height="30" class="sign-bit"/>
<text x="149" y="180" class="bit-text">0</text>
<!-- Exponent bits (4) -->
<rect x="163" y="165" width="18" height="30" class="exponent-bit"/>
<text x="172" y="180" class="bit-text">0</text>
<rect x="186" y="165" width="18" height="30" class="exponent-bit"/>
<text x="195" y="180" class="bit-text">1</text>
<rect x="209" y="165" width="18" height="30" class="exponent-bit"/>
<text x="218" y="180" class="bit-text">0</text>
<rect x="232" y="165" width="18" height="30" class="exponent-bit"/>
<text x="241" y="180" class="bit-text">1</text>
<!-- Mantissa bits (3) -->
<rect x="255" y="165" width="18" height="30" class="mantissa-bit"/>
<text x="264" y="180" class="bit-text">1</text>
<rect x="278" y="165" width="18" height="30" class="mantissa-bit"/>
<text x="287" y="180" class="bit-text">0</text>
<rect x="301" y="165" width="18" height="30" class="mantissa-bit"/>
<text x="310" y="180" class="bit-text">1</text>
<text x="355" y="180" class="value-text">= 0.40625</text>
<!-- FP8 E5M2 Format (8 bits: 1 + 5 + 2) -->
<text x="60" y="240" class="format-label">FP8 E5M2</text>
<!-- Sign bit (1) -->
<rect x="140" y="225" width="18" height="30" class="sign-bit"/>
<text x="149" y="240" class="bit-text">0</text>
<!-- Exponent bits (5) -->
<rect x="163" y="225" width="18" height="30" class="exponent-bit"/>
<text x="172" y="240" class="bit-text">0</text>
<rect x="186" y="225" width="18" height="30" class="exponent-bit"/>
<text x="195" y="240" class="bit-text">1</text>
<rect x="209" y="225" width="18" height="30" class="exponent-bit"/>
<text x="218" y="240" class="bit-text">1</text>
<rect x="232" y="225" width="18" height="30" class="exponent-bit"/>
<text x="241" y="240" class="bit-text">0</text>
<rect x="255" y="225" width="18" height="30" class="exponent-bit"/>
<text x="264" y="240" class="bit-text">1</text>
<!-- Mantissa bits (2) -->
<rect x="278" y="225" width="18" height="30" class="mantissa-bit"/>
<text x="287" y="240" class="bit-text">1</text>
<rect x="301" y="225" width="18" height="30" class="mantissa-bit"/>
<text x="310" y="240" class="bit-text">0</text>
<text x="355" y="240" class="value-text">= 0.375</text>
</svg>
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 900 380">
<style>
@import url("../_static/css/diagram-colors.css");
.axis-line { stroke: #333; stroke-width: 2.5; }
.value-dot { fill: #2196f3; stroke: #1976d2; stroke-width: 1; }
.arrow { fill: #4caf50; }
.arrow-line { stroke: #4caf50; stroke-width: 3; }
.range-label { font-size: 14px; fill: #555; font-weight: 500; }
</style>
<!-- Top: Original values (before scaling) -->
<text x="450" y="55" class="section-title" text-anchor="middle">Original Tensor Values</text>
<!-- Top axis -->
<line x1="80" y1="85" x2="820" y2="85" class="axis-line"/>
<!-- Zero marker (center) -->
<line x1="450" y1="80" x2="450" y2="90" stroke="#333" stroke-width="2"/>
<text x="450" y="108" class="text" text-anchor="middle" font-size="12px">0</text>
<!-- Value dots (before scaling - irregular, not symmetric around zero) -->
<circle cx="118" cy="85" r="6" fill="#e53935" stroke="#c62828" stroke-width="2"/>
<circle cx="159" cy="85" r="5" class="value-dot"/>
<circle cx="167" cy="85" r="5" class="value-dot"/>
<circle cx="187" cy="85" r="5" class="value-dot"/>
<circle cx="199" cy="85" r="5" class="value-dot"/>
<circle cx="228" cy="85" r="5" class="value-dot"/>
<circle cx="326" cy="85" r="5" class="value-dot"/>
<circle cx="368" cy="85" r="5" class="value-dot"/>
<circle cx="442" cy="85" r="5" class="value-dot"/>
<circle cx="621" cy="85" r="5" class="value-dot"/>
<circle cx="649" cy="85" r="5" class="value-dot"/>
<circle cx="725" cy="85" r="5" class="value-dot"/>
<!-- amax label -->
<text x="118" y="70" class="text" fill="#e53935" font-weight="700" font-size="14px" text-anchor="middle">amax</text>
<!-- Original range bracket spanning all values -->
<line x1="118" y1="100" x2="118" y2="110" stroke="#666" stroke-width="1.5"/>
<line x1="118" y1="110" x2="725" y2="110" stroke="#666" stroke-width="1.5"/>
<line x1="725" y1="100" x2="725" y2="110" stroke="#666" stroke-width="1.5"/>
<text x="750" y="114" class="range-label" text-anchor="start">Original range</text>
<!-- Trapezoid showing compression from original range to FP8 range -->
<polygon points="118,115 725,115 650,165 250,165" fill="#e53935" opacity="0.2" stroke="#e53935" stroke-width="1.5"/>
<!-- Bottom: After scaling -->
<text x="450" y="190" class="section-title" text-anchor="middle">Scaled Values (fit FP8 range)</text>
<!-- Bottom axis -->
<line x1="80" y1="220" x2="820" y2="220" class="axis-line"/>
<!-- Zero marker (center) -->
<line x1="450" y1="215" x2="450" y2="225" stroke="#333" stroke-width="2"/>
<text x="450" y="238" class="text" text-anchor="middle" font-size="12px">0</text>
<!-- FP8 range bracket -->
<line x1="250" y1="245" x2="250" y2="255" stroke="#4caf50" stroke-width="1.5"/>
<line x1="250" y1="255" x2="650" y2="255" stroke="#4caf50" stroke-width="1.5"/>
<line x1="650" y1="245" x2="650" y2="255" stroke="#4caf50" stroke-width="1.5"/>
<text x="750" y="259" class="range-label" text-anchor="start" fill="#4caf50">FP8 range</text>
<!-- Value dots (after scaling - homogeneous scaling from zero, all fit into FP8 range) -->
<circle cx="250" cy="220" r="6" fill="#e53935" stroke="#c62828" stroke-width="2"/>
<text x="250" y="205" class="text" fill="#e53935" font-weight="700" font-size="12px" text-anchor="middle">- FP8 range max</text>
<circle cx="275" cy="220" r="5" class="value-dot"/>
<circle cx="280" cy="220" r="5" class="value-dot"/>
<circle cx="292" cy="220" r="5" class="value-dot"/>
<circle cx="299" cy="220" r="5" class="value-dot"/>
<circle cx="316" cy="220" r="5" class="value-dot"/>
<circle cx="375" cy="220" r="5" class="value-dot"/>
<circle cx="401" cy="220" r="5" class="value-dot"/>
<circle cx="445" cy="220" r="5" class="value-dot"/>
<circle cx="553" cy="220" r="5" class="value-dot"/>
<circle cx="569" cy="220" r="5" class="value-dot"/>
<circle cx="615" cy="220" r="5" class="value-dot"/>
<!-- Third line: After cast to FP8 (quantized values) -->
<text x="450" y="290" class="section-title" text-anchor="middle">Cast to FP8 (quantized values)</text>
<!-- Third axis -->
<line x1="80" y1="320" x2="820" y2="320" class="axis-line"/>
<!-- Zero marker (center) -->
<line x1="450" y1="315" x2="450" y2="325" stroke="#333" stroke-width="2"/>
<text x="450" y="338" class="text" text-anchor="middle" font-size="12px">0</text>
<!-- FP8 range bracket -->
<line x1="250" y1="345" x2="250" y2="355" stroke="#4caf50" stroke-width="1.5"/>
<line x1="250" y1="355" x2="650" y2="355" stroke="#4caf50" stroke-width="1.5"/>
<line x1="650" y1="345" x2="650" y2="355" stroke="#4caf50" stroke-width="1.5"/>
<text x="750" y="359" class="range-label" text-anchor="start" fill="#4caf50">FP8 range</text>
<!-- Quantized dots - merged close values to show FP8 granularity -->
<circle cx="250" cy="320" r="6" fill="#e53935" stroke="#c62828" stroke-width="2"/>
<!-- merged: 275+280 -->
<circle cx="278" cy="317" r="4.5" class="value-dot"/>
<circle cx="278" cy="323" r="4.5" class="value-dot"/>
<!-- merged: 292+299 -->
<circle cx="296" cy="317" r="4.5" class="value-dot"/>
<circle cx="296" cy="323" r="4.5" class="value-dot"/>
<circle cx="318" cy="320" r="5" class="value-dot"/>
<circle cx="378" cy="320" r="5" class="value-dot"/>
<circle cx="404" cy="320" r="5" class="value-dot"/>
<circle cx="450" cy="320" r="5" class="value-dot"/>
<!-- merged: 553+569 -->
<circle cx="562" cy="317" r="4.5" class="value-dot"/>
<circle cx="562" cy="323" r="4.5" class="value-dot"/>
<circle cx="615" cy="320" r="5" class="value-dot"/>
</svg>
# Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# See LICENSE for license information.
# START_CURRENT_SCALING_EXAMPLE
import jax
import jax.numpy as jnp
import transformer_engine.jax as te
from transformer_engine.jax.flax import DenseGeneral
from transformer_engine.common.recipe import Float8CurrentScaling, Format
# Create FP8 Current Scaling recipe
# Available formats:
# - Format.HYBRID (default) -- E4M3 for forward pass, E5M2 for backward pass
# - Format.E4M3 -- E4M3 for both forward and backward pass
recipe = Float8CurrentScaling(fp8_format=Format.HYBRID)
with te.autocast(enabled=True, recipe=recipe):
# Create and initialize layer
layer = DenseGeneral(features=1024)
key = jax.random.PRNGKey(0)
x = jax.random.normal(key, (32, 128, 1024), dtype=jnp.bfloat16)
var_collect = layer.init(key, x)
# Forward and backward pass
def loss_fn(var_collect):
output = layer.apply(var_collect, x)
return output.sum()
loss, grads = jax.value_and_grad(loss_fn)(var_collect)
# END_CURRENT_SCALING_EXAMPLE
# Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# See LICENSE for license information.
# START_CURRENT_SCALING_EXAMPLE
import torch
import transformer_engine.pytorch as te
from transformer_engine.common.recipe import Float8CurrentScaling, Format
# Create FP8 Current Scaling recipe
# Available formats:
# - Format.HYBRID (default) -- E4M3 for forward pass, E5M2 for backward pass
# - Format.E4M3 -- E4M3 for both forward and backward pass
recipe = Float8CurrentScaling(fp8_format=Format.HYBRID)
# Create a simple linear layer with bfloat16 parameters
layer = te.Linear(1024, 1024, params_dtype=torch.bfloat16)
# Forward and backward pass
inp = torch.randn(32, 128, 1024, dtype=torch.bfloat16, device="cuda")
with te.autocast(enabled=True, recipe=recipe):
output = layer(inp)
loss = output.sum()
loss.backward()
# END_CURRENT_SCALING_EXAMPLE
..
Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
See LICENSE for license information.
FP8 Delayed Scaling
===================================
FP8 Delayed Scaling recipe estimates scaling factors from historical amax values rather than computing them
for each tensor. Compared to Current Scaling recipe,
this reduces tensor reads per quantization from two to one,
improving memory efficiency.
Both this and :doc:`FP8 Current Scaling <../fp8_current_scaling/fp8_current_scaling>` recipe use
the same FP8 formats (E4M3/E5M2) with one FP32 scaling factor per tensor.
Reading the FP8 Current Scaling documentation first is recommended.
Quantization with delayed scaling factors
-----------------------------------------
FP8 Current Scaling requires two tensor reads per quantization: one to compute amax,
one to cast. FP8 Delayed Scaling eliminates the first read by predicting the scaling factor
from historical amax values - hence *delayed* (using past values) versus *current* (using present values).
The quantization process works as follows:
1. **Compute scaling factor from history** (no tensor read needed):
The scaling factor is derived from stored ``amax_history`` using the formula:
``scaling_factor = FP8_MAX / amax``
where ``amax`` is computed from history using either ``max`` (maximum over window, default) or ``most_recent`` algorithm.
2. **Quantize the tensor** (one tensor read):
Apply the scaling factor and cast to FP8. Values exceeding FP8 range are clipped.
3. **Update history**:
Record the actual amax from this quantization for future iterations.
Each module maintains an ``amax_history`` tensor of configurable length (``amax_history_len``)
for each quantized tensor.
.. raw:: html
:file: img/scaling_comparison.svg
*Figure 1. Comparison of FP8 Current Scaling and FP8 Delayed Scaling quantization processes.*
Amax History Management
-----------------------
The ``amax_history`` buffer acts as a sliding window of recent amax values.
Position 0 serves as a staging area for the current amax, while positions 1 to N-1
store the history from oldest to newest. Each quantization writes the observed amax
to position 0, and after the pass completes, the history is rotated:
.. code-block:: text
Before rotation: [amax_N, amax_1, amax_2, ..., amax_N-1] (amax_N = current, amax_1 = oldest)
After rotation: [0, amax_2, ..., amax_N-1, amax_N] (amax_1 dropped, amax_N appended)
The scaling factor is computed **before** the rotation, so it uses all ``amax_history_len`` values.
Position 0 serves as a staging area — it is zeroed after the scale update, ready for the next iteration's amax.
The implementation differs between PyTorch and JAX:
.. tabs::
.. tab:: PyTorch
Each module creates two ``amax_history`` tensors, initialized to zero:
- Forward: shape ``(amax_history_len, num_gemms * 3)`` — three FP8 tensors per GEMM (input, weight, output)
- Backward: shape ``(amax_history_len, num_gemms * 2)`` — two FP8 tensors per GEMM (grad_output, grad_input)
When the autocast context exits, a single CUDA kernel processes all tensors at once —
performing amax reduction across GPUs and history rotation. This batched approach
minimizes kernel launch overhead compared to updating each tensor separately.
.. tab:: JAX
Each quantizer maintains its own ``amax_history`` with shape ``(amax_history_len,)``
and updates independently.
Here's how to use FP8 Delayed Scaling in PyTorch and JAX:
.. tabs::
.. tab:: PyTorch
.. raw:: html
<div style="background: #f0f4f8; border-left: 3px solid #5c7cfa; padding: 6px 12px; font-size: 13px; color: #495057; margin-bottom: 0; border-radius: 4px 4px 0 0;">
Requires SM89 (Ada) or later
</div>
.. literalinclude:: pytorch_delayed_scaling_example.py
:language: python
:start-after: # START_DELAYED_SCALING_EXAMPLE
:end-before: # END_DELAYED_SCALING_EXAMPLE
.. tab:: JAX
.. raw:: html
<div style="background: #f0f4f8; border-left: 3px solid #5c7cfa; padding: 6px 12px; font-size: 13px; color: #495057; margin-bottom: 0; border-radius: 4px 4px 0 0;">
Requires SM89 (Ada) or later
</div>
.. literalinclude:: jax_delayed_scaling_example.py
:language: python
:start-after: # START_DELAYED_SCALING_EXAMPLE
:end-before: # END_DELAYED_SCALING_EXAMPLE
Distributed Training
--------------------
FP8 Delayed Scaling uses the same data formats as FP8 Current Scaling - quantized all-gather is supported.
However, amax reduction works slightly differently in different frameworks.
.. tabs::
.. tab:: PyTorch
Amax reduction is controlled by two parameters:
- ``reduce_amax`` in recipe: enables/disables reduction (required for SP and CP)
- ``amax_reduction_group`` in ``autocast``: specifies the process group for reduction
We recommend reducing amax across all GPUs where the tensor is sharded,
including data parallel ranks.
.. literalinclude:: pytorch_delayed_scaling_distributed_example.py
:language: python
:start-after: # START_AMAX_REDUCTION_EXAMPLE
:end-before: # END_AMAX_REDUCTION_EXAMPLE
In data parallel training, some modules may not execute on certain ranks
(e.g., MoE experts that receive no tokens). This is handled as follows:
- **First iteration**: All modules must execute on all ranks to register
their ``amax_history`` tensors in the global buffer. Mismatched registration
would cause the ``all_reduce`` to hang due to different tensor sizes across ranks.
- **Subsequent iterations**: The ``autocast`` context must be entered and exited
on all ranks (this triggers the collective reduction). Individual modules can be
skipped - if no rank executes a module, its history is not rotated and scale
remains unchanged.
.. tab:: JAX
Amax reduction is always enabled and managed automatically.
Reduction scope: all parallelism axes except pipeline parallelism (TP, SP, DP/FSDP).
.. literalinclude:: jax_delayed_scaling_distributed_example.py
:language: python
:start-after: # START_AMAX_REDUCTION_EXAMPLE
:end-before: # END_AMAX_REDUCTION_EXAMPLE
Supported devices
-----------------
Ada and later (SM 8.9+)
\ No newline at end of file
<?xml version="1.0" encoding="UTF-8"?>
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 1000 420">
<defs>
<style>
/* Common styles loaded from diagram-colors.css: .hp, .fp8, .quantize, .amax, .text, .title, .label, .box-orange, .box-dashed */
/* Diagram-specific styles for arrows */
.arrow {
stroke: #616161;
stroke-width: 2;
fill: none;
marker-end: url(#arrowhead);
}
</style>
<marker id="arrowhead" markerWidth="10" markerHeight="10" refX="8" refY="3" orient="auto" markerUnits="strokeWidth">
<polygon points="0 0, 10 3, 0 6" fill="#616161" />
</marker>
</defs>
<!-- Current Scaling Section -->
<text x="250" y="30" class="title">Current Scaling</text>
<!-- Tensor box -->
<rect x="150" y="60" width="200" height="60" class="hp" rx="5"/>
<text x="250" y="95" class="text">Tensor</text>
<!-- Arrow to amax computation -->
<path d="M 250 120 L 250 160" class="arrow"/>
<!-- Amax computation box -->
<rect x="150" y="160" width="200" height="60" class="amax" rx="5"/>
<text x="250" y="195" class="text">Amax Computation</text>
<!-- Arrow to quantization -->
<path d="M 250 220 L 250 260" class="arrow"/>
<!-- Quantization box -->
<rect x="125" y="260" width="250" height="60" class="quantize" rx="5"/>
<text x="250" y="285" class="text">Quantization</text>
<text x="250" y="305" class="label">(uses tensor + amax)</text>
<!-- Arrow to FP8 tensor -->
<path d="M 250 320 L 250 360" class="arrow"/>
<!-- FP8 Tensor result -->
<rect x="150" y="360" width="200" height="40" class="fp8" rx="5"/>
<text x="250" y="385" class="text">FP8 Tensor</text>
<!-- Delayed Scaling Section -->
<text x="750" y="30" class="title">Delayed Scaling</text>
<!-- Tensor box with amax history subbox -->
<rect x="650" y="60" width="200" height="80" class="hp" rx="5"/>
<text x="750" y="90" class="text">Tensor</text>
<!-- Amax history subbox (below tensor) -->
<rect x="660" y="110" width="180" height="25" class="box-orange box-dashed" rx="3"/>
<text x="750" y="127" class="label">amax history</text>
<!-- Arrow to quantization -->
<path d="M 750 140 L 750 180" class="arrow"/>
<text x="820" y="162" class="small-text" style="text-anchor: start;">read amax</text>
<!-- Quantization box -->
<rect x="625" y="180" width="250" height="80" class="quantize" rx="5"/>
<text x="750" y="210" class="text">Quantization</text>
<text x="750" y="230" class="label">(uses tensor + amax from history)</text>
<text x="750" y="250" class="label">(updates amax history)</text>
<!-- Arrow back to history (curved) -->
<path d="M 625 220 Q 590 220 590 127 L 660 127" class="arrow"/>
<text x="565" y="175" class="small-text" style="text-anchor: end;">update amax</text>
<!-- Arrow to FP8 tensor -->
<path d="M 750 260 L 750 300" class="arrow"/>
<!-- FP8 Tensor result -->
<rect x="650" y="300" width="200" height="40" class="fp8" rx="5"/>
<text x="750" y="325" class="text">FP8 Tensor</text>
</svg>
# Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# See LICENSE for license information.
# START_AMAX_REDUCTION_EXAMPLE
import transformer_engine.jax as te
from transformer_engine.common.recipe import DelayedScaling
# Amax reduction scope is managed internally
recipe = DelayedScaling(reduce_amax=True) # Must be True in JAX
with te.autocast(enabled=True, recipe=recipe):
output = layer.apply(params, inp)
# END_AMAX_REDUCTION_EXAMPLE
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment