An important observation is that the **forward pass uses only rowwise tensors** - both input
and weight are accessed rowwise.
The backward pass introduces columnwise access. For weight gradient, both output gradient and input
are accessed columnwise. For input gradient, output gradient is rowwise while weight is columnwise.
As a result, each tensor (input, weight, output gradient) needs both rowwise and columnwise
usages during training. This has implications for memory layout and transpose operations.
**Architecture differences**
The physical memory layout requirements for rowwise and columnwise usages differ between architectures
and recipes. For FP8 tensors:
- *Hopper*: cannot efficiently access elements in columnwise fashion, so columnwise tensors need to be physically transposed in memory. Note that higher precision formats (BF16/FP16) do not have this limitation.
- *Blackwell*: supports columnwise access natively, so no transpose is needed.
We will see that for most of the recipes and devices, rowwise usage and columnwise usage need different tensors.
Thus by *rowwise tensor* and *columnwise tensor* we mean tensors that are used in rowwise and columnwise usages respectively.
.. figure:: img/hopper_vs_blackwell_layout.svg
:align: center
:alt: Comparison of rowwise and columnwise tensor layouts on Blackwell vs Hopper
Figure 2: On Blackwell, rowwise and columnwise usages share the same memory layout.
On Hopper, columnwise usage requires a physical transpose.
**Quantization fusions**
This section is relevant only for recipes for which columnwise tensors
are different from rowwise tensors.
Note that performing rowwise and columnwise quantization at the same time
enables some fusions, which usually lead to better performance.
We showcase 3 example scenarios of producing quantized tensors in rowwise and columnwise usages,
TE will use best possible fusion for given recipe and TE module configuration:
1. *Computation of quantized tensor in both rowwise and columnwise usages in a single kernel in forward pass*.
This is the fastest one,
but since the columnwise usage is saved for backward pass, it may lead to increased memory usage,
if the high precision tensor also needs to be saved for backward - for example if it is the attention output which is saved anyway.
2. *Computation of quantized tensor in rowwise usage in forward pass and fused quantization to produce columnwise usage in backward pass*.
This is usually slower than the previous one, since high precision tensor needs to be read twice.
It is used for example when high precision tensor is gathered both in forward and in backward
and quantized tensor gather is not implemented for such recipe.
3. *Computation of quantized tensor in rowwise usage in forward pass and transpose to columnwise usage in backward pass*.
It is more memory efficient than Option 1, but not all recipes can utilize it (otherwise
the quantization accuracy would drop due to double quantization errors).
Transformer Engine chooses the best possible fusion internally taking the recipe and the operation into account.
.. raw:: html
:file: img/transpose_fusion.svg
*Figure 3: Three scenarios of producing quantized tensors in rowwise and columnwise usages.*
Memory usage
------------
This section discusses memory usage in low precision training.
Contrary to intuition, FP8 training does not always reduce memory compared to BF16/FP16.
*Master weights*
Transformer Engine by default stores weights in high precision and quantizes them to low precision before each GEMM.
Moreover, one can specify which high precision should be used to store the weights in the
model (FP32/BF16/FP16) -- or choose not to store high precision weights in the model at all.
There are multiple scenarios to consider, three of them are listed below:
1. model weights are in FP32, quantized to low precision before each GEMM,
2. model weights are in BF16/FP16, quantized to low precision before each GEMM, master weights in optimizer are in FP32.
3. model weights are stored directly in low precision, and master weights in optimizer are in FP32.
Note that each of these scenarios may have different memory footprint.
*Activations saved for backward*
Unlike weights, activations do not require a high precision copy for optimizer updates.
As shown in Table 2, the input needs rowwise usage in forward and columnwise usage
for weight gradient computation in backward — so it must be saved between passes.
The memory impact depends on which scenario from Figure 3 is used.
Additionally, on architectures where rowwise and columnwise usage tensors share the same memory layout
(e.g., FP8 on Blackwell, as shown in Figure 2), a single quantized tensor serves both usages,
reducing memory overhead compared to architectures requiring separate tensors.
Output gradients, on the other hand, are computed during backward and do not need to be saved —
both rowwise and columnwise usages are produced on the fly as needed.
The FP8 examples below are analyzed on Hopper (SM90) or Ada (SM89) architecture, where rowwise
and columnwise tensors require separate memory layouts.