For a more comprehensive tutorial, check out our `Quickstart Notebook <https://github.com/NVIDIA/TransformerEngine/blob/main/docs/examples/quickstart.ipynb>`_.
For a more comprehensive tutorial, check out our `Getting Started Guide <https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/getting_started.html>`_.
.. overview-end-marker-do-not-remove
...
...
@@ -175,15 +175,22 @@ For example to use the NGC PyTorch container interactively,
.. code-block:: bash
docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:25.08-py3
docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:26.01-py3
For example to use the NGC JAX container interactively,
.. code-block:: bash
docker run --gpus all -it --rm nvcr.io/nvidia/jax:25.08-py3
docker run --gpus all -it --rm nvcr.io/nvidia/jax:26.01-py3
Where 25.08 (corresponding to August 2025 release) is the container version.
Where 26.01 (corresponding to January 2026 release) is the container version.
We recommend updating to the latest NGC container available here:
If you run any examples, please ensure you are using a matching version of TransformerEngine. TransformerEngine is pre-built and packaged inside the containers with examples available at ``/opt/transformerengine`` or ``/opt/transformer-engine``. If you would like to use examples from TE main branch and are running into import errors, please try the latest pip package or building from source, although NGC containers are recommended for ease-of-use for most users.
***Solution:**ThiscanoccurwhenTEisbuiltagainstthecontainer's system installation of cuDNN, but pip packages inside the virtual environment pull in pip packages for ``nvidia-cudnn-cu12/cu13``. To resolve this, when building TE from source please specify the following environment variables to point to the cuDNN in your virtual environment.
* **Symptoms:** Regular TE installs work correctly but UV wheel builds fail at runtime.
* **Solution:** Ensure that ``uv build --wheel --no-build-isolation -v`` is used during the wheel build as well as the pip installation of the wheel. Use ``-v`` for verbose output to verify that TE is not pulling in a mismatching version of PyTorch or JAX that differs from the UV environment'sversion.
***Solution:**Ensure``--no-build-isolation``isusedduringinstallation.Ifpre-buildingwheels,ensurethatthewheelisbothbuiltandinstalledwith``--no-build-isolation``.See"Problems using UV or Virtual Environments"aboveifusingUV.
"# Accelerating Hugging Face Llama 2 and 3 Fine-Tuning with Transformer Engine\n",
...
...
@@ -14,11 +13,11 @@
"This tutorial showcases how to accelerate finetuning a full [Llama 2](https://huggingface.co/meta-llama/Llama-2-7b-hf) or [Llama 3](https://huggingface.co/meta-llama/Meta-Llama-3-8B) models from Hugging Face by using `TransformerLayer` from the [Transformer Engine library](https://github.com/NVIDIA/TransformerEngine) in `BF16` and `FP8` precisions.\n",
"\n",
"</div>\n"
]
],
"id": "6a5b2993"
},
{
"cell_type": "markdown",
"id": "331f476a",
"metadata": {},
"source": [
"## Dependencies for this tutorial\n",
...
...
@@ -29,12 +28,11 @@
" - This file contains the code to load a Hugging Face Llama 2 or Llama 3 checkpoint in Transformer Engine's `TransformerLayer` instead of Hugging Face's `LlamaDecoderLayer`. This is used in the following two sections of the tutorial - \"Improvement 1\" and \"Improvement 2\".\n",
"2. `utils.py`\n",
" - This file contains the code related to dataloading, hyperparameters, setting up model/optimizers/accelerator, model training and other miscellaneous tasks like restarting the jupyter notebook from within the cell. \n",
"3. `media/`\n",
"3. `requirements.txt`\n",
" - This file contains the necessary Python packages for this tutorial.\n",
"4. `media/`\n",
" - This directory contains the images used in the following tutorial.\n",
"\n",
"These packages are necessary to run this tutorial:\n",
"This tutorial shows the cell outputs when run with Llama 2 7B weights. It can be run with Llama 3 8B weights simply by providing the directory with those weights (in Hugging Face format) instead of Llama 2 7B weights. These two models are almost identical, the biggest difference being the model dimension (the smallest Llama 3 model has 8B parameters, whereas the smallest Llama 2 has 7B), which enables this tutorial to work for both of them.\n",
"\n",
"</div>\n"
]
"</div>\n",
""
],
"id": "331f476a"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Setup\n",
"\n",
"Install the required Python packages using the following command:"
],
"id": "b56526b3"
},
{
"cell_type": "code",
"metadata": {},
"source": [
"# Uncomment and run this cell when running the tutorial for the first time\n",
"# %pip install -r requirements.txt"
],
"id": "099697e2",
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"id": "44abae4f",
"metadata": {},
"source": [
"## Table of contents\n",
...
...
@@ -61,11 +81,11 @@
" - Mapping weights from HF's `LlamaDecoderLayer` to TE's `TransformerLayer`\n",
" <figcaption> Fig 2: Comparing GPT and Llama architectures. </figcaption>\n",
"</figure>"
]
],
"id": "e37e2cc1"
},
{
"cell_type": "markdown",
"id": "a110de1a",
"metadata": {},
"source": [
"## Hugging Face's `LlamaModel`\n",
...
...
@@ -155,7 +175,7 @@
")\n",
"```\n",
"\n",
"#### Hugging Face's `LlamaDecoderLayer`\n",
"### Hugging Face's `LlamaDecoderLayer`\n",
"\n",
"Let's take a closer look at `LlamaDecoderLayer`. It is composed of `input_layernorm`, `self_attn`, `post_attention_layernorm` and `mlp` modules. Each module has associated weights as shown in the diagram.\n",
"\n",
...
...
@@ -164,10 +184,10 @@
" <figcaption> Fig 4: Causal Llama Model Block Diagram (with simplified illustration of the [LlamaDecoderLayer](https://github.com/huggingface/transformers/blob/e770f0316d2a9b787c9d1440f204fcb65e176682/src/transformers/models/llama/modeling_llama.py#L695)). </figcaption>\n",
"</figure>\n",
"\n",
"##### Self_Attn Layer\n",
"#### Self_Attn Layer\n",
"For simplicity in the block diagram illustration of the \"self_attn\" box, we omit the \"Grouped Query Attention\" operation and only showcase the modules which have associated weights.\n",
" \n",
"##### MLP Layer\n",
"#### MLP Layer\n",
"\n",
"SwiGLU is an activation defined as follows in the [modeling_llama.py](https://github.com/huggingface/transformers/blob/7c4995f93d8d24aae05e1e43279c96dce736e5c8/src/transformers/models/llama/modeling_llama.py#L236) file in the Hugging Face github repo:\n",
"```\n",
...
...
@@ -184,11 +204,11 @@
"<img src=\"media/swiglu.svg\">\n",
" <figcaption> Fig 5: A look inside the feedforward layer with <code>swiglu</code> activation function. </figcaption>\n",
"The baseline implementation will be run in `BF16` precision.\n",
"\n",
"</div>"
]
],
"id": "c9529229"
},
{
"cell_type": "markdown",
"id": "b38eb3ac",
"metadata": {},
"source": [
"<div class=\"alert alert-info\">\n",
...
...
@@ -224,23 +244,12 @@
"If the utility doesn't work, comment this line `restart_jupyter_notebook()` in the following cell and manually restart the jupyter notebook before running the cell. Repeat the same for other sections in this tutorial.\n",
"\n",
"</div>\n"
]
],
"id": "b38eb3ac"
},
{
"cell_type": "code",
"execution_count": 1,
"id": "2e9d7a8c",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"10 finetuning steps complete!\n",
"Average time taken per step: 248 milliseconds\n"
]
}
],
"source": [
"# Restart the notebook (to flush the GPU memory)\n",
"In addition to basic layers like `Linear` and `LayerNorm`, Transformer Engine offers larger modules like `MultiheadAttention` (combines \"LayerNorm\" and \"Self Attention\") and `LayerNormMLP` (combines \"LayerNorm\" and \"MLP\") that could replace their counterparts in the `LlamaDecoderLayer` and potentially provide a speedup. Transformer Engine also offers a full `TransformerLayer` (which further combines `MultiheadAttention` and `LayerNormMLP` layers) which could replace `LlamaDecoderLayer` and provide a speedup (with careful mapping of the weights since the name of the weights are different for those two layers). Let's take a closer look at Transformer Engine's `TransformerLayer`. \n",
"\n",
"#### Transformer Engine's `TransformerLayer`\n",
"### Transformer Engine's `TransformerLayer`\n",
"\n",
"At a higher level, TE's `TransformerLayer` could be visualized as an apt replacement for the `LlamaDecoderLayer`. But the internals of the `TransformerLayer` are organized a bit differently. \n",
"\n",
...
...
@@ -327,7 +347,7 @@
" <figcaption> Fig 8: Abstract illustration of the SwiGLU implementation in Transformer Engine. </figcaption>\n",
"</figure>\n",
"\n",
"#### `TransformerLayer` options explained\n",
"### `TransformerLayer` options explained\n",
"\n",
"<div class=\"alert alert-info\">\n",
"\n",
...
...
@@ -404,7 +424,7 @@
"A major portion of the Hugging Face model implementation (32 `LlamaDecoderLayer` layers) could be potentially replaced with Transformer Engine's `TransformerLayer` layers. Let's see how it is made possible.\n",
"\n",
"\n",
"#### Mapping weights from HF's `LlamaDecoderLayer` to TE's `TransformerLayer`\n",
"### Mapping weights from HF's `LlamaDecoderLayer` to TE's `TransformerLayer`\n",
"\n",
"Refer the accompanying file `te_llama.py` which provides a reference to create a Llama 2 model with TE's `TransformerLayer` after replacing HF's `LlamaDecoderLayer`.\n",
"\n",
...
...
@@ -559,23 +579,12 @@
"\n",
"Let's first run this \"TELlama\" implementation in `BF16` precision.\n",
"</div>"
]
],
"id": "3db90dff"
},
{
"cell_type": "code",
"execution_count": 1,
"id": "bdb34b91",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"10 finetuning steps complete!\n",
"Average time taken per step: 185 milliseconds\n"
]
}
],
"source": [
"# Restart the notebook (to flush the GPU memory)\n",
"Compared to the \"baseline\" implementation, we see that using Transformer Engine's `TransformerLayer` in place of Huggging Face's `LlamaDecoderLayer` gives a speedup of **34%** even when using only BF16 precision!\n",
"Now that most of the HF Llama model implementation (`LlamaDecoderLayer`s) has been swapped with Transformer Engine implementation (`TELlamaDecoderLayer` or `TransformerLayer`), let's see how finetuning in `FP8` precision helps improve performance.\n",
"\n",
"#### How to run the model in `FP8` precision\n",
"### How to run the model in `FP8` precision\n",
"\n",
"After the substitution, the model can be run in `FP8` precision by the following change over the previous BF16 runs. (For more information, refer the corresponding `wrap_with_accelerator` function in the accompanying `utils.py` file).\n",
"\n",
...
...
@@ -648,23 +668,12 @@
" kwargs_handlers=fp8_kwarg_handler\n",
")\n",
"```"
]
],
"id": "98cd8efb"
},
{
"cell_type": "code",
"execution_count": 1,
"id": "772c6f22",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"10 finetuning steps complete!\n",
"Average time taken per step: 160 milliseconds\n"
]
}
],
"source": [
"# Restart the notebook (to flush the GPU memory)\n",
"| Models | Precision | Step Time (or ms per batch) | Speedup (over baseline) |\n",
...
...
@@ -715,7 +735,7 @@
"\n",
"After turning on FP8 precision, we get even more speedup of **55%** (with Llama 2 7B)!\n",
"\n",
"#### Llama 3 performance results\n",
"### Llama 3 performance results\n",
"Running the same tutorial with **Llama 3 8B** yields the following performance numbers:\n",
"\n",
"| Models | Precision | Step Time (or ms per batch) | Speedup (over baseline) |\n",
...
...
@@ -726,17 +746,18 @@
"\n",
"For Llama 3 8B, we get the most speedup of **46%** with FP8 precision!\n",
"\n"
]
],
"id": "e7cf9c3a"
},
{
"cell_type": "markdown",
"id": "95d6c42b",
"metadata": {},
"source": [
"## Conclusion\n",
"\n",
"Using `TransformerLayer` module from Transformer Engine as a substitute for Hugging Face's `LlamaDecoderLayer` provides a speedup over Hugging Face's native Llama 2 and Llama 3 implementations. This needs careful initialization of the model such that the model weights (which are meant for `LlamaDecoderLayer`) are correctly mapped to their counterparts in TE's `TransformerLayer`. Even with `BF16` precision, `TransformerLayer` provides a speedup over the baseline implementation. With `FP8` precision, the speed up is even more pronounced!"
Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
See LICENSE for license information.
FP8 Current Scaling
===================================
FP8 current scaling recipe is the simplest low precision recipe provided by Transformer Engine.
To understand how this recipe works, we first need to examine what the FP8 data type is and how it differs from other floating point formats.
FP8 data type
-------------
The FP8 datatype, introduced in Hopper architecture, is actually 2 distinct datatypes, useful in different parts of the training of neural networks:
* E4M3 -- consists of 1 sign bit, 4 exponent bits and 3 bits of mantissa. It can store values up to +/-448 and ``nan``.
* E5M2 -- consists of 1 sign bit, 5 exponent bits and 2 bits of mantissa. It can store values up to +/-57344, +/- ``inf`` and ``nan``. The tradeoff of the increased dynamic range is lower precision of the stored values.
.. raw:: html
:file: img/fp8_formats.svg
*Figure 1: Structure of the floating point datatypes. All of the values shown (in FP16, BF16, FP8 E4M3 and FP8 E5M2) are the closest representations of value 0.3952.*
**E4M3 and E5M2 usage in training**
By default, Transformer Engine uses a hybrid approach:
* *Forward pass* - activations and weights require more precision, so E4M3 datatype is used to store them.
* *Backward pass* - gradients are less susceptible to precision loss but require higher dynamic range, so E5M2 datatype is preferred.
The user can configure this behavior via the ``fp8_format`` parameter of the recipe.
Scaling factors
---------------
Limited dynamic range of FP8 datatype is insufficient for many tensors.
To address this, values in the tensor are scaled. FP8 Current Scaling recipe uses one **FP32** scale factor per tensor. The representation of a tensor element ``x`` in FP8 precision is given by:
.. code-block:: python
x = x_fp8 * s
where
* ``x_fp8`` is the FP8 value (E4M3 or E5M2),
* ``s`` is a global **FP32** scaling factor applied to the entire tensor.
**FP8 Current Scaling quantization**
Let's take a closer look at how quantization to FP8 with scaling factor is implemented in
the FP8 Current Scaling recipe.
.. raw:: html
:file: img/fp8_scaling_concept.svg
*Figure 3: Quantization to FP8 consists of amax (absolute maximum) computation, scaling to fit the FP8 range and casting to the respective FP8 format.*
Quantization to FP8 consists of 3 steps:
1. Computation of the absolute maximum value of the tensor - we refer to it as ``amax``.
2. Applying the scaling factor of ``fp8_max / amax`` to the tensor, to fit it into the FP8 range
3. Casting into the respective FP8 format using *Round To Nearest Even (RTNE)*. Values round to the nearest representable FP8 value. When exactly halfway between two values, rounds to the one with even mantissa to minimize systematic bias.
**Performance analysis**
Quantization is a memory-bound operation that requires reading the tensor twice:
* First read: compute ``amax`` across all elements.
* Second read: apply the scaling factor and cast to FP8.
This is a significant overhead compared to other recipes, which typically require only a single memory read.
.. raw:: html
:file: img/fp8_cast_process.svg
*Figure 4: FP8 quantization with current scaling recipe - two tensor reads are needed, one to compute amax and one to apply the scaling factor and cast to FP8.*
Transpose handling
------------------
*Ada and Hopper*
On Ada and Hopper, the backward pass requires a transposed FP8 tensor.
The columnwise layout is physically different from the rowwise layout, so a transpose operation is needed.
All 3 options from :ref:`Performance Considerations Transpose handling section <handling_transposes>` are supported.
*Blackwell and later*
Blackwell hardware supports multiple GEMM layouts natively, eliminating the need for explicit transposes.
The rowwise and columnwise tensors share the same physical memory layout.