Inital code drop

Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

Inital code drop
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
996ea169 · Przemek Tredak · 996ea169 · 996ea169 · 996ea169 · 996ea169
Commit 996ea169 authored Sep 27, 2022 by Przemek Tredak
20 changed files
--- a/docs/conf.py
+++ b/docs/conf.py
+# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# See LICENSE for license information.
+import os
+import sys
+import sphinx_rtd_theme
+from sphinx.ext.autodoc.mock import mock
+from sphinx.ext.autodoc import between, ClassDocumenter, AttributeDocumenter
+from sphinx.util import inspect
+from builtins import str
+from enum import Enum
+import re
+import subprocess
+from pathlib import Path
+from datetime import date
+te_path = os.path.dirname(os.path.realpath(__file__))
+with open(te_path + "/../VERSION", "r") as f:
+    te_version = f.readline()
+release_year = 2022
+current_year = date.today().year
+if current_year == release_year:
+    copyright_year = release_year
+else:
+    copyright_year = str(release_year) + "-" + str(current_year)
+project = u'Transformer Engine'
+copyright = u'{}, NVIDIA CORPORATION & AFFILIATES. All rights reserved.'.format(copyright_year)
+author = u'NVIDIA CORPORATION & AFFILIATES'
+version = te_version
+release = version
+# -- General configuration ---------------------------------------------------
+# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
+extensions = [
+        'sphinx.ext.autodoc',
+        'sphinx.ext.mathjax',
+        'sphinx.ext.napoleon',
+        'nbsphinx',
+        'breathe']
+templates_path = ['_templates']
+exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
+source_suffix = '.rst'
+master_doc = 'index'
+pygments_style = 'sphinx'
+# -- Options for HTML output -------------------------------------------------
+# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output
+html_theme = 'sphinx_rtd_theme'
+html_theme_path = [sphinx_rtd_theme.get_html_theme_path()]
+html_static_path = ['_static']
+html_theme_options = {
+        'display_version': True,
+        'collapse_navigation': False,
+        'logo_only': False
+}
+napoleon_custom_sections = [('Parallelism parameters', 'params_style'),
+                            ('Optimization parameters', 'params_style'),
+                            ('Values', 'params_style')]
+breathe_projects = {"TransformerEngine": os.path.abspath("doxygen/xml/")}
+breathe_default_project = "TransformerEngine"
--- a/docs/examples/delayed_scaling.png
+++ b/docs/examples/delayed_scaling.png
--- a/docs/examples/fp8_formats.png
+++ b/docs/examples/fp8_formats.png
--- a/docs/examples/fp8_primer.ipynb
+++ b/docs/examples/fp8_primer.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "7b3e6954",
+   "metadata": {},
+   "source": [
+    "# Using FP8 with Transformer Engine\n",
+    "\n",
+    "H100 GPU introduced support for a new datatype, FP8 (8-bit floating point), enabling higher throughput of matrix multiplies and convolutions. In this example we will introduce the FP8 datatype and show how to use it with Transformer Engine.\n",
+    "\n",
+    "## Introduction to FP8\n",
+    "\n",
+    "### Structure\n",
+    "\n",
+    "The FP8 datatype supported by H100 is actually 2 distinct datatypes, useful in different parts of the training of neural networks:\n",
+    "\n",
+    "* E4M3 - it consists of 1 sign bit, 4 exponent bits and 3 bits of mantissa. It can store values up to +/-448 and `nan`.\n",
+    "* E5M2 - it consists of 1 sign bit, 5 exponent bits and 2 bits of mantissa. It can store values up to +/-57344, +/- `inf` and `nan`. The tradeoff of the increased dynamic range is lower precision of the stored values.\n",
+    "\n",
+    "<figure align=\"center\">\n",
+    "<img src=\"fp8_formats.png\" width=\"60%\">\n",
+    "<figcaption> Figure 1: Structure of the floating point datatypes. All of the values shown (in FP16, BF16, FP8 E4M3 and FP8 E5M2) are the closest representations of value 0.3952.</figcaption>\n",
+    "</figure>\n",
+    "\n",
+    "During training neural networks both of these types may be utilized. Typically forward activations and weights require more precision, so E4M3 datatype is best used during forward pass. In the backward pass, however, gradients flowing through the network typically are less susceptible to the loss of precision, but require higher dynamic range. Therefore they are best stored using E5M2 data format. H100 TensorCores provide support for any combination of these types as the inputs, enabling us to store each tensor using its preferred precision.\n",
+    "\n",
+    "### Mixed precision training - a quick introduction\n",
+    "\n",
+    "In order to understand how FP8 can be used for training Deep Learning models, it is useful to first remind ourselves how mixed precision works with other datatypes, especially FP16.\n",
+    "\n",
+    "Mixed precision recipe for FP16 training has 2 components: choosing which operations should be performed in FP16 and dynamic loss scaling.\n",
+    "\n",
+    "* Choosing the operations to be performed in FP16 precision requires analysis of the numerical behavior of the outputs with respect to inputs of the operation as well as the expected performance benefit. This enables marking operations like matrix multiplies, convolutions and normalization layers as safe, while leaving `norm` or `exp` operations as requiring high precision.\n",
+    "* Dynamic loss scaling enables avoiding both over- and underflows of the gradients during training. Those may happen since, while the dynamic range of FP16 is enough to store the distribution of the gradient values, this distribution may be centered around values too high or too low for FP16 to handle. Scaling the loss shifts those distributions (without affecting numerics by using only powers of 2) into the range representable in FP16. \n",
+    "\n",
+    "<figure align=\"center\">\n",
+    "<img src=\"loss_scaling.png\" width=\"50%\">\n",
+    "<figcaption> Figure 2: Scaling the loss enables shifting the gradient distribution into the representable range of FP16 datatype. </figcaption>\n",
+    "</figure>\n",
+    "\n",
+    "### Mixed precision training with FP8\n",
+    "\n",
+    "While the dynamic range provided by the FP8 types is sufficient to store any particular activation or gradient, it is not sufficient for all of them at the same time. This makes the single loss scaling factor strategy, which worked for FP16, infeasible for FP8 training and instead requires using distinct scaling factors for each FP8 tensor.\n",
+    "\n",
+    "There are multiple strategies for choosing a scaling factor that is appropriate for a given FP8 tensor:\n",
+    "\n",
+    "* just-in-time scaling. This strategy chooses the scaling factor based on the maximum of absolute values (amax) of the tensor being produced. In practice it is infeasible, as it requires multiple passes through data - the operator produces and writes out the output in higher precision, then the maximum absolute value of the output is found and applied to all values in order to obtain the final FP8 output. This results in a lot of overhead, severely diminishing gains from using FP8.\n",
+    "* delayed scaling. This strategy chooses the scaling factor based on the maximums of absolute values seen in some number of previous iterations. This enables full performance of FP8 computation, but requires storing the history of maximums as additional parameters of the FP8 operators. \n",
+    "\n",
+    "<figure align=\"center\">\n",
+    "<img src=\"delayed_scaling.png\" width=\"80%\">\n",
+    "<figcaption> Figure 3: Delayed scaling strategy. The FP8 operator uses scaling factor obtained using the history of amaxes (maximums of absolute values) seen in some number of previous iterations and produces both the FP8 output and the current amax, which gets stored in the history.</figcaption>\n",
+    "</figure>\n",
+    "\n",
+    "As one can see in Figure 3, delayed scaling strategy requires both storing the history of amaxes, but also choosing a recipe for converting that history into the scaling factor used in the next iteration."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cf5e0b0d",
+   "metadata": {},
+   "source": [
+    "## Using FP8 with Transformer Engine\n",
+    "\n",
+    "Transformer Engine library provides tools enabling easy to use training with FP8 datatype using delayed scaling strategy.\n",
+    "\n",
+    "### FP8 recipe\n",
+    "\n",
+    "[DelayedScaling](../api/common.rst#transformer_engine.common.recipe.DelayedScaling) recipe from `transformer_engine.common.recipe` module stores all of the required options for FP8 training - length of the amax history to use for scaling factor computation, FP8 data format etc."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "0c8fd0ef",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from transformer_engine.common.recipe import Format, DelayedScaling\n",
+    "\n",
+    "fp8_format = Format.HYBRID  # E4M3 during forward pass, E5M2 during backward pass\n",
+    "fp8_recipe = DelayedScaling(fp8_format=fp8_format, amax_history_len=16, amax_compute_algo=\"max\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f9591eb5",
+   "metadata": {},
+   "source": [
+    "This recipe is then used to configure the FP8 training."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "734d3934",
+   "metadata": {},
+   "source": [
+    "### FP8 autocasting\n",
+    "\n",
+    "Not every operation is safe to be performed using FP8. All of the modules provided by Transformer Engine library were designed to provide maximum performance benefit from FP8 datatype while maintaining accuracy. In order to enable FP8 operations, TE modules need to be wrapped inside the [fp8_autocast](../api/pytorch.rst#transformer_engine.pytorch.fp8_autocast) context manager."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "f8b1ff7f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import transformer_engine.pytorch as te\n",
+    "import torch\n",
+    "\n",
+    "torch.manual_seed(12345)\n",
+    "\n",
+    "my_linear = te.Linear(768, 768, bias=True)\n",
+    "\n",
+    "inp = torch.rand((1024, 768)).cuda()\n",
+    "\n",
+    "with te.fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):\n",
+    "    out_fp8 = my_linear(inp)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e41161f1",
+   "metadata": {},
+   "source": [
+    "The `fp8_autocast` context manager hides the complexity of handling FP8:\n",
+    "\n",
+    "- All FP8-safe operations have their inputs cast to FP8\n",
+    "- Amax history is updated\n",
+    "- New scaling factors are computed and ready for the next iteration\n",
+    "\n",
+    "<div class=\"alert alert-info\">\n",
+    "\n",
+    "<b>Note</b>\n",
+    "\n",
+    "Support for FP8 in the Linear layer of Transformer Engine is currently limited to tensors with shapes where both dimensions are divisible by 16. In terms of the input to the full Transformer network, this typically requires padding sequence length to be multiple of 16.\n",
+    "\n",
+    "</div>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f7bb2de9",
+   "metadata": {},
+   "source": [
+    "### Handling backward pass\n",
+    "\n",
+    "When a model is run inside the `fp8_autocast` region, especially in multi-GPU training, some communication is required in order to synchronize the scaling factors and amax history. In order to perform that communication without introducing much overhead, `fp8_autocast` context manager aggregates the tensors before performing the communication.\n",
+    "\n",
+    "Due to this aggregation the backward call needs to happen outside of the `fp8_autocast` context manager. It has no impact on the computation precision - the precision of the backward pass is determined by the precision of the forward pass."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "e012bc8d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "loss_fp8 = out_fp8.mean()\n",
+    "\n",
+    "loss_fp8.backward()  # This backward pass uses FP8, since out_fp8 was calculated inside fp8_autocast\n",
+    "\n",
+    "out_fp32 = my_linear(inp)\n",
+    "loss_fp32 = out_fp32.mean()\n",
+    "loss_fp32.backward()  # This backward pass does not use FP8, since out_fp32 was calculated outside fp8_autocast"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1a6723ca",
+   "metadata": {},
+   "source": [
+    "### Precision\n",
+    "\n",
+    "If we compare the results of the FP32 and FP8 execution, we will see that they are relatively close, but different:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "41e9a37b",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "tensor([[ 0.2276,  0.2627,  0.3001,  ...,  0.0346,  0.2211,  0.1188],\n",
+       "        [-0.0963, -0.3725,  0.1717,  ...,  0.0901,  0.0522, -0.3472],\n",
+       "        [ 0.4526,  0.3482,  0.5976,  ..., -0.0687, -0.0382,  0.1566],\n",
+       "        ...,\n",
+       "        [ 0.1698,  0.6061,  0.0385,  ..., -0.2875, -0.1152, -0.0260],\n",
+       "        [ 0.0679,  0.2946,  0.2751,  ..., -0.2284,  0.0517, -0.1441],\n",
+       "        [ 0.1865,  0.2353,  0.9172,  ...,  0.1085,  0.1135,  0.1438]],\n",
+       "       device='cuda:0', grad_fn=<_LinearBackward>)"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "out_fp8"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "b328ae0e",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "tensor([[ 0.2373,  0.2674,  0.2980,  ...,  0.0233,  0.2498,  0.1131],\n",
+       "        [-0.0767, -0.3778,  0.1862,  ...,  0.0858,  0.0676, -0.3369],\n",
+       "        [ 0.4615,  0.3593,  0.5813,  ..., -0.0779, -0.0349,  0.1422],\n",
+       "        ...,\n",
+       "        [ 0.1914,  0.6038,  0.0382,  ..., -0.2847, -0.0991, -0.0423],\n",
+       "        [ 0.0864,  0.2895,  0.2719,  ..., -0.2388,  0.0772, -0.1541],\n",
+       "        [ 0.2019,  0.2275,  0.9027,  ...,  0.1022,  0.1300,  0.1444]],\n",
+       "       device='cuda:0', grad_fn=<_LinearBackward>)"
+      ]
+     },
+     "execution_count": 5,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "out_fp32"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a9413c0a",
+   "metadata": {},
+   "source": [
+    "That happens because in the FP8 case both the input and weights are cast to FP8 before the computation. We can see this if instead of the original inputs we use the inputs representable in FP8 (using a function defined in [quickstart_utils.py](quickstart_utils.py)):"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "ea939581",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "tensor([[ 0.2276,  0.2629,  0.3000,  ...,  0.0346,  0.2211,  0.1188],\n",
+      "        [-0.0963, -0.3724,  0.1717,  ...,  0.0901,  0.0522, -0.3470],\n",
+      "        [ 0.4526,  0.3479,  0.5976,  ..., -0.0686, -0.0382,  0.1566],\n",
+      "        ...,\n",
+      "        [ 0.1698,  0.6062,  0.0385,  ..., -0.2876, -0.1152, -0.0260],\n",
+      "        [ 0.0679,  0.2947,  0.2750,  ..., -0.2284,  0.0516, -0.1441],\n",
+      "        [ 0.1865,  0.2353,  0.9170,  ...,  0.1085,  0.1135,  0.1438]],\n",
+      "       device='cuda:0', grad_fn=<_LinearBackward>)\n"
+     ]
+    }
+   ],
+   "source": [
+    "from quickstart_utils import cast_to_representable\n",
+    "\n",
+    "inp_representable = cast_to_representable(inp)\n",
+    "my_linear.weight.data = cast_to_representable(my_linear.weight.data)\n",
+    "\n",
+    "out_fp32_representable = my_linear(inp_representable)\n",
+    "\n",
+    "print(out_fp32_representable)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "03e703bd",
+   "metadata": {},
+   "source": [
+    "This time the difference is really small:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "78f1c2eb",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "tensor([[ 4.9591e-05, -1.9073e-04,  9.5367e-05,  ..., -3.8147e-06,\n",
+       "          4.1962e-05,  2.2888e-05],\n",
+       "        [ 2.2888e-05, -3.4332e-05,  2.2888e-05,  ...,  2.6703e-05,\n",
+       "          5.3406e-05, -1.4114e-04],\n",
+       "        [-3.8147e-05,  2.6703e-04, -3.8147e-06,  ..., -5.7220e-05,\n",
+       "          4.1962e-05, -1.9073e-05],\n",
+       "        ...,\n",
+       "        [ 1.1444e-05, -7.2479e-05, -3.8147e-06,  ...,  5.3406e-05,\n",
+       "         -1.5259e-05,  2.2888e-05],\n",
+       "        [ 4.9591e-05, -9.5367e-05,  6.8665e-05,  ..., -1.5259e-05,\n",
+       "          7.6294e-05,  4.5776e-05],\n",
+       "        [-1.5259e-05, -7.6294e-06,  1.8692e-04,  ..., -3.0518e-05,\n",
+       "         -4.5776e-05,  7.6294e-06]], device='cuda:0', grad_fn=<SubBackward0>)"
+      ]
+     },
+     "execution_count": 7,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "out_fp8 - out_fp32_representable"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "63ff9b8c",
+   "metadata": {},
+   "source": [
+    "The differences in result coming from FP8 execution do not matter during the training process, but it is good to understand them, e.g. during debugging the model."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/docs/examples/loss_scaling.png
+++ b/docs/examples/loss_scaling.png
--- a/docs/examples/quickstart.ipynb
+++ b/docs/examples/quickstart.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "da9fd6a8",
+   "metadata": {},
+   "source": [
+    "# Getting Started\n",
+    "\n",
+    "## Overview\n",
+    "\n",
+    "Transformer Engine (TE) is a library for accelerating Transformer models on NVIDIA GPUs, providing better performance with lower memory utilization in both training and inference. It provides support for 8-bit floating point (FP8) precision on Hopper GPUs, implements a collection of highly optimized building blocks for popular Transformer architectures, and exposes an automatic-mixed-precision-like API that can be used seamlessy with your PyTorch code. It also includes a framework-agnostic C++ API that can be integrated with other deep learning libraries to enable FP8 support for Transformers.\n",
+    "\n",
+    "## Let's build a Transformer layer!\n",
+    "\n",
+    "<div class=\"alert alert-info\">\n",
+    "\n",
+    "<b>Summary</b>\n",
+    "    \n",
+    "We build a basic Transformer layer using regular PyTorch modules. This will be our baseline for later comparisons with Transformer Engine.\n",
+    "\n",
+    "</div>\n",
+    "\n",
+    "Let's start with creating a GPT encoder layer using plain PyTorch. Figure 1 shows the overall structure.\n",
+    "\n",
+    "<figure align=\"center\">\n",
+    "<img src=\"transformer_layer.png\" width=\"20%\">\n",
+    "<figcaption> Figure 1: Structure of a GPT encoder layer.</figcaption>\n",
+    "</figure>\n",
+    "\n",
+    "We construct the components as follows:\n",
+    "\n",
+    "- `LayerNorm`: `torch.nn.LayerNorm`\n",
+    "- `QKV Projection`: `torch.nn.Linear` (conceptually three `Linear` layers for Q, K, and V separately, but we fuse into a single `Linear` layer that is three times larger)\n",
+    "- `DotProductAttention`: `DotProductAttention` from [quickstart_utils.py](quickstart_utils.py)\n",
+    "- `Projection`: `torch.nn.Linear`\n",
+    "- `Dropout`: `torch.nn.Dropout`\n",
+    "- `MLP`: `BasicMLP` from [quickstart_utils.py](quickstart_utils.py)\n",
+    "\n",
+    "Over the course of this tutorial we will use a few modules and helper functions defined in [quickstart_utils.py](quickstart_utils.py). Putting it all together:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "2be43d64",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "import quickstart_utils as utils\n",
+    "\n",
+    "class BasicTransformerLayer(torch.nn.Module):\n",
+    "    def __init__(\n",
+    "        self,\n",
+    "        hidden_size: int,\n",
+    "        ffn_hidden_size: int,\n",
+    "        num_attention_heads: int,\n",
+    "        layernorm_eps: int = 1e-5,\n",
+    "        attention_dropout: float = 0.1,\n",
+    "        hidden_dropout: float = 0.1,\n",
+    "    ):\n",
+    "        super().__init__()\n",
+    "        self.num_attention_heads = num_attention_heads\n",
+    "        self.kv_channels = hidden_size // num_attention_heads\n",
+    "        self.ln1 = torch.nn.LayerNorm(hidden_size, eps=layernorm_eps)\n",
+    "        self.qkv_projection = torch.nn.Linear(hidden_size, 3 * hidden_size, bias=True)\n",
+    "        self.attention = utils.DotProductAttention(\n",
+    "            num_attention_heads=num_attention_heads,\n",
+    "            kv_channels=self.kv_channels,\n",
+    "            attention_dropout=attention_dropout,\n",
+    "        )\n",
+    "        self.projection = torch.nn.Linear(hidden_size, hidden_size, bias=True)\n",
+    "        self.dropout = torch.nn.Dropout(hidden_dropout)\n",
+    "        self.ln2 = torch.nn.LayerNorm(hidden_size, eps=layernorm_eps)\n",
+    "        self.mlp = utils.BasicMLP(\n",
+    "            hidden_size=hidden_size,\n",
+    "            ffn_hidden_size=ffn_hidden_size,\n",
+    "        ) \n",
+    "        \n",
+    "    def forward(\n",
+    "        self, \n",
+    "        x: torch.Tensor, \n",
+    "        attention_mask: torch.Tensor\n",
+    "    ) -> torch.Tensor:\n",
+    "        res = x\n",
+    "        x = self.ln1(x)\n",
+    "        \n",
+    "        # Fused QKV projection\n",
+    "        qkv = self.qkv_projection(x)\n",
+    "        qkv = qkv.view(qkv.size(0), qkv.size(1), self.num_attention_heads, 3 * self.kv_channels)\n",
+    "        q, k, v = torch.split(qkv, qkv.size(3) // 3, dim=3)\n",
+    "        \n",
+    "        x = self.attention(q, k, v, attention_mask)\n",
+    "        x = self.projection(x)\n",
+    "        x = self.dropout(x)\n",
+    "        x = res + x\n",
+    "        res = x\n",
+    "        x = self.ln2(x)\n",
+    "        x = self.mlp(x)\n",
+    "        \n",
+    "        return x + res"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "40724d1d",
+   "metadata": {},
+   "source": [
+    "That's it! We now have a simple Transformer layer. We can test it:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "a786f0ea",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Layer configuration\n",
+    "hidden_size = 4096\n",
+    "sequence_length = 2048\n",
+    "batch_size = 4\n",
+    "ffn_hidden_size = 16384\n",
+    "num_attention_heads = 32\n",
+    "dtype = torch.float16\n",
+    "\n",
+    "# Synthetic data\n",
+    "x = torch.rand(sequence_length, batch_size, hidden_size).cuda().to(dtype=dtype)\n",
+    "dy = torch.rand(sequence_length, batch_size, hidden_size).cuda().to(dtype=dtype)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "ffdbfb7a",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "BasicTransformerLayer(\n",
+       "  (ln1): LayerNorm((4096,), eps=1e-05, elementwise_affine=True)\n",
+       "  (qkv_projection): Linear(in_features=4096, out_features=12288, bias=True)\n",
+       "  (attention): DotProductAttention(\n",
+       "    (dropout): Dropout(p=0.1, inplace=False)\n",
+       "  )\n",
+       "  (projection): Linear(in_features=4096, out_features=4096, bias=True)\n",
+       "  (dropout): Dropout(p=0.1, inplace=False)\n",
+       "  (ln2): LayerNorm((4096,), eps=1e-05, elementwise_affine=True)\n",
+       "  (mlp): BasicMLP(\n",
+       "    (linear1): Linear(in_features=4096, out_features=16384, bias=True)\n",
+       "    (linear2): Linear(in_features=16384, out_features=4096, bias=True)\n",
+       "  )\n",
+       ")"
+      ]
+     },
+     "execution_count": 3,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "basic_transformer = BasicTransformerLayer(\n",
+    "    hidden_size, \n",
+    "    ffn_hidden_size, \n",
+    "    num_attention_heads\n",
+    ")\n",
+    "basic_transformer.to(dtype=dtype).cuda()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "0162ad40",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "torch.manual_seed(1234)\n",
+    "y = basic_transformer(x, attention_mask=None)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "65ae6dd6",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Mean time: 41.4469287109375 ms\n"
+     ]
+    }
+   ],
+   "source": [
+    "utils.speedometer(basic_transformer, x, dy)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "43717e36",
+   "metadata": {},
+   "source": [
+    "## Meet Transformer Engine\n",
+    "\n",
+    "<div class=\"alert alert-info\">\n",
+    "\n",
+    "<b>Summary</b>\n",
+    "    \n",
+    "We modify the example Transformer layer to include the simplest TE modules: `Linear` and `LayerNorm`.\n",
+    "\n",
+    "</div>\n",
+    "\n",
+    "Now that we have a basic Transformer layer, let's use Transformer Engine to speed up the training. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "004d3c92",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import transformer_engine.pytorch as te"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1931f911",
+   "metadata": {},
+   "source": [
+    "TE provides a set of PyTorch modules that can be used to build Transformer layers. The simplest of the provided modules are the `Linear` and `LayerNorm` layers, which we can use instead of `torch.nn.Linear` and `torch.nn.LayerNorm`. Let's modify `BasicTransformerLayer`:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "1f44db50",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class BasicTEMLP(torch.nn.Module):\n",
+    "    def __init__(self,\n",
+    "                 hidden_size: int,\n",
+    "                 ffn_hidden_size: int) -> None:\n",
+    "        super().__init__()\n",
+    "        self.linear1 = te.Linear(hidden_size, ffn_hidden_size, bias=True)\n",
+    "        self.linear2 = te.Linear(ffn_hidden_size, hidden_size, bias=True)\n",
+    "\n",
+    "    def forward(self, x):\n",
+    "        x = self.linear1(x)\n",
+    "        x = torch.nn.functional.gelu(x, approximate='tanh')\n",
+    "        x = self.linear2(x)\n",
+    "        return x    \n",
+    "    \n",
+    "class BasicTETransformerLayer(torch.nn.Module):\n",
+    "    def __init__(self,\n",
+    "                 hidden_size: int,\n",
+    "                 ffn_hidden_size: int,\n",
+    "                 num_attention_heads: int,\n",
+    "                 layernorm_eps: int = 1e-5,\n",
+    "                 attention_dropout: float = 0.1,\n",
+    "                 hidden_dropout: float = 0.1):\n",
+    "        super().__init__()\n",
+    "        self.num_attention_heads = num_attention_heads\n",
+    "        self.kv_channels = hidden_size // num_attention_heads\n",
+    "        self.ln1 = te.LayerNorm(hidden_size, eps=layernorm_eps)\n",
+    "        self.qkv_projection = te.Linear(hidden_size, 3 * hidden_size, bias=True)\n",
+    "        self.attention = utils.DotProductAttention(\n",
+    "            num_attention_heads=num_attention_heads,\n",
+    "            kv_channels=self.kv_channels,\n",
+    "            attention_dropout=attention_dropout,\n",
+    "        )\n",
+    "        self.projection = te.Linear(hidden_size, hidden_size, bias=True)\n",
+    "        self.dropout = torch.nn.Dropout(hidden_dropout)\n",
+    "        self.ln2 = te.LayerNorm(hidden_size, eps=layernorm_eps)\n",
+    "        self.mlp = BasicTEMLP(\n",
+    "            hidden_size=hidden_size,\n",
+    "            ffn_hidden_size=ffn_hidden_size,\n",
+    "        )\n",
+    "        \n",
+    "    def forward(self, \n",
+    "                x: torch.Tensor, \n",
+    "                attention_mask: torch.Tensor):\n",
+    "        res = x\n",
+    "        x = self.ln1(x)\n",
+    "        \n",
+    "        # Fused QKV projection\n",
+    "        qkv = self.qkv_projection(x)\n",
+    "        qkv = qkv.view(qkv.size(0), qkv.size(1), self.num_attention_heads, 3 * self.kv_channels)\n",
+    "        q, k, v = torch.split(qkv, qkv.size(3) // 3, dim=3)\n",
+    "        \n",
+    "        x = self.attention(q, k, v, attention_mask)\n",
+    "        x = self.projection(x)\n",
+    "        x = self.dropout(x)\n",
+    "        x = res + x\n",
+    "        res = x\n",
+    "        x = self.ln2(x)\n",
+    "        x = self.mlp(x)\n",
+    "        \n",
+    "        return x + res"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "916531e8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "basic_te_transformer = BasicTETransformerLayer(\n",
+    "    hidden_size, \n",
+    "    ffn_hidden_size, \n",
+    "    num_attention_heads,\n",
+    ")\n",
+    "basic_te_transformer.to(dtype=dtype).cuda()\n",
+    "utils.share_parameters_with_basic_te_model(basic_te_transformer, basic_transformer)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "3643fa54",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "torch.manual_seed(1234)\n",
+    "y = basic_te_transformer(x, attention_mask=None)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "10b92894",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Mean time: 41.3155712890625 ms\n"
+     ]
+    }
+   ],
+   "source": [
+    "utils.speedometer(basic_te_transformer, x, dy)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3f990226",
+   "metadata": {},
+   "source": [
+    "## Fused TE Modules\n",
+    "\n",
+    "<div class=\"alert alert-info\">\n",
+    "\n",
+    "<b>Summary</b>\n",
+    "    \n",
+    "We optimize the example Transformer layer with TE modules for fused operations.\n",
+    "\n",
+    "</div>\n",
+    "\n",
+    "The `Linear` layer is enough to build any Transformer model and it enables usage of Transformer Engine even for very custom Transformers. However, having more knowledge about the model allows for additional optimizations like kernel fusion, increasing the achievable speedup.\n",
+    "\n",
+    "Transformer Engine therefore provides coarser modules that span multiple layers:\n",
+    "\n",
+    "* `LayerNormLinear`\n",
+    "* `LayerNormMLP`\n",
+    "* `TransformerLayer`\n",
+    "\n",
+    "Building a third iteration of our Transformer layer with `LayerNormLinear` and `LayerNormMLP`:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "id": "c55eae1f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class FusedTETransformerLayer(torch.nn.Module):\n",
+    "    def __init__(self,\n",
+    "                 hidden_size: int,\n",
+    "                 ffn_hidden_size: int,\n",
+    "                 num_attention_heads: int,\n",
+    "                 layernorm_eps: int = 1e-5,\n",
+    "                 attention_dropout: float = 0.1,\n",
+    "                 hidden_dropout: float = 0.1):\n",
+    "        super().__init__()\n",
+    "        self.num_attention_heads = num_attention_heads\n",
+    "        self.kv_channels = hidden_size // num_attention_heads\n",
+    "        self.ln_qkv = te.LayerNormLinear(hidden_size, 3 * hidden_size, eps=layernorm_eps, bias=True)\n",
+    "        self.attention = utils.DotProductAttention(\n",
+    "            num_attention_heads=num_attention_heads,\n",
+    "            kv_channels=self.kv_channels,\n",
+    "            attention_dropout=attention_dropout,\n",
+    "        )\n",
+    "        self.projection = te.Linear(hidden_size, hidden_size, bias=True)\n",
+    "        self.dropout = torch.nn.Dropout(hidden_dropout)\n",
+    "        self.ln_mlp = te.LayerNormMLP(hidden_size, ffn_hidden_size, eps=layernorm_eps, bias=True)\n",
+    "        \n",
+    "        \n",
+    "    def forward(self, \n",
+    "                x: torch.Tensor, \n",
+    "                attention_mask: torch.Tensor):\n",
+    "        res = x\n",
+    "        qkv = self.ln_qkv(x)\n",
+    "        \n",
+    "        # Split qkv into query, key and value\n",
+    "        qkv = qkv.view(qkv.size(0), qkv.size(1), self.num_attention_heads, 3 * self.kv_channels)\n",
+    "        q, k, v = torch.split(qkv, qkv.size(3) // 3, dim=3)\n",
+    "        \n",
+    "        x = self.attention(q, k, v, attention_mask)\n",
+    "        x = self.projection(x)\n",
+    "        x = self.dropout(x)\n",
+    "        x = res + x\n",
+    "        res = x\n",
+    "        x = self.ln_mlp(x)\n",
+    "        \n",
+    "        return x + res"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "id": "85949421",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "fused_te_transformer = FusedTETransformerLayer(hidden_size, ffn_hidden_size, num_attention_heads)\n",
+    "fused_te_transformer.to(dtype=dtype).cuda()\n",
+    "utils.share_parameters_with_fused_te_model(fused_te_transformer, basic_transformer)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "id": "2c263e71",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "torch.manual_seed(1234)\n",
+    "y = fused_te_transformer(x, attention_mask=None)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "id": "24e101bc",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Mean time: 41.5097509765625 ms\n"
+     ]
+    }
+   ],
+   "source": [
+    "utils.speedometer(fused_te_transformer, x, dy)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "33f13c26",
+   "metadata": {},
+   "source": [
+    "Finally, the `TransformerLayer` module is convenient for creating standard Transformer architectures and it provides the highest degree of performance optimization:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "id": "ec8c3685",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "te_transformer = te.TransformerLayer(hidden_size, ffn_hidden_size, num_attention_heads)\n",
+    "te_transformer.to(dtype=dtype).cuda()\n",
+    "utils.share_parameters_with_transformerlayer_te_model(te_transformer, basic_transformer)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "id": "e48cd590",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "torch.manual_seed(1234)\n",
+    "y = te_transformer(x, attention_mask=None)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "id": "3ec3707d-e63f-4899-8308-b11c55b5caa4",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Mean time: 38.391796875 ms\n"
+     ]
+    }
+   ],
+   "source": [
+    "utils.speedometer(te_transformer, x, dy)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4034c3eb-8958-49f2-85f6-30c94977d884",
+   "metadata": {},
+   "source": [
+    "## Enabling FP8\n",
+    "\n",
+    "<div class=\"alert alert-info\">\n",
+    "\n",
+    "<b>Summary</b>\n",
+    "    \n",
+    "We configure a TE module to perform compute in FP8.\n",
+    "\n",
+    "</div>\n",
+    "\n",
+    "Enabling FP8 support is very simple in Transformer Engine. We just need to wrap the modules within an [fp8_autocast](../api/pytorch.rst#transformer_engine.pytorch.fp8_autocast) context manager. See the [FP8 tutorial](fp8_primer.ipynb) for a detailed explanation of FP8 recipes and the supported options."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "id": "31256aa7-3d5e-425c-91ab-502b1326a748",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from transformer_engine.common.recipe import Format, DelayedScaling\n",
+    "\n",
+    "te_transformer = te.TransformerLayer(hidden_size, ffn_hidden_size, num_attention_heads)\n",
+    "te_transformer.to(dtype=dtype).cuda()\n",
+    "utils.share_parameters_with_transformerlayer_te_model(te_transformer, basic_transformer)\n",
+    "\n",
+    "fp8_format = Format.HYBRID\n",
+    "fp8_recipe = DelayedScaling(fp8_format=fp8_format, amax_history_len=16, amax_compute_algo=\"max\")\n",
+    "torch.manual_seed(1234)\n",
+    "with te.fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):\n",
+    "    y = te_transformer(x, attention_mask=None)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "id": "793ebd2d-b84b-47bc-811a-7991df8500aa",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Mean time: 27.991220703125 ms\n"
+     ]
+    }
+   ],
+   "source": [
+    "with te.fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):\n",
+    "    utils.speedometer(te_transformer, x, dy)"
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/docs/examples/quickstart_utils.py
+++ b/docs/examples/quickstart_utils.py
+# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# See LICENSE for license information.
+import math
+from typing import Callable, Optional
+import torch
+def speedometer(
+        module: torch.nn.Module,
+        input: torch.Tensor,
+        output_grad: torch.Tensor,
+        timing_iters: int = 50,
+        warmup_iters: int = 50,
+) -> None:
+    """Measure average run time for a PyTorch module
+    Performs forward and backward passes.
+    """
+    start = torch.cuda.Event(enable_timing=True)
+    end = torch.cuda.Event(enable_timing=True)
+    # Warmup runs
+    torch.cuda.synchronize()
+    for _ in range(warmup_iters):
+        output = module(input, attention_mask=None)
+        output.backward(output_grad)
+    # Timing runs
+    start.record()
+    for _ in range(timing_iters):
+        output = module(input, attention_mask=None)
+        output.backward(output_grad)
+    end.record()
+    torch.cuda.synchronize()
+    print(f"Mean time: {start.elapsed_time(end)/timing_iters} ms")
+class DotProductAttention(torch.nn.Module):
+    """Attention operation in Transformer layer
+    Built with plain PyTorch modules.
+    """
+    def __init__(
+            self,
+            num_attention_heads: int,
+            kv_channels: int,
+            attention_dropout: float,
+    ) -> None:
+        super().__init__()
+        self.projection_size = kv_channels * num_attention_heads
+        self.hidden_size_per_attention_head = kv_channels
+        self.norm_factor = math.sqrt(self.hidden_size_per_attention_head)
+        self.dropout = torch.nn.Dropout(attention_dropout)
+    def masked_softmax(
+            self,
+            inp: torch.Tensor,
+            mask: Optional[torch.Tensor]
+    ) -> torch.Tensor:
+        if mask is not None:
+            inp.masked_fill_(mask, -10000.0)
+        return torch.nn.Softmax(dim=-1)(inp)
+    def forward(
+            self,
+            query: torch.Tensor,
+            key: torch.Tensor,
+            value: torch.Tensor,
+            attention_mask: Optional[torch.Tensor] = None,
+    ) -> torch.Tensor:
+        b = query.size(1)
+        np = query.size(2)
+        sq = query.size(0)
+        sk = key.size(0)
+        hn = value.size(3)
+        # [sq, b, np, hn] -> [sq, b * np, hn]
+        query = query.view(sq, b * np, -1)
+        # [sk, b, np, hn] -> [sk, b * np, hn]
+        key = key.view(sk, b * np, -1)
+        bmm1 = torch.bmm(query.transpose(0, 1), key.transpose(0, 1).transpose(1, 2)) / self.norm_factor
+        # change view to [b, np, sq, sk]
+        attention_scores = bmm1.view(b, np, sq, sk)
+        attention_probs = self.masked_softmax(attention_scores, attention_mask)
+        attention_probs = self.dropout(attention_probs)
+        # change view [sk, b * np, hn]
+        value = value.view(sk, b * np, -1)
+        # change view [b * np, sq, sk]
+        attention_probs = attention_probs.view(b * np, sq, -1)
+        # matmul: [b * np, sq, hn]
+        context = torch.bmm(attention_probs, value.transpose(0, 1))
+        # change view [b, np, sq, hn]
+        context = context.view(b, np, sq, hn)
+        # [b, np, sq, hn] --> [sq, b, np, hn]
+        context = context.permute(2, 0, 1, 3).contiguous()
+        # [sq, b, np, hn] --> [sq, b, hp]
+        context = context.view(sq, b, self.projection_size)
+        return context
+class BasicMLP(torch.nn.Module):
+    """Feed-forward network in Transformer layer
+    Built with plain PyTorch modules.
+    """
+    def __init__(
+            self,
+            hidden_size: int,
+            ffn_hidden_size: int,
+    ) -> None:
+        super().__init__()
+        self.linear1 = torch.nn.Linear(hidden_size, ffn_hidden_size, bias=True)
+        self.linear2 = torch.nn.Linear(ffn_hidden_size, hidden_size, bias=True)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        x = self.linear1(x)
+        x = torch.nn.functional.gelu(x, approximate='tanh')
+        x = self.linear2(x)
+        return x
+def share_parameters_with_basic_te_model(te_model, basic_model):
+    """Initialize parameters for TE Transformer layer with basic modules
+    Parameter values are copied from pure PyTorch implementation.
+    """
+    te_model.ln1.weight= basic_model.ln1.weight
+    te_model.ln1.bias = basic_model.ln1.bias
+    te_model.qkv_projection.weight = basic_model.qkv_projection.weight
+    te_model.qkv_projection.bias = basic_model.qkv_projection.bias
+    te_model.projection.weight = basic_model.projection.weight
+    te_model.projection.bias = basic_model.projection.bias
+    te_model.ln2.weight = basic_model.ln2.weight
+    te_model.ln2.bias = basic_model.ln2.bias
+    te_model.mlp.linear1.weight = basic_model.mlp.linear1.weight
+    te_model.mlp.linear1.bias = basic_model.mlp.linear1.bias
+    te_model.mlp.linear2.weight = basic_model.mlp.linear2.weight
+    te_model.mlp.linear2.bias = basic_model.mlp.linear2.bias
+def share_parameters_with_fused_te_model(te_model, basic_model):
+    """Initialize parameters for TE Transformer layer with fused modules
+    Parameter values are copied from pure PyTorch implementation.
+    """
+    te_model.ln_qkv.layer_norm_weight = basic_model.ln1.weight
+    te_model.ln_qkv.layer_norm_bias = basic_model.ln1.bias
+    te_model.ln_qkv.weight = basic_model.qkv_projection.weight
+    te_model.ln_qkv.bias = basic_model.qkv_projection.bias
+    te_model.projection.weight = basic_model.projection.weight
+    te_model.projection.bias = basic_model.projection.bias
+    te_model.ln_mlp.layer_norm_weight = basic_model.ln2.weight
+    te_model.ln_mlp.layer_norm_bias = basic_model.ln2.bias
+    te_model.ln_mlp.fc1_weight = basic_model.mlp.linear1.weight
+    te_model.ln_mlp.fc1_bias = basic_model.mlp.linear1.bias
+    te_model.ln_mlp.fc2_weight = basic_model.mlp.linear2.weight
+    te_model.ln_mlp.fc2_bias = basic_model.mlp.linear2.bias
+def share_parameters_with_transformerlayer_te_model(te_model, basic_model):
+    """Initialize parameters for monolithic TE Transformer layer
+    Parameter values are copied from pure PyTorch implementation.
+    """
+    te_model.self_attention.layernorm_qkv.layer_norm_weight = basic_model.ln1.weight
+    te_model.self_attention.layernorm_qkv.layer_norm_bias = basic_model.ln1.bias
+    te_model.self_attention.layernorm_qkv.weight = basic_model.qkv_projection.weight
+    te_model.self_attention.layernorm_qkv.bias = basic_model.qkv_projection.bias
+    te_model.self_attention.proj.weight = basic_model.projection.weight
+    te_model.self_attention.proj.bias = basic_model.projection.bias
+    te_model.layernorm_mlp.layer_norm_weight = basic_model.ln2.weight
+    te_model.layernorm_mlp.layer_norm_bias = basic_model.ln2.bias
+    te_model.layernorm_mlp.fc1_weight = basic_model.mlp.linear1.weight
+    te_model.layernorm_mlp.fc1_bias = basic_model.mlp.linear1.bias
+    te_model.layernorm_mlp.fc2_weight = basic_model.mlp.linear2.weight
+    te_model.layernorm_mlp.fc2_bias = basic_model.mlp.linear2.bias
+def cast_to_representable(inp, scale = 1., fp8_format='e4m3'):
+    import transformer_engine.pytorch.cpp_extensions as texcpp
+    import transformer_engine_extensions as tex
+    fp8_type = tex.DType.kFloat8E4M3 if fp8_format == 'e4m3' else tex.DType.kFloat8E5M2
+    input_type = texcpp.TE_DType[inp.dtype]
+    meta = tex.FP8TensorMeta()
+    meta.scale = torch.ones(1,dtype=torch.float32, device="cuda") * scale
+    meta.scale_inv = torch.ones(1, dtype=torch.float32, device="cuda") / scale
+    meta.amax_history = torch.zeros(1, 1, dtype=torch.float32, device="cuda")
+    ret = texcpp.cast_to_fp8(inp, meta, tex.FP8FwdTensors.GEMM1_INPUT, fp8_type)
+    ret = texcpp.cast_from_fp8(ret, meta, tex.FP8FwdTensors.GEMM1_INPUT, fp8_type, input_type)
+    return ret
--- a/docs/examples/transformer_layer.png
+++ b/docs/examples/transformer_layer.png
--- a/docs/index.rst
+++ b/docs/index.rst
+..
+    Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+    See LICENSE for license information.
+Transformer Engine documentation
+==============================================
+.. include:: ../README.rst
+   :start-after: overview-begin-marker-do-not-remove
+   :end-before: overview-end-marker-do-not-remove
+.. toctree::
+   :hidden:
+   Home <self>
+.. toctree::
+   :hidden:
+   :caption: Getting Started
+   installation
+   examples/quickstart.ipynb
+.. toctree::
+   :hidden:
+   :caption: Python API documentation
+   api/common
+   api/framework
+.. toctree::
+   :hidden:
+   :caption: Examples and Tutorials
+   examples/fp8_primer.ipynb
+.. toctree::
+   :hidden:
+   :caption: Advanced
+   api/c/index
--- a/docs/installation.rst
+++ b/docs/installation.rst
+..
+    Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+    See LICENSE for license information.
+Installation
+============
+Prerequisites
+-------------
+.. |driver link| replace:: NVIDIA Driver
+.. _driver link: https://www.nvidia.com/drivers
+1. Linux x86_64
+2. `CUDA 11.8 <https://developer.nvidia.com/cuda-downloads>`__
+3. |driver link|_ supporting CUDA 11.8 or later.
+Transformer Engine in NGC Containers
+------------------------------------
+Transformer Engine library is preinstalled in the PyTorch container in versions 22.09 and later
+on `NVIDIA GPU Cloud <https://ngc.nvidia.com>`_.
+pip - from GitHub
+-----------------------
+Additional Prerequisites
+^^^^^^^^^^^^^^^^^^^^^^^^
+1. `CMake <https://cmake.org/>`__ version 3.18 or later
+2. `pyTorch <https://pytorch.org/>`__ with GPU support
+3. `Ninja <https://ninja-build.org/>`__
+Installation (stable release)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Execute the following command to install the latest stable version of Transformer Engine:
+.. code-block:: bash
+   pip install --upgrade git+https://github.com/NVIDIA/TransformerEngine.git@stable
+Installation (development build)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+.. warning::
+   While the development build of Transformer Engine could contain new features not available in
+   the official build yet, it is not supported and so its usage is not recommended for general
+   use.
+Execute the following command to install the latest development build of Transformer Engine:
+.. code-block:: bash
+   pip install --upgrade git+https://github.com/NVIDIA/TransformerEngine.git@main
--- a/examples/pytorch/mnist/README.md
+++ b/examples/pytorch/mnist/README.md
+# Basic MNIST Example with optional FP8
+```bash
+python main.py
+python main.py --use-te   # Linear layers from TransformerEngine
+python main.py --use-fp8  # FP8 + TransformerEngine for Linear layers
+```
--- a/examples/pytorch/mnist/main.py
+++ b/examples/pytorch/mnist/main.py
+# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# See LICENSE for license information.
+import argparse
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torch.optim as optim
+from torchvision import datasets, transforms
+from torch.optim.lr_scheduler import StepLR
+from transformer_engine import pytorch as te
+class Net(nn.Module):
+    def __init__(self, use_te=False):
+        super(Net, self).__init__()
+        self.conv1 = nn.Conv2d(1, 32, 3, 1)
+        self.conv2 = nn.Conv2d(32, 64, 3, 1)
+        self.dropout1 = nn.Dropout(0.25)
+        self.dropout2 = nn.Dropout(0.5)
+        if use_te:
+            self.fc1 = te.Linear(9216, 128)
+            self.fc2 = te.Linear(128, 16)
+        else:
+            self.fc1 = nn.Linear(9216, 128)
+            self.fc2 = nn.Linear(128, 16)
+        self.fc3 = nn.Linear(16, 10)
+    def forward(self, x):
+        """FWD"""
+        x = self.conv1(x)
+        x = F.relu(x)
+        x = self.conv2(x)
+        x = F.relu(x)
+        x = F.max_pool2d(x, 2)
+        x = self.dropout1(x)
+        x = torch.flatten(x, 1)
+        x = self.fc1(x)
+        x = F.relu(x)
+        x = self.dropout2(x)
+        x = self.fc2(x)
+        x = self.fc3(x)
+        output = F.log_softmax(x, dim=1)
+        return output
+def train(args, model, device, train_loader, optimizer, epoch, use_fp8):
+    """Training function."""
+    model.train()
+    for batch_idx, (data, target) in enumerate(train_loader):
+        data, target = data.to(device), target.to(device)
+        optimizer.zero_grad()
+        with te.fp8_autocast(enabled=use_fp8):
+            output = model(data)
+        loss = F.nll_loss(output, target)
+        loss.backward()
+        optimizer.step()
+        if batch_idx % args.log_interval == 0:
+            print(
+                f"Train Epoch: {epoch} "
+                f"[{batch_idx * len(data)}/{len(train_loader.dataset)} "
+                f"({100. * batch_idx / len(train_loader):.0f}%)]\t"
+                f"Loss: {loss.item():.6f}"
+            )
+            if args.dry_run:
+                break
+def test(model, device, test_loader, use_fp8):
+    """Testing function."""
+    model.eval()
+    test_loss = 0
+    correct = 0
+    with torch.no_grad():
+        for data, target in test_loader:
+            data, target = data.to(device), target.to(device)
+            with te.fp8_autocast(enabled=use_fp8):
+                output = model(data)
+            test_loss += F.nll_loss(
+                output, target, reduction="sum"
+            ).item()  # sum up batch loss
+            pred = output.argmax(
+                dim=1, keepdim=True
+            )  # get the index of the max log-probability
+            correct += pred.eq(target.view_as(pred)).sum().item()
+    test_loss /= len(test_loader.dataset)
+    print(
+        f"\nTest set: Average loss: {test_loss:.4f}, "
+        f"Accuracy: {correct}/{len(test_loader.dataset)} "
+        f"({100. * correct / len(test_loader.dataset):.0f}%)\n"
+    )
+def main():
+    # Training settings
+    parser = argparse.ArgumentParser(description="PyTorch MNIST Example")
+    parser.add_argument(
+        "--batch-size",
+        type=int,
+        default=64,
+        metavar="N",
+        help="input batch size for training (default: 64)",
+    )
+    parser.add_argument(
+        "--test-batch-size",
+        type=int,
+        default=1000,
+        metavar="N",
+        help="input batch size for testing (default: 1000)",
+    )
+    parser.add_argument(
+        "--epochs",
+        type=int,
+        default=14,
+        metavar="N",
+        help="number of epochs to train (default: 14)",
+    )
+    parser.add_argument(
+        "--lr",
+        type=float,
+        default=1.0,
+        metavar="LR",
+        help="learning rate (default: 1.0)",
+    )
+    parser.add_argument(
+        "--gamma",
+        type=float,
+        default=0.7,
+        metavar="M",
+        help="Learning rate step gamma (default: 0.7)",
+    )
+    parser.add_argument(
+        "--dry-run",
+        action="store_true",
+        default=False,
+        help="quickly check a single pass",
+    )
+    parser.add_argument(
+        "--seed", type=int, default=1, metavar="S", help="random seed (default: 1)"
+    )
+    parser.add_argument(
+        "--log-interval",
+        type=int,
+        default=10,
+        metavar="N",
+        help="how many batches to wait before logging training status",
+    )
+    parser.add_argument(
+        "--save-model",
+        action="store_true",
+        default=False,
+        help="For Saving the current Model",
+    )
+    parser.add_argument(
+        "--use-fp8", action="store_true", default=False, help="Use FP8 training"
+    )
+    parser.add_argument(
+        "--use-te", action="store_true", default=False, help="Use Transformer Engine"
+    )
+    args = parser.parse_args()
+    use_cuda = torch.cuda.is_available()
+    if args.use_fp8:
+        assert use_cuda, "CUDA needed for FP8 execution."
+        args.use_te = True
+    torch.manual_seed(args.seed)
+    device = torch.device("cuda" if use_cuda else "cpu")
+    train_kwargs = {"batch_size": args.batch_size}
+    test_kwargs = {"batch_size": args.test_batch_size}
+    if use_cuda:
+        cuda_kwargs = {"num_workers": 1, "pin_memory": True, "shuffle": True}
+        train_kwargs.update(cuda_kwargs)
+        test_kwargs.update(cuda_kwargs)
+    transform = transforms.Compose(
+        [transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]
+    )
+    dataset1 = datasets.MNIST("../data", train=True, download=True, transform=transform)
+    dataset2 = datasets.MNIST("../data", train=False, transform=transform)
+    train_loader = torch.utils.data.DataLoader(dataset1, **train_kwargs)
+    test_loader = torch.utils.data.DataLoader(dataset2, **test_kwargs)
+    model = Net(use_te=args.use_te).to(device)
+    optimizer = optim.Adadelta(model.parameters(), lr=args.lr)
+    scheduler = StepLR(optimizer, step_size=1, gamma=args.gamma)
+    for epoch in range(1, args.epochs + 1):
+        train(args, model, device, train_loader, optimizer, epoch, args.use_fp8)
+        test(model, device, test_loader, args.use_fp8)
+        scheduler.step()
+    if args.save_model:
+        torch.save(model.state_dict(), "mnist_cnn.pt")
+if __name__ == "__main__":
+    main()
--- a/qa/L0_cppunittest/test.sh
+++ b/qa/L0_cppunittest/test.sh
+# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# See LICENSE for license information.
+set -e
+: ${TE_PATH:=/opt/transformerengine}
+TE_LIB_PATH=`pip show transformer-engine | grep Location | cut -d ' ' -f 2`
+export LD_LIBRARY_PATH=$TE_LIB_PATH:$LD_LIBRARY_PATH
+cd $TE_PATH/tests/cpp
+cmake -GNinja -Bbuild .
+cmake --build build
+cd build && ctest
--- a/qa/L0_license/config.json
+++ b/qa/L0_license/config.json
+{
+    "initial_year": 2022,
+    "copyright": "Copyright (c) <YEAR>, NVIDIA CORPORATION & AFFILIATES. All rights reserved.",
+    "license": "See LICENSE for license information.",
+    "exclude": ["3rdparty",
+                "Dockerfile",
+                "Dockerfile.base",
+                "Dockerfile.qa",
+                "Dockerfile.devel",
+                "Dockerfile.docs",
+                "docker-build.sh",
+                ".png",
+                ".ipynb",
+                "docs/Makefile",
+                "layout.html",
+                "LICENSE",
+                "VERSION",
+                "Doxyfile",
+                "pylintrc",
+                ".json"
+               ],
+    "exclude_copyright": [],
+    "copyright_only": false
+}
--- a/qa/L0_license/copyright_checker.py
+++ b/qa/L0_license/copyright_checker.py
--- a/qa/L0_license/test.sh
+++ b/qa/L0_license/test.sh
+# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# See LICENSE for license information.
+set -e
+: "${TE_PATH:=/opt/transformerengine}"
+python $TE_PATH/qa/L0_license/copyright_checker.py $TE_PATH
--- a/qa/L0_lint/CPPLINT.cfg
+++ b/qa/L0_lint/CPPLINT.cfg
+# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# See LICENSE for license information.
+# Stop searching for additional config files.
+set noparent
+# Limit line length.
+linelength=100
+# Ignore the following errors.
+filter=-build/include_subdir
+filter=-build/namespaces
+filter=-readability/todo
+filter=-build/header_guard
+filter=-build/include
--- a/qa/L0_lint/format_py_files.sh
+++ b/qa/L0_lint/format_py_files.sh
+#!/bin/bash
+# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# See LICENSE for license information.
+python_files=`find transformer_engine tests setup.py examples -name '*.py'`
+for f in $python_files
+do
+  black $f
+done
--- a/qa/L0_lint/pylintrc
+++ b/qa/L0_lint/pylintrc
+[MASTER]
+extension-pkg-whitelist=torch,
+                        transformer_engine_extensions,
+                        scaled_softmax_cuda,
+                        scaled_masked_softmax_cuda,
+                        scaled_upper_triang_masked_softmax_cuda
+disable=too-many-locals,
+        invalid-name,
+        too-many-arguments,
+        abstract-method,
+        arguments-differ,
+        too-many-instance-attributes,
+        unsubscriptable-object,
+        import-outside-toplevel,
+        too-many-statements,
+        import-error,
+        too-many-lines,
+        use-maxsplit-arg,
+        protected-access,
+        pointless-string-statement,
+        cyclic-import,
+        duplicate-code,
+        no-member,
+        attribute-defined-outside-init,
+        global-statement,
+	too-many-branches,
+	global-variable-not-assigned
+[TYPECHECK]
+ignored-modules=torch
+ignored-classes=torch
--- a/qa/L0_lint/test.sh
+++ b/qa/L0_lint/test.sh
+# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# See LICENSE for license information.
+set -e
+: "${TE_PATH:=/opt/transformerengine}"
+pip install cpplint==1.6.0 pylint==2.13.5
+echo "Checking common API headers"
+cd $TE_PATH && \
+cpplint --root transformer_engine/common/include --recursive transformer_engine/common/include
+echo "Checking C++ files"
+cd $TE_PATH && \
+cpplint --recursive --exclude=transformer_engine/common/include transformer_engine
+echo "Checking Python files"
+cd $TE_PATH && \
+pylint --recursive=y transformer_engine