Docs: remove build warnings and add FP8 caching note (#44)

* docs: remove build warnings and add FP8 caching note Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * add comment about amax history Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Docs: remove build warnings and add FP8 caching note (#44)
* docs: remove build warnings and add FP8 caching note Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * add comment about amax history Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
d6ff6f4d · Kirthi Shankar Sivamani · GitHub · 64a8dc90 · d6ff6f4d · d6ff6f4d
Unverified Commit d6ff6f4d authored Jan 04, 2023 by Kirthi Shankar Sivamani Committed by GitHub Jan 04, 2023
Showing with 12 additions and 3 deletions

docs/api/c/softmax.rst docs/api/c/softmax.rst +1 -1

docs/conf.py docs/conf.py +1 -1

docs/examples/advanced_optimizations.ipynb docs/examples/advanced_optimizations.ipynb +10 -1

No files found.
--- a/docs/api/c/softmax.rst
+++ b/docs/api/c/softmax.rst
@@ -4,6 +4,6 @@
    See LICENSE for license information.
 softmax.h
-======
+=========
 .. doxygenfile:: softmax.h
--- a/docs/conf.py
+++ b/docs/conf.py
@@ -83,7 +83,7 @@ pygments_style = 'sphinx'
 html_theme = 'sphinx_rtd_theme'
 html_theme_path = [sphinx_rtd_theme.get_html_theme_path()]
-html_static_path = ['_static']
+html_static_path = []
 html_theme_options = {
        'display_version': True,

--- a/docs/examples/advanced_optimizations.ipynb
+++ b/docs/examples/advanced_optimizations.ipynb
@@ -250,6 +250,7 @@
   ]
  },
  {
+   "attachments": {},
   "cell_type": "markdown",
   "id": "add64bd5",
   "metadata": {},
@@ -264,7 +265,15 @@
    "\n",
    "</div>\n",
    "\n",
-    "Since weights are typically trained in FP32, a type conversion is required before we can perform compute in FP8. By default, the [fp8_autocast](../api/pytorch.rst#transformer_engine.pytorch.fp8_autocast) context manager will handle this internally by casting non-FP8 tensors to FP8 as they are encountered. However, we can improve upon this in some cases. In particular, if our training iteration is split into multiple gradient accumulation steps, each micro-batch will encounter the same weight tensors. Thus, we only need to cast the weights to FP8 in the first gradient accumulation step and we can cache the resulting FP8 weights for the remaining gradient accumulation steps."
+    "Since weights are typically trained in FP32, a type conversion is required before we can perform compute in FP8. By default, the [fp8_autocast](../api/pytorch.rst#transformer_engine.pytorch.fp8_autocast) context manager will handle this internally by casting non-FP8 tensors to FP8 as they are encountered. However, we can improve upon this in some cases. In particular, if our training iteration is split into multiple gradient accumulation steps, each micro-batch will encounter the same weight tensors. Thus, we only need to cast the weights to FP8 in the first gradient accumulation step and we can cache the resulting FP8 weights for the remaining gradient accumulation steps.\n",
+    "\n",
+    "<div class=\"alert alert-warning\">\n",
+    "\n",
+    "<b>Warning!</b> \n",
+    "\n",
+    "The precise numerical outputs with and without the FP8 weight caching optimization may not be bitwise identical. This is because while the weights remain frozen across a gradient accumulation cycle, the scaling factors and amaxes for the FP8 weights can change as they are updated at the end of every iteration. These changes in amax tensors are incorporated into the amax history, which is not frozen.\n",
+    "\n",
+    "</div>"
   ]
  },
  {