"vscode:/vscode.git/clone" did not exist on "4ee52bb169d64691c3bfe7b1b2fff91300d49095"
Unverified Commit d6ff6f4d authored by Kirthi Shankar Sivamani's avatar Kirthi Shankar Sivamani Committed by GitHub
Browse files

Docs: remove build warnings and add FP8 caching note (#44)



* docs: remove build warnings and add FP8 caching note
Signed-off-by: default avatarKirthi Shankar Sivamani <ksivamani@nvidia.com>

* add comment about amax history
Signed-off-by: default avatarKirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: default avatarKirthi Shankar Sivamani <ksivamani@nvidia.com>
parent 64a8dc90
...@@ -4,6 +4,6 @@ ...@@ -4,6 +4,6 @@
See LICENSE for license information. See LICENSE for license information.
softmax.h softmax.h
====== =========
.. doxygenfile:: softmax.h .. doxygenfile:: softmax.h
...@@ -83,7 +83,7 @@ pygments_style = 'sphinx' ...@@ -83,7 +83,7 @@ pygments_style = 'sphinx'
html_theme = 'sphinx_rtd_theme' html_theme = 'sphinx_rtd_theme'
html_theme_path = [sphinx_rtd_theme.get_html_theme_path()] html_theme_path = [sphinx_rtd_theme.get_html_theme_path()]
html_static_path = ['_static'] html_static_path = []
html_theme_options = { html_theme_options = {
'display_version': True, 'display_version': True,
......
...@@ -250,6 +250,7 @@ ...@@ -250,6 +250,7 @@
] ]
}, },
{ {
"attachments": {},
"cell_type": "markdown", "cell_type": "markdown",
"id": "add64bd5", "id": "add64bd5",
"metadata": {}, "metadata": {},
...@@ -264,7 +265,15 @@ ...@@ -264,7 +265,15 @@
"\n", "\n",
"</div>\n", "</div>\n",
"\n", "\n",
"Since weights are typically trained in FP32, a type conversion is required before we can perform compute in FP8. By default, the [fp8_autocast](../api/pytorch.rst#transformer_engine.pytorch.fp8_autocast) context manager will handle this internally by casting non-FP8 tensors to FP8 as they are encountered. However, we can improve upon this in some cases. In particular, if our training iteration is split into multiple gradient accumulation steps, each micro-batch will encounter the same weight tensors. Thus, we only need to cast the weights to FP8 in the first gradient accumulation step and we can cache the resulting FP8 weights for the remaining gradient accumulation steps." "Since weights are typically trained in FP32, a type conversion is required before we can perform compute in FP8. By default, the [fp8_autocast](../api/pytorch.rst#transformer_engine.pytorch.fp8_autocast) context manager will handle this internally by casting non-FP8 tensors to FP8 as they are encountered. However, we can improve upon this in some cases. In particular, if our training iteration is split into multiple gradient accumulation steps, each micro-batch will encounter the same weight tensors. Thus, we only need to cast the weights to FP8 in the first gradient accumulation step and we can cache the resulting FP8 weights for the remaining gradient accumulation steps.\n",
"\n",
"<div class=\"alert alert-warning\">\n",
"\n",
"<b>Warning!</b> \n",
"\n",
"The precise numerical outputs with and without the FP8 weight caching optimization may not be bitwise identical. This is because while the weights remain frozen across a gradient accumulation cycle, the scaling factors and amaxes for the FP8 weights can change as they are updated at the end of every iteration. These changes in amax tensors are incorporated into the amax history, which is not frozen.\n",
"\n",
"</div>"
] ]
}, },
{ {
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment