Converging

6644c6e6 · Michael Carilli · 248d7b10 · 6644c6e6 · 6644c6e6 · 6644c6e6
Commit 6644c6e6 authored Mar 06, 2019 by Michael Carilli
Hide whitespace changes
Inline Side-by-side

Showing with 128 additions and 23 deletions

apex/amp/frontend.py apex/amp/frontend.py +10 -1

apex/amp/handle.py apex/amp/handle.py +2 -2

docs/source/amp.rst docs/source/amp.rst +116 -20

No files found.
--- a/apex/amp/frontend.py
+++ b/apex/amp/frontend.py
@@ -205,6 +205,12 @@ def initialize(
    Initialize your models, optimizers, and the Torch tensor and functional namespace according to the
    chosen ``opt_level`` and overridden properties, if any.

+    ``amp.initialize`` must be called **after** you have finished constructing your model(s) and
+    optimizer(s), but **before** you send your model through any DistributedDataParallel wrapper.
+    See `Distributed training`_ in the Imagenet example.
+
+    Any property keyword argument that is not ``None`` will be interpreted as a manual override.
+
    To prevent having to rewrite anything else in your script, name the returned models/optimizers
    to replace the passed models/optimizers, as in the Usage below.

@@ -214,7 +220,7 @@ def initialize(
        enabled (bool, optional, default=True):  If False, renders all Amp calls no-ops, so your script
            should run as if Amp were not present.
        opt_level(str, required):  Pure or mixed precision optimization level.  Accepted values are
-            "O0", "O1", "O2", and "O3", which are explained in detail above.
+            "O0", "O1", "O2", and "O3", explained in detail above.
        cast_model_type (``torch.dtype``, optional, default=None):  Optional property override, see
            above.
        patch_torch_functions (bool, optional, default=None):  Optional property override.
@@ -254,6 +260,9 @@ def initialize(

    The `Imagenet example`_ demonstrates live use of various opt_levels and overrides.

+    .. _`Distributed training`:
+        https://github.com/NVIDIA/apex/tree/master/examples/imagenet#distributed-training
+
    .. _`Imagenet example`:
        https://github.com/NVIDIA/apex/tree/master/examples/imagenet
    """

--- a/apex/amp/handle.py
+++ b/apex/amp/handle.py
@@ -43,9 +43,9 @@ def scale_loss(loss,
    Args:
        loss(Tensor):  Typically a scalar Tensor. The ``scaled_loss`` that the context
            manager yields is simply ``loss.float()*loss_scale``, so in principle
-            `loss` could have more than one element, as long as you call
+            ``loss`` could have more than one element, as long as you call
            ``backward()`` on ``scaled_loss`` appropriately within the context manager body.
-        optimizer:  **Must** be an optimizer returned from an earlier call to ``amp.initialize``.
+        optimizer:  Must be an optimizer returned from an earlier call to ``amp.initialize``.
        model(torch.nn.Module, optional, default=None):  Currently unused, reserved to enable future
            optimizations.
        delay_unscale(bool, default=False):  Don't unscale the gradients or perform model->master

--- a/docs/source/amp.rst
+++ b/docs/source/amp.rst
@@ -4,21 +4,9 @@
 apex.amp
 ===================================

-Unified API
-----------
-
 This page documents the updated API for Amp (Automatic Mixed Precision),
 a tool to enable Tensor Core-accelerated training in only 3 lines of Python.

-Amp allows users to easily experiment with different pure and mixed precision modes, including
-pure FP16 training and pure FP32 training.  Commonly-used default modes are chosen by
-selecting an "optimization level" or ``opt_level``; each ``opt_level`` establishes a set of
-properties that govern Amp's implementation of pure or mixed precision training.
-Finer-grained control of how a given ``opt_level`` behaves can be achieved by passing values for
-particular properties directly to ``amp.initialize``.  These manually specified values will
-override the defaults established by the ``opt_level``.  If you attempt to override a property
-that does not make sense for the current ``opt_level``, Amp will raise an error with an explanation.
-
 Users **should not** manually cast their model or data to ``.half()``, regardless of what ``opt_level``
 or properties are chosen.  Amp intends that users start with an existing default (FP32) script,
 add the three lines corresponding to the Amp API, and begin training with mixed precision.
@@ -45,15 +33,120 @@ on the Github page.
 GANs are a tricky case that many people have requested.  A `comprehensive DCGAN example`_
 is under construction.

-``opt_level``\ s and Properties
-------------------------------
-
 .. _`runnable, comprehensive Imagenet example`:
    https://github.com/NVIDIA/apex/tree/master/examples/imagenet

 .. _`comprehensive DCGAN example`:
    https://github.com/NVIDIA/apex/tree/master/examples/dcgan

+``opt_level``\ s and Properties
+-------------------------------
+
+Amp allows users to easily experiment with different pure and mixed precision modes, including
+pure FP16 training and pure FP32 training.  Commonly-used default modes are chosen by
+selecting an "optimization level" or ``opt_level``; each ``opt_level`` establishes a set of
+properties that govern Amp's implementation of pure or mixed precision training.
+Finer-grained control of how a given ``opt_level`` behaves can be achieved by passing values for
+particular properties directly to ``amp.initialize``.  These manually specified values will
+override the defaults established by the ``opt_level``.
+
+Properties
+**********
+
+Currently, the under-the-hood properties that govern pure or mixed precision training are the following:
+
+- ``cast_model_type``:  Casts your model's parameters and buffers to the desired type.
+- ``patch_torch_functions``: Patch all Torch functions and Tensor methods to perform Tensor Core-friendly ops like GEMMs and convolutions in FP16, and any ops that benefit from FP32 precision in FP32.
+- ``keep_batchnorm_fp32``:  To enhance precision and enable cudnn batchnorm (which improves performance), it's often beneficial to keep batchnorms in particular in FP32 even if the rest of the model is FP16.
+- ``master_weights``:  Maintain FP32 master weights to accompany any FP16 model weights.  FP32 master weights are stepped by the optimizer to enhance precision and capture small gradients.
+- ``loss_scale``:  If ``loss_scale`` is a float value, use this value as the static (fixed) loss scale.  If ``loss_scale`` is the string ``"dynamic"``, adapatively adjust the loss scale over time.  Dynamic loss scale adjustments are performed by Amp automatically.
+
+Again, you often don't need to specify these properties by hand.  Instead, select an ``opt_level``,
+which will set them up for you.  After selecting an ``opt_level``, you can optionally pass property
+kwargs as manual overrides.
+
+If you attempt to override a property that does not make sense for the selected ``opt_level``,
+Amp will raise an error with an explanation.  For example, selecting ``opt_level="O1"`` combined with
+the override ``master_weights=True`` does not make sense.  ``O1`` inserts casts
+around Torch functions rather than model weights.  Data, activations, and weights are recast
+out-of-place on the fly as they flow through patched functions.  Therefore, the model weights themselves
+can (and should) remain FP32, and there is no need to maintain separate FP32 master weights.
+
+``opt_level``\ s
+****************
+
+``O0`` and ``O3`` are not true mixed precision, but they are useful for establishing baselines.
+
+``O0``:  FP32 training
+^^^^^^^^^^^^^^^^^^^^^^
+Your incoming model should be FP32 already, so this is likely a no-op.
+``O0`` can be useful to establish an accuracy baseline.
+
+| Default properties set by ``O0``:
+| ``cast_model_type=torch.float32``
+| ``patch_torch_functions=False``
+| ``keep_batchnorm_fp32=None`` (effectively, "not applicable")
+| ``master_weights=False``
+| ``loss_scale=1.0``
+|
+|
+
+``O3``:  FP16 training
+^^^^^^^^^^^^^^^^^^^^^^
+``O3`` may not achieve the stability of the true mixed precision options ``O1`` and ``O2``.
+However, it can be useful to establish a speed baseline for your model, against which
+the performance of ``O1`` and ``O2`` can be compared.  If your model uses batch normalization,
+to establish "speed of light" you can try ``O3`` with the additional property override
+``keep_batchnorm_fp32=True`` (which enables cudnn batchnorm, as stated earlier).
+
+| Default properties set by ``O3``:
+| ``cast_model_type=torch.float16``
+| ``patch_torch_functions=False``
+| ``keep_batchnorm_fp32=False``
+| ``master_weights=False``
+| ``loss_scale=1.0``
+|
+|
+
+``O1`` and ``O2`` are different implementations of mixed precision.  Try both, and see
+what gives the best speedup and accuracy for your model.
+
+``O1``:  Conservative Mixed Precision
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Patch all Torch functions and Tensor methods to cast their inputs according to a whitelist-blacklist
+model.  Whitelist ops (for example, Tensor Core-friendly ops like GEMMs and convolutions) are performed
+in FP16.  Blacklist ops that benefit from FP32 precision (for example, batchnorm and softmax)
+are performed in FP32.  ``O1`` also uses dynamic loss scaling, unless overridden.
+
+| Default properties set by ``O1``:
+| ``cast_model_type=None`` (not applicable)
+| ``patch_torch_functions=True``
+| ``keep_batchnorm_fp32=None`` (not necessary to specify True, batchnorm inputs are cast to FP32)
+| ``master_weights=None`` (not applicable, model weights remain FP32)
+| ``loss_scale="dynamic"``
+|
+|
+
+``O2``:  Fast Mixed Precision
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+``O2`` casts the model to FP16, keeps batchnorms in FP32, maintains master weights in FP32,
+and implements dynamic loss scaling (unless overridden).
+Unlike ``O1``, ``O2`` does not patch Torch functions or Tensor methods.
+
+| Default properties set by ``O2``:
+| ``cast_model_type=torch.float16``
+| ``patch_torch_functions=False``
+| ``keep_batchnorm_fp32=True``
+| ``master_weights=True``
+| ``loss_scale="dynamic"``
+|
+|
+
+
+Unified API
+-----------
+
 .. automodule:: apex.amp
 .. currentmodule:: apex.amp

@@ -66,7 +159,7 @@ is under construction.
 Advanced use cases
 ------------------

-The new Amp API supports gradient accumulation across iterations,
+The unified Amp API supports gradient accumulation across iterations,
 multiple backward passes per iteration, multiple models/optimizers,
 and custom/user-defined autograd functions.  Gradient clipping and GANs also
 require special treatment, but this treatment does not need to change
@@ -82,10 +175,12 @@ Transition guide for old API users

 We strongly encourage moving to the new Amp API, because it's more versatile, easier to use, and future proof.  The original :class:`FP16_Optimizer` and the old "Amp" API are deprecated, and subject to removal at at any time.

-**For users of the old "Amp" API**
+For users of the old "Amp" API
+******************************

-In the new API, ``opt-level O1`` performs the same patching of the Torch namespace as the old Amp API.
-However, the new API allows choosing static or dynamic loss scaling, while the old API only allowed dynamic loss scaling.
+In the new API, ``opt-level O1`` performs the same patching of the Torch namespace as the old thing
+called "Amp."
+However, the new API allows static or dynamic loss scaling, while the old API only allowed dynamic loss scaling.

 In the new API, the old call to ``amp_handle = amp.init()``, and the returned ``amp_handle``, are no
 longer exposed or necessary.  The new ``amp.initialize()`` does the duty of ``amp.init()`` (and more).
@@ -111,7 +206,8 @@ with a particular precision are still honored by the new API.
    https://github.com/NVIDIA/apex/tree/master/apex/amp#annotating-user-functions


-**For users of the old FP16_Optimizer**
+For users of the old FP16_Optimizer
+***********************************

 ``opt-level O2`` is equivalent to :class:`FP16_Optimizer` with ``dynamic_loss_scale=True``.
 Once again, the backward pass must be changed to the unified version::