Merge branch 'master' into deyuf/update_norm

2f0bf594 · Deyu Fu · GitHub · 99495376 · 40555b3a · 2f0bf594
Unverified Commit 2f0bf594 authored Mar 08, 2019 by Deyu Fu Committed by GitHub Mar 08, 2019
20 changed files
--- a/docs/source/advanced.rst
+++ b/docs/source/advanced.rst
+.. role:: hidden
+    :class: hidden-section
+Advanced Amp Usage
+===================================
+GANs
+----
+GANs are an interesting synthesis of several topics below.  A `comprehensive example`_
+is under construction.
+.. _`comprehensive example`:
+    https://github.com/NVIDIA/apex/tree/master/examples/dcgan
+Gradient clipping
+-----------------
+If Amp uses master params distinct from the model params,
+then the params ``step()``\ ed by the optimizer are the master params,
+and it is the master gradients (rather than the model gradients) that must be clipped.
+If Amp is not using master params distinct from the model params, then the optimizer
+directly steps the model params, and the model grads must be clipped.
+In both cases, correct practice is to clip the gradients of the params that are about to be stepped **by the optimizer** (which may be distinct from ``model.parameters()``).
+Also, if Amp uses loss scaling, gradients must be clipped after they have been unscaled.
+The following pattern accounts for all possibilities, and should be correct for
+any ``opt_level``::
+    with amp.scale_loss(loss, optimizer) as scaled_loss:
+        scaled_loss.backward()
+        # Gradients are unscaled during context manager exit.
+    # Now it's safe to clip:
+    torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), max_norm)
+    # or
+    torch.nn.utils.clip_grad_value_(amp.master_params(optimizer), max_)
+Note the use of the utility function ``amp.master_params(optimizer)``,
+which returns a generator-expression that iterates over the
+params that the optimizer steps (master params if enabled, otherwise model params).
+Custom/user-defined autograd functions
+--------------------------------------
+The old Amp API for `registering user functions`_ is still considered correct.  Functions must
+be registered before calling ``amp.initialize``.
+.. _`registering user functions`:
+    https://github.com/NVIDIA/apex/tree/master/apex/amp#annotating-user-functions
+Forcing particular layers/functions to a desired type
+-----------------------------------------------------
+I'm still working on a generalizable exposure for this that won't require user-side code divergence
+across different ``opt-level``\ s.
+Multiple models/optimizers
+--------------------------
+``amp.initialize``'s optimizer argument may be a single optimizer or a list of optimizers,
+as long as the output you accept has the same type.
+Similarly, the ``model`` argument may be a single model or a list of models, as long as the accepted
+output matches.  The following calls are all legal::
+    model, optim = amp.initialize(model, optim,...)
+    model, [optim1, optim2] = amp.initialize(model, [optim1, optim2],...)
+    [model1, model2], optim = amp.initialize([model1, model2], optim,...)
+    [model1, model2], [optim1, optim2] = amp.initialize([model1, model2], [optim1, optim2],...)
+Whenever you invoke a backward pass, the optimizer you should pass to ``amp.scaled_loss`` is whatever
+optimizer owns the parameters for which this particular backward pass is creating gradients.
+Multiple backward passes per iteration
+--------------------------------------
+If you want to accumulate gradients from multiple losses for the params owned by a given optimizer,
+you must invoke ``with amp.scale_loss(..., delay_unscale=True)`` for all backward passes except
+the last::
+    # delay_unscale=True for the first two losses
+    with amp.scale_loss(loss1, optimizer, delay_unscale=True) as scaled_loss:
+        scaled_loss.backward()
+    with amp.scale_loss(loss2, optimizer, delay_unscale=True) as scaled_loss:
+        scaled_loss.backward()
+    # Don't delay_unscale for the final loss 
+    with amp.scale_loss(loss3, optimizer) as scaled_loss:
+        scaled_loss.backward()
+    optimizer.step()
+Gradient accumulation across iterations
+---------------------------------------
+Pass ``delay_unscale=True`` to ``amp.scale_loss`` until you're ready to ``step()``::
+    if iter%iters_to_accumulate == 0:
+        # Every iters_to_accumulate iterations, unscale and step
+        with amp.scale_loss(loss, optimizer) as scaled_loss:
+            scaled_loss.backward()
+        optimizer.step()
+        optimizer.zero_grad()
+    else:
+        # Otherwise, just accumulate gradients, don't unscale or step. 
+        with amp.scale_loss(loss, optimizer, delay_unscale=True) as scaled_loss:
+            scaled_loss.backward()
--- a/docs/source/amp.rst
+++ b/docs/source/amp.rst
@@ -4,8 +4,235 @@
 apex.amp
 ===================================
-Amp (Automatic Mixed Precision) is a tool designed for ease of use and maximum safety in FP16 training. All potentially unsafe ops are performed in FP32 under the hood, while safe ops are performed using faster, Tensor Core-friendly FP16 math. Amp also automatically implements dynamic loss scaling.
+This page documents the updated API for Amp (Automatic Mixed Precision),
+a tool to enable Tensor Core-accelerated training in only 3 lines of Python.
-The intention of Amp is to be the "on-ramp" to easy FP16 training: achieve all the numerical stability of full FP32 training, with most of the performance benefits of full FP16 training.
+A `runnable, comprehensive Imagenet example`_ demonstrating good practices can be found
+on the Github page.
-Currently, complete API documentation resides on the Github page: https://github.com/NVIDIA/apex/tree/master/apex/amp.
+GANs are a tricky case that many people have requested.  A `comprehensive DCGAN example`_
+is under construction.
+``opt_level``\ s and Properties
+-------------------------------
+Amp allows users to easily experiment with different pure and mixed precision modes.
+Commonly-used default modes are chosen by
+selecting an "optimization level" or ``opt_level``; each ``opt_level`` establishes a set of
+properties that govern Amp's implementation of pure or mixed precision training.
+Finer-grained control of how a given ``opt_level`` behaves can be achieved by passing values for
+particular properties directly to ``amp.initialize``.  These manually specified values
+override the defaults established by the ``opt_level``.
+Example::
+        # Declare model and optimizer as usual
+        model = torch.nn.Linear(D_in, D_out).cuda().half()
+        optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
+        # Allow Amp to perform casts as required by the opt_level
+        model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
+        ...
+        # loss.backward() becomes:
+        with amp.scale_loss(loss, optimizer) as scaled_loss:
+            scaled_loss.backward()
+        ...
+Users **should not** manually cast their model or data to ``.half()``, regardless of what ``opt_level``
+or properties are chosen.  Amp intends that users start with an existing default (FP32) script,
+add the three lines corresponding to the Amp API, and begin training with mixed precision.
+Amp can also be disabled, in which case the original script will behave exactly as it used to.
+In this way, there's no risk adhering to the Amp API, and a lot of potential performance benefit.
+.. note::
+    Because it's never necessary to manually cast your model (aside from the call ``amp.initialize``)
+    or input data, a script that adheres to the new API
+    can switch between different ``opt-level``\ s without having to make any other changes.
+.. _`runnable, comprehensive Imagenet example`:
+    https://github.com/NVIDIA/apex/tree/master/examples/imagenet
+.. _`comprehensive DCGAN example`:
+    https://github.com/NVIDIA/apex/tree/master/examples/dcgan
+Properties
+**********
+Currently, the under-the-hood properties that govern pure or mixed precision training are the following:
+- ``cast_model_type``:  Casts your model's parameters and buffers to the desired type.
+- ``patch_torch_functions``: Patch all Torch functions and Tensor methods to perform Tensor Core-friendly ops like GEMMs and convolutions in FP16, and any ops that benefit from FP32 precision in FP32.
+- ``keep_batchnorm_fp32``:  To enhance precision and enable cudnn batchnorm (which improves performance), it's often beneficial to keep batchnorm weights in FP32 even if the rest of the model is FP16.
+- ``master_weights``:  Maintain FP32 master weights to accompany any FP16 model weights.  FP32 master weights are stepped by the optimizer to enhance precision and capture small gradients.
+- ``loss_scale``:  If ``loss_scale`` is a float value, use this value as the static (fixed) loss scale.  If ``loss_scale`` is the string ``"dynamic"``, adaptively adjust the loss scale over time.  Dynamic loss scale adjustments are performed by Amp automatically.
+Again, you often don't need to specify these properties by hand.  Instead, select an ``opt_level``,
+which will set them up for you.  After selecting an ``opt_level``, you can optionally pass property
+kwargs as manual overrides.
+If you attempt to override a property that does not make sense for the selected ``opt_level``,
+Amp will raise an error with an explanation.  For example, selecting ``opt_level="O1"`` combined with
+the override ``master_weights=True`` does not make sense.  ``O1`` inserts casts
+around Torch functions rather than model weights.  Data, activations, and weights are recast
+out-of-place on the fly as they flow through patched functions.  Therefore, the model weights themselves
+can (and should) remain FP32, and there is no need to maintain separate FP32 master weights.
+``opt_level``\ s
+****************
+Recognized ``opt_level``\ s are ``"O0"``, ``"O1"``, ``"O2"``, and ``"O3"``.
+``O0`` and ``O3`` are not true mixed precision, but they are useful for establishing accuracy and
+speed baselines, respectively.
+``O1`` and ``O2`` are different implementations of mixed precision.  Try both, and see
+what gives the best speedup and accuracy for your model.
+``O0``:  FP32 training
+^^^^^^^^^^^^^^^^^^^^^^
+Your incoming model should be FP32 already, so this is likely a no-op.
+``O0`` can be useful to establish an accuracy baseline.
+| Default properties set by ``O0``:
+| ``cast_model_type=torch.float32``
+| ``patch_torch_functions=False``
+| ``keep_batchnorm_fp32=None`` (effectively, "not applicable," everything is FP32)
+| ``master_weights=False``
+| ``loss_scale=1.0``
+|
+|
+``O1``:  Conservative Mixed Precision
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Patch all Torch functions and Tensor methods to cast their inputs according to a whitelist-blacklist
+model.  Whitelist ops (for example, Tensor Core-friendly ops like GEMMs and convolutions) are performed
+in FP16.  Blacklist ops that benefit from FP32 precision (for example, softmax)
+are performed in FP32.  ``O1`` also uses dynamic loss scaling, unless overridden.
+| Default properties set by ``O1``:
+| ``cast_model_type=None`` (not applicable)
+| ``patch_torch_functions=True``
+| ``keep_batchnorm_fp32=None`` (again, not applicable, all model weights remain FP32)
+| ``master_weights=None`` (not applicable, model weights remain FP32)
+| ``loss_scale="dynamic"``
+|
+|
+``O2``:  Fast Mixed Precision
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+``O2`` casts the model weights to FP16,
+patches the model's ``forward`` method to cast input
+data to FP16, keeps batchnorms in FP32, maintains FP32 master weights,
+and implements dynamic loss scaling (unless overridden).
+Unlike ``O1``, ``O2`` does not patch Torch functions or Tensor methods.
+| Default properties set by ``O2``:
+| ``cast_model_type=torch.float16``
+| ``patch_torch_functions=False``
+| ``keep_batchnorm_fp32=True``
+| ``master_weights=True``
+| ``loss_scale="dynamic"``
+|
+|
+``O3``:  FP16 training
+^^^^^^^^^^^^^^^^^^^^^^
+``O3`` may not achieve the stability of the true mixed precision options ``O1`` and ``O2``.
+However, it can be useful to establish a speed baseline for your model, against which
+the performance of ``O1`` and ``O2`` can be compared.  If your model uses batch normalization,
+to establish "speed of light" you can try ``O3`` with the additional property override
+``keep_batchnorm_fp32=True`` (which enables cudnn batchnorm, as stated earlier).
+| Default properties set by ``O3``:
+| ``cast_model_type=torch.float16``
+| ``patch_torch_functions=False``
+| ``keep_batchnorm_fp32=False``
+| ``master_weights=False``
+| ``loss_scale=1.0``
+|
+|
+Unified API
+-----------
+.. automodule:: apex.amp
+.. currentmodule:: apex.amp
+.. autofunction:: initialize
+.. autofunction:: scale_loss
+.. autofunction:: master_params
+Advanced use cases
+------------------
+The unified Amp API supports gradient accumulation across iterations,
+multiple backward passes per iteration, multiple models/optimizers,
+and custom/user-defined autograd functions.  Gradient clipping and GANs also
+require special treatment, but this treatment does not need to change
+for different ``opt_level``\ s.  Further details can be found here:
+.. toctree::
+   :maxdepth: 1
+   advanced
+Transition guide for old API users
+----------------------------------
+We strongly encourage moving to the new Amp API, because it's more versatile, easier to use, and future proof.  The original :class:`FP16_Optimizer` and the old "Amp" API are deprecated, and subject to removal at at any time.
+For users of the old "Amp" API
+******************************
+In the new API, ``opt-level O1`` performs the same patching of the Torch namespace as the old thing
+called "Amp."
+However, the new API allows static or dynamic loss scaling, while the old API only allowed dynamic loss scaling.
+In the new API, the old call to ``amp_handle = amp.init()``, and the returned ``amp_handle``, are no
+longer exposed or necessary.  The new ``amp.initialize()`` does the duty of ``amp.init()`` (and more).
+Therefore, any existing calls to ``amp_handle = amp.init()`` should be deleted.
+The functions formerly exposed through ``amp_handle`` are now free
+functions accessible through the ``amp`` module.
+The backward context manager must be changed accordingly::
+    # old API
+    with amp_handle.scale_loss(loss, optimizer) as scaled_loss:
+        scaled_loss.backward()
+    ->
+    # new API
+    with amp.scale_loss(loss, optimizer) as scaled_loss:
+        scaled_loss.backward()
+For now, the deprecated "Amp" API documentation can still be found on the Github README:  https://github.com/NVIDIA/apex/tree/master/apex/amp.  The old API calls that `annotate user functions`_ to run
+with a particular precision are still honored by the new API.
+.. _`annotate user functions`:
+    https://github.com/NVIDIA/apex/tree/master/apex/amp#annotating-user-functions
+For users of the old FP16_Optimizer
+***********************************
+``opt-level O2`` is equivalent to :class:`FP16_Optimizer` with ``dynamic_loss_scale=True``.
+Once again, the backward pass must be changed to the unified version::
+    optimizer.backward(loss)
+    ->
+    with amp.scale_loss(loss, optimizer) as scaled_loss:
+        scaled_loss.backward()
+One annoying aspect of FP16_Optimizer was that the user had to manually convert their model to half
+(either by calling ``.half()`` on it, or using a function or module wrapper from
+``apex.fp16_utils``), and also manually call ``.half()`` on input data.  **Neither of these are
+necessary in the new API.  No matter what --opt-level
+you choose, you can and should simply build your model and pass input data in the default FP32 format.**
+The new Amp API will perform the right conversions during
+``model, optimizer = amp.initialize(model, optimizer, opt_level=....)`` based on the ``--opt-level``
+and any overridden flags.  Floating point input data may be FP32 or FP16, but you may as well just
+let it be FP16, because the ``model`` returned by ``amp.initialize`` will have its ``forward``
+method patched to cast the input data appropriately.
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -11,14 +11,7 @@ Apex (A PyTorch Extension)
 This site contains the API documentation for Apex (https://github.com/nvidia/apex),
 a Pytorch extension with NVIDIA-maintained utilities to streamline mixed precision and distributed training.  Some of the code here will be included in upstream Pytorch eventually. The intention of Apex is to make up-to-date utilities available to users as quickly as possible.
-Installation requires CUDA 9 or later, PyTorch 0.4 or later, and Python 3. Install by running
+Installation instructions can be found here:  https://github.com/NVIDIA/apex#quick-start.
-::
-   git clone https://www.github.com/nvidia/apex
-   cd apex
-   python setup.py install [--cuda_ext] [--cpp_ext]
 .. toctree::
   :maxdepth: 1
@@ -26,12 +19,6 @@ Installation requires CUDA 9 or later, PyTorch 0.4 or later, and Python 3. Insta
   amp
-.. toctree::
-   :maxdepth: 1
-   :caption: FP16/Mixed Precision Utilities
-   fp16_utils
 .. toctree::
   :maxdepth: 1
   :caption: Distributed Training
@@ -50,6 +37,11 @@ Installation requires CUDA 9 or later, PyTorch 0.4 or later, and Python 3. Insta
   layernorm
+..   .. toctree::
+     :maxdepth: 1
+     :caption: Deprecated mixed precision API
+     fp16_util
 ..   reparameterization
 ..   RNN

--- a/examples/FP16_Optimizer_simple/distributed_apex_legacy_launcher/README.md
+++ b/examples/FP16_Optimizer_simple/distributed_apex_legacy_launcher/README.md
-**distributed_data_parallel.py** and **run.sh** show an example using `FP16_Optimizer` with
-`apex.parallel.DistributedDataParallel` in conjuction with the legacy Apex
-launcher script, `apex.parallel.multiproc`.  See 
-[FP16_Optimizer_simple/distributed_apex](https://github.com/NVIDIA/apex/tree/torch_launcher/examples/FP16_Optimizer_simple/distributed_apex) for a more up-to-date example that uses the Pytorch launcher
-script, `torch.distributed.launch`.
-The usage of `FP16_Optimizer` with distributed does not need to change from ordinary 
-single-process usage.  Test via
-```bash
-bash run.sh
-```
--- a/examples/FP16_Optimizer_simple/distributed_apex_legacy_launcher/distributed_data_parallel.py
+++ b/examples/FP16_Optimizer_simple/distributed_apex_legacy_launcher/distributed_data_parallel.py
-import torch
-import argparse
-from apex.parallel import DistributedDataParallel as DDP
-from apex.fp16_utils import FP16_Optimizer
-parser = argparse.ArgumentParser()
-parser.add_argument('--dist-url', default='tcp://224.66.41.62:23456', type=str,
-                    help='url used to set up distributed training')
-parser.add_argument('--world-size', default=2, type=int,
-                    help='Number of distributed processes.')
-parser.add_argument("--rank", type=int,
-                    help='Rank of this process')
-args = parser.parse_args()
-torch.cuda.set_device(args.rank)
-torch.distributed.init_process_group(backend='nccl',
-                                     init_method=args.dist_url,
-                                     world_size=args.world_size,
-                                     rank=args.rank)
-torch.backends.cudnn.benchmark = True
-N, D_in, D_out = 64, 1024, 16
-x = torch.randn(N, D_in, device='cuda', dtype=torch.half)
-y = torch.randn(N, D_out, device='cuda', dtype=torch.half)
-model = torch.nn.Linear(D_in, D_out).cuda().half()
-model = DDP(model)
-optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
-### Construct FP16_Optimizer ###
-optimizer = FP16_Optimizer(optimizer)
-###
-loss_fn = torch.nn.MSELoss()
-for t in range(500):
-    optimizer.zero_grad()
-    y_pred = model(x)
-    loss = loss_fn(y_pred.float(), y.float())
-    ### Change loss.backward() to: ###
-    optimizer.backward(loss)
-    ###
-    optimizer.step()
-print("final loss = ", loss)
--- a/examples/FP16_Optimizer_simple/distributed_apex_legacy_launcher/run.sh
+++ b/examples/FP16_Optimizer_simple/distributed_apex_legacy_launcher/run.sh
-#!/bin/bash
-# By default, apex.parallel.multiproc will attempt to use all available GPUs on the system.  
-# The number of GPUs to use can be limited by setting CUDA_VISIBLE_DEVICES:
-export CUDA_VISIBLE_DEVICES=0,1
-python -m apex.parallel.multiproc distributed_data_parallel.py
--- a/examples/README.md
+++ b/examples/README.md
-## Contents:
+This directory contains examples illustrating Apex mixed precision and distributed tools.
-**distributed**:  Walkthrough of apex distributed data parallel utilities.
+**Note for users of the pre-unification API**:
+`deprecated_api` contains examples illustrating the old (pre-unified) APIs.  These APIs will be removed soon, and users are strongly encouraged to switch.  The separate mixed precision tools called `Amp` and `FP16_Optimizer` in the old API are exposed via different flags/optimization levels in the new API.
-**FP16_Optimizer_simple**:  Simple examples demonstrating various use cases of `FP16_Optimizer` to automatically manage master parameters and static or dynamic loss scaling.
-**imagenet**:  Example based on [https://github.com/pytorch/examples/tree/master/imagenet](https://github.com/pytorch/examples/tree/master/imagenet) showing the use of `FP16_Optimizer`, as well as manual management of master parameters and loss scaling for illustration/comparison.
-**word_language_model**:  Example based on [https://github.com/pytorch/examples/tree/master/word_language_model](https://github.com/pytorch/examples/tree/master/word_language_model) showing the use of `FP16_Optimizer`, as well as manual management of master parameters and loss scaling for illustration/comparison.
-**docker**:  Example of a minimal Dockerfile that installs Apex on top of an existing container.
--- a/examples/dcgan/README.md
+++ b/examples/dcgan/README.md
+Under construction...
--- a/examples/FP16_Optimizer_simple/README.md
+++ b/examples/FP16_Optimizer_simple/README.md
--- a/examples/FP16_Optimizer_simple/closure.py
+++ b/examples/FP16_Optimizer_simple/closure.py
--- a/examples/FP16_Optimizer_simple/distributed_apex/README.md
+++ b/examples/FP16_Optimizer_simple/distributed_apex/README.md
--- a/examples/FP16_Optimizer_simple/distributed_apex/distributed_data_parallel.py
+++ b/examples/FP16_Optimizer_simple/distributed_apex/distributed_data_parallel.py
--- a/examples/FP16_Optimizer_simple/distributed_apex/run.sh
+++ b/examples/FP16_Optimizer_simple/distributed_apex/run.sh
--- a/examples/FP16_Optimizer_simple/distributed_pytorch/README.md
+++ b/examples/FP16_Optimizer_simple/distributed_pytorch/README.md
--- a/examples/FP16_Optimizer_simple/distributed_pytorch/distributed_data_parallel.py
+++ b/examples/FP16_Optimizer_simple/distributed_pytorch/distributed_data_parallel.py
--- a/examples/FP16_Optimizer_simple/distributed_pytorch/run.sh
+++ b/examples/FP16_Optimizer_simple/distributed_pytorch/run.sh
--- a/examples/FP16_Optimizer_simple/minimal.py
+++ b/examples/FP16_Optimizer_simple/minimal.py
--- a/examples/FP16_Optimizer_simple/save_load.py
+++ b/examples/FP16_Optimizer_simple/save_load.py
--- a/examples/deprecated_api/README.md
+++ b/examples/deprecated_api/README.md
+## Contents:
+**distributed**:  Walkthrough of apex distributed data parallel utilities.
+**FP16_Optimizer_simple**:  Simple examples demonstrating various use cases of `FP16_Optimizer` to automatically manage master parameters and static or dynamic loss scaling.
+**imagenet**:  Example based on [https://github.com/pytorch/examples/tree/master/imagenet](https://github.com/pytorch/examples/tree/master/imagenet) showing the use of `FP16_Optimizer`, as well as manual management of master parameters and loss scaling for illustration/comparison.
+**word_language_model**:  Example based on [https://github.com/pytorch/examples/tree/master/word_language_model](https://github.com/pytorch/examples/tree/master/word_language_model) showing the use of `FP16_Optimizer`, as well as manual management of master parameters and loss scaling for illustration/comparison.
+**docker**:  Example of a minimal Dockerfile that installs Apex on top of an existing container.
--- a/examples/distributed/README.md
+++ b/examples/distributed/README.md