advanced.rst

.. role:: hidden
    :class: hidden-section

Advanced Amp Usage
===================================

GANs
----

GANs are an interesting synthesis of several topics below.  A `comprehensive example`_
is under construction.

.. _`comprehensive example`:
    https://github.com/NVIDIA/apex/tree/master/examples/dcgan

Gradient clipping
-----------------
Amp calls the params owned directly by the optimizer's ``param_groups`` the "master params."

These master params may be fully or partially distinct from ``model.parameters()``.
For example, with `opt_level="O2"`_, ``amp.initialize`` casts most model params to FP16,
creates an FP32 master param outside the model for each newly-FP16 model param,
and updates the optimizer's ``param_groups`` to point to these FP32 params.

The master params owned by the optimizer's ``param_groups`` may also fully coincide with the
model params, which is typically true for ``opt_level``\s ``O0``, ``O1``, and ``O3``.

In all cases, correct practice is to clip the gradients of the params that are guaranteed to be
owned **by the optimizer's** ``param_groups``, instead of those retrieved via ``model.parameters()``.

Also, if Amp uses loss scaling, gradients must be clipped after they have been unscaled
(which occurs during exit from the ``amp.scale_loss`` context manager).

The following pattern should be correct for any ``opt_level``::

    with amp.scale_loss(loss, optimizer) as scaled_loss:
        scaled_loss.backward()
        # Gradients are unscaled during context manager exit.
    # Now it's safe to clip.  Replace
    # torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)
    # with
    torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), max_norm)
    # or
    torch.nn.utils.clip_grad_value_(amp.master_params(optimizer), max_)

Note the use of the utility function ``amp.master_params(optimizer)``,
which returns a generator-expression that iterates over the
params in the optimizer's ``param_groups``.

Also note that ``clip_grad_norm_(amp.master_params(optimizer), max_norm)`` is invoked
*instead of*, not *in addition to*, ``clip_grad_norm_(model.parameters(), max_norm)``.

.. _`opt_level="O2"`:
    https://nvidia.github.io/apex/amp.html#o2-fast-mixed-precision

Custom/user-defined autograd functions
--------------------------------------

The old Amp API for `registering user functions`_ is still considered correct.  Functions must
be registered before calling ``amp.initialize``.

.. _`registering user functions`:
    https://github.com/NVIDIA/apex/tree/master/apex/amp#annotating-user-functions

Forcing particular layers/functions to a desired type
-----------------------------------------------------

I'm still working on a generalizable exposure for this that won't require user-side code divergence
across different ``opt-level``\ s.

Multiple models/optimizers
--------------------------

``amp.initialize``'s optimizer argument may be a single optimizer or a list of optimizers,
as long as the output you accept has the same type.
Similarly, the ``model`` argument may be a single model or a list of models, as long as the accepted
output matches.  The following calls are all legal::

    model, optim = amp.initialize(model, optim,...)
    model, [optim1, optim2] = amp.initialize(model, [optim1, optim2],...)
    [model1, model2], optim = amp.initialize([model1, model2], optim,...)
    [model1, model2], [optim1, optim2] = amp.initialize([model1, model2], [optim1, optim2],...)

Whenever you invoke a backward pass, the optimizer you should pass to ``amp.scaled_loss`` is whatever
optimizer owns the parameters for which this particular backward pass is creating gradients.

Multiple backward passes per iteration
--------------------------------------

If you want to accumulate gradients from multiple losses for the params owned by a given optimizer,
you must invoke ``with amp.scale_loss(..., delay_unscale=True)`` for all backward passes except
the last::

    # delay_unscale=True for the first two losses
    with amp.scale_loss(loss1, optimizer, delay_unscale=True) as scaled_loss:
        scaled_loss.backward()
    with amp.scale_loss(loss2, optimizer, delay_unscale=True) as scaled_loss:
        scaled_loss.backward()
    # Don't delay_unscale for the final loss 
    with amp.scale_loss(loss3, optimizer) as scaled_loss:
        scaled_loss.backward()
    optimizer.step()


Gradient accumulation across iterations
---------------------------------------

Pass ``delay_unscale=True`` to ``amp.scale_loss`` until you're ready to ``step()``::

    if iter%iters_to_accumulate == 0:
        # Every iters_to_accumulate iterations, unscale and step
        with amp.scale_loss(loss, optimizer) as scaled_loss:
            scaled_loss.backward()
        optimizer.step()
        optimizer.zero_grad()
    else:
        # Otherwise, just accumulate gradients, don't unscale or step. 
        with amp.scale_loss(loss, optimizer, delay_unscale=True) as scaled_loss:
            scaled_loss.backward()