advanced.rst 9.96 KB
Newer Older
Michael Carilli's avatar
Michael Carilli committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
.. role:: hidden
    :class: hidden-section

Advanced Amp Usage
===================================

GANs
----

GANs are an interesting synthesis of several topics below.  A `comprehensive example`_
is under construction.

.. _`comprehensive example`:
    https://github.com/NVIDIA/apex/tree/master/examples/dcgan

Gradient clipping
-----------------
18
Amp calls the params owned directly by the optimizer's ``param_groups`` the "master params."
Michael Carilli's avatar
Michael Carilli committed
19

20
21
22
23
These master params may be fully or partially distinct from ``model.parameters()``.
For example, with `opt_level="O2"`_, ``amp.initialize`` casts most model params to FP16,
creates an FP32 master param outside the model for each newly-FP16 model param,
and updates the optimizer's ``param_groups`` to point to these FP32 params.
Michael Carilli's avatar
Michael Carilli committed
24

25
26
The master params owned by the optimizer's ``param_groups`` may also fully coincide with the
model params, which is typically true for ``opt_level``\s ``O0``, ``O1``, and ``O3``.
Michael Carilli's avatar
Michael Carilli committed
27

28
29
In all cases, correct practice is to clip the gradients of the params that are guaranteed to be
owned **by the optimizer's** ``param_groups``, instead of those retrieved via ``model.parameters()``.
Michael Carilli's avatar
Michael Carilli committed
30

31
32
33
34
Also, if Amp uses loss scaling, gradients must be clipped after they have been unscaled
(which occurs during exit from the ``amp.scale_loss`` context manager).

The following pattern should be correct for any ``opt_level``::
Michael Carilli's avatar
Michael Carilli committed
35
36
37
38

    with amp.scale_loss(loss, optimizer) as scaled_loss:
        scaled_loss.backward()
        # Gradients are unscaled during context manager exit.
39
40
41
    # Now it's safe to clip.  Replace
    # torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)
    # with
Michael Carilli's avatar
Michael Carilli committed
42
43
44
45
46
47
    torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), max_norm)
    # or
    torch.nn.utils.clip_grad_value_(amp.master_params(optimizer), max_)

Note the use of the utility function ``amp.master_params(optimizer)``,
which returns a generator-expression that iterates over the
48
49
50
51
52
53
54
params in the optimizer's ``param_groups``.

Also note that ``clip_grad_norm_(amp.master_params(optimizer), max_norm)`` is invoked
*instead of*, not *in addition to*, ``clip_grad_norm_(model.parameters(), max_norm)``.

.. _`opt_level="O2"`:
    https://nvidia.github.io/apex/amp.html#o2-fast-mixed-precision
Michael Carilli's avatar
Michael Carilli committed
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70

Custom/user-defined autograd functions
--------------------------------------

The old Amp API for `registering user functions`_ is still considered correct.  Functions must
be registered before calling ``amp.initialize``.

.. _`registering user functions`:
    https://github.com/NVIDIA/apex/tree/master/apex/amp#annotating-user-functions

Forcing particular layers/functions to a desired type
-----------------------------------------------------

I'm still working on a generalizable exposure for this that won't require user-side code divergence
across different ``opt-level``\ s.

71
72
73
74
75
Multiple models/optimizers/losses
---------------------------------

Initialization with multiple models/optimizers
**********************************************
Michael Carilli's avatar
Michael Carilli committed
76
77
78
79
80
81
82

``amp.initialize``'s optimizer argument may be a single optimizer or a list of optimizers,
as long as the output you accept has the same type.
Similarly, the ``model`` argument may be a single model or a list of models, as long as the accepted
output matches.  The following calls are all legal::

    model, optim = amp.initialize(model, optim,...)
83
84
85
    model, [optim0, optim1] = amp.initialize(model, [optim0, optim1],...)
    [model0, model1], optim = amp.initialize([model0, model1], optim,...)
    [model0, model1], [optim0, optim1] = amp.initialize([model0, model1], [optim0, optim1],...)
Michael Carilli's avatar
Michael Carilli committed
86

87
88
Backward passes with multiple optimizers
****************************************
Michael Carilli's avatar
Michael Carilli committed
89

90
91
92
93
Whenever you invoke a backward pass, the ``amp.scale_loss`` context manager must receive
**all the optimizers that own any params for which the current backward pass is creating gradients.**
This is true even if each optimizer owns only some, but not all, of the params that are about to
receive gradients.
Michael Carilli's avatar
Michael Carilli committed
94

95
96
97
If, for a given backward pass, there's only one optimizer whose params are about to receive gradients,
you may pass that optimizer directly to ``amp.scale_loss``.  Otherwise, you must pass the
list of optimizers whose params are about to receive gradients::
Michael Carilli's avatar
Michael Carilli committed
98

99
100
    # loss0 accumulates gradients only into params owned by optim0:
    with amp.scale_loss(loss0, optim0) as scaled_loss:
Michael Carilli's avatar
Michael Carilli committed
101
        scaled_loss.backward()
102
103
104

    # loss1 accumulates gradients only into params owned by optim1:
    with amp.scale_loss(loss1, optim1) as scaled_loss:
Michael Carilli's avatar
Michael Carilli committed
105
        scaled_loss.backward()
106
107
108
109

    # loss2 accumulates gradients into some params owned by optim0
    # and some params owned by optim1
    with amp.scale_loss(loss2, [optim0, optim1]) as scaled_loss:
Michael Carilli's avatar
Michael Carilli committed
110
111
        scaled_loss.backward()

112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
Optionally have Amp use a different loss scaler per-loss
********************************************************

By default, Amp maintains a single global loss scaler that will be used for all backward passes
(all invocations of ``with amp.scale_loss(...)``).  No additional arguments to ``amp.initialize``
or ``amp.scale_loss`` are required to use the global loss scaler.  The code snippets above with
multiple optimizers/backward passes use the single global loss scaler under the hood,
and they should "just work."

However, you can optionally tell Amp to maintain a loss scaler per-loss, which gives Amp increased
numerical flexibility.  This is accomplished by supplying the ``num_losses`` argument to
``amp.initialize`` (which tells Amp how many backward passes you plan to invoke, and therefore
how many loss scalers Amp should create), then supplying the ``loss_id`` argument to each of your
backward passes (which tells Amp the loss scaler to use for this particular backward pass)::

    model, [optim0, optim1] = amp.initialize(model, [optim0, optim1], ..., num_losses=3)

    with amp.scale_loss(loss0, optim0, loss_id=0) as scaled_loss:
        scaled_loss.backward()

    with amp.scale_loss(loss1, optim1, loss_id=1) as scaled_loss:
        scaled_loss.backward()

    with amp.scale_loss(loss2, [optim0, optim1], loss_id=2) as scaled_loss:
        scaled_loss.backward()

``num_losses`` and ``loss_id``\ s should be specified purely based on the set of
losses/backward passes.  The use of multiple optimizers, or association of single or
multiple optimizers with each backward pass, is unrelated.
Michael Carilli's avatar
Michael Carilli committed
141
142
143
144

Gradient accumulation across iterations
---------------------------------------

145
146
147
The following should "just work," and properly accommodate multiple models/optimizers/losses, as well as
gradient clipping via the `instructions above`_::

148
149
150
151
152
    # If your intent is to simulate a larger batch size using gradient accumulation,
    # you can divide the loss by the number of accumulation iterations (so that gradients
    # will be averaged over that many iterations):
    loss = loss/iters_to_accumulate

153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
    if iter%iters_to_accumulate == 0:
        # Every iters_to_accumulate iterations, unscale and step
        with amp.scale_loss(loss, optimizer) as scaled_loss:
            scaled_loss.backward()
        # Gradient clipping if desired:
        # torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), max_norm)
        optimizer.step()
        optimizer.zero_grad()
    else:
        # Otherwise, accumulate gradients, don't unscale or step.
        with amp.scale_loss(loss, optimizer) as scaled_loss:
            scaled_loss.backward()

As a minor performance optimization, you can pass ``delay_unscale=True``
to ``amp.scale_loss`` until you're ready to ``step()``.  You should only attempt ``delay_unscale=True``
if you're sure you know what you're doing, because the interaction with gradient clipping and
multiple models/optimizers/losses can become tricky.::
Michael Carilli's avatar
Michael Carilli committed
170
171
172
173
174
175
176
177

    if iter%iters_to_accumulate == 0:
        # Every iters_to_accumulate iterations, unscale and step
        with amp.scale_loss(loss, optimizer) as scaled_loss:
            scaled_loss.backward()
        optimizer.step()
        optimizer.zero_grad()
    else:
178
        # Otherwise, accumulate gradients, don't unscale or step.
Michael Carilli's avatar
Michael Carilli committed
179
180
        with amp.scale_loss(loss, optimizer, delay_unscale=True) as scaled_loss:
            scaled_loss.backward()
181

182
183
.. _`instructions above`:
    https://nvidia.github.io/apex/advanced.html#gradient-clipping
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222

Custom data batch types
-----------------------

The intention of Amp is that you never need to cast your input data manually, regardless of
``opt_level``.  Amp accomplishes this by patching any models' ``forward`` methods to cast
incoming data appropriately for the ``opt_level``.  But to cast incoming data,
Amp needs to know how.  The patched ``forward`` will recognize and cast floating-point Tensors
(non-floating-point Tensors like IntTensors are not touched) and
Python containers of floating-point Tensors.  However, if you wrap your Tensors in a custom class,
the casting logic doesn't know how to drill
through the tough custom shell to access and cast the juicy Tensor meat within.  You need to tell
Amp how to cast your custom batch class, by assigning it a ``to`` method that accepts a ``torch.dtype``
(e.g., ``torch.float16`` or ``torch.float32``) and returns an instance of the custom batch cast to
``dtype``.  The patched ``forward`` checks for the presence of your ``to`` method, and will
invoke it with the correct type for the ``opt_level``.

Example::

    class CustomData(object):
        def __init__(self):
            self.tensor = torch.cuda.FloatTensor([1,2,3])

        def to(self, dtype):
            self.tensor = self.tensor.to(dtype)
            return self

.. warning::

    Amp also forwards numpy ndarrays without casting them.  If you send input data as a raw, unwrapped
    ndarray, then later use it to create a Tensor within your ``model.forward``, this Tensor's type will
    not depend on the ``opt_level``, and may or may not be correct.  Users are encouraged to pass
    castable data inputs (Tensors, collections of Tensors, or custom classes with a ``to`` method)
    wherever possible.

.. note::

    Amp does not call ``.cuda()`` on any Tensors for you.  Amp assumes that your original script
    is already set up to move Tensors from the host to the device as needed.