# optional arg to control dynamic loss scaling behavior
# optional arg to control dynamic loss scaling behavior
# dynamic_loss_args={'scale_window' : 500})
# dynamic_loss_args={'scale_window' : 500})
# Usually, dynamic_loss_args is not necessary.
# Usually, dynamic_loss_args is not necessary.
Args:
Args:
init_optimizer (torch.optim.optimizer): Existing optimizer created with the parameters to optimize. Internally, :class:`FP16_Optimizer` replaces the passed optimizer's fp16 parameters, if any, with fp32 master parameters copied from the original ones. :class:`FP16_Optimizer` also stores references to the original fp16 parameters, and updates these fp16 parameters from the master fp32 copy at the end of each :attr:`step`.
init_optimizer (torch.optim.optimizer): Existing optimizer created with the parameters to optimize. Internally, :class:`FP16_Optimizer` replaces the passed optimizer's fp16 parameters, if any, with fp32 master parameters copied from the original ones. :class:`FP16_Optimizer` also stores references to the original fp16 parameters, and updates these fp16 parameters from the master fp32 copy at the end of each :attr:`step`.
static_loss_scale (float, optional, default=1.0): Loss scale used internally to scale gradients computed by the model. Any fp16 gradients will be copied to fp32, then downscaled before being applied to the fp32 master params, so ``static_loss_scale`` should not affect learning rate.
static_loss_scale (float, optional, default=1.0): Loss scale used internally to scale gradients computed by the model. Any fp16 gradients will be copied to fp32, then downscaled before being applied to the fp32 master params, so ``static_loss_scale`` should not affect learning rate.
dynamic_loss_scale (bool, optional, default=False): Use dynamic loss scaling. If True, this will override any ``static_loss_scale`` option.
dynamic_loss_scale (bool, optional, default=False): Use dynamic loss scaling. If True, this will override any ``static_loss_scale`` option.
dynamic_loss_args (dict, optional, default=None): Dict of kwargs that will be forwarded to the internal :class:`DynamicLossScaler` instance's constructor. Keys of this dict must match kwargs accepted by :class:`DynamicLossScaler`'s constructor. If ``dynamic_loss_args`` is unspecified, :class:`DynamicLossScaler`'s defaults will be used.
dynamic_loss_args (dict, optional, default=None): Dict of kwargs that will be forwarded to the internal :class:`DynamicLossScaler` instance's constructor. Keys of this dict must match kwargs accepted by :class:`DynamicLossScaler`'s constructor. If ``dynamic_loss_args`` is unspecified, :class:`DynamicLossScaler`'s defaults will be used.
verbose (bool, optional, default=True): By default, FP16_Optimizer's constructor prints out the parameters and parameter groups it is ingesting, as a sanity check. If this becomes annoying (e.g. for large models), it can be disabled by passing ``verbose=False``. ``verbose=False`` will not disable printing when the loss scale is readjusted during dynamic loss scaling.
verbose (bool, optional, default=True): By default, FP16_Optimizer's constructor prints out the parameters and parameter groups it is ingesting, as a sanity check. If this becomes annoying (e.g. for large models), it can be disabled by passing ``verbose=False``. ``verbose=False`` will not disable printing when the loss scale is readjusted during dynamic loss scaling.
``init_optimizer`` is expected to have been constructed in the ordinary way.
``init_optimizer`` is expected to have been constructed in the ordinary way.
It is recommended (although not required) that the newly constructed :class:`FP16_Optimizer` instance be
It is recommended (although not required) that the newly constructed :class:`FP16_Optimizer` instance be
named to replace ``init_optimizer``, for two reasons:
named to replace ``init_optimizer``, for two reasons:
First, it means that references to the same name
First, it means that references to the same name
later in the file will not have to change.
later in the file will not have to change.
Second, :class:`FP16_Optimizer` reserves the right (as an implementation detail) to
Second, :class:`FP16_Optimizer` reserves the right (as an implementation detail) to
modify ``init_optimizer``. If you do choose a unique name for the new
modify ``init_optimizer``. If you do choose a unique name for the new
:class:`FP16_Optimizer` instance, you should only work with this new instance,
:class:`FP16_Optimizer` instance, you should only work with this new instance,
because the preexisting optimizer might no longer behave as expected.
because the preexisting optimizer might no longer behave as expected.
``init_optimizer`` may be any Pytorch optimizer.
``init_optimizer`` may be any Pytorch optimizer.
It may contain a mixture of fp16 and fp32 parameters organized into any number of
It may contain a mixture of fp16 and fp32 parameters organized into any number of
``param_groups`` with different hyperparameters. The :class:`FP16_Optimizer` constructor will
``param_groups`` with different hyperparameters. The :class:`FP16_Optimizer` constructor will
ingest these ``param_groups`` and remember them.
ingest these ``param_groups`` and remember them.
Calls to ::
Calls to ::
loss.backward()
loss.backward()
must be replaced with ::
must be replaced with ::
optimizer.backward(loss)
optimizer.backward(loss)
because :class:`FP16_Optimizer` requires ownership of the backward pass to implement
because :class:`FP16_Optimizer` requires ownership of the backward pass to implement
loss scaling and copies to master gradients.
loss scaling and copies to master gradients.
.. note::
.. note::
Loss scaling, either static or dynamic, is orthogonal to learning rate, because gradients
Loss scaling, either static or dynamic, is orthogonal to learning rate, because gradients
are downscaled before being applied. This means that adjusting the loss scale, or using
are downscaled before being applied. This means that adjusting the loss scale, or using
dynamic loss scaling, should not require retuning the learning rate or any other
dynamic loss scaling, should not require retuning the learning rate or any other
hyperparameters.
hyperparameters.
...
@@ -152,7 +160,7 @@ class FP16_Optimizer(object):
...
@@ -152,7 +160,7 @@ class FP16_Optimizer(object):
See docstring for :attr:`step`.
See docstring for :attr:`step`.
**Gradient clipping**: Use :attr:`clip_master_grads`.
**Gradient clipping**: Use :attr:`clip_master_grads`.
**Multiple losses**: If your model accumulates gradients from multiple losses,
**Multiple losses**: If your model accumulates gradients from multiple losses,
this can be made more efficient by supplying ``update_master_grads=False``
this can be made more efficient by supplying ``update_master_grads=False``
to :attr:`backward`. See docstring for :attr:`backward`.
to :attr:`backward`. See docstring for :attr:`backward`.
...
@@ -163,19 +171,19 @@ class FP16_Optimizer(object):
...
@@ -163,19 +171,19 @@ class FP16_Optimizer(object):
optimizer.loss_scale = new_loss_scale
optimizer.loss_scale = new_loss_scale
For static loss scaling, manually adjusting the loss scale over time is a reasonable
For static loss scaling, manually adjusting the loss scale over time is a reasonable
thing to do. During later epochs, gradients may become smaller, and a
thing to do. During later epochs, gradients may become smaller, and a
higher loss scale may be required, analogous to scheduling the learning rate. Dynamic loss
higher loss scale may be required, analogous to scheduling the learning rate. Dynamic loss
scaling is more subtle (see :class:`DynamicLossScaler`) and in this case, manually adjusting
scaling is more subtle (see :class:`DynamicLossScaler`) and in this case, manually adjusting
the loss scale is not recommended.
the loss scale is not recommended.
**Multi_GPU training**: If the wrapped ``init_optimizer`` was created from a model wrapped in
**Multi_GPU training**: If the wrapped ``init_optimizer`` was created from a model wrapped in
Pytorch DistributedDataParallel or Apex DistributedDataParallel, :class:`FP16_Optimizer`
Pytorch DistributedDataParallel or Apex DistributedDataParallel, :class:`FP16_Optimizer`
should still work as intended.
should still work as intended.
"""
"""
def__init__(self,
def__init__(self,
init_optimizer,
init_optimizer,
static_loss_scale=1.0,
static_loss_scale=1.0,
dynamic_loss_scale=False,
dynamic_loss_scale=False,
dynamic_loss_args=None,
dynamic_loss_args=None,
verbose=False):
verbose=False):
...
@@ -212,7 +220,7 @@ class FP16_Optimizer(object):
...
@@ -212,7 +220,7 @@ class FP16_Optimizer(object):
# Reset existing state dict key to the new master param.
# Reset existing state dict key to the new master param.
# We still need to recast per-param state tensors, if any, to FP32.
# We still need to recast per-param state tensors, if any, to FP32.
master_params: List of FP32 master parameters created by :func:`prep_param_lists`. If ``master_params`` was created with ``flat_master=True``, ``flat_master=True`` should also be supplied to :func:`master_params_to_model_params`.
master_params: List of FP32 master parameters created by :func:`prep_param_lists`. If ``master_params`` was created with ``flat_master=True``, ``flat_master=True`` should also be supplied to :func:`master_params_to_model_params`.
# item() is a recent addition, so this helps with backward compatibility.
# item() is a recent addition, so this helps with backward compatibility.
defto_python_float(t):
defto_python_float(t):
ifhasattr(t,'item'):
ifhasattr(t,'item'):
returnt.item()
returnt.item()
else:
else:
returnt[0]
returnt[0]
classLossScaler:
classLossScaler:
"""
"""
Class that manages a static loss scale. This class is intended to interact with
Class that manages a static loss scale. This class is intended to interact with
:class:`FP16_Optimizer`, and should not be directly manipulated by the user.
:class:`FP16_Optimizer`, and should not be directly manipulated by the user.
Use of :class:`LossScaler` is enabled via the ``static_loss_scale`` argument to
Use of :class:`LossScaler` is enabled via the ``static_loss_scale`` argument to
:class:`FP16_Optimizer`'s constructor.
:class:`FP16_Optimizer`'s constructor.
Args:
Args:
...
@@ -54,16 +61,22 @@ class LossScaler:
...
@@ -54,16 +61,22 @@ class LossScaler:
returnself.cur_scale
returnself.cur_scale
defscale_gradient(self,module,grad_in,grad_out):
defscale_gradient(self,module,grad_in,grad_out):
returntuple(self.loss_scale*gforgingrad_in)
_overflow_buf=torch.cuda.IntTensor([0])
multi_tensor_applier(amp_C.multi_tensor_scale,
_overflow_buf,
[grad_in,grad_in],
self.loss_scale)
returngrad_in
defbackward(self,loss,retain_graph=False):
defbackward(self,loss,retain_graph=False):
scaled_loss=loss*self.loss_scale
scaled_loss=loss*self.loss_scale
scaled_loss.backward(retain_graph=retain_graph)
scaled_loss.backward(retain_graph=retain_graph)
classDynamicLossScaler:
classDynamicLossScaler:
"""
"""
Class that manages dynamic loss scaling. It is recommended to use :class:`DynamicLossScaler`
Class that manages dynamic loss scaling. It is recommended to use :class:`DynamicLossScaler`
indirectly, by supplying ``dynamic_loss_scale=True`` to the constructor of
indirectly, by supplying ``dynamic_loss_scale=True`` to the constructor of
:class:`FP16_Optimizer`. However, it's important to understand how :class:`DynamicLossScaler`
:class:`FP16_Optimizer`. However, it's important to understand how :class:`DynamicLossScaler`
operates, because the default options can be changed using the
operates, because the default options can be changed using the
the ``dynamic_loss_args`` argument to :class:`FP16_Optimizer`'s constructor.
the ``dynamic_loss_args`` argument to :class:`FP16_Optimizer`'s constructor.
...
@@ -71,18 +84,18 @@ class DynamicLossScaler:
...
@@ -71,18 +84,18 @@ class DynamicLossScaler:
Loss scaling is designed to combat the problem of underflowing gradients encountered at long
Loss scaling is designed to combat the problem of underflowing gradients encountered at long
times when training fp16 networks. Dynamic loss scaling begins by attempting a very high loss
times when training fp16 networks. Dynamic loss scaling begins by attempting a very high loss
scale. Ironically, this may result in OVERflowing gradients. If overflowing gradients are
scale. Ironically, this may result in OVERflowing gradients. If overflowing gradients are
encountered, :class:`DynamicLossScaler` informs :class:`FP16_Optimizer` that an overflow has
encountered, :class:`DynamicLossScaler` informs :class:`FP16_Optimizer` that an overflow has
occurred.
occurred.
:class:`FP16_Optimizer` then skips the update step for this particular iteration/minibatch,
:class:`FP16_Optimizer` then skips the update step for this particular iteration/minibatch,
and :class:`DynamicLossScaler` adjusts the loss scale to a lower value.
and :class:`DynamicLossScaler` adjusts the loss scale to a lower value.
If a certain number of iterations occur without overflowing gradients detected,
If a certain number of iterations occur without overflowing gradients detected,
:class:`DynamicLossScaler` increases the loss scale once more.
:class:`DynamicLossScaler` increases the loss scale once more.
In this way :class:`DynamicLossScaler` attempts to "ride the edge" of
In this way :class:`DynamicLossScaler` attempts to "ride the edge" of
always using the highest loss scale possible without incurring overflow.
always using the highest loss scale possible without incurring overflow.
Args:
Args:
init_scale (float, optional, default=2**32): Initial loss scale attempted by :class:`DynamicLossScaler.`
init_scale (float, optional, default=2**32): Initial loss scale attempted by :class:`DynamicLossScaler.`
scale_factor (float, optional, default=2.0): Factor used when adjusting the loss scale. If an overflow is encountered, the loss scale is readjusted to loss scale/``scale_factor``. If ``scale_window`` consecutive iterations take place without an overflow, the loss scale is readjusted to loss_scale*``scale_factor``.
scale_factor (float, optional, default=2.0): Factor used when adjusting the loss scale. If an overflow is encountered, the loss scale is readjusted to loss scale/``scale_factor``. If ``scale_window`` consecutive iterations take place without an overflow, the loss scale is readjusted to loss_scale*``scale_factor``.
scale_window (int, optional, default=1000): Number of consecutive iterations without an overflow to wait before increasing the loss scale.
scale_window (int, optional, default=1000): Number of consecutive iterations without an overflow to wait before increasing the loss scale.
"""
"""
...
@@ -122,12 +135,12 @@ class DynamicLossScaler:
...
@@ -122,12 +135,12 @@ class DynamicLossScaler:
overflow=overflow_gpu[0].item()
overflow=overflow_gpu[0].item()
returnbool(overflow)
returnbool(overflow)
# `x` is a torch.Tensor
# `x` is a torch.Tensor
def_has_inf_or_nan(x):
def_has_inf_or_nan(x):
try:
try:
# if x is half, the .float() incurs an additional deep copy, but it's necessary if
# if x is half, the .float() incurs an additional deep copy, but it's necessary if
# Pytorch's .sum() creates a one-element tensor of the same type as x
# Pytorch's .sum() creates a one-element tensor of the same type as x
# (which is true for some recent version of pytorch).
# (which is true for some recent version of pytorch).
cpu_sum=float(x.float().sum())
cpu_sum=float(x.float().sum())
# More efficient version that can be used if .sum() returns a Python scalar
# More efficient version that can be used if .sum() returns a Python scalar