master_params: List of FP32 master parameters created by :func:`prep_param_lists`. If ``master_params`` was created with ``flat_master=True``, ``flat_master=True`` should also be supplied to :func:`master_params_to_model_params`.
master_params: List of FP32 master parameters created by :func:`prep_param_lists`. If ``master_params`` was created with ``flat_master=True``, ``flat_master=True`` should also be supplied to :func:`master_params_to_model_params`.
# item() is a recent addition, so this helps with backward compatibility.
# item() is a recent addition, so this helps with backward compatibility.
defto_python_float(t):
defto_python_float(t):
ifhasattr(t,'item'):
ifhasattr(t,'item'):
returnt.item()
returnt.item()
else:
else:
returnt[0]
returnt[0]
classLossScaler:
classLossScaler:
"""
"""
Class that manages a static loss scale. This class is intended to interact with
Class that manages a static loss scale. This class is intended to interact with
:class:`FP16_Optimizer`, and should not be directly manipulated by the user.
:class:`FP16_Optimizer`, and should not be directly manipulated by the user.
Use of :class:`LossScaler` is enabled via the ``static_loss_scale`` argument to
Use of :class:`LossScaler` is enabled via the ``static_loss_scale`` argument to
:class:`FP16_Optimizer`'s constructor.
:class:`FP16_Optimizer`'s constructor.
Args:
Args:
...
@@ -57,13 +60,14 @@ class LossScaler:
...
@@ -57,13 +60,14 @@ class LossScaler:
returntuple(self.loss_scale*gforgingrad_in)
returntuple(self.loss_scale*gforgingrad_in)
defbackward(self,loss,retain_graph=False):
defbackward(self,loss,retain_graph=False):
scaled_loss=loss*self.loss_scale
scaled_loss=loss*self.loss_scale
scaled_loss.backward(retain_graph=retain_graph)
scaled_loss.backward(retain_graph=retain_graph)
classDynamicLossScaler:
classDynamicLossScaler:
"""
"""
Class that manages dynamic loss scaling. It is recommended to use :class:`DynamicLossScaler`
Class that manages dynamic loss scaling. It is recommended to use :class:`DynamicLossScaler`
indirectly, by supplying ``dynamic_loss_scale=True`` to the constructor of
indirectly, by supplying ``dynamic_loss_scale=True`` to the constructor of
:class:`FP16_Optimizer`. However, it's important to understand how :class:`DynamicLossScaler`
:class:`FP16_Optimizer`. However, it's important to understand how :class:`DynamicLossScaler`
operates, because the default options can be changed using the
operates, because the default options can be changed using the
the ``dynamic_loss_args`` argument to :class:`FP16_Optimizer`'s constructor.
the ``dynamic_loss_args`` argument to :class:`FP16_Optimizer`'s constructor.
...
@@ -71,18 +75,18 @@ class DynamicLossScaler:
...
@@ -71,18 +75,18 @@ class DynamicLossScaler:
Loss scaling is designed to combat the problem of underflowing gradients encountered at long
Loss scaling is designed to combat the problem of underflowing gradients encountered at long
times when training fp16 networks. Dynamic loss scaling begins by attempting a very high loss
times when training fp16 networks. Dynamic loss scaling begins by attempting a very high loss
scale. Ironically, this may result in OVERflowing gradients. If overflowing gradients are
scale. Ironically, this may result in OVERflowing gradients. If overflowing gradients are
encountered, :class:`DynamicLossScaler` informs :class:`FP16_Optimizer` that an overflow has
encountered, :class:`DynamicLossScaler` informs :class:`FP16_Optimizer` that an overflow has
occurred.
occurred.
:class:`FP16_Optimizer` then skips the update step for this particular iteration/minibatch,
:class:`FP16_Optimizer` then skips the update step for this particular iteration/minibatch,
and :class:`DynamicLossScaler` adjusts the loss scale to a lower value.
and :class:`DynamicLossScaler` adjusts the loss scale to a lower value.
If a certain number of iterations occur without overflowing gradients detected,
If a certain number of iterations occur without overflowing gradients detected,
:class:`DynamicLossScaler` increases the loss scale once more.
:class:`DynamicLossScaler` increases the loss scale once more.
In this way :class:`DynamicLossScaler` attempts to "ride the edge" of
In this way :class:`DynamicLossScaler` attempts to "ride the edge" of
always using the highest loss scale possible without incurring overflow.
always using the highest loss scale possible without incurring overflow.
Args:
Args:
init_scale (float, optional, default=2**32): Initial loss scale attempted by :class:`DynamicLossScaler.`
init_scale (float, optional, default=2**32): Initial loss scale attempted by :class:`DynamicLossScaler.`
scale_factor (float, optional, default=2.0): Factor used when adjusting the loss scale. If an overflow is encountered, the loss scale is readjusted to loss scale/``scale_factor``. If ``scale_window`` consecutive iterations take place without an overflow, the loss scale is readjusted to loss_scale*``scale_factor``.
scale_factor (float, optional, default=2.0): Factor used when adjusting the loss scale. If an overflow is encountered, the loss scale is readjusted to loss scale/``scale_factor``. If ``scale_window`` consecutive iterations take place without an overflow, the loss scale is readjusted to loss_scale*``scale_factor``.
scale_window (int, optional, default=1000): Number of consecutive iterations without an overflow to wait before increasing the loss scale.
scale_window (int, optional, default=1000): Number of consecutive iterations without an overflow to wait before increasing the loss scale.
"""
"""
...
@@ -122,12 +126,12 @@ class DynamicLossScaler:
...
@@ -122,12 +126,12 @@ class DynamicLossScaler:
overflow=overflow_gpu[0].item()
overflow=overflow_gpu[0].item()
returnbool(overflow)
returnbool(overflow)
# `x` is a torch.Tensor
# `x` is a torch.Tensor
def_has_inf_or_nan(x):
def_has_inf_or_nan(x):
try:
try:
# if x is half, the .float() incurs an additional deep copy, but it's necessary if
# if x is half, the .float() incurs an additional deep copy, but it's necessary if
# Pytorch's .sum() creates a one-element tensor of the same type as x
# Pytorch's .sum() creates a one-element tensor of the same type as x
# (which is true for some recent version of pytorch).
# (which is true for some recent version of pytorch).
cpu_sum=float(x.float().sum())
cpu_sum=float(x.float().sum())
# More efficient version that can be used if .sum() returns a Python scalar
# More efficient version that can be used if .sum() returns a Python scalar