static_loss_scale (float, optional, default=1.0): Loss scale used internally to scale gradients computed by the model. Any fp16 gradients will be copied to fp32, then downscaled before being applied to the fp32 master params, so ``static_loss_scale`` should not affect learning rate.
static_loss_scale (float, optional, default=1.0): Loss scale used internally to scale gradients computed by the model. Any fp16 gradients will be copied to fp32, then downscaled before being applied to the fp32 master params, so ``static_loss_scale`` should not affect learning rate.
dynamic_loss_scale (bool, optional, default=False): Use dynamic loss scaling. If True, this will override any ``static_loss_scale`` option.
dynamic_loss_scale (bool, optional, default=False): Use dynamic loss scaling. If True, this will override any ``static_loss_scale`` option.
dynamic_loss_args (dict, optional, default=None): Dict of kwargs that will be forwarded to the internal :class:`DynamicLossScaler` instance's constructor. Keys of this dict must match kwargs accepted by :class:`DynamicLossScaler`'s constructor. If ``dynamic_loss_args`` is unspecified, :class:`DynamicLossScaler`'s defaults will be used.
dynamic_loss_args (dict, optional, default=None): Dict of kwargs that will be forwarded to the internal :class:`DynamicLossScaler` instance's constructor. Keys of this dict must match kwargs accepted by :class:`DynamicLossScaler`'s constructor. If ``dynamic_loss_args`` is unspecified, :class:`DynamicLossScaler`'s defaults will be used.
verbose (bool, optional, default=True): By default, FP16_Optimizer's constructor prints out the parameters and parameter groups it is ingesting, as a sanity check. If this becomes annoying (e.g. for large models), it can be disabled by passing ``verbose=False``. ``verbose=False`` will not disable printing when the loss scale is readjusted during dynamic loss scaling.
``init_optimizer`` is expected to have been constructed in the ordinary way.
``init_optimizer`` is expected to have been constructed in the ordinary way.
It is recommended (although not required) that the newly constructed :class:`FP16_Optimizer` instance be
It is recommended (although not required) that the newly constructed :class:`FP16_Optimizer` instance be
...
@@ -105,10 +106,13 @@ class FP16_Optimizer(object):
...
@@ -105,10 +106,13 @@ class FP16_Optimizer(object):
init_optimizer,
init_optimizer,
static_loss_scale=1.0,
static_loss_scale=1.0,
dynamic_loss_scale=False,
dynamic_loss_scale=False,
dynamic_loss_args=None):
dynamic_loss_args=None,
verbose=True):
ifnottorch.cuda.is_available:
ifnottorch.cuda.is_available:
raiseSystemError("Cannot use fp16 without CUDA.")
raiseSystemError("Cannot use fp16 without CUDA.")
self.verbose=verbose
self.optimizer=init_optimizer
self.optimizer=init_optimizer
# init_state_dict sets up an alternative way to cast per-param state tensors.
# init_state_dict sets up an alternative way to cast per-param state tensors.
# Stashing here in case https://github.com/pytorch/pytorch/issues/7733 makes it necessary.
# Stashing here in case https://github.com/pytorch/pytorch/issues/7733 makes it necessary.
...
@@ -118,14 +122,14 @@ class FP16_Optimizer(object):
...
@@ -118,14 +122,14 @@ class FP16_Optimizer(object):