:class:`apex.parallel.Reducer` is a simple class that helps allreduce a module's parameters
:class:`apex.parallel.Reducer` is a simple class that helps allreduce a module's parameters
...
@@ -93,13 +93,13 @@ class Reducer(object):
...
@@ -93,13 +93,13 @@ class Reducer(object):
Unlike :class:`DistributedDataParallel`, :class:`Reducer` will not automatically allreduce
Unlike :class:`DistributedDataParallel`, :class:`Reducer` will not automatically allreduce
parameters during ``backward()``.
parameters during ``backward()``.
Instead, :class:`Reducer` waits for the user to call ``<reducer_instance>.reduce()`` manually.
Instead, :class:`Reducer` waits for the user to call ``<reducer_instance>.reduce()`` manually.
This enables, for example, delaying the allreduce to be carried out every
This enables, for example, delaying the allreduce to be carried out every
several iterations instead of every single iteration.
several iterations instead of every single iteration.
Like :class:`DistributedDataParallel`, :class:`Reducer` averages any tensors it allreduces
Like :class:`DistributedDataParallel`, :class:`Reducer` averages any tensors it allreduces
over the number of participating processes.
over the number of participating processes.
:class:`Reducer` is designed to work with the upstream launch utility script
:class:`Reducer` is designed to work with the upstream launch utility script
``torch.distributed.launch`` with ``--nproc_per_node <= number of gpus per node``.
``torch.distributed.launch`` with ``--nproc_per_node <= number of gpus per node``.
When used with this launcher, :class:`Reducer` assumes 1:1 mapping of processes to GPUs.
When used with this launcher, :class:`Reducer` assumes 1:1 mapping of processes to GPUs.
It also assumes that your script calls ``torch.cuda.set_device(args.rank)`` before creating the model.
It also assumes that your script calls ``torch.cuda.set_device(args.rank)`` before creating the model.
...
@@ -109,7 +109,7 @@ class Reducer(object):
...
@@ -109,7 +109,7 @@ class Reducer(object):
Args:
Args:
module_or_grads_list: Either a network definition (module) being run in multi-gpu/distributed mode, or an iterable of gradients to be reduced. If a module is passed in, the Reducer constructor will sync the parameters across processes (broadcasting from rank 0) to make sure they're all initialized with the same values. If a list of gradients (that came from some module) is passed in, the user is responsible for manually syncing that module's parameters at the beginning of training.
module_or_grads_list: Either a network definition (module) being run in multi-gpu/distributed mode, or an iterable of gradients to be reduced. If a module is passed in, the Reducer constructor will sync the parameters across processes (broadcasting from rank 0) to make sure they're all initialized with the same values. If a list of gradients (that came from some module) is passed in, the user is responsible for manually syncing that module's parameters at the beginning of training.
:class:`apex.parallel.DistributedDataParallel` is a module wrapper that enables
:class:`apex.parallel.DistributedDataParallel` is a module wrapper that enables
easy multiprocess distributed data parallel training, similar to ``torch.nn.parallel.DistributedDataParallel``. Parameters are broadcast across participating processes on initialization, and gradients are
easy multiprocess distributed data parallel training, similar to ``torch.nn.parallel.DistributedDataParallel``. Parameters are broadcast across participating processes on initialization, and gradients are
allreduced and averaged over processes during ``backward()``.
allreduced and averaged over processes during ``backward()``.
:class:`DistributedDataParallel` is optimized for use with NCCL. It achieves high performance by
:class:`DistributedDataParallel` is optimized for use with NCCL. It achieves high performance by
overlapping communication with computation during ``backward()`` and bucketing smaller gradient
overlapping communication with computation during ``backward()`` and bucketing smaller gradient
transfers to reduce the total number of transfers required.
transfers to reduce the total number of transfers required.
:class:`DistributedDataParallel` is designed to work with the upstream launch utility script
:class:`DistributedDataParallel` is designed to work with the upstream launch utility script
``torch.distributed.launch`` with ``--nproc_per_node <= number of gpus per node``.
``torch.distributed.launch`` with ``--nproc_per_node <= number of gpus per node``.
When used with this launcher, :class:`DistributedDataParallel` assumes 1:1 mapping of processes to GPUs.
When used with this launcher, :class:`DistributedDataParallel` assumes 1:1 mapping of processes to GPUs.
It also assumes that your script calls ``torch.cuda.set_device(args.rank)`` before creating the model.
It also assumes that your script calls ``torch.cuda.set_device(args.rank)`` before creating the model.
...
@@ -161,20 +161,22 @@ class DistributedDataParallel(Module):
...
@@ -161,20 +161,22 @@ class DistributedDataParallel(Module):
"""
"""
def__init__(self,
def__init__(self,
module,
module,
message_size=10000000,
message_size=10000000,
delay_allreduce=False,
delay_allreduce=False,
shared_param=None,
shared_param=None,
allreduce_trigger_params=None,
allreduce_trigger_params=None,
retain_allreduce_buffers=False,
retain_allreduce_buffers=False,
allreduce_always_fp32=False,
allreduce_always_fp32=False,
allreduce_different_streams=False,
gradient_average=True,
gradient_average=True,
gradient_predivide_factor=1.0,
gradient_predivide_factor=1.0,
gradient_average_split_factor=None):
gradient_average_split_factor=None,
prof=False):
super(DistributedDataParallel,self).__init__()
super(DistributedDataParallel,self).__init__()
# Backward/forward compatibility around
# Backward/forward compatibility around
# https://github.com/pytorch/pytorch/commit/540ef9b1fc5506369a48491af8a285a686689b36 and
# https://github.com/pytorch/pytorch/commit/540ef9b1fc5506369a48491af8a285a686689b36 and
raiseValueError("shared_param is no longer supported as an option. It was misleadingly named from the start. It turns out overlapping communication with computation should work fine with shared parameters. If you still wish to delay communication to the end of the backward pass, use delay_allreduce=True|False instead.")
raiseValueError("shared_param is no longer supported as an option. It was misleadingly named from the start. It turns out overlapping communication with computation should work fine with shared parameters. If you still wish to delay communication to the end of the backward pass, use delay_allreduce=True|False instead.")
ifgradient_average_split_factorisnotNone:
ifgradient_average_split_factorisnotNone:
print("Warning: gradient_average_split_factor has been renamed to gradient_predivide_factor. For now, gradient_average_split_factor will also work, but please update to gradient_predivide_factor instead.")
print("Warning: gradient_average_split_factor has been renamed to gradient_predivide_factor. For now, gradient_average_split_factor will also work, but please update to gradient_predivide_factor instead.")
...
@@ -206,25 +215,25 @@ class DistributedDataParallel(Module):
...
@@ -206,25 +215,25 @@ class DistributedDataParallel(Module):
self.custom_allreduce_triggers=False
self.custom_allreduce_triggers=False
ifallreduce_trigger_paramsisnotNone:
ifallreduce_trigger_paramsisnotNone:
ifdelay_allreduce:
ifdelay_allreduce:
raiseValueError("Setting allreduce_trigger_params is only valid if delay_allreduce=False.")
raiseValueError("Setting allreduce_trigger_params is only valid if delay_allreduce=False.")