:class:`Reducer` is designed to work with the launch utility script
:class:`Reducer` is designed to work with the launch utility script
``apex.parallel.multiproc.py`` or the upstream launch utility script
``apex.parallel.multiproc.py`` or the upstream launch utility script
``torch.distributed.launch`` with --nproc_per_node <= the number of gpus per node.
``torch.distributed.launch`` with --nproc_per_node <= the number of gpus per node.
When used with these launchers, :class:`apex.parallel.multiproc.py`
For forward compatibility, ``torch.distributed.launch`` is recommended.
assumes 1:1 mapping of processes to GPUs.
When used with these launchers, :class:`Reducer` assumes 1:1 mapping of processes to GPUs.
It also assumes that your script calls ``torch.cuda.set_device(args.rank)`` before creating the model.
main_reducer.py in https://github.com/NVIDIA/apex/tree/master/examples/imagenet shows example usage.
main_reducer.py in https://github.com/NVIDIA/apex/tree/master/examples/imagenet shows example usage.
...
@@ -95,22 +96,20 @@ class Reducer(object):
...
@@ -95,22 +96,20 @@ class Reducer(object):
classDistributedDataParallel(Module):
classDistributedDataParallel(Module):
"""
"""
:class:`apex.parallel.DistributedDataParallel` is a module wrapper that enables
:class:`apex.parallel.DistributedDataParallel` is a module wrapper that enables
easy multiprocess distributed data parallel training, similar to ``torch.nn.parallel.DistributedDataParallel``.
easy multiprocess distributed data parallel training, similar to ``torch.nn.parallel.DistributedDataParallel``. Parameters are broadcast across participating processes on initialization, and gradients are
:class:`DistributedDataParallel` is designed to work with
the launch utility script ``apex.parallel.multiproc.py``.
When used with ``multiproc.py``, :class:`DistributedDataParallel`
assigns 1 process to each of the available (visible) GPUs on the node.
Parameters are broadcast across participating processes on initialization, and gradients are
allreduced and averaged over processes during ``backward()``.
allreduced and averaged over processes during ``backward()``.
:class:`DistributedDataParallel` is optimized for use with NCCL. It achieves high performance by
:class:`DistributedDataParallel` is optimized for use with NCCL. It achieves high performance by
overlapping communication with computation during ``backward()`` and bucketing smaller gradient
overlapping communication with computation during ``backward()`` and bucketing smaller gradient
transfers to reduce the total number of transfers required.
transfers to reduce the total number of transfers required.
:class:`DistributedDataParallel` assumes that your script accepts the command line
:class:`DistributedDataParallel` is designed to work with the launch utility script
arguments "rank" and "world-size." It also assumes that your script calls
``apex.parallel.multiproc.py`` or the upstream launch utility script
``torch.cuda.set_device(args.rank)`` before creating the model.
``torch.distributed.launch`` with --nproc_per_node <= the number of gpus per node.
For forward compatibility, ``torch.distributed.launch`` is recommended.
When used with these launchers, :class:`DistributedDataParallel` assumes 1:1 mapping of processes to GPUs.
It also assumes that your script calls ``torch.cuda.set_device(args.rank)`` before creating the model.
When used with these launchers, :class:`DistributedDataParallel` assumes 1:1 mapping of processes to GPUs.