Commit 8124fba2 authored by Michael Carilli's avatar Michael Carilli
Browse files

Updating documentation for merged utilities

parent 1fa1a073
...@@ -94,14 +94,16 @@ import apex ...@@ -94,14 +94,16 @@ import apex
``` ```
### CUDA/C++ extension ### CUDA/C++ extension
To build Apex with CUDA/C++ extension, follow the Linux instruction with the Apex contains optional CUDA/C++ extensions, installable via
`--cuda_ext` option enabled
``` ```
python setup.py install --cuda_ext python setup.py install [--cuda_ext] [--cpp_ext]
``` ```
Currently, `--cuda_ext` enables
- Fused kernels that improve the performance and numerical stability of `apex.parallel.SyncBatchNorm`.
- Fused kernels required to use `apex.optimizers.FusedAdam`.
CUDA/C++ extension provides customed synchronized Batch Normalization kernels `--cpp_ext` enables
that provides better performance and numerical accuracy. - C++-side flattening and unflattening utilities that reduce the CPU overhead of `apex.parallel.DistributedDataParallel`.
### Windows support ### Windows support
Windows support is experimental, and Linux is recommended. However, since Apex could be Python-only, there's a good chance the Python-only features "just works" the same way as Linux. If you installed Pytorch in a Conda environment, make sure to install Apex in that same environment. Windows support is experimental, and Linux is recommended. However, since Apex could be Python-only, there's a good chance the Python-only features "just works" the same way as Linux. If you installed Pytorch in a Conda environment, make sure to install Apex in that same environment.
......
...@@ -81,11 +81,9 @@ class Reducer(object): ...@@ -81,11 +81,9 @@ class Reducer(object):
Like :class:`DistributedDataParallel`, :class:`Reducer` averages any tensors it allreduces Like :class:`DistributedDataParallel`, :class:`Reducer` averages any tensors it allreduces
over the number of participating processes. over the number of participating processes.
:class:`Reducer` is designed to work with the launch utility script :class:`Reducer` is designed to work with the upstream launch utility script
``apex.parallel.multiproc.py`` or the upstream launch utility script
``torch.distributed.launch`` with ``--nproc_per_node <= number of gpus per node``. ``torch.distributed.launch`` with ``--nproc_per_node <= number of gpus per node``.
For forward compatibility, ``torch.distributed.launch`` is recommended. When used with this launcher, :class:`Reducer` assumes 1:1 mapping of processes to GPUs.
When used with these launchers, :class:`Reducer` assumes 1:1 mapping of processes to GPUs.
It also assumes that your script calls ``torch.cuda.set_device(args.rank)`` before creating the model. It also assumes that your script calls ``torch.cuda.set_device(args.rank)`` before creating the model.
main_reducer.py in https://github.com/NVIDIA/apex/tree/master/examples/imagenet shows example usage. main_reducer.py in https://github.com/NVIDIA/apex/tree/master/examples/imagenet shows example usage.
...@@ -122,11 +120,9 @@ class DistributedDataParallel(Module): ...@@ -122,11 +120,9 @@ class DistributedDataParallel(Module):
overlapping communication with computation during ``backward()`` and bucketing smaller gradient overlapping communication with computation during ``backward()`` and bucketing smaller gradient
transfers to reduce the total number of transfers required. transfers to reduce the total number of transfers required.
:class:`DistributedDataParallel` is designed to work with the launch utility script :class:`DistributedDataParallel` is designed to work with the upstream launch utility script
``apex.parallel.multiproc.py`` or the upstream launch utility script
``torch.distributed.launch`` with ``--nproc_per_node <= number of gpus per node``. ``torch.distributed.launch`` with ``--nproc_per_node <= number of gpus per node``.
For forward compatibility, ``torch.distributed.launch`` is recommended. When used with this launcher, :class:`DistributedDataParallel` assumes 1:1 mapping of processes to GPUs.
When used with these launchers, :class:`DistributedDataParallel` assumes 1:1 mapping of processes to GPUs.
It also assumes that your script calls ``torch.cuda.set_device(args.rank)`` before creating the model. It also assumes that your script calls ``torch.cuda.set_device(args.rank)`` before creating the model.
https://github.com/NVIDIA/apex/tree/master/examples/distributed shows detailed usage. https://github.com/NVIDIA/apex/tree/master/examples/distributed shows detailed usage.
...@@ -135,8 +131,11 @@ class DistributedDataParallel(Module): ...@@ -135,8 +131,11 @@ class DistributedDataParallel(Module):
Args: Args:
module: Network definition to be run in multi-gpu/distributed mode. module: Network definition to be run in multi-gpu/distributed mode.
message_size (Default = 1e7): Minimum number of elements in a communication bucket. message_size (int, default=1e7): Minimum number of elements in a communication bucket.
delay_allreduce (Default = False): Delay all communication to the end of the backward pass. This disables overlapping communication with computation. delay_allreduce (bool, default=False): Delay all communication to the end of the backward pass. This disables overlapping communication with computation.
allreduce_trigger_params (list, optional, default=None): If supplied, should contain a list of parameters drawn from the model. Allreduces will be kicked off whenever one of these parameters receives its gradient (as opposed to when a bucket of size message_size is full). At the end of backward(), a cleanup allreduce to catch any remaining gradients will also be performed automatically. If allreduce_trigger_params is supplied, the message_size argument will be ignored.
allreduce_always_fp32 (bool, default=False): Convert any FP16 gradients to FP32 before allreducing. This can improve stability for widely scaled-out runs.
gradient_average_split_factor (float, default=1.0): Perform the averaging of gradients over proceses partially before and partially after the allreduce. Before allreduce: grads.mul_(1.0/gradient_average_split_factor). After allreduce: grads.mul_(gradient_average_split_factor/world size). This can reduce the stress on the dynamic range of FP16 allreduces for widely scaled-out runs.
""" """
......
...@@ -17,7 +17,7 @@ Installation requires CUDA 9 or later, PyTorch 0.4 or later, and Python 3. Insta ...@@ -17,7 +17,7 @@ Installation requires CUDA 9 or later, PyTorch 0.4 or later, and Python 3. Insta
git clone https://www.github.com/nvidia/apex git clone https://www.github.com/nvidia/apex
cd apex cd apex
python setup.py install python setup.py install [--cuda_ext] [--cpp_ext]
.. toctree:: .. toctree::
...@@ -38,6 +38,12 @@ Installation requires CUDA 9 or later, PyTorch 0.4 or later, and Python 3. Insta ...@@ -38,6 +38,12 @@ Installation requires CUDA 9 or later, PyTorch 0.4 or later, and Python 3. Insta
parallel parallel
.. toctree::
:maxdepth: 1
:caption: Fused Optimizers
optimizers
.. reparameterization .. reparameterization
.. RNN .. RNN
......
.. role:: hidden
:class: hidden-section
apex.optimizers
===================================
.. automodule:: apex.optimizers
.. currentmodule:: apex.optimizers
.. FusedAdam
----------
.. autoclass:: FusedAdam
:members:
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment