In addition to `FP16_Optimizer` examples, the Imagenet and word_language_model directories contain examples that demonstrate manual management of master parameters and static loss scaling.
The [Imagenet with FP16_Optimizer](https://github.com/NVIDIA/apex/tree/master/examples/imagenet) and [word_language_model with FP16_Optimizer](https://github.com/NVIDIA/apex/tree/master/examples/word_language_model) directories also contain `main.py` files that demonstrate manual management of master parameters and static loss scaling. These examples illustrate what sort of operations `FP16_Optimizer` is performing automatically.
These examples illustrate what sort of operations `FP16_Optimizer` is performing automatically.
@@ -43,13 +43,13 @@ class DistributedDataParallel(Module):
...
@@ -43,13 +43,13 @@ class DistributedDataParallel(Module):
When used with ``multiproc.py``, :class:`DistributedDataParallel`
When used with ``multiproc.py``, :class:`DistributedDataParallel`
assigns 1 process to each of the available (visible) GPUs on the node.
assigns 1 process to each of the available (visible) GPUs on the node.
Parameters are broadcast across participating processes on initialization, and gradients are
Parameters are broadcast across participating processes on initialization, and gradients are
allreduced and averaged over processes during ``backward()`.
allreduced and averaged over processes during ``backward()``.
:class:``DistributedDataParallel`` is optimized for use with NCCL. It achieves high performance by
:class:`DistributedDataParallel` is optimized for use with NCCL. It achieves high performance by
overlapping communication with computation during ``backward()`` and bucketing smaller gradient
overlapping communication with computation during ``backward()`` and bucketing smaller gradient
transfers to reduce the total number of transfers required.
transfers to reduce the total number of transfers required.
:class:``DistributedDataParallel`` assumes that your script accepts the command line
:class:`DistributedDataParallel` assumes that your script accepts the command line
arguments "rank" and "world-size." It also assumes that your script calls
arguments "rank" and "world-size." It also assumes that your script calls
``torch.cuda.set_device(args.rank)`` before creating the model.
``torch.cuda.set_device(args.rank)`` before creating the model.
...
@@ -60,8 +60,7 @@ class DistributedDataParallel(Module):
...
@@ -60,8 +60,7 @@ class DistributedDataParallel(Module):
Args:
Args:
module: Network definition to be run in multi-gpu/distributed mode.
module: Network definition to be run in multi-gpu/distributed mode.
message_size (Default = 1e7): Minimum number of elements in a communication bucket.
message_size (Default = 1e7): Minimum number of elements in a communication bucket.
shared_param (Default = False): If your model uses shared parameters this must be True.
shared_param (Default = False): If your model uses shared parameters this must be True. It will disable bucketing of parameters to avoid race conditions.
It will disable bucketing of parameters to avoid race conditions.
This site contains the API documentation for Apex (https://github.com/nvidia/apex),
This site contains the API documentation for Apex (https://github.com/nvidia/apex),
a Pytorch extension with NVIDIA-maintained utilities to streamline mixed precision and distributed training. Some of the code here will be included in upstream Pytorch eventually. The intention of Apex is to make up-to-date utilities available to users as quickly as possible.
a Pytorch extension with NVIDIA-maintained utilities to streamline mixed precision and distributed training. Some of the code here will be included in upstream Pytorch eventually. The intention of Apex is to make up-to-date utilities available to users as quickly as possible.
Installation requires CUDA 9 or later, PyTorch 0.4 or later, and Python 3. Installation can be done by running
Installation requires CUDA 9 or later, PyTorch 0.4 or later, and Python 3. Install by running
adding any args... you'd like. The launch script `apex.parallel.multiproc` will
python -m apex.parallel.multiproc main.py args...
```
adding any `args...` you like. The launch script `apex.parallel.multiproc` will
spawn one process for each of your system's available (visible) GPUs.
spawn one process for each of your system's available (visible) GPUs.
Each process will run `python main.py args... --world-size <worldsize> --rank <rank>`
Each process will run `python main.py args... --world-size <worldsize> --rank <rank>`
(the `--world-size` and `--rank` arguments are determined and appended by `apex.parallel.multiproc`).
(the `--world-size` and `--rank` arguments are determined and appended by `apex.parallel.multiproc`).
...
@@ -45,7 +49,5 @@ which will run on devices 0 and 1. By default, if `CUDA_VISIBLE_DEVICES` is uns
...
@@ -45,7 +49,5 @@ which will run on devices 0 and 1. By default, if `CUDA_VISIBLE_DEVICES` is uns
To understand how to convert your own model, please see all sections of main.py within ```#=====START: ADDED FOR DISTRIBUTED======``` and ```#=====END: ADDED FOR DISTRIBUTED======``` flags.
To understand how to convert your own model, please see all sections of main.py within ```#=====START: ADDED FOR DISTRIBUTED======``` and ```#=====END: ADDED FOR DISTRIBUTED======``` flags.
[Example with Imagenet and mixed precision training](https://github.com/NVIDIA/apex/tree/master/examples/imagenet)
## Requirements
## Requirements
Pytorch master branch built from source. This requirement is to use NCCL as a distributed backend.
Pytorch master branch built from source. This requirement is to use NCCL as a distributed backend.
This example is based on [https://github.com/pytorch/examples/tree/master/imagenet](https://github.com/pytorch/examples/tree/master/imagenet).
This example is based on [https://github.com/pytorch/examples/tree/master/imagenet](https://github.com/pytorch/examples/tree/master/imagenet).
It implements training of popular model architectures, such as ResNet, AlexNet, and VGG on the ImageNet dataset.
It implements training of popular model architectures, such as ResNet, AlexNet, and VGG on the ImageNet dataset.
`main.py` and `main_fp16_optimizer.py` have been modified to use the `DistributedDataParallel` module in APEx instead of the one in upstream PyTorch. For description of how this works please see the distributed example included in this repo.
`main.py` and `main_fp16_optimizer.py` have been modified to use the `DistributedDataParallel` module in Apex instead of the one in upstream PyTorch. For description of how this works please see the distributed example included in this repo.
`main.py` with the `--fp16` argument demonstrates mixed precision training with manual management of master parameters and loss scaling.
`main.py` with the `--fp16` argument demonstrates mixed precision training with manual management of master parameters and loss scaling.
...
@@ -15,8 +15,8 @@ adding any normal arguments.
...
@@ -15,8 +15,8 @@ adding any normal arguments.
## Requirements
## Requirements
- APEx which can be installed from https://www.github.com/nvidia/apex
- Apex which can be installed from https://www.github.com/nvidia/apex
- Install PyTorch from source, master branch of ([pytorch on github](https://www.github.com/pytorch/pytorch)
- Install PyTorch from source, master branch of [pytorch on github](https://www.github.com/pytorch/pytorch).
- `pip install -r requirements.txt`
- `pip install -r requirements.txt`
- Download the ImageNet dataset and move validation images to labeled subfolders
- Download the ImageNet dataset and move validation images to labeled subfolders
- To do this, you can use the following script: https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh
- To do this, you can use the following script: https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh