Commit 82d7a3bf authored by Michael Carilli's avatar Michael Carilli
Browse files

Updating READMEs and examples

parent 2c175a5d
...@@ -37,7 +37,8 @@ The intention of `FP16_Optimizer` is to be the "highway" for FP16 training: achi ...@@ -37,7 +37,8 @@ The intention of `FP16_Optimizer` is to be the "highway" for FP16 training: achi
[word_language_model with FP16_Optimizer](https://github.com/NVIDIA/apex/tree/master/examples/word_language_model) [word_language_model with FP16_Optimizer](https://github.com/NVIDIA/apex/tree/master/examples/word_language_model)
The Imagenet and word_language_model directories also contain examples that show manual management of master parameters and static loss scaling. The Imagenet and word_language_model directories also contain examples that show manual management of master parameters and static loss scaling.
These examples illustrate what sort of operations `amp` and `FP16_Optimizer` are performing automatically.
These manual examples illustrate what sort of operations `amp` and `FP16_Optimizer` are performing automatically.
## 2. Distributed Training ## 2. Distributed Training
...@@ -49,7 +50,7 @@ optimized for NVIDIA's NCCL communication library. ...@@ -49,7 +50,7 @@ optimized for NVIDIA's NCCL communication library.
[API Documentation](https://nvidia.github.io/apex/parallel.html) [API Documentation](https://nvidia.github.io/apex/parallel.html)
[Python Source](https://nvidia.github.io/apex/parallel) [Python Source](https://github.com/NVIDIA/apex/tree/master/apex/parallel)
[Example/Walkthrough](https://github.com/NVIDIA/apex/tree/master/examples/distributed) [Example/Walkthrough](https://github.com/NVIDIA/apex/tree/master/examples/distributed)
......
...@@ -43,7 +43,7 @@ class DistributedDataParallel(Module): ...@@ -43,7 +43,7 @@ class DistributedDataParallel(Module):
to the other processes on initialization of DistributedDataParallel, and will be to the other processes on initialization of DistributedDataParallel, and will be
allreduced in buckets durring the backward pass. allreduced in buckets durring the backward pass.
See https://github.com/csarofeen/examples/tree/apex/distributed for detailed usage. See https://github.com/NVIDIA/apex/tree/master/examples/distributed for detailed usage.
Args: Args:
module: Network definition to be run in multi-gpu/distributed mode. module: Network definition to be run in multi-gpu/distributed mode.
......
...@@ -3,7 +3,7 @@ ...@@ -3,7 +3,7 @@
You can adapt this file completely to your liking, but it should at least You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive. contain the root `toctree` directive.
:github_url: https://gitlab-master.nvidia.com/csarofeen/apex :github_url: https://github.com/nvidia/apex
APEx (A PyTorch Extension) APEx (A PyTorch Extension)
=================================== ===================================
......
# Simple examples of FP16_Optimizer functionality # Simple examples of FP16_Optimizer functionality
#### Minimal Working Sample #### Minimal Working Sample
`minimal.py` shows the basic usage of `FP16_Optimizer` with either static or dynamic loss scaling. Test via `minimal.py` shows the basic usage of `FP16_Optimizer` with either static or dynamic loss scaling. Test via `python minimal.py`.
```bash
python minimal.py
```
#### Closures #### Closures
`FP16_Optimizer` supports closures with the same control flow as ordinary Pytorch optimizers. `FP16_Optimizer` supports closures with the same control flow as ordinary Pytorch optimizers.
`closure.py` shows an example. Test via `closure.py` shows an example. Test via `python closure.py`.
```bash
python closure.py
```
See [the API documentation](https://nvidia.github.io/apex/fp16_utils.html#apex.fp16_utils.FP16_Optimizer.step) for more details. See [the API documentation](https://nvidia.github.io/apex/fp16_utils.html#apex.fp16_utils.FP16_Optimizer.step) for more details.
#### Checkpointing #### Checkpointing
`FP16_Optimizer` also supports checkpointing with the same control flow as ordinary Pytorch optimizers. `FP16_Optimizer` also supports checkpointing with the same control flow as ordinary Pytorch optimizers.
`save_load.py` shows an example. Test via `save_load.py` shows an example. Test via `python save_load.py`.
```bash
python save_load.py
```
See [the API documentation](https://nvidia.github.io/apex/fp16_utils.html#apex.fp16_utils.FP16_Optimizer.load_state_dict) for more details. See [the API documentation](https://nvidia.github.io/apex/fp16_utils.html#apex.fp16_utils.FP16_Optimizer.load_state_dict) for more details.
#### Distributed #### Distributed
**distributed_pytorch** shows an example using `FP16_Optimizer` with Pytorch DistributedDataParallel. **distributed_pytorch** shows an example using `FP16_Optimizer` with Pytorch DistributedDataParallel.
The usage of `FP16_Optimizer` with distributed does not need to change from ordinary single-process The usage of `FP16_Optimizer` with distributed does not need to change from ordinary single-process
usage. Run via usage. Test via
```bash ```bash
cd distributed_pytorch cd distributed_pytorch
bash run.sh bash run.sh
...@@ -33,7 +26,7 @@ bash run.sh ...@@ -33,7 +26,7 @@ bash run.sh
**distributed_pytorch** shows an example using `FP16_Optimizer` with Apex DistributedDataParallel. **distributed_pytorch** shows an example using `FP16_Optimizer` with Apex DistributedDataParallel.
Again, the usage of `FP16_Optimizer` with distributed does not need to change from ordinary Again, the usage of `FP16_Optimizer` with distributed does not need to change from ordinary
single-process usage. Run via single-process usage. Test via
```bash ```bash
cd distributed_apex cd distributed_apex
bash run.sh bash run.sh
......
...@@ -4,7 +4,7 @@ This example demonstrates how to modify a network to use a simple but effective ...@@ -4,7 +4,7 @@ This example demonstrates how to modify a network to use a simple but effective
[API Documentation](https://nvidia.github.io/apex/parallel.html) [API Documentation](https://nvidia.github.io/apex/parallel.html)
[Source Code](https://github.com/csarofeen/examples/tree/apex/distributed) [Source Code](https://github.com/NVIDIA/apex/tree/master/apex/parallel)
## Getting started ## Getting started
Prior to running please run Prior to running please run
...@@ -22,4 +22,3 @@ To understand how to convert your own model to use the distributed module includ ...@@ -22,4 +22,3 @@ To understand how to convert your own model to use the distributed module includ
## Requirements ## Requirements
Pytorch master branch built from source. This requirement is to use NCCL as a distributed backend. Pytorch master branch built from source. This requirement is to use NCCL as a distributed backend.
Apex installed from https://www.github.com/nvidia/apex
...@@ -23,11 +23,7 @@ adding any normal arguments. ...@@ -23,11 +23,7 @@ adding any normal arguments.
## Training ## Training
To train a model, run `main.py` with the desired model architecture and the path to the ImageNet dataset: To train a model, run `main.py` with the desired model architecture and the path to the ImageNet dataset.
```bash
python main.py -a resnet18 [imagenet-folder with train and val folders]
```
The default learning rate schedule starts at 0.1 and decays by a factor of 10 every 30 epochs. This is appropriate for ResNet and models with batch normalization, but too high for AlexNet and VGG. Use 0.01 as the initial learning rate for AlexNet or VGG: The default learning rate schedule starts at 0.1 and decays by a factor of 10 every 30 epochs. This is appropriate for ResNet and models with batch normalization, but too high for AlexNet and VGG. Use 0.01 as the initial learning rate for AlexNet or VGG:
...@@ -36,7 +32,29 @@ python main.py -a alexnet --lr 0.01 /path/to/imagenet/folder ...@@ -36,7 +32,29 @@ python main.py -a alexnet --lr 0.01 /path/to/imagenet/folder
``` ```
The directory at /path/to/imagenet/directory should contain two subdirectories called "train" The directory at /path/to/imagenet/directory should contain two subdirectories called "train"
and "val" that contain the training and validation data respectively. and "val" that contain the training and validation data respectively. Train images are expected to be 256x256 jpegs.
Example commands (note: batch size --b 256 assumes your GPUs have >=16GB of onboard memory).
```bash
### Softlink training dataset into current directory
$ ln -sf /data/imagenet/train-jpeg-256x256/ train
### Softlink validation dataset into current directory
$ ln -sf /data/imagenet/val-jpeg/ val
### Single-process training
$ python main.py -a resnet50 --fp16 --b 256 --workers 4 ./
### Multi-process training (uses all visible GPU on the node)
$ python -m apex.parallel.multiproc main.py -a resnet50 --fp16 --b 256 --workers 4 ./
### Multi-process training on GPUs 0 and 1 only
$ export CUDA_VISIBLE_DEVICES=0,1
$ python -m apex.parallel.multiproc main.py -a resnet50 --fp16 --b 256 --workers 4 ./
### Multi-process training with FP16_Optimizer, default loss scale 1.0 (still uses FP32 master params)
$ python -m apex.parallel.multiproc main_fp16_optimizer.py -a resnet50 --fp16 --b 256 --workers 4 ./
# Multi-process training with FP16_Optimizer, static loss scale
$ python -m apex.parallel.multiproc main_fp16_optimizer.py -a resnet50 --fp16 --b 256 --static-loss-scale 128.0 --workers 4 ./
### Multi-process training with FP16_Optimizer, dynamic loss scaling
$ python -m apex.parallel.multiproc main_fp16_optimizer.py -a resnet50 --fp16 --b 256 --dynamic-loss-scale --workers 4 ./
```
## Usage for `main.py` and `main_fp16_optimizer.py` ## Usage for `main.py` and `main_fp16_optimizer.py`
......
...@@ -120,7 +120,8 @@ def main(): ...@@ -120,7 +120,8 @@ def main():
if args.fp16: if args.fp16:
model = network_to_half(model) model = network_to_half(model)
if args.distributed: if args.distributed:
model = DDP(model) #shared param turns off bucketing in DDP, for lower latency runs this can improve perf
model = DDP(model, shared_param=True)
# define loss function (criterion) and optimizer # define loss function (criterion) and optimizer
criterion = nn.CrossEntropyLoss().cuda() criterion = nn.CrossEntropyLoss().cuda()
...@@ -293,6 +294,7 @@ def train(train_loader, model, criterion, optimizer, epoch): ...@@ -293,6 +294,7 @@ def train(train_loader, model, criterion, optimizer, epoch):
loss.backward() loss.backward()
optimizer.step() optimizer.step()
torch.cuda.synchronize()
# measure elapsed time # measure elapsed time
batch_time.update(time.time() - end) batch_time.update(time.time() - end)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment