Updating READMEs and examples

82d7a3bf · Michael Carilli · 2c175a5d · 82d7a3bf · 82d7a3bf · 82d7a3bf
Commit 82d7a3bf authored Jun 15, 2018 by Michael Carilli
7 changed files
--- a/README.md
+++ b/README.md
@@ -37,7 +37,8 @@ The intention of `FP16_Optimizer` is to be the "highway" for FP16 training: achi
 [word_language_model with FP16_Optimizer](https://github.com/NVIDIA/apex/tree/master/examples/word_language_model)
 The Imagenet and word_language_model directories also contain examples that show manual management of master parameters and static loss scaling.  
-These examples illustrate what sort of operations `amp` and `FP16_Optimizer` are performing automatically.
+These manual examples illustrate what sort of operations `amp` and `FP16_Optimizer` are performing automatically.
 ## 2. Distributed Training
@@ -49,7 +50,7 @@ optimized for NVIDIA's NCCL communication library.
 [API Documentation](https://nvidia.github.io/apex/parallel.html)
-[Python Source](https://nvidia.github.io/apex/parallel)
+[Python Source](https://github.com/NVIDIA/apex/tree/master/apex/parallel)
 [Example/Walkthrough](https://github.com/NVIDIA/apex/tree/master/examples/distributed)

--- a/apex/parallel/distributed.py
+++ b/apex/parallel/distributed.py
@@ -43,7 +43,7 @@ class DistributedDataParallel(Module):
    to the other processes on initialization of DistributedDataParallel, and will be
    allreduced in buckets durring the backward pass.
-    See https://github.com/csarofeen/examples/tree/apex/distributed for detailed usage.
+    See https://github.com/NVIDIA/apex/tree/master/examples/distributed for detailed usage.
    Args:
        module: Network definition to be run in multi-gpu/distributed mode.

--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -3,7 +3,7 @@
   You can adapt this file completely to your liking, but it should at least
   contain the root `toctree` directive.
-:github_url: https://gitlab-master.nvidia.com/csarofeen/apex
+:github_url: https://github.com/nvidia/apex
 APEx (A PyTorch Extension)
 ===================================

--- a/examples/FP16_Optimizer_simple/README.md
+++ b/examples/FP16_Optimizer_simple/README.md
 # Simple examples of FP16_Optimizer functionality
 #### Minimal Working Sample
-`minimal.py` shows the basic usage of `FP16_Optimizer` with either static or dynamic loss scaling.  Test via
+`minimal.py` shows the basic usage of `FP16_Optimizer` with either static or dynamic loss scaling.  Test via `python minimal.py`.
-```bash
-python minimal.py
-```
 #### Closures
 `FP16_Optimizer` supports closures with the same control flow as ordinary Pytorch optimizers.  
-`closure.py` shows an example.  Test via
+`closure.py` shows an example.  Test via `python closure.py`.
-```bash
-python closure.py
-```
 See [the API documentation](https://nvidia.github.io/apex/fp16_utils.html#apex.fp16_utils.FP16_Optimizer.step) for more details.
 #### Checkpointing
 `FP16_Optimizer` also supports checkpointing with the same control flow as ordinary Pytorch optimizers.
-`save_load.py` shows an example.  Test via
+`save_load.py` shows an example.  Test via `python save_load.py`.
-```bash
-python save_load.py
-```
 See [the API documentation](https://nvidia.github.io/apex/fp16_utils.html#apex.fp16_utils.FP16_Optimizer.load_state_dict) for more details.
 #### Distributed
 **distributed_pytorch** shows an example using `FP16_Optimizer` with Pytorch DistributedDataParallel.
 The usage of `FP16_Optimizer` with distributed does not need to change from ordinary single-process 
-usage. Run via
+usage. Test via
 ```bash
 cd distributed_pytorch
 bash run.sh
@@ -33,7 +26,7 @@ bash run.sh
 **distributed_pytorch** shows an example using `FP16_Optimizer` with Apex DistributedDataParallel.
 Again, the usage of `FP16_Optimizer` with distributed does not need to change from ordinary 
-single-process usage.  Run via
+single-process usage.  Test via
 ```bash
 cd distributed_apex
 bash run.sh

--- a/examples/distributed/README.md
+++ b/examples/distributed/README.md
@@ -4,7 +4,7 @@ This example demonstrates how to modify a network to use a simple but effective
 [API Documentation](https://nvidia.github.io/apex/parallel.html)
-[Source Code](https://github.com/csarofeen/examples/tree/apex/distributed)
+[Source Code](https://github.com/NVIDIA/apex/tree/master/apex/parallel)
 ## Getting started
 Prior to running please run
@@ -22,4 +22,3 @@ To understand how to convert your own model to use the distributed module includ
 ## Requirements
 Pytorch master branch built from source. This requirement is to use NCCL as a distributed backend.
-Apex installed from https://www.github.com/nvidia/apex
--- a/examples/imagenet/README.md
+++ b/examples/imagenet/README.md
@@ -23,11 +23,7 @@ adding any normal arguments.
 ## Training
-To train a model, run `main.py` with the desired model architecture and the path to the ImageNet dataset:
+To train a model, run `main.py` with the desired model architecture and the path to the ImageNet dataset.
-```bash
-python main.py -a resnet18 [imagenet-folder with train and val folders]
-```
 The default learning rate schedule starts at 0.1 and decays by a factor of 10 every 30 epochs. This is appropriate for ResNet and models with batch normalization, but too high for AlexNet and VGG. Use 0.01 as the initial learning rate for AlexNet or VGG:
@@ -36,7 +32,29 @@ python main.py -a alexnet --lr 0.01 /path/to/imagenet/folder
 ```
 The directory at /path/to/imagenet/directory should contain two subdirectories called "train"
-and "val" that contain the training and validation data respectively.
+and "val" that contain the training and validation data respectively. Train images are expected to be 256x256 jpegs.
+Example commands (note:  batch size --b 256 assumes your GPUs have >=16GB of onboard memory).
+```bash
+### Softlink training dataset into current directory
+$ ln -sf /data/imagenet/train-jpeg-256x256/ train
+### Softlink validation dataset into current directory
+$ ln -sf /data/imagenet/val-jpeg/ val
+### Single-process training
+$ python main.py -a resnet50 --fp16 --b 256 --workers 4 ./
+### Multi-process training (uses all visible GPU on the node)
+$ python -m apex.parallel.multiproc main.py -a resnet50 --fp16 --b 256 --workers 4 ./
+### Multi-process training on GPUs 0 and 1 only
+$ export CUDA_VISIBLE_DEVICES=0,1
+$ python -m apex.parallel.multiproc main.py -a resnet50 --fp16 --b 256 --workers 4 ./
+### Multi-process training with FP16_Optimizer, default loss scale 1.0 (still uses FP32 master params)
+$ python -m apex.parallel.multiproc main_fp16_optimizer.py -a resnet50 --fp16 --b 256 --workers 4 ./
+# Multi-process training with FP16_Optimizer, static loss scale
+$ python -m apex.parallel.multiproc main_fp16_optimizer.py -a resnet50 --fp16 --b 256 --static-loss-scale 128.0 --workers 4 ./
+### Multi-process training with FP16_Optimizer, dynamic loss scaling
+$ python -m apex.parallel.multiproc main_fp16_optimizer.py -a resnet50 --fp16 --b 256 --dynamic-loss-scale --workers 4 ./
+```
 ## Usage for `main.py` and `main_fp16_optimizer.py`

--- a/examples/imagenet/main_fp16_optimizer.py
+++ b/examples/imagenet/main_fp16_optimizer.py
@@ -120,7 +120,8 @@ def main():
    if args.fp16:
        model = network_to_half(model)
    if args.distributed:
-        model = DDP(model)
+        #shared param turns off bucketing in DDP, for lower latency runs this can improve perf
+        model = DDP(model, shared_param=True)
    # define loss function (criterion) and optimizer
    criterion = nn.CrossEntropyLoss().cuda()
@@ -293,6 +294,7 @@ def train(train_loader, model, criterion, optimizer, epoch):
            loss.backward()
        optimizer.step()
+        torch.cuda.synchronize()
        # measure elapsed time
        batch_time.update(time.time() - end)