Updated references Readme files with torchrun instead of distributed.launch (#4451)

3266e5c4 · Shruti Pulstya · GitHub · 85982ac6 · 3266e5c4 · 3266e5c4
Unverified Commit 3266e5c4 authored Sep 20, 2021 by Shruti Pulstya Committed by GitHub Sep 20, 2021
4 changed files
--- a/references/classification/README.md
+++ b/references/classification/README.md
@@ -23,7 +23,7 @@ Since `AlexNet` and the original `VGG` architectures do not include batch
 normalization, the default initial learning rate `--lr 0.1` is to high.

 ```
-python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py\
+torchrun --nproc_per_node=8 train.py\
    --model $MODEL --lr 1e-2
 ```

@@ -33,7 +33,7 @@ normalization and thus are trained with the default parameters.

 ### ResNext-50 32x4d
 ```
-python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py\
+torchrun --nproc_per_node=8 train.py\
    --model resnext50_32x4d --epochs 100
 ```

@@ -41,7 +41,7 @@ python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py\
 ### ResNext-101 32x8d

 ```
-python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py\
+torchrun --nproc_per_node=8 train.py\
    --model resnext101_32x8d --epochs 100
 ```

@@ -54,7 +54,7 @@ which are respectively batch_size=32 and lr=0.1

 ### MobileNetV2
 ```
-python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py\
+torchrun --nproc_per_node=8 train.py\
     --model mobilenet_v2 --epochs 300 --lr 0.045 --wd 0.00004\
     --lr-step-size 1 --lr-gamma 0.98
 ```
@@ -62,7 +62,7 @@ python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py\

 ### MobileNetV3 Large & Small
 ```
-python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py\
+torchrun --nproc_per_node=8 train.py\
     --model $MODEL --epochs 600 --opt rmsprop --batch-size 128 --lr 0.064\ 
     --wd 0.00001 --lr-step-size 2 --lr-gamma 0.973 --auto-augment imagenet --random-erase 0.2
 ```
@@ -85,7 +85,7 @@ Automatic Mixed Precision (AMP) training on GPU for Pytorch can be enabled with
 Mixed precision training makes use of both FP32 and FP16 precisions where appropriate. FP16 operations can leverage the Tensor cores on NVIDIA GPUs (Volta, Turing or newer architectures) for improved throughput, generally without loss in model accuracy. Mixed precision training also often allows larger batch sizes. GPU automatic mixed precision training for Pytorch Vision can be enabled via the flag value `--apex=True`.

 ```
-python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py\
+torchrun --nproc_per_node=8 train.py\
    --model resnext50_32x4d --epochs 100 --apex
 ```

@@ -120,7 +120,7 @@ For Mobilenet-v2, the model was trained with quantization aware training, the se
 12. weight-decay: 0.0001

 ```
-python -m torch.distributed.launch --nproc_per_node=8 --use_env train_quantization.py --model='mobilenet_v2'
+torchrun --nproc_per_node=8 train_quantization.py --model='mobilenet_v2'
 ```

 Training converges at about 10 epochs.
@@ -140,7 +140,7 @@ For Mobilenet-v3 Large, the model was trained with quantization aware training,
 12. weight-decay: 0.00001

 ```
-python -m torch.distributed.launch --nproc_per_node=8 --use_env train_quantization.py --model='mobilenet_v3_large' \
+torchrun --nproc_per_node=8 train_quantization.py --model='mobilenet_v3_large' \
    --wd 0.00001 --lr 0.001
 ```


--- a/references/detection/README.md
+++ b/references/detection/README.md
@@ -22,35 +22,35 @@ Except otherwise noted, all models have been trained on 8x V100 GPUs.

 ### Faster R-CNN ResNet-50 FPN
 ```
-python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py\
+torchrun --nproc_per_node=8 train.py\
    --dataset coco --model fasterrcnn_resnet50_fpn --epochs 26\
    --lr-steps 16 22 --aspect-ratio-group-factor 3
 ```

 ### Faster R-CNN MobileNetV3-Large FPN
 ```
-python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py\
+torchrun --nproc_per_node=8 train.py\
    --dataset coco --model fasterrcnn_mobilenet_v3_large_fpn --epochs 26\
    --lr-steps 16 22 --aspect-ratio-group-factor 3
 ```

 ### Faster R-CNN MobileNetV3-Large 320 FPN
 ```
-python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py\
+torchrun --nproc_per_node=8 train.py\
    --dataset coco --model fasterrcnn_mobilenet_v3_large_320_fpn --epochs 26\
    --lr-steps 16 22 --aspect-ratio-group-factor 3
 ```

 ### RetinaNet
 ```
-python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py\
+torchrun --nproc_per_node=8 train.py\
    --dataset coco --model retinanet_resnet50_fpn --epochs 26\
    --lr-steps 16 22 --aspect-ratio-group-factor 3 --lr 0.01
 ```

 ### SSD300 VGG16
 ```
-python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py\
+torchrun --nproc_per_node=8 train.py\
    --dataset coco --model ssd300_vgg16 --epochs 120\
    --lr-steps 80 110 --aspect-ratio-group-factor 3 --lr 0.002 --batch-size 4\
    --weight-decay 0.0005 --data-augmentation ssd
@@ -58,7 +58,7 @@ python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py\

 ### SSDlite320 MobileNetV3-Large
 ```
-python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py\
+torchrun --nproc_per_node=8 train.py\
    --dataset coco --model ssdlite320_mobilenet_v3_large --epochs 660\
    --aspect-ratio-group-factor 3 --lr-scheduler cosineannealinglr --lr 0.15 --batch-size 24\
    --weight-decay 0.00004 --data-augmentation ssdlite
@@ -67,7 +67,7 @@ python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py\

 ### Mask R-CNN
 ```
-python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py\
+torchrun --nproc_per_node=8 train.py\
    --dataset coco --model maskrcnn_resnet50_fpn --epochs 26\
    --lr-steps 16 22 --aspect-ratio-group-factor 3
 ```
@@ -75,7 +75,7 @@ python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py\

 ### Keypoint R-CNN
 ```
-python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py\
+torchrun --nproc_per_node=8 train.py\
    --dataset coco_kp --model keypointrcnn_resnet50_fpn --epochs 46\
    --lr-steps 36 43 --aspect-ratio-group-factor 3
 ```
--- a/references/segmentation/README.md
+++ b/references/segmentation/README.md
@@ -14,30 +14,30 @@ You must modify the following flags:

 ## fcn_resnet50
 ```
-python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py --lr 0.02 --dataset coco -b 4 --model fcn_resnet50 --aux-loss
+torchrun --nproc_per_node=8 train.py --lr 0.02 --dataset coco -b 4 --model fcn_resnet50 --aux-loss
 ```

 ## fcn_resnet101
 ```
-python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py --lr 0.02 --dataset coco -b 4 --model fcn_resnet101 --aux-loss
+torchrun --nproc_per_node=8 train.py --lr 0.02 --dataset coco -b 4 --model fcn_resnet101 --aux-loss
 ```

 ## deeplabv3_resnet50
 ```
-python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py --lr 0.02 --dataset coco -b 4 --model deeplabv3_resnet50 --aux-loss
+torchrun --nproc_per_node=8 train.py --lr 0.02 --dataset coco -b 4 --model deeplabv3_resnet50 --aux-loss
 ```

 ## deeplabv3_resnet101
 ```
-python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py --lr 0.02 --dataset coco -b 4 --model deeplabv3_resnet101 --aux-loss
+torchrun --nproc_per_node=8 train.py --lr 0.02 --dataset coco -b 4 --model deeplabv3_resnet101 --aux-loss
 ```

 ## deeplabv3_mobilenet_v3_large
 ```
-python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py --dataset coco -b 4 --model deeplabv3_mobilenet_v3_large --aux-loss --wd 0.000001
+torchrun --nproc_per_node=8 train.py --dataset coco -b 4 --model deeplabv3_mobilenet_v3_large --aux-loss --wd 0.000001
 ```

 ## lraspp_mobilenet_v3_large
 ```
-python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py --dataset coco -b 4 --model lraspp_mobilenet_v3_large --wd 0.000001
+torchrun --nproc_per_node=8 train.py --dataset coco -b 4 --model lraspp_mobilenet_v3_large --wd 0.000001
 ```
--- a/references/video_classification/README.md
+++ b/references/video_classification/README.md
@@ -18,7 +18,7 @@ We assume the training and validation AVI videos are stored at `/data/kinectics4

 Run the training on a single node with 8 GPUs:
 ```bash
-python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py --data-path=/data/kinectics400 --train-dir=train --val-dir=val --batch-size=16 --cache-dataset --sync-bn --apex
+torchrun --nproc_per_node=8 train.py --data-path=/data/kinectics400 --train-dir=train --val-dir=val --batch-size=16 --cache-dataset --sync-bn --apex
 ```

 **Note:** all our models were trained on 8 nodes with 8 V100 GPUs each for a total of 64 GPUs. Expected training time for 64 GPUs is 24 hours, depending on the storage solution.