Fix save_dir creation while training on multiple nodes (#626)

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/626 While training a model on multiple GPUs, the current fairseq train workflow fails while creating the directory from which to load a checkpoint. This seems to be happening because multiple nodes attempt to create the same directory thus causing some weird interaction with os.makedirs option "exist_ok=True". Fixing this by making sure only rank 0 creates this directory. Reviewed By: myleott Differential Revision: D14841304 fbshipit-source-id: c9b73ba804de97e2cb19a616189fefce476d8c74

Fix save_dir creation while training on multiple nodes (#626)
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/626 While training a model on multiple GPUs, the current fairseq train workflow fails while creating the directory from which to load a checkpoint. This seems to be happening because multiple nodes attempt to create the same directory thus causing some weird interaction with os.makedirs option "exist_ok=True". Fixing this by making sure only rank 0 creates this directory. Reviewed By: myleott Differential Revision: D14841304 fbshipit-source-id: c9b73ba804de97e2cb19a616189fefce476d8c74
94e9d77c · Kartikay Khandelwal · Facebook Github Bot · 34028c63 · 94e9d77c
Commit 94e9d77c authored Apr 09, 2019 by Kartikay Khandelwal Committed by Facebook Github Bot Apr 09, 2019
Show whitespace changes
Inline Side-by-side

Showing with 5 additions and 1 deletion

train.py train.py +5 -1

No files found.
--- a/train.py
+++ b/train.py
@@ -342,7 +342,11 @@ def save_checkpoint(args, trainer, epoch_itr, val_loss):

 def load_checkpoint(args, trainer, epoch_itr):
    """Load a checkpoint and replay dataloader to match."""
+
+    # Only rank 0 should attempt to create the required dir
+    if args.distributed_rank == 0:
        os.makedirs(args.save_dir, exist_ok=True)
+
    if os.path.isabs(args.restore_file):
        checkpoint_path = args.restore_file
    else: