"vscode:/vscode.git/clone" did not exist on "5321f3e2035675ee3f749fae2298a2bc6a6f012a"
Commit 94e9d77c authored by Kartikay Khandelwal's avatar Kartikay Khandelwal Committed by Facebook Github Bot
Browse files

Fix save_dir creation while training on multiple nodes (#626)

Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/626

While training a model on multiple GPUs, the current fairseq train workflow fails while creating the directory from which to load a checkpoint. This seems to be happening because multiple nodes attempt to create the same directory thus causing some weird interaction with os.makedirs option "exist_ok=True". Fixing this by making sure only rank 0 creates this directory.

Reviewed By: myleott

Differential Revision: D14841304

fbshipit-source-id: c9b73ba804de97e2cb19a616189fefce476d8c74
parent 34028c63
......@@ -342,7 +342,11 @@ def save_checkpoint(args, trainer, epoch_itr, val_loss):
def load_checkpoint(args, trainer, epoch_itr):
"""Load a checkpoint and replay dataloader to match."""
# Only rank 0 should attempt to create the required dir
if args.distributed_rank == 0:
os.makedirs(args.save_dir, exist_ok=True)
if os.path.isabs(args.restore_file):
checkpoint_path = args.restore_file
else:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment