• Kartikay Khandelwal's avatar
    Fix save_dir creation while training on multiple nodes (#626) · 94e9d77c
    Kartikay Khandelwal authored
    Summary:
    Pull Request resolved: https://github.com/pytorch/fairseq/pull/626
    
    While training a model on multiple GPUs, the current fairseq train workflow fails while creating the directory from which to load a checkpoint. This seems to be happening because multiple nodes attempt to create the same directory thus causing some weird interaction with os.makedirs option "exist_ok=True". Fixing this by making sure only rank 0 creates this directory.
    
    Reviewed By: myleott
    
    Differential Revision: D14841304
    
    fbshipit-source-id: c9b73ba804de97e2cb19a616189fefce476d8c74
    94e9d77c
train.py 15.4 KB