• Halil Akin's avatar
    Take a dummy train step under OOM to keep multiprocessing in sync · 6c006a34
    Halil Akin authored
    Summary: This is not a guaranteed solution (since processes may still get out of sync if OOM happens after an all_gather/all_reduce has been done) - but should still make multiprocessing training more robust in practice since it seems we usually OOM early enough.
    
    Reviewed By: myleott
    
    Differential Revision: D13086018
    
    fbshipit-source-id: feb1b01c2eb8818797cfdabc0faac8056ba1b4ee
    6c006a34
train.py 13.3 KB