• Myle Ott's avatar
    Do distributed init after data loading · ec6f8ef9
    Myle Ott authored
    Summary:
    FACEBOOK
    
    This switches back to torch.multiprocessing.spawn, instead of directly calling fb_train.par using a subprocess.Process. This has the advantage that exceptions are propagated properly. It also moves the distributed_init part to happen after data loading, which gets around the timeout issue.
    
    The downside of this approach is that it's not so easy to pipe stdout to multiple places, which was nice when using the sweep.py scripts. I'm still working on a fix for that.
    
    Reviewed By: rutyrinott, ngoyal2707
    
    Differential Revision: D13873224
    
    fbshipit-source-id: 08d593233b8d23590c01c723363630a79804a8b0
    ec6f8ef9
train.py 15.4 KB