• Sudarshan Raghunathan's avatar
    Add reply files to d2go training processes · f0f55cdc
    Sudarshan Raghunathan authored
    Summary:
    This diff contains a minimal set of changes to support returning reply files to MAST.
    
    There are three parts:
    1. First, we have a try..except in the main function to catch all the "catchable" Python exceptions. Exceptions from C++ code or segfaults will not be handled here.
    2. Each exception is then written to a per-process JSON reply file.
    3. At the end, all per-process files are stat-ed and the earliest file is copied to a location specified by MAST.
    
    # Limitations
    1. This only works when local processes are launched using multiprocessing (which is the default)
    2. If any error happens in C++ code - it will likely not be caught in Python and the reply file might not have the correct logs
    
    Differential Revision: D43097683
    
    fbshipit-source-id: 0eaf4f19f6199a9c77f2ce4c7d2bbc2a2078be99
    f0f55cdc
train_net.py 7.27 KB