1. 02 Sep, 2022 1 commit
    • Yifan Xiong's avatar
      Enhance timeout cleanup to avoid possible hanging (#405) · 8afaa376
      Yifan Xiong authored
      Enhance timeout cleanup to avoid possible hanging.
      
      __Major Revisions__
      * Skip postprocess (mainly torch.dist.barrier and destroy) when exception happens (e.g., timeout, GPU crashed) to avoid subprocesses hanging.
      * Add cleanup to kill sb exec processes when Ansible run failed for certain benchmark.
      
      __Minor Revisions__
      * Update extra Ansible timeout from 300s to 60s.
      8afaa376