Enhance timeout cleanup to avoid possible hanging (#405)
Enhance timeout cleanup to avoid possible hanging. __Major Revisions__ * Skip postprocess (mainly torch.dist.barrier and destroy) when exception happens (e.g., timeout, GPU crashed) to avoid subprocesses hanging. * Add cleanup to kill sb exec processes when Ansible run failed for certain benchmark. __Minor Revisions__ * Update extra Ansible timeout from 300s to 60s.
Showing
Please register or sign in to comment