Unverified Commit 7a3a4502 authored by Yuting Jiang's avatar Yuting Jiang Committed by GitHub
Browse files

Bug: Fix Bug - Add barrier before 'destroy_process_group' in model benchmarks (#198)

**Description**
Add barrier before 'destroy_process_group' to resolve the bug due to when multi models in one model benchmark, some processes haven't finished the previous process group while others failed to initialize new process group for the next model on rocm4.x when running bert_models.

**Major Revision**
-  Add barrier before 'destroy_process_group'.
parent 1f9de77f
...@@ -174,6 +174,7 @@ def _postprocess(self): ...@@ -174,6 +174,7 @@ def _postprocess(self):
try: try:
if self._args.distributed_impl == DistributedImpl.DDP: if self._args.distributed_impl == DistributedImpl.DDP:
torch.distributed.barrier()
torch.distributed.destroy_process_group() torch.distributed.destroy_process_group()
except BaseException as e: except BaseException as e:
self._result.set_return_code(ReturnCode.DISTRIBUTED_SETTING_DESTROY_FAILURE) self._result.set_return_code(ReturnCode.DISTRIBUTED_SETTING_DESTROY_FAILURE)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment