Unverified Commit 644b5395 authored by Yifan Xiong's avatar Yifan Xiong Committed by GitHub
Browse files

Benchmark - Fix torch.dist init issue with multiple models (#495)

Fix potential barrier timeout in init_process_group due to race
condition of using the same port. Change to different ports when running
multiple models sequentially in one process.
For example, when running vgg11/13/16/19, will use port 29501~29504
respectively.
parent 5a88db16
......@@ -70,7 +70,8 @@ def _init_distributed_setting(self):
)
return False
# torch >= 1.9.0a0 torch.distributed.elastic is used by default
port = int(os.environ['MASTER_PORT']) + 1
port = int(os.environ.get('MASTER_PORT', '29500')) + 1
os.environ['MASTER_PORT'] = str(port)
addr = os.environ['MASTER_ADDR']
self._global_rank = int(os.environ['RANK'])
self._local_rank = int(os.environ['LOCAL_RANK'])
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment