Benchmark - Fix torch.dist init issue with multiple models (#495)

Fix potential barrier timeout in init_process_group due to race condition of using the same port. Change to different ports when running multiple models sequentially in one process. For example, when running vgg11/13/16/19, will use port 29501~29504 respectively.

Benchmark - Fix torch.dist init issue with multiple models (#495)
Fix potential barrier timeout in init_process_group due to race condition of using the same port. Change to different ports when running multiple models sequentially in one process. For example, when running vgg11/13/16/19, will use port 29501~29504 respectively.
644b5395 · Yifan Xiong · GitHub · 5a88db16 · 644b5395
Unverified Commit 644b5395 authored Mar 21, 2023 by Yifan Xiong Committed by GitHub Mar 21, 2023
Hide whitespace changes
Inline Side-by-side

Showing with 2 additions and 1 deletion

superbench/benchmarks/model_benchmarks/pytorch_base.py superbench/benchmarks/model_benchmarks/pytorch_base.py +2 -1

No files found.
--- a/superbench/benchmarks/model_benchmarks/pytorch_base.py
+++ b/superbench/benchmarks/model_benchmarks/pytorch_base.py
@@ -70,7 +70,8 @@ def _init_distributed_setting(self):
                    )
                    return False
                # torch >= 1.9.0a0 torch.distributed.elastic is used by default
-                port = int(os.environ['MASTER_PORT']) + 1
+                port = int(os.environ.get('MASTER_PORT', '29500')) + 1
+                os.environ['MASTER_PORT'] = str(port)
                addr = os.environ['MASTER_ADDR']
                self._global_rank = int(os.environ['RANK'])
                self._local_rank = int(os.environ['LOCAL_RANK'])