Commit 30d5ca55 authored by Yuxin Wu's avatar Yuxin Wu Committed by Facebook GitHub Bot
Browse files

avoid warnings of NCCL

Summary:
Pull Request resolved: https://github.com/facebookresearch/detectron2/pull/3322

avoid warnings like the following:
```
[W ProcessGroupNCCL.cpp:1569] Rank 0 using best-guess GPU 0 to perform barrier as devices used by
this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is
incorrect. Specify device_ids in barrier() to force use of a particular device.
```

maybe can fix the hang in https://github.com/facebookresearch/detectron2/issues/3319

Reviewed By: vaibhava0

Differential Revision: D30077957

fbshipit-source-id: b8827e66c5eecc06b650acde2e7ff44106327f69
parent b9209b69
...@@ -136,12 +136,6 @@ def _distributed_worker( ...@@ -136,12 +136,6 @@ def _distributed_worker(
except Exception as e: except Exception as e:
logger.error("Process group URL: {}".format(dist_url)) logger.error("Process group URL: {}".format(dist_url))
raise e raise e
# synchronize is needed here to prevent a possible timeout after calling
# init_process_group
# See: https://github.com/facebookresearch/maskrcnn-benchmark/issues/172
comm.synchronize()
if backend in ["NCCL"]:
torch.cuda.set_device(local_rank)
# Setup the local process group (which contains ranks within the same machine) # Setup the local process group (which contains ranks within the same machine)
assert comm._LOCAL_PROCESS_GROUP is None assert comm._LOCAL_PROCESS_GROUP is None
...@@ -154,6 +148,14 @@ def _distributed_worker( ...@@ -154,6 +148,14 @@ def _distributed_worker(
if i == machine_rank: if i == machine_rank:
comm._LOCAL_PROCESS_GROUP = pg comm._LOCAL_PROCESS_GROUP = pg
# synchronize is needed here to prevent a possible timeout after calling
# init_process_group
# See: https://github.com/facebookresearch/maskrcnn-benchmark/issues/172
comm.synchronize()
if backend in ["NCCL"]:
torch.cuda.set_device(local_rank)
ret = main_func(*args) ret = main_func(*args)
if global_rank == 0: if global_rank == 0:
logger.info( logger.info(
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment