Use torch.cuda.synchronize() right after calling batch_isend_irecv() communication API

7ffea978 · Deepak Narayanan · 2096d356 · 7ffea978
Commit 7ffea978 authored Feb 05, 2021 by Deepak Narayanan
Hide whitespace changes
Inline Side-by-side

Showing with 2 additions and 0 deletions

megatron/training.py megatron/training.py +2 -0

No files found.
--- a/megatron/training.py
+++ b/megatron/training.py
@@ -351,6 +351,8 @@ def communicate(tensor_send_next, tensor_send_prev, recv_forward, recv_backward)
    reqs = torch.distributed.batch_isend_irecv(ops)
    for req in reqs:
        req.wait()
+    # Temporary workaround for batch_isend_irecv() race condition.
+    torch.cuda.synchronize()

    return tensor_recv_prev, tensor_recv_next