[PyTorch] Fixing hang in `initialize_ub()` for multi-node runs after PR901...
[PyTorch] Fixing hang in `initialize_ub()` for multi-node runs after PR901 removal of MPI-dependence (#986) * Re-implementing PR901 (removing MPI-dependence in Userbuffers) with multi-node fixes * passing data-parallel rank/size info from torch.distributed to userbuffers Signed-off-by:Alp Dener <adener@nvidia.com> * multi-node example working with UB_SKIPMC=1 but not with multicast Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed multi-node hang in initialize_ub(), updated comm+GEMM overlap example to support multi-node mixed tensor/data parallelism, added README Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed use case when Userbuffers is asked to allocate the TP overlap buffer with UB_SKIPMC=1 Signed-off-by:
Alp Dener <adener@nvidia.com> * corrected example problem to set device by local ordinal instead of global process rank Signed-off-by:
Alp Dener <adener@nvidia.com> * double-free fix in userbuffers destructor Signed-off-by:
Alp Dener <adener@nvidia.com> * removed unnecessary and incorrect torch.cuda.set_device(...) Signed-off-by:
Alp Dener <adener@nvidia.com> * corrected inter-node ranks logic Signed-off-by:
Alp Dener <adener@nvidia.com> * generalized node ID logic in initialize_ub to handle arbitrary world rank layouts within node Signed-off-by:
Alp Dener <adener@nvidia.com> * added single-node comm+GEMM overlap unit tests Signed-off-by:
Alp Dener <adener@nvidia.com> * LayerNormMLP example confirmed working with 2 nodes on Eos Signed-off-by:
Alp Dener <adener@nvidia.com> * unit test cleanup Signed-off-by:
Alp Dener <adener@nvidia.com> * corrected DP group ranks logic in LNMLP comm+GEMM overlap example Signed-off-by:
Alp Dener <adener@nvidia.com> * corrected enums in unit test Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed incorrect Ubuf object init signature Signed-off-by:
Alp Dener <adener@nvidia.com> * switched default backend for Userbuffer bootstrapping to Gloo with MPI and NCCL fallbacks, and initialize_ub option to manually select backend Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed all comm+GEMM overlap unit tests Signed-off-by:
Alp Dener <adener@nvidia.com> * corrected all_gather use for Gloo backend Signed-off-by:
Alp Dener <adener@nvidia.com> * changed userbuffers allgather callback to always use all_gather() instead of all_gather_into_tensor() Signed-off-by:
Alp Dener <adener@nvidia.com> * restored and verified old MPI-based bootstrapping via NVTE_UB_WITH_MPI=1 option at compile time Signed-off-by:
Alp Dener <adener@nvidia.com> * disabled scoped GIL release for comm+GEMM overlap algorithms Signed-off-by:
Alp Dener <adener@nvidia.com> * avoid dist.init_device_mesh in comm+GEMM overlap example to support older PyTorch versions Signed-off-by:
Alp Dener <adener@nvidia.com> * applied RS overlap FP8 fix from PR1004 Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed segfault in Userbuffers destructor Signed-off-by:
Alp Dener <adener@nvidia.com> * corrected comm+GEMM overlap unit test arguments Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed unit test run command for when Userbuffers is compiled with MPI Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Refactored torch.distributed collectives into pure C++ callbacks Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Alp Dener <adener@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Showing
Please register or sign in to comment