Fix for data error and kernel hung because of inflight rdma channel head update

When dispatch/combine, neither sender nor receiver waits for the finish of the rdma channel head update, which may result in the remaining inflight head update wqes even after the kernel finished. Once the infight wqes arrive after the rdma channel head buffer cleaning for the next round of dispatch/combine, the rdma channel head buffer will be re- written to a none-zero value. The rdma sender can reuse the data buffer before the rdma receivers consume the date buffer because of the wrong rdma channel head, cauing date error and kernel hung. For performance considering, to overlap the inflight wqes' RTT, fix this issue by waiting for all previous inflight wqes to complete before cleaning rdma buffers in the next round of dispatch/combine. Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>

Fix for data error and kernel hung because of inflight rdma channel head update
When dispatch/combine, neither sender nor receiver waits for the finish of the rdma channel head update, which may result in the remaining inflight head update wqes even after the kernel finished. Once the infight wqes arrive after the rdma channel head buffer cleaning for the next round of dispatch/combine, the rdma channel head buffer will be re- written to a none-zero value. The rdma sender can reuse the data buffer before the rdma receivers consume the date buffer because of the wrong rdma channel head, cauing date error and kernel hung. For performance considering, to overlap the inflight wqes' RTT, fix this issue by waiting for all previous inflight wqes to complete before cleaning rdma buffers in the next round of dispatch/combine. Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
b65b22ed · Guangguan · Guangguan Wang · 0eee87b8 · b65b22ed
Commit b65b22ed authored Jul 15, 2025 by Guangguan Committed by Guangguan Wang Jul 17, 2025
Show whitespace changes
Inline Side-by-side

Showing with 20 additions and 0 deletions

csrc/kernels/internode.cu csrc/kernels/internode.cu +20 -0

No files found.
--- a/csrc/kernels/internode.cu
+++ b/csrc/kernels/internode.cu
@@ -102,6 +102,17 @@ notify_dispatch(const int* num_tokens_per_rank, int* moe_recv_counter_mapped, in
        // Global barrier: the first warp does intra-node sync, the second warp does internode sync
        EP_DEVICE_ASSERT(num_warps > 1);
        EP_DEVICE_ASSERT(kNumRDMARanks <= num_threads);
+
+        // waiting for all previous inflight wrs to complete,
+        // in case of rewriting cleared rdma_buffer
+        auto qps_per_rdma_rank = ibgda_get_state()->num_rc_per_pe * ibgda_get_state()->num_devices_initialized;
+        for (int i = thread_id; i < qps_per_rdma_rank * (kNumRDMARanks - 1); i += num_threads) {
+            auto dst_rdma_rank = (i / qps_per_rdma_rank + rdma_rank + 1) % kNumRDMARanks;
+            auto qp_id = i % qps_per_rdma_rank;
+            nvshmemi_ibgda_quiet(translate_dst_rdma_rank<kLowLatencyMode>(dst_rdma_rank, nvl_rank), qp_id);
+        }
+        __syncthreads();
+
        if (thread_id == 32)
            nvshmem_sync_with_same_gpu_idx<kLowLatencyMode>(rdma_team);
        barrier_block<NUM_MAX_NVL_PEERS, true>(barrier_signal_ptrs, nvl_rank);
@@ -1044,9 +1055,18 @@ __global__ void cached_notify(const int rdma_clean_offset, const int rdma_num_in

    auto nvl_rank = rank % NUM_MAX_NVL_PEERS;
    auto num_rdma_ranks = num_ranks / NUM_MAX_NVL_PEERS;
+    auto rdma_rank = rank / NUM_MAX_NVL_PEERS;

    // Using two SMs, which clean the RDMA/NVL buffer respectively
    if (sm_id == 0) {
+        auto qps_per_rdma_rank = ibgda_get_state()->num_rc_per_pe * ibgda_get_state()->num_devices_initialized;
+        for (int i = thread_id; i < qps_per_rdma_rank * (num_rdma_ranks - 1); i += num_threads) {
+            auto dst_rdma_rank = (i / qps_per_rdma_rank + rdma_rank + 1) % num_rdma_ranks;
+            auto qp_id = i % qps_per_rdma_rank;
+            nvshmemi_ibgda_quiet(translate_dst_rdma_rank<kLowLatencyMode>(dst_rdma_rank, nvl_rank), qp_id);
+        }
+        __syncthreads();
+
        // Barrier for RDMA
        if (thread_id == 32)
            nvshmem_sync_with_same_gpu_idx<kLowLatencyMode>(rdma_team);