Use TMA to optimize internode dispatch. (#276)

* Add TMA buffer allocation * Use TMA for forwarders and NVL receivers * Use lane 31 to operate TMA. * Change rdma buffer layout. * Use TMA to transfer scales also. * Increase the NVL recv buffer size. * Disable early stopping. * Apply similar optimizations on receiver warps. * Prevent warp divergence. * Disable aggressive ptx by default. * Revert using TMA to transfer scales. * Format. * Change the layout of dispatch NVL buffer. * Move topk transformation to recv warps. * Use TMA to transfer all data in foward warps * Use TMA to store scales. * Code lint --------- Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>

Use TMA to optimize internode dispatch. (#276)
* Add TMA buffer allocation * Use TMA for forwarders and NVL receivers * Use lane 31 to operate TMA. * Change rdma buffer layout. * Use TMA to transfer scales also. * Increase the NVL recv buffer size. * Disable early stopping. * Apply similar optimizations on receiver warps. * Prevent warp divergence. * Disable aggressive ptx by default. * Revert using TMA to transfer scales. * Format. * Change the layout of dispatch NVL buffer. * Move topk transformation to recv warps. * Use TMA to transfer all data in foward warps * Use TMA to store scales. * Code lint --------- Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>
a2fa3b73 · Shangyan Zhou · GitHub · 7705f533 · a2fa3b73 · a2fa3b73
Unverified Commit a2fa3b73 authored Jul 04, 2025 by Shangyan Zhou Committed by GitHub Jul 04, 2025
Expand all Hide whitespace changes
Inline Side-by-side

Showing with 118 additions and 90 deletions

csrc/kernels/internode.cu csrc/kernels/internode.cu +116 -88

setup.py setup.py +1 -1

tests/test_internode.py tests/test_internode.py +1 -1

No files found.
--- a/csrc/kernels/internode.cu
+++ b/csrc/kernels/internode.cu
--- a/setup.py
+++ b/setup.py
@@ -55,7 +55,7 @@ if __name__ == '__main__':
        os.environ['DISABLE_AGGRESSIVE_PTX_INSTRS'] = '1'
    # Disable aggressive PTX instructions
-    if int(os.getenv('DISABLE_AGGRESSIVE_PTX_INSTRS', '0')):
+    if int(os.getenv('DISABLE_AGGRESSIVE_PTX_INSTRS', '1')):
        cxx_flags.append('-DDISABLE_AGGRESSIVE_PTX_INSTRS')
        nvcc_flags.append('-DDISABLE_AGGRESSIVE_PTX_INSTRS')

--- a/tests/test_internode.py
+++ b/tests/test_internode.py
@@ -234,7 +234,7 @@ def test_loop(local_rank: int, num_local_ranks: int, args: argparse.Namespace):
    num_sms = 24
    num_qps_per_rank = max(num_sms, ll_num_experts // num_ranks if args.test_ll_compatibility else 0)
-    buffer = deep_ep.Buffer(group, int(1e9), int(1e9), low_latency_mode=args.test_ll_compatibility,
+    buffer = deep_ep.Buffer(group, int(2e9), int(1e9), low_latency_mode=args.test_ll_compatibility,
                            num_qps_per_rank=num_qps_per_rank)
    assert num_local_ranks == 8 and num_ranks > 8
    torch.manual_seed(rank)