Commits · 3b1045db43ed232691d32dde0517dfed571b0a65 · OpenDAS / DeepEP

22 Apr, 2025 3 commits
- Several code lints · edbb1bc3
  Chenggang Zhao authored Apr 22, 2025
  
  edbb1bc3
- Normal kernels always use IBGDA mode. · 3e54b78f
  Shangyan Zhou authored Apr 22, 2025
  
  3e54b78f
- Refactor some code. · 20b2aaaf
  Shangyan Zhou authored Apr 22, 2025
  
  20b2aaaf
21 Apr, 2025 2 commits

Revert `ibgda_device.cuh` and remove some comments. · e2c57848
Shangyan Zhou authored Apr 21, 2025

e2c57848

In the Internode Normal Kernel, when using nvshmem ibrc for RDMA data... · 5ab80c28

moningchen authored Apr 21, 2025

In the Internode Normal Kernel, when using nvshmem ibrc for RDMA data transmission, a single QP is used for data transfer between two GPUs, which limits kernel performance in network card dual-port and RoCE network scenarios.

In our optimized Internode Normal Kernel, we implemented multiple QPs for data transmission between two GPUs, setting a different QP for each channel. Additionally, we modified the transmission method from IBRC to IBGAD.

Through these optimizations, the Internode Normal Kernel achieves optimal performance in both H800 and H20 environments, with RDMA transmission performance nearly reaching the physical network performance limit. Using the current default statistical method, in 4-node H800 and H20 environments, RDMA performance can reach 60GB/s+.

5ab80c28

07 Apr, 2025 1 commit
- Remove useless control metadata for low-latency combine · 42494864
  Chenggang Zhao authored Apr 07, 2025
  
  42494864
28 Mar, 2025 2 commits

Fix compilation · c4d12b4f
Chenggang Zhao authored Mar 28, 2025

c4d12b4f

For the SMs which calculate metadata in notify_dispatch, each warp in the SM... · 4dd1e68a

songhexiang authored Mar 28, 2025

For the SMs which calculate metadata in notify_dispatch, each warp in the SM is used to calculate the metadata of one channel. The default configuration is 8 warps for 10 channels, which needs two rounds of loop. Maybe the number of warps can be configured to the number of the channels so that one loop is enough.

4dd1e68a

27 Mar, 2025 1 commit
- Stronger acquire scope for low-latency kernels · ffc39ba0
  Chenggang Zhao authored Mar 27, 2025
  
  ffc39ba0
18 Mar, 2025 1 commit
- Support zero-copy for low-latency combine · dcaf73e5
  Chenggang Zhao authored Mar 18, 2025
  
  dcaf73e5
14 Mar, 2025 3 commits
- Fix bugs for intranode EP kernels · 82dcf48f
  Chenggang Zhao authored Mar 14, 2025
  
  82dcf48f
- Fix style. · 38cdaf39
  Shangyan Zhou authored Mar 14, 2025
  
  38cdaf39
- Low latency kernels use rdma atomic to support AR. · 2d0cf41d
  Shangyan Zhou authored Mar 14, 2025
  
  2d0cf41d
10 Mar, 2025 1 commit
- Support BF16 for low-latency kernels · ed7487c1
  Chenggang Zhao authored Mar 10, 2025
  
  ed7487c1
06 Mar, 2025 1 commit
- Improve AR performance · 1fc40d50
  Chenggang Zhao authored Mar 06, 2025
  
  1fc40d50
05 Mar, 2025 2 commits
- Fix AR bugs for normal kernels · 458cdcb2
  Chenggang Zhao authored Mar 05, 2025
  
  458cdcb2
- Bugs fixed · 680e424b
  Chenggang Zhao authored Mar 05, 2025
  
  680e424b
04 Mar, 2025 1 commit
- Improve EP2/4 performance · 1553fc42
  Chenggang Zhao authored Mar 04, 2025
  
  1553fc42
03 Mar, 2025 1 commit
- Remove all raw tensors for better P2P overlapping · 6cc3497d
  Chenggang Zhao authored Mar 03, 2025
  
  6cc3497d
27 Feb, 2025 1 commit
- Update some comments and docs · 77bb07aa
  Chenggang Zhao authored Feb 27, 2025
  
  77bb07aa
25 Feb, 2025 1 commit
- Initial commit · ebfe47e4
  Chenggang Zhao authored Feb 24, 2025
  
  ebfe47e4