Commits · bc118b248a07316c6b1f61d8e5b2dde069dae7e6 · OpenDAS / DeepEP

24 Jun, 2025 1 commit

Add the transaction window data structure for RDMA senders (#245) · bc118b24

Chenggang Zhao authored Jun 24, 2025

* Add draft

* Add fast-debugging flags

* Fix several bugs

* Add sender timeout checks

* Fix stuck

* Fix bugs

* Fix bugs

bc118b24

13 Jun, 2025 1 commit
- Use one qp per sm for internode normal kernels (#181) · 05df5554
  Zhicheng Wu authored Jun 13, 2025
```
let the sender SM use the channel_id, and the receiver SM use channel_id + num_channels
```
  05df5554
12 Jun, 2025 1 commit
- Support UE8M0 data format. (#206) · 21efbe9b
  Shifang Xu authored Jun 12, 2025
  
  21efbe9b
22 Apr, 2025 1 commit
- Refactor some code. · 20b2aaaf
  Shangyan Zhou authored Apr 22, 2025
  
  20b2aaaf
21 Apr, 2025 1 commit

In the Internode Normal Kernel, when using nvshmem ibrc for RDMA data... · 5ab80c28

moningchen authored Apr 21, 2025

In the Internode Normal Kernel, when using nvshmem ibrc for RDMA data transmission, a single QP is used for data transfer between two GPUs, which limits kernel performance in network card dual-port and RoCE network scenarios.

In our optimized Internode Normal Kernel, we implemented multiple QPs for data transmission between two GPUs, setting a different QP for each channel. Additionally, we modified the transmission method from IBRC to IBGAD.

Through these optimizations, the Internode Normal Kernel achieves optimal performance in both H800 and H20 environments, with RDMA transmission performance nearly reaching the physical network performance limit. Using the current default statistical method, in 4-node H800 and H20 environments, RDMA performance can reach 60GB/s+.

5ab80c28

11 Apr, 2025 1 commit
- Fix test combine args · 23c54150
  Hao Lin authored Apr 11, 2025
```
Signed-off-by: Hao Lin <linhaomails@gmail.com>
```
  23c54150
10 Apr, 2025 1 commit
- fix: not output result in some linux system · 0f80da84
  fujianhao.fjh authored Apr 10, 2025
  
  0f80da84
25 Mar, 2025 1 commit
- Remove confusing comments · ae0eafd2
  Chenggang Zhao authored Mar 25, 2025
  
  ae0eafd2
05 Mar, 2025 1 commit
- Fix AR bugs for normal kernels · 458cdcb2
  Chenggang Zhao authored Mar 05, 2025
  
  458cdcb2
25 Feb, 2025 1 commit
- Initial commit · ebfe47e4
  Chenggang Zhao authored Feb 24, 2025
  
  ebfe47e4