Commits · d72817eb05c4e7f425972a2c364b11c13da83907 · OpenDAS / DeepEP

14 Jul, 2025 1 commit
- Strengthen the barrier in `cached_notify` (#304) · eb155da4
  Shangyan Zhou authored Jul 14, 2025
```
* Strengthen the barrier in `cached_notify`

* lint

* Change the timing method

* lint
```
  eb155da4
11 Jul, 2025 1 commit

Explicitly destroy the C++ runtime and release resources. (#292) · 0c984e25

Shangyan Zhou authored Jul 11, 2025

* Explicitly destroy the C++ runtime and release resources.

* Small fix

* fix typo

* Add a flag to control whether explicit `destroy()` is required.

0c984e25

10 Jul, 2025 1 commit

Use TMA to optimize internode combine. (#287) · 06f417dc

Shangyan Zhou authored Jul 10, 2025



* Let forwarders use a dedicated SM

* Shuffle rdma idx

* Sender use TMA.

* Adjust the tuning chunk size.

* Modify NVL chunk layout.

* Update some combine config.

* Small lint

* Minor fix

* Overlap TMA store

---------
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>

06f417dc

04 Jul, 2025 2 commits

Update some dispatch configs. · e6d61fc6
Shangyan Zhou authored Jul 04, 2025

e6d61fc6

Use TMA to optimize internode dispatch. (#276) · a2fa3b73

Shangyan Zhou authored Jul 04, 2025



* Add TMA buffer allocation

* Use TMA for forwarders and NVL receivers

* Use lane 31 to operate TMA.

* Change rdma buffer layout.

* Use TMA to transfer scales also.

* Increase the NVL recv buffer size.

* Disable early stopping.

* Apply similar optimizations on receiver warps.

* Prevent warp divergence.

* Disable aggressive ptx by default.

* Revert using TMA to transfer scales.

* Format.

* Change the layout of dispatch NVL buffer.

* Move topk transformation to recv warps.

* Use TMA to transfer all data in foward warps

* Use TMA to store scales.

* Code lint

---------
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>

a2fa3b73

02 Jul, 2025 5 commits

Refactor testing arguments · 7705f533
Chenggang Zhao authored Jul 02, 2025

7705f533

Use CLI args instead of envs (#273) · 6b17f4fa

youkaichao authored Jul 02, 2025



* use cli arg for num_processes
Signed-off-by: youkaichao <youkaichao@gmail.com>

* update low-latency
Signed-off-by: youkaichao <youkaichao@gmail.com>

* update intranode
Signed-off-by: youkaichao <youkaichao@gmail.com>

* update internode
Signed-off-by: youkaichao <youkaichao@gmail.com>

---------
Signed-off-by: youkaichao <youkaichao@gmail.com>

6b17f4fa

Unify testing envs' naming · 01f49071
Chenggang Zhao authored Jul 02, 2025

01f49071
cherry pick (#251) · 8dcdd349
fzyzcjy authored Jul 02, 2025

8dcdd349
more (#238) · 19fc0700
fzyzcjy authored Jul 02, 2025

19fc0700

25 Jun, 2025 1 commit
- Support bias. (#257) · bd429ffe
  Shangyan Zhou authored Jun 25, 2025
```
* Support bias.

* Fix.

* Fix style.
```
  bd429ffe
24 Jun, 2025 1 commit

Add the transaction window data structure for RDMA senders (#245) · bc118b24

Chenggang Zhao authored Jun 24, 2025

* Add draft

* Add fast-debugging flags

* Fix several bugs

* Add sender timeout checks

* Fix stuck

* Fix bugs

* Fix bugs

bc118b24

13 Jun, 2025 1 commit
- Use one qp per sm for internode normal kernels (#181) · 05df5554
  Zhicheng Wu authored Jun 13, 2025
```
let the sender SM use the channel_id, and the receiver SM use channel_id + num_channels
```
  05df5554
12 Jun, 2025 1 commit
- Support UE8M0 data format. (#206) · 21efbe9b
  Shifang Xu authored Jun 12, 2025
  
  21efbe9b
22 Apr, 2025 1 commit
- Refactor some code. · 20b2aaaf
  Shangyan Zhou authored Apr 22, 2025
  
  20b2aaaf
21 Apr, 2025 1 commit

In the Internode Normal Kernel, when using nvshmem ibrc for RDMA data... · 5ab80c28

moningchen authored Apr 21, 2025

In the Internode Normal Kernel, when using nvshmem ibrc for RDMA data transmission, a single QP is used for data transfer between two GPUs, which limits kernel performance in network card dual-port and RoCE network scenarios.

In our optimized Internode Normal Kernel, we implemented multiple QPs for data transmission between two GPUs, setting a different QP for each channel. Additionally, we modified the transmission method from IBRC to IBGAD.

Through these optimizations, the Internode Normal Kernel achieves optimal performance in both H800 and H20 environments, with RDMA transmission performance nearly reaching the physical network performance limit. Using the current default statistical method, in 4-node H800 and H20 environments, RDMA performance can reach 60GB/s+.

5ab80c28

11 Apr, 2025 1 commit
- Fix test combine args · 23c54150
  Hao Lin authored Apr 11, 2025
```
Signed-off-by: Hao Lin <linhaomails@gmail.com>
```
  23c54150
10 Apr, 2025 1 commit
- fix: not output result in some linux system · 0f80da84
  fujianhao.fjh authored Apr 10, 2025
  
  0f80da84
25 Mar, 2025 1 commit
- Remove confusing comments · ae0eafd2
  Chenggang Zhao authored Mar 25, 2025
  
  ae0eafd2
05 Mar, 2025 1 commit
- Fix AR bugs for normal kernels · 458cdcb2
  Chenggang Zhao authored Mar 05, 2025
  
  458cdcb2
25 Feb, 2025 1 commit
- Initial commit · ebfe47e4
  Chenggang Zhao authored Feb 24, 2025
  
  ebfe47e4