Commits · 510719682c3d89ab51e25318ee378e793bce0e2f · OpenDAS / DeepEP

08 Aug, 2025 1 commit
- Fix low latency test · 51071968
  Shangyan Zhou authored Aug 08, 2025
  
  51071968
07 Aug, 2025 1 commit

Support 10-bit LogFMT Combine (#345) · c5facf5c

Zhean Xu authored Aug 07, 2025



* independent logfmt_simulate function

* draft: logfmt low latency combine

* Minor bug fixes

* Fix non-logfmt bugs

* Fix logfmt bugs

* Fix logfmt bugs

* Minor fix

* Minor fix

* Clean code

* Clean code

* Use fewer regs

* Use two warp groups

* Correct shared memory size

* Minor fix

* Minor fix

* More rigorous tests

* Clean code

* Use more SMs

* Use different unroll factor for send & recv

* Update csrc/kernels/internode_ll.cu
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update csrc/kernels/internode_ll.cu
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Some renaming

* Some comments of tests

* Format `logfmt_encode`

* More lints

* Some refactors on sends

* Fix testing

* Fix bugs

* Renaming

* Use the full warp

* Unify combine recv

* Lint

* Lint

* Support 2560

* Fix meta buffer dtype

* Better encode calls

* Better amin/max writes

* Extra sync

* Read `topk_idx` by once

* Better specialization

* Read weights by once

* Rename

* Bug fixed

* Some renaming

* Fix local memory usage for sending

* Fix local memory usage for receiving

* Less writes

* Optimize performance

* Optimize performance

* Better performance

* Optimize performance

* Fix rounding

* Manually unroll

* Fix bench

---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>

c5facf5c

01 Aug, 2025 1 commit
- more (#275) · ab0a3dd2
  fzyzcjy authored Aug 01, 2025
  
  ab0a3dd2
31 Jul, 2025 1 commit
- Remove the diagnosis part from tests · c7033854
  Chenggang Zhao authored Jul 31, 2025
  
  c7033854
30 Jul, 2025 1 commit

Add diagnosis module for efficient and precise location of slow rank (#311) · 4b67064d

sky authored Jul 30, 2025



* Add diagnosis module for precise identification of slow ranks
Signed-off-by: wangfakang <fakangwang@gmail.com>

* Add tests for the slow diagnosis module
Signed-off-by: wangfakang <fakangwang@gmail.com>

* Update some comments for diagnose
Signed-off-by: wangfakang <fakangwang@gmail.com>

* Update test case for diagnose
Signed-off-by: wangfakang <fakangwang@gmail.com>

* Strip the diagnose module, thx LyricZhao and sphish.
Signed-off-by: wangfakang <fakangwang@gmail.com>

* update variable name and cumulative wait recv cost, thx sphish.
Signed-off-by: wangfakang <fakangwang@gmail.com>

* remove invalid comments.
Signed-off-by: wangfakang <fakangwang@gmail.com>

---------
Signed-off-by: wangfakang <fakangwang@gmail.com>

4b67064d

22 Jul, 2025 1 commit
- Update combine config · bdd119f8
  Shangyan Zhou authored Jul 22, 2025
  
  bdd119f8
21 Jul, 2025 1 commit

Minor patches for deepep (#318) · 5b549c85

Guangguan Wang authored Jul 21, 2025



* Add arg --pressure-test for test_low_latency.py

Add arg --pressure-test for test_low_latency.py
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>

* Export NVSHMEM_QP_DEPTH

Export NVSHMEM_QP_DEPTH
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>

---------
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>

5b549c85

14 Jul, 2025 1 commit
- Strengthen the barrier in `cached_notify` (#304) · eb155da4
  Shangyan Zhou authored Jul 14, 2025
```
* Strengthen the barrier in `cached_notify`

* lint

* Change the timing method

* lint
```
  eb155da4
11 Jul, 2025 1 commit

Explicitly destroy the C++ runtime and release resources. (#292) · 0c984e25

Shangyan Zhou authored Jul 11, 2025

* Explicitly destroy the C++ runtime and release resources.

* Small fix

* fix typo

* Add a flag to control whether explicit `destroy()` is required.

0c984e25

10 Jul, 2025 2 commits

Use TMA to optimize internode combine. (#287) · 06f417dc

Shangyan Zhou authored Jul 10, 2025



* Let forwarders use a dedicated SM

* Shuffle rdma idx

* Sender use TMA.

* Adjust the tuning chunk size.

* Modify NVL chunk layout.

* Update some combine config.

* Small lint

* Minor fix

* Overlap TMA store

---------
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>

06f417dc

Support 10-bit LogFMT (simulated version) (#284) · 1cf85fb2

Chenggang Zhao authored Jul 10, 2025



* Add LogFMT interface

* Update comments

* Add simulated code

* Fix comments

* Change to 128 channels

* Add notes

* Optimize performance

* optimize simulate logfmt 10bit

* Minor fix

* Stronger low latency tests

* Minor fix

* Stronger low latency tests for logfmt

* Optimize logfmt simulate: lg2/ex2 ptx, step_inv

* Minor fix

* Minor fix

* Add non-logfmt bench

* Fix value=0 for logfmt

* Optimize performance

* Refactor tests

---------
Co-authored-by: Zhean Xu <xza@deepseek.com>

1cf85fb2

04 Jul, 2025 2 commits

Update some dispatch configs. · e6d61fc6
Shangyan Zhou authored Jul 04, 2025

e6d61fc6

Use TMA to optimize internode dispatch. (#276) · a2fa3b73

Shangyan Zhou authored Jul 04, 2025



* Add TMA buffer allocation

* Use TMA for forwarders and NVL receivers

* Use lane 31 to operate TMA.

* Change rdma buffer layout.

* Use TMA to transfer scales also.

* Increase the NVL recv buffer size.

* Disable early stopping.

* Apply similar optimizations on receiver warps.

* Prevent warp divergence.

* Disable aggressive ptx by default.

* Revert using TMA to transfer scales.

* Format.

* Change the layout of dispatch NVL buffer.

* Move topk transformation to recv warps.

* Use TMA to transfer all data in foward warps

* Use TMA to store scales.

* Code lint

---------
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>

a2fa3b73

02 Jul, 2025 9 commits
- Refactor testing arguments · 7705f533
  Chenggang Zhao authored Jul 02, 2025
  
  7705f533
- Use CLI args instead of envs (#273) · 6b17f4fa
  youkaichao authored Jul 02, 2025
```
* use cli arg for num_processes
Signed-off-by: youkaichao <youkaichao@gmail.com>

* update low-latency
Signed-off-by: youkaichao <youkaichao@gmail.com>

* update intranode
Signed-off-by: youkaichao <youkaichao@gmail.com>

* update internode
Signed-off-by: youkaichao <youkaichao@gmail.com>

---------
Signed-off-by: youkaichao <youkaichao@gmail.com>
```
  6b17f4fa
- Simplify · 341bb961
  Chenggang Zhao authored Jul 02, 2025
  
  341bb961
- Renaming · 63f79469
  Chenggang Zhao authored Jul 02, 2025
  
  63f79469
- Refactor the bench function · d79b3cd1
  Chenggang Zhao authored Jul 02, 2025
  
  d79b3cd1
- Support displaying separate send and recv time (#239) · 85793dda
  fzyzcjy authored Jul 02, 2025
```
* more

* more

* more

* more

* more

* more
```
  85793dda
- Unify testing envs' naming · 01f49071
  Chenggang Zhao authored Jul 02, 2025
  
  01f49071
- cherry pick (#251) · 8dcdd349
  fzyzcjy authored Jul 02, 2025
  
  8dcdd349
- more (#238) · 19fc0700
  fzyzcjy authored Jul 02, 2025
  
  19fc0700
25 Jun, 2025 1 commit
- Support bias. (#257) · bd429ffe
  Shangyan Zhou authored Jun 25, 2025
```
* Support bias.

* Fix.

* Fix style.
```
  bd429ffe
24 Jun, 2025 2 commits

Add the transaction window data structure for RDMA senders (#245) · bc118b24

Chenggang Zhao authored Jun 24, 2025

* Add draft

* Add fast-debugging flags

* Fix several bugs

* Add sender timeout checks

* Fix stuck

* Fix bugs

* Fix bugs

bc118b24

Optimize intranode combine. (#247) · 9eb2f84b

Shangyan Zhou authored Jun 24, 2025

* Increase the test round.

* Add warp synchronization.

* Shuffle the send warps.

* Add time elapsed into bench result.

9eb2f84b

18 Jun, 2025 4 commits
- Surpass type checks · 9d4f7ef8
  Chenggang Zhao authored Jun 18, 2025
  
  9d4f7ef8
- Adjust import order · b56f7c2c
  Chenggang Zhao authored Jun 18, 2025
  
  b56f7c2c
- Move import. · cd371d31
  Shangyan Zhou authored Jun 18, 2025
  
  cd371d31
- Set `device_id` to suppress pytorch warning. · bf4a4a21
  Shangyan Zhou authored Jun 18, 2025
  
  bf4a4a21
16 Jun, 2025 1 commit
- Remove the low-latency usage flag (#214) · 8aaddf76
  Chenggang Zhao authored Jun 16, 2025
  
  8aaddf76
13 Jun, 2025 1 commit
- Use one qp per sm for internode normal kernels (#181) · 05df5554
  Zhicheng Wu authored Jun 13, 2025
```
let the sender SM use the channel_id, and the receiver SM use channel_id + num_channels
```
  05df5554
12 Jun, 2025 1 commit
- Support UE8M0 data format. (#206) · 21efbe9b
  Shifang Xu authored Jun 12, 2025
  
  21efbe9b
11 Jun, 2025 3 commits

Support Ampere architecture (#204) · b8d90fb7

Chenggang Zhao authored Jun 11, 2025

* Update README

* Update `setup.py`

* Fix headers

* Add `DISABLE_NVSHMEM` for APIs

* Fix launch

* Fix TMA settings

* Fix TMA usages

* Fix dlink

* Separate layout kernels

* Update version

* Add `is_sm90_compiled`

* Fix tests

* Add NVLink connection checks

* Update README

* Fix tests

* Add some comments

* Minor fix

* Minor fix

* Fix bugs

b8d90fb7

Check the empty list · dd13c714
Chenggang Zhao authored Jun 11, 2025

dd13c714
Support CUDA graph for intranode normal kernels (#203) · a8299ca7
Chenggang Zhao authored Jun 11, 2025

a8299ca7

09 Jun, 2025 2 commits
- Support statistics tensor for low-latency kernels (#196) · 5a2e37fa
  Chenggang Zhao authored Jun 09, 2025
  
  5a2e37fa
- Add low-latency kernel PCIe usage flag (#195) · 0d1a855d
  Chenggang Zhao authored Jun 09, 2025
```
* Add low-latency kernel usage flag

* Update comments
```
  0d1a855d
06 Jun, 2025 1 commit

Use TMA instead of LD/ST for intra-node normal kernels (#191) · c8dceba1

Chenggang Zhao authored Jun 06, 2025

* Update CMake files

* Use TMA instead of LD/ST for intranode dispatch

* Use TMA instead of LD/ST for intranode combine

* Adjust configs

* Test default configs as well

* More warps for combine

* Add inter-thread fence

* Enable more warps

* Do not use TMA for senders

* Update configs

* Remove useless wait

c8dceba1

22 Apr, 2025 1 commit
- Refactor some code. · 20b2aaaf
  Shangyan Zhou authored Apr 22, 2025
  
  20b2aaaf
21 Apr, 2025 1 commit

In the Internode Normal Kernel, when using nvshmem ibrc for RDMA data... · 5ab80c28

moningchen authored Apr 21, 2025

In the Internode Normal Kernel, when using nvshmem ibrc for RDMA data transmission, a single QP is used for data transfer between two GPUs, which limits kernel performance in network card dual-port and RoCE network scenarios.

In our optimized Internode Normal Kernel, we implemented multiple QPs for data transmission between two GPUs, setting a different QP for each channel. Additionally, we modified the transmission method from IBRC to IBGAD.

Through these optimizations, the Internode Normal Kernel achieves optimal performance in both H800 and H20 environments, with RDMA transmission performance nearly reaching the physical network performance limit. Using the current default statistical method, in 4-node H800 and H20 environments, RDMA performance can reach 60GB/s+.

5ab80c28