Commits · 1cf85fb2593208772ee4599070939a13a18b58eb · OpenDAS / DeepEP

10 Jul, 2025 1 commit

Support 10-bit LogFMT (simulated version) (#284) · 1cf85fb2

Chenggang Zhao authored Jul 10, 2025



* Add LogFMT interface

* Update comments

* Add simulated code

* Fix comments

* Change to 128 channels

* Add notes

* Optimize performance

* optimize simulate logfmt 10bit

* Minor fix

* Stronger low latency tests

* Minor fix

* Stronger low latency tests for logfmt

* Optimize logfmt simulate: lg2/ex2 ptx, step_inv

* Minor fix

* Minor fix

* Add non-logfmt bench

* Fix value=0 for logfmt

* Optimize performance

* Refactor tests

---------
Co-authored-by: Zhean Xu <xza@deepseek.com>

1cf85fb2

04 Jul, 2025 2 commits

Update some dispatch configs. · e6d61fc6
Shangyan Zhou authored Jul 04, 2025

e6d61fc6

Use TMA to optimize internode dispatch. (#276) · a2fa3b73

Shangyan Zhou authored Jul 04, 2025



* Add TMA buffer allocation

* Use TMA for forwarders and NVL receivers

* Use lane 31 to operate TMA.

* Change rdma buffer layout.

* Use TMA to transfer scales also.

* Increase the NVL recv buffer size.

* Disable early stopping.

* Apply similar optimizations on receiver warps.

* Prevent warp divergence.

* Disable aggressive ptx by default.

* Revert using TMA to transfer scales.

* Format.

* Change the layout of dispatch NVL buffer.

* Move topk transformation to recv warps.

* Use TMA to transfer all data in foward warps

* Use TMA to store scales.

* Code lint

---------
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>

a2fa3b73

02 Jul, 2025 9 commits
- Refactor testing arguments · 7705f533
  Chenggang Zhao authored Jul 02, 2025
  
  7705f533
- Use CLI args instead of envs (#273) · 6b17f4fa
  youkaichao authored Jul 02, 2025
```
* use cli arg for num_processes
Signed-off-by: youkaichao <youkaichao@gmail.com>

* update low-latency
Signed-off-by: youkaichao <youkaichao@gmail.com>

* update intranode
Signed-off-by: youkaichao <youkaichao@gmail.com>

* update internode
Signed-off-by: youkaichao <youkaichao@gmail.com>

---------
Signed-off-by: youkaichao <youkaichao@gmail.com>
```
  6b17f4fa
- Simplify · 341bb961
  Chenggang Zhao authored Jul 02, 2025
  
  341bb961
- Renaming · 63f79469
  Chenggang Zhao authored Jul 02, 2025
  
  63f79469
- Refactor the bench function · d79b3cd1
  Chenggang Zhao authored Jul 02, 2025
  
  d79b3cd1
- Support displaying separate send and recv time (#239) · 85793dda
  fzyzcjy authored Jul 02, 2025
```
* more

* more

* more

* more

* more

* more
```
  85793dda
- Unify testing envs' naming · 01f49071
  Chenggang Zhao authored Jul 02, 2025
  
  01f49071
- cherry pick (#251) · 8dcdd349
  fzyzcjy authored Jul 02, 2025
  
  8dcdd349
- more (#238) · 19fc0700
  fzyzcjy authored Jul 02, 2025
  
  19fc0700
25 Jun, 2025 1 commit
- Support bias. (#257) · bd429ffe
  Shangyan Zhou authored Jun 25, 2025
```
* Support bias.

* Fix.

* Fix style.
```
  bd429ffe
24 Jun, 2025 2 commits

Add the transaction window data structure for RDMA senders (#245) · bc118b24

Chenggang Zhao authored Jun 24, 2025

* Add draft

* Add fast-debugging flags

* Fix several bugs

* Add sender timeout checks

* Fix stuck

* Fix bugs

* Fix bugs

bc118b24

Optimize intranode combine. (#247) · 9eb2f84b

Shangyan Zhou authored Jun 24, 2025

* Increase the test round.

* Add warp synchronization.

* Shuffle the send warps.

* Add time elapsed into bench result.

9eb2f84b

18 Jun, 2025 4 commits
- Surpass type checks · 9d4f7ef8
  Chenggang Zhao authored Jun 18, 2025
  
  9d4f7ef8
- Adjust import order · b56f7c2c
  Chenggang Zhao authored Jun 18, 2025
  
  b56f7c2c
- Move import. · cd371d31
  Shangyan Zhou authored Jun 18, 2025
  
  cd371d31
- Set `device_id` to suppress pytorch warning. · bf4a4a21
  Shangyan Zhou authored Jun 18, 2025
  
  bf4a4a21
16 Jun, 2025 1 commit
- Remove the low-latency usage flag (#214) · 8aaddf76
  Chenggang Zhao authored Jun 16, 2025
  
  8aaddf76
13 Jun, 2025 1 commit
- Use one qp per sm for internode normal kernels (#181) · 05df5554
  Zhicheng Wu authored Jun 13, 2025
```
let the sender SM use the channel_id, and the receiver SM use channel_id + num_channels
```
  05df5554
12 Jun, 2025 1 commit
- Support UE8M0 data format. (#206) · 21efbe9b
  Shifang Xu authored Jun 12, 2025
  
  21efbe9b
11 Jun, 2025 3 commits

Support Ampere architecture (#204) · b8d90fb7

Chenggang Zhao authored Jun 11, 2025

* Update README

* Update `setup.py`

* Fix headers

* Add `DISABLE_NVSHMEM` for APIs

* Fix launch

* Fix TMA settings

* Fix TMA usages

* Fix dlink

* Separate layout kernels

* Update version

* Add `is_sm90_compiled`

* Fix tests

* Add NVLink connection checks

* Update README

* Fix tests

* Add some comments

* Minor fix

* Minor fix

* Fix bugs

b8d90fb7

Check the empty list · dd13c714
Chenggang Zhao authored Jun 11, 2025

dd13c714
Support CUDA graph for intranode normal kernels (#203) · a8299ca7
Chenggang Zhao authored Jun 11, 2025

a8299ca7

09 Jun, 2025 2 commits
- Support statistics tensor for low-latency kernels (#196) · 5a2e37fa
  Chenggang Zhao authored Jun 09, 2025
  
  5a2e37fa
- Add low-latency kernel PCIe usage flag (#195) · 0d1a855d
  Chenggang Zhao authored Jun 09, 2025
```
* Add low-latency kernel usage flag

* Update comments
```
  0d1a855d
06 Jun, 2025 1 commit

Use TMA instead of LD/ST for intra-node normal kernels (#191) · c8dceba1

Chenggang Zhao authored Jun 06, 2025

* Update CMake files

* Use TMA instead of LD/ST for intranode dispatch

* Use TMA instead of LD/ST for intranode combine

* Adjust configs

* Test default configs as well

* More warps for combine

* Add inter-thread fence

* Enable more warps

* Do not use TMA for senders

* Update configs

* Remove useless wait

c8dceba1

22 Apr, 2025 1 commit
- Refactor some code. · 20b2aaaf
  Shangyan Zhou authored Apr 22, 2025
  
  20b2aaaf
21 Apr, 2025 1 commit

In the Internode Normal Kernel, when using nvshmem ibrc for RDMA data... · 5ab80c28

moningchen authored Apr 21, 2025

In the Internode Normal Kernel, when using nvshmem ibrc for RDMA data transmission, a single QP is used for data transfer between two GPUs, which limits kernel performance in network card dual-port and RoCE network scenarios.

In our optimized Internode Normal Kernel, we implemented multiple QPs for data transmission between two GPUs, setting a different QP for each channel. Additionally, we modified the transmission method from IBRC to IBGAD.

Through these optimizations, the Internode Normal Kernel achieves optimal performance in both H800 and H20 environments, with RDMA transmission performance nearly reaching the physical network performance limit. Using the current default statistical method, in 4-node H800 and H20 environments, RDMA performance can reach 60GB/s+.

5ab80c28

11 Apr, 2025 1 commit
- Fix test combine args · 23c54150
  Hao Lin authored Apr 11, 2025
```
Signed-off-by: Hao Lin <linhaomails@gmail.com>
```
  23c54150
10 Apr, 2025 1 commit
- fix: not output result in some linux system · 0f80da84
  fujianhao.fjh authored Apr 10, 2025
  
  0f80da84
07 Apr, 2025 1 commit
- Remove useless control metadata for low-latency combine · 42494864
  Chenggang Zhao authored Apr 07, 2025
  
  42494864
28 Mar, 2025 1 commit
- Fix zero-copy mode tests · 26fa72d8
  Chenggang Zhao authored Mar 28, 2025
  
  26fa72d8
25 Mar, 2025 1 commit
- Remove confusing comments · ae0eafd2
  Chenggang Zhao authored Mar 25, 2025
  
  ae0eafd2
18 Mar, 2025 1 commit
- Support zero-copy for low-latency combine · dcaf73e5
  Chenggang Zhao authored Mar 18, 2025
  
  dcaf73e5
10 Mar, 2025 2 commits
- Allow passing output tensor in low_latency_combine · b3b61ef5
  Dmytro Dzhulgakov authored Mar 10, 2025
  
  b3b61ef5
- Support BF16 for low-latency kernels · ed7487c1
  Chenggang Zhao authored Mar 10, 2025
  
  ed7487c1
05 Mar, 2025 1 commit
- Fix AR bugs for normal kernels · 458cdcb2
  Chenggang Zhao authored Mar 05, 2025
  
  458cdcb2
04 Mar, 2025 1 commit
- Improve EP2/4 performance · 1553fc42
  Chenggang Zhao authored Mar 04, 2025
  
  1553fc42