Commits · 4b67064d220f6f22619063a08a9fcec41f0c5614 · OpenDAS / DeepEP

30 Jul, 2025 1 commit

Add diagnosis module for efficient and precise location of slow rank (#311) · 4b67064d

sky authored Jul 30, 2025



* Add diagnosis module for precise identification of slow ranks
Signed-off-by: wangfakang <fakangwang@gmail.com>

* Add tests for the slow diagnosis module
Signed-off-by: wangfakang <fakangwang@gmail.com>

* Update some comments for diagnose
Signed-off-by: wangfakang <fakangwang@gmail.com>

* Update test case for diagnose
Signed-off-by: wangfakang <fakangwang@gmail.com>

* Strip the diagnose module, thx LyricZhao and sphish.
Signed-off-by: wangfakang <fakangwang@gmail.com>

* update variable name and cumulative wait recv cost, thx sphish.
Signed-off-by: wangfakang <fakangwang@gmail.com>

* remove invalid comments.
Signed-off-by: wangfakang <fakangwang@gmail.com>

---------
Signed-off-by: wangfakang <fakangwang@gmail.com>

4b67064d

22 Jul, 2025 1 commit
- Update combine config · bdd119f8
  Shangyan Zhou authored Jul 22, 2025
  
  bdd119f8
21 Jul, 2025 1 commit

Minor patches for deepep (#318) · 5b549c85

Guangguan Wang authored Jul 21, 2025



* Add arg --pressure-test for test_low_latency.py

Add arg --pressure-test for test_low_latency.py
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>

* Export NVSHMEM_QP_DEPTH

Export NVSHMEM_QP_DEPTH
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>

---------
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>

5b549c85

15 Jul, 2025 1 commit
- buffer.py: Do not force the NIC handler to GPU. · e6b4f527
  Seth Howell authored Jul 14, 2025
```
This enables the CPU-Assisted data path.
Signed-off-by: Seth Howell <sethh@nvidia.com>
```
  e6b4f527
11 Jul, 2025 1 commit

Explicitly destroy the C++ runtime and release resources. (#292) · 0c984e25

Shangyan Zhou authored Jul 11, 2025

* Explicitly destroy the C++ runtime and release resources.

* Small fix

* fix typo

* Add a flag to control whether explicit `destroy()` is required.

0c984e25

10 Jul, 2025 2 commits

Use TMA to optimize internode combine. (#287) · 06f417dc

Shangyan Zhou authored Jul 10, 2025



* Let forwarders use a dedicated SM

* Shuffle rdma idx

* Sender use TMA.

* Adjust the tuning chunk size.

* Modify NVL chunk layout.

* Update some combine config.

* Small lint

* Minor fix

* Overlap TMA store

---------
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>

06f417dc

Support 10-bit LogFMT (simulated version) (#284) · 1cf85fb2

Chenggang Zhao authored Jul 10, 2025



* Add LogFMT interface

* Update comments

* Add simulated code

* Fix comments

* Change to 128 channels

* Add notes

* Optimize performance

* optimize simulate logfmt 10bit

* Minor fix

* Stronger low latency tests

* Minor fix

* Stronger low latency tests for logfmt

* Optimize logfmt simulate: lg2/ex2 ptx, step_inv

* Minor fix

* Minor fix

* Add non-logfmt bench

* Fix value=0 for logfmt

* Optimize performance

* Refactor tests

---------
Co-authored-by: Zhean Xu <xza@deepseek.com>

1cf85fb2

04 Jul, 2025 1 commit
- Update some dispatch configs. · e6d61fc6
  Shangyan Zhou authored Jul 04, 2025
  
  e6d61fc6
27 Jun, 2025 1 commit
- Change the default num of QPs. (#263) · 4e72eb39
  Shangyan Zhou authored Jun 27, 2025
  
  4e72eb39
25 Jun, 2025 2 commits
- Support bias. (#257) · bd429ffe
  Shangyan Zhou authored Jun 25, 2025
```
* Support bias.

* Fix.

* Fix style.
```
  bd429ffe
- Add `get_comm_stream`. (#256) · b80e55e2
  Shangyan Zhou authored Jun 25, 2025
```
* Add `get_comm_stream`.

* Fix style.
```
  b80e55e2
16 Jun, 2025 1 commit
- Remove the low-latency usage flag (#214) · 8aaddf76
  Chenggang Zhao authored Jun 16, 2025
  
  8aaddf76
12 Jun, 2025 1 commit
- Support UE8M0 data format. (#206) · 21efbe9b
  Shifang Xu authored Jun 12, 2025
  
  21efbe9b
11 Jun, 2025 3 commits

Support Ampere architecture (#204) · b8d90fb7

Chenggang Zhao authored Jun 11, 2025

* Update README

* Update `setup.py`

* Fix headers

* Add `DISABLE_NVSHMEM` for APIs

* Fix launch

* Fix TMA settings

* Fix TMA usages

* Fix dlink

* Separate layout kernels

* Update version

* Add `is_sm90_compiled`

* Fix tests

* Add NVLink connection checks

* Update README

* Fix tests

* Add some comments

* Minor fix

* Minor fix

* Fix bugs

b8d90fb7

Check the empty list · dd13c714
Chenggang Zhao authored Jun 11, 2025

dd13c714
Support CUDA graph for intranode normal kernels (#203) · a8299ca7
Chenggang Zhao authored Jun 11, 2025

a8299ca7

09 Jun, 2025 2 commits
- Support statistics tensor for low-latency kernels (#196) · 5a2e37fa
  Chenggang Zhao authored Jun 09, 2025
  
  5a2e37fa
- Add low-latency kernel PCIe usage flag (#195) · 0d1a855d
  Chenggang Zhao authored Jun 09, 2025
```
* Add low-latency kernel usage flag

* Update comments
```
  0d1a855d
07 Jun, 2025 1 commit
- more · 4cd95170
  fzyzcjy authored Jun 07, 2025
  
  4cd95170
06 Jun, 2025 2 commits

Use TMA instead of LD/ST for intra-node normal kernels (#191) · c8dceba1

Chenggang Zhao authored Jun 06, 2025

* Update CMake files

* Use TMA instead of LD/ST for intranode dispatch

* Use TMA instead of LD/ST for intranode combine

* Adjust configs

* Test default configs as well

* More warps for combine

* Add inter-thread fence

* Enable more warps

* Do not use TMA for senders

* Update configs

* Remove useless wait

c8dceba1

Reduce NVSHMEM gpu memory usage and disable MNNVL. (#190) · df4debe3
Shangyan Zhou authored Jun 06, 2025
```
Co-authored-by: Shangyan Zhou <sy.zhou@deepseek.com>
```
df4debe3

23 May, 2025 3 commits
- Allow NVLink traffic for low-latency kernels by default · aae9fa9a
  Chenggang Zhao authored May 23, 2025
  
  aae9fa9a
- Code cleanup and bug fixed · 92405ddf
  Chenggang Zhao authored May 23, 2025
  
  92405ddf
- Feature: LL nvlink p2p (#173) · 68ae8b3d
  cywork121 authored May 23, 2025
  
  68ae8b3d
22 Apr, 2025 2 commits
- Several code lints · edbb1bc3
  Chenggang Zhao authored Apr 22, 2025
  
  edbb1bc3
- Normal kernels always use IBGDA mode. · 3e54b78f
  Shangyan Zhou authored Apr 22, 2025
  
  3e54b78f
21 Apr, 2025 1 commit

In the Internode Normal Kernel, when using nvshmem ibrc for RDMA data... · 5ab80c28

moningchen authored Apr 21, 2025

In the Internode Normal Kernel, when using nvshmem ibrc for RDMA data transmission, a single QP is used for data transfer between two GPUs, which limits kernel performance in network card dual-port and RoCE network scenarios.

In our optimized Internode Normal Kernel, we implemented multiple QPs for data transmission between two GPUs, setting a different QP for each channel. Additionally, we modified the transmission method from IBRC to IBGAD.

Through these optimizations, the Internode Normal Kernel achieves optimal performance in both H800 and H20 environments, with RDMA transmission performance nearly reaching the physical network performance limit. Using the current default statistical method, in 4-node H800 and H20 environments, RDMA performance can reach 60GB/s+.

5ab80c28

03 Apr, 2025 1 commit
- Update buffer.py · 218c5a1f
  fzyzcjy authored Apr 03, 2025
  
  218c5a1f
25 Mar, 2025 1 commit
- Update buffer.py · 36b5c279
  fzyzcjy authored Mar 25, 2025
  
  36b5c279
18 Mar, 2025 1 commit
- Support zero-copy for low-latency combine · dcaf73e5
  Chenggang Zhao authored Mar 18, 2025
  
  dcaf73e5
13 Mar, 2025 1 commit
- comments · 50ac280a
  Dmytro Dzhulgakov authored Mar 13, 2025
  
  50ac280a
10 Mar, 2025 2 commits
- Allow passing output tensor in low_latency_combine · b3b61ef5
  Dmytro Dzhulgakov authored Mar 10, 2025
  
  b3b61ef5
- Support BF16 for low-latency kernels · ed7487c1
  Chenggang Zhao authored Mar 10, 2025
  
  ed7487c1
05 Mar, 2025 1 commit
- Fix AR bugs for normal kernels · 458cdcb2
  Chenggang Zhao authored Mar 05, 2025
  
  458cdcb2
04 Mar, 2025 2 commits
- Improve EP2/4 performance · 1553fc42
  Chenggang Zhao authored Mar 04, 2025
  
  1553fc42
- Add some docs · 2a3cac90
  Chenggang Zhao authored Mar 04, 2025
  
  2a3cac90
26 Feb, 2025 1 commit
- Add `NVSHMEM_IB_ENABLE_RELAXED_ORDERING` · 3885404f
  Chenggang Zhao authored Feb 26, 2025
  
  3885404f
25 Feb, 2025 1 commit
- Initial commit · ebfe47e4
  Chenggang Zhao authored Feb 24, 2025
  
  ebfe47e4