Commits · 9ec061204e6763c12a9dd9f4cc5ca3b6c868b552 · OpenDAS / DeepEP

11 Jun, 2025 4 commits

Use `pynvml` to detect NVLink connections (#205) · 9ec06120
Chenggang Zhao authored Jun 11, 2025
```
* Use `pynvml` to detect NVLink connections

* Add a TODO

* Add shutdown

* Fix comments
```
9ec06120

Support Ampere architecture (#204) · b8d90fb7

Chenggang Zhao authored Jun 11, 2025

* Update README

* Update `setup.py`

* Fix headers

* Add `DISABLE_NVSHMEM` for APIs

* Fix launch

* Fix TMA settings

* Fix TMA usages

* Fix dlink

* Separate layout kernels

* Update version

* Add `is_sm90_compiled`

* Fix tests

* Add NVLink connection checks

* Update README

* Fix tests

* Add some comments

* Minor fix

* Minor fix

* Fix bugs

b8d90fb7

Check the empty list · dd13c714
Chenggang Zhao authored Jun 11, 2025

dd13c714
Support CUDA graph for intranode normal kernels (#203) · a8299ca7
Chenggang Zhao authored Jun 11, 2025

a8299ca7

09 Jun, 2025 2 commits
- Support statistics tensor for low-latency kernels (#196) · 5a2e37fa
  Chenggang Zhao authored Jun 09, 2025
  
  5a2e37fa
- Add low-latency kernel PCIe usage flag (#195) · 0d1a855d
  Chenggang Zhao authored Jun 09, 2025
```
* Add low-latency kernel usage flag

* Update comments
```
  0d1a855d
07 Jun, 2025 1 commit
- more · 4cd95170
  fzyzcjy authored Jun 07, 2025
  
  4cd95170
06 Jun, 2025 2 commits

Use TMA instead of LD/ST for intra-node normal kernels (#191) · c8dceba1

Chenggang Zhao authored Jun 06, 2025

* Update CMake files

* Use TMA instead of LD/ST for intranode dispatch

* Use TMA instead of LD/ST for intranode combine

* Adjust configs

* Test default configs as well

* More warps for combine

* Add inter-thread fence

* Enable more warps

* Do not use TMA for senders

* Update configs

* Remove useless wait

c8dceba1

Reduce NVSHMEM gpu memory usage and disable MNNVL. (#190) · df4debe3
Shangyan Zhou authored Jun 06, 2025
```
Co-authored-by: Shangyan Zhou <sy.zhou@deepseek.com>
```
df4debe3

23 May, 2025 3 commits
- Allow NVLink traffic for low-latency kernels by default · aae9fa9a
  Chenggang Zhao authored May 23, 2025
  
  aae9fa9a
- Code cleanup and bug fixed · 92405ddf
  Chenggang Zhao authored May 23, 2025
  
  92405ddf
- Feature: LL nvlink p2p (#173) · 68ae8b3d
  cywork121 authored May 23, 2025
  
  68ae8b3d
22 Apr, 2025 2 commits
- Several code lints · edbb1bc3
  Chenggang Zhao authored Apr 22, 2025
  
  edbb1bc3
- Normal kernels always use IBGDA mode. · 3e54b78f
  Shangyan Zhou authored Apr 22, 2025
  
  3e54b78f
21 Apr, 2025 1 commit

In the Internode Normal Kernel, when using nvshmem ibrc for RDMA data... · 5ab80c28

moningchen authored Apr 21, 2025

In the Internode Normal Kernel, when using nvshmem ibrc for RDMA data transmission, a single QP is used for data transfer between two GPUs, which limits kernel performance in network card dual-port and RoCE network scenarios.

In our optimized Internode Normal Kernel, we implemented multiple QPs for data transmission between two GPUs, setting a different QP for each channel. Additionally, we modified the transmission method from IBRC to IBGAD.

Through these optimizations, the Internode Normal Kernel achieves optimal performance in both H800 and H20 environments, with RDMA transmission performance nearly reaching the physical network performance limit. Using the current default statistical method, in 4-node H800 and H20 environments, RDMA performance can reach 60GB/s+.

5ab80c28

03 Apr, 2025 1 commit
- Update buffer.py · 218c5a1f
  fzyzcjy authored Apr 03, 2025
  
  218c5a1f
25 Mar, 2025 1 commit
- Update buffer.py · 36b5c279
  fzyzcjy authored Mar 25, 2025
  
  36b5c279
18 Mar, 2025 1 commit
- Support zero-copy for low-latency combine · dcaf73e5
  Chenggang Zhao authored Mar 18, 2025
  
  dcaf73e5
13 Mar, 2025 1 commit
- comments · 50ac280a
  Dmytro Dzhulgakov authored Mar 13, 2025
  
  50ac280a
10 Mar, 2025 2 commits
- Allow passing output tensor in low_latency_combine · b3b61ef5
  Dmytro Dzhulgakov authored Mar 10, 2025
  
  b3b61ef5
- Support BF16 for low-latency kernels · ed7487c1
  Chenggang Zhao authored Mar 10, 2025
  
  ed7487c1
05 Mar, 2025 1 commit
- Fix AR bugs for normal kernels · 458cdcb2
  Chenggang Zhao authored Mar 05, 2025
  
  458cdcb2
04 Mar, 2025 2 commits
- Improve EP2/4 performance · 1553fc42
  Chenggang Zhao authored Mar 04, 2025
  
  1553fc42
- Add some docs · 2a3cac90
  Chenggang Zhao authored Mar 04, 2025
  
  2a3cac90
26 Feb, 2025 1 commit
- Add `NVSHMEM_IB_ENABLE_RELAXED_ORDERING` · 3885404f
  Chenggang Zhao authored Feb 26, 2025
  
  3885404f
25 Feb, 2025 1 commit
- Initial commit · ebfe47e4
  Chenggang Zhao authored Feb 24, 2025
  
  ebfe47e4