Commits · 0d1a855d8177604a85170384477fa6043947c0a1 · OpenDAS / DeepEP

09 Jun, 2025 2 commits
- Add low-latency kernel PCIe usage flag (#195) · 0d1a855d
  Chenggang Zhao authored Jun 09, 2025
```
* Add low-latency kernel usage flag

* Update comments
```
  0d1a855d
- Fix `< PTX ISA 8.6` compatibility (#194) · 564e3752
  Chenggang Zhao authored Jun 09, 2025
  
  564e3752
08 Jun, 2025 1 commit
- Merge pull request #193 from fzyzcjy/feat/fix_mnnvl · 11a0b0e1
  Shangyan Zhou authored Jun 08, 2025
```
Allow using MNNVL
```
  11a0b0e1
07 Jun, 2025 1 commit
- more · 4cd95170
  fzyzcjy authored Jun 07, 2025
  
  4cd95170
06 Jun, 2025 2 commits

Use TMA instead of LD/ST for intra-node normal kernels (#191) · c8dceba1

Chenggang Zhao authored Jun 06, 2025

* Update CMake files

* Use TMA instead of LD/ST for intranode dispatch

* Use TMA instead of LD/ST for intranode combine

* Adjust configs

* Test default configs as well

* More warps for combine

* Add inter-thread fence

* Enable more warps

* Do not use TMA for senders

* Update configs

* Remove useless wait

c8dceba1

Reduce NVSHMEM gpu memory usage and disable MNNVL. (#190) · df4debe3
Shangyan Zhou authored Jun 06, 2025
```
Co-authored-by: Shangyan Zhou <sy.zhou@deepseek.com>
```
df4debe3

05 Jun, 2025 3 commits
- Update README · d8dd185c
  Chenggang Zhao authored Jun 05, 2025
  
  d8dd185c
- Update readme. · de8cfca3
  Shangyan Zhou authored Jun 05, 2025
  
  de8cfca3
- Merge pull request #182 from wzc-wuzhicheng/fix-notify-dispatch · fc48a467
  Shangyan Zhou authored Jun 05, 2025
```
Fix notify_dispatch: using warp 0 to issue send
```
  fc48a467
03 Jun, 2025 1 commit
- Fix notify_dispatch: using warp 0 to issue send · d0225df2
  wzc.wuzhicheng authored Jun 03, 2025
```
Signed-off-by: wzc.wuzhicheng <wzc.wuzhicheng@linux.alibaba.com>
```
  d0225df2
28 May, 2025 1 commit
- Use IBGDA only (#177) · 9fe9021f
  Shangyan Zhou authored May 28, 2025
  
  9fe9021f
23 May, 2025 4 commits
- Allow NVLink traffic for low-latency kernels by default · aae9fa9a
  Chenggang Zhao authored May 23, 2025
  
  aae9fa9a
- Merge pull request #174 from deepseek-ai/p2p-refactor · 8da1b1f8
  Shangyan Zhou authored May 23, 2025
```
Low-latency P2P code cleanup and bug fixed
```
  8da1b1f8
- Code cleanup and bug fixed · 92405ddf
  Chenggang Zhao authored May 23, 2025
  
  92405ddf
- Feature: LL nvlink p2p (#173) · 68ae8b3d
  cywork121 authored May 23, 2025
  
  68ae8b3d
19 May, 2025 1 commit

Make `TORCH_CUDA_ARCH_LIST` as an environment variable (#167) · d5ca4495

guyueh1 authored May 18, 2025



* Add 10.0 to TORCH_CUDA_ARCH_LIST
Signed-off-by: Guyue Huang <guyueh@nvidia.com>

* Revert csrc/CMakeLists change; in setup.py make TORCH_CUDA_ARCH_LIST configurable
Signed-off-by: Guyue Huang <guyueh@nvidia.com>

---------
Signed-off-by: Guyue Huang <guyueh@nvidia.com>

d5ca4495

12 May, 2025 4 commits
- Merge pull request #154 from sleepcoo/support-more-hidden · bb393e77
  Chenggang Zhao authored May 12, 2025
```
Support hidden size 4096
```
  bb393e77
- support hidden size 4096 · a107266a
  sleepcoo authored May 12, 2025
```
Co-authored-by: zhyncs <me@zhyncs.com>
Co-authored-by: yinfan98 <1106310035@qq.com>
```
  a107266a
- Merge pull request #151 from vicoooo26/feat/nvidia-peer-mem-detection · 05104029
  Shangyan Zhou authored May 12, 2025
```
Feat: enhance nvidia peer memory detection
```
  05104029
- Merge pull request #153 from wangfakang/opt-shuffled_dst · f0a9f106
  Chenggang Zhao authored May 12, 2025
```
Shuffling the starting index of target rank for different ranks and channels
```
  f0a9f106
10 May, 2025 1 commit

To mitigate incast congestion, shuffle the starting index of target rank for... · 63c29d06

wangfakang authored May 09, 2025


To mitigate incast congestion, shuffle the starting index of target rank for different ranks and channels
Signed-off-by: wangfakang <fakangwang@gmail.com>

63c29d06

09 May, 2025 1 commit
- Feat: enhance nvidia peer memory detection · c6051f38
  Vico Chu authored May 09, 2025
  
  c6051f38
08 May, 2025 2 commits
- Merge pull request #142 from fzyzcjy/patch-3 · 9056a6db
  Chenggang Zhao authored May 08, 2025
```
Fix DeepEP cannot be used together with code that needs GIL such as Mooncake transfer engine
```
  9056a6db
- Update deep_ep.cpp · adc6e24c
  fzyzcjy authored May 08, 2025
  
  adc6e24c
29 Apr, 2025 1 commit
- Update deep_ep.cpp · 23ded3bd
  fzyzcjy authored Apr 29, 2025
  
  23ded3bd
27 Apr, 2025 2 commits
- Merge pull request #135 from deepseek-ai/add-iw-fork · 65e2a700
  Shangyan Zhou authored Apr 27, 2025
```
Add Infrawaves' fork to README.
```
  65e2a700
- Add Infrawaves' fork to README. · 1a0c8f64
  Shangyan Zhou authored Apr 27, 2025
  
  1a0c8f64
22 Apr, 2025 6 commits
- Merge pull request #130 from deepseek-ai/trmt/internode_multi_qp · 007fcfcf
  Chenggang Zhao authored Apr 22, 2025
```
Support multi-QP for normal kernels
```
  007fcfcf
- Use `put_nbi_warp`. · e255d57b
  Shangyan Zhou authored Apr 22, 2025
  
  e255d57b
- Fix the performance data. · 3b1045db
  Shangyan Zhou authored Apr 22, 2025
  
  3b1045db
- Several code lints · edbb1bc3
  Chenggang Zhao authored Apr 22, 2025
  
  edbb1bc3
- Normal kernels always use IBGDA mode. · 3e54b78f
  Shangyan Zhou authored Apr 22, 2025
  
  3e54b78f
- Refactor some code. · 20b2aaaf
  Shangyan Zhou authored Apr 22, 2025
  
  20b2aaaf
21 Apr, 2025 4 commits

Merge branch 'trmt/internode_multi_qp' of github.com:deepseek-ai/DeepEP into... · c07fdd19
moningchen authored Apr 21, 2025
```
Merge branch 'trmt/internode_multi_qp' of github.com:deepseek-ai/DeepEP into trmt/internode_multi_qp
```
c07fdd19
Add the performance data after internode optimization in the Readme file · e0eaaf94
moningchen authored Apr 21, 2025

e0eaaf94
Revert `ibgda_device.cuh` and remove some comments. · e2c57848
Shangyan Zhou authored Apr 21, 2025

e2c57848

In the Internode Normal Kernel, when using nvshmem ibrc for RDMA data... · 5ab80c28

moningchen authored Apr 21, 2025

In the Internode Normal Kernel, when using nvshmem ibrc for RDMA data transmission, a single QP is used for data transfer between two GPUs, which limits kernel performance in network card dual-port and RoCE network scenarios.

In our optimized Internode Normal Kernel, we implemented multiple QPs for data transmission between two GPUs, setting a different QP for each channel. Additionally, we modified the transmission method from IBRC to IBGAD.

Through these optimizations, the Internode Normal Kernel achieves optimal performance in both H800 and H20 environments, with RDMA transmission performance nearly reaching the physical network performance limit. Using the current default statistical method, in 4-node H800 and H20 environments, RDMA performance can reach 60GB/s+.

5ab80c28

16 Apr, 2025 2 commits
- Merge pull request #124 from wplf/patch-1 · a84a2480
  Shangyan Zhou authored Apr 16, 2025
```
Fix typo in nvshmem.patch
```
  a84a2480
- Fix typo in nvshmem.patch · a2ccc95d
  李金梁 authored Apr 16, 2025
  
  a2ccc95d
14 Apr, 2025 1 commit
- Merge pull request #118 from andylin-hao/main · a0c69317
  Chenggang Zhao authored Apr 14, 2025
```
Fix test combine args
```
  a0c69317