Commits · a16af40531f6c5dabfd7f0b22b835ecfe3e43464 · OpenDAS / DeepEP

10 Jun, 2025 3 commits
- Merge pull request #201 from youkaichao/no_gdrcopy · a16af405
  Shangyan Zhou authored Jun 10, 2025
```
remove the dependency of gdrcopy
```
  a16af405
- update readme · b9b7ce34
  youkaichao authored Jun 10, 2025
```
Signed-off-by: youkaichao <youkaichao@gmail.com>
```
  b9b7ce34
- update the patch · 97be5a38
  youkaichao authored Jun 10, 2025
```
Signed-off-by: youkaichao <youkaichao@gmail.com>
```
  97be5a38
09 Jun, 2025 4 commits
- Remove useless comments · 1157693c
  Chenggang Zhao authored Jun 09, 2025
  
  1157693c
- Support statistics tensor for low-latency kernels (#196) · 5a2e37fa
  Chenggang Zhao authored Jun 09, 2025
  
  5a2e37fa
- Add low-latency kernel PCIe usage flag (#195) · 0d1a855d
  Chenggang Zhao authored Jun 09, 2025
```
* Add low-latency kernel usage flag

* Update comments
```
  0d1a855d
- Fix `< PTX ISA 8.6` compatibility (#194) · 564e3752
  Chenggang Zhao authored Jun 09, 2025
  
  564e3752
08 Jun, 2025 1 commit
- Merge pull request #193 from fzyzcjy/feat/fix_mnnvl · 11a0b0e1
  Shangyan Zhou authored Jun 08, 2025
```
Allow using MNNVL
```
  11a0b0e1
07 Jun, 2025 1 commit
- more · 4cd95170
  fzyzcjy authored Jun 07, 2025
  
  4cd95170
06 Jun, 2025 2 commits

Use TMA instead of LD/ST for intra-node normal kernels (#191) · c8dceba1

Chenggang Zhao authored Jun 06, 2025

* Update CMake files

* Use TMA instead of LD/ST for intranode dispatch

* Use TMA instead of LD/ST for intranode combine

* Adjust configs

* Test default configs as well

* More warps for combine

* Add inter-thread fence

* Enable more warps

* Do not use TMA for senders

* Update configs

* Remove useless wait

c8dceba1

Reduce NVSHMEM gpu memory usage and disable MNNVL. (#190) · df4debe3
Shangyan Zhou authored Jun 06, 2025
```
Co-authored-by: Shangyan Zhou <sy.zhou@deepseek.com>
```
df4debe3

05 Jun, 2025 3 commits
- Update README · d8dd185c
  Chenggang Zhao authored Jun 05, 2025
  
  d8dd185c
- Update readme. · de8cfca3
  Shangyan Zhou authored Jun 05, 2025
  
  de8cfca3
- Merge pull request #182 from wzc-wuzhicheng/fix-notify-dispatch · fc48a467
  Shangyan Zhou authored Jun 05, 2025
```
Fix notify_dispatch: using warp 0 to issue send
```
  fc48a467
03 Jun, 2025 1 commit
- Fix notify_dispatch: using warp 0 to issue send · d0225df2
  wzc.wuzhicheng authored Jun 03, 2025
```
Signed-off-by: wzc.wuzhicheng <wzc.wuzhicheng@linux.alibaba.com>
```
  d0225df2
28 May, 2025 1 commit
- Use IBGDA only (#177) · 9fe9021f
  Shangyan Zhou authored May 28, 2025
  
  9fe9021f
23 May, 2025 4 commits
- Allow NVLink traffic for low-latency kernels by default · aae9fa9a
  Chenggang Zhao authored May 23, 2025
  
  aae9fa9a
- Merge pull request #174 from deepseek-ai/p2p-refactor · 8da1b1f8
  Shangyan Zhou authored May 23, 2025
```
Low-latency P2P code cleanup and bug fixed
```
  8da1b1f8
- Code cleanup and bug fixed · 92405ddf
  Chenggang Zhao authored May 23, 2025
  
  92405ddf
- Feature: LL nvlink p2p (#173) · 68ae8b3d
  cywork121 authored May 23, 2025
  
  68ae8b3d
19 May, 2025 1 commit

Make `TORCH_CUDA_ARCH_LIST` as an environment variable (#167) · d5ca4495

guyueh1 authored May 18, 2025



* Add 10.0 to TORCH_CUDA_ARCH_LIST
Signed-off-by: Guyue Huang <guyueh@nvidia.com>

* Revert csrc/CMakeLists change; in setup.py make TORCH_CUDA_ARCH_LIST configurable
Signed-off-by: Guyue Huang <guyueh@nvidia.com>

---------
Signed-off-by: Guyue Huang <guyueh@nvidia.com>

d5ca4495

12 May, 2025 4 commits
- Merge pull request #154 from sleepcoo/support-more-hidden · bb393e77
  Chenggang Zhao authored May 12, 2025
```
Support hidden size 4096
```
  bb393e77
- support hidden size 4096 · a107266a
  sleepcoo authored May 12, 2025
```
Co-authored-by: zhyncs <me@zhyncs.com>
Co-authored-by: yinfan98 <1106310035@qq.com>
```
  a107266a
- Merge pull request #151 from vicoooo26/feat/nvidia-peer-mem-detection · 05104029
  Shangyan Zhou authored May 12, 2025
```
Feat: enhance nvidia peer memory detection
```
  05104029
- Merge pull request #153 from wangfakang/opt-shuffled_dst · f0a9f106
  Chenggang Zhao authored May 12, 2025
```
Shuffling the starting index of target rank for different ranks and channels
```
  f0a9f106
10 May, 2025 1 commit

To mitigate incast congestion, shuffle the starting index of target rank for... · 63c29d06

wangfakang authored May 09, 2025


To mitigate incast congestion, shuffle the starting index of target rank for different ranks and channels
Signed-off-by: wangfakang <fakangwang@gmail.com>

63c29d06

09 May, 2025 1 commit
- Feat: enhance nvidia peer memory detection · c6051f38
  Vico Chu authored May 09, 2025
  
  c6051f38
08 May, 2025 2 commits
- Merge pull request #142 from fzyzcjy/patch-3 · 9056a6db
  Chenggang Zhao authored May 08, 2025
```
Fix DeepEP cannot be used together with code that needs GIL such as Mooncake transfer engine
```
  9056a6db
- Update deep_ep.cpp · adc6e24c
  fzyzcjy authored May 08, 2025
  
  adc6e24c
29 Apr, 2025 1 commit
- Update deep_ep.cpp · 23ded3bd
  fzyzcjy authored Apr 29, 2025
  
  23ded3bd
27 Apr, 2025 2 commits
- Merge pull request #135 from deepseek-ai/add-iw-fork · 65e2a700
  Shangyan Zhou authored Apr 27, 2025
```
Add Infrawaves' fork to README.
```
  65e2a700
- Add Infrawaves' fork to README. · 1a0c8f64
  Shangyan Zhou authored Apr 27, 2025
  
  1a0c8f64
22 Apr, 2025 6 commits
- Merge pull request #130 from deepseek-ai/trmt/internode_multi_qp · 007fcfcf
  Chenggang Zhao authored Apr 22, 2025
```
Support multi-QP for normal kernels
```
  007fcfcf
- Use `put_nbi_warp`. · e255d57b
  Shangyan Zhou authored Apr 22, 2025
  
  e255d57b
- Fix the performance data. · 3b1045db
  Shangyan Zhou authored Apr 22, 2025
  
  3b1045db
- Several code lints · edbb1bc3
  Chenggang Zhao authored Apr 22, 2025
  
  edbb1bc3
- Normal kernels always use IBGDA mode. · 3e54b78f
  Shangyan Zhou authored Apr 22, 2025
  
  3e54b78f
- Refactor some code. · 20b2aaaf
  Shangyan Zhou authored Apr 22, 2025
  
  20b2aaaf
21 Apr, 2025 2 commits
- Merge branch 'trmt/internode_multi_qp' of github.com:deepseek-ai/DeepEP into... · c07fdd19
  moningchen authored Apr 21, 2025
```
Merge branch 'trmt/internode_multi_qp' of github.com:deepseek-ai/DeepEP into trmt/internode_multi_qp
```
  c07fdd19
- Add the performance data after internode optimization in the Readme file · e0eaaf94
  moningchen authored Apr 21, 2025
  
  e0eaaf94