Commits · 9d4f7ef8eeedd9970c2bb8efe998e07e7bb9df5b · OpenDAS / DeepEP

"openmmapi/src/PythonForce.cpp" did not exist on "7482534e84878e9a78c47104c495593c42ef2a04"

18 Jun, 2025 6 commits
- Surpass type checks · 9d4f7ef8
  Chenggang Zhao authored Jun 18, 2025
  
  9d4f7ef8
- Adjust import order · b56f7c2c
  Chenggang Zhao authored Jun 18, 2025
  
  b56f7c2c
- Merge pull request #222 from deepseek-ai/set_dev_id · a2d2354e
  Shangyan Zhou authored Jun 18, 2025
```
Set `device_id` to suppress pytorch warning.
```
  a2d2354e
- Move import. · cd371d31
  Shangyan Zhou authored Jun 18, 2025
  
  cd371d31
- Set `device_id` to suppress pytorch warning. · bf4a4a21
  Shangyan Zhou authored Jun 18, 2025
  
  bf4a4a21
- Fix the tail loading issue. (#219) · 77f97f79
  Shangyan Zhou authored Jun 18, 2025
```
* Fix the tail loading issue.

* Modify the sync offset.
```
  77f97f79
16 Jun, 2025 4 commits
- Fix warp synchronization. (#215) · dd133d39
  Shangyan Zhou authored Jun 16, 2025
```
* Fix warp synchronization.

* Another fix.
```
  dd133d39
- Remove the low-latency usage flag (#214) · 8aaddf76
  Chenggang Zhao authored Jun 16, 2025
  
  8aaddf76
- Add automatic warp count control for low-latency kernels (#213) · 1b92be8a
  Chenggang Zhao authored Jun 16, 2025
```
* Add automatic warp count control for low-latency dispatch

* Add automatic warp count control for low-latency combine

* More assertions
```
  1b92be8a
- Update intranode.cu (#210) · 4e923188
  fzyzcjy authored Jun 16, 2025
  
  4e923188
13 Jun, 2025 2 commits
- Update assertion of `num_rc_per_pe`. · 483f00af
  Shangyan Zhou authored Jun 13, 2025
  
  483f00af
- Use one qp per sm for internode normal kernels (#181) · 05df5554
  Zhicheng Wu authored Jun 13, 2025
```
let the sender SM use the channel_id, and the receiver SM use channel_id + num_channels
```
  05df5554
12 Jun, 2025 1 commit
- Support UE8M0 data format. (#206) · 21efbe9b
  Shifang Xu authored Jun 12, 2025
  
  21efbe9b
11 Jun, 2025 4 commits

Use `pynvml` to detect NVLink connections (#205) · 9ec06120
Chenggang Zhao authored Jun 11, 2025
```
* Use `pynvml` to detect NVLink connections

* Add a TODO

* Add shutdown

* Fix comments
```
9ec06120

Support Ampere architecture (#204) · b8d90fb7

Chenggang Zhao authored Jun 11, 2025

* Update README

* Update `setup.py`

* Fix headers

* Add `DISABLE_NVSHMEM` for APIs

* Fix launch

* Fix TMA settings

* Fix TMA usages

* Fix dlink

* Separate layout kernels

* Update version

* Add `is_sm90_compiled`

* Fix tests

* Add NVLink connection checks

* Update README

* Fix tests

* Add some comments

* Minor fix

* Minor fix

* Fix bugs

b8d90fb7

Check the empty list · dd13c714
Chenggang Zhao authored Jun 11, 2025

dd13c714
Support CUDA graph for intranode normal kernels (#203) · a8299ca7
Chenggang Zhao authored Jun 11, 2025

a8299ca7

10 Jun, 2025 4 commits
- Fully remove barrier FIFO designs (#200) · 8da2d7b3
  Chenggang Zhao authored Jun 10, 2025
```
* Fully remove FIFO slots

* Fully remove FIFO buffers

* Minor fix styles

* Fix some typos

* Bugs fixed

* Cleanup `ibgda_poll_cq`
```
  8da2d7b3
- Merge pull request #201 from youkaichao/no_gdrcopy · a16af405
  Shangyan Zhou authored Jun 10, 2025
```
remove the dependency of gdrcopy
```
  a16af405
- update readme · b9b7ce34
  youkaichao authored Jun 10, 2025
```
Signed-off-by: youkaichao <youkaichao@gmail.com>
```
  b9b7ce34
- update the patch · 97be5a38
  youkaichao authored Jun 10, 2025
```
Signed-off-by: youkaichao <youkaichao@gmail.com>
```
  97be5a38
09 Jun, 2025 4 commits
- Remove useless comments · 1157693c
  Chenggang Zhao authored Jun 09, 2025
  
  1157693c
- Support statistics tensor for low-latency kernels (#196) · 5a2e37fa
  Chenggang Zhao authored Jun 09, 2025
  
  5a2e37fa
- Add low-latency kernel PCIe usage flag (#195) · 0d1a855d
  Chenggang Zhao authored Jun 09, 2025
```
* Add low-latency kernel usage flag

* Update comments
```
  0d1a855d
- Fix `< PTX ISA 8.6` compatibility (#194) · 564e3752
  Chenggang Zhao authored Jun 09, 2025
  
  564e3752
08 Jun, 2025 1 commit
- Merge pull request #193 from fzyzcjy/feat/fix_mnnvl · 11a0b0e1
  Shangyan Zhou authored Jun 08, 2025
```
Allow using MNNVL
```
  11a0b0e1
07 Jun, 2025 1 commit
- more · 4cd95170
  fzyzcjy authored Jun 07, 2025
  
  4cd95170
06 Jun, 2025 2 commits

Use TMA instead of LD/ST for intra-node normal kernels (#191) · c8dceba1

Chenggang Zhao authored Jun 06, 2025

* Update CMake files

* Use TMA instead of LD/ST for intranode dispatch

* Use TMA instead of LD/ST for intranode combine

* Adjust configs

* Test default configs as well

* More warps for combine

* Add inter-thread fence

* Enable more warps

* Do not use TMA for senders

* Update configs

* Remove useless wait

c8dceba1

Reduce NVSHMEM gpu memory usage and disable MNNVL. (#190) · df4debe3
Shangyan Zhou authored Jun 06, 2025
```
Co-authored-by: Shangyan Zhou <sy.zhou@deepseek.com>
```
df4debe3

05 Jun, 2025 3 commits
- Update README · d8dd185c
  Chenggang Zhao authored Jun 05, 2025
  
  d8dd185c
- Update readme. · de8cfca3
  Shangyan Zhou authored Jun 05, 2025
  
  de8cfca3
- Merge pull request #182 from wzc-wuzhicheng/fix-notify-dispatch · fc48a467
  Shangyan Zhou authored Jun 05, 2025
```
Fix notify_dispatch: using warp 0 to issue send
```
  fc48a467
03 Jun, 2025 1 commit
- Fix notify_dispatch: using warp 0 to issue send · d0225df2
  wzc.wuzhicheng authored Jun 03, 2025
```
Signed-off-by: wzc.wuzhicheng <wzc.wuzhicheng@linux.alibaba.com>
```
  d0225df2
28 May, 2025 1 commit
- Use IBGDA only (#177) · 9fe9021f
  Shangyan Zhou authored May 28, 2025
  
  9fe9021f
23 May, 2025 4 commits
- Allow NVLink traffic for low-latency kernels by default · aae9fa9a
  Chenggang Zhao authored May 23, 2025
  
  aae9fa9a
- Merge pull request #174 from deepseek-ai/p2p-refactor · 8da1b1f8
  Shangyan Zhou authored May 23, 2025
```
Low-latency P2P code cleanup and bug fixed
```
  8da1b1f8
- Code cleanup and bug fixed · 92405ddf
  Chenggang Zhao authored May 23, 2025
  
  92405ddf
- Feature: LL nvlink p2p (#173) · 68ae8b3d
  cywork121 authored May 23, 2025
  
  68ae8b3d
19 May, 2025 1 commit

Make `TORCH_CUDA_ARCH_LIST` as an environment variable (#167) · d5ca4495

guyueh1 authored May 18, 2025



* Add 10.0 to TORCH_CUDA_ARCH_LIST
Signed-off-by: Guyue Huang <guyueh@nvidia.com>

* Revert csrc/CMakeLists change; in setup.py make TORCH_CUDA_ARCH_LIST configurable
Signed-off-by: Guyue Huang <guyueh@nvidia.com>

---------
Signed-off-by: Guyue Huang <guyueh@nvidia.com>

d5ca4495

12 May, 2025 1 commit
- Merge pull request #154 from sleepcoo/support-more-hidden · bb393e77
  Chenggang Zhao authored May 12, 2025
```
Support hidden size 4096
```
  bb393e77