Commits · d9767ce05f67b4f8314d3f6ebd95846939256d0b · OpenDAS / DeepEP

11 Aug, 2025 1 commit
- Add EP144/160 · d9767ce0
  Chenggang Zhao authored Aug 11, 2025
  
  d9767ce0
08 Aug, 2025 2 commits
- fix when --num-tokens == 1 (#356) · 695b6347
  AlphaBaby authored Aug 08, 2025
```
Co-authored-by: fujianhao.fjh <fujianhao.fjh@antgroup.com>
```
  695b6347
- Fix low latency test · 51071968
  Shangyan Zhou authored Aug 08, 2025
  
  51071968
07 Aug, 2025 3 commits

Fix indent · d31c72a1
Chenggang Zhao authored Aug 07, 2025

d31c72a1
Fix compilation · dd14b36d
Chenggang Zhao authored Aug 07, 2025

dd14b36d

Support 10-bit LogFMT Combine (#345) · c5facf5c

Zhean Xu authored Aug 07, 2025



* independent logfmt_simulate function

* draft: logfmt low latency combine

* Minor bug fixes

* Fix non-logfmt bugs

* Fix logfmt bugs

* Fix logfmt bugs

* Minor fix

* Minor fix

* Clean code

* Clean code

* Use fewer regs

* Use two warp groups

* Correct shared memory size

* Minor fix

* Minor fix

* More rigorous tests

* Clean code

* Use more SMs

* Use different unroll factor for send & recv

* Update csrc/kernels/internode_ll.cu
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update csrc/kernels/internode_ll.cu
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Some renaming

* Some comments of tests

* Format `logfmt_encode`

* More lints

* Some refactors on sends

* Fix testing

* Fix bugs

* Renaming

* Use the full warp

* Unify combine recv

* Lint

* Lint

* Support 2560

* Fix meta buffer dtype

* Better encode calls

* Better amin/max writes

* Extra sync

* Read `topk_idx` by once

* Better specialization

* Read weights by once

* Rename

* Bug fixed

* Some renaming

* Fix local memory usage for sending

* Fix local memory usage for receiving

* Less writes

* Optimize performance

* Optimize performance

* Better performance

* Optimize performance

* Fix rounding

* Manually unroll

* Fix bench

---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>

c5facf5c

05 Aug, 2025 1 commit
- build(setuptools): fix nvshmem dynamic library name searching in Python 3.9 (#351) · 26cf250a
  windreamer authored Aug 05, 2025
  
  26cf250a
01 Aug, 2025 1 commit
- more (#275) · ab0a3dd2
  fzyzcjy authored Aug 01, 2025
  
  ab0a3dd2
31 Jul, 2025 4 commits
- Code lint · c6faca45
  Chenggang Zhao authored Jul 31, 2025
  
  c6faca45
- Remove the diagnosis part from tests · c7033854
  Chenggang Zhao authored Jul 31, 2025
  
  c7033854
- Fix SM80 compilation · be8053d6
  Chenggang Zhao authored Jul 31, 2025
  
  be8053d6
- Update README · 227c3589
  Shangyan Zhou authored Jul 31, 2025
  
  227c3589
30 Jul, 2025 1 commit

Add diagnosis module for efficient and precise location of slow rank (#311) · 4b67064d

sky authored Jul 30, 2025



* Add diagnosis module for precise identification of slow ranks
Signed-off-by: wangfakang <fakangwang@gmail.com>

* Add tests for the slow diagnosis module
Signed-off-by: wangfakang <fakangwang@gmail.com>

* Update some comments for diagnose
Signed-off-by: wangfakang <fakangwang@gmail.com>

* Update test case for diagnose
Signed-off-by: wangfakang <fakangwang@gmail.com>

* Strip the diagnose module, thx LyricZhao and sphish.
Signed-off-by: wangfakang <fakangwang@gmail.com>

* update variable name and cumulative wait recv cost, thx sphish.
Signed-off-by: wangfakang <fakangwang@gmail.com>

* remove invalid comments.
Signed-off-by: wangfakang <fakangwang@gmail.com>

---------
Signed-off-by: wangfakang <fakangwang@gmail.com>

4b67064d

29 Jul, 2025 2 commits

Add hidden_size 6144 (#329) · b92d0d48

Jee Jee Li authored Jul 29, 2025



* Done
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

* Add comment
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

---------
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

b92d0d48

Fix the address of dispatch_rdma_recv_count_buffer to avoid cleaning after... · 42253d08

Void authored Jul 29, 2025


Fix the address of dispatch_rdma_recv_count_buffer to avoid cleaning after each change in hidden_size/token_num. (#313)
Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>

42253d08

22 Jul, 2025 1 commit
- Update combine config · bdd119f8
  Shangyan Zhou authored Jul 22, 2025
  
  bdd119f8
21 Jul, 2025 3 commits

fix hang due to small rdma_chunk_size (#317) · d72817eb
Zhiyi Hu authored Jul 21, 2025
```
Co-authored-by: zhiyi Hu <zhiyihu@U-NYQQMGK0-2250.local>
```
d72817eb

Minor patches for deepep (#318) · 5b549c85

Guangguan Wang authored Jul 21, 2025



* Add arg --pressure-test for test_low_latency.py

Add arg --pressure-test for test_low_latency.py
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>

* Export NVSHMEM_QP_DEPTH

Export NVSHMEM_QP_DEPTH
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>

---------
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>

5b549c85

Use TMA to optimize combine forwarder. (#320) · f9c06bb0

Shangyan Zhou authored Jul 21, 2025



* Remove an outdated todo

* Increase the number of combine forward warps.

* forwarder use TMA.

* Small fix

* Code lint

---------
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>

f9c06bb0

18 Jul, 2025 1 commit
- Fix for data error and kernel hung because of inflight rdma channel head update (#310) · e6012370
  Shangyan Zhou authored Jul 18, 2025
```
Fix for data error and kernel hung because of inflight rdma channel head update
```
  e6012370
17 Jul, 2025 1 commit

Fix for data error and kernel hung because of inflight rdma channel head update · b65b22ed

Guangguan authored Jul 15, 2025



When dispatch/combine, neither sender nor receiver waits
for the finish of the rdma channel head update, which may
result in the remaining inflight head update wqes even after
the kernel finished. Once the infight wqes arrive after the
rdma channel head buffer cleaning for the next round of
dispatch/combine, the rdma channel head buffer will be re-
written to a none-zero value. The rdma sender can reuse the
data buffer before the rdma receivers consume the date buffer
because of the wrong rdma channel head, cauing date error and
kernel hung.
For performance considering, to overlap the inflight wqes' RTT,
fix this issue by waiting for all previous inflight wqes to
complete before cleaning rdma buffers in the next round of
dispatch/combine.
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>

b65b22ed

16 Jul, 2025 3 commits
- Update README · 0eee87b8
  Shangyan Zhou authored Jul 16, 2025
  
  0eee87b8
- third-party: Improvements to NVSHMEM Integration · b6ce310b
  Shangyan Zhou authored Jul 16, 2025
```
third-party: Improvements to NVSHMEM Integration
```
  b6ce310b
- Optimize `cached_notify` by TMA (#306) · 146b013d
  Shangyan Zhou authored Jul 16, 2025
```
* Fix rdma head movement

* Optimize `cached_notify` by using TMA.

* Fix

* Small fix
```
  146b013d
15 Jul, 2025 8 commits
- third-party: Add link to blog post on CPU-Assisted IBGDA. · c5d22023
  Seth Howell authored Jul 15, 2025
```
Signed-off-by: Seth Howell <sethh@nvidia.com>
```
  c5d22023
- Correct the wqe_idx in rdma write wqe · 3073a2c6
  Shangyan Zhou authored Jul 15, 2025
```
correct the wqe_idx in rdma write wqe
```
  3073a2c6
- correct the wqe_idx in rdma write wqe · eaa2d0d2
  Guangguan authored Jul 15, 2025
```
correct the wqe_idx in rdma write wqe when num_wqes > 1 in nvshmemi_ibgda_put_nbi_warp.
Signed-off-by: Guangguan <guangguan.wang@linux.alibaba.com>
```
  eaa2d0d2
- buffer.py: Do not force the NIC handler to GPU. · e6b4f527
  Seth Howell authored Jul 14, 2025
```
This enables the CPU-Assisted data path.
Signed-off-by: Seth Howell <sethh@nvidia.com>
```
  e6b4f527
- setup.py: Remove nvcc_dlink specific gencode · 35e1cd1b
  Seth Howell authored Jul 14, 2025
```
Responding to review comments.
Signed-off-by: Seth Howell <sethh@nvidia.com>
```
  35e1cd1b
- setup.py: Add logic for detecting library locations from NVSHMEM wheels. · 2a873392
  Seth Howell authored Jul 14, 2025
```
Signed-off-by: Seth Howell <sethh@nvidia.com>
```
  2a873392
- setup.py: Clean up some extra prints. · 903711c6
  Seth Howell authored Jul 14, 2025
```
Signed-off-by: Seth Howell <sethh@nvidia.com>
```
  903711c6
- third-party: Add back nvshmem.patch. · b79ca2bb
  Seth Howell authored Jul 14, 2025
```
This will give consumers an opportunity to update their builds.
Signed-off-by: Seth Howell <sethh@nvidia.com>
```
  b79ca2bb
14 Jul, 2025 4 commits

Fix · 079c5a4f
Shangyan Zhou authored Jul 14, 2025

079c5a4f
Strengthen the barrier in `cached_notify` (#304) · eb155da4
Shangyan Zhou authored Jul 14, 2025
```
* Strengthen the barrier in `cached_notify`

* lint

* Change the timing method

* lint
```
eb155da4
Minor fix · ea152b57
Chenggang Zhao authored Jul 14, 2025

ea152b57

Optimize low latency combine send with TMA (#299) · c874cb7a

Zhean Xu authored Jul 14, 2025



* feat: low latency combine inplace TMA

* optimize tma pointer with PatternVisitor

* Minor cleanup

* Add `elect_one_sync`

---------
Co-authored-by: Zhean Xu <xza@deepseek.com>
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>

c874cb7a

12 Jul, 2025 3 commits

third-party: Update readme to reflect new features. · 69f9dfe2
Seth Howell authored Jul 11, 2025
```
Signed-off-by: Seth Howell <sethh@nvidia.com>
```
69f9dfe2

third-party: Add CPU-assisted IBGDA support · aa3187ef

Seth Howell authored Jul 11, 2025



This allows users to use NVSHMEM without setting the driver regkey.
Signed-off-by: Seth Howell <sethh@nvidia.com>

aa3187ef

third-party: Update tests to use upstream NVSHMEM · 441833d3

Seth Howell authored Jul 11, 2025



NVSHMEM 3.3 and above support the host-side features
in the patch.

Note: Removed recv queue support
Signed-off-by: Seth Howell <sethh@nvidia.com>

441833d3

11 Jul, 2025 1 commit
- ibgda: support non-bond dual-port environments · 898269fa
  Shangyan Zhou authored Jul 11, 2025
```
ibgda: support non-bond dual-port environments via multi-port config
```
  898269fa