Commits · 24324cc5408aa7f70dbc28880ec4b68690cb18e7 · OpenDAS / DeepEP

10 Sep, 2025 2 commits
- Fix combine tma mbarrier · 24324cc5
  Shangyan Zhou authored Sep 10, 2025
  
  24324cc5
- Fix tma mbarrier (#399) · 0970c4c8
  Shangyan Zhou authored Sep 10, 2025
```
* Fix mbarrier

* Remove redundant store
```
  0970c4c8
01 Sep, 2025 3 commits
- Update intranode.cu (#381) · c18eabde
  Chongchong Tian authored Sep 01, 2025
```
Each thread is responsible for one target rank
```
  c18eabde
- Speed up dispatch send by refining loop unrolling (#385) · 6a5d323c
  fzyzcjy authored Sep 01, 2025
```
* nits

* hack unrolled warp copy

* Revert "nits"

This reverts commit 3e1b28d9b17f2c1cc46403d432ca576dbf15bd45.
```
  6a5d323c
- Update internode_ll.cu (#376) · c78b9ed7
  fzyzcjy authored Sep 01, 2025
  
  c78b9ed7
28 Aug, 2025 1 commit

Fix: avoid floating point exception (#379) · b7fef496

sky authored Aug 28, 2025



* Fix: avoid floating point exception.
Signed-off-by: wangfakang <fakangwang@gmail.com>

* simplify the code.
Signed-off-by: wangfakang <fakangwang@gmail.com>

---------
Signed-off-by: wangfakang <fakangwang@gmail.com>

b7fef496

26 Aug, 2025 1 commit

fix combine timeout due to delayed forwarder min head update (#353) · 1da73be0

Zhiyi Hu authored Aug 26, 2025



* fix combine timeout due to forwarder min head update

* Update head before and after combine_token; add assertion for nvl_buffer_size_per_rdma_rank

---------
Co-authored-by: zhiyi Hu <zhiyihu@U-NYQQMGK0-2250.local>

1da73be0

25 Aug, 2025 3 commits
- Update internode_ll.cu (#374) · ab484794
  fzyzcjy authored Aug 25, 2025
  
  ab484794
- fix potential hang in pollcq (#371) · e7044855
  Thunderbrook authored Aug 25, 2025
  
  e7044855
- style: remove trailing whitespace (#373) · 91bb69a8
  sky authored Aug 25, 2025
```
Signed-off-by: wangfakang <fakangwang@gmail.com>
```
  91bb69a8
11 Aug, 2025 1 commit
- Add EP144/160 · d9767ce0
  Chenggang Zhao authored Aug 11, 2025
  
  d9767ce0
07 Aug, 2025 3 commits

Fix indent · d31c72a1
Chenggang Zhao authored Aug 07, 2025

d31c72a1
Fix compilation · dd14b36d
Chenggang Zhao authored Aug 07, 2025

dd14b36d

Support 10-bit LogFMT Combine (#345) · c5facf5c

Zhean Xu authored Aug 07, 2025



* independent logfmt_simulate function

* draft: logfmt low latency combine

* Minor bug fixes

* Fix non-logfmt bugs

* Fix logfmt bugs

* Fix logfmt bugs

* Minor fix

* Minor fix

* Clean code

* Clean code

* Use fewer regs

* Use two warp groups

* Correct shared memory size

* Minor fix

* Minor fix

* More rigorous tests

* Clean code

* Use more SMs

* Use different unroll factor for send & recv

* Update csrc/kernels/internode_ll.cu
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update csrc/kernels/internode_ll.cu
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Some renaming

* Some comments of tests

* Format `logfmt_encode`

* More lints

* Some refactors on sends

* Fix testing

* Fix bugs

* Renaming

* Use the full warp

* Unify combine recv

* Lint

* Lint

* Support 2560

* Fix meta buffer dtype

* Better encode calls

* Better amin/max writes

* Extra sync

* Read `topk_idx` by once

* Better specialization

* Read weights by once

* Rename

* Bug fixed

* Some renaming

* Fix local memory usage for sending

* Fix local memory usage for receiving

* Less writes

* Optimize performance

* Optimize performance

* Better performance

* Optimize performance

* Fix rounding

* Manually unroll

* Fix bench

---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>

c5facf5c

31 Jul, 2025 2 commits
- Code lint · c6faca45
  Chenggang Zhao authored Jul 31, 2025
  
  c6faca45
- Fix SM80 compilation · be8053d6
  Chenggang Zhao authored Jul 31, 2025
  
  be8053d6
30 Jul, 2025 1 commit

Add diagnosis module for efficient and precise location of slow rank (#311) · 4b67064d

sky authored Jul 30, 2025



* Add diagnosis module for precise identification of slow ranks
Signed-off-by: wangfakang <fakangwang@gmail.com>

* Add tests for the slow diagnosis module
Signed-off-by: wangfakang <fakangwang@gmail.com>

* Update some comments for diagnose
Signed-off-by: wangfakang <fakangwang@gmail.com>

* Update test case for diagnose
Signed-off-by: wangfakang <fakangwang@gmail.com>

* Strip the diagnose module, thx LyricZhao and sphish.
Signed-off-by: wangfakang <fakangwang@gmail.com>

* update variable name and cumulative wait recv cost, thx sphish.
Signed-off-by: wangfakang <fakangwang@gmail.com>

* remove invalid comments.
Signed-off-by: wangfakang <fakangwang@gmail.com>

---------
Signed-off-by: wangfakang <fakangwang@gmail.com>

4b67064d

29 Jul, 2025 1 commit

Add hidden_size 6144 (#329) · b92d0d48

Jee Jee Li authored Jul 29, 2025



* Done
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

* Add comment
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

---------
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

b92d0d48

21 Jul, 2025 2 commits

fix hang due to small rdma_chunk_size (#317) · d72817eb
Zhiyi Hu authored Jul 21, 2025
```
Co-authored-by: zhiyi Hu <zhiyihu@U-NYQQMGK0-2250.local>
```
d72817eb

Use TMA to optimize combine forwarder. (#320) · f9c06bb0

Shangyan Zhou authored Jul 21, 2025



* Remove an outdated todo

* Increase the number of combine forward warps.

* forwarder use TMA.

* Small fix

* Code lint

---------
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>

f9c06bb0

17 Jul, 2025 1 commit

Fix for data error and kernel hung because of inflight rdma channel head update · b65b22ed

Guangguan authored Jul 15, 2025



When dispatch/combine, neither sender nor receiver waits
for the finish of the rdma channel head update, which may
result in the remaining inflight head update wqes even after
the kernel finished. Once the infight wqes arrive after the
rdma channel head buffer cleaning for the next round of
dispatch/combine, the rdma channel head buffer will be re-
written to a none-zero value. The rdma sender can reuse the
data buffer before the rdma receivers consume the date buffer
because of the wrong rdma channel head, cauing date error and
kernel hung.
For performance considering, to overlap the inflight wqes' RTT,
fix this issue by waiting for all previous inflight wqes to
complete before cleaning rdma buffers in the next round of
dispatch/combine.
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>

b65b22ed

16 Jul, 2025 1 commit
- Optimize `cached_notify` by TMA (#306) · 146b013d
  Shangyan Zhou authored Jul 16, 2025
```
* Fix rdma head movement

* Optimize `cached_notify` by using TMA.

* Fix

* Small fix
```
  146b013d
15 Jul, 2025 1 commit

correct the wqe_idx in rdma write wqe · eaa2d0d2

Guangguan authored Jul 15, 2025



correct the wqe_idx in rdma write wqe when num_wqes > 1 in nvshmemi_ibgda_put_nbi_warp.
Signed-off-by: Guangguan <guangguan.wang@linux.alibaba.com>

eaa2d0d2

14 Jul, 2025 4 commits

Fix · 079c5a4f
Shangyan Zhou authored Jul 14, 2025

079c5a4f
Strengthen the barrier in `cached_notify` (#304) · eb155da4
Shangyan Zhou authored Jul 14, 2025
```
* Strengthen the barrier in `cached_notify`

* lint

* Change the timing method

* lint
```
eb155da4
Minor fix · ea152b57
Chenggang Zhao authored Jul 14, 2025

ea152b57

Optimize low latency combine send with TMA (#299) · c874cb7a

Zhean Xu authored Jul 14, 2025



* feat: low latency combine inplace TMA

* optimize tma pointer with PatternVisitor

* Minor cleanup

* Add `elect_one_sync`

---------
Co-authored-by: Zhean Xu <xza@deepseek.com>
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>

c874cb7a

12 Jul, 2025 2 commits

third-party: Add CPU-assisted IBGDA support · aa3187ef

Seth Howell authored Jul 11, 2025



This allows users to use NVSHMEM without setting the driver regkey.
Signed-off-by: Seth Howell <sethh@nvidia.com>

aa3187ef

third-party: Update tests to use upstream NVSHMEM · 441833d3

Seth Howell authored Jul 11, 2025



NVSHMEM 3.3 and above support the host-side features
in the patch.

Note: Removed recv queue support
Signed-off-by: Seth Howell <sethh@nvidia.com>

441833d3

11 Jul, 2025 1 commit
- Increase MAX_NUM_HCAS from 16 to 32 to support more NICs in NVSHMEM · 1cd5eea6
  root authored Jul 09, 2025
```
fix format
```
  1cd5eea6
10 Jul, 2025 2 commits

Use TMA to optimize internode combine. (#287) · 06f417dc

Shangyan Zhou authored Jul 10, 2025



* Let forwarders use a dedicated SM

* Shuffle rdma idx

* Sender use TMA.

* Adjust the tuning chunk size.

* Modify NVL chunk layout.

* Update some combine config.

* Small lint

* Minor fix

* Overlap TMA store

---------
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>

06f417dc

Support 10-bit LogFMT (simulated version) (#284) · 1cf85fb2

Chenggang Zhao authored Jul 10, 2025



* Add LogFMT interface

* Update comments

* Add simulated code

* Fix comments

* Change to 128 channels

* Add notes

* Optimize performance

* optimize simulate logfmt 10bit

* Minor fix

* Stronger low latency tests

* Minor fix

* Stronger low latency tests for logfmt

* Optimize logfmt simulate: lg2/ex2 ptx, step_inv

* Minor fix

* Minor fix

* Add non-logfmt bench

* Fix value=0 for logfmt

* Optimize performance

* Refactor tests

---------
Co-authored-by: Zhean Xu <xza@deepseek.com>

1cf85fb2

09 Jul, 2025 1 commit
- add DeepEP_multi_port_nobond ibgda support · 3571a927
  liuhe authored Jul 09, 2025
  
  3571a927
08 Jul, 2025 1 commit
- Better debugging messages · c50f3d6f
  Chenggang Zhao authored Jul 08, 2025
  
  c50f3d6f
07 Jul, 2025 1 commit
- feat: support cluster size 2 (#283) · eef7ab50
  Zhean Xu authored Jul 07, 2025
```
Co-authored-by: Zhean Xu <xza@deepseek.com>
```
  eef7ab50
04 Jul, 2025 1 commit

Use TMA to optimize internode dispatch. (#276) · a2fa3b73

Shangyan Zhou authored Jul 04, 2025



* Add TMA buffer allocation

* Use TMA for forwarders and NVL receivers

* Use lane 31 to operate TMA.

* Change rdma buffer layout.

* Use TMA to transfer scales also.

* Increase the NVL recv buffer size.

* Disable early stopping.

* Apply similar optimizations on receiver warps.

* Prevent warp divergence.

* Disable aggressive ptx by default.

* Revert using TMA to transfer scales.

* Format.

* Change the layout of dispatch NVL buffer.

* Move topk transformation to recv warps.

* Use TMA to transfer all data in foward warps

* Use TMA to store scales.

* Code lint

---------
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>

a2fa3b73

02 Jul, 2025 4 commits
- Remove launch bound for layout kernels · 77ddb015
  Chenggang Zhao authored Jul 02, 2025
  
  77ddb015
- Improve layout kernel performance · d4f34978
  Chenggang Zhao authored Jul 02, 2025
  
  d4f34978
- Code cleanup · 0a47402f
  Chenggang Zhao authored Jul 02, 2025
  
  0a47402f
- support hidden size 8192 (#264) · b6516358
  ruizhang1230 authored Jul 02, 2025
```
* support hidden size 8192

* refactor code

* fix assert
```
  b6516358