Commits · ab484794b60c1f16812ee8363fbb89a7cb849823 · OpenDAS / DeepEP

25 Aug, 2025 3 commits
- Update internode_ll.cu (#374) · ab484794
  fzyzcjy authored Aug 25, 2025
  
  ab484794
- fix potential hang in pollcq (#371) · e7044855
  Thunderbrook authored Aug 25, 2025
  
  e7044855
- style: remove trailing whitespace (#373) · 91bb69a8
  sky authored Aug 25, 2025
```
Signed-off-by: wangfakang <fakangwang@gmail.com>
```
  91bb69a8
11 Aug, 2025 1 commit
- Add EP144/160 · d9767ce0
  Chenggang Zhao authored Aug 11, 2025
  
  d9767ce0
07 Aug, 2025 3 commits

Fix indent · d31c72a1
Chenggang Zhao authored Aug 07, 2025

d31c72a1
Fix compilation · dd14b36d
Chenggang Zhao authored Aug 07, 2025

dd14b36d

Support 10-bit LogFMT Combine (#345) · c5facf5c

Zhean Xu authored Aug 07, 2025



* independent logfmt_simulate function

* draft: logfmt low latency combine

* Minor bug fixes

* Fix non-logfmt bugs

* Fix logfmt bugs

* Fix logfmt bugs

* Minor fix

* Minor fix

* Clean code

* Clean code

* Use fewer regs

* Use two warp groups

* Correct shared memory size

* Minor fix

* Minor fix

* More rigorous tests

* Clean code

* Use more SMs

* Use different unroll factor for send & recv

* Update csrc/kernels/internode_ll.cu
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update csrc/kernels/internode_ll.cu
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Some renaming

* Some comments of tests

* Format `logfmt_encode`

* More lints

* Some refactors on sends

* Fix testing

* Fix bugs

* Renaming

* Use the full warp

* Unify combine recv

* Lint

* Lint

* Support 2560

* Fix meta buffer dtype

* Better encode calls

* Better amin/max writes

* Extra sync

* Read `topk_idx` by once

* Better specialization

* Read weights by once

* Rename

* Bug fixed

* Some renaming

* Fix local memory usage for sending

* Fix local memory usage for receiving

* Less writes

* Optimize performance

* Optimize performance

* Better performance

* Optimize performance

* Fix rounding

* Manually unroll

* Fix bench

---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>

c5facf5c

31 Jul, 2025 2 commits
- Code lint · c6faca45
  Chenggang Zhao authored Jul 31, 2025
  
  c6faca45
- Fix SM80 compilation · be8053d6
  Chenggang Zhao authored Jul 31, 2025
  
  be8053d6
30 Jul, 2025 1 commit

Add diagnosis module for efficient and precise location of slow rank (#311) · 4b67064d

sky authored Jul 30, 2025



* Add diagnosis module for precise identification of slow ranks
Signed-off-by: wangfakang <fakangwang@gmail.com>

* Add tests for the slow diagnosis module
Signed-off-by: wangfakang <fakangwang@gmail.com>

* Update some comments for diagnose
Signed-off-by: wangfakang <fakangwang@gmail.com>

* Update test case for diagnose
Signed-off-by: wangfakang <fakangwang@gmail.com>

* Strip the diagnose module, thx LyricZhao and sphish.
Signed-off-by: wangfakang <fakangwang@gmail.com>

* update variable name and cumulative wait recv cost, thx sphish.
Signed-off-by: wangfakang <fakangwang@gmail.com>

* remove invalid comments.
Signed-off-by: wangfakang <fakangwang@gmail.com>

---------
Signed-off-by: wangfakang <fakangwang@gmail.com>

4b67064d

29 Jul, 2025 2 commits

Add hidden_size 6144 (#329) · b92d0d48

Jee Jee Li authored Jul 29, 2025



* Done
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

* Add comment
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

---------
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

b92d0d48

Fix the address of dispatch_rdma_recv_count_buffer to avoid cleaning after... · 42253d08

Void authored Jul 29, 2025


Fix the address of dispatch_rdma_recv_count_buffer to avoid cleaning after each change in hidden_size/token_num. (#313)
Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>

42253d08

21 Jul, 2025 2 commits

fix hang due to small rdma_chunk_size (#317) · d72817eb
Zhiyi Hu authored Jul 21, 2025
```
Co-authored-by: zhiyi Hu <zhiyihu@U-NYQQMGK0-2250.local>
```
d72817eb

Use TMA to optimize combine forwarder. (#320) · f9c06bb0

Shangyan Zhou authored Jul 21, 2025



* Remove an outdated todo

* Increase the number of combine forward warps.

* forwarder use TMA.

* Small fix

* Code lint

---------
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>

f9c06bb0

17 Jul, 2025 1 commit

Fix for data error and kernel hung because of inflight rdma channel head update · b65b22ed

Guangguan authored Jul 15, 2025



When dispatch/combine, neither sender nor receiver waits
for the finish of the rdma channel head update, which may
result in the remaining inflight head update wqes even after
the kernel finished. Once the infight wqes arrive after the
rdma channel head buffer cleaning for the next round of
dispatch/combine, the rdma channel head buffer will be re-
written to a none-zero value. The rdma sender can reuse the
data buffer before the rdma receivers consume the date buffer
because of the wrong rdma channel head, cauing date error and
kernel hung.
For performance considering, to overlap the inflight wqes' RTT,
fix this issue by waiting for all previous inflight wqes to
complete before cleaning rdma buffers in the next round of
dispatch/combine.
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>

b65b22ed

16 Jul, 2025 1 commit
- Optimize `cached_notify` by TMA (#306) · 146b013d
  Shangyan Zhou authored Jul 16, 2025
```
* Fix rdma head movement

* Optimize `cached_notify` by using TMA.

* Fix

* Small fix
```
  146b013d
15 Jul, 2025 1 commit

correct the wqe_idx in rdma write wqe · eaa2d0d2

Guangguan authored Jul 15, 2025



correct the wqe_idx in rdma write wqe when num_wqes > 1 in nvshmemi_ibgda_put_nbi_warp.
Signed-off-by: Guangguan <guangguan.wang@linux.alibaba.com>

eaa2d0d2

14 Jul, 2025 4 commits

Fix · 079c5a4f
Shangyan Zhou authored Jul 14, 2025

079c5a4f
Strengthen the barrier in `cached_notify` (#304) · eb155da4
Shangyan Zhou authored Jul 14, 2025
```
* Strengthen the barrier in `cached_notify`

* lint

* Change the timing method

* lint
```
eb155da4
Minor fix · ea152b57
Chenggang Zhao authored Jul 14, 2025

ea152b57

Optimize low latency combine send with TMA (#299) · c874cb7a

Zhean Xu authored Jul 14, 2025



* feat: low latency combine inplace TMA

* optimize tma pointer with PatternVisitor

* Minor cleanup

* Add `elect_one_sync`

---------
Co-authored-by: Zhean Xu <xza@deepseek.com>
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>

c874cb7a

12 Jul, 2025 2 commits

third-party: Add CPU-assisted IBGDA support · aa3187ef

Seth Howell authored Jul 11, 2025



This allows users to use NVSHMEM without setting the driver regkey.
Signed-off-by: Seth Howell <sethh@nvidia.com>

aa3187ef

third-party: Update tests to use upstream NVSHMEM · 441833d3

Seth Howell authored Jul 11, 2025



NVSHMEM 3.3 and above support the host-side features
in the patch.

Note: Removed recv queue support
Signed-off-by: Seth Howell <sethh@nvidia.com>

441833d3

11 Jul, 2025 2 commits
- Increase MAX_NUM_HCAS from 16 to 32 to support more NICs in NVSHMEM · 1cd5eea6
  root authored Jul 09, 2025
```
fix format
```
  1cd5eea6
- Explicitly destroy the C++ runtime and release resources. (#292) · 0c984e25
  Shangyan Zhou authored Jul 11, 2025
```
* Explicitly destroy the C++ runtime and release resources.

* Small fix

* fix typo

* Add a flag to control whether explicit `destroy()` is required.
```
  0c984e25
10 Jul, 2025 2 commits

Use TMA to optimize internode combine. (#287) · 06f417dc

Shangyan Zhou authored Jul 10, 2025



* Let forwarders use a dedicated SM

* Shuffle rdma idx

* Sender use TMA.

* Adjust the tuning chunk size.

* Modify NVL chunk layout.

* Update some combine config.

* Small lint

* Minor fix

* Overlap TMA store

---------
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>

06f417dc

Support 10-bit LogFMT (simulated version) (#284) · 1cf85fb2

Chenggang Zhao authored Jul 10, 2025



* Add LogFMT interface

* Update comments

* Add simulated code

* Fix comments

* Change to 128 channels

* Add notes

* Optimize performance

* optimize simulate logfmt 10bit

* Minor fix

* Stronger low latency tests

* Minor fix

* Stronger low latency tests for logfmt

* Optimize logfmt simulate: lg2/ex2 ptx, step_inv

* Minor fix

* Minor fix

* Add non-logfmt bench

* Fix value=0 for logfmt

* Optimize performance

* Refactor tests

---------
Co-authored-by: Zhean Xu <xza@deepseek.com>

1cf85fb2

09 Jul, 2025 1 commit
- add DeepEP_multi_port_nobond ibgda support · 3571a927
  liuhe authored Jul 09, 2025
  
  3571a927
08 Jul, 2025 1 commit
- Better debugging messages · c50f3d6f
  Chenggang Zhao authored Jul 08, 2025
  
  c50f3d6f
07 Jul, 2025 1 commit
- feat: support cluster size 2 (#283) · eef7ab50
  Zhean Xu authored Jul 07, 2025
```
Co-authored-by: Zhean Xu <xza@deepseek.com>
```
  eef7ab50
04 Jul, 2025 1 commit

Use TMA to optimize internode dispatch. (#276) · a2fa3b73

Shangyan Zhou authored Jul 04, 2025



* Add TMA buffer allocation

* Use TMA for forwarders and NVL receivers

* Use lane 31 to operate TMA.

* Change rdma buffer layout.

* Use TMA to transfer scales also.

* Increase the NVL recv buffer size.

* Disable early stopping.

* Apply similar optimizations on receiver warps.

* Prevent warp divergence.

* Disable aggressive ptx by default.

* Revert using TMA to transfer scales.

* Format.

* Change the layout of dispatch NVL buffer.

* Move topk transformation to recv warps.

* Use TMA to transfer all data in foward warps

* Use TMA to store scales.

* Code lint

---------
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>

a2fa3b73

02 Jul, 2025 5 commits
- Remove launch bound for layout kernels · 77ddb015
  Chenggang Zhao authored Jul 02, 2025
  
  77ddb015
- Improve layout kernel performance · d4f34978
  Chenggang Zhao authored Jul 02, 2025
  
  d4f34978
- Code cleanup · 0a47402f
  Chenggang Zhao authored Jul 02, 2025
  
  0a47402f
- support hidden size 8192 (#264) · b6516358
  ruizhang1230 authored Jul 02, 2025
```
* support hidden size 8192

* refactor code

* fix assert
```
  b6516358
- remove redundant variable num_scales (#265) · 486dd1d9
  Zhiyi Hu authored Jul 02, 2025
```
Co-authored-by: zhiyi Hu <zhiyihu@U-NYQQMGK0-2250.local>
```
  486dd1d9
27 Jun, 2025 4 commits
- opt code · d24bbeba
  alpha-baby authored Jun 27, 2025
  
  d24bbeba
- enhance warp copy · b90320e2
  alpha-baby authored Jun 27, 2025
  
  b90320e2
- Stricter conditions for aggressive PTX instructions · 004d6f9b
  Chenggang Zhao authored Jun 27, 2025
  
  004d6f9b
- Remove memory fence in NVLink barrier. (#253) · 7de7464e
  Shangyan Zhou authored Jun 27, 2025
```
* Remove memory fence in NVLink barrier.

* Move `__syncthread` and fence into barrier.

* Fix bugs

---------
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>
```
  7de7464e