Commits · d72817eb05c4e7f425972a2c364b11c13da83907 · OpenDAS / DeepEP

21 Jul, 2025 3 commits

fix hang due to small rdma_chunk_size (#317) · d72817eb
Zhiyi Hu authored Jul 21, 2025
```
Co-authored-by: zhiyi Hu <zhiyihu@U-NYQQMGK0-2250.local>
```
d72817eb

Minor patches for deepep (#318) · 5b549c85

Guangguan Wang authored Jul 21, 2025



* Add arg --pressure-test for test_low_latency.py

Add arg --pressure-test for test_low_latency.py
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>

* Export NVSHMEM_QP_DEPTH

Export NVSHMEM_QP_DEPTH
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>

---------
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>

5b549c85

Use TMA to optimize combine forwarder. (#320) · f9c06bb0

Shangyan Zhou authored Jul 21, 2025



* Remove an outdated todo

* Increase the number of combine forward warps.

* forwarder use TMA.

* Small fix

* Code lint

---------
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>

f9c06bb0

18 Jul, 2025 1 commit
- Fix for data error and kernel hung because of inflight rdma channel head update (#310) · e6012370
  Shangyan Zhou authored Jul 18, 2025
```
Fix for data error and kernel hung because of inflight rdma channel head update
```
  e6012370
17 Jul, 2025 1 commit

Fix for data error and kernel hung because of inflight rdma channel head update · b65b22ed

Guangguan authored Jul 15, 2025



When dispatch/combine, neither sender nor receiver waits
for the finish of the rdma channel head update, which may
result in the remaining inflight head update wqes even after
the kernel finished. Once the infight wqes arrive after the
rdma channel head buffer cleaning for the next round of
dispatch/combine, the rdma channel head buffer will be re-
written to a none-zero value. The rdma sender can reuse the
data buffer before the rdma receivers consume the date buffer
because of the wrong rdma channel head, cauing date error and
kernel hung.
For performance considering, to overlap the inflight wqes' RTT,
fix this issue by waiting for all previous inflight wqes to
complete before cleaning rdma buffers in the next round of
dispatch/combine.
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>

b65b22ed

16 Jul, 2025 3 commits
- Update README · 0eee87b8
  Shangyan Zhou authored Jul 16, 2025
  
  0eee87b8
- third-party: Improvements to NVSHMEM Integration · b6ce310b
  Shangyan Zhou authored Jul 16, 2025
```
third-party: Improvements to NVSHMEM Integration
```
  b6ce310b
- Optimize `cached_notify` by TMA (#306) · 146b013d
  Shangyan Zhou authored Jul 16, 2025
```
* Fix rdma head movement

* Optimize `cached_notify` by using TMA.

* Fix

* Small fix
```
  146b013d
15 Jul, 2025 8 commits
- third-party: Add link to blog post on CPU-Assisted IBGDA. · c5d22023
  Seth Howell authored Jul 15, 2025
```
Signed-off-by: Seth Howell <sethh@nvidia.com>
```
  c5d22023
- Correct the wqe_idx in rdma write wqe · 3073a2c6
  Shangyan Zhou authored Jul 15, 2025
```
correct the wqe_idx in rdma write wqe
```
  3073a2c6
- correct the wqe_idx in rdma write wqe · eaa2d0d2
  Guangguan authored Jul 15, 2025
```
correct the wqe_idx in rdma write wqe when num_wqes > 1 in nvshmemi_ibgda_put_nbi_warp.
Signed-off-by: Guangguan <guangguan.wang@linux.alibaba.com>
```
  eaa2d0d2
- buffer.py: Do not force the NIC handler to GPU. · e6b4f527
  Seth Howell authored Jul 14, 2025
```
This enables the CPU-Assisted data path.
Signed-off-by: Seth Howell <sethh@nvidia.com>
```
  e6b4f527
- setup.py: Remove nvcc_dlink specific gencode · 35e1cd1b
  Seth Howell authored Jul 14, 2025
```
Responding to review comments.
Signed-off-by: Seth Howell <sethh@nvidia.com>
```
  35e1cd1b
- setup.py: Add logic for detecting library locations from NVSHMEM wheels. · 2a873392
  Seth Howell authored Jul 14, 2025
```
Signed-off-by: Seth Howell <sethh@nvidia.com>
```
  2a873392
- setup.py: Clean up some extra prints. · 903711c6
  Seth Howell authored Jul 14, 2025
```
Signed-off-by: Seth Howell <sethh@nvidia.com>
```
  903711c6
- third-party: Add back nvshmem.patch. · b79ca2bb
  Seth Howell authored Jul 14, 2025
```
This will give consumers an opportunity to update their builds.
Signed-off-by: Seth Howell <sethh@nvidia.com>
```
  b79ca2bb
14 Jul, 2025 4 commits

Fix · 079c5a4f
Shangyan Zhou authored Jul 14, 2025

079c5a4f
Strengthen the barrier in `cached_notify` (#304) · eb155da4
Shangyan Zhou authored Jul 14, 2025
```
* Strengthen the barrier in `cached_notify`

* lint

* Change the timing method

* lint
```
eb155da4
Minor fix · ea152b57
Chenggang Zhao authored Jul 14, 2025

ea152b57

Optimize low latency combine send with TMA (#299) · c874cb7a

Zhean Xu authored Jul 14, 2025



* feat: low latency combine inplace TMA

* optimize tma pointer with PatternVisitor

* Minor cleanup

* Add `elect_one_sync`

---------
Co-authored-by: Zhean Xu <xza@deepseek.com>
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>

c874cb7a

12 Jul, 2025 3 commits

third-party: Update readme to reflect new features. · 69f9dfe2
Seth Howell authored Jul 11, 2025
```
Signed-off-by: Seth Howell <sethh@nvidia.com>
```
69f9dfe2

third-party: Add CPU-assisted IBGDA support · aa3187ef

Seth Howell authored Jul 11, 2025



This allows users to use NVSHMEM without setting the driver regkey.
Signed-off-by: Seth Howell <sethh@nvidia.com>

aa3187ef

third-party: Update tests to use upstream NVSHMEM · 441833d3

Seth Howell authored Jul 11, 2025



NVSHMEM 3.3 and above support the host-side features
in the patch.

Note: Removed recv queue support
Signed-off-by: Seth Howell <sethh@nvidia.com>

441833d3

11 Jul, 2025 4 commits
- ibgda: support non-bond dual-port environments · 898269fa
  Shangyan Zhou authored Jul 11, 2025
```
ibgda: support non-bond dual-port environments via multi-port config
```
  898269fa
- Increase MAX_NUM_HCAS from 16 to 32 to support more NICs in NVSHMEM · 1cd5eea6
  root authored Jul 09, 2025
```
fix format
```
  1cd5eea6
- Update NVSHMEM README · b0f13ef7
  Shangyan Zhou authored Jul 11, 2025
  
  b0f13ef7
- Explicitly destroy the C++ runtime and release resources. (#292) · 0c984e25
  Shangyan Zhou authored Jul 11, 2025
```
* Explicitly destroy the C++ runtime and release resources.

* Small fix

* fix typo

* Add a flag to control whether explicit `destroy()` is required.
```
  0c984e25
10 Jul, 2025 2 commits

Use TMA to optimize internode combine. (#287) · 06f417dc

Shangyan Zhou authored Jul 10, 2025



* Let forwarders use a dedicated SM

* Shuffle rdma idx

* Sender use TMA.

* Adjust the tuning chunk size.

* Modify NVL chunk layout.

* Update some combine config.

* Small lint

* Minor fix

* Overlap TMA store

---------
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>

06f417dc

Support 10-bit LogFMT (simulated version) (#284) · 1cf85fb2

Chenggang Zhao authored Jul 10, 2025



* Add LogFMT interface

* Update comments

* Add simulated code

* Fix comments

* Change to 128 channels

* Add notes

* Optimize performance

* optimize simulate logfmt 10bit

* Minor fix

* Stronger low latency tests

* Minor fix

* Stronger low latency tests for logfmt

* Optimize logfmt simulate: lg2/ex2 ptx, step_inv

* Minor fix

* Minor fix

* Add non-logfmt bench

* Fix value=0 for logfmt

* Optimize performance

* Refactor tests

---------
Co-authored-by: Zhean Xu <xza@deepseek.com>

1cf85fb2

09 Jul, 2025 1 commit
- add DeepEP_multi_port_nobond ibgda support · 3571a927
  liuhe authored Jul 09, 2025
  
  3571a927
08 Jul, 2025 1 commit
- Better debugging messages · c50f3d6f
  Chenggang Zhao authored Jul 08, 2025
  
  c50f3d6f
07 Jul, 2025 1 commit
- feat: support cluster size 2 (#283) · eef7ab50
  Zhean Xu authored Jul 07, 2025
```
Co-authored-by: Zhean Xu <xza@deepseek.com>
```
  eef7ab50
04 Jul, 2025 2 commits

Update some dispatch configs. · e6d61fc6
Shangyan Zhou authored Jul 04, 2025

e6d61fc6

Use TMA to optimize internode dispatch. (#276) · a2fa3b73

Shangyan Zhou authored Jul 04, 2025



* Add TMA buffer allocation

* Use TMA for forwarders and NVL receivers

* Use lane 31 to operate TMA.

* Change rdma buffer layout.

* Use TMA to transfer scales also.

* Increase the NVL recv buffer size.

* Disable early stopping.

* Apply similar optimizations on receiver warps.

* Prevent warp divergence.

* Disable aggressive ptx by default.

* Revert using TMA to transfer scales.

* Format.

* Change the layout of dispatch NVL buffer.

* Move topk transformation to recv warps.

* Use TMA to transfer all data in foward warps

* Use TMA to store scales.

* Code lint

---------
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>

a2fa3b73

02 Jul, 2025 6 commits
- Refactor testing arguments · 7705f533
  Chenggang Zhao authored Jul 02, 2025
  
  7705f533
- Use CLI args instead of envs (#273) · 6b17f4fa
  youkaichao authored Jul 02, 2025
```
* use cli arg for num_processes
Signed-off-by: youkaichao <youkaichao@gmail.com>

* update low-latency
Signed-off-by: youkaichao <youkaichao@gmail.com>

* update intranode
Signed-off-by: youkaichao <youkaichao@gmail.com>

* update internode
Signed-off-by: youkaichao <youkaichao@gmail.com>

---------
Signed-off-by: youkaichao <youkaichao@gmail.com>
```
  6b17f4fa
- Simplify · 341bb961
  Chenggang Zhao authored Jul 02, 2025
  
  341bb961
- Renaming · 63f79469
  Chenggang Zhao authored Jul 02, 2025
  
  63f79469
- Refactor the bench function · d79b3cd1
  Chenggang Zhao authored Jul 02, 2025
  
  d79b3cd1
- Support displaying separate send and recv time (#239) · 85793dda
  fzyzcjy authored Jul 02, 2025
```
* more

* more

* more

* more

* more

* more
```
  85793dda