Commits · 0eee87b8ca19e255b70b5631b8cc42ca1b88c8d7 · OpenDAS / DeepEP

16 Jul, 2025 3 commits
- Update README · 0eee87b8
  Shangyan Zhou authored Jul 16, 2025
  
  0eee87b8
- third-party: Improvements to NVSHMEM Integration · b6ce310b
  Shangyan Zhou authored Jul 16, 2025
```
third-party: Improvements to NVSHMEM Integration
```
  b6ce310b
- Optimize `cached_notify` by TMA (#306) · 146b013d
  Shangyan Zhou authored Jul 16, 2025
```
* Fix rdma head movement

* Optimize `cached_notify` by using TMA.

* Fix

* Small fix
```
  146b013d
15 Jul, 2025 8 commits
- third-party: Add link to blog post on CPU-Assisted IBGDA. · c5d22023
  Seth Howell authored Jul 15, 2025
```
Signed-off-by: Seth Howell <sethh@nvidia.com>
```
  c5d22023
- Correct the wqe_idx in rdma write wqe · 3073a2c6
  Shangyan Zhou authored Jul 15, 2025
```
correct the wqe_idx in rdma write wqe
```
  3073a2c6
- correct the wqe_idx in rdma write wqe · eaa2d0d2
  Guangguan authored Jul 15, 2025
```
correct the wqe_idx in rdma write wqe when num_wqes > 1 in nvshmemi_ibgda_put_nbi_warp.
Signed-off-by: Guangguan <guangguan.wang@linux.alibaba.com>
```
  eaa2d0d2
- buffer.py: Do not force the NIC handler to GPU. · e6b4f527
  Seth Howell authored Jul 14, 2025
```
This enables the CPU-Assisted data path.
Signed-off-by: Seth Howell <sethh@nvidia.com>
```
  e6b4f527
- setup.py: Remove nvcc_dlink specific gencode · 35e1cd1b
  Seth Howell authored Jul 14, 2025
```
Responding to review comments.
Signed-off-by: Seth Howell <sethh@nvidia.com>
```
  35e1cd1b
- setup.py: Add logic for detecting library locations from NVSHMEM wheels. · 2a873392
  Seth Howell authored Jul 14, 2025
```
Signed-off-by: Seth Howell <sethh@nvidia.com>
```
  2a873392
- setup.py: Clean up some extra prints. · 903711c6
  Seth Howell authored Jul 14, 2025
```
Signed-off-by: Seth Howell <sethh@nvidia.com>
```
  903711c6
- third-party: Add back nvshmem.patch. · b79ca2bb
  Seth Howell authored Jul 14, 2025
```
This will give consumers an opportunity to update their builds.
Signed-off-by: Seth Howell <sethh@nvidia.com>
```
  b79ca2bb
14 Jul, 2025 4 commits

Fix · 079c5a4f
Shangyan Zhou authored Jul 14, 2025

079c5a4f
Strengthen the barrier in `cached_notify` (#304) · eb155da4
Shangyan Zhou authored Jul 14, 2025
```
* Strengthen the barrier in `cached_notify`

* lint

* Change the timing method

* lint
```
eb155da4
Minor fix · ea152b57
Chenggang Zhao authored Jul 14, 2025

ea152b57

Optimize low latency combine send with TMA (#299) · c874cb7a

Zhean Xu authored Jul 14, 2025



* feat: low latency combine inplace TMA

* optimize tma pointer with PatternVisitor

* Minor cleanup

* Add `elect_one_sync`

---------
Co-authored-by: Zhean Xu <xza@deepseek.com>
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>

c874cb7a

12 Jul, 2025 3 commits

third-party: Update readme to reflect new features. · 69f9dfe2
Seth Howell authored Jul 11, 2025
```
Signed-off-by: Seth Howell <sethh@nvidia.com>
```
69f9dfe2

third-party: Add CPU-assisted IBGDA support · aa3187ef

Seth Howell authored Jul 11, 2025



This allows users to use NVSHMEM without setting the driver regkey.
Signed-off-by: Seth Howell <sethh@nvidia.com>

aa3187ef

third-party: Update tests to use upstream NVSHMEM · 441833d3

Seth Howell authored Jul 11, 2025



NVSHMEM 3.3 and above support the host-side features
in the patch.

Note: Removed recv queue support
Signed-off-by: Seth Howell <sethh@nvidia.com>

441833d3

11 Jul, 2025 4 commits
- ibgda: support non-bond dual-port environments · 898269fa
  Shangyan Zhou authored Jul 11, 2025
```
ibgda: support non-bond dual-port environments via multi-port config
```
  898269fa
- Increase MAX_NUM_HCAS from 16 to 32 to support more NICs in NVSHMEM · 1cd5eea6
  root authored Jul 09, 2025
```
fix format
```
  1cd5eea6
- Update NVSHMEM README · b0f13ef7
  Shangyan Zhou authored Jul 11, 2025
  
  b0f13ef7
- Explicitly destroy the C++ runtime and release resources. (#292) · 0c984e25
  Shangyan Zhou authored Jul 11, 2025
```
* Explicitly destroy the C++ runtime and release resources.

* Small fix

* fix typo

* Add a flag to control whether explicit `destroy()` is required.
```
  0c984e25
10 Jul, 2025 2 commits

Use TMA to optimize internode combine. (#287) · 06f417dc

Shangyan Zhou authored Jul 10, 2025



* Let forwarders use a dedicated SM

* Shuffle rdma idx

* Sender use TMA.

* Adjust the tuning chunk size.

* Modify NVL chunk layout.

* Update some combine config.

* Small lint

* Minor fix

* Overlap TMA store

---------
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>

06f417dc

Support 10-bit LogFMT (simulated version) (#284) · 1cf85fb2

Chenggang Zhao authored Jul 10, 2025



* Add LogFMT interface

* Update comments

* Add simulated code

* Fix comments

* Change to 128 channels

* Add notes

* Optimize performance

* optimize simulate logfmt 10bit

* Minor fix

* Stronger low latency tests

* Minor fix

* Stronger low latency tests for logfmt

* Optimize logfmt simulate: lg2/ex2 ptx, step_inv

* Minor fix

* Minor fix

* Add non-logfmt bench

* Fix value=0 for logfmt

* Optimize performance

* Refactor tests

---------
Co-authored-by: Zhean Xu <xza@deepseek.com>

1cf85fb2

09 Jul, 2025 1 commit
- add DeepEP_multi_port_nobond ibgda support · 3571a927
  liuhe authored Jul 09, 2025
  
  3571a927
08 Jul, 2025 1 commit
- Better debugging messages · c50f3d6f
  Chenggang Zhao authored Jul 08, 2025
  
  c50f3d6f
07 Jul, 2025 1 commit
- feat: support cluster size 2 (#283) · eef7ab50
  Zhean Xu authored Jul 07, 2025
```
Co-authored-by: Zhean Xu <xza@deepseek.com>
```
  eef7ab50
04 Jul, 2025 2 commits

Update some dispatch configs. · e6d61fc6
Shangyan Zhou authored Jul 04, 2025

e6d61fc6

Use TMA to optimize internode dispatch. (#276) · a2fa3b73

Shangyan Zhou authored Jul 04, 2025



* Add TMA buffer allocation

* Use TMA for forwarders and NVL receivers

* Use lane 31 to operate TMA.

* Change rdma buffer layout.

* Use TMA to transfer scales also.

* Increase the NVL recv buffer size.

* Disable early stopping.

* Apply similar optimizations on receiver warps.

* Prevent warp divergence.

* Disable aggressive ptx by default.

* Revert using TMA to transfer scales.

* Format.

* Change the layout of dispatch NVL buffer.

* Move topk transformation to recv warps.

* Use TMA to transfer all data in foward warps

* Use TMA to store scales.

* Code lint

---------
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>

a2fa3b73

02 Jul, 2025 11 commits
- Refactor testing arguments · 7705f533
  Chenggang Zhao authored Jul 02, 2025
  
  7705f533
- Use CLI args instead of envs (#273) · 6b17f4fa
  youkaichao authored Jul 02, 2025
```
* use cli arg for num_processes
Signed-off-by: youkaichao <youkaichao@gmail.com>

* update low-latency
Signed-off-by: youkaichao <youkaichao@gmail.com>

* update intranode
Signed-off-by: youkaichao <youkaichao@gmail.com>

* update internode
Signed-off-by: youkaichao <youkaichao@gmail.com>

---------
Signed-off-by: youkaichao <youkaichao@gmail.com>
```
  6b17f4fa
- Simplify · 341bb961
  Chenggang Zhao authored Jul 02, 2025
  
  341bb961
- Renaming · 63f79469
  Chenggang Zhao authored Jul 02, 2025
  
  63f79469
- Refactor the bench function · d79b3cd1
  Chenggang Zhao authored Jul 02, 2025
  
  d79b3cd1
- Support displaying separate send and recv time (#239) · 85793dda
  fzyzcjy authored Jul 02, 2025
```
* more

* more

* more

* more

* more

* more
```
  85793dda
- Remove launch bound for layout kernels · 77ddb015
  Chenggang Zhao authored Jul 02, 2025
  
  77ddb015
- Improve layout kernel performance · d4f34978
  Chenggang Zhao authored Jul 02, 2025
  
  d4f34978
- Unify testing envs' naming · 01f49071
  Chenggang Zhao authored Jul 02, 2025
  
  01f49071
- cherry pick (#251) · 8dcdd349
  fzyzcjy authored Jul 02, 2025
  
  8dcdd349
- more (#238) · 19fc0700
  fzyzcjy authored Jul 02, 2025
  
  19fc0700