Commits · b0f13ef7c410c4da3c62d03a1650603a005fbb45 · OpenDAS / DeepEP

11 Jul, 2025 2 commits
- Update NVSHMEM README · b0f13ef7
  Shangyan Zhou authored Jul 11, 2025
  
  b0f13ef7
- Explicitly destroy the C++ runtime and release resources. (#292) · 0c984e25
  Shangyan Zhou authored Jul 11, 2025
```
* Explicitly destroy the C++ runtime and release resources.

* Small fix

* fix typo

* Add a flag to control whether explicit `destroy()` is required.
```
  0c984e25
10 Jul, 2025 2 commits

Use TMA to optimize internode combine. (#287) · 06f417dc

Shangyan Zhou authored Jul 10, 2025



* Let forwarders use a dedicated SM

* Shuffle rdma idx

* Sender use TMA.

* Adjust the tuning chunk size.

* Modify NVL chunk layout.

* Update some combine config.

* Small lint

* Minor fix

* Overlap TMA store

---------
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>

06f417dc

Support 10-bit LogFMT (simulated version) (#284) · 1cf85fb2

Chenggang Zhao authored Jul 10, 2025



* Add LogFMT interface

* Update comments

* Add simulated code

* Fix comments

* Change to 128 channels

* Add notes

* Optimize performance

* optimize simulate logfmt 10bit

* Minor fix

* Stronger low latency tests

* Minor fix

* Stronger low latency tests for logfmt

* Optimize logfmt simulate: lg2/ex2 ptx, step_inv

* Minor fix

* Minor fix

* Add non-logfmt bench

* Fix value=0 for logfmt

* Optimize performance

* Refactor tests

---------
Co-authored-by: Zhean Xu <xza@deepseek.com>

1cf85fb2

08 Jul, 2025 1 commit
- Better debugging messages · c50f3d6f
  Chenggang Zhao authored Jul 08, 2025
  
  c50f3d6f
07 Jul, 2025 1 commit
- feat: support cluster size 2 (#283) · eef7ab50
  Zhean Xu authored Jul 07, 2025
```
Co-authored-by: Zhean Xu <xza@deepseek.com>
```
  eef7ab50
04 Jul, 2025 2 commits

Update some dispatch configs. · e6d61fc6
Shangyan Zhou authored Jul 04, 2025

e6d61fc6

Use TMA to optimize internode dispatch. (#276) · a2fa3b73

Shangyan Zhou authored Jul 04, 2025



* Add TMA buffer allocation

* Use TMA for forwarders and NVL receivers

* Use lane 31 to operate TMA.

* Change rdma buffer layout.

* Use TMA to transfer scales also.

* Increase the NVL recv buffer size.

* Disable early stopping.

* Apply similar optimizations on receiver warps.

* Prevent warp divergence.

* Disable aggressive ptx by default.

* Revert using TMA to transfer scales.

* Format.

* Change the layout of dispatch NVL buffer.

* Move topk transformation to recv warps.

* Use TMA to transfer all data in foward warps

* Use TMA to store scales.

* Code lint

---------
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>

a2fa3b73

02 Jul, 2025 14 commits
- Refactor testing arguments · 7705f533
  Chenggang Zhao authored Jul 02, 2025
  
  7705f533
- Use CLI args instead of envs (#273) · 6b17f4fa
  youkaichao authored Jul 02, 2025
```
* use cli arg for num_processes
Signed-off-by: youkaichao <youkaichao@gmail.com>

* update low-latency
Signed-off-by: youkaichao <youkaichao@gmail.com>

* update intranode
Signed-off-by: youkaichao <youkaichao@gmail.com>

* update internode
Signed-off-by: youkaichao <youkaichao@gmail.com>

---------
Signed-off-by: youkaichao <youkaichao@gmail.com>
```
  6b17f4fa
- Simplify · 341bb961
  Chenggang Zhao authored Jul 02, 2025
  
  341bb961
- Renaming · 63f79469
  Chenggang Zhao authored Jul 02, 2025
  
  63f79469
- Refactor the bench function · d79b3cd1
  Chenggang Zhao authored Jul 02, 2025
  
  d79b3cd1
- Support displaying separate send and recv time (#239) · 85793dda
  fzyzcjy authored Jul 02, 2025
```
* more

* more

* more

* more

* more

* more
```
  85793dda
- Remove launch bound for layout kernels · 77ddb015
  Chenggang Zhao authored Jul 02, 2025
  
  77ddb015
- Improve layout kernel performance · d4f34978
  Chenggang Zhao authored Jul 02, 2025
  
  d4f34978
- Unify testing envs' naming · 01f49071
  Chenggang Zhao authored Jul 02, 2025
  
  01f49071
- cherry pick (#251) · 8dcdd349
  fzyzcjy authored Jul 02, 2025
  
  8dcdd349
- more (#238) · 19fc0700
  fzyzcjy authored Jul 02, 2025
  
  19fc0700
- Code cleanup · 0a47402f
  Chenggang Zhao authored Jul 02, 2025
  
  0a47402f
- support hidden size 8192 (#264) · b6516358
  ruizhang1230 authored Jul 02, 2025
```
* support hidden size 8192

* refactor code

* fix assert
```
  b6516358
- remove redundant variable num_scales (#265) · 486dd1d9
  Zhiyi Hu authored Jul 02, 2025
```
Co-authored-by: zhiyi Hu <zhiyihu@U-NYQQMGK0-2250.local>
```
  486dd1d9
30 Jun, 2025 1 commit
- Merge pull request #266 from alpha-baby/fujh/enhance_warp_copy · 8b0c5944
  Shangyan Zhou authored Jun 30, 2025
```
enhance warp copy
```
  8b0c5944
27 Jun, 2025 7 commits
- opt code · d24bbeba
  alpha-baby authored Jun 27, 2025
  
  d24bbeba
- enhance warp copy · b90320e2
  alpha-baby authored Jun 27, 2025
  
  b90320e2
- Stricter conditions for aggressive PTX instructions · 004d6f9b
  Chenggang Zhao authored Jun 27, 2025
  
  004d6f9b
- Remove memory fence in NVLink barrier. (#253) · 7de7464e
  Shangyan Zhou authored Jun 27, 2025
```
* Remove memory fence in NVLink barrier.

* Move `__syncthread` and fence into barrier.

* Fix bugs

---------
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>
```
  7de7464e
- Change the default num of QPs. (#263) · 4e72eb39
  Shangyan Zhou authored Jun 27, 2025
  
  4e72eb39
- Minor fix. · 58c47942
  Shangyan Zhou authored Jun 27, 2025
  
  58c47942
- Minor fixes · 7ce8da4e
  Chenggang Zhao authored Jun 27, 2025
  
  7ce8da4e
26 Jun, 2025 1 commit
- Fix transcation window. (#260) · ed3444bf
  Shangyan Zhou authored Jun 26, 2025
  
  ed3444bf
25 Jun, 2025 2 commits
- Support bias. (#257) · bd429ffe
  Shangyan Zhou authored Jun 25, 2025
```
* Support bias.

* Fix.

* Fix style.
```
  bd429ffe
- Add `get_comm_stream`. (#256) · b80e55e2
  Shangyan Zhou authored Jun 25, 2025
```
* Add `get_comm_stream`.

* Fix style.
```
  b80e55e2
24 Jun, 2025 3 commits
- Remove useless assertion · a15faa9f
  Chenggang Zhao authored Jun 24, 2025
  
  a15faa9f
- Add the transaction window data structure for RDMA senders (#245) · bc118b24
  Chenggang Zhao authored Jun 24, 2025
```
* Add draft

* Add fast-debugging flags

* Fix several bugs

* Add sender timeout checks

* Fix stuck

* Fix bugs

* Fix bugs
```
  bc118b24
- Optimize intranode combine. (#247) · 9eb2f84b
  Shangyan Zhou authored Jun 24, 2025
```
* Increase the test round.

* Add warp synchronization.

* Shuffle the send warps.

* Add time elapsed into bench result.
```
  9eb2f84b
23 Jun, 2025 2 commits
- Update internode_ll.cu (#246) · fbcf4300
  fzyzcjy authored Jun 23, 2025
  
  fbcf4300
- Update deep_ep.cpp (#242) · c95997f8
  fzyzcjy authored Jun 23, 2025
  
  c95997f8
20 Jun, 2025 1 commit
- Support more hidden size · 7b0c25f8
  Chenggang Zhao authored Jun 20, 2025
  
  7b0c25f8
18 Jun, 2025 1 commit
- Surpass type checks · 9d4f7ef8
  Chenggang Zhao authored Jun 18, 2025
  
  9d4f7ef8