Commits · 898269fa06b743353240f1fc17a4cca9bbf487a9 · OpenDAS / DeepEP

11 Jul, 2025 1 commit
- Increase MAX_NUM_HCAS from 16 to 32 to support more NICs in NVSHMEM · 1cd5eea6
  root authored Jul 09, 2025
```
fix format
```
  1cd5eea6
10 Jul, 2025 2 commits

Use TMA to optimize internode combine. (#287) · 06f417dc

Shangyan Zhou authored Jul 10, 2025



* Let forwarders use a dedicated SM

* Shuffle rdma idx

* Sender use TMA.

* Adjust the tuning chunk size.

* Modify NVL chunk layout.

* Update some combine config.

* Small lint

* Minor fix

* Overlap TMA store

---------
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>

06f417dc

Support 10-bit LogFMT (simulated version) (#284) · 1cf85fb2

Chenggang Zhao authored Jul 10, 2025



* Add LogFMT interface

* Update comments

* Add simulated code

* Fix comments

* Change to 128 channels

* Add notes

* Optimize performance

* optimize simulate logfmt 10bit

* Minor fix

* Stronger low latency tests

* Minor fix

* Stronger low latency tests for logfmt

* Optimize logfmt simulate: lg2/ex2 ptx, step_inv

* Minor fix

* Minor fix

* Add non-logfmt bench

* Fix value=0 for logfmt

* Optimize performance

* Refactor tests

---------
Co-authored-by: Zhean Xu <xza@deepseek.com>

1cf85fb2

09 Jul, 2025 1 commit
- add DeepEP_multi_port_nobond ibgda support · 3571a927
  liuhe authored Jul 09, 2025
  
  3571a927
08 Jul, 2025 1 commit
- Better debugging messages · c50f3d6f
  Chenggang Zhao authored Jul 08, 2025
  
  c50f3d6f
07 Jul, 2025 1 commit
- feat: support cluster size 2 (#283) · eef7ab50
  Zhean Xu authored Jul 07, 2025
```
Co-authored-by: Zhean Xu <xza@deepseek.com>
```
  eef7ab50
04 Jul, 2025 1 commit

Use TMA to optimize internode dispatch. (#276) · a2fa3b73

Shangyan Zhou authored Jul 04, 2025



* Add TMA buffer allocation

* Use TMA for forwarders and NVL receivers

* Use lane 31 to operate TMA.

* Change rdma buffer layout.

* Use TMA to transfer scales also.

* Increase the NVL recv buffer size.

* Disable early stopping.

* Apply similar optimizations on receiver warps.

* Prevent warp divergence.

* Disable aggressive ptx by default.

* Revert using TMA to transfer scales.

* Format.

* Change the layout of dispatch NVL buffer.

* Move topk transformation to recv warps.

* Use TMA to transfer all data in foward warps

* Use TMA to store scales.

* Code lint

---------
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>

a2fa3b73

02 Jul, 2025 4 commits
- Remove launch bound for layout kernels · 77ddb015
  Chenggang Zhao authored Jul 02, 2025
  
  77ddb015
- Improve layout kernel performance · d4f34978
  Chenggang Zhao authored Jul 02, 2025
  
  d4f34978
- Code cleanup · 0a47402f
  Chenggang Zhao authored Jul 02, 2025
  
  0a47402f
- support hidden size 8192 (#264) · b6516358
  ruizhang1230 authored Jul 02, 2025
```
* support hidden size 8192

* refactor code

* fix assert
```
  b6516358
27 Jun, 2025 6 commits
- opt code · d24bbeba
  alpha-baby authored Jun 27, 2025
  
  d24bbeba
- enhance warp copy · b90320e2
  alpha-baby authored Jun 27, 2025
  
  b90320e2
- Stricter conditions for aggressive PTX instructions · 004d6f9b
  Chenggang Zhao authored Jun 27, 2025
  
  004d6f9b
- Remove memory fence in NVLink barrier. (#253) · 7de7464e
  Shangyan Zhou authored Jun 27, 2025
```
* Remove memory fence in NVLink barrier.

* Move `__syncthread` and fence into barrier.

* Fix bugs

---------
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>
```
  7de7464e
- Minor fix. · 58c47942
  Shangyan Zhou authored Jun 27, 2025
  
  58c47942
- Minor fixes · 7ce8da4e
  Chenggang Zhao authored Jun 27, 2025
  
  7ce8da4e
26 Jun, 2025 1 commit
- Fix transcation window. (#260) · ed3444bf
  Shangyan Zhou authored Jun 26, 2025
  
  ed3444bf
25 Jun, 2025 1 commit
- Support bias. (#257) · bd429ffe
  Shangyan Zhou authored Jun 25, 2025
```
* Support bias.

* Fix.

* Fix style.
```
  bd429ffe
24 Jun, 2025 3 commits
- Remove useless assertion · a15faa9f
  Chenggang Zhao authored Jun 24, 2025
  
  a15faa9f
- Add the transaction window data structure for RDMA senders (#245) · bc118b24
  Chenggang Zhao authored Jun 24, 2025
```
* Add draft

* Add fast-debugging flags

* Fix several bugs

* Add sender timeout checks

* Fix stuck

* Fix bugs

* Fix bugs
```
  bc118b24
- Optimize intranode combine. (#247) · 9eb2f84b
  Shangyan Zhou authored Jun 24, 2025
```
* Increase the test round.

* Add warp synchronization.

* Shuffle the send warps.

* Add time elapsed into bench result.
```
  9eb2f84b
23 Jun, 2025 1 commit
- Update internode_ll.cu (#246) · fbcf4300
  fzyzcjy authored Jun 23, 2025
  
  fbcf4300
20 Jun, 2025 1 commit
- Support more hidden size · 7b0c25f8
  Chenggang Zhao authored Jun 20, 2025
  
  7b0c25f8
18 Jun, 2025 1 commit
- Fix the tail loading issue. (#219) · 77f97f79
  Shangyan Zhou authored Jun 18, 2025
```
* Fix the tail loading issue.

* Modify the sync offset.
```
  77f97f79
16 Jun, 2025 4 commits
- Fix warp synchronization. (#215) · dd133d39
  Shangyan Zhou authored Jun 16, 2025
```
* Fix warp synchronization.

* Another fix.
```
  dd133d39
- Remove the low-latency usage flag (#214) · 8aaddf76
  Chenggang Zhao authored Jun 16, 2025
  
  8aaddf76
- Add automatic warp count control for low-latency kernels (#213) · 1b92be8a
  Chenggang Zhao authored Jun 16, 2025
```
* Add automatic warp count control for low-latency dispatch

* Add automatic warp count control for low-latency combine

* More assertions
```
  1b92be8a
- Update intranode.cu (#210) · 4e923188
  fzyzcjy authored Jun 16, 2025
  
  4e923188
13 Jun, 2025 2 commits
- Update assertion of `num_rc_per_pe`. · 483f00af
  Shangyan Zhou authored Jun 13, 2025
  
  483f00af
- Use one qp per sm for internode normal kernels (#181) · 05df5554
  Zhicheng Wu authored Jun 13, 2025
```
let the sender SM use the channel_id, and the receiver SM use channel_id + num_channels
```
  05df5554
12 Jun, 2025 1 commit
- Support UE8M0 data format. (#206) · 21efbe9b
  Shifang Xu authored Jun 12, 2025
  
  21efbe9b
11 Jun, 2025 2 commits

Support Ampere architecture (#204) · b8d90fb7

Chenggang Zhao authored Jun 11, 2025

* Update README

* Update `setup.py`

* Fix headers

* Add `DISABLE_NVSHMEM` for APIs

* Fix launch

* Fix TMA settings

* Fix TMA usages

* Fix dlink

* Separate layout kernels

* Update version

* Add `is_sm90_compiled`

* Fix tests

* Add NVLink connection checks

* Update README

* Fix tests

* Add some comments

* Minor fix

* Minor fix

* Fix bugs

b8d90fb7

Support CUDA graph for intranode normal kernels (#203) · a8299ca7
Chenggang Zhao authored Jun 11, 2025

a8299ca7

10 Jun, 2025 1 commit

Fully remove barrier FIFO designs (#200) · 8da2d7b3

Chenggang Zhao authored Jun 10, 2025

* Fully remove FIFO slots

* Fully remove FIFO buffers

* Minor fix styles

* Fix some typos

* Bugs fixed

* Cleanup `ibgda_poll_cq`

8da2d7b3

09 Jun, 2025 3 commits
- Support statistics tensor for low-latency kernels (#196) · 5a2e37fa
  Chenggang Zhao authored Jun 09, 2025
  
  5a2e37fa
- Add low-latency kernel PCIe usage flag (#195) · 0d1a855d
  Chenggang Zhao authored Jun 09, 2025
```
* Add low-latency kernel usage flag

* Update comments
```
  0d1a855d
- Fix `< PTX ISA 8.6` compatibility (#194) · 564e3752
  Chenggang Zhao authored Jun 09, 2025
  
  564e3752
06 Jun, 2025 1 commit

Use TMA instead of LD/ST for intra-node normal kernels (#191) · c8dceba1

Chenggang Zhao authored Jun 06, 2025

* Update CMake files

* Use TMA instead of LD/ST for intranode dispatch

* Use TMA instead of LD/ST for intranode combine

* Adjust configs

* Test default configs as well

* More warps for combine

* Add inter-thread fence

* Enable more warps

* Do not use TMA for senders

* Update configs

* Remove useless wait

c8dceba1

03 Jun, 2025 1 commit
- Fix notify_dispatch: using warp 0 to issue send · d0225df2
  wzc.wuzhicheng authored Jun 03, 2025
```
Signed-off-by: wzc.wuzhicheng <wzc.wuzhicheng@linux.alibaba.com>
```
  d0225df2