Commits · 61bc0affa699c42c626b3b7fec71a801e089f945 · OpenDAS / DeepEP

04 Mar, 2026 1 commit
- 修改torchrun启动的测试 · 61bc0aff
  lishen authored Mar 04, 2026
  
  61bc0aff
09 Feb, 2026 1 commit
- 低延迟combine支持10bit量化代码 · 17d9c844
  lishen authored Feb 09, 2026
  
  17d9c844
05 Feb, 2026 1 commit
- 量化测试代码修改对应的tests修改 · e57e9270
  lishen authored Feb 05, 2026
  
  e57e9270
04 Feb, 2026 1 commit
- modify quant test. · ace6e18e
  lijian6 authored Feb 04, 2026
```
Signed-off-by: lijian <lijian6@sugon.com>
```
  ace6e18e
02 Feb, 2026 1 commit
- 支持更复杂的量化，包括fp8/int8/ue8m0，且支持per-group/per-channel · 44ec8bed
  lishen authored Feb 03, 2026
  
  44ec8bed
29 Jan, 2026 1 commit
- feat(test):Add some test opt. · 340c3f01
  lijian6 authored Jan 29, 2026
```
Signed-off-by: lijian <lijian6@sugon.com>
```
  340c3f01
23 Dec, 2025 1 commit
- 支持zero_copy正确性 · f4b3020e
  lishen authored Dec 23, 2025
  
  f4b3020e
15 Dec, 2025 1 commit
- 完善低延迟模式int8类型的测试 · 969e30f8
  lishen authored Dec 15, 2025
  
  969e30f8
25 Nov, 2025 1 commit
- 支持int8类型的kernel接口 · 6dfe3bc2
  lishen authored Nov 25, 2025
  
  6dfe3bc2
06 Nov, 2025 1 commit

1. Fix ll mode 256 experts err. · e18f726a

lijian6 authored Nov 06, 2025


2. Add internode ll mode.
3. Add test internode ll mode.
Signed-off-by: lijian <lijian6@sugon.com>

e18f726a

05 Nov, 2025 1 commit
- 低延迟接口支持int8类型通信 · ce671dd4
  lishen authored Nov 05, 2025
  
  ce671dd4
03 Nov, 2025 1 commit
- 完成低延迟接口功能 · da13c63a
  lishen authored Nov 04, 2025
  
  da13c63a
24 Oct, 2025 1 commit
- 1. 修复使用函数获取num_nvl_bytes, num_rdma_bytes变量的bug. · 0b14d3b2
  lijian6 authored Oct 24, 2025
```
2. 修改测试脚本，降低显存占用。使用量从17G -> 8G.
Signed-off-by: lijian <lijian6@sugon.com>
```
  0b14d3b2
20 Oct, 2025 2 commits
- Add build.sh for whl and bak file · 5b5a7909
  lijian6 authored Oct 20, 2025
```
Signed-off-by: lijian <lijian6@sugon.com>
```
  5b5a7909
- Fix sync mode error. · e1283972
  lijian6 authored Oct 20, 2025
```
Signed-off-by: lijian <lijian6@sugon.com>
```
  e1283972
17 Oct, 2025 1 commit
- Fitter for DCU. · 5563b6d0
  lijian6 authored Oct 17, 2025
```
Signed-off-by: lijian <lijian6@sugon.com>
```
  5563b6d0
24 Sep, 2025 1 commit
- Make dtype of topk_idx configurable (#422) · da6ca24e
  Tailing Yuan authored Sep 24, 2025
```
Co-authored-by: Yifei Zhang <219273404+yifeizhang-c@users.noreply.github.com>
```
  da6ca24e
22 Sep, 2025 1 commit
- Support EP48/96 for internode kernels and update some config. (#421) · c939644c
  Shangyan Zhou authored Sep 22, 2025
  
  c939644c
17 Sep, 2025 1 commit

Fix hidden_size % 128 != 0 in intranode kernels (#413) · abba6add

Shangyan Zhou authored Sep 17, 2025

* Fix hidden_size % 128 != 0

* Add `align_down()` function

* Use the full warp to wait TMA store

* Support arbitrary hidden sizes in fp8 cast

* lint

abba6add

16 Sep, 2025 1 commit

Canonicalize TMA usages (#410) · 2012e310

Chenggang Zhao authored Sep 16, 2025

* Remove redundant TMA flushes

* Less barrier initialization overhead

* Simplify `elect_one_sync`

* Use `elect_one_sync` instead of lanes

* Minor fix

* Polish testing prints

* Refactor for internode kernels

* Better performance

2012e310

10 Sep, 2025 1 commit

Add pressure test mode for internode test (#400) · ef70b83e

Shangyan Zhou authored Sep 10, 2025

* Suppress kineto output

* Add pressure test mode

* Add `x_pure_rand_e4m3` test

* Add more results into hash value

ef70b83e

09 Sep, 2025 1 commit
- Fix bench warmup counts · 174c209f
  Chenggang Zhao authored Sep 09, 2025
  
  174c209f
14 Aug, 2025 1 commit
- [fix] handle empty tensor in per_token_cast_back (#360) · e3908bf5
  Yizhi Wang authored Aug 14, 2025
  
  e3908bf5
08 Aug, 2025 2 commits
- fix when --num-tokens == 1 (#356) · 695b6347
  AlphaBaby authored Aug 08, 2025
```
Co-authored-by: fujianhao.fjh <fujianhao.fjh@antgroup.com>
```
  695b6347
- Fix low latency test · 51071968
  Shangyan Zhou authored Aug 08, 2025
  
  51071968
07 Aug, 2025 1 commit

Support 10-bit LogFMT Combine (#345) · c5facf5c

Zhean Xu authored Aug 07, 2025



* independent logfmt_simulate function

* draft: logfmt low latency combine

* Minor bug fixes

* Fix non-logfmt bugs

* Fix logfmt bugs

* Fix logfmt bugs

* Minor fix

* Minor fix

* Clean code

* Clean code

* Use fewer regs

* Use two warp groups

* Correct shared memory size

* Minor fix

* Minor fix

* More rigorous tests

* Clean code

* Use more SMs

* Use different unroll factor for send & recv

* Update csrc/kernels/internode_ll.cu
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update csrc/kernels/internode_ll.cu
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Some renaming

* Some comments of tests

* Format `logfmt_encode`

* More lints

* Some refactors on sends

* Fix testing

* Fix bugs

* Renaming

* Use the full warp

* Unify combine recv

* Lint

* Lint

* Support 2560

* Fix meta buffer dtype

* Better encode calls

* Better amin/max writes

* Extra sync

* Read `topk_idx` by once

* Better specialization

* Read weights by once

* Rename

* Bug fixed

* Some renaming

* Fix local memory usage for sending

* Fix local memory usage for receiving

* Less writes

* Optimize performance

* Optimize performance

* Better performance

* Optimize performance

* Fix rounding

* Manually unroll

* Fix bench

---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>

c5facf5c

01 Aug, 2025 1 commit
- more (#275) · ab0a3dd2
  fzyzcjy authored Aug 01, 2025
  
  ab0a3dd2
31 Jul, 2025 1 commit
- Remove the diagnosis part from tests · c7033854
  Chenggang Zhao authored Jul 31, 2025
  
  c7033854
30 Jul, 2025 1 commit

Add diagnosis module for efficient and precise location of slow rank (#311) · 4b67064d

sky authored Jul 30, 2025



* Add diagnosis module for precise identification of slow ranks
Signed-off-by: wangfakang <fakangwang@gmail.com>

* Add tests for the slow diagnosis module
Signed-off-by: wangfakang <fakangwang@gmail.com>

* Update some comments for diagnose
Signed-off-by: wangfakang <fakangwang@gmail.com>

* Update test case for diagnose
Signed-off-by: wangfakang <fakangwang@gmail.com>

* Strip the diagnose module, thx LyricZhao and sphish.
Signed-off-by: wangfakang <fakangwang@gmail.com>

* update variable name and cumulative wait recv cost, thx sphish.
Signed-off-by: wangfakang <fakangwang@gmail.com>

* remove invalid comments.
Signed-off-by: wangfakang <fakangwang@gmail.com>

---------
Signed-off-by: wangfakang <fakangwang@gmail.com>

4b67064d

22 Jul, 2025 1 commit
- Update combine config · bdd119f8
  Shangyan Zhou authored Jul 22, 2025
  
  bdd119f8
21 Jul, 2025 1 commit

Minor patches for deepep (#318) · 5b549c85

Guangguan Wang authored Jul 21, 2025



* Add arg --pressure-test for test_low_latency.py

Add arg --pressure-test for test_low_latency.py
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>

* Export NVSHMEM_QP_DEPTH

Export NVSHMEM_QP_DEPTH
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>

---------
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>

5b549c85

14 Jul, 2025 1 commit
- Strengthen the barrier in `cached_notify` (#304) · eb155da4
  Shangyan Zhou authored Jul 14, 2025
```
* Strengthen the barrier in `cached_notify`

* lint

* Change the timing method

* lint
```
  eb155da4
11 Jul, 2025 1 commit

Explicitly destroy the C++ runtime and release resources. (#292) · 0c984e25

Shangyan Zhou authored Jul 11, 2025

* Explicitly destroy the C++ runtime and release resources.

* Small fix

* fix typo

* Add a flag to control whether explicit `destroy()` is required.

0c984e25

10 Jul, 2025 2 commits

Use TMA to optimize internode combine. (#287) · 06f417dc

Shangyan Zhou authored Jul 10, 2025



* Let forwarders use a dedicated SM

* Shuffle rdma idx

* Sender use TMA.

* Adjust the tuning chunk size.

* Modify NVL chunk layout.

* Update some combine config.

* Small lint

* Minor fix

* Overlap TMA store

---------
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>

06f417dc

Support 10-bit LogFMT (simulated version) (#284) · 1cf85fb2

Chenggang Zhao authored Jul 10, 2025



* Add LogFMT interface

* Update comments

* Add simulated code

* Fix comments

* Change to 128 channels

* Add notes

* Optimize performance

* optimize simulate logfmt 10bit

* Minor fix

* Stronger low latency tests

* Minor fix

* Stronger low latency tests for logfmt

* Optimize logfmt simulate: lg2/ex2 ptx, step_inv

* Minor fix

* Minor fix

* Add non-logfmt bench

* Fix value=0 for logfmt

* Optimize performance

* Refactor tests

---------
Co-authored-by: Zhean Xu <xza@deepseek.com>

1cf85fb2

04 Jul, 2025 2 commits

Update some dispatch configs. · e6d61fc6
Shangyan Zhou authored Jul 04, 2025

e6d61fc6

Use TMA to optimize internode dispatch. (#276) · a2fa3b73

Shangyan Zhou authored Jul 04, 2025



* Add TMA buffer allocation

* Use TMA for forwarders and NVL receivers

* Use lane 31 to operate TMA.

* Change rdma buffer layout.

* Use TMA to transfer scales also.

* Increase the NVL recv buffer size.

* Disable early stopping.

* Apply similar optimizations on receiver warps.

* Prevent warp divergence.

* Disable aggressive ptx by default.

* Revert using TMA to transfer scales.

* Format.

* Change the layout of dispatch NVL buffer.

* Move topk transformation to recv warps.

* Use TMA to transfer all data in foward warps

* Use TMA to store scales.

* Code lint

---------
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>

a2fa3b73

02 Jul, 2025 3 commits

Refactor testing arguments · 7705f533
Chenggang Zhao authored Jul 02, 2025

7705f533

Use CLI args instead of envs (#273) · 6b17f4fa

youkaichao authored Jul 02, 2025



* use cli arg for num_processes
Signed-off-by: youkaichao <youkaichao@gmail.com>

* update low-latency
Signed-off-by: youkaichao <youkaichao@gmail.com>

* update intranode
Signed-off-by: youkaichao <youkaichao@gmail.com>

* update internode
Signed-off-by: youkaichao <youkaichao@gmail.com>

---------
Signed-off-by: youkaichao <youkaichao@gmail.com>

6b17f4fa

Simplify · 341bb961
Chenggang Zhao authored Jul 02, 2025

341bb961