- 04 Mar, 2026 1 commit
-
-
lishen authored
-
- 09 Feb, 2026 1 commit
-
-
lishen authored
-
- 05 Feb, 2026 1 commit
-
-
lishen authored
-
- 04 Feb, 2026 1 commit
-
-
lijian6 authored
Signed-off-by:lijian <lijian6@sugon.com>
-
- 02 Feb, 2026 1 commit
-
-
lishen authored
-
- 29 Jan, 2026 1 commit
-
-
lijian6 authored
Signed-off-by:lijian <lijian6@sugon.com>
-
- 23 Dec, 2025 1 commit
-
-
lishen authored
-
- 15 Dec, 2025 1 commit
-
-
lishen authored
-
- 25 Nov, 2025 1 commit
-
-
lishen authored
-
- 06 Nov, 2025 1 commit
-
-
lijian6 authored
2. Add internode ll mode. 3. Add test internode ll mode. Signed-off-by:lijian <lijian6@sugon.com>
-
- 05 Nov, 2025 1 commit
-
-
lishen authored
-
- 03 Nov, 2025 1 commit
-
-
lishen authored
-
- 24 Oct, 2025 1 commit
-
-
lijian6 authored
2. 修改测试脚本,降低显存占用。使用量从17G -> 8G. Signed-off-by:lijian <lijian6@sugon.com>
-
- 20 Oct, 2025 2 commits
-
-
lijian6 authored
Signed-off-by:lijian <lijian6@sugon.com>
-
lijian6 authored
Signed-off-by:lijian <lijian6@sugon.com>
-
- 17 Oct, 2025 1 commit
-
-
lijian6 authored
Signed-off-by:lijian <lijian6@sugon.com>
-
- 24 Sep, 2025 1 commit
-
-
Tailing Yuan authored
Co-authored-by:Yifei Zhang <219273404+yifeizhang-c@users.noreply.github.com>
-
- 22 Sep, 2025 1 commit
-
-
Shangyan Zhou authored
-
- 17 Sep, 2025 1 commit
-
-
Shangyan Zhou authored
* Fix hidden_size % 128 != 0 * Add `align_down()` function * Use the full warp to wait TMA store * Support arbitrary hidden sizes in fp8 cast * lint
-
- 16 Sep, 2025 1 commit
-
-
Chenggang Zhao authored
* Remove redundant TMA flushes * Less barrier initialization overhead * Simplify `elect_one_sync` * Use `elect_one_sync` instead of lanes * Minor fix * Polish testing prints * Refactor for internode kernels * Better performance
-
- 10 Sep, 2025 1 commit
-
-
Shangyan Zhou authored
* Suppress kineto output * Add pressure test mode * Add `x_pure_rand_e4m3` test * Add more results into hash value
-
- 09 Sep, 2025 1 commit
-
-
Chenggang Zhao authored
-
- 14 Aug, 2025 1 commit
-
-
Yizhi Wang authored
-
- 08 Aug, 2025 2 commits
-
-
AlphaBaby authored
Co-authored-by:fujianhao.fjh <fujianhao.fjh@antgroup.com>
-
Shangyan Zhou authored
-
- 07 Aug, 2025 1 commit
-
-
Zhean Xu authored
* independent logfmt_simulate function * draft: logfmt low latency combine * Minor bug fixes * Fix non-logfmt bugs * Fix logfmt bugs * Fix logfmt bugs * Minor fix * Minor fix * Clean code * Clean code * Use fewer regs * Use two warp groups * Correct shared memory size * Minor fix * Minor fix * More rigorous tests * Clean code * Use more SMs * Use different unroll factor for send & recv * Update csrc/kernels/internode_ll.cu Co-authored-by:
Copilot <175728472+Copilot@users.noreply.github.com> * Update csrc/kernels/internode_ll.cu Co-authored-by:
Copilot <175728472+Copilot@users.noreply.github.com> * Some renaming * Some comments of tests * Format `logfmt_encode` * More lints * Some refactors on sends * Fix testing * Fix bugs * Renaming * Use the full warp * Unify combine recv * Lint * Lint * Support 2560 * Fix meta buffer dtype * Better encode calls * Better amin/max writes * Extra sync * Read `topk_idx` by once * Better specialization * Read weights by once * Rename * Bug fixed * Some renaming * Fix local memory usage for sending * Fix local memory usage for receiving * Less writes * Optimize performance * Optimize performance * Better performance * Optimize performance * Fix rounding * Manually unroll * Fix bench --------- Co-authored-by:
Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by:
Chenggang Zhao <chenggangz@deepseek.com>
-
- 01 Aug, 2025 1 commit
-
-
fzyzcjy authored
-
- 31 Jul, 2025 1 commit
-
-
Chenggang Zhao authored
-
- 30 Jul, 2025 1 commit
-
-
sky authored
* Add diagnosis module for precise identification of slow ranks Signed-off-by:
wangfakang <fakangwang@gmail.com> * Add tests for the slow diagnosis module Signed-off-by:
wangfakang <fakangwang@gmail.com> * Update some comments for diagnose Signed-off-by:
wangfakang <fakangwang@gmail.com> * Update test case for diagnose Signed-off-by:
wangfakang <fakangwang@gmail.com> * Strip the diagnose module, thx LyricZhao and sphish. Signed-off-by:
wangfakang <fakangwang@gmail.com> * update variable name and cumulative wait recv cost, thx sphish. Signed-off-by:
wangfakang <fakangwang@gmail.com> * remove invalid comments. Signed-off-by:
wangfakang <fakangwang@gmail.com> --------- Signed-off-by:
wangfakang <fakangwang@gmail.com>
-
- 22 Jul, 2025 1 commit
-
-
Shangyan Zhou authored
-
- 21 Jul, 2025 1 commit
-
-
Guangguan Wang authored
* Add arg --pressure-test for test_low_latency.py Add arg --pressure-test for test_low_latency.py Signed-off-by:
Guangguan Wang <guangguan.wang@linux.alibaba.com> * Export NVSHMEM_QP_DEPTH Export NVSHMEM_QP_DEPTH Signed-off-by:
Guangguan Wang <guangguan.wang@linux.alibaba.com> --------- Signed-off-by:
Guangguan Wang <guangguan.wang@linux.alibaba.com>
-
- 14 Jul, 2025 1 commit
-
-
Shangyan Zhou authored
* Strengthen the barrier in `cached_notify` * lint * Change the timing method * lint
-
- 11 Jul, 2025 1 commit
-
-
Shangyan Zhou authored
* Explicitly destroy the C++ runtime and release resources. * Small fix * fix typo * Add a flag to control whether explicit `destroy()` is required.
-
- 10 Jul, 2025 2 commits
-
-
Shangyan Zhou authored
* Let forwarders use a dedicated SM * Shuffle rdma idx * Sender use TMA. * Adjust the tuning chunk size. * Modify NVL chunk layout. * Update some combine config. * Small lint * Minor fix * Overlap TMA store --------- Co-authored-by:Chenggang Zhao <chenggangz@deepseek.com>
-
Chenggang Zhao authored
* Add LogFMT interface * Update comments * Add simulated code * Fix comments * Change to 128 channels * Add notes * Optimize performance * optimize simulate logfmt 10bit * Minor fix * Stronger low latency tests * Minor fix * Stronger low latency tests for logfmt * Optimize logfmt simulate: lg2/ex2 ptx, step_inv * Minor fix * Minor fix * Add non-logfmt bench * Fix value=0 for logfmt * Optimize performance * Refactor tests --------- Co-authored-by:Zhean Xu <xza@deepseek.com>
-
- 04 Jul, 2025 2 commits
-
-
Shangyan Zhou authored
-
Shangyan Zhou authored
* Add TMA buffer allocation * Use TMA for forwarders and NVL receivers * Use lane 31 to operate TMA. * Change rdma buffer layout. * Use TMA to transfer scales also. * Increase the NVL recv buffer size. * Disable early stopping. * Apply similar optimizations on receiver warps. * Prevent warp divergence. * Disable aggressive ptx by default. * Revert using TMA to transfer scales. * Format. * Change the layout of dispatch NVL buffer. * Move topk transformation to recv warps. * Use TMA to transfer all data in foward warps * Use TMA to store scales. * Code lint --------- Co-authored-by:Chenggang Zhao <chenggangz@deepseek.com>
-
- 02 Jul, 2025 3 commits
-
-
Chenggang Zhao authored
-
youkaichao authored
* use cli arg for num_processes Signed-off-by:
youkaichao <youkaichao@gmail.com> * update low-latency Signed-off-by:
youkaichao <youkaichao@gmail.com> * update intranode Signed-off-by:
youkaichao <youkaichao@gmail.com> * update internode Signed-off-by:
youkaichao <youkaichao@gmail.com> --------- Signed-off-by:
youkaichao <youkaichao@gmail.com>
-
Chenggang Zhao authored
-