- 14 Jul, 2025 4 commits
-
-
Shangyan Zhou authored
-
Shangyan Zhou authored
* Strengthen the barrier in `cached_notify` * lint * Change the timing method * lint
-
Chenggang Zhao authored
-
Zhean Xu authored
* feat: low latency combine inplace TMA * optimize tma pointer with PatternVisitor * Minor cleanup * Add `elect_one_sync` --------- Co-authored-by:
Zhean Xu <xza@deepseek.com> Co-authored-by:
Chenggang Zhao <chenggangz@deepseek.com>
-
- 11 Jul, 2025 2 commits
-
-
root authored
fix format
-
Shangyan Zhou authored
* Explicitly destroy the C++ runtime and release resources. * Small fix * fix typo * Add a flag to control whether explicit `destroy()` is required.
-
- 10 Jul, 2025 2 commits
-
-
Shangyan Zhou authored
* Let forwarders use a dedicated SM * Shuffle rdma idx * Sender use TMA. * Adjust the tuning chunk size. * Modify NVL chunk layout. * Update some combine config. * Small lint * Minor fix * Overlap TMA store --------- Co-authored-by:Chenggang Zhao <chenggangz@deepseek.com>
-
Chenggang Zhao authored
* Add LogFMT interface * Update comments * Add simulated code * Fix comments * Change to 128 channels * Add notes * Optimize performance * optimize simulate logfmt 10bit * Minor fix * Stronger low latency tests * Minor fix * Stronger low latency tests for logfmt * Optimize logfmt simulate: lg2/ex2 ptx, step_inv * Minor fix * Minor fix * Add non-logfmt bench * Fix value=0 for logfmt * Optimize performance * Refactor tests --------- Co-authored-by:Zhean Xu <xza@deepseek.com>
-
- 09 Jul, 2025 1 commit
-
-
liuhe authored
-
- 08 Jul, 2025 1 commit
-
-
Chenggang Zhao authored
-
- 07 Jul, 2025 1 commit
-
-
Zhean Xu authored
Co-authored-by:Zhean Xu <xza@deepseek.com>
-
- 04 Jul, 2025 1 commit
-
-
Shangyan Zhou authored
* Add TMA buffer allocation * Use TMA for forwarders and NVL receivers * Use lane 31 to operate TMA. * Change rdma buffer layout. * Use TMA to transfer scales also. * Increase the NVL recv buffer size. * Disable early stopping. * Apply similar optimizations on receiver warps. * Prevent warp divergence. * Disable aggressive ptx by default. * Revert using TMA to transfer scales. * Format. * Change the layout of dispatch NVL buffer. * Move topk transformation to recv warps. * Use TMA to transfer all data in foward warps * Use TMA to store scales. * Code lint --------- Co-authored-by:Chenggang Zhao <chenggangz@deepseek.com>
-
- 02 Jul, 2025 5 commits
-
-
Chenggang Zhao authored
-
Chenggang Zhao authored
-
Chenggang Zhao authored
-
ruizhang1230 authored
* support hidden size 8192 * refactor code * fix assert
-
Zhiyi Hu authored
Co-authored-by:zhiyi Hu <zhiyihu@U-NYQQMGK0-2250.local>
-
- 27 Jun, 2025 6 commits
-
-
alpha-baby authored
-
alpha-baby authored
-
Chenggang Zhao authored
-
Shangyan Zhou authored
* Remove memory fence in NVLink barrier. * Move `__syncthread` and fence into barrier. * Fix bugs --------- Co-authored-by:Chenggang Zhao <chenggangz@deepseek.com>
-
Shangyan Zhou authored
-
Chenggang Zhao authored
-
- 26 Jun, 2025 1 commit
-
-
Shangyan Zhou authored
-
- 25 Jun, 2025 2 commits
-
-
Shangyan Zhou authored
* Support bias. * Fix. * Fix style.
-
Shangyan Zhou authored
* Add `get_comm_stream`. * Fix style.
-
- 24 Jun, 2025 3 commits
-
-
Chenggang Zhao authored
-
Chenggang Zhao authored
* Add draft * Add fast-debugging flags * Fix several bugs * Add sender timeout checks * Fix stuck * Fix bugs * Fix bugs
-
Shangyan Zhou authored
* Increase the test round. * Add warp synchronization. * Shuffle the send warps. * Add time elapsed into bench result.
-
- 23 Jun, 2025 2 commits
- 20 Jun, 2025 1 commit
-
-
Chenggang Zhao authored
-
- 18 Jun, 2025 1 commit
-
-
Shangyan Zhou authored
* Fix the tail loading issue. * Modify the sync offset.
-
- 16 Jun, 2025 4 commits
-
-
Shangyan Zhou authored
* Fix warp synchronization. * Another fix.
-
Chenggang Zhao authored
-
Chenggang Zhao authored
* Add automatic warp count control for low-latency dispatch * Add automatic warp count control for low-latency combine * More assertions
-
fzyzcjy authored
-
- 13 Jun, 2025 2 commits
-
-
Shangyan Zhou authored
-
Zhicheng Wu authored
let the sender SM use the channel_id, and the receiver SM use channel_id + num_channels
-
- 12 Jun, 2025 1 commit
-
-
Shifang Xu authored
-