- 07 Aug, 2025 1 commit
-
-
Zhean Xu authored
* independent logfmt_simulate function * draft: logfmt low latency combine * Minor bug fixes * Fix non-logfmt bugs * Fix logfmt bugs * Fix logfmt bugs * Minor fix * Minor fix * Clean code * Clean code * Use fewer regs * Use two warp groups * Correct shared memory size * Minor fix * Minor fix * More rigorous tests * Clean code * Use more SMs * Use different unroll factor for send & recv * Update csrc/kernels/internode_ll.cu Co-authored-by:
Copilot <175728472+Copilot@users.noreply.github.com> * Update csrc/kernels/internode_ll.cu Co-authored-by:
Copilot <175728472+Copilot@users.noreply.github.com> * Some renaming * Some comments of tests * Format `logfmt_encode` * More lints * Some refactors on sends * Fix testing * Fix bugs * Renaming * Use the full warp * Unify combine recv * Lint * Lint * Support 2560 * Fix meta buffer dtype * Better encode calls * Better amin/max writes * Extra sync * Read `topk_idx` by once * Better specialization * Read weights by once * Rename * Bug fixed * Some renaming * Fix local memory usage for sending * Fix local memory usage for receiving * Less writes * Optimize performance * Optimize performance * Better performance * Optimize performance * Fix rounding * Manually unroll * Fix bench --------- Co-authored-by:
Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by:
Chenggang Zhao <chenggangz@deepseek.com>
-
- 31 Jul, 2025 1 commit
-
-
Chenggang Zhao authored
-
- 14 Jul, 2025 2 commits
-
-
Chenggang Zhao authored
-
Zhean Xu authored
* feat: low latency combine inplace TMA * optimize tma pointer with PatternVisitor * Minor cleanup * Add `elect_one_sync` --------- Co-authored-by:
Zhean Xu <xza@deepseek.com> Co-authored-by:
Chenggang Zhao <chenggangz@deepseek.com>
-
- 10 Jul, 2025 1 commit
-
-
Chenggang Zhao authored
* Add LogFMT interface * Update comments * Add simulated code * Fix comments * Change to 128 channels * Add notes * Optimize performance * optimize simulate logfmt 10bit * Minor fix * Stronger low latency tests * Minor fix * Stronger low latency tests for logfmt * Optimize logfmt simulate: lg2/ex2 ptx, step_inv * Minor fix * Minor fix * Add non-logfmt bench * Fix value=0 for logfmt * Optimize performance * Refactor tests --------- Co-authored-by:Zhean Xu <xza@deepseek.com>
-
- 08 Jul, 2025 1 commit
-
-
Chenggang Zhao authored
-
- 27 Jun, 2025 2 commits
-
-
Chenggang Zhao authored
-
Shangyan Zhou authored
* Remove memory fence in NVLink barrier. * Move `__syncthread` and fence into barrier. * Fix bugs --------- Co-authored-by:Chenggang Zhao <chenggangz@deepseek.com>
-
- 24 Jun, 2025 1 commit
-
-
Chenggang Zhao authored
* Add draft * Add fast-debugging flags * Fix several bugs * Add sender timeout checks * Fix stuck * Fix bugs * Fix bugs
-
- 16 Jun, 2025 1 commit
-
-
Chenggang Zhao authored
* Add automatic warp count control for low-latency dispatch * Add automatic warp count control for low-latency combine * More assertions
-
- 12 Jun, 2025 1 commit
-
-
Shifang Xu authored
-
- 11 Jun, 2025 1 commit
-
-
Chenggang Zhao authored
* Update README * Update `setup.py` * Fix headers * Add `DISABLE_NVSHMEM` for APIs * Fix launch * Fix TMA settings * Fix TMA usages * Fix dlink * Separate layout kernels * Update version * Add `is_sm90_compiled` * Fix tests * Add NVLink connection checks * Update README * Fix tests * Add some comments * Minor fix * Minor fix * Fix bugs
-
- 10 Jun, 2025 1 commit
-
-
Chenggang Zhao authored
* Fully remove FIFO slots * Fully remove FIFO buffers * Minor fix styles * Fix some typos * Bugs fixed * Cleanup `ibgda_poll_cq`
-
- 09 Jun, 2025 1 commit
-
-
Chenggang Zhao authored
-
- 06 Jun, 2025 1 commit
-
-
Chenggang Zhao authored
* Update CMake files * Use TMA instead of LD/ST for intranode dispatch * Use TMA instead of LD/ST for intranode combine * Adjust configs * Test default configs as well * More warps for combine * Add inter-thread fence * Enable more warps * Do not use TMA for senders * Update configs * Remove useless wait
-
- 27 Feb, 2025 1 commit
-
-
Chenggang Zhao authored
-
- 25 Feb, 2025 1 commit
-
-
Chenggang Zhao authored
-