- 04 Feb, 2026 1 commit
-
-
lijian6 authored
Signed-off-by:lijian <lijian6@sugon.com>
-
- 25 Nov, 2025 1 commit
-
-
lishen authored
-
- 17 Sep, 2025 1 commit
-
-
Shangyan Zhou authored
* Fix hidden_size % 128 != 0 * Add `align_down()` function * Use the full warp to wait TMA store * Support arbitrary hidden sizes in fp8 cast * lint
-
- 09 Sep, 2025 1 commit
-
-
Chenggang Zhao authored
-
- 14 Aug, 2025 1 commit
-
-
Yizhi Wang authored
-
- 07 Aug, 2025 1 commit
-
-
Zhean Xu authored
* independent logfmt_simulate function * draft: logfmt low latency combine * Minor bug fixes * Fix non-logfmt bugs * Fix logfmt bugs * Fix logfmt bugs * Minor fix * Minor fix * Clean code * Clean code * Use fewer regs * Use two warp groups * Correct shared memory size * Minor fix * Minor fix * More rigorous tests * Clean code * Use more SMs * Use different unroll factor for send & recv * Update csrc/kernels/internode_ll.cu Co-authored-by:
Copilot <175728472+Copilot@users.noreply.github.com> * Update csrc/kernels/internode_ll.cu Co-authored-by:
Copilot <175728472+Copilot@users.noreply.github.com> * Some renaming * Some comments of tests * Format `logfmt_encode` * More lints * Some refactors on sends * Fix testing * Fix bugs * Renaming * Use the full warp * Unify combine recv * Lint * Lint * Support 2560 * Fix meta buffer dtype * Better encode calls * Better amin/max writes * Extra sync * Read `topk_idx` by once * Better specialization * Read weights by once * Rename * Bug fixed * Some renaming * Fix local memory usage for sending * Fix local memory usage for receiving * Less writes * Optimize performance * Optimize performance * Better performance * Optimize performance * Fix rounding * Manually unroll * Fix bench --------- Co-authored-by:
Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by:
Chenggang Zhao <chenggangz@deepseek.com>
-
- 02 Jul, 2025 5 commits
-
-
Chenggang Zhao authored
-
Chenggang Zhao authored
-
fzyzcjy authored
* more * more * more * more * more * more
-
Chenggang Zhao authored
-
fzyzcjy authored
-
- 24 Jun, 2025 1 commit
-
-
Shangyan Zhou authored
* Increase the test round. * Add warp synchronization. * Shuffle the send warps. * Add time elapsed into bench result.
-
- 18 Jun, 2025 4 commits
-
-
Chenggang Zhao authored
-
Chenggang Zhao authored
-
Shangyan Zhou authored
-
Shangyan Zhou authored
-
- 12 Jun, 2025 1 commit
-
-
Shifang Xu authored
-
- 25 Feb, 2025 1 commit
-
-
Chenggang Zhao authored
-