1. 07 Aug, 2025 1 commit
    • Zhean Xu's avatar
      Support 10-bit LogFMT Combine (#345) · c5facf5c
      Zhean Xu authored
      
      
      * independent logfmt_simulate function
      
      * draft: logfmt low latency combine
      
      * Minor bug fixes
      
      * Fix non-logfmt bugs
      
      * Fix logfmt bugs
      
      * Fix logfmt bugs
      
      * Minor fix
      
      * Minor fix
      
      * Clean code
      
      * Clean code
      
      * Use fewer regs
      
      * Use two warp groups
      
      * Correct shared memory size
      
      * Minor fix
      
      * Minor fix
      
      * More rigorous tests
      
      * Clean code
      
      * Use more SMs
      
      * Use different unroll factor for send & recv
      
      * Update csrc/kernels/internode_ll.cu
      Co-authored-by: default avatarCopilot <175728472+Copilot@users.noreply.github.com>
      
      * Update csrc/kernels/internode_ll.cu
      Co-authored-by: default avatarCopilot <175728472+Copilot@users.noreply.github.com>
      
      * Some renaming
      
      * Some comments of tests
      
      * Format `logfmt_encode`
      
      * More lints
      
      * Some refactors on sends
      
      * Fix testing
      
      * Fix bugs
      
      * Renaming
      
      * Use the full warp
      
      * Unify combine recv
      
      * Lint
      
      * Lint
      
      * Support 2560
      
      * Fix meta buffer dtype
      
      * Better encode calls
      
      * Better amin/max writes
      
      * Extra sync
      
      * Read `topk_idx` by once
      
      * Better specialization
      
      * Read weights by once
      
      * Rename
      
      * Bug fixed
      
      * Some renaming
      
      * Fix local memory usage for sending
      
      * Fix local memory usage for receiving
      
      * Less writes
      
      * Optimize performance
      
      * Optimize performance
      
      * Better performance
      
      * Optimize performance
      
      * Fix rounding
      
      * Manually unroll
      
      * Fix bench
      
      ---------
      Co-authored-by: default avatarCopilot <175728472+Copilot@users.noreply.github.com>
      Co-authored-by: default avatarChenggang Zhao <chenggangz@deepseek.com>
      c5facf5c
  2. 05 Aug, 2025 1 commit
  3. 01 Aug, 2025 1 commit
  4. 31 Jul, 2025 4 commits
  5. 30 Jul, 2025 1 commit
  6. 29 Jul, 2025 2 commits
  7. 22 Jul, 2025 1 commit
  8. 21 Jul, 2025 3 commits
  9. 18 Jul, 2025 1 commit
  10. 17 Jul, 2025 1 commit
    • Guangguan's avatar
      Fix for data error and kernel hung because of inflight rdma channel head update · b65b22ed
      Guangguan authored
      
      
      When dispatch/combine, neither sender nor receiver waits
      for the finish of the rdma channel head update, which may
      result in the remaining inflight head update wqes even after
      the kernel finished. Once the infight wqes arrive after the
      rdma channel head buffer cleaning for the next round of
      dispatch/combine, the rdma channel head buffer will be re-
      written to a none-zero value. The rdma sender can reuse the
      data buffer before the rdma receivers consume the date buffer
      because of the wrong rdma channel head, cauing date error and
      kernel hung.
      For performance considering, to overlap the inflight wqes' RTT,
      fix this issue by waiting for all previous inflight wqes to
      complete before cleaning rdma buffers in the next round of
      dispatch/combine.
      Signed-off-by: default avatarGuangguan Wang <guangguan.wang@linux.alibaba.com>
      b65b22ed
  11. 16 Jul, 2025 3 commits
  12. 15 Jul, 2025 8 commits
  13. 14 Jul, 2025 4 commits
  14. 12 Jul, 2025 3 commits
  15. 11 Jul, 2025 4 commits
  16. 10 Jul, 2025 2 commits
    • Shangyan Zhou's avatar
      Use TMA to optimize internode combine. (#287) · 06f417dc
      Shangyan Zhou authored
      
      
      * Let forwarders use a dedicated SM
      
      * Shuffle rdma idx
      
      * Sender use TMA.
      
      * Adjust the tuning chunk size.
      
      * Modify NVL chunk layout.
      
      * Update some combine config.
      
      * Small lint
      
      * Minor fix
      
      * Overlap TMA store
      
      ---------
      Co-authored-by: default avatarChenggang Zhao <chenggangz@deepseek.com>
      06f417dc
    • Chenggang Zhao's avatar
      Support 10-bit LogFMT (simulated version) (#284) · 1cf85fb2
      Chenggang Zhao authored
      
      
      * Add LogFMT interface
      
      * Update comments
      
      * Add simulated code
      
      * Fix comments
      
      * Change to 128 channels
      
      * Add notes
      
      * Optimize performance
      
      * optimize simulate logfmt 10bit
      
      * Minor fix
      
      * Stronger low latency tests
      
      * Minor fix
      
      * Stronger low latency tests for logfmt
      
      * Optimize logfmt simulate: lg2/ex2 ptx, step_inv
      
      * Minor fix
      
      * Minor fix
      
      * Add non-logfmt bench
      
      * Fix value=0 for logfmt
      
      * Optimize performance
      
      * Refactor tests
      
      ---------
      Co-authored-by: default avatarZhean Xu <xza@deepseek.com>
      1cf85fb2