1. 25 Aug, 2025 3 commits
  2. 11 Aug, 2025 1 commit
  3. 07 Aug, 2025 3 commits
    • Chenggang Zhao's avatar
      Fix indent · d31c72a1
      Chenggang Zhao authored
      d31c72a1
    • Chenggang Zhao's avatar
      Fix compilation · dd14b36d
      Chenggang Zhao authored
      dd14b36d
    • Zhean Xu's avatar
      Support 10-bit LogFMT Combine (#345) · c5facf5c
      Zhean Xu authored
      
      
      * independent logfmt_simulate function
      
      * draft: logfmt low latency combine
      
      * Minor bug fixes
      
      * Fix non-logfmt bugs
      
      * Fix logfmt bugs
      
      * Fix logfmt bugs
      
      * Minor fix
      
      * Minor fix
      
      * Clean code
      
      * Clean code
      
      * Use fewer regs
      
      * Use two warp groups
      
      * Correct shared memory size
      
      * Minor fix
      
      * Minor fix
      
      * More rigorous tests
      
      * Clean code
      
      * Use more SMs
      
      * Use different unroll factor for send & recv
      
      * Update csrc/kernels/internode_ll.cu
      Co-authored-by: default avatarCopilot <175728472+Copilot@users.noreply.github.com>
      
      * Update csrc/kernels/internode_ll.cu
      Co-authored-by: default avatarCopilot <175728472+Copilot@users.noreply.github.com>
      
      * Some renaming
      
      * Some comments of tests
      
      * Format `logfmt_encode`
      
      * More lints
      
      * Some refactors on sends
      
      * Fix testing
      
      * Fix bugs
      
      * Renaming
      
      * Use the full warp
      
      * Unify combine recv
      
      * Lint
      
      * Lint
      
      * Support 2560
      
      * Fix meta buffer dtype
      
      * Better encode calls
      
      * Better amin/max writes
      
      * Extra sync
      
      * Read `topk_idx` by once
      
      * Better specialization
      
      * Read weights by once
      
      * Rename
      
      * Bug fixed
      
      * Some renaming
      
      * Fix local memory usage for sending
      
      * Fix local memory usage for receiving
      
      * Less writes
      
      * Optimize performance
      
      * Optimize performance
      
      * Better performance
      
      * Optimize performance
      
      * Fix rounding
      
      * Manually unroll
      
      * Fix bench
      
      ---------
      Co-authored-by: default avatarCopilot <175728472+Copilot@users.noreply.github.com>
      Co-authored-by: default avatarChenggang Zhao <chenggangz@deepseek.com>
      c5facf5c
  4. 31 Jul, 2025 2 commits
  5. 30 Jul, 2025 1 commit
  6. 29 Jul, 2025 2 commits
  7. 21 Jul, 2025 2 commits
  8. 17 Jul, 2025 1 commit
    • Guangguan's avatar
      Fix for data error and kernel hung because of inflight rdma channel head update · b65b22ed
      Guangguan authored
      
      
      When dispatch/combine, neither sender nor receiver waits
      for the finish of the rdma channel head update, which may
      result in the remaining inflight head update wqes even after
      the kernel finished. Once the infight wqes arrive after the
      rdma channel head buffer cleaning for the next round of
      dispatch/combine, the rdma channel head buffer will be re-
      written to a none-zero value. The rdma sender can reuse the
      data buffer before the rdma receivers consume the date buffer
      because of the wrong rdma channel head, cauing date error and
      kernel hung.
      For performance considering, to overlap the inflight wqes' RTT,
      fix this issue by waiting for all previous inflight wqes to
      complete before cleaning rdma buffers in the next round of
      dispatch/combine.
      Signed-off-by: default avatarGuangguan Wang <guangguan.wang@linux.alibaba.com>
      b65b22ed
  9. 16 Jul, 2025 1 commit
  10. 15 Jul, 2025 1 commit
  11. 14 Jul, 2025 4 commits
  12. 12 Jul, 2025 2 commits
  13. 11 Jul, 2025 2 commits
  14. 10 Jul, 2025 2 commits
    • Shangyan Zhou's avatar
      Use TMA to optimize internode combine. (#287) · 06f417dc
      Shangyan Zhou authored
      
      
      * Let forwarders use a dedicated SM
      
      * Shuffle rdma idx
      
      * Sender use TMA.
      
      * Adjust the tuning chunk size.
      
      * Modify NVL chunk layout.
      
      * Update some combine config.
      
      * Small lint
      
      * Minor fix
      
      * Overlap TMA store
      
      ---------
      Co-authored-by: default avatarChenggang Zhao <chenggangz@deepseek.com>
      06f417dc
    • Chenggang Zhao's avatar
      Support 10-bit LogFMT (simulated version) (#284) · 1cf85fb2
      Chenggang Zhao authored
      
      
      * Add LogFMT interface
      
      * Update comments
      
      * Add simulated code
      
      * Fix comments
      
      * Change to 128 channels
      
      * Add notes
      
      * Optimize performance
      
      * optimize simulate logfmt 10bit
      
      * Minor fix
      
      * Stronger low latency tests
      
      * Minor fix
      
      * Stronger low latency tests for logfmt
      
      * Optimize logfmt simulate: lg2/ex2 ptx, step_inv
      
      * Minor fix
      
      * Minor fix
      
      * Add non-logfmt bench
      
      * Fix value=0 for logfmt
      
      * Optimize performance
      
      * Refactor tests
      
      ---------
      Co-authored-by: default avatarZhean Xu <xza@deepseek.com>
      1cf85fb2
  15. 09 Jul, 2025 1 commit
  16. 08 Jul, 2025 1 commit
  17. 07 Jul, 2025 1 commit
  18. 04 Jul, 2025 1 commit
    • Shangyan Zhou's avatar
      Use TMA to optimize internode dispatch. (#276) · a2fa3b73
      Shangyan Zhou authored
      
      
      * Add TMA buffer allocation
      
      * Use TMA for forwarders and NVL receivers
      
      * Use lane 31 to operate TMA.
      
      * Change rdma buffer layout.
      
      * Use TMA to transfer scales also.
      
      * Increase the NVL recv buffer size.
      
      * Disable early stopping.
      
      * Apply similar optimizations on receiver warps.
      
      * Prevent warp divergence.
      
      * Disable aggressive ptx by default.
      
      * Revert using TMA to transfer scales.
      
      * Format.
      
      * Change the layout of dispatch NVL buffer.
      
      * Move topk transformation to recv warps.
      
      * Use TMA to transfer all data in foward warps
      
      * Use TMA to store scales.
      
      * Code lint
      
      ---------
      Co-authored-by: default avatarChenggang Zhao <chenggangz@deepseek.com>
      a2fa3b73
  19. 02 Jul, 2025 5 commits
  20. 27 Jun, 2025 4 commits