1. 03 Nov, 2025 1 commit
  2. 30 Oct, 2025 1 commit
  3. 21 Oct, 2025 2 commits
  4. 20 Oct, 2025 2 commits
  5. 17 Oct, 2025 1 commit
  6. 24 Sep, 2025 1 commit
  7. 22 Sep, 2025 1 commit
  8. 17 Sep, 2025 1 commit
  9. 16 Sep, 2025 1 commit
    • Chenggang Zhao's avatar
      Canonicalize TMA usages (#410) · 2012e310
      Chenggang Zhao authored
      * Remove redundant TMA flushes
      
      * Less barrier initialization overhead
      
      * Simplify `elect_one_sync`
      
      * Use `elect_one_sync` instead of lanes
      
      * Minor fix
      
      * Polish testing prints
      
      * Refactor for internode kernels
      
      * Better performance
      2012e310
  10. 15 Sep, 2025 1 commit
  11. 11 Sep, 2025 1 commit
  12. 10 Sep, 2025 2 commits
  13. 01 Sep, 2025 3 commits
  14. 28 Aug, 2025 1 commit
  15. 26 Aug, 2025 1 commit
  16. 25 Aug, 2025 3 commits
  17. 11 Aug, 2025 1 commit
  18. 07 Aug, 2025 3 commits
    • Chenggang Zhao's avatar
      Fix indent · d31c72a1
      Chenggang Zhao authored
      d31c72a1
    • Chenggang Zhao's avatar
      Fix compilation · dd14b36d
      Chenggang Zhao authored
      dd14b36d
    • Zhean Xu's avatar
      Support 10-bit LogFMT Combine (#345) · c5facf5c
      Zhean Xu authored
      
      
      * independent logfmt_simulate function
      
      * draft: logfmt low latency combine
      
      * Minor bug fixes
      
      * Fix non-logfmt bugs
      
      * Fix logfmt bugs
      
      * Fix logfmt bugs
      
      * Minor fix
      
      * Minor fix
      
      * Clean code
      
      * Clean code
      
      * Use fewer regs
      
      * Use two warp groups
      
      * Correct shared memory size
      
      * Minor fix
      
      * Minor fix
      
      * More rigorous tests
      
      * Clean code
      
      * Use more SMs
      
      * Use different unroll factor for send & recv
      
      * Update csrc/kernels/internode_ll.cu
      Co-authored-by: default avatarCopilot <175728472+Copilot@users.noreply.github.com>
      
      * Update csrc/kernels/internode_ll.cu
      Co-authored-by: default avatarCopilot <175728472+Copilot@users.noreply.github.com>
      
      * Some renaming
      
      * Some comments of tests
      
      * Format `logfmt_encode`
      
      * More lints
      
      * Some refactors on sends
      
      * Fix testing
      
      * Fix bugs
      
      * Renaming
      
      * Use the full warp
      
      * Unify combine recv
      
      * Lint
      
      * Lint
      
      * Support 2560
      
      * Fix meta buffer dtype
      
      * Better encode calls
      
      * Better amin/max writes
      
      * Extra sync
      
      * Read `topk_idx` by once
      
      * Better specialization
      
      * Read weights by once
      
      * Rename
      
      * Bug fixed
      
      * Some renaming
      
      * Fix local memory usage for sending
      
      * Fix local memory usage for receiving
      
      * Less writes
      
      * Optimize performance
      
      * Optimize performance
      
      * Better performance
      
      * Optimize performance
      
      * Fix rounding
      
      * Manually unroll
      
      * Fix bench
      
      ---------
      Co-authored-by: default avatarCopilot <175728472+Copilot@users.noreply.github.com>
      Co-authored-by: default avatarChenggang Zhao <chenggangz@deepseek.com>
      c5facf5c
  19. 31 Jul, 2025 2 commits
  20. 30 Jul, 2025 1 commit
  21. 29 Jul, 2025 1 commit
  22. 21 Jul, 2025 2 commits
  23. 17 Jul, 2025 1 commit
    • Guangguan's avatar
      Fix for data error and kernel hung because of inflight rdma channel head update · b65b22ed
      Guangguan authored
      
      
      When dispatch/combine, neither sender nor receiver waits
      for the finish of the rdma channel head update, which may
      result in the remaining inflight head update wqes even after
      the kernel finished. Once the infight wqes arrive after the
      rdma channel head buffer cleaning for the next round of
      dispatch/combine, the rdma channel head buffer will be re-
      written to a none-zero value. The rdma sender can reuse the
      data buffer before the rdma receivers consume the date buffer
      because of the wrong rdma channel head, cauing date error and
      kernel hung.
      For performance considering, to overlap the inflight wqes' RTT,
      fix this issue by waiting for all previous inflight wqes to
      complete before cleaning rdma buffers in the next round of
      dispatch/combine.
      Signed-off-by: default avatarGuangguan Wang <guangguan.wang@linux.alibaba.com>
      b65b22ed
  24. 16 Jul, 2025 1 commit
  25. 15 Jul, 2025 1 commit
  26. 14 Jul, 2025 4 commits