1. 24 Sep, 2025 1 commit
  2. 17 Sep, 2025 1 commit
  3. 16 Sep, 2025 1 commit
    • Chenggang Zhao's avatar
      Canonicalize TMA usages (#410) · 2012e310
      Chenggang Zhao authored
      * Remove redundant TMA flushes
      
      * Less barrier initialization overhead
      
      * Simplify `elect_one_sync`
      
      * Use `elect_one_sync` instead of lanes
      
      * Minor fix
      
      * Polish testing prints
      
      * Refactor for internode kernels
      
      * Better performance
      2012e310
  4. 11 Sep, 2025 1 commit
  5. 10 Sep, 2025 2 commits
  6. 26 Aug, 2025 1 commit
  7. 25 Aug, 2025 1 commit
  8. 07 Aug, 2025 1 commit
  9. 21 Jul, 2025 2 commits
  10. 17 Jul, 2025 1 commit
    • Guangguan's avatar
      Fix for data error and kernel hung because of inflight rdma channel head update · b65b22ed
      Guangguan authored
      
      
      When dispatch/combine, neither sender nor receiver waits
      for the finish of the rdma channel head update, which may
      result in the remaining inflight head update wqes even after
      the kernel finished. Once the infight wqes arrive after the
      rdma channel head buffer cleaning for the next round of
      dispatch/combine, the rdma channel head buffer will be re-
      written to a none-zero value. The rdma sender can reuse the
      data buffer before the rdma receivers consume the date buffer
      because of the wrong rdma channel head, cauing date error and
      kernel hung.
      For performance considering, to overlap the inflight wqes' RTT,
      fix this issue by waiting for all previous inflight wqes to
      complete before cleaning rdma buffers in the next round of
      dispatch/combine.
      Signed-off-by: default avatarGuangguan Wang <guangguan.wang@linux.alibaba.com>
      b65b22ed
  11. 16 Jul, 2025 1 commit
  12. 14 Jul, 2025 2 commits
  13. 10 Jul, 2025 1 commit
  14. 04 Jul, 2025 1 commit
    • Shangyan Zhou's avatar
      Use TMA to optimize internode dispatch. (#276) · a2fa3b73
      Shangyan Zhou authored
      
      
      * Add TMA buffer allocation
      
      * Use TMA for forwarders and NVL receivers
      
      * Use lane 31 to operate TMA.
      
      * Change rdma buffer layout.
      
      * Use TMA to transfer scales also.
      
      * Increase the NVL recv buffer size.
      
      * Disable early stopping.
      
      * Apply similar optimizations on receiver warps.
      
      * Prevent warp divergence.
      
      * Disable aggressive ptx by default.
      
      * Revert using TMA to transfer scales.
      
      * Format.
      
      * Change the layout of dispatch NVL buffer.
      
      * Move topk transformation to recv warps.
      
      * Use TMA to transfer all data in foward warps
      
      * Use TMA to store scales.
      
      * Code lint
      
      ---------
      Co-authored-by: default avatarChenggang Zhao <chenggangz@deepseek.com>
      a2fa3b73
  15. 02 Jul, 2025 1 commit
  16. 27 Jun, 2025 5 commits
  17. 26 Jun, 2025 1 commit
  18. 25 Jun, 2025 1 commit
  19. 24 Jun, 2025 2 commits
  20. 18 Jun, 2025 1 commit
  21. 16 Jun, 2025 1 commit
  22. 13 Jun, 2025 2 commits
  23. 12 Jun, 2025 1 commit
  24. 11 Jun, 2025 1 commit
    • Chenggang Zhao's avatar
      Support Ampere architecture (#204) · b8d90fb7
      Chenggang Zhao authored
      * Update README
      
      * Update `setup.py`
      
      * Fix headers
      
      * Add `DISABLE_NVSHMEM` for APIs
      
      * Fix launch
      
      * Fix TMA settings
      
      * Fix TMA usages
      
      * Fix dlink
      
      * Separate layout kernels
      
      * Update version
      
      * Add `is_sm90_compiled`
      
      * Fix tests
      
      * Add NVLink connection checks
      
      * Update README
      
      * Fix tests
      
      * Add some comments
      
      * Minor fix
      
      * Minor fix
      
      * Fix bugs
      b8d90fb7
  25. 10 Jun, 2025 1 commit
  26. 03 Jun, 2025 1 commit
  27. 28 May, 2025 1 commit
  28. 10 May, 2025 1 commit
  29. 22 Apr, 2025 3 commits