1. 20 Oct, 2025 2 commits
  2. 17 Oct, 2025 1 commit
  3. 24 Sep, 2025 1 commit
  4. 22 Sep, 2025 1 commit
  5. 17 Sep, 2025 1 commit
  6. 16 Sep, 2025 1 commit
    • Chenggang Zhao's avatar
      Canonicalize TMA usages (#410) · 2012e310
      Chenggang Zhao authored
      * Remove redundant TMA flushes
      
      * Less barrier initialization overhead
      
      * Simplify `elect_one_sync`
      
      * Use `elect_one_sync` instead of lanes
      
      * Minor fix
      
      * Polish testing prints
      
      * Refactor for internode kernels
      
      * Better performance
      2012e310
  7. 10 Sep, 2025 1 commit
  8. 09 Sep, 2025 1 commit
  9. 14 Aug, 2025 1 commit
  10. 08 Aug, 2025 2 commits
  11. 07 Aug, 2025 1 commit
    • Zhean Xu's avatar
      Support 10-bit LogFMT Combine (#345) · c5facf5c
      Zhean Xu authored
      
      
      * independent logfmt_simulate function
      
      * draft: logfmt low latency combine
      
      * Minor bug fixes
      
      * Fix non-logfmt bugs
      
      * Fix logfmt bugs
      
      * Fix logfmt bugs
      
      * Minor fix
      
      * Minor fix
      
      * Clean code
      
      * Clean code
      
      * Use fewer regs
      
      * Use two warp groups
      
      * Correct shared memory size
      
      * Minor fix
      
      * Minor fix
      
      * More rigorous tests
      
      * Clean code
      
      * Use more SMs
      
      * Use different unroll factor for send & recv
      
      * Update csrc/kernels/internode_ll.cu
      Co-authored-by: default avatarCopilot <175728472+Copilot@users.noreply.github.com>
      
      * Update csrc/kernels/internode_ll.cu
      Co-authored-by: default avatarCopilot <175728472+Copilot@users.noreply.github.com>
      
      * Some renaming
      
      * Some comments of tests
      
      * Format `logfmt_encode`
      
      * More lints
      
      * Some refactors on sends
      
      * Fix testing
      
      * Fix bugs
      
      * Renaming
      
      * Use the full warp
      
      * Unify combine recv
      
      * Lint
      
      * Lint
      
      * Support 2560
      
      * Fix meta buffer dtype
      
      * Better encode calls
      
      * Better amin/max writes
      
      * Extra sync
      
      * Read `topk_idx` by once
      
      * Better specialization
      
      * Read weights by once
      
      * Rename
      
      * Bug fixed
      
      * Some renaming
      
      * Fix local memory usage for sending
      
      * Fix local memory usage for receiving
      
      * Less writes
      
      * Optimize performance
      
      * Optimize performance
      
      * Better performance
      
      * Optimize performance
      
      * Fix rounding
      
      * Manually unroll
      
      * Fix bench
      
      ---------
      Co-authored-by: default avatarCopilot <175728472+Copilot@users.noreply.github.com>
      Co-authored-by: default avatarChenggang Zhao <chenggangz@deepseek.com>
      c5facf5c
  12. 01 Aug, 2025 1 commit
  13. 31 Jul, 2025 1 commit
  14. 30 Jul, 2025 1 commit
  15. 22 Jul, 2025 1 commit
  16. 21 Jul, 2025 1 commit
  17. 14 Jul, 2025 1 commit
  18. 11 Jul, 2025 1 commit
  19. 10 Jul, 2025 2 commits
    • Shangyan Zhou's avatar
      Use TMA to optimize internode combine. (#287) · 06f417dc
      Shangyan Zhou authored
      
      
      * Let forwarders use a dedicated SM
      
      * Shuffle rdma idx
      
      * Sender use TMA.
      
      * Adjust the tuning chunk size.
      
      * Modify NVL chunk layout.
      
      * Update some combine config.
      
      * Small lint
      
      * Minor fix
      
      * Overlap TMA store
      
      ---------
      Co-authored-by: default avatarChenggang Zhao <chenggangz@deepseek.com>
      06f417dc
    • Chenggang Zhao's avatar
      Support 10-bit LogFMT (simulated version) (#284) · 1cf85fb2
      Chenggang Zhao authored
      
      
      * Add LogFMT interface
      
      * Update comments
      
      * Add simulated code
      
      * Fix comments
      
      * Change to 128 channels
      
      * Add notes
      
      * Optimize performance
      
      * optimize simulate logfmt 10bit
      
      * Minor fix
      
      * Stronger low latency tests
      
      * Minor fix
      
      * Stronger low latency tests for logfmt
      
      * Optimize logfmt simulate: lg2/ex2 ptx, step_inv
      
      * Minor fix
      
      * Minor fix
      
      * Add non-logfmt bench
      
      * Fix value=0 for logfmt
      
      * Optimize performance
      
      * Refactor tests
      
      ---------
      Co-authored-by: default avatarZhean Xu <xza@deepseek.com>
      1cf85fb2
  20. 04 Jul, 2025 2 commits
    • Shangyan Zhou's avatar
      Update some dispatch configs. · e6d61fc6
      Shangyan Zhou authored
      e6d61fc6
    • Shangyan Zhou's avatar
      Use TMA to optimize internode dispatch. (#276) · a2fa3b73
      Shangyan Zhou authored
      
      
      * Add TMA buffer allocation
      
      * Use TMA for forwarders and NVL receivers
      
      * Use lane 31 to operate TMA.
      
      * Change rdma buffer layout.
      
      * Use TMA to transfer scales also.
      
      * Increase the NVL recv buffer size.
      
      * Disable early stopping.
      
      * Apply similar optimizations on receiver warps.
      
      * Prevent warp divergence.
      
      * Disable aggressive ptx by default.
      
      * Revert using TMA to transfer scales.
      
      * Format.
      
      * Change the layout of dispatch NVL buffer.
      
      * Move topk transformation to recv warps.
      
      * Use TMA to transfer all data in foward warps
      
      * Use TMA to store scales.
      
      * Code lint
      
      ---------
      Co-authored-by: default avatarChenggang Zhao <chenggangz@deepseek.com>
      a2fa3b73
  21. 02 Jul, 2025 9 commits
  22. 25 Jun, 2025 1 commit
  23. 24 Jun, 2025 2 commits
  24. 18 Jun, 2025 4 commits