1. 04 Jul, 2025 1 commit
    • Shangyan Zhou's avatar
      Use TMA to optimize internode dispatch. (#276) · a2fa3b73
      Shangyan Zhou authored
      
      
      * Add TMA buffer allocation
      
      * Use TMA for forwarders and NVL receivers
      
      * Use lane 31 to operate TMA.
      
      * Change rdma buffer layout.
      
      * Use TMA to transfer scales also.
      
      * Increase the NVL recv buffer size.
      
      * Disable early stopping.
      
      * Apply similar optimizations on receiver warps.
      
      * Prevent warp divergence.
      
      * Disable aggressive ptx by default.
      
      * Revert using TMA to transfer scales.
      
      * Format.
      
      * Change the layout of dispatch NVL buffer.
      
      * Move topk transformation to recv warps.
      
      * Use TMA to transfer all data in foward warps
      
      * Use TMA to store scales.
      
      * Code lint
      
      ---------
      Co-authored-by: default avatarChenggang Zhao <chenggangz@deepseek.com>
      a2fa3b73
  2. 02 Jul, 2025 14 commits
  3. 30 Jun, 2025 1 commit
  4. 27 Jun, 2025 7 commits
  5. 26 Jun, 2025 1 commit
  6. 25 Jun, 2025 2 commits
  7. 24 Jun, 2025 3 commits
  8. 23 Jun, 2025 2 commits
  9. 20 Jun, 2025 1 commit
  10. 18 Jun, 2025 6 commits
  11. 16 Jun, 2025 2 commits