1. 26 Feb, 2026 1 commit
  2. 05 Feb, 2026 1 commit
  3. 29 Jan, 2026 1 commit
  4. 17 Oct, 2025 1 commit
  5. 24 Sep, 2025 1 commit
  6. 16 Sep, 2025 1 commit
    • Chenggang Zhao's avatar
      Canonicalize TMA usages (#410) · 2012e310
      Chenggang Zhao authored
      * Remove redundant TMA flushes
      
      * Less barrier initialization overhead
      
      * Simplify `elect_one_sync`
      
      * Use `elect_one_sync` instead of lanes
      
      * Minor fix
      
      * Polish testing prints
      
      * Refactor for internode kernels
      
      * Better performance
      2012e310
  7. 11 Jul, 2025 1 commit
  8. 02 Jul, 2025 5 commits
  9. 24 Jun, 2025 1 commit
  10. 12 Jun, 2025 1 commit
  11. 11 Jun, 2025 3 commits
  12. 06 Jun, 2025 1 commit
    • Chenggang Zhao's avatar
      Use TMA instead of LD/ST for intra-node normal kernels (#191) · c8dceba1
      Chenggang Zhao authored
      * Update CMake files
      
      * Use TMA instead of LD/ST for intranode dispatch
      
      * Use TMA instead of LD/ST for intranode combine
      
      * Adjust configs
      
      * Test default configs as well
      
      * More warps for combine
      
      * Add inter-thread fence
      
      * Enable more warps
      
      * Do not use TMA for senders
      
      * Update configs
      
      * Remove useless wait
      c8dceba1
  13. 11 Apr, 2025 1 commit
  14. 10 Apr, 2025 1 commit
  15. 25 Mar, 2025 1 commit
  16. 04 Mar, 2025 1 commit
  17. 03 Mar, 2025 1 commit
  18. 25 Feb, 2025 1 commit