1. 30 Oct, 2025 1 commit
  2. 21 Oct, 2025 1 commit
  3. 17 Oct, 2025 1 commit
  4. 24 Sep, 2025 1 commit
  5. 17 Sep, 2025 1 commit
  6. 07 Aug, 2025 1 commit
    • Zhean Xu's avatar
      Support 10-bit LogFMT Combine (#345) · c5facf5c
      Zhean Xu authored
      
      
      * independent logfmt_simulate function
      
      * draft: logfmt low latency combine
      
      * Minor bug fixes
      
      * Fix non-logfmt bugs
      
      * Fix logfmt bugs
      
      * Fix logfmt bugs
      
      * Minor fix
      
      * Minor fix
      
      * Clean code
      
      * Clean code
      
      * Use fewer regs
      
      * Use two warp groups
      
      * Correct shared memory size
      
      * Minor fix
      
      * Minor fix
      
      * More rigorous tests
      
      * Clean code
      
      * Use more SMs
      
      * Use different unroll factor for send & recv
      
      * Update csrc/kernels/internode_ll.cu
      Co-authored-by: default avatarCopilot <175728472+Copilot@users.noreply.github.com>
      
      * Update csrc/kernels/internode_ll.cu
      Co-authored-by: default avatarCopilot <175728472+Copilot@users.noreply.github.com>
      
      * Some renaming
      
      * Some comments of tests
      
      * Format `logfmt_encode`
      
      * More lints
      
      * Some refactors on sends
      
      * Fix testing
      
      * Fix bugs
      
      * Renaming
      
      * Use the full warp
      
      * Unify combine recv
      
      * Lint
      
      * Lint
      
      * Support 2560
      
      * Fix meta buffer dtype
      
      * Better encode calls
      
      * Better amin/max writes
      
      * Extra sync
      
      * Read `topk_idx` by once
      
      * Better specialization
      
      * Read weights by once
      
      * Rename
      
      * Bug fixed
      
      * Some renaming
      
      * Fix local memory usage for sending
      
      * Fix local memory usage for receiving
      
      * Less writes
      
      * Optimize performance
      
      * Optimize performance
      
      * Better performance
      
      * Optimize performance
      
      * Fix rounding
      
      * Manually unroll
      
      * Fix bench
      
      ---------
      Co-authored-by: default avatarCopilot <175728472+Copilot@users.noreply.github.com>
      Co-authored-by: default avatarChenggang Zhao <chenggangz@deepseek.com>
      c5facf5c
  7. 29 Jul, 2025 1 commit
  8. 16 Jun, 2025 1 commit
  9. 11 Jun, 2025 1 commit
    • Chenggang Zhao's avatar
      Support Ampere architecture (#204) · b8d90fb7
      Chenggang Zhao authored
      * Update README
      
      * Update `setup.py`
      
      * Fix headers
      
      * Add `DISABLE_NVSHMEM` for APIs
      
      * Fix launch
      
      * Fix TMA settings
      
      * Fix TMA usages
      
      * Fix dlink
      
      * Separate layout kernels
      
      * Update version
      
      * Add `is_sm90_compiled`
      
      * Fix tests
      
      * Add NVLink connection checks
      
      * Update README
      
      * Fix tests
      
      * Add some comments
      
      * Minor fix
      
      * Minor fix
      
      * Fix bugs
      b8d90fb7
  10. 07 Apr, 2025 1 commit
  11. 18 Mar, 2025 2 commits
  12. 10 Mar, 2025 1 commit
  13. 06 Mar, 2025 1 commit
  14. 03 Mar, 2025 1 commit
  15. 25 Feb, 2025 1 commit