• Zhean Xu's avatar
    Support 10-bit LogFMT Combine (#345) · c5facf5c
    Zhean Xu authored
    
    
    * independent logfmt_simulate function
    
    * draft: logfmt low latency combine
    
    * Minor bug fixes
    
    * Fix non-logfmt bugs
    
    * Fix logfmt bugs
    
    * Fix logfmt bugs
    
    * Minor fix
    
    * Minor fix
    
    * Clean code
    
    * Clean code
    
    * Use fewer regs
    
    * Use two warp groups
    
    * Correct shared memory size
    
    * Minor fix
    
    * Minor fix
    
    * More rigorous tests
    
    * Clean code
    
    * Use more SMs
    
    * Use different unroll factor for send & recv
    
    * Update csrc/kernels/internode_ll.cu
    Co-authored-by: default avatarCopilot <175728472+Copilot@users.noreply.github.com>
    
    * Update csrc/kernels/internode_ll.cu
    Co-authored-by: default avatarCopilot <175728472+Copilot@users.noreply.github.com>
    
    * Some renaming
    
    * Some comments of tests
    
    * Format `logfmt_encode`
    
    * More lints
    
    * Some refactors on sends
    
    * Fix testing
    
    * Fix bugs
    
    * Renaming
    
    * Use the full warp
    
    * Unify combine recv
    
    * Lint
    
    * Lint
    
    * Support 2560
    
    * Fix meta buffer dtype
    
    * Better encode calls
    
    * Better amin/max writes
    
    * Extra sync
    
    * Read `topk_idx` by once
    
    * Better specialization
    
    * Read weights by once
    
    * Rename
    
    * Bug fixed
    
    * Some renaming
    
    * Fix local memory usage for sending
    
    * Fix local memory usage for receiving
    
    * Less writes
    
    * Optimize performance
    
    * Optimize performance
    
    * Better performance
    
    * Optimize performance
    
    * Fix rounding
    
    * Manually unroll
    
    * Fix bench
    
    ---------
    Co-authored-by: default avatarCopilot <175728472+Copilot@users.noreply.github.com>
    Co-authored-by: default avatarChenggang Zhao <chenggangz@deepseek.com>
    c5facf5c
test_low_latency.py 13.8 KB