• Shangyan Zhou's avatar
    Use TMA to optimize internode dispatch. (#276) · a2fa3b73
    Shangyan Zhou authored
    
    
    * Add TMA buffer allocation
    
    * Use TMA for forwarders and NVL receivers
    
    * Use lane 31 to operate TMA.
    
    * Change rdma buffer layout.
    
    * Use TMA to transfer scales also.
    
    * Increase the NVL recv buffer size.
    
    * Disable early stopping.
    
    * Apply similar optimizations on receiver warps.
    
    * Prevent warp divergence.
    
    * Disable aggressive ptx by default.
    
    * Revert using TMA to transfer scales.
    
    * Format.
    
    * Change the layout of dispatch NVL buffer.
    
    * Move topk transformation to recv warps.
    
    * Use TMA to transfer all data in foward warps
    
    * Use TMA to store scales.
    
    * Code lint
    
    ---------
    Co-authored-by: default avatarChenggang Zhao <chenggangz@deepseek.com>
    a2fa3b73
setup.py 3.86 KB