• Chenggang Zhao's avatar
    Canonicalize TMA usages (#410) · 2012e310
    Chenggang Zhao authored
    * Remove redundant TMA flushes
    
    * Less barrier initialization overhead
    
    * Simplify `elect_one_sync`
    
    * Use `elect_one_sync` instead of lanes
    
    * Minor fix
    
    * Polish testing prints
    
    * Refactor for internode kernels
    
    * Better performance
    2012e310
internode.cu 105 KB