• Po Yen Chen's avatar
    [CK_TILE] Optimize fmha splitkv & splitkv combine kernels (#1577) · 95e722a3
    Po Yen Chen authored
    * Use smaller width for lse_accum dist tensor
    
    * Update pipeline comment
    
    * Fix wrong distribution for lse_accum
    
    * Remove duplicate dim in lse_accum dist encoding
    
    * Decide fmha splitkv combine kernel kBlockSize by kM0
    
    * Remove assumption of MPerThread=1
    
    * Add log<4> & log<8> specialization
    
    * Enlarge occupancy array
    
    * Fix vector size for small tile
    
    * Add support for kMaxSplits=8
    
    * Re-format gemm.hpp
    
    * Use 16x16x16 warp gemm for fwd_splitkv
    
    * Centralize policy code changes
    
    * Leave fp8/bf8 tile settings unchanged
    95e722a3
gemm.hpp 2.47 KB