• Anthony Chang's avatar
    Input/output permutation for fused attention (#460) · de37550f
    Anthony Chang authored
    
    
    * reopen masking att instance due to CI is upgraded
    
    * re-enable instances previously failed on 9110
    
    * enable ksize-kpadding pair validity test
    
    * add non-masked attention+permute test; expose masking boolean to attention kernel handles
    
    * disable bench
    
    * fix test
    
    * move files
    
    * bulk rename batched_gemm_masking_scale_softmax_gemm_permute to batched_gemm_softmax_gemm_permute
    
    * format
    
    * amend rename
    
    * disable bench in test
    
    * add mask/no-mask test for non-permute attention kernels
    
    * disable broken kernel instance
    
    * example working
    
    add non-permuted problem statement
    
    evaluating whether overhead comes from permutation or the extra kernel arg
    
    * interface for bias addition without implementing it
    
    * test and profiler running
    
    * tidy
    
    * mask type determined by enum class
    
    * unify example code
    
    * move masking specialization to its own header
    
    * align formats
    
    * extract helper functions
    
    * experiment merging dims for attn w/ permute; shows perf parity with attn wo/ permute
    
    * add tensor specialization to template args
    
    since tensor spec packed shows perf parity when permutation isn't needed
    
    remove redundant template args
    
    comment on 'packed' tensor specialization
    
    * grouped attention with input/output permute example
    
    * format
    
    * clean up
    
    * refactor acc0 tile visitor
    Co-authored-by: wangshaojie6's avatarshaojiewang <wsjmessi@163.com>
    Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
    de37550f
CMakeLists.txt 1.72 KB