1. 07 Jan, 2025 1 commit
    • Po Yen Chen's avatar
      [CK_TILE] fmha fwd splitkv optimization for decode (seqlen_q=1) (#1789) · 24b12d04
      Po Yen Chen authored
      
      
      * Update license year
      
      * Add initial code to override decode problem
      
      * Fix splitkv traits/args overriding error
      
      * Reshape and transpose lse for decode
      
      * Remove debug code
      
      * Prettify example code
      
      * Use better function name
      
      * Add kMergeNumHeadGroupsSeqLenQ flag
      
      Kernel user can use this switch to turn on/off optimization for
      some problem sizes
      
      * Add missing flag declarations
      
      * Default turn off kMergeNumHeadGroupsSeqLenQ in codegen
      
      * Group similar statements together
      
      * Remove assumption of seqlen_q=1
      
      * Remove kMergeNumHeadGroupsSeqLenQ from splitkv combine kernel
      
      * Support kMergeNumHeadGroupsSeqLenQ=true in fmha splitkv kernel
      
      * Run kMergeNumHeadGroupsSeqLenQ=true kernels when need
      
      * Fix group mode block skip logics
      
      * Undo changes of normal fwd kernel
      
      * Update in GridSize() and using GridSize() for splitkv kernel (#1799)
      
      ---------
      Co-authored-by: default avatarQianfeng <qianfeng.zhang@amd.com>
      24b12d04
  2. 01 Nov, 2024 1 commit
    • rocking's avatar
      [Ck_tile] smoothquant (#1617) · fbd65454
      rocking authored
      
      
      * fix compile error
      
      * fix typo of padding
      
      * Add smoothquant op
      
      * Add smoothquant instance library
      
      * refine type
      
      * add test script
      
      * Re-generate smoothquant.hpp
      
      * Always use 'current year' in copyright
      
      * use Generic2dBlockShape instead
      
      * Add vector = 8 instance back
      
      * Find exe path automatically
      
      * Simplify the api condition
      
      * Remove debugging code
      
      * update year
      
      * Add blank line between function declaration
      
      * explicitly cast return value to dim3
      
      * refine return value
      
      * Fix default warmup and repeat value
      
      * Add comment
      
      * refactor sommthquant cmake
      
      * Add README
      
      * Fix typo
      
      ---------
      Co-authored-by: default avatarPo Yen, Chen <PoYen.Chen@amd.com>
      fbd65454
  3. 31 Oct, 2024 1 commit
    • carlushuang's avatar
      [CK_TILE] layernorm support fused-quant/fused-add (#1604) · c3a4800c
      carlushuang authored
      * add prenorm/postnorm support, refactor using generate.py
      
      * update README
      
      * update README
      
      * fix format
      
      * update some description and fix format
      
      * update format
      
      * format
      
      * use non-raw for loading
      
      * format and update n4096
      
      * dynamic-quant ready
      
      * update readme
      
      * support fused dynamic-quant
      
      * update fused-quant, with smooth
      
      * update README
      
      * update args
      
      * update some based on comment
      c3a4800c
  4. 30 Oct, 2024 1 commit
    • rocking's avatar
      [Ck tile] support rmsnorm and related fusion (#1605) · 3d609534
      rocking authored
      * Add reduce2d new api
      
      * Prevent user use cross warp reduction
      
      * Fix bug of std caculation
      
      * Add rmsnorm2d
      
      * Add rmsnorm small example
      
      * Remove static assert to prevent compile fail
      
      * Add script to test performance and correctness
      
      * Add missing cmake change
      
      * refine naming
      
      * refine example of rmsnorm
      
      * Fix bug of rmsnorm
      
      * Refine naming
      
      * Fix cmake
      
      * clang format
      
      * Refine pipeline name
      
      * Add add_rmsnorm2d_rdquant kernel
      
      * Add reduce op
      
      * host verification
      
      * Fix bug of one pass pipeline
      
      * Refine tile size
      
      * Add two pass pipeline
      
      * Rename two pass to three pass
      
      * Fix bug of kSaveX == false
      
      * Add instance library
      
      * Add test script
      
      * Fix bug of x verification
      
      * Add save_x to trait
      
      * Add README
      
      * Move reduce2d into reduce folder
      
      * Fix bug of welford when number of m warp > 1
      
      * remove reduncant comment
      
      * 1. move 06_rmsnorm2d to 10_rmsnorm2d
      2. move 07_add_rmsnorm2d_rdquant to 11_add_rmsnorm2d_rdquant
      
      * clang format and add missing header
      
      * Add host validation of add + layernorm2d + rsquant
      
      * Revert "Add host validation of add + layernorm2d + rsquant"
      
      This reverts commit 936cb457978b928b90eff89a08fcdb7dc8bbed67.
      
      * Remove deprecated flag
      3d609534