1. 07 Jan, 2025 1 commit
    • Po Yen Chen's avatar
      [CK_TILE] fmha fwd splitkv optimization for decode (seqlen_q=1) (#1789) · 24b12d04
      Po Yen Chen authored
      
      
      * Update license year
      
      * Add initial code to override decode problem
      
      * Fix splitkv traits/args overriding error
      
      * Reshape and transpose lse for decode
      
      * Remove debug code
      
      * Prettify example code
      
      * Use better function name
      
      * Add kMergeNumHeadGroupsSeqLenQ flag
      
      Kernel user can use this switch to turn on/off optimization for
      some problem sizes
      
      * Add missing flag declarations
      
      * Default turn off kMergeNumHeadGroupsSeqLenQ in codegen
      
      * Group similar statements together
      
      * Remove assumption of seqlen_q=1
      
      * Remove kMergeNumHeadGroupsSeqLenQ from splitkv combine kernel
      
      * Support kMergeNumHeadGroupsSeqLenQ=true in fmha splitkv kernel
      
      * Run kMergeNumHeadGroupsSeqLenQ=true kernels when need
      
      * Fix group mode block skip logics
      
      * Undo changes of normal fwd kernel
      
      * Update in GridSize() and using GridSize() for splitkv kernel (#1799)
      
      ---------
      Co-authored-by: default avatarQianfeng <qianfeng.zhang@amd.com>
      24b12d04
  2. 03 Jan, 2025 1 commit
  3. 29 Dec, 2024 1 commit
    • Qianfeng's avatar
      Remove using partitioner for all fmha kernels (#1778) · 4e076909
      Qianfeng authored
      * Remove using tile partitioner for fmha_fwd_kernel
      
      * Remove using tile partitioner for fmha_fwd_splitkv and splitkv-combine kernels
      
      * Remove using tile partitioner for fmha_fwd_appendkv kernel
      
      * Unify the format of GetTileIndex
      4e076909
  4. 28 Dec, 2024 1 commit
  5. 23 Dec, 2024 1 commit
  6. 20 Dec, 2024 2 commits
    • carlushuang's avatar
      hot-fix (#1768) · 1c45ca35
      carlushuang authored
      1c45ca35
    • Po Yen Chen's avatar
      [CK_TILE] Add fmha fwd N-Warp S-Shuffle pipeline (fmha fwd splitkv pipeline variant) (#1705) · 37cdbf4f
      Po Yen Chen authored
      
      
      * Add check for zero values
      
      * Add static assertions
      
      * Remove invalid option '-e' in smoke_test.sh
      
      * Use correct path of smoke_test.sh
      
      * Avoid zero-sized shared memory array
      
      * Add warning comment
      
      * Replace expr by integer_divide_ceil() call
      
      * Use more readable constant names
      
      * Write down assumption as static assertion
      
      * Add more diagnostic error messages
      
      * Fix wrong BlockWarps when using default pipeline policy
      
      * Add more static assertions for A LDS desc
      
      * Allow using vector size < 8 for data type fp16/bf16
      
      * Align vector size between DRAM dist & LDS desc
      
      * Remove no-longer used func decl
      
      * Fix wrong displayed piepline name
      
      * Undo policy template changes for tile_example_gemm_basic
      
      * Add missing space and make error message stands out
      
      * Unify print precision
      
      * Add missing include directive <iomanip>
      
      * Replace constant 64 by get_warp_size() call
      
      * Replace constant 128 by named variable: BankLength
      
      * Add kAMBlock/kBNBlock attributes
      
      * Allow usig different A/B warp dist for multiple blocks
      
      * Add helper function to get warp dist encodings
      
      * Add 4x64x4 fp16 warp gemm attribute impl
      
      * Complete the A/B warp dist encoding logic
      
      * Fix wrong thread mapping for C matrix
      
      * Use smaller vector size for small tile
      
      * Add static assert to block unsupported warp gemm impl
      
      * Extract common code out as helper method
      
      * Add 4x64x16 fp16 warp gemm type alias
      
      * Add comment to warning developers
      
      * Undo WarpGemmAtrributeMfma<> changes
      
      * Use more clear static assertion error message
      
      * Add trivial wrapper to get warp dstr encodings
      
      * Only transpose warp gemm result if it's square
      
      * Fix compilation error
      
      * Support multi-block warp gemm (on N direction)
      
      * Remove duplicated code
      
      * Fix output encoding of warp gemm
      
      * Fix wrong shape of WarpGemmAtrributeMfmaIterateK<>
      
      * Remove unused code
      
      * Fix wrong shape of WarpGemmAttributeMfmaImplF16F16F32M4N64K4
      
      * Add type config for bf16_t
      
      * Add 4x64x16 bf16 warp gemm
      
      * Update WarpGemmAtrributeMfmaIterateKAndTransposedCDistribution
      
      * Add 64x4x4 fp16/bf16 warp gemm impl
      
      * Add 64x4x16 fp16/bf16 warp gemm
      
      * Add static assertion for better error diagnostic
      
      * Get Q dram dstr directly form block gemm
      
      * Add missing header: fused_moe.hpp
      
      * Allow specifying different warp-gemm for gemm0 & gemm1
      
      * Store P matrix into LDS before gemm1
      
      * Fix inconsistant kernel name
      
      * Remove constraint on gemm0 & gemm1 block warps
      
      * Remove unsupported vector size from checking list
      
      * Allow using 4x64x16 warp gemm for gemm0
      
      * Finish policy customization
      
      * Finish pipeline modification
      F#
      
      * Use block warps in codegen
      
      * Fix wrong rank of m_lds_window origin
      
      * Use better distributed tensor
      
      * Make P-store earlier
      
      * Remove duplicated experssions
      
      * Remove unnecessary tile window
      
      * Create new files for new splitkv pipeline
      
      * Separate old/new pipeline codegen logic
      
      * Sync changes form develop
      
      * Undo gemm kernel/pipeline changes
      
      * Undo gemm example changes
      
      * Remove blank lines
      
      * Fix typo
      
      * Use new warp gemm interface
      
      * Fix link error
      
      * Fix wrong pipeline tag
      
      * Fix more link error
      
      * Avoid unnecessary padding
      
      * Always use vector load for K
      
      * Padding on fastest dimension when necessary
      
      * Force padding Q on hdim_q
      
      * Set high dimension padding flag to false
      
      * Re-format headers
      
      * Use warps=<1, 4, 1> for both gemm0 & gemm1
      
      * Fix complilation errors
      
      * Remove m/l shuffle logics
      
      * Ignore duplicate data when write lse_acc
      
      * Use gemm0 block warps as lds tile width
      
      * Remove hard-coded numbers
      
      * Fix wrong distribution width
      
      * Remove unnecessary code
      
      * Add s_barrier before writing to LDS
      
      * Store Q into LDS before gemm0
      
      * Fix wrong Q tile size
      
      * Use simple Q lds descriptor for debuging
      
      * Use more realistic Q lds descriptor
      
      * Add comment & use better variable name
      
      * Make Q lds space not overlapped with others
      
      * Remove unnecessary block_tile_reduce_sync() call
      
      * Move Q load statements
      
      * Move block_sync_lds() right before use
      
      * Re-order instructions
      
      * Remove necessary lambda expression
      
      * Use 8 threads on kMaxSplits direction while doing reduction
      
      * Tiny correction for using 8 threads on kMaxSplits direction for combine kernel
      
      * Padding num_split direction of o_acc tile window to 4x
      
      * Update splitkv combine pipeline design
      
      * Add kN1 back to splitkv combine pipeline problem
      
      * Fix compilation errors
      
      * Add missing template parameter
      
      * Fix wrong splitkv combine kernel name
      
      * Fix wrong origin
      
      * Fix wrong LDS descriptor shape
      
      * Fix sync & reduction logics
      
      * Remove unnecessary static assertions
      
      * Extract tile size computation logics
      
      * Make sure we can reuse padding flags in combine kernels
      
      * Rename variables
      
      * Use OaccDataType in BlockFmhaSplitKVCombinePipelineTileSizes<>
      
      * Remove unnecessary static assertion
      
      * Fix function name typo
      
      * Add constraint on kN1 template parameter
      
      * Hide K tile loading latency in earlier iteration
      
      * Fix wrong splitkv kernel name
      
      * Use s_shuffling to replace p_shuffling which removes the needs of cross-warp reduction
      
      * Rename pipeline
      
      * Fix wrong pipeline name attribute
      
      * Add GetAlignmentQ() for NWarpSShuffle pipeline
      
      * Separate Q tile into dram tile & register tile concepts
      
      * Remove non-squre warp gemm transpose c type alias
      
      * Fallback tile size changes for fmha fwd splitkv
      
      * Remove redundant change
      
      * Refine naming for the S tile
      
      * Use better naming of the S tile dstr (read from lds)
      
      * Share Q lds with K lds
      
      * Tiny change
      
      * Fix with using static_for for passing CI checking
      
      ---------
      Co-authored-by: default avatarQianfeng Zhang <Qianfeng.Zhang@amd.com>
      37cdbf4f
  7. 18 Dec, 2024 1 commit
    • aledudek's avatar
      [CK TILE] Refactor GemmKernel to be reused by other GEMM related operators (#1730) · 453ca373
      aledudek authored
      * Gemm Kernel Refactor part1
      
      * Gemm Kernel Refactor common gemm pipeline part2
      
      * [CK TILE] Refactor batched gemm to reuse GemmKernel
      
      * [CK TILE] Refactor GemmKernel - review changes part1
      
      * [CK TILE] Refactor GemmKernel - references fix
      
      * [CK TILE] Refactor GemmKernel - naming changes, add problem
      
      * [CK_TILE] Refactor GemmKernel - update tests
      
      * [CK_TILE] Refactor GemmKernel - review changes
      
      * [CK_TILE] Refactor GemmKernel - update test
      
      * [CK_TILE] Refactor GemmKernel - constness fixes
      
      * [CK_TILE] Refactor GemmKernel - update tests
      453ca373
  8. 17 Dec, 2024 1 commit
  9. 15 Dec, 2024 1 commit
  10. 13 Dec, 2024 1 commit
  11. 12 Dec, 2024 1 commit
    • carlushuang's avatar
      [CK_TILE] naive attn (#1708) · 77a38e02
      carlushuang authored
      * add reference attention fwd
      
      * refactor addresser
      
      * update
      
      * paged, and i8 reflect-quant
      
      * lets call it forward-quant
      
      * fix error in decode variation
      
      * update naive-attn
      
      * fix page table
      
      * fix build err
      77a38e02
  12. 06 Dec, 2024 1 commit
  13. 05 Dec, 2024 1 commit
  14. 04 Dec, 2024 2 commits
  15. 30 Nov, 2024 1 commit
  16. 29 Nov, 2024 1 commit
    • aledudek's avatar
      Ck tile batched gemm example (#1615) · 78f0fea0
      aledudek authored
      * [CK Tile] Batched GEMM Example
      
      * [CK Tile] Batched GEMM Example - minor refactor
      
      * [CK Tile] Batched GEMM Example - README update
      
      * [CK Tile] Batched Gemm Example - review changes
      
      - Added tensor data layours as input parameters
      - Changed structure of Host and Kernel args
      - Removed bug with invalid vector read on non-contiguous memory
      
      * [CK Tile] Batched Gemm Example - remove comment
      
      * [CK Tile] Batched Gemm Example - Add GTests part1
      
      * [CK Tile] Batched Gemm Example - GTests part2 + review changes
      
      * [CK TILE] Batched GEMM post merge fixes
      
      * [CK Tile] Batched GEMM Example - fix pad views
      78f0fea0
  17. 28 Nov, 2024 1 commit
  18. 27 Nov, 2024 1 commit
  19. 26 Nov, 2024 4 commits
    • rocking's avatar
      support max3 in smoothquant and add+ rmsnorm + rdquant (#1654) · abae2afc
      rocking authored
      * Fix cmake example build
      
      * Support max3 in smoothquant one pass
      
      * support max3 in two pass
      
      * support max3 in add_rmsnorm_rdquant
      abae2afc
    • Po Yen Chen's avatar
      [CK_TILE] Fix incorrect computation of group mode PagedAttention (#1688) · cf2d635e
      Po Yen Chen authored
      
      
      * Allow getting batch size from splitkv tile partitioner
      
      * Fix wrong paged-kvcache impl for group mode
      
      * Fix wrong example code for page-kvcache
      
      * Undo changes in fmha_fwd.cpp
      
      * Always use 2D block table
      
      * Add is_gappy kernel argument for paged-kvcache
      
      The is_gappy argument is used for differentiating seqstart_k_ptr usage
      in flash-attention & xformers
      
      * Remove out-of-date comments
      
      * Remove no-longer used method
      
      * Fix wrong # page-block calculation
      
      * Fix wrong comment
      
      ---------
      Co-authored-by: default avatarQianfeng <qianfeng.zhang@amd.com>
      cf2d635e
    • Adam Osewski's avatar
      CK-Tile first draft of universal block gemm with interwave & intrawave scheduler (#1676) · b6bcd76d
      Adam Osewski authored
      * Block universal gemm.
      
      * Universal block gemm with interwave scheduler - draft.
      
      * Refactoring
      
      * Move a/b_warp_tiles into BlockGemmImpl
      * set BlockGemmImpl as a class member
      
      * Change tile size for more suitable to memory bound cases.
      
      * Introduce kKPerThread to WarpGemm
      
      * Add documentation comment.
      
      * Fix Interwave scheduler block gemm.
      
      * Add compute/memory friendly tile configuration.
      
      * Clean
      
      * New tile configurations in gemm mem example.
      
      * Add more static checks and fix loop order in block gemm.
      
      * Add more static checks and use warp gemm mfma dispatcher.
      
      * Add default scheduler block gemm.
      
      * Remove logging in example.
      b6bcd76d
    • carlushuang's avatar
      [CK_TILE] fused-moe first version (#1634) · 440e28b0
      carlushuang authored
      
      
      * moe pipeline
      
      * update code
      
      * compile OK
      
      * update
      
      * update cpu reference
      
      * update pipeline_gemm0
      
      * compiler ok
      
      * update pipeline
      
      * rename to ex pipeline
      
      * block-asm
      
      * update
      
      * update
      
      * update first gemm ok
      
      * compute correct
      
      * update file structure
      
      * update README
      
      * update
      
      * update
      
      * update code
      
      * update API
      
      * return unsupport case
      
      * add comment
      
      * update readme
      
      * update
      
      * uncomment
      
      * update
      
      * fix build err
      
      ---------
      Co-authored-by: default avatarvalarLip <340077269@qq.com>
      440e28b0
  20. 25 Nov, 2024 3 commits
  21. 22 Nov, 2024 1 commit
    • schung-amd's avatar
      [CK_TILE] MakeKargs overloads for backward compatibility (#1681) · ff92222f
      schung-amd authored
      
      
      * Add overloads for MakeKargs
      
      Overload MakeKargs to accept std::tuple<uint64_t, uint64_t> and std::tuple<void*, void*> to preserve functionality of code currently passing in list initializers or tuples.
      
      * Add overloads for MakeKargs
      
      Overload MakeKargs to accept std::tuple<uint64_t, uint64_t> and std::tuple<void*, void*> to preserve functionality of code currently passing in list initializers or tuples.
      
      * Re-format files using ck_tile remod.py
      
      ---------
      Co-authored-by: default avatarPo Yen Chen <PoYen.Chen@amd.com>
      ff92222f
  22. 21 Nov, 2024 1 commit
  23. 14 Nov, 2024 1 commit
  24. 13 Nov, 2024 1 commit
  25. 12 Nov, 2024 1 commit
  26. 11 Nov, 2024 2 commits
  27. 09 Nov, 2024 1 commit
    • dummycoderfe's avatar
      Ck tile/moe sorting (#1624) · bec6fbc6
      dummycoderfe authored
      
      
      * add moe_sorting & check ok
      
      * fix comments & typo
      
      * Run remod.py under include/ck_tile & example/ck_tile directories
      
      * format codes
      
      * fix output ci check bug
      
      * fix moe sorting readme and error commit file
      
      * use magiv div to accelerate compute
      
      * add an loop unroll for moe lds ops
      
      * add extblocksnel to set zeros for moebufs
      
      * [Ck_tile] moe set zero run ok, add size check and fix ref check
      
      * [Ck_tile]fix moe_sorting fuse set_zero remod
      
      * [Ck_tile] change name style, fix zero buffer size err, change folder
      
      * [Ck_tile] moe_sorting: fix name style
      
      * [Ck_tile] moe_sorting, remove useless params in traits
      
      * [Ck_tile] change outputtile cnt * unit_size; change output buf alloc
      
      ---------
      Co-authored-by: default avatardummycoderfe <noplydummmycoder@163.com>
      Co-authored-by: default avatarPo Yen, Chen <PoYen.Chen@amd.com>
      Co-authored-by: default avatarcarlushuang <carlus.huang@amd.com>
      bec6fbc6
  28. 08 Nov, 2024 1 commit
  29. 02 Nov, 2024 1 commit
  30. 01 Nov, 2024 2 commits
    • rocking's avatar
      [Ck_tile] smoothquant (#1617) · fbd65454
      rocking authored
      
      
      * fix compile error
      
      * fix typo of padding
      
      * Add smoothquant op
      
      * Add smoothquant instance library
      
      * refine type
      
      * add test script
      
      * Re-generate smoothquant.hpp
      
      * Always use 'current year' in copyright
      
      * use Generic2dBlockShape instead
      
      * Add vector = 8 instance back
      
      * Find exe path automatically
      
      * Simplify the api condition
      
      * Remove debugging code
      
      * update year
      
      * Add blank line between function declaration
      
      * explicitly cast return value to dim3
      
      * refine return value
      
      * Fix default warmup and repeat value
      
      * Add comment
      
      * refactor sommthquant cmake
      
      * Add README
      
      * Fix typo
      
      ---------
      Co-authored-by: default avatarPo Yen, Chen <PoYen.Chen@amd.com>
      fbd65454
    • carlushuang's avatar
      [layernorm] hot fix (#1620) · 550248de
      carlushuang authored
      * hot fix ln
      
      * some rename
      550248de
  31. 31 Oct, 2024 1 commit
    • carlushuang's avatar
      [CK_TILE] layernorm support fused-quant/fused-add (#1604) · c3a4800c
      carlushuang authored
      * add prenorm/postnorm support, refactor using generate.py
      
      * update README
      
      * update README
      
      * fix format
      
      * update some description and fix format
      
      * update format
      
      * format
      
      * use non-raw for loading
      
      * format and update n4096
      
      * dynamic-quant ready
      
      * update readme
      
      * support fused dynamic-quant
      
      * update fused-quant, with smooth
      
      * update README
      
      * update args
      
      * update some based on comment
      c3a4800c