1. 29 Jan, 2025 2 commits
  2. 27 Jan, 2025 2 commits
    • Andriy Roshchenko's avatar
      Add OCP FP8 support in CK_TILE (#1829) · 35aebe59
      Andriy Roshchenko authored
      * Add OCP FP8 to CK_TILE
      
      * Validate OCP FP8 in FMHA FWD under VALID=1
      35aebe59
    • Adam Osewski's avatar
      [CK-Tile] Enable vectorized reads on all layouts & improve perf. (#1835) · 39dc25a9
      Adam Osewski authored
      
      
      * Refactor universal gemm policy.
      
      * Adapt example to refactor changes.
      
      * Introduce static encoding pattern
      
      * Adding shuffled encoding patterns.
      
      * Fix err in reverse tuple.
      
      * Add transpose_tile2d
      
      * Small refactoring + doc
      
      * Enable reading on contiguous dimension in all layouts.
      
      * Transpose A/B register tile if needed for comp v3 pipeline.
      
      * Take contiguous dim size when calculating dram vector load size.
      
      * A/B smem pack size taken from WarpGemm attributes
      
      * Update B LDS layout and setup tile distribution pattern at class level.
      
      * Fix static assert.
      
      * Fix errors in examples.
      
      * Formatting & fix IsTranspose
      
      * Fix VectorSize & refactor.
      
      * Add error loging messages.
      
      * Fix VecLoadSize and TranspseC for mem pipeline.
      
      * Update unit-tests & disable mem pipeline.
      
      * Clang format
      
      * Update include/ck_tile/core/tensor/tile_window.hpp
      Co-authored-by: default avatarjakpiase <jakub.piasecki@amd.com>
      
      * Fix compilation and reviewers comments.
      
      * Refactor unit-test. Fallback to non-universal gemm.
      
      Need to use GemmPipelineAGmemBGmemCRegV1 for now,
      since GemmKernel is now supporting also non-K major vector reads.
      
      ---------
      Co-authored-by: default avatarjakpiase <jakub.piasecki@amd.com>
      39dc25a9
  3. 24 Jan, 2025 2 commits
  4. 22 Jan, 2025 1 commit
  5. 21 Jan, 2025 1 commit
    • Mateusz Ozga's avatar
      CK-Tile Grouped GEMM refactor and post PR fixes (#1756) · 3c93d3c4
      Mateusz Ozga authored
      * Grouped gemm simple code refactor
      
      * Offset invoker
      
      * Invoke generic Run, and replace name of parrtitioner variable
      
      * Tests fix type
      
      * Removed namespaces
      
      * Add template param to avoid implicit cast
      
      * Remove generic function
      
      * Constant value
      
      * underline enum to int16_t
      
      * Generalize partitioner function
      
      * Remove whitespaces
      
      * Rename function
      
      * Using support
      
      * Clang-format
      
      * Clang-format
      
      * Fn-partitioner description fn
      
      * Typo
      
      * Typo 2
      
      * Better description
      
      * Better description
      
      * Refactor after review
      
      * Use ctr instead of set fn
      
      * Inovke ctr and typo
      
      * Comments
      
      * Remove unnecessary comment
      
      * Review, remove modulo
      3c93d3c4
  6. 18 Jan, 2025 1 commit
  7. 17 Jan, 2025 1 commit
    • Aviral Goel's avatar
      Implementing Test Filters for Smoke and Regression Tests (#1819) · 54de3e55
      Aviral Goel authored
      * smoke and regression targets working with tests
      
      * test filters work for both examples and test
      
      * removed uneccesary comments
      
      * added a missing comment
      
      * added a missing comment
      
      * fixed typo in the comments
      
      * updated README
      
      * Update PULL_REQUEST_TEMPLATE.md
      
      updating the template for future addition of test cases
      
      * Update PULL_REQUEST_TEMPLATE.md
      54de3e55
  8. 16 Jan, 2025 1 commit
  9. 15 Jan, 2025 2 commits
    • Bartłomiej Kocot's avatar
      Add rounding for float to bf16 conversion as default (#1812) · 7790e8c3
      Bartłomiej Kocot authored
      * Add rounding for float to bf16 conversion
      
      * Add bhalf test
      
      * Add inf test bhalf
      
      * Refactor
      
      * update cmake
      
      * Fixes
      7790e8c3
    • ruanjm's avatar
      [CK_TILE] Add Various Fusion Functions to RMSNorm (#1802) · 04dd3148
      ruanjm authored
      
      
      * Add shortcut to RMSNorm
      
      * Modify test for adding shortcut for RMSNorm
      
      * Add fused parameter into tests
      
      * 1. Add YDataType. 2. rmsnorm2d_fwd_traits_ from rmsnorm2d_fwd.hpp to rmsnorm2d_fwd_api.cpp and rmsnorm2d_fwd_instance_common.hpp
      
      * 1. Supports various stride and percisions.
      
      * Add support of Epilogue
      
      * Add fuse and epilogue support to rmsnorm ref
      
      * Modify rmsnorm example
      
      * Refactor tests/examples
      
      * Bug fix for newly added tests/examples
      
      * Bug fix for new tests 2
      
      * Modify smoke test scripts
      
      remove dbg code
      
      * Supports non-smooth dyanmic quant
      
      * Update Rmsnorm2dFwd::GetName()
      
      * rename xscale and prec_sx to smoothscale and prec_sm
      
      Bug fix after rename
      
      Remove files
      
      * change example_rmsnorm2d_fwd.cpp
      
      * update performance calculator
      
      * Fix issue in two-pass when fuse add is enabled
      
      * Remove comment of beta
      
      ---------
      Co-authored-by: default avatarrocking <ChunYu.Lai@amd.com>
      04dd3148
  10. 13 Jan, 2025 2 commits
  11. 10 Jan, 2025 1 commit
    • Thomas Ning's avatar
      Ck tile/gemm perf measure (#1750) · 73a076ee
      Thomas Ning authored
      
      
      * Finished adding the performance benchmark for ck tile gemm
      
      * Fix the executable rename problem
      
      * fix the executable name error
      
      * delete the unsupported layout combinations
      
      * Update run_full_test.sh
      
      * Update benchmark_mem_pipeline.sh
      
      * Update benchmark_basic.sh
      
      * change the executable of gemm_universal
      
      * change ck_tile_gemm script permissions
      
      * Addressed the comment
      
      * Addressed the comment
      
      * Fixed the comments
      
      * Fixed Comment
      
      * roll back the malfunctioned change
      
      * Fix the Typo
      
      * finalize the tile_gemm_fp16 performance monitoring
      
      * fix the stash names for ck_tile gemm logs
      
      * change the stashing logic
      
      * change stashing syntax
      
      ---------
      Co-authored-by: default avatarIllia Silin <98187287+illsilin@users.noreply.github.com>
      Co-authored-by: default avatarillsilin <Illia.Silin@amd.com>
      73a076ee
  12. 08 Jan, 2025 2 commits
  13. 07 Jan, 2025 1 commit
    • Po Yen Chen's avatar
      [CK_TILE] fmha fwd splitkv optimization for decode (seqlen_q=1) (#1789) · 24b12d04
      Po Yen Chen authored
      
      
      * Update license year
      
      * Add initial code to override decode problem
      
      * Fix splitkv traits/args overriding error
      
      * Reshape and transpose lse for decode
      
      * Remove debug code
      
      * Prettify example code
      
      * Use better function name
      
      * Add kMergeNumHeadGroupsSeqLenQ flag
      
      Kernel user can use this switch to turn on/off optimization for
      some problem sizes
      
      * Add missing flag declarations
      
      * Default turn off kMergeNumHeadGroupsSeqLenQ in codegen
      
      * Group similar statements together
      
      * Remove assumption of seqlen_q=1
      
      * Remove kMergeNumHeadGroupsSeqLenQ from splitkv combine kernel
      
      * Support kMergeNumHeadGroupsSeqLenQ=true in fmha splitkv kernel
      
      * Run kMergeNumHeadGroupsSeqLenQ=true kernels when need
      
      * Fix group mode block skip logics
      
      * Undo changes of normal fwd kernel
      
      * Update in GridSize() and using GridSize() for splitkv kernel (#1799)
      
      ---------
      Co-authored-by: default avatarQianfeng <qianfeng.zhang@amd.com>
      24b12d04
  14. 03 Jan, 2025 3 commits
  15. 02 Jan, 2025 2 commits
  16. 29 Dec, 2024 1 commit
    • Qianfeng's avatar
      Remove using partitioner for all fmha kernels (#1778) · 4e076909
      Qianfeng authored
      * Remove using tile partitioner for fmha_fwd_kernel
      
      * Remove using tile partitioner for fmha_fwd_splitkv and splitkv-combine kernels
      
      * Remove using tile partitioner for fmha_fwd_appendkv kernel
      
      * Unify the format of GetTileIndex
      4e076909
  17. 28 Dec, 2024 1 commit
  18. 25 Dec, 2024 1 commit
  19. 23 Dec, 2024 1 commit
  20. 20 Dec, 2024 1 commit
    • Po Yen Chen's avatar
      [CK_TILE] Add fmha fwd N-Warp S-Shuffle pipeline (fmha fwd splitkv pipeline variant) (#1705) · 37cdbf4f
      Po Yen Chen authored
      
      
      * Add check for zero values
      
      * Add static assertions
      
      * Remove invalid option '-e' in smoke_test.sh
      
      * Use correct path of smoke_test.sh
      
      * Avoid zero-sized shared memory array
      
      * Add warning comment
      
      * Replace expr by integer_divide_ceil() call
      
      * Use more readable constant names
      
      * Write down assumption as static assertion
      
      * Add more diagnostic error messages
      
      * Fix wrong BlockWarps when using default pipeline policy
      
      * Add more static assertions for A LDS desc
      
      * Allow using vector size < 8 for data type fp16/bf16
      
      * Align vector size between DRAM dist & LDS desc
      
      * Remove no-longer used func decl
      
      * Fix wrong displayed piepline name
      
      * Undo policy template changes for tile_example_gemm_basic
      
      * Add missing space and make error message stands out
      
      * Unify print precision
      
      * Add missing include directive <iomanip>
      
      * Replace constant 64 by get_warp_size() call
      
      * Replace constant 128 by named variable: BankLength
      
      * Add kAMBlock/kBNBlock attributes
      
      * Allow usig different A/B warp dist for multiple blocks
      
      * Add helper function to get warp dist encodings
      
      * Add 4x64x4 fp16 warp gemm attribute impl
      
      * Complete the A/B warp dist encoding logic
      
      * Fix wrong thread mapping for C matrix
      
      * Use smaller vector size for small tile
      
      * Add static assert to block unsupported warp gemm impl
      
      * Extract common code out as helper method
      
      * Add 4x64x16 fp16 warp gemm type alias
      
      * Add comment to warning developers
      
      * Undo WarpGemmAtrributeMfma<> changes
      
      * Use more clear static assertion error message
      
      * Add trivial wrapper to get warp dstr encodings
      
      * Only transpose warp gemm result if it's square
      
      * Fix compilation error
      
      * Support multi-block warp gemm (on N direction)
      
      * Remove duplicated code
      
      * Fix output encoding of warp gemm
      
      * Fix wrong shape of WarpGemmAtrributeMfmaIterateK<>
      
      * Remove unused code
      
      * Fix wrong shape of WarpGemmAttributeMfmaImplF16F16F32M4N64K4
      
      * Add type config for bf16_t
      
      * Add 4x64x16 bf16 warp gemm
      
      * Update WarpGemmAtrributeMfmaIterateKAndTransposedCDistribution
      
      * Add 64x4x4 fp16/bf16 warp gemm impl
      
      * Add 64x4x16 fp16/bf16 warp gemm
      
      * Add static assertion for better error diagnostic
      
      * Get Q dram dstr directly form block gemm
      
      * Add missing header: fused_moe.hpp
      
      * Allow specifying different warp-gemm for gemm0 & gemm1
      
      * Store P matrix into LDS before gemm1
      
      * Fix inconsistant kernel name
      
      * Remove constraint on gemm0 & gemm1 block warps
      
      * Remove unsupported vector size from checking list
      
      * Allow using 4x64x16 warp gemm for gemm0
      
      * Finish policy customization
      
      * Finish pipeline modification
      F#
      
      * Use block warps in codegen
      
      * Fix wrong rank of m_lds_window origin
      
      * Use better distributed tensor
      
      * Make P-store earlier
      
      * Remove duplicated experssions
      
      * Remove unnecessary tile window
      
      * Create new files for new splitkv pipeline
      
      * Separate old/new pipeline codegen logic
      
      * Sync changes form develop
      
      * Undo gemm kernel/pipeline changes
      
      * Undo gemm example changes
      
      * Remove blank lines
      
      * Fix typo
      
      * Use new warp gemm interface
      
      * Fix link error
      
      * Fix wrong pipeline tag
      
      * Fix more link error
      
      * Avoid unnecessary padding
      
      * Always use vector load for K
      
      * Padding on fastest dimension when necessary
      
      * Force padding Q on hdim_q
      
      * Set high dimension padding flag to false
      
      * Re-format headers
      
      * Use warps=<1, 4, 1> for both gemm0 & gemm1
      
      * Fix complilation errors
      
      * Remove m/l shuffle logics
      
      * Ignore duplicate data when write lse_acc
      
      * Use gemm0 block warps as lds tile width
      
      * Remove hard-coded numbers
      
      * Fix wrong distribution width
      
      * Remove unnecessary code
      
      * Add s_barrier before writing to LDS
      
      * Store Q into LDS before gemm0
      
      * Fix wrong Q tile size
      
      * Use simple Q lds descriptor for debuging
      
      * Use more realistic Q lds descriptor
      
      * Add comment & use better variable name
      
      * Make Q lds space not overlapped with others
      
      * Remove unnecessary block_tile_reduce_sync() call
      
      * Move Q load statements
      
      * Move block_sync_lds() right before use
      
      * Re-order instructions
      
      * Remove necessary lambda expression
      
      * Use 8 threads on kMaxSplits direction while doing reduction
      
      * Tiny correction for using 8 threads on kMaxSplits direction for combine kernel
      
      * Padding num_split direction of o_acc tile window to 4x
      
      * Update splitkv combine pipeline design
      
      * Add kN1 back to splitkv combine pipeline problem
      
      * Fix compilation errors
      
      * Add missing template parameter
      
      * Fix wrong splitkv combine kernel name
      
      * Fix wrong origin
      
      * Fix wrong LDS descriptor shape
      
      * Fix sync & reduction logics
      
      * Remove unnecessary static assertions
      
      * Extract tile size computation logics
      
      * Make sure we can reuse padding flags in combine kernels
      
      * Rename variables
      
      * Use OaccDataType in BlockFmhaSplitKVCombinePipelineTileSizes<>
      
      * Remove unnecessary static assertion
      
      * Fix function name typo
      
      * Add constraint on kN1 template parameter
      
      * Hide K tile loading latency in earlier iteration
      
      * Fix wrong splitkv kernel name
      
      * Use s_shuffling to replace p_shuffling which removes the needs of cross-warp reduction
      
      * Rename pipeline
      
      * Fix wrong pipeline name attribute
      
      * Add GetAlignmentQ() for NWarpSShuffle pipeline
      
      * Separate Q tile into dram tile & register tile concepts
      
      * Remove non-squre warp gemm transpose c type alias
      
      * Fallback tile size changes for fmha fwd splitkv
      
      * Remove redundant change
      
      * Refine naming for the S tile
      
      * Use better naming of the S tile dstr (read from lds)
      
      * Share Q lds with K lds
      
      * Tiny change
      
      * Fix with using static_for for passing CI checking
      
      ---------
      Co-authored-by: default avatarQianfeng Zhang <Qianfeng.Zhang@amd.com>
      37cdbf4f
  21. 19 Dec, 2024 1 commit
  22. 18 Dec, 2024 2 commits
    • aledudek's avatar
      [CK TILE] Refactor GemmKernel to be reused by other GEMM related operators (#1730) · 453ca373
      aledudek authored
      * Gemm Kernel Refactor part1
      
      * Gemm Kernel Refactor common gemm pipeline part2
      
      * [CK TILE] Refactor batched gemm to reuse GemmKernel
      
      * [CK TILE] Refactor GemmKernel - review changes part1
      
      * [CK TILE] Refactor GemmKernel - references fix
      
      * [CK TILE] Refactor GemmKernel - naming changes, add problem
      
      * [CK_TILE] Refactor GemmKernel - update tests
      
      * [CK_TILE] Refactor GemmKernel - review changes
      
      * [CK_TILE] Refactor GemmKernel - update test
      
      * [CK_TILE] Refactor GemmKernel - constness fixes
      
      * [CK_TILE] Refactor GemmKernel - update tests
      453ca373
    • aledudek's avatar
      [CK_TILE] Move hipmalloc/memcpy calls out of gpu reference gemm (#1743) · f6c4d614
      aledudek authored
      * [CK_TILE] Move hipmalloc/memcpy calls out of gpu reference gemm
      
      * [CK_TILE] Move hipmalloc/memcpy calls out of gpu reference gemm - review changes
      
      * [CK_TILE] Move hipmalloc/memcpy calls out of gpu reference gemm - review fix
      f6c4d614
  23. 13 Dec, 2024 2 commits
    • Bartłomiej Kocot's avatar
      Add SplitK support into Batched GEMM V3 (#1729) · 4d8fce33
      Bartłomiej Kocot authored
      
      
      * add bmm api
      
      * add bf16 multi_d
      
      * add ckProfiler for bf16
      
      * add ckProfiler files
      
      * add more instance; fixed 64bit index issue
      
      * fixed naming
      
      * enabled batched Ds
      
      * use long_index for ds offsets
      
      * clean
      
      * add bmm fp8 ckProfiler
      
      * Update example/24_batched_gemm/batched_gemm_xdl_bf16_v3.cpp
      Co-authored-by: default avatarBartłomiej Kocot <bartlomiejkocot98@gmail.com>
      
      * Update example/24_batched_gemm/batched_gemm_xdl_fp8_rowwise_v3.cpp
      Co-authored-by: default avatarBartłomiej Kocot <bartlomiejkocot98@gmail.com>
      
      * Update example/24_batched_gemm/run_batched_gemm_example_rowwise.inc
      Co-authored-by: default avatarBartłomiej Kocot <bartlomiejkocot98@gmail.com>
      
      * Update library/src/tensor_operation_instance/gpu/gemm_universal_batched/device_batched_gemm_xdl_universal_bf16_bf16_bf16/device_batched_gemm_xdl_universal_bf16_bf16_bf16_mk_nk_mn.hpp
      Co-authored-by: default avatarBartłomiej Kocot <bartlomiejkocot98@gmail.com>
      
      * Update library/src/tensor_operation_instance/gpu/gemm_universal_batched/device_batched_gemm_xdl_universal_bf16_bf16_bf16/device_batched_gemm_xdl_universal_bf16_bf16_bf16_mk_nk_mn_mem_v1_default_instance.cpp
      Co-authored-by: default avatarBartłomiej Kocot <bartlomiejkocot98@gmail.com>
      
      * Update library/src/tensor_operation_instance/gpu/gemm_universal_batched/device_batched_gemm_xdl_universal_bf16_bf16_bf16/device_batched_gemm_xdl_universal_bf16_bf16_bf16_mk_nk_mn_mem_v2_default_instance.cpp
      Co-authored-by: default avatarBartłomiej Kocot <bartlomiejkocot98@gmail.com>
      
      * Update profiler/src/profile_gemm_universal_batched.cpp
      Co-authored-by: default avatarBartłomiej Kocot <bartlomiejkocot98@gmail.com>
      
      * Update profiler/include/profiler/profile_gemm_universal_batched_impl.hpp
      Co-authored-by: default avatarBartłomiej Kocot <bartlomiejkocot98@gmail.com>
      
      * clean
      
      * Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp
      
      * Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp
      
      * Update library/src/tensor_operation_instance/gpu/gemm_universal_batched/device_batched_gemm_xdl_universal_bf16_bf16_bf16/device_batched_gemm_xdl_universal_bf16_bf16_bf16_mk_nk_mn_comp_default_instance.cpp
      
      * Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp
      
      * Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp
      
      * Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp
      
      * refactor batch offset func
      
      * add splitk suppport into bmm_v3
      
      * clean
      
      * clean
      
      * format
      
      * fixed
      
      * fix
      
      ---------
      Co-authored-by: default avatarJing Zhang <jizhan@fb.com>
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      4d8fce33
    • chenjun's avatar
      Ck tile/smoothquant out stride (#1742) · 4e731776
      chenjun authored
      * add ck_tile/smoothquant out stride parameter
      
      * Remove the default stride value
      
      ---------
      
      Co-authored-by: so <a.com>
      4e731776
  24. 12 Dec, 2024 1 commit
    • carlushuang's avatar
      [CK_TILE] naive attn (#1708) · 77a38e02
      carlushuang authored
      * add reference attention fwd
      
      * refactor addresser
      
      * update
      
      * paged, and i8 reflect-quant
      
      * lets call it forward-quant
      
      * fix error in decode variation
      
      * update naive-attn
      
      * fix page table
      
      * fix build err
      77a38e02
  25. 10 Dec, 2024 1 commit
  26. 06 Dec, 2024 1 commit
  27. 05 Dec, 2024 1 commit
  28. 04 Dec, 2024 1 commit
  29. 03 Dec, 2024 1 commit