Commits · 86923c19f79b755eb6f0499f310cde218eb00f08 · gaoqiong / composable_kernel_ROCM

10 Feb, 2025 1 commit
- update · 86923c19
  Jim authored Feb 11, 2025
  
  86923c19
08 Feb, 2025 1 commit
- codegen template · 8b745f2c
  Jim authored Feb 08, 2025
  
  8b745f2c
07 Feb, 2025 1 commit
- template bwd v3 api · f4489897
  Jim authored Feb 07, 2025
  
  f4489897
05 Feb, 2025 2 commits
- smoke test update · f88ba67e
  danyao12 authored Feb 05, 2025
  
  f88ba67e
- add hd64 fp16 kernels · 8a8dc7f6
  danyao12 authored Feb 05, 2025
  
  8a8dc7f6
27 Jan, 2025 1 commit
- add layout restrictions · 008c91c9
  danyao12 authored Jan 27, 2025
  
  008c91c9
24 Jan, 2025 1 commit
- separate hd pad/unpad kernels · 92494a8a
  danyao12 authored Jan 24, 2025
  
  92494a8a
23 Jan, 2025 1 commit
- hd padding(hd % 8 == 0) support from 64 to 128 · b16fa5f0
  danyao12 authored Jan 23, 2025
  
  b16fa5f0
21 Jan, 2025 1 commit
- remove v3 spec · 248fd588
  danyao12 authored Jan 21, 2025
  
  248fd588
13 Jan, 2025 1 commit
- fix hd64 seqlen64 memory fault · d61f4b83
  danyao12 authored Jan 13, 2025
  
  d61f4b83
07 Jan, 2025 2 commits
- add data type config to FAv3 · 466b82a5
  danyao12 authored Jan 07, 2025
  
  466b82a5
- CMakeLists update · 21d12bb7
  danyao12 authored Jan 07, 2025
  
  21d12bb7
03 Jan, 2025 2 commits

[CK_TILE]naive attn support FP8 KVCache quant (#1747) · 6df5fe2a

carlushuang authored Jan 03, 2025



* quant

* fix bug

* simple smoothquant after softmax

* update kv-quant

* update stride

* fix fp8-pertoken-kvcache

* update int8/fp8 quant support

---------

Co-authored-by: so <a.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

6df5fe2a

qdo/kv strides split · 0c126ffc
danyao12 authored Jan 03, 2025

0c126ffc

29 Dec, 2024 1 commit

Remove using partitioner for all fmha kernels (#1778) · 4e076909

Qianfeng authored Dec 29, 2024

* Remove using tile partitioner for fmha_fwd_kernel

* Remove using tile partitioner for fmha_fwd_splitkv and splitkv-combine kernels

* Remove using tile partitioner for fmha_fwd_appendkv kernel

* Unify the format of GetTileIndex

4e076909

28 Dec, 2024 2 commits
- add templates · 43d81903
  danyao12 authored Dec 28, 2024
  
  43d81903
- enable hd64 bf16 atomic32 · 2defe2f6
  danyao12 authored Dec 28, 2024
  
  2defe2f6
25 Dec, 2024 2 commits
- Correct the dtype checking logics (#1775) · 4c2eff02
  Po Yen Chen authored Dec 25, 2024
  
  4c2eff02
- enable hd64 bf16 causal · 64bf2f36
  danyao12 authored Dec 25, 2024
  
  64bf2f36
20 Dec, 2024 1 commit

[CK_TILE] Add fmha fwd N-Warp S-Shuffle pipeline (fmha fwd splitkv pipeline variant) (#1705) · 37cdbf4f

Po Yen Chen authored Dec 20, 2024



* Add check for zero values

* Add static assertions

* Remove invalid option '-e' in smoke_test.sh

* Use correct path of smoke_test.sh

* Avoid zero-sized shared memory array

* Add warning comment

* Replace expr by integer_divide_ceil() call

* Use more readable constant names

* Write down assumption as static assertion

* Add more diagnostic error messages

* Fix wrong BlockWarps when using default pipeline policy

* Add more static assertions for A LDS desc

* Allow using vector size < 8 for data type fp16/bf16

* Align vector size between DRAM dist & LDS desc

* Remove no-longer used func decl

* Fix wrong displayed piepline name

* Undo policy template changes for tile_example_gemm_basic

* Add missing space and make error message stands out

* Unify print precision

* Add missing include directive <iomanip>

* Replace constant 64 by get_warp_size() call

* Replace constant 128 by named variable: BankLength

* Add kAMBlock/kBNBlock attributes

* Allow usig different A/B warp dist for multiple blocks

* Add helper function to get warp dist encodings

* Add 4x64x4 fp16 warp gemm attribute impl

* Complete the A/B warp dist encoding logic

* Fix wrong thread mapping for C matrix

* Use smaller vector size for small tile

* Add static assert to block unsupported warp gemm impl

* Extract common code out as helper method

* Add 4x64x16 fp16 warp gemm type alias

* Add comment to warning developers

* Undo WarpGemmAtrributeMfma<> changes

* Use more clear static assertion error message

* Add trivial wrapper to get warp dstr encodings

* Only transpose warp gemm result if it's square

* Fix compilation error

* Support multi-block warp gemm (on N direction)

* Remove duplicated code

* Fix output encoding of warp gemm

* Fix wrong shape of WarpGemmAtrributeMfmaIterateK<>

* Remove unused code

* Fix wrong shape of WarpGemmAttributeMfmaImplF16F16F32M4N64K4

* Add type config for bf16_t

* Add 4x64x16 bf16 warp gemm

* Update WarpGemmAtrributeMfmaIterateKAndTransposedCDistribution

* Add 64x4x4 fp16/bf16 warp gemm impl

* Add 64x4x16 fp16/bf16 warp gemm

* Add static assertion for better error diagnostic

* Get Q dram dstr directly form block gemm

* Add missing header: fused_moe.hpp

* Allow specifying different warp-gemm for gemm0 & gemm1

* Store P matrix into LDS before gemm1

* Fix inconsistant kernel name

* Remove constraint on gemm0 & gemm1 block warps

* Remove unsupported vector size from checking list

* Allow using 4x64x16 warp gemm for gemm0

* Finish policy customization

* Finish pipeline modification
F#

* Use block warps in codegen

* Fix wrong rank of m_lds_window origin

* Use better distributed tensor

* Make P-store earlier

* Remove duplicated experssions

* Remove unnecessary tile window

* Create new files for new splitkv pipeline

* Separate old/new pipeline codegen logic

* Sync changes form develop

* Undo gemm kernel/pipeline changes

* Undo gemm example changes

* Remove blank lines

* Fix typo

* Use new warp gemm interface

* Fix link error

* Fix wrong pipeline tag

* Fix more link error

* Avoid unnecessary padding

* Always use vector load for K

* Padding on fastest dimension when necessary

* Force padding Q on hdim_q

* Set high dimension padding flag to false

* Re-format headers

* Use warps=<1, 4, 1> for both gemm0 & gemm1

* Fix complilation errors

* Remove m/l shuffle logics

* Ignore duplicate data when write lse_acc

* Use gemm0 block warps as lds tile width

* Remove hard-coded numbers

* Fix wrong distribution width

* Remove unnecessary code

* Add s_barrier before writing to LDS

* Store Q into LDS before gemm0

* Fix wrong Q tile size

* Use simple Q lds descriptor for debuging

* Use more realistic Q lds descriptor

* Add comment & use better variable name

* Make Q lds space not overlapped with others

* Remove unnecessary block_tile_reduce_sync() call

* Move Q load statements

* Move block_sync_lds() right before use

* Re-order instructions

* Remove necessary lambda expression

* Use 8 threads on kMaxSplits direction while doing reduction

* Tiny correction for using 8 threads on kMaxSplits direction for combine kernel

* Padding num_split direction of o_acc tile window to 4x

* Update splitkv combine pipeline design

* Add kN1 back to splitkv combine pipeline problem

* Fix compilation errors

* Add missing template parameter

* Fix wrong splitkv combine kernel name

* Fix wrong origin

* Fix wrong LDS descriptor shape

* Fix sync & reduction logics

* Remove unnecessary static assertions

* Extract tile size computation logics

* Make sure we can reuse padding flags in combine kernels

* Rename variables

* Use OaccDataType in BlockFmhaSplitKVCombinePipelineTileSizes<>

* Remove unnecessary static assertion

* Fix function name typo

* Add constraint on kN1 template parameter

* Hide K tile loading latency in earlier iteration

* Fix wrong splitkv kernel name

* Use s_shuffling to replace p_shuffling which removes the needs of cross-warp reduction

* Rename pipeline

* Fix wrong pipeline name attribute

* Add GetAlignmentQ() for NWarpSShuffle pipeline

* Separate Q tile into dram tile & register tile concepts

* Remove non-squre warp gemm transpose c type alias

* Fallback tile size changes for fmha fwd splitkv

* Remove redundant change

* Refine naming for the S tile

* Use better naming of the S tile dstr (read from lds)

* Share Q lds with K lds

* Tiny change

* Fix with using static_for for passing CI checking

---------
Co-authored-by: Qianfeng Zhang <Qianfeng.Zhang@amd.com>

37cdbf4f

17 Dec, 2024 1 commit
- fav3 bwd hd64 bf16 a16 verification passed · 66cbdd6c
  danyao12 authored Dec 17, 2024
  
  66cbdd6c
12 Dec, 2024 1 commit

[CK_TILE] naive attn (#1708) · 77a38e02

carlushuang authored Dec 12, 2024

* add reference attention fwd

* refactor addresser

* update

* paged, and i8 reflect-quant

* lets call it forward-quant

* fix error in decode variation

* update naive-attn

* fix page table

* fix build err

77a38e02

10 Dec, 2024 1 commit
- [CK TILE] Use config name instead of data type in FmhaFwdTypeConfig<config> (#1731) · 94ae7113
  rocking authored Dec 10, 2024
```
* Add data type config, Prepare to add mix precision in the future

* Fix compile error
```
  94ae7113
26 Nov, 2024 1 commit

[CK_TILE] Fix incorrect computation of group mode PagedAttention (#1688) · cf2d635e

Po Yen Chen authored Nov 26, 2024



* Allow getting batch size from splitkv tile partitioner

* Fix wrong paged-kvcache impl for group mode

* Fix wrong example code for page-kvcache

* Undo changes in fmha_fwd.cpp

* Always use 2D block table

* Add is_gappy kernel argument for paged-kvcache

The is_gappy argument is used for differentiating seqstart_k_ptr usage
in flash-attention & xformers

* Remove out-of-date comments

* Remove no-longer used method

* Fix wrong # page-block calculation

* Fix wrong comment

---------
Co-authored-by: Qianfeng <qianfeng.zhang@amd.com>

cf2d635e

25 Nov, 2024 2 commits

[CK_TILE] Fix fMHA fwd MakeKargs() compilation errors (#1689) · 645fe812

Po Yen Chen authored Nov 25, 2024



* Fix mis-matched tuple<> elem types

* Rename MakeKargs() as MakeKargsImpl()

---------
Co-authored-by: Qianfeng <qianfeng.zhang@amd.com>

645fe812

Change in fwd-splitkv kernel to support num_splits=1 case (#1690) · ce2bdf42

Qianfeng authored Nov 25, 2024



* Change in fwd-splitkv kernel to support num_splits=1 case

* Update in codegen fwd-splitkv to make num_splits > 1 cases pass

* Specify instance traits in dispatch

* Fix link error for fp8 kernels

---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

ce2bdf42

21 Nov, 2024 1 commit

[CK_TILE] Add paged-kvcache support in group mode fmha fwd splitkv kernels (#1678) · fb1ccfa9

Po Yen Chen authored Nov 21, 2024

* Generate group mode paged-attn kernel

* Enable paged-kvcache + group mode support

* Add missing header: fused_moe.hpp

* Add comment to explain kernel arg usage

* Make error message more clear

* Add comment for confusing data member names

* Add more comment for confusing variable names

* Fix typo in option description

fb1ccfa9

09 Nov, 2024 1 commit
- Fix 'sh' command compatibility of smoke_test_fwd.sh (#1553) · af9546d9
  Po Yen Chen authored Nov 09, 2024
  
  af9546d9
05 Nov, 2024 1 commit

[generate.py] Override blob list if it already exists (#1635) · 464abd23

Juan Manuel Martinez Caamaño authored Nov 05, 2024

Before, generate.py appended the list at the end of the output file.
When running the cmake configuration steps multiple times on the
examples, the blob list (such as fwd_blob_list.txt) would grow at every
configuration.
`library/src/tensor_operation_instance/gpu/mha/CMakeLists.txt` worked around
this issue by removing the output file if it exists.

Now, generate.py overrides the content of the output file.
There is no need for the workaround in the CMakeLists.txt;
and the issue is solved for the example projects too.

464abd23

01 Nov, 2024 1 commit
- add benchmark_bwd_v3.sh · 55d982c3
  danyao12 authored Nov 01, 2024
  
  55d982c3
30 Oct, 2024 1 commit

[CK_TILE] Add fmha fwd headdim96 support (#1608) · 86322218

Qianfeng authored Oct 30, 2024



* Add ceil_to_qualified_tile_length()

* Rename kK0BlockLength to kQKHeaddim

* Add kSubQKHeaddim concept to support headdim96

* Fix in math.hpp to avoid using __half interfaces

* Add LdsBufferSequence instance for headdim96

* Update in fmha_fwd/fmha_fwd_splitkv codegen to support hd96 testing

* Disable hd96 instance generation in codegen fmha_fwd and fmha_fwd_splitkv to save compiling time

* Reformat one file

* Fix text alignment in fmha_fwd_splitkv.py

---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

86322218

26 Oct, 2024 1 commit

[CK_TILE] More fmha splitkv optimizations (#1588) · 54f0e6f4

Po Yen Chen authored Oct 26, 2024

* Use pre-defined constants for readability

* Use vector write for o_acc tensor

* Remove no-longer used policy method

* Deprecate no-longer used policy/pipeline

* Specify gemm0/gemm1 block warps separately in codegen

* Fix wrong ps_idx creation logic

* Add single-warp block gemm

* Supoprt single-warp gemm0

* Make MakeCBlockTile() as static method

* Use MakeCBlockTile() to get underlying tile distribution

* Use kNumGemm1Warps to compute # threads for gemm1

* Put normal case in the if clause

* Refine fmha splitkv block mapping

* Refine & fix the lse_acc/o_acc layout

* Fix wrong LDS size for K tile

* Use kK0=64 for hdim=128,256 fmha splitkv kernels

* Use kK1=64 for hdim=32,64,128 fmha splitkv kernels

* Undo kK0/kK1 changes

* Use more reasonable GetAlignmentV() computation

* Using store_tile() in fmha splitkv kernel epilogue

54f0e6f4

21 Oct, 2024 1 commit

[CK_TILE] Optimize fmha splitkv & splitkv combine kernels (#1577) · 95e722a3

Po Yen Chen authored Oct 21, 2024

* Use smaller width for lse_accum dist tensor

* Update pipeline comment

* Fix wrong distribution for lse_accum

* Remove duplicate dim in lse_accum dist encoding

* Decide fmha splitkv combine kernel kBlockSize by kM0

* Remove assumption of MPerThread=1

* Add log<4> & log<8> specialization

* Enlarge occupancy array

* Fix vector size for small tile

* Add support for kMaxSplits=8

* Re-format gemm.hpp

* Use 16x16x16 warp gemm for fwd_splitkv

* Centralize policy code changes

* Leave fp8/bf8 tile settings unchanged

95e722a3

12 Oct, 2024 2 commits
- code revert · ae2d7d2b
  danyao12 authored Oct 12, 2024
  
  ae2d7d2b
- add bf16 rtne kernels · e2ea64d9
  danyao12 authored Oct 12, 2024
  
  e2ea64d9
11 Oct, 2024 2 commits
- bf16 rtz update · ee9706ab
  danyao12 authored Oct 11, 2024
  
  ee9706ab
- some kernels and related api update · 7b12d9b7
  danyao12 authored Oct 11, 2024
  
  7b12d9b7
08 Oct, 2024 3 commits

rename & ensure thread safety · d4de8495
danyao12 authored Oct 08, 2024

d4de8495

[CK_TILE] Update example README files & fix script compatibility issue (#1548) · 0c094daa

Po Yen Chen authored Oct 08, 2024

* Fix text alignment of ArgParser::print()

* Update example README files

* Clarify make-ck-dev.sh <arch> usage

* Only keep some of the argument from '-?' output

* Undo command line output changes in README

* Only keep existing argument on doc and update description

* Fix text alignment

* Make cmake-ck-*.sh compatible with 'sh' command

0c094daa

[CK_TILE] Simplify the codes in splitkv_combine pipeline (#1549) · 74d68e3b

Qianfeng authored Oct 08, 2024



* Simplify the codes in splitkv_combine pipeline

* Always set kPadSeqLenK=true for fmha splitkv kernels

* Change in Oacc Alignment and TileDistribution to be more adaptable to tile sizes

---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

74d68e3b