Commits · ck_tile/support-vllm-kcache-layout · gaoqiong / composable_kernel_ROCM

01 Jan, 2025 1 commit
- Merge branch 'develop' into ck_tile/support-vllm-kcache-layout · c881136b
  Po Yen Chen authored Jan 01, 2025
  
  c881136b
31 Dec, 2024 1 commit
- Add kStoreLSE=true fp8 fmha fwd splitkv instances · c5e8e14f
  Po Yen, Chen authored Dec 31, 2024
  
  c5e8e14f
29 Dec, 2024 12 commits
- Move splitkv partitioner logics into splitkv kernel · ca1a816d
  Po Yen Chen authored Dec 29, 2024
  
  ca1a816d
- Revert "Use async splitkv pipeline for hdim<256 problems" · f31fad7d
  Po Yen Chen authored Dec 29, 2024
```
This reverts commit 658350b3.
```
  f31fad7d
- Use async splitkv pipeline for hdim<256 problems · 658350b3
  Po Yen Chen authored Dec 29, 2024
  
  658350b3
- Re-arrange move_tile_window() statements (async) · 76b31460
  Po Yen Chen authored Dec 29, 2024
  
  76b31460
- Fix aligment for o_acc · 12871dd4
  Po Yen Chen authored Dec 29, 2024
  
  12871dd4
- Re-arrange K tile move_tile_window() statement · 9bf87a90
  Po Yen Chen authored Dec 29, 2024
  
  9bf87a90
- Revert "Re-arrange move_tile_window() statements" · 212e9006
  Po Yen Chen authored Dec 29, 2024
```
This reverts commit 09486ebf.
```
  212e9006
- Re-arrange move_tile_window() statements · 09486ebf
  Po Yen Chen authored Dec 29, 2024
  
  09486ebf
- Only update tensor view attributes if change page-block · 73a4d827
  Po Yen Chen authored Dec 29, 2024
  
  73a4d827
- Correct the dtype checking logics · 3d167b4b
  Po Yen Chen authored Dec 29, 2024
  
  3d167b4b
- Use vector load if paged-vcache is in column major (async pipeline) · 36a1c7c9
  Po Yen Chen authored Dec 29, 2024
  
  36a1c7c9
- Remove using partitioner for all fmha kernels (#1778) · 4e076909
  Qianfeng authored Dec 29, 2024
```
* Remove using tile partitioner for fmha_fwd_kernel

* Remove using tile partitioner for fmha_fwd_splitkv and splitkv-combine kernels

* Remove using tile partitioner for fmha_fwd_appendkv kernel

* Unify the format of GetTileIndex
```
  4e076909
28 Dec, 2024 1 commit

[CK TILE] GEMM and Batched GEMM SplitK support (#1724) · af664948

Bartłomiej Kocot authored Dec 28, 2024

* [CK TILE] Add split K support in GEMM

* Updates

* Fixes

* rebase

* fix

* Fix

* fixes

* support for batched gemm

af664948

25 Dec, 2024 1 commit
- Correct the dtype checking logics (#1775) · 4c2eff02
  Po Yen Chen authored Dec 25, 2024
  
  4c2eff02
24 Dec, 2024 3 commits
- Use vector load if paged-vcache is in column major · 65bbe6ea
  Po Yen Chen authored Dec 24, 2024
  
  65bbe6ea
- Fix wrong vlayout assumption in example · 1fef9106
  Po Yen Chen authored Dec 23, 2024
  
  1fef9106
- Use vlayout=col for chunked prefill · 4633f073
  Po Yen Chen authored Dec 23, 2024
  
  4633f073
23 Dec, 2024 8 commits
- Only generate V rowmajor kernels · bb093470
  Po Yen Chen authored Dec 23, 2024
  
  bb093470
- Merge branch 'feature/fmha-fwd-async-splitkv' into... · 377e1289
  Po Yen Chen authored Dec 23, 2024
```
Merge branch 'feature/fmha-fwd-async-splitkv' into feature/support-vllm-kcache-layout-add-splitkv-instance
```
  377e1289
- Only check incomplete split in first&last iterations · 6c4e10da
  Po Yen Chen authored Dec 23, 2024
  
  6c4e10da
- Merge branch 'feature/add-splitkv-instance' into... · c5083c0f
  Po Yen Chen authored Dec 23, 2024
```
Merge branch 'feature/add-splitkv-instance' into feature/support-vllm-kcache-layout-add-splitkv-instance
```
  c5083c0f
- Only check incomplete split in first&last iterations · 3f29f232
  Po Yen Chen authored Dec 23, 2024
  
  3f29f232
- Set kHasUnevenSplits=false if num_splits = 1 · 4b3474e4
  Po Yen Chen authored Dec 23, 2024
  
  4b3474e4
- Do not force kHasUnevenSplits=true in group mode · 346ba760
  Po Yen Chen authored Dec 23, 2024
  
  346ba760
- [CK_TILE] optimize moe-sorting kernel (#1771) · 3d15f364
  carlushuang authored Dec 23, 2024
```
* opt moe sorting

* remove commented code
```
  3d15f364
20 Dec, 2024 6 commits

Use kv_perm to controol key/value layout · 0739bc5a
Po Yen Chen authored Dec 20, 2024

0739bc5a
fix typo for CK_USE_OCP_FP8 (#1769) · 07339c73
Illia Silin authored Dec 20, 2024

07339c73
hot-fix (#1768) · 1c45ca35
carlushuang authored Dec 20, 2024

1c45ca35

[CK_TILE] Add fmha fwd N-Warp S-Shuffle pipeline (fmha fwd splitkv pipeline variant) (#1705) · 37cdbf4f

Po Yen Chen authored Dec 20, 2024



* Add check for zero values

* Add static assertions

* Remove invalid option '-e' in smoke_test.sh

* Use correct path of smoke_test.sh

* Avoid zero-sized shared memory array

* Add warning comment

* Replace expr by integer_divide_ceil() call

* Use more readable constant names

* Write down assumption as static assertion

* Add more diagnostic error messages

* Fix wrong BlockWarps when using default pipeline policy

* Add more static assertions for A LDS desc

* Allow using vector size < 8 for data type fp16/bf16

* Align vector size between DRAM dist & LDS desc

* Remove no-longer used func decl

* Fix wrong displayed piepline name

* Undo policy template changes for tile_example_gemm_basic

* Add missing space and make error message stands out

* Unify print precision

* Add missing include directive <iomanip>

* Replace constant 64 by get_warp_size() call

* Replace constant 128 by named variable: BankLength

* Add kAMBlock/kBNBlock attributes

* Allow usig different A/B warp dist for multiple blocks

* Add helper function to get warp dist encodings

* Add 4x64x4 fp16 warp gemm attribute impl

* Complete the A/B warp dist encoding logic

* Fix wrong thread mapping for C matrix

* Use smaller vector size for small tile

* Add static assert to block unsupported warp gemm impl

* Extract common code out as helper method

* Add 4x64x16 fp16 warp gemm type alias

* Add comment to warning developers

* Undo WarpGemmAtrributeMfma<> changes

* Use more clear static assertion error message

* Add trivial wrapper to get warp dstr encodings

* Only transpose warp gemm result if it's square

* Fix compilation error

* Support multi-block warp gemm (on N direction)

* Remove duplicated code

* Fix output encoding of warp gemm

* Fix wrong shape of WarpGemmAtrributeMfmaIterateK<>

* Remove unused code

* Fix wrong shape of WarpGemmAttributeMfmaImplF16F16F32M4N64K4

* Add type config for bf16_t

* Add 4x64x16 bf16 warp gemm

* Update WarpGemmAtrributeMfmaIterateKAndTransposedCDistribution

* Add 64x4x4 fp16/bf16 warp gemm impl

* Add 64x4x16 fp16/bf16 warp gemm

* Add static assertion for better error diagnostic

* Get Q dram dstr directly form block gemm

* Add missing header: fused_moe.hpp

* Allow specifying different warp-gemm for gemm0 & gemm1

* Store P matrix into LDS before gemm1

* Fix inconsistant kernel name

* Remove constraint on gemm0 & gemm1 block warps

* Remove unsupported vector size from checking list

* Allow using 4x64x16 warp gemm for gemm0

* Finish policy customization

* Finish pipeline modification
F#

* Use block warps in codegen

* Fix wrong rank of m_lds_window origin

* Use better distributed tensor

* Make P-store earlier

* Remove duplicated experssions

* Remove unnecessary tile window

* Create new files for new splitkv pipeline

* Separate old/new pipeline codegen logic

* Sync changes form develop

* Undo gemm kernel/pipeline changes

* Undo gemm example changes

* Remove blank lines

* Fix typo

* Use new warp gemm interface

* Fix link error

* Fix wrong pipeline tag

* Fix more link error

* Avoid unnecessary padding

* Always use vector load for K

* Padding on fastest dimension when necessary

* Force padding Q on hdim_q

* Set high dimension padding flag to false

* Re-format headers

* Use warps=<1, 4, 1> for both gemm0 & gemm1

* Fix complilation errors

* Remove m/l shuffle logics

* Ignore duplicate data when write lse_acc

* Use gemm0 block warps as lds tile width

* Remove hard-coded numbers

* Fix wrong distribution width

* Remove unnecessary code

* Add s_barrier before writing to LDS

* Store Q into LDS before gemm0

* Fix wrong Q tile size

* Use simple Q lds descriptor for debuging

* Use more realistic Q lds descriptor

* Add comment & use better variable name

* Make Q lds space not overlapped with others

* Remove unnecessary block_tile_reduce_sync() call

* Move Q load statements

* Move block_sync_lds() right before use

* Re-order instructions

* Remove necessary lambda expression

* Use 8 threads on kMaxSplits direction while doing reduction

* Tiny correction for using 8 threads on kMaxSplits direction for combine kernel

* Padding num_split direction of o_acc tile window to 4x

* Update splitkv combine pipeline design

* Add kN1 back to splitkv combine pipeline problem

* Fix compilation errors

* Add missing template parameter

* Fix wrong splitkv combine kernel name

* Fix wrong origin

* Fix wrong LDS descriptor shape

* Fix sync & reduction logics

* Remove unnecessary static assertions

* Extract tile size computation logics

* Make sure we can reuse padding flags in combine kernels

* Rename variables

* Use OaccDataType in BlockFmhaSplitKVCombinePipelineTileSizes<>

* Remove unnecessary static assertion

* Fix function name typo

* Add constraint on kN1 template parameter

* Hide K tile loading latency in earlier iteration

* Fix wrong splitkv kernel name

* Use s_shuffling to replace p_shuffling which removes the needs of cross-warp reduction

* Rename pipeline

* Fix wrong pipeline name attribute

* Add GetAlignmentQ() for NWarpSShuffle pipeline

* Separate Q tile into dram tile & register tile concepts

* Remove non-squre warp gemm transpose c type alias

* Fallback tile size changes for fmha fwd splitkv

* Remove redundant change

* Refine naming for the S tile

* Use better naming of the S tile dstr (read from lds)

* Share Q lds with K lds

* Tiny change

* Fix with using static_for for passing CI checking

---------
Co-authored-by: Qianfeng Zhang <Qianfeng.Zhang@amd.com>

37cdbf4f

Merge branch 'feature/fmha-fwd-async-splitkv' into... · 7c0e5822

Po Yen Chen authored Dec 20, 2024

Merge branch 'feature/fmha-fwd-async-splitkv' into feature/support-vllm-kcache-layout-add-splitkv-instance

7c0e5822

fix profiler_grouped_gemm (#1766) · 2944c508
Illia Silin authored Dec 19, 2024

2944c508

19 Dec, 2024 7 commits
- Apply Ck-tile argument parser for vectors [I/O] (#1758) · e758d006
  Mateusz Ozga authored Dec 19, 2024
```
* Parser for a vector was added. Additionaly we valid correctnes of numbers

* Remove unnecessary comments

* Review part 1

* Review part 2

* Add const to variadic lambda

* Rename C->K
```
  e758d006
- Fix wrong origin calculation · e86da0e9
  Po Yen Chen authored Dec 19, 2024
  
  e86da0e9
- Enable splitkv async pipeline · 60356c90
  Po Yen Chen authored Dec 19, 2024
  
  60356c90
- Fix compilation errors · de6dd79f
  Po Yen Chen authored Dec 19, 2024
  
  de6dd79f
- Complete splitkv async default policy · 232864b4
  Po Yen Chen authored Dec 19, 2024
  
  232864b4
- Update license year · 8f9f4ae5
  Po Yen Chen authored Dec 19, 2024
  
  8f9f4ae5
- Update comments in default policy source files · 2609ea69
  Po Yen Chen authored Dec 19, 2024
  
  2609ea69