Commits · e7b6286441aae59d3a87db67f42369d3cc2636a4 · gaoqiong / composable_kernel_ROCM

27 Nov, 2024 3 commits

Add interwave scheduler for gemm mem pipeline (#1647) · e7b62864

jakpiase authored Nov 27, 2024



* add interwave scheduler for gemm mem pipeline

* Fix merge artifacts.

* Refactor unit tests.

* Switch to interwave scheduler for mem example

---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>

e7b62864

move utility headers from library/include to include path (#1697) · fe6b185b
Illia Silin authored Nov 27, 2024

fe6b185b

Polished Grouped GEMM APIs and new BF16 instances (#1600) · 061ac064

Adam Osewski authored Nov 27, 2024

* Few small fixes.

* New GroupedGemm instances (BF16)

* Unify and refactor GroupedGEMM device API.

* Adapt changes to new API.

* Adapt grouped gemm profiler.

* Accept multiple kbatches for grouped gemm profiler.

- delete obsolete two stage as it is now covered by grouped gemm

* Update unit test for grouped gemm.

* Fix thresholds for BF16 and F8. Unblock tests.

* Fix few instances.

* Multiple small fixes.

* Adapt to new API, check dynamic casting.

* Uncomment few data types in grouped gemm profiler.

* Fix call to SetDeviceArgs.

* Fix profile grouped gemm multiply tile loop.

* Fix grouped gemm tile loop kernel args in client examples.

* Review comments.

061ac064

26 Nov, 2024 6 commits

support max3 in smoothquant and add+ rmsnorm + rdquant (#1654) · abae2afc

rocking authored Nov 27, 2024

* Fix cmake example build

* Support max3 in smoothquant one pass

* support max3 in two pass

* support max3 in add_rmsnorm_rdquant

abae2afc

Change block gemm pipeline local prefill loop order. (#1692) · bfe983a1
Adam Osewski authored Nov 26, 2024
```
* Fix loop order.

* Fix loop order in pipeline v4
```
bfe983a1

Add check for bf16 splitk support for grouped gemm splitk (#1673) · b70f367f

jakpiase authored Nov 26, 2024



* add check for bf16 splitk support for grouped gemm splitk

* Update if condition

---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

b70f367f

[CK_TILE] Fix incorrect computation of group mode PagedAttention (#1688) · cf2d635e

Po Yen Chen authored Nov 26, 2024



* Allow getting batch size from splitkv tile partitioner

* Fix wrong paged-kvcache impl for group mode

* Fix wrong example code for page-kvcache

* Undo changes in fmha_fwd.cpp

* Always use 2D block table

* Add is_gappy kernel argument for paged-kvcache

The is_gappy argument is used for differentiating seqstart_k_ptr usage
in flash-attention & xformers

* Remove out-of-date comments

* Remove no-longer used method

* Fix wrong # page-block calculation

* Fix wrong comment

---------
Co-authored-by: Qianfeng <qianfeng.zhang@amd.com>

cf2d635e

CK-Tile first draft of universal block gemm with interwave & intrawave scheduler (#1676) · b6bcd76d

Adam Osewski authored Nov 26, 2024

* Block universal gemm.

* Universal block gemm with interwave scheduler - draft.

* Refactoring

* Move a/b_warp_tiles into BlockGemmImpl
* set BlockGemmImpl as a class member

* Change tile size for more suitable to memory bound cases.

* Introduce kKPerThread to WarpGemm

* Add documentation comment.

* Fix Interwave scheduler block gemm.

* Add compute/memory friendly tile configuration.

* Clean

* New tile configurations in gemm mem example.

* Add more static checks and fix loop order in block gemm.

* Add more static checks and use warp gemm mfma dispatcher.

* Add default scheduler block gemm.

* Remove logging in example.

b6bcd76d

[CK_TILE] fused-moe first version (#1634) · 440e28b0

carlushuang authored Nov 26, 2024



* moe pipeline

* update code

* compile OK

* update

* update cpu reference

* update pipeline_gemm0

* compiler ok

* update pipeline

* rename to ex pipeline

* block-asm

* update

* update

* update first gemm ok

* compute correct

* update file structure

* update README

* update

* update

* update code

* update API

* return unsupport case

* add comment

* update readme

* update

* uncomment

* update

* fix build err

---------
Co-authored-by: valarLip <340077269@qq.com>

440e28b0

25 Nov, 2024 3 commits

[CK_TILE] Fix fMHA fwd MakeKargs() compilation errors (#1689) · 645fe812

Po Yen Chen authored Nov 25, 2024



* Fix mis-matched tuple<> elem types

* Rename MakeKargs() as MakeKargsImpl()

---------
Co-authored-by: Qianfeng <qianfeng.zhang@amd.com>

645fe812

[CK_TILE]Moe update index (#1672) · 36c7ce4e

carlushuang authored Nov 25, 2024



* update MOCK_ID for moe-sorting

* add moe-smoothquant

* update a comment

* fix format

* hot fix

* update topk in overflow case

* update comments

* update bf16 cvt

---------
Co-authored-by: valarLip <340077269@qq.com>

36c7ce4e

Change in fwd-splitkv kernel to support num_splits=1 case (#1690) · ce2bdf42

Qianfeng authored Nov 25, 2024



* Change in fwd-splitkv kernel to support num_splits=1 case

* Update in codegen fwd-splitkv to make num_splits > 1 cases pass

* Specify instance traits in dispatch

* Fix link error for fp8 kernels

---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

ce2bdf42

22 Nov, 2024 1 commit

[CK_TILE] MakeKargs overloads for backward compatibility (#1681) · ff92222f

schung-amd authored Nov 22, 2024



* Add overloads for MakeKargs

Overload MakeKargs to accept std::tuple<uint64_t, uint64_t> and std::tuple<void*, void*> to preserve functionality of code currently passing in list initializers or tuples.

* Add overloads for MakeKargs

Overload MakeKargs to accept std::tuple<uint64_t, uint64_t> and std::tuple<void*, void*> to preserve functionality of code currently passing in list initializers or tuples.

* Re-format files using ck_tile remod.py

---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

ff92222f

21 Nov, 2024 2 commits

universal streamk fp8 changes (#1665) · d6d4c278

Harisankar Sadasivan authored Nov 21, 2024



* universal streamk fp8 changes & ckprofiler instances

* revert strides to -1 and verification options

* fp8 exclusion on pre-gfx94 for universal_streamk

* PR review based revisions: permissions reverted,  removed hip err checks


---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

d6d4c278

[CK_TILE] Add paged-kvcache support in group mode fmha fwd splitkv kernels (#1678) · fb1ccfa9

Po Yen Chen authored Nov 21, 2024

* Generate group mode paged-attn kernel

* Enable paged-kvcache + group mode support

* Add missing header: fused_moe.hpp

* Add comment to explain kernel arg usage

* Make error message more clear

* Add comment for confusing data member names

* Add more comment for confusing variable names

* Fix typo in option description

fb1ccfa9

18 Nov, 2024 2 commits

Add bf16 and int8 wmma gemms for Navi3x and Navi4x. (#1671) · 8aba2724

Illia Silin authored Nov 18, 2024

* add bf16 gemms for gfx11/gfx12

* reduce the input values in test_gemm

* add int8 wmma gemm instances for gfx11/gfx12

* add example gemm_wmma_int8

* fix bug in gemm_wmma_int8 test

* increase bf16 gemm test tolerance

* update the dates and clean-up commented-out instances

8aba2724

Batched GEMM Multiple D based on Universal GEMM (#1655) · 754adc70

Bartłomiej Kocot authored Nov 18, 2024



* Batched GEMM Multiple D based on Universal GEMM
Co-authored-by: Jing Zhang <jizhan@fb.com>

* CI fixes
Co-authored-by: Jing Zhang <jizhan@fb.com>

---------
Co-authored-by: Jing Zhang <jizhan@fb.com>

754adc70

14 Nov, 2024 1 commit
- [Ck_tile] hot fix, fix rpcf param setting err (#1657) · c1f8d53c
  feli authored Nov 14, 2024
```
Co-authored-by: dummycoderfe <noplydummmycoder@163.com>
```
  c1f8d53c
13 Nov, 2024 3 commits
- fix clang format (#1662) · efd92615
  Illia Silin authored Nov 13, 2024
  
  efd92615
- Move checks for compatibility from Argument() to IsSupportedArgument() (#1653) · 73f02a10
  Taylor Ding authored Nov 13, 2024
  
  73f02a10
- [CK TILE] Update gemm universal pipeline (#1644) · d2073569
  Bartłomiej Kocot authored Nov 13, 2024
```
* [CK TILE] Update gemm universal pipeline

* Fixes

* fix

* Rebase
```
  d2073569
12 Nov, 2024 1 commit

[CK Tile] Improve the Layout, Padding, and Alignment features of CK Tile GEMM (#1651) · 2b6458dd

Thomas Ning authored Nov 12, 2024

* Finished the feature

* Modified the test file

* Test case update

* addresss comment

* Addressed the review comment

* Fixed the CI error

2b6458dd

11 Nov, 2024 2 commits
- [CK_TILE] add more stride for layernorm to support un-continuous Tensor (#1650) · 8ef8a994
  valarLip authored Nov 11, 2024
```
* [CK_TILE] add more stride for layernorm to support un-continuous Tensor

* align CK coding style

* extend strides to layernrom expample

* clang-format...
```
  8ef8a994
- Return nullptr when block index is invalid (#1649) · 13332998
  Po Yen Chen authored Nov 11, 2024
  
  13332998
09 Nov, 2024 1 commit

Ck tile/moe sorting (#1624) · bec6fbc6

dummycoderfe authored Nov 09, 2024



* add moe_sorting & check ok

* fix comments & typo

* Run remod.py under include/ck_tile & example/ck_tile directories

* format codes

* fix output ci check bug

* fix moe sorting readme and error commit file

* use magiv div to accelerate compute

* add an loop unroll for moe lds ops

* add extblocksnel to set zeros for moebufs

* [Ck_tile] moe set zero run ok, add size check and fix ref check

* [Ck_tile]fix moe_sorting fuse set_zero remod

* [Ck_tile] change name style, fix zero buffer size err, change folder

* [Ck_tile] moe_sorting: fix name style

* [Ck_tile] moe_sorting, remove useless params in traits

* [Ck_tile] change outputtile cnt * unit_size; change output buf alloc

---------
Co-authored-by: dummycoderfe <noplydummmycoder@163.com>
Co-authored-by: Po Yen, Chen <PoYen.Chen@amd.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>

bec6fbc6

08 Nov, 2024 1 commit

[Ck tile] layernorm2d fwd optimize (#1637) · 686a58a9

dummycoderfe authored Nov 08, 2024



* optimze small N case using vec io and using rcp div

* [Ck_tile] layernorm, add param to control fastdiv; change generate codes and test pass

* [Ck_tile] fix blockSize compute in Generic2dBlockShape

* [Ck_tile]fix kfastfdiv template style

* [Ck_tile] layernorm, fix stype in review

---------
Co-authored-by: dummycoderfe <noplydummmycoder@163.com>

686a58a9

07 Nov, 2024 1 commit
- enable compilation for generic navi targets (#1645) · 75c5bfa3
  Illia Silin authored Nov 07, 2024
  
  75c5bfa3
05 Nov, 2024 1 commit
- Statically Cast Pointer Offset (#1631) · d0e3a70a
  darren-amd authored Nov 05, 2024
```
* explicit cast ptr offset

* formating change
```
  d0e3a70a
02 Nov, 2024 1 commit

[CK_TILE] layernorm have more accurate residual (#1623) · cb6c5d39

carlushuang authored Nov 02, 2024



* more accurate residual

* modify comment

* Fix literal case in README.md

---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

cb6c5d39

01 Nov, 2024 2 commits

[Ck_tile] smoothquant (#1617) · fbd65454

rocking authored Nov 01, 2024



* fix compile error

* fix typo of padding

* Add smoothquant op

* Add smoothquant instance library

* refine type

* add test script

* Re-generate smoothquant.hpp

* Always use 'current year' in copyright

* use Generic2dBlockShape instead

* Add vector = 8 instance back

* Find exe path automatically

* Simplify the api condition

* Remove debugging code

* update year

* Add blank line between function declaration

* explicitly cast return value to dim3

* refine return value

* Fix default warmup and repeat value

* Add comment

* refactor sommthquant cmake

* Add README

* Fix typo

---------
Co-authored-by: Po Yen, Chen <PoYen.Chen@amd.com>

fbd65454

[layernorm] hot fix (#1620) · 550248de
carlushuang authored Nov 01, 2024
```
* hot fix ln

* some rename
```
550248de

31 Oct, 2024 1 commit

[CK_TILE] layernorm support fused-quant/fused-add (#1604) · c3a4800c

carlushuang authored Oct 31, 2024

* add prenorm/postnorm support, refactor using generate.py

* update README

* update README

* fix format

* update some description and fix format

* update format

* format

* use non-raw for loading

* format and update n4096

* dynamic-quant ready

* update readme

* support fused dynamic-quant

* update fused-quant, with smooth

* update README

* update args

* update some based on comment

c3a4800c

30 Oct, 2024 5 commits

Remove virtual destructors from unary ops (#1610) · 9a8a5213
Bartłomiej Kocot authored Oct 30, 2024
```
* Remove virtual destructors from unary ops

* Fixes

* Fixes

* clang format fixes
```
9a8a5213
clang-format (#1612) · 7d911154
rocking authored Oct 30, 2024

7d911154

[CK-Tile] Universal gemm memory bound pipeline (#1558) · 24d996aa

Adam Osewski authored Oct 30, 2024

* CK-Tile GEMM with memory bound pipeline.

* Memory bound gemm pipeline.

* Fix not closed namespace.

* Block gemm mem pipeline draft.

* Do not use ck_tile:: within ck_tile namespace.

* Refactoring & Move Layout info to pipeline problem.

* Get hot loop and TailNum information before lunching kernel.

* Fixes in pipeline.

* Add comment to load_tile_raw and change variable naming style.

* Few small changes & formatting.

* Do not use macro.

* Add gtests.

* Use AccDataType for Output of MFMA instruction.

* Formatting.

* Refactor gemm examples.

* Switch over to current block gemm.

* Use currently available pipeline policy.

* Refactoring and review comment.s

* Fixes after merge.

* Add missing include.

* Add load tile overload which accepts output tensor as parameter.

* This give 8% perf boost at the cost of using more registers.

* Rename example.

* Small changes.

* Fix compilation err and lower K.

* Support different layouts for A/B

* Fix vector size for different layouts.

* Rename Alignment into VectorSize

* Unblock tests.

24d996aa

[Ck tile] support rmsnorm and related fusion (#1605) · 3d609534

rocking authored Oct 30, 2024

* Add reduce2d new api

* Prevent user use cross warp reduction

* Fix bug of std caculation

* Add rmsnorm2d

* Add rmsnorm small example

* Remove static assert to prevent compile fail

* Add script to test performance and correctness

* Add missing cmake change

* refine naming

* refine example of rmsnorm

* Fix bug of rmsnorm

* Refine naming

* Fix cmake

* clang format

* Refine pipeline name

* Add add_rmsnorm2d_rdquant kernel

* Add reduce op

* host verification

* Fix bug of one pass pipeline

* Refine tile size

* Add two pass pipeline

* Rename two pass to three pass

* Fix bug of kSaveX == false

* Add instance library

* Add test script

* Fix bug of x verification

* Add save_x to trait

* Add README

* Move reduce2d into reduce folder

* Fix bug of welford when number of m warp > 1

* remove reduncant comment

* 1. move 06_rmsnorm2d to 10_rmsnorm2d
2. move 07_add_rmsnorm2d_rdquant to 11_add_rmsnorm2d_rdquant

* clang format and add missing header

* Add host validation of add + layernorm2d + rsquant

* Revert "Add host validation of add + layernorm2d + rsquant"

This reverts commit 936cb457978b928b90eff89a08fcdb7dc8bbed67.

* Remove deprecated flag

3d609534

[CK_TILE] Add fmha fwd headdim96 support (#1608) · 86322218

Qianfeng authored Oct 30, 2024



* Add ceil_to_qualified_tile_length()

* Rename kK0BlockLength to kQKHeaddim

* Add kSubQKHeaddim concept to support headdim96

* Fix in math.hpp to avoid using __half interfaces

* Add LdsBufferSequence instance for headdim96

* Update in fmha_fwd/fmha_fwd_splitkv codegen to support hd96 testing

* Disable hd96 instance generation in codegen fmha_fwd and fmha_fwd_splitkv to save compiling time

* Reformat one file

* Fix text alignment in fmha_fwd_splitkv.py

---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

86322218

29 Oct, 2024 3 commits
- [CK_TILE] add scatter_gather (#1609) · 4d7e063a
  valarLip authored Oct 29, 2024
  
  4d7e063a
- [CK_TILE] add generic_permute (#1607) · 9fbd72e9
  valarLip authored Oct 29, 2024
  
  9fbd72e9
- fix compilation errors for gfx12 with clang20 (#1606) · 922e42a0
  Illia Silin authored Oct 28, 2024
  
  922e42a0