Commits · 63b152d6d391c90effe435f0bc956824dadbb81f · gaoqiong / composable_kernel_ROCM

15 Oct, 2024 1 commit
- [CK_TILE] Add block universal gemm pipeline policy (#1557) · d02a92cc
  Bartłomiej Kocot authored Oct 15, 2024
```
* [CK_TILE] Add block universal gemm pipeline policy

* Fixes

* fixes2

* Fixes3

* fixeS
```
  d02a92cc
14 Oct, 2024 1 commit
- Add transpose scale amax example (#1547) · f21cda25
  Bartłomiej Kocot authored Oct 14, 2024
```
* Add transpose scale amax example

* fixes

* Tune reduce instance
```
  f21cda25
12 Oct, 2024 2 commits
- code revert · ae2d7d2b
  danyao12 authored Oct 12, 2024
  
  ae2d7d2b
- add bf16 rtne kernels · e2ea64d9
  danyao12 authored Oct 12, 2024
  
  e2ea64d9
11 Oct, 2024 2 commits
- bf16 rtz update · ee9706ab
  danyao12 authored Oct 11, 2024
  
  ee9706ab
- some kernels and related api update · 7b12d9b7
  danyao12 authored Oct 11, 2024
  
  7b12d9b7
10 Oct, 2024 2 commits

Fix default stride value (#1559) · d18fc079
Rostyslav Geyyer authored Oct 10, 2024

d18fc079

Ck tile gemm cshuffle & CK Tile GEMM restructure (#1535) · 6f27bc98

Thomas Ning authored Oct 10, 2024



* ake the cshuffle compilable

* modify Mhe reference on gpu and cpu. Correaccess of cshuffle

* fix the cpu reference code

* Complete the in tile shuffle logic

* restructure the kernel template input

* change the naming pattern of ck_tile gemm pipeline

* Re-format files using remod.py

* Solve the fmha conflict with gemm

* Comment Addressed from Carlus

---------
Co-authored-by: Po Yen, Chen <PoYen.Chen@amd.com>

6f27bc98

08 Oct, 2024 4 commits

Add a gpu gemm reference kernel (#1528) · aa932445

Rostyslav Geyyer authored Oct 08, 2024



* Add a gpu gemm reference kernel

* Switch to gpu reference in gemm examples

* Remove redundant arguments

* Update all related examples

* Update more examples

* Try less threads per block

* Try even less threads per block

* Add support for all matrix layouts

* Increase block size

* Clean up

* Remove hardcoded strides

* Clean up

* Try a column-major case

* Revert back to row-major

* Run both CPU and GPU veriffication

---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

aa932445

rename & ensure thread safety · d4de8495
danyao12 authored Oct 08, 2024

d4de8495

[CK_TILE] Update example README files & fix script compatibility issue (#1548) · 0c094daa

Po Yen Chen authored Oct 08, 2024

* Fix text alignment of ArgParser::print()

* Update example README files

* Clarify make-ck-dev.sh <arch> usage

* Only keep some of the argument from '-?' output

* Undo command line output changes in README

* Only keep existing argument on doc and update description

* Fix text alignment

* Make cmake-ck-*.sh compatible with 'sh' command

0c094daa

[CK_TILE] Simplify the codes in splitkv_combine pipeline (#1549) · 74d68e3b

Qianfeng authored Oct 08, 2024



* Simplify the codes in splitkv_combine pipeline

* Always set kPadSeqLenK=true for fmha splitkv kernels

* Change in Oacc Alignment and TileDistribution to be more adaptable to tile sizes

---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

74d68e3b

07 Oct, 2024 2 commits

Fix build logic using GRU_ARCHS. (#1536) · 7d8ea5f0

Illia Silin authored Oct 07, 2024

* update build logic with GPU_ARCHS

* fix the GPU_ARCHS build for codegen

* unset GPU_TARGETS when GPU_ARCHS are set

7d8ea5f0

[Ck tile] Support layernorm one pass (#1512) · 0023f01a

rocking authored Oct 07, 2024



* Fix compile error

* Add one pass pipeline

* Extract creating tile_window to operator()

* clang format

* reduce duplicated code

* do not hardcode

* Support padding in layernorm

---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

0023f01a

04 Oct, 2024 1 commit

Adding seed and offset pointer support to the philox random number generator. (#1523) · c24fae23

kylasa authored Oct 04, 2024



* Adding seed and offset pointer support to the philox random number generator.

* Separating seed and offset pointer checks with different condition statements.

* Changes include, adding support for device seed and offset pointers, union is used to store seed/offset values and device pointers to minimize device SGPRs.

* Correcting a typo in the readme file

* Re-format files using remod.py

* Use STL type for API parameters

* Use simpler struct design for drop_seed & drop_offset

* Undo unnecessary changes

* Sync kargs style for fmha_fwd.hpp/.cpp

* Use templated union to reduce code

* Use structured binding to make code more readable

---------
Co-authored-by: Sudhir Kylasa <sukylasa@amd.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

c24fae23

01 Oct, 2024 2 commits

[CK_TILE] Change output accum tensor layout of fmha fwd split-kv & combine kernels (#1527) · a1c07e8d

Po Yen Chen authored Oct 01, 2024

* Use same layout for o_acc and o tensor

* Use better param names in partitioner

* Remove redundant kargs 'max_seqlen_q'

* Use better param names in splitkv kernel

* Add comment for additional kernel arguments

* Sync empty loop early return logics between pipelines

* Pass more arguments to cmake in scripts

* Align backslashes

* Fix wrong o_acc tensor view strides

* Change o_acc layout if o_perm=0

* Handle whole row masked via attn_bias

* Use use vector width = 1 for o_acc

* Use more even split sizes

a1c07e8d

Complex Contraction CK Bilinear Example (#1061) · 4cd1dc7f

M.Emin Ozturk authored Sep 30, 2024



* complex type contraction

* bug fix

* update

* Tensor Contraction Complex Data Type is working

* 4D Kernel

* some change

* validation check in progress

* validation issue

* fp32 verification error is fixed

* fp32 and fp64 are done

* remove old files

* remove cmake files

* remove cmake files

* Readme

* img verification

* CMakeList

* number changed

---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: Emin Ozturk <emin.ozturk@utah.edu>

4cd1dc7f

29 Sep, 2024 1 commit
- add bf16+a16 rtz · 871c7556
  danyao12 authored Sep 29, 2024
  
  871c7556
27 Sep, 2024 2 commits
- [CK_TILE] Image to Column kernel (#1532) · de3e3b64
  Bartłomiej Kocot authored Sep 27, 2024
```
* [CK_TILE] Image to Column kernel

* Fixes

* Vector loads and stores

* Fixes

* Fixes

* change test dir name
```
  de3e3b64
- mqa/gqa support for atomic f16 cases · 2dafca1f
  danyao12 authored Sep 27, 2024
  
  2dafca1f
26 Sep, 2024 1 commit
- [CK_TILE] Fix compiler related FA bwd issues (#1530) · 9d69a099
  Dan Yao authored Sep 27, 2024
```
* add barriers

* tail bias barriers

* adjust bf16/hd256 tol

* continue adjust bf16/hd256 tol
```
  9d69a099
23 Sep, 2024 2 commits
- add benchmark_bwd_ext · 1e01ee09
  danyao12 authored Sep 23, 2024
  
  1e01ee09
- clang-format · 36e65bdc
  danyao12 authored Sep 23, 2024
  
  36e65bdc
21 Sep, 2024 2 commits
- code revert · 2463a221
  danyao12 authored Sep 21, 2024
  
  2463a221
- no_coex update · 78f33529
  danyao12 authored Sep 21, 2024
  
  78f33529
20 Sep, 2024 1 commit
- asm code update · 8ac3eb39
  danyao12 authored Sep 20, 2024
  
  8ac3eb39
19 Sep, 2024 3 commits
- enable bwd_fp16_a16 · 67b160c5
  danyao12 authored Sep 19, 2024
  
  67b160c5
- clang-format · c3b406d6
  danyao12 authored Sep 19, 2024
  
  c3b406d6
- add traits · 5ab137f4
  danyao12 authored Sep 19, 2024
  
  5ab137f4
18 Sep, 2024 3 commits
- Ck tile gemm padding dim (#1516) · 694c3001
  Thomas Ning authored Sep 18, 2024
```
* Support the N dimension padding

* Finished the padding feature for different dimension of K
```
  694c3001
- code cleanup · a0491b67
  danyao12 authored Sep 18, 2024
  
  a0491b67
- tmp save · 3efb8621
  danyao12 authored Sep 18, 2024
  
  3efb8621
14 Sep, 2024 1 commit

Ck tile GPU verification sample develop & Add the CK TILE GEMM to the CI/CD test (#1505) · 844f5a17

Thomas Ning authored Sep 14, 2024



* Finished the feature of gpu verification

* Add the ck_tile_gemm test in the CI CD

* add the include of tensor_layou in reference_gemm

* Comment Addressed

* split ck_tile fhma and gemm tests into separate stages

* restructure the reference gemm

* restructure a new reference_gemm api that could read the device mem

---------
Co-authored-by: carlushuang <carlus.huang@amd.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>

844f5a17

13 Sep, 2024 1 commit

Customize filesystem in CK for legacy systems (#1509) · 81bc1496

Jun Liu authored Sep 13, 2024



* Legacy support: customized filesystem

* Update cmakefile for python alternative path

* fix build issues

* CK has no boost dependency

* More fixes to issues found on legay systems

* fix clang format issue

* Check if blob is correctly generated in cmake

* fix the python issues

* add a compiler flag for codegen when using alternative python

* use target_link_options instead of target_compile_options

---------
Co-authored-by: illsilin <Illia.Silin@amd.com>

81bc1496

09 Sep, 2024 1 commit
- fix the unsupported scenario of Ali TestGemmUniversal (#1501) · cf08df6b
  Thomas Ning authored Sep 09, 2024
  
  cf08df6b
07 Sep, 2024 1 commit

Ck tile gemm example (#1488) · caacd388

Thomas Ning authored Sep 07, 2024



* Checkpoint: Finished with the tile example & kernel verification, working on the different matrix layout

* Finished the Matrix Layout feature set up. Note: Need to modify the inner block to solve the shuffle problem in the future.

* Fix: Clang Format, API fixed from fmha

* fix with better naming convention

* revert back the pipeline code of fmha

* Fixed: Addressed the comments and merge the GEMM shape of GEMM Operator and FMHA Operator to one.

* clang format with the reference_gemm file

* convert the clang format with the remod.py

* Changed the format and variable name of the kernel gemm_shape and partitioner

---------
Co-authored-by: thomasning <thomasning@banff-cyxtera-s70-4.ctr.dcgpu>

caacd388

05 Sep, 2024 4 commits
- add fmha asm api: fmha_bwd_ext · d4139c8b
  Fang.Che authored Sep 05, 2024
  
  d4139c8b
- hsaco reorder · 933ac7c7
  danyao12 authored Sep 05, 2024
  
  933ac7c7
- hsaco rename · d356c4d0
  danyao12 authored Sep 05, 2024
  
  d356c4d0
- Add gemm universal bf16 instances (#1484) · 5b10dae6
  Haocong WANG authored Sep 05, 2024
```
* revert ckprofiler change

* temp save

* Add test and test pass

* test pass

* Fix bug inside rotating buffer when tensor is not packed

* bug fix

* clang format

---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
```
  5b10dae6