Commits · ab3885aa6dbab6e9ba923b97fb996ce1eefb8b6f · gaoqiong / composable_kernel_ROCM

05 Jul, 2024 4 commits
- print messages for arguments1 · ab3885aa
  Harisankar Sadasivan authored Jul 05, 2024
  
  ab3885aa
- print messages for arguments · c08ca6f2
  Harisankar Sadasivan authored Jul 05, 2024
  
  c08ca6f2
- print messages for arguments · cc9dd3b6
  Harisankar Sadasivan authored Jul 05, 2024
  
  cc9dd3b6
- added default value support for grid_size and streamk-polic selection set to -1 · 22329520
  Harisankar Sadasivan authored Jul 05, 2024
  
  22329520
04 Jul, 2024 1 commit
- Fix issue with multiple targets and remove smfmac tests from unsupported test targets (#1372) · 95907384
  Jun Liu authored Jul 03, 2024
  
  95907384
03 Jul, 2024 2 commits
- permissions updated · 31e104ca
  Harisankar Sadasivan authored Jul 03, 2024
  
  31e104ca
- changes suggested in PR review are made- removing comments and correcting copyright · e7cde218
  Harisankar Sadasivan authored Jul 03, 2024
  
  e7cde218
27 Jun, 2024 2 commits
- Update CMakeLists.txt (#1364) · 2525864f
  Ruturaj Vaidya authored Jun 27, 2024
```
It is a good practice to check if the file CMakeLists.txt is in fact in the directory.
```
  2525864f
- Merging the gfx12 code into public repo. (#1362) · 941d1f7c
  Illia Silin authored Jun 27, 2024
  
  941d1f7c
26 Jun, 2024 2 commits

[CK_TILE] fmha forward split-kv + combine kernels (#1338) · 0cb2e06d

Po Yen Chen authored Jun 26, 2024



* FA fwd dropout

* FA bwd

* epilogue reuse

* CMakeLists update

* [CK_TILE] support alibi (#1269)

* add alibi support

* fix code

* update code based on comment

* Support more hdim

* fix fp8 bias

* support seqlen_k=0 case

* remove unused printf

* fix format

---------
Co-authored-by: rocking <ChunYu.Lai@amd.com>

* now fwd/bwd can build

* bwd alibi

* add bwd validation stream_config

* update generated filenames

* update bwd kernel launch

* CK_TILE_HOST_DEVICE in philox

* Transpose -> transpose

* format

* format

* format

* Generate the instance for FA required

* format

* fix error in WarpGemm

* Add num_splits option and dummy split-kv api method

* Generate fmha_fwd_splitkv()

* Add SplitKV kernel codegen logics

* Add SplitKV combine kernel codegen logics

* Fix mismatched return type

* Clean-up code

* Replace sentinel value before storing

* Fix wrong layout of LSE/LSEacc/Oacc

* Format codes

* Fix o_acc memory error

* Fix wrong kBlockSize used in policy

* Reduce # of combine kernels

* Fix split-kv combine kernel name

* Fix wrong LDS indexing logics

* Fix wrong loop counter step logic

* Undo vector size changes

* Remove no-longer used field

* Remove in-consistent comment

* Remove debug statements in example

* Remove more debug statements

* Add constness to local variables

* Clearn up generate.py

* Fix unstable clang-format comment

* Remove unused include directive

* Use shorter template parameter name

* Enable non-split-kv blobs

* Update license date

* Print num_splits conditionally

* Undo disabling data types

* Remove unnessary tile size for fp8

* Fix wrong pipeline args for fp8

* Fix example output format

* Remove more debug code in combine pipeline

* Add stride kernel arguments for LSE/O acc workspace

* Re-order split-kv pipeline call operator arguments

* Pass LSE/O strides in kernel argument

* Re-order pipeline call operator arguments

* Use tensor_descriptor to locate LSEacc elements

* Support providing invalid element for tensor view

* Set invalid element value for LSEacc tensor view

* Remove hand-written store_tile() code

* Remove necessary value-overwrite logic

* Add transposed lds descriptor

* Support load_tile() for tile_window_with_static_lengths<>

* Undo removing necessary value-overwrite logic

* Use read descriptor to locate lds elements

* Simplify pipeline source code

* Add constraint to kMaxSplits

* Default use kMaxSplits=64 in generate.py

* Revert "Add constraint to kMaxSplits"

This reverts commit 0a2132d758042e6fb0292f4e354909b8a4d1c118.

* Revert "Default use kMaxSplits=64 in generate.py"

This reverts commit c7d9c80b77320aec6559222bed7d47adcaefe4e3.

* Decide alignment by the padding parameter

* Remove no-longer used utility functions

* Remove not-working code

* Add comment & remove no-longer used code

* Fix computation errors

* Add heuristic to override num_splits option

* Add constraint to kMaxSplits

* Fix compilation error

* Clean up pipeline code

* Wrap pointer access as lambda function

* Rename confusing methods

* Use kLogMasSplits as template parameter

* Finish splitkv combine kernel codegen

* Update kMaxSplits limit

* Use smaller kM0 for splitkv combine kernel

* Ignore droupout flag in splitkv pipeline

* Unify flag usage

* Add back flag kStoreLSE

* Merge lambda calls in pipeline

* Fix compilation errors

* Avoid all empty splits

* Always check for empty loop in splitkv pipelines

* Re-order parameters

* Remove redundant p_drop option check

* Add traits/problem for fwd splitkv kernel

* Conditionally enable uneven split boundary checks

* Add comment for the splitkv traits field

* Change even split criteria

* Re-order statements

* Refine occupancy value for hdim=128&256

* Refine occupancy value for hdim=32&64

* Remove redundant kernel argument

* Separate fmha bwd codegen logics

* Separate fmha fwd codegen logics

* Remove redundant direction parameter in fwd&bwd codegen logics

* Support generate multiple APIs for an example

* Let 'api' an alias of 'direction' option

* Remove choices for the 'direction' option

* Use dictionary to config all the functions

* Move fmha splitkv codegen logics to other file

* Add fwd_splitkv api for tile_example_fmha_fwd

---------

Co-authored-by: danyao12 <danyao12>
Co-authored-by: carlushuang <carlus.huang@amd.com>
Co-authored-by: rocking <ChunYu.Lai@amd.com>
Co-authored-by: Jing Zhang <jizhan@amd.com>

0cb2e06d

fixed argument type issue with parsing cmd args · 66e0e909
Harisankar Sadasivan authored Jun 26, 2024

66e0e909

25 Jun, 2024 4 commits
- corrected arg parsing for streamk and universal streamk · 03c25255
  Harisankar Sadasivan authored Jun 25, 2024
  
  03c25255
- removed conflicts in struct members between streamk and unievrsal streamk · f7f9954d
  Harisankar Sadasivan authored Jun 25, 2024
  
  f7f9954d
- fixing clang-format issues · b469ec5f
  Harisankar Sadasivan authored Jun 25, 2024
  
  b469ec5f
- Update README.md · fbce4790
  Harisankar Sadasivan authored Jun 25, 2024
  
  fbce4790
24 Jun, 2024 2 commits

universal streamk files and ckprofiler files for same · a4d67230
Harisankar Sadasivan authored Jun 24, 2024

a4d67230

layernorm2d forward (#1339) · cb138394

rocking authored Jun 24, 2024



* Add layernorm2d forward

* Refind file path

* clang format

* Exclude ck_tile op from all

* use add_executable instead

* refactor layernorm2d_fwd example

---------
Co-authored-by: carlushuang <carlus.huang@amd.com>

cb138394

22 Jun, 2024 1 commit

Add instances of grouped convolution 3d forward with a ConvScale element-wise... · 05b10e0e

Andriy Roshchenko authored Jun 21, 2024


Add instances of grouped convolution 3d forward with a ConvScale element-wise op for bf8@bf8->fp8 (#1326)

We are adding more instances of grouped convolution 3d forward with a ConvScale element-wise operation.
This commit handles bf8@bf8->fp8 data types combination.

* Included an example.
* Added instances.
* Added a client example.

---------
Co-authored-by: Rostyslav Geyyer <rosty.geyyer@amd.com>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

05b10e0e

20 Jun, 2024 1 commit

Adding Missed Activation Functions for Grouped 2D/3D Convolutions (#1348) · 0162a5f6

ThruptiRajLakshmanaGowda authored Jun 20, 2024

* Initial Push

* First Push

* Fixed Clang format

* Resolve merge conflict

* Addressed review comments

* Addressed review comments

* Addressed review comments

0162a5f6

18 Jun, 2024 1 commit

Switch to universal gemm in grouped gemm tile loop (#1335) · e2d13920

jakpiase authored Jun 18, 2024



* switch to universal gemm in grouped gemm tile loop

* minor fixes

* add reviewers comments

---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

e2d13920

12 Jun, 2024 1 commit
- Add instances for grouped conv fwd 3d with ConvScale for fp8@bf8->fp8 (#1325) · acda4c5a
  Rostyslav Geyyer authored Jun 12, 2024
```
* Add fp8 bf8 conv example

* Add instances

* Add client example

* Add random scale values

* Format
```
  acda4c5a
10 Jun, 2024 1 commit

Add a convinvscale op, related instances and examples (#1307) · ce66277a

Rostyslav Geyyer authored Jun 10, 2024



* Update the element op

* Add an example

* Add instances

* Add a client example

* make sure new instances only build on gfx9

* Update element op and its handling

* Format

* Update instances to take element op as an argument

* Update examples to use random scale values

* Format

* Update client example with random scales

* Format

---------
Co-authored-by: illsilin <Illia.Silin@amd.com>

ce66277a

05 Jun, 2024 1 commit

Add a scale op, related instances and examples (#1242) · cb0645be

Rostyslav Geyyer authored Jun 04, 2024



* Add a scale op

* Update the element op

* Add instances

* Add an example

* Add a client example

* Add a flag check

* Revert flag check addition

* Fix flag check

* Update d strides in example

* Update d strides in client example

* Apply suggestions from code review

Update copyright header
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Move the example

* Move the client example

* Update element op

* Update example with the new element op

* Add scalar layout

* Update example

* Update kernel for scalar Ds

* Revert kernel changes

* Update element op

* Update example to use scales' pointers

* Format

* Update instances

* Update client example

* Move element op to unary elements

* Update element op to work with values instead of pointers

* Update instances to take element op as an argument

* Update examples to use random scale values

---------
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

cb0645be

04 Jun, 2024 1 commit

CK Tile FA Training kernels (#1286) · 2cab8d39

Dan Yao authored Jun 05, 2024



* FA fwd dropout

* FA bwd

* epilogue reuse

* CMakeLists update

* [CK_TILE] support alibi (#1269)

* add alibi support

* fix code

* update code based on comment

* Support more hdim

* fix fp8 bias

* support seqlen_k=0 case

* remove unused printf

* fix format

---------
Co-authored-by: rocking <ChunYu.Lai@amd.com>

* now fwd/bwd can build

* bwd alibi

* add bwd validation stream_config

* update generated filenames

* update bwd kernel launch

* CK_TILE_HOST_DEVICE in philox

* Transpose -> transpose

* format

* format

* format

* Generate the instance for FA required

* format

* fix error in WarpGemm

---------

Co-authored-by: danyao12 <danyao12>
Co-authored-by: carlushuang <carlus.huang@amd.com>
Co-authored-by: rocking <ChunYu.Lai@amd.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: Jing Zhang <jizhan@amd.com>

2cab8d39

01 Jun, 2024 1 commit

Post-merge fix of PR 1300 (#1313) · 6fb1f4e0

zjing14 authored Jun 01, 2024

* add f8 gemm with multiD for both row/col wise

* change compute_type to fp8

* changed tuning parameters in the example

* add rcr example

* post-merge fix

* fix

* reduce init range

6fb1f4e0

28 May, 2024 2 commits

add f8 gemm multiD with both row/col wise scale (#1300) · 80db62f0

zjing14 authored May 28, 2024

* add f8 gemm with multiD for both row/col wise

* change compute_type to fp8

* changed tuning parameters in the example

* add rcr example

80db62f0

[CK_TILE] support group from cmdline (#1295) · 5055b3bd

carlushuang authored May 28, 2024

* support cmdline seqlen decode

* silent print

* update readme

* update kernel launch 3d

* update tile partitioner

* fix spill for bf16

* modify based on comment

* modify payload_t

* fix bug for alibi mode

* fix alibi test err

* refactor kernel launch, support select timer

* add missing file

* remove useless code

* add some comments

5055b3bd

22 May, 2024 1 commit

Select appropriate GPU targets for instances, tests, and examples. (#1304) · 7b027d56

Illia Silin authored May 22, 2024

* set individual gpu targets for instances, examples, tests

* fix path to hip compiler

* fix path to hip compiler once more

* aggregate device macros in ck_tile config header

* fix the cmake logic for instances

* fix clang format

* add gfx900 and gfx906 to default set of targets

7b027d56

11 May, 2024 1 commit
- re-enable convnd_fwd_xdl_fp64 testing (#1289) · 7843a8a7
  Illia Silin authored May 10, 2024
  
  7843a8a7
10 May, 2024 2 commits
- Code clean-up (#1285) · 566b6480
  Illia Silin authored May 10, 2024
```
* code clean-up

* remove the profiling output samples
```
  566b6480
- [CK_TILE] fix some rand number init (#1287) · fcba889e
  carlushuang authored May 11, 2024
```
* add random norm

* normalized default to 0/3

* change squant->auto
```
  fcba889e
09 May, 2024 1 commit
- Fix MakeArgument (#1284) · a0ae1c61
  Adam Osewski authored May 09, 2024
  
  a0ae1c61
07 May, 2024 1 commit

[CK_TILE] support alibi (#1269) · 851c3ed1

carlushuang authored May 07, 2024



* add alibi support

* fix code

* update code based on comment

* Support more hdim

* fix fp8 bias

* support seqlen_k=0 case

* remove unused printf

* fix format

---------
Co-authored-by: rocking <ChunYu.Lai@amd.com>

851c3ed1

30 Apr, 2024 1 commit
- Fix example CMakeLists.txt (#1267) · 0f7e8ec4
  Adam Osewski authored Apr 30, 2024
```
Add proper dependency target.
```
  0f7e8ec4
26 Apr, 2024 2 commits

[GEMM] UniversalGemm update (#1262) · 764164b4

Haocong WANG authored Apr 27, 2024



* Add bf16 instances

* Add bf16 gemm universal example

* tempsave

* Add guard to navi compilation

* workground on a specific mixed gemm instance ( bring back it when compiler fix upload)

* fix formatting condition statement issue

* solve conflict

---------
Co-authored-by: Jun Liu <Liu.Jun@amd.com>

764164b4

bf16A_Int8B with fastgelu/bias (#1264) · 0d0150db

zjing14 authored Apr 26, 2024

* changed the copy function to v7r2

* adding multi_abd

* in-progress

* add post-load oob check

* debugging

* adjust instances

* add run_lds

* add elemntwise_op

* replace multi_abd_device with v3

* clean up

* clean

* clean

* Added LDSType

* profiling

* adjust oobcheck

* add missing file

* refactor

* clean

* add examples

0d0150db

25 Apr, 2024 1 commit

Grouped GEMM Multiple D tile loop. (#1247) · b4032629

Adam Osewski authored Apr 25, 2024

* Overload output stream operator for LoopScheduler and PiplineVersion

* Add Run overload accepting grid descriptors MK.

* Add __device__ keyword for CalculateGridSize

* Create device op GroupedGemmMultipleD

* Add GroupedGemm MultipleD Tile Loop implementation.

* Add an example for GroupedGemm MultipleD tile loop.

* Device Op GroupedGEMMTileLoop.

* Bunch of small changes in exmaple.

* CkProfiler

* Remove unused tparam.

* Fix include statement.

* Fix output stream overloads.

* Do not make descriptors and check validity untill we find group.

* Fix gemm desc initialization.

* Revert device op

* Fix compilation for DTYPES=FP16

* Validate tensor transfers paramters.

* Validate on host only NK dims if M is not known.

* Fix bug.

* A convenient debug func for selecting threads.

* Fix has main k block loop bug.

* Make sure that b2c has up to date tile offset.

* Output stream operator for Sequence type.

* Cmake file formatting.

b4032629

19 Apr, 2024 1 commit

Refactor elementwise kernels (#1222) · ad1597c4

Bartłomiej Kocot authored Apr 19, 2024

* Refactor elementwise kernels

* Instances fixes

* Fix cmake

* Fix max pool bwd test

* Update two stage gemm split k

* Restore elementwise scale for hiptensor backward compatiblity

* Fix Acc data type check in conv fwd multiple abd

* Disable conv fp64 fwd example

* Update grouped conv weight multi d

ad1597c4

18 Apr, 2024 1 commit

Add grouped conv bwd weight multi d kernel (#1237) · fd923b6d

Bartłomiej Kocot authored Apr 18, 2024

* Add grouped conv bwd weight multi d kernel

* Reference fix

* Fix cmake files

* bwd weight scale only xdl

* Fixes

* Fix client conv fwd example

fd923b6d

16 Apr, 2024 1 commit

Added Multi_ABD support into Gemm and GroupedGemmFixedNK (#978) · 12865fbf

zjing14 authored Apr 15, 2024



* added an example grouped_gemm_multi_abd

* fixed ci

* add setElementwiseOp

* changed API

* clean code: add multiA into example

* fixed v7r2 copy

* add transpose

* clean

* fixed vector_load check

* Update example/15_grouped_gemm/grouped_gemm_multi_abd_xdl_fixed_nk_bias_fp16.cpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update example/15_grouped_gemm/grouped_gemm_multi_abd_xdl_fixed_nk_bias_fp16.cpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update example/15_grouped_gemm/grouped_gemm_multi_abd_xdl_fixed_nk_bias_fp16.cpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_multiple_abd_xdl_cshuffle.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_multiple_abd_xdl_cshuffle.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd_fixed_nk.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd_fixed_nk.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* add reduce

* testing

* add example_b16_i8

* refactor example

* clean

* add mpading

* disable reduce for kbatch = 1

* seperate reduce device op

* add reduce op

* add guard for workspace_size

* add instances

* format

* fixed

* add client example

* add a colmajor

* add instances

* Update cmake-ck-dev.sh

* Update profile_gemm_splitk.cpp

* Update gridwise_gemm_xdlops_v2r4r2.hpp

* format

* Update profile_gemm_splitk.cpp

* fixed

* fixed

* adjust test

* adjust precision loss

* adjust test

* fixed

* add bf16_i8 scale bias

* fixed scale

* fixed scale elementwise_op

* revert contraction deviceop changes

* fixed

* Add AddFastGelu

* Revert "Merge branch 'jizhan/gemm_splitk_reduce' into grouped_gemm_multi_abd_fixed_nk_example"

This reverts commit 3b5d001efd74335b38dcb7d8c8877580b49d23a4, reversing
changes made to 943199a99191661c5597c51ca8371a90bf57837e.

* add Scales into elementwise

* add gemm_multi_abd client example

* add client examples

* add rcr and crr

* add grouped gemm client example

* add grouped gemm client example

* add instance for rcr crr

* format

* fixed

* fixed cmake

* fixed

* fixed client_example

* format

* fixed contraction isSupport

* Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd_fixed_nk.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update device_reduce_threadwise.hpp

* clean

* Fixes

* Fix example

---------
Co-authored-by: Jing Zhang <jizha@amd.com>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

12865fbf