Commits · 66736edb95fb9e0250a2fd23ce75001c968caa73 · gaoqiong / composable_kernel_ROCM

20 Feb, 2024 1 commit

Extend permute scale support up to 6D (#1168) · 66736edb

Bartłomiej Kocot authored Feb 20, 2024



* Extend permute scale support up to 6D

* Fixes

* Fixes

* Update profiler/README.md
Co-authored-by: Lisa <lisajdelaney@gmail.com>

* Update profiler/README.md
Co-authored-by: Lisa <lisajdelaney@gmail.com>

* Update profiler/README.md
Co-authored-by: Lisa <lisajdelaney@gmail.com>

* Update profiler/README.md
Co-authored-by: Lisa <lisajdelaney@gmail.com>

* Update profiler/README.md
Co-authored-by: Lisa <lisajdelaney@gmail.com>

* Update profiler/README.md
Co-authored-by: Lisa <lisajdelaney@gmail.com>

* Update profiler/README.md
Co-authored-by: Lisa <lisajdelaney@gmail.com>

---------
Co-authored-by: Lisa <lisajdelaney@gmail.com>

66736edb

07 Feb, 2024 1 commit
- Add support for mixed-precision f16bf16_int8 gemm (#1127) · ba86eadc
  jakpiase authored Feb 07, 2024
  
  ba86eadc
25 Jan, 2024 1 commit

layernorm & groupnorm bwd gamma beta (#1133) · 28f68a5a

rocking authored Jan 25, 2024

* Add layernorm bwd gamma beta external api

* Add groupnorm external api

* Add layernorm bwd gamma beta profiler

* Add groupnorm bwd gamma beta ckProfiler

* Add layernorm & groupnorm bwd gamma beta test

* Fix groupnorm bwd gamma beta profiler bug

* Layernorm bwd weight client example

* Groupnorm bwd weight client example

* clang format

* Remove useless header

* Let inv_std be positive

* Rename to num_bytes and move this calculation outside the loop

28f68a5a

24 Jan, 2024 1 commit

Fixing most of the cppcheck errors. (#1142) · 180e5720

Illia Silin authored Jan 24, 2024

* fix cppcheck errors, first pass

* fix format

* fix returned value in examples

* add macro definitions for cppcheck

* fix the profile_gemm logic

* update the gemm profiler logic

* add more difinitions to cppcheck, fix couple more errors

* replace runtime error with message in device function

* fix a couple of int4 issues

* no return for fill function

* fix errors in data_types.hpp

* fix format

* fix few remaining errors

* fix errors in data_types.hpp

* fix last couple of errors in datat_types.hpp

180e5720

22 Jan, 2024 1 commit
- fixed return (#1138) · 1be47063
  zjing14 authored Jan 22, 2024
  
  1be47063
19 Jan, 2024 1 commit

[GEMM] Optimization for MI200/300. (#1135) · bb63b973

Haocong WANG authored Jan 19, 2024

* Optimize GEMM on MI200/300:
1. Add new blockwise gemm pipeline
2. Add irregular splitk intances

* clang format + typo fix

* Fix a bug

bb63b973

09 Jan, 2024 1 commit
- Add an option to change the number of warm-up cycles and iterations. (#1124) · 886d9eeb
  Illia Silin authored Jan 09, 2024
```
* allow setting the number of warmup cycles and iterations for profiler

* fix the gemm_splitk and grouped_gemm examples
```
  886d9eeb
04 Jan, 2024 1 commit

Transpose profiler fix (#1114) · aa3e2d79

arai713 authored Jan 04, 2024



* added working example for 5D input using 1D kernel

* example with 5D input tensor and 2d kernel - not working: issues with arguments

* added updated version of 3d device op - changed descriptors/dims

* added example file to check kernel

* fixed descriptor and isSupportedArgument stride problem

* added and modified kernel for 3d - updated tids/loop

* adding some more 5d example files

* fixed some issues

* changes made for testing

* working version: fixed error in stride for A, still a bit inefficient

* cleaned up formatting/comments

* updating formatting

* more formatting fixes

* fixing cmake, adding back gpu targets in cmake script

* adding client example

* added instances for client example

* fixed errors in client example

* implemented client ex with device_elementwise.hpp and device_elementwise_3d_impl.hpp

* removed extra files

* minor formatting and naming fixes

* adding test files and profiler

* fixing minor error

* minor fix

* removed unneccesary comments, renamed files

* updated instance list for client example, added different layout example

* removing instances

* fixed error in instance generation

* remove comments

* update profiler and client example tensor layouts

* fixed errors in test/profiler

* updated vector dim access to enable vector load

* updated test/profiler files

* updated example with 1d kernel

* updating profiler

* renamed files

* disabled device op for MI300

* skip  elementwise_permute_2d on gfx94x

* Update CMakeLists.txt

* fixing CMake - disabling some GPU targets

* added transpose profiler to CMake

* fixed transpose profiler errors

* fixed instances for tests/profiler

* cleaned up code in transpose profiler source code

* added some comments, updated copyright

* made function arguments const where possible

---------
Co-authored-by: Jing Zhang <jizha@amd.com>
Co-authored-by: Jing Zhang <jizhan@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>

aa3e2d79

20 Dec, 2023 1 commit

enable compilation of INSTANCES_ONLY for Windows (#1082) · fb5bd51b

Artur Wojcik authored Dec 20, 2023



* enable compilation of INSTANCES_ONLY for Windows

* suppress ROCMChecks warnings on GoogleTests

* suppress -Wfloat-equal warning on GoogleTests

---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

fb5bd51b

18 Dec, 2023 1 commit

layernorm and groupnorm backward data (#1083) · a69aa2a1

rocking authored Dec 19, 2023

* rename folder

* Add type string

* Remove typo

* Add deviceOp to backward x

* Add comment to describe the behavior of backward normalization

* Add kernel function, prepare to implement

* implement generic kernel

* Check vector size

* Add sweep once pipeline for small reduce size

* Fix bug of KRaw_ error

* Fix bug of dx stride

* sanity check for mean and rstd

* backward x for groupnorm

* Add bwd x instance

* add layernorm 2d bwd gamma beta instances

* Change save mean var type from f32 to f16 in f16 mode

* Change the example to f16

* Add groupnorm bwd gamma beta instance

* Add groupnorm bwd x instance

* Fix naming

* Add layernorm bwd x ckprofiler

* Add groupnorm bwd x profiler

* clang format

* Rename bwd x to bwd data

* Fix bug of verification in profiler

* Add test of layernorm and groupnorm bwd data

* Add missing cmake

* Add layernorm2d bwd data

* rename fwd example

* Add groupnorm client example

* Fix typo. replace Invarient with Invariant

* Add checking before running the best instance

a69aa2a1

07 Dec, 2023 1 commit

remove imcomplete transpose profiler (#1088) · 33600202

zjing14 authored Dec 07, 2023


Co-authored-by: Jing Zhang <jizha@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

33600202

29 Nov, 2023 1 commit

Disable transpose device op for MI300 (#1050) · a2969aa8

arai713 authored Nov 29, 2023



* added working example for 5D input using 1D kernel

* example with 5D input tensor and 2d kernel - not working: issues with arguments

* added updated version of 3d device op - changed descriptors/dims

* added example file to check kernel

* fixed descriptor and isSupportedArgument stride problem

* added and modified kernel for 3d - updated tids/loop

* adding some more 5d example files

* fixed some issues

* changes made for testing

* working version: fixed error in stride for A, still a bit inefficient

* cleaned up formatting/comments

* updating formatting

* more formatting fixes

* fixing cmake, adding back gpu targets in cmake script

* adding client example

* added instances for client example

* fixed errors in client example

* implemented client ex with device_elementwise.hpp and device_elementwise_3d_impl.hpp

* removed extra files

* minor formatting and naming fixes

* adding test files and profiler

* fixing minor error

* minor fix

* removed unneccesary comments, renamed files

* updated instance list for client example, added different layout example

* removing instances

* fixed error in instance generation

* remove comments

* update profiler and client example tensor layouts

* fixed errors in test/profiler

* updated vector dim access to enable vector load

* updated test/profiler files

* updated example with 1d kernel

* updating profiler

* renamed files

* disabled device op for MI300

* skip  elementwise_permute_2d on gfx94x

* Update CMakeLists.txt

* fixing CMake - disabling some GPU targets

---------
Co-authored-by: Jing Zhang <jizha@amd.com>
Co-authored-by: Jing Zhang <jizhan@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>

a2969aa8

28 Nov, 2023 1 commit
- recover default niter (#1064) · ae5e5181
  zjing14 authored Nov 28, 2023
  
  ae5e5181
17 Nov, 2023 1 commit

Improve 4k gemm perf (#1047) · e8cddfdc

zjing14 authored Nov 17, 2023



* improve 4k gemm perf

* add f8 instances

* format

---------
Co-authored-by: Jing Zhang <jizha@amd.com>

e8cddfdc

16 Nov, 2023 1 commit
- [Hotfix] Remove unsed profile_transpose.cpp (#1046) · e1fa0091
  Chao Liu authored Nov 16, 2023
  
  e1fa0091
14 Nov, 2023 1 commit

Introduce multiABD api and deprecate multiD (#1035) · f2398f61

Bartłomiej Kocot authored Nov 14, 2023

* Introduce multiABD api and deprecate multiD

* Replace multiD with multiABD

* Mark structures as deprecated

* Change doxygen deprecated to note to avoid warnings

f2398f61

11 Nov, 2023 1 commit

add more instances for bfp16 gemm (#1036) · 600fc000

zjing14 authored Nov 11, 2023



* add more instances for bfp16

* reduce the gemm input values to prevent round-off errors

---------
Co-authored-by: Jing Zhang <jizha@amd.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>

600fc000

09 Nov, 2023 2 commits

Transpose 3d (#984) · 3af8c81a

arai713 authored Nov 08, 2023



* added working example for 5D input using 1D kernel

* example with 5D input tensor and 2d kernel - not working: issues with arguments

* added updated version of 3d device op - changed descriptors/dims

* added example file to check kernel

* fixed descriptor and isSupportedArgument stride problem

* added and modified kernel for 3d - updated tids/loop

* adding some more 5d example files

* fixed some issues

* changes made for testing

* working version: fixed error in stride for A, still a bit inefficient

* cleaned up formatting/comments

* updating formatting

* more formatting fixes

* fixing cmake, adding back gpu targets in cmake script

* adding client example

* added instances for client example

* fixed errors in client example

* implemented client ex with device_elementwise.hpp and device_elementwise_3d_impl.hpp

* removed extra files

* minor formatting and naming fixes

* adding test files and profiler

* fixing minor error

* minor fix

* removed unneccesary comments, renamed files

* updated instance list for client example, added different layout example

* removing instances

* fixed error in instance generation

* remove comments

* update profiler and client example tensor layouts

* fixed errors in test/profiler

* updated vector dim access to enable vector load

* updated test/profiler files

* updated example with 1d kernel

* updating profiler

* renamed files

---------
Co-authored-by: Jing Zhang <jizha@amd.com>

3af8c81a

Layernorm4d (#1022) · a3d9a2cd

rocking authored Nov 09, 2023



* Rename folder

* Add layernorm 4d fwd example

* Rename original layernorm example

* Add layernorm 4d f16  test

* Add layernorm4d_fwd client example

* Support layernorm4D in ckProfiler

* Rename groupnorm to groupnorm fwd in example

* Rename layernorm and group fwd in test

* Rename normalization to normalization_fwd (instances)

* Add fwd to DeviceNormalization

* Rename external api header

* Rename folder, because we can also add bwd in this folder

* Add fwd in layernorm and groupnorm (profiler

* Fix compile error

---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

a3d9a2cd

07 Nov, 2023 1 commit

Add Gemm instances for performance improvement (#1018) · 98fd41f5

zjing14 authored Nov 07, 2023



* improve kpad

* more tuning parameters

* f16_f8_fp16

* cut test time

* add f16_f8_fp16

* add f16_f8_f16

* testing instances for skinny cases

* format

* clean

* add fp16_f8_fp16

* clang-format

* add grouped gemm instalces

* fixed profile grouped_gemm

* clean

* clean

* clean

* clean

* clean

* add missing instance func

* fixed inferface

---------
Co-authored-by: Jing Zhang <jizha@amd.com>
Co-authored-by: root <root@sh5-1e707-rc06-38.mkm.dcgpu>

98fd41f5

02 Nov, 2023 1 commit

Add support for mixed precision in contraction scale and bilinear (#973) · 4ef704d8

Bartlomiej Wroblewski authored Nov 02, 2023



* Add support for mixed precision in contraction scale and bilinear (#936)

* Extract common functionality to separate files

* Reference contraction: Remove incorrect consts from type_converts

* Reference contraction: Add missing type_convert for dst value

* Reference contraction: Fix incorrect order of B matrix dimensions

* Add support for mixed precision in contraction scale and bilinear

* Move using statements from instances to a common file

* Move using statements from examples to a common file

* Fix the order of B matrix dimensions across examples and profiler

* Fix the computation of error threshold

* Make ComputeDataType an optional argument

* Include possible DataType -> ComputeDataType casting error in the threshold

* Remove commented code

* Make the ComputeDataType an optional argument in instance

---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

4ef704d8

31 Oct, 2023 1 commit

Add support for groups in Img2Col/Col2Img (#1007) · 2e824c6d

Bartłomiej Kocot authored Oct 31, 2023

* Add support for groups in Img2Col/Col2Img

* Fix interface test

* Fix interface test G to N

* Improve performance

* Change gemm layout to 3d

* Fixes

2e824c6d

28 Oct, 2023 1 commit

Fix the fp8 gemm for large tensors on MI300. (#1011) · f46a6ffa

Illia Silin authored Oct 27, 2023



* Fix the fp8 conversion

* Try clipping value before conversion

* Fix return

* Simplify with a const

* reduce the gemm input tensor values to reduce round-off error

* replace if-else with lambda

* fix syntax

---------
Co-authored-by: Rostyslav Geyyer <rosty.geyyer@amd.com>

f46a6ffa

19 Oct, 2023 1 commit
- Change 1d,2d,... to 1D,2D,... (#997) · 0abc0f87
  Bartlomiej Wroblewski authored Oct 19, 2023
  
  0abc0f87
18 Oct, 2023 2 commits

Layernorm and groupnorm support to save mean and inverse std in forward (#929) · 3696fe1c

rocking authored Oct 19, 2023

* save mean and inverse std in normalization

* Save mean and inverse std in splitK

* Vector save mean and inv std

* Modify instance for save mean and std

* simplify the layernorm example

* Save mean and std in groupnorm example

* Save mean and inv std in ckProfiler and test

* Remove compute data type from base class

* Save mean and inv std in client example

* Add changelog

* clang format

* Fix compile error

* Refine naming

* Avoid error in bf16

* revert changelog

3696fe1c

Clean DTYPES conditions in CMake (#974) · bf435140

zjing14 authored Oct 18, 2023



* Add a condition to build fp8 instances

* simplified buffer_load/store

* add bfp8/fp8

* fixed

* remove all f8/bf8 condition include folder

* fixed cmake conditions

* fixed DTYPES=fp16/bfp16

* fix

* fixed buffer_load

* fixed buffer_store

* fix

* clean example cmake files

* fixed ci

* fixed cit

---------
Co-authored-by: Rostyslav Geyyer <rosty.geyyer@amd.com>
Co-authored-by: Jing Zhang <jizha@amd.com>

bf435140

17 Oct, 2023 1 commit

Add grouped conv bwd weight wmma (#985) · 16d7c4d2

Bartłomiej Kocot authored Oct 17, 2023

* Add grouped conv bwd weight wmma

* Update README, changelog, profiler

* Minor fixes

* Fix grouped conv bwd wei dl kernel

* Minor fixes

* Minor stylistic fixes

16d7c4d2

13 Oct, 2023 1 commit
- Add splitk gemm fp16 @ fp16 with fp8 compute instances (#983) · fa753f27
  Rostyslav Geyyer authored Oct 13, 2023
```
* Add ComputeType

* Update for compatibility

* Add instances

* Update profiler api
```
  fa753f27
05 Oct, 2023 1 commit

Revert "Add support for mixed precision in contraction scale and bilinear" (#967) · 4daedf8c

Illia Silin authored Oct 05, 2023

* Revert "Add support for mixed precision in contraction scale and bilinear (#936)"

This reverts commit f0748506.

* revert commits #957 and #960

4daedf8c

04 Oct, 2023 1 commit

Add conv bwd weight fp16 comp bf8 fp8 op, instances and example (#945) · 42facfc6

Rostyslav Geyyer authored Oct 04, 2023



* Add f8 bf8 gemm example

* Add element-wise ops

* Add intrinsics

* Update reference calculation

* Add an additional type option for xdlops gemm

* Fix build process

* Add bf8 to buffer addressing

* Update blockwise op, split typeA and typeB

* Update for compatibility

* Uppdate naming to f8->fp8

* Update naming

* Format

* Update naming (#937)

* Add a client example

* Add computetypes to device and gridwise ops

* Add instances, update instance factory

* Format

* Fix a flag

* Add ckProfiler mode

* Fix typos

* Add an example

* Add bf8 generator

* add bf8 mfma; fixed type_convert for bf8

* move verfication ahead of timing

* Update reference calculation

* Fix reference

* Narrow down float init range

* Fix bf8 bf8 mfma

* Add bf8 @ fp8 mfma

* Update example

* Update instances

* Update profiler api

* Update for compatibility

* Format

* Remove extra example

* Clean up

* workaround convert

---------
Co-authored-by: Jing Zhang <jizha@amd.com>

42facfc6

29 Sep, 2023 1 commit

Add support for mixed precision in contraction scale and bilinear (#936) · f0748506

Bartlomiej Wroblewski authored Sep 29, 2023

* Extract common functionality to separate files

* Reference contraction: Remove incorrect consts from type_converts

* Reference contraction: Add missing type_convert for dst value

* Reference contraction: Fix incorrect order of B matrix dimensions

* Add support for mixed precision in contraction scale and bilinear

* Move using statements from instances to a common file

* Move using statements from examples to a common file

* Fix the order of B matrix dimensions across examples and profiler

* Fix the computation of error threshold

* Make ComputeDataType an optional argument

* Include possible DataType -> ComputeDataType casting error in the threshold

* Remove commented code

f0748506

27 Sep, 2023 1 commit

Add column to image kernel (#930) · e2243a4d

Bartłomiej Kocot authored Sep 27, 2023

* Add column to image kernel

* Minor fixes for dtypes and client examples

* Disable tests for disabled dtypes

* Disable add instances functions for disabled data types

* Minor stylistic fixes

* Revert "Disable add instances functions for disabled data types"

This reverts commit 728b8695.

* Instances reduction

* Add comments in device_column_to_image_impl

* Update changelog and Copyrights

* Improve changelog

e2243a4d

26 Sep, 2023 1 commit
- Add fp8 gemm instances (#920) · 94bfa502
  Rostyslav Geyyer authored Sep 26, 2023
```
* Add fp8 gemm instances

* Update instance naming
```
  94bfa502
13 Sep, 2023 1 commit

fixed fp8 issues (#894) · a66d14ed

zjing14 authored Sep 12, 2023



* fixed fp8 init; and reference gemm

* Update host_tensor_generator.hpp

* fixed convert

* fixed reference gemm

* fixed comments

* fixed comments

* fixed ci

* fixed computeType

---------
Co-authored-by: Jing Zhang <jizha@amd.com>

a66d14ed

12 Sep, 2023 1 commit

Refactor f8_t, add bf8_t (#792) · 62d4af74

Rostyslav Geyyer authored Sep 12, 2023

* Refactor f8_t to add bf8_t

* Add check_err impl for f8_t

* Update fp8 test

* Format

* Revert the fix

* Update vector_type implementation

* Add bf8 test

* Add bf8, use BitInt types

* Add bf8 conversion methods

* Update type_convert for fp8/bf8

* Add check_err fp8/bf8 support

* Add subnorm fp8 tests

* Add subnorm bf8 tests

* Fix conversion

* Add bf8 cmake bindings

* Add macros to enable build with disabled fp8/bf8

* Remove is_native method

* Update flag combination for mixed precision instances

* Add more flag checks

* Add another flag to a client example

* Add type traits, decouple f8/bf8 casting

* Clean up

* Decouple fp8 and bf8 flags

* Remove more redundant flags

* Remove leftover comments

62d4af74

08 Sep, 2023 1 commit

[Navi3x] Add fp16/int8 wmma conv forward instances (#746) · 562b4cec

Haocong WANG authored Sep 08, 2023



* fix wmma gemm int8; add grouped conv int8 example

* Add int8 gemm-bilinear instances

* compile sanity check unknown

* Sanity pass + clang-format

* add int8 conv profiler instances

* solve merge conflict

---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

562b4cec

05 Sep, 2023 1 commit

Add image to column kernel (#867) · 0077eeb3

Bartłomiej Kocot authored Sep 05, 2023

* Add image to column kernel

* Add instances, tests, profiler, example

* Add client example

* Several fixes of image to column

* Fix variable name in device_image_to_column_impl

* Several fixes of image to column profiler

* Fix num_btype calculation

* Make new mesaurements for correct bytes calculation

0077eeb3

31 Aug, 2023 1 commit

MaxPool & AvgPool bwd instances, test, ckProfiler, client example (#861) · 866377de

rocking authored Aug 31, 2023

* Add maxpool instances

* Rename index pool to max pool.

* Add maxpool bwd bf16 instances

* Add avg pool bwd instances

* Rename avgpool and maxpool to avg_pool3d and max_pool

* Add bf16 pool fwd instances

* Add max pool bwd to ckProfiler

* Add avg pool3d bwd to ckProfiler

* Add avg pool bwd test

* Fix bug of reference pool fwd (dilation)

* Fix bug of max pool bwd  (dilation and initZero)

* Support bf16 compute data type

* Force compute type be f32. Because atomicAdd only support f32

* Add max pool bwd test

* Rename folder

* Rename pool

* Add max pool bwd client example

* Add avg pool bwd client example

* Add missing workspace

* clang format

* Rename macro

* remove useless header

* remove useless layout

866377de

28 Aug, 2023 1 commit

Fp16/fp8 mixed-precision Gemm with multiply+add fusion (#865) · 31ea132a

zjing14 authored Aug 28, 2023



* add compute_type

* add multiply_add ckProfiler

* add f8_fp16 support

* clean

* clean

* fixed lds size calc

* format

---------
Co-authored-by: Jing Zhang <jizha@amd.com>

31ea132a

23 Aug, 2023 1 commit

[HotFix] add config and version files to pass on build info (#856) · c8a8385f

Jun Liu authored Aug 23, 2023

* experiment with config file

* experiment with version.h config

* add more info to version.h

* minor updates

* minor updates

* fix case where DTYPE is not used

* large amount of files but minor changes

* remove white space

* minor changes to add more MACROs

* fix cmakedefine01

* fix issue with CK internal conflict

* fix define and define value

* fix clang-format

* fix formatting issue

* experiment with cmake

* clang format v12 to be consistent with miopen

* avoid clang-format for config file

c8a8385f