Commits · bfecc19352140d107130d518428a0c51595cbd0c · gaoqiong / composable_kernel

27 Nov, 2023 1 commit
- Fix cluster length arrange order in fp16 GEMM example (#1055) · bfecc193
  Bartlomiej Wroblewski authored Nov 27, 2023
  
  bfecc193
25 Nov, 2023 1 commit

Add basic support for direct loads from global to LDS (#999) · 627054b9

Bartlomiej Wroblewski authored Nov 25, 2023

* Add basic support for direct loads from global to LDS

* Clean the code and comments

* Add support for fp16

* Add comments

* Add check for thread cluster lengths

* Align non-direct-load fp16 example

* Small fixes

* Extend IsSupported to check for supported GPU gens

* Build examples only on the supported HW

* Do not throw when instance not supported in 04 example

* Review: Apply review suggestions

* Review: small fix

* Review: small fix

627054b9

17 Nov, 2023 1 commit

Improve 4k gemm perf (#1047) · e8cddfdc

zjing14 authored Nov 17, 2023



* improve 4k gemm perf

* add f8 instances

* format

---------
Co-authored-by: Jing Zhang <jizha@amd.com>

e8cddfdc

14 Nov, 2023 1 commit

Introduce multiABD api and deprecate multiD (#1035) · f2398f61

Bartłomiej Kocot authored Nov 14, 2023

* Introduce multiABD api and deprecate multiD

* Replace multiD with multiABD

* Mark structures as deprecated

* Change doxygen deprecated to note to avoid warnings

f2398f61

13 Nov, 2023 1 commit

Hip tensor permute (#1002) · 454cf7bd

arai713 authored Nov 13, 2023

* adding files for F32 example

* adding functioning implementation with scalar multiplication and unary operator support

* added fp 16 type check in unary square

* updating scalar multiplication as an operator

* functioning version with scalar operator

* changing strides for col major

* updated column major implementation

* working column major implementation

* cleaned up comments, rearranged/renamed files

454cf7bd

10 Nov, 2023 2 commits

Support multi AB for grouped conv fwd xdl (#1027) · 49e52bb3

Bartłomiej Kocot authored Nov 10, 2023

* Support multi AB for grouped conv fwd xdl

* Add instances

* Add client example

* Add example

* Add interface test

* Minor fixes

Minor fixes

Minor fixes

* Comment fixes

* Fixes

* Reference fix

* Test xdl fixes

* Improve multi_ab interface test

49e52bb3

Backward of gamma and beta for layernorm and groupnorm (#1013) · 1db75603

rocking authored Nov 10, 2023

* Add layernorm backward reference code

* Add groupnorm backward reference code

* Add example

* clang format

* Fixc bug of reference layernorm and groupnorm

* Fix naming

* Refine naming

* Add device op for normalization bwd gamma and beta

* Refine template parameter

* Add bwd gamma & beta of kernel

* 1. Add groupnorm example
2. Refine layernorm naming

* Narrow down the static check for performance

* Refine variable name

1db75603

09 Nov, 2023 2 commits

Transpose 3d (#984) · 3af8c81a

arai713 authored Nov 08, 2023



* added working example for 5D input using 1D kernel

* example with 5D input tensor and 2d kernel - not working: issues with arguments

* added updated version of 3d device op - changed descriptors/dims

* added example file to check kernel

* fixed descriptor and isSupportedArgument stride problem

* added and modified kernel for 3d - updated tids/loop

* adding some more 5d example files

* fixed some issues

* changes made for testing

* working version: fixed error in stride for A, still a bit inefficient

* cleaned up formatting/comments

* updating formatting

* more formatting fixes

* fixing cmake, adding back gpu targets in cmake script

* adding client example

* added instances for client example

* fixed errors in client example

* implemented client ex with device_elementwise.hpp and device_elementwise_3d_impl.hpp

* removed extra files

* minor formatting and naming fixes

* adding test files and profiler

* fixing minor error

* minor fix

* removed unneccesary comments, renamed files

* updated instance list for client example, added different layout example

* removing instances

* fixed error in instance generation

* remove comments

* update profiler and client example tensor layouts

* fixed errors in test/profiler

* updated vector dim access to enable vector load

* updated test/profiler files

* updated example with 1d kernel

* updating profiler

* renamed files

---------
Co-authored-by: Jing Zhang <jizha@amd.com>

3af8c81a

Layernorm4d (#1022) · a3d9a2cd

rocking authored Nov 09, 2023



* Rename folder

* Add layernorm 4d fwd example

* Rename original layernorm example

* Add layernorm 4d f16  test

* Add layernorm4d_fwd client example

* Support layernorm4D in ckProfiler

* Rename groupnorm to groupnorm fwd in example

* Rename layernorm and group fwd in test

* Rename normalization to normalization_fwd (instances)

* Add fwd to DeviceNormalization

* Rename external api header

* Rename folder, because we can also add bwd in this folder

* Add fwd in layernorm and groupnorm (profiler

* Fix compile error

---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

a3d9a2cd

03 Nov, 2023 1 commit
- Add missing ComputeDatatype in contraction_multi_ABD_xdl_fp16 (#1024) · 16eb824c
  Bartlomiej Wroblewski authored Nov 03, 2023
  
  16eb824c
02 Nov, 2023 1 commit

Add support for mixed precision in contraction scale and bilinear (#973) · 4ef704d8

Bartlomiej Wroblewski authored Nov 02, 2023



* Add support for mixed precision in contraction scale and bilinear (#936)

* Extract common functionality to separate files

* Reference contraction: Remove incorrect consts from type_converts

* Reference contraction: Add missing type_convert for dst value

* Reference contraction: Fix incorrect order of B matrix dimensions

* Add support for mixed precision in contraction scale and bilinear

* Move using statements from instances to a common file

* Move using statements from examples to a common file

* Fix the order of B matrix dimensions across examples and profiler

* Fix the computation of error threshold

* Make ComputeDataType an optional argument

* Include possible DataType -> ComputeDataType casting error in the threshold

* Remove commented code

* Make the ComputeDataType an optional argument in instance

---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

4ef704d8

01 Nov, 2023 1 commit
- Add ScaleAddScaleAddRelu post op for conv fwd (#1006) · f27ea94e
  Bartłomiej Kocot authored Nov 02, 2023
```
* Add ScaleAddScaleAddRelu post op for conv fwd

* Fixes

* Fix instance file name

* Minor fix
```
  f27ea94e
31 Oct, 2023 1 commit

Add support for groups in Img2Col/Col2Img (#1007) · 2e824c6d

Bartłomiej Kocot authored Oct 31, 2023

* Add support for groups in Img2Col/Col2Img

* Fix interface test

* Fix interface test G to N

* Improve performance

* Change gemm layout to 3d

* Fixes

2e824c6d

28 Oct, 2023 1 commit

Fix the fp8 gemm for large tensors on MI300. (#1011) · f46a6ffa

Illia Silin authored Oct 27, 2023



* Fix the fp8 conversion

* Try clipping value before conversion

* Fix return

* Simplify with a const

* reduce the gemm input tensor values to reduce round-off error

* replace if-else with lambda

* fix syntax

---------
Co-authored-by: Rostyslav Geyyer <rosty.geyyer@amd.com>

f46a6ffa

21 Oct, 2023 1 commit

Fix cmake dtype check (#989) · ac0e0067

Bartłomiej Kocot authored Oct 21, 2023

* Fix instances dtype check

* Fix source dtypes seletor for examples and tests

* Sync with new cmakefile changes

* Remove not needed ifdefs

* Remove not needed ifdefs

ac0e0067

20 Oct, 2023 1 commit
- Fix bf8 conversion issues (#1003) · 1fd27d52
  Rostyslav Geyyer authored Oct 20, 2023
```
* Fix the conversion

* Add bf8 functionality

* Enable example on MI200 as well
```
  1fd27d52
19 Oct, 2023 2 commits
- Extend available elementwise operations with conv examples (#995) · 82f3a835
  Bartłomiej Kocot authored Oct 19, 2023
```
* Extend available elementwise operations with conv examples

* Fixes

* Remove not needed convert

* Update CMakeFile and dir name
```
  82f3a835
- Change 1d,2d,... to 1D,2D,... (#997) · 0abc0f87
  Bartlomiej Wroblewski authored Oct 19, 2023
  
  0abc0f87
18 Oct, 2023 3 commits

Layernorm and groupnorm support to save mean and inverse std in forward (#929) · 3696fe1c

rocking authored Oct 19, 2023

* save mean and inverse std in normalization

* Save mean and inverse std in splitK

* Vector save mean and inv std

* Modify instance for save mean and std

* simplify the layernorm example

* Save mean and std in groupnorm example

* Save mean and inv std in ckProfiler and test

* Remove compute data type from base class

* Save mean and inv std in client example

* Add changelog

* clang format

* Fix compile error

* Refine naming

* Avoid error in bf16

* revert changelog

3696fe1c

Clean DTYPES conditions in CMake (#974) · bf435140

zjing14 authored Oct 18, 2023



* Add a condition to build fp8 instances

* simplified buffer_load/store

* add bfp8/fp8

* fixed

* remove all f8/bf8 condition include folder

* fixed cmake conditions

* fixed DTYPES=fp16/bfp16

* fix

* fixed buffer_load

* fixed buffer_store

* fix

* clean example cmake files

* fixed ci

* fixed cit

---------
Co-authored-by: Rostyslav Geyyer <rosty.geyyer@amd.com>
Co-authored-by: Jing Zhang <jizha@amd.com>

bf435140

Add contraction_multi_abd (#972) · 1cc36ba5

zjing14 authored Oct 17, 2023



* add gridwise_multi_abd

* move element_op into RunRead

* merge element_wise op with data read

* add multiABD example

* allow packed elementwise_op

* changed example

* clean

* clean

* add is_detected

* fix

* minor fix

* add scaleAdd_vec4 example

* init commit for contraction_multi_ABD

* add examples

* add examples of multiA and broadcast

* update example

* fixed comments

* Update cmake-ck-dev.sh

* Update cmake-ck-dev.sh

* Add comments into the example

* Update CMakeLists.txt

---------
Co-authored-by: Jing Zhang <jizha@amd.com>

1cc36ba5

17 Oct, 2023 1 commit

Add grouped conv bwd weight wmma (#985) · 16d7c4d2

Bartłomiej Kocot authored Oct 17, 2023

* Add grouped conv bwd weight wmma

* Update README, changelog, profiler

* Minor fixes

* Fix grouped conv bwd wei dl kernel

* Minor fixes

* Minor stylistic fixes

16d7c4d2

13 Oct, 2023 1 commit

add vector_type support into thread_copy_v3r1 (#969) · 2ce9b56c

zjing14 authored Oct 13, 2023



* add vector_type support into thread_copy_v3r1

* remove unncessary type_convert

* fixed datatype

* fixed dataType

* changed API with is_packx_invocable

* changed example

* add missing cmake file

* fixed ci

* fixed cmake

---------
Co-authored-by: Jing Zhang <jizha@amd.com>

2ce9b56c

10 Oct, 2023 1 commit

Fixed f8_gemm NaN (#975) · ac9595a9

zjing14 authored Oct 10, 2023



* workaround nan problem by changing output to fp16

* enable f8/bf8 gemm tests on MI200

* workaround f16 to f8 conversion

---------
Co-authored-by: Jing Zhang <jizha@amd.com>

ac9595a9

05 Oct, 2023 3 commits
- Replace CMake `return` from later CMake (#970) · 59136091
  Lauren Wrubleski authored Oct 05, 2023
  
  59136091
- Revert "Add support for mixed precision in contraction scale and bilinear" (#967) · 4daedf8c
  Illia Silin authored Oct 05, 2023
```
* Revert "Add support for mixed precision in contraction scale and bilinear (#936)"

This reverts commit f0748506.

* revert commits #957 and #960
```
  4daedf8c
- remove example 60 (#963) · 570ff3dd
  zjing14 authored Oct 05, 2023
```
Co-authored-by: Jing Zhang <jizha@amd.com>
```
  570ff3dd
04 Oct, 2023 1 commit

Add conv bwd weight fp16 comp bf8 fp8 op, instances and example (#945) · 42facfc6

Rostyslav Geyyer authored Oct 04, 2023



* Add f8 bf8 gemm example

* Add element-wise ops

* Add intrinsics

* Update reference calculation

* Add an additional type option for xdlops gemm

* Fix build process

* Add bf8 to buffer addressing

* Update blockwise op, split typeA and typeB

* Update for compatibility

* Uppdate naming to f8->fp8

* Update naming

* Format

* Update naming (#937)

* Add a client example

* Add computetypes to device and gridwise ops

* Add instances, update instance factory

* Format

* Fix a flag

* Add ckProfiler mode

* Fix typos

* Add an example

* Add bf8 generator

* add bf8 mfma; fixed type_convert for bf8

* move verfication ahead of timing

* Update reference calculation

* Fix reference

* Narrow down float init range

* Fix bf8 bf8 mfma

* Add bf8 @ fp8 mfma

* Update example

* Update instances

* Update profiler api

* Update for compatibility

* Format

* Remove extra example

* Clean up

* workaround convert

---------
Co-authored-by: Jing Zhang <jizha@amd.com>

42facfc6

03 Oct, 2023 2 commits
- changed test for grouped_gemm to be random (#959) · 5311d1b3
  zjing14 authored Oct 03, 2023
```
Co-authored-by: Jing Zhang <jizha@amd.com>
```
  5311d1b3
- Fixed contraction issues (#960) · aa46039f
  zjing14 authored Oct 03, 2023
```
* add missing ComputeType

* fixed

* Update cmake-ck-dev.sh

---------
Co-authored-by: Jing Zhang <jizha@amd.com>
```
  aa46039f
02 Oct, 2023 3 commits

Add fp8 @ bf8 gemm support and example (#933) · bd09b5c5

Rostyslav Geyyer authored Oct 02, 2023

* Add f8 bf8 gemm example

* Add element-wise ops

* Add intrinsics

* Update reference calculation

* Add an additional type option for xdlops gemm

* Fix build process

* Add bf8 to buffer addressing

* Update blockwise op, split typeA and typeB

* Update for compatibility

* Uppdate naming to f8->fp8

* Update naming

* Format

bd09b5c5

get rid of gfx900/906, set rocm5.7 as default (#958) · 59dbb01f
Illia Silin authored Oct 02, 2023

59dbb01f

Contraction multi abd (#957) · 9d58c421

zjing14 authored Oct 02, 2023



* add gridwise_multi_abd

* move element_op into RunRead

* merge element_wise op with data read

* add multiABD example

* allow packed elementwise_op

* changed example

* clean

* clean

* add is_detected

* fix

* minor fix

* add scaleAdd_vec4 example

* init commit for contraction_multi_ABD

* add examples

* add examples of multiA and broadcast

* update example

* fixed comments

* Update cmake-ck-dev.sh

* Update cmake-ck-dev.sh

* Add comments into the example

---------
Co-authored-by: Jing Zhang <jizha@amd.com>

9d58c421

29 Sep, 2023 1 commit

Add support for mixed precision in contraction scale and bilinear (#936) · f0748506

Bartlomiej Wroblewski authored Sep 29, 2023

* Extract common functionality to separate files

* Reference contraction: Remove incorrect consts from type_converts

* Reference contraction: Add missing type_convert for dst value

* Reference contraction: Fix incorrect order of B matrix dimensions

* Add support for mixed precision in contraction scale and bilinear

* Move using statements from instances to a common file

* Move using statements from examples to a common file

* Fix the order of B matrix dimensions across examples and profiler

* Fix the computation of error threshold

* Make ComputeDataType an optional argument

* Include possible DataType -> ComputeDataType casting error in the threshold

* Remove commented code

f0748506

28 Sep, 2023 1 commit

Add grouped conv bwd data wmma (#950) · cb538740

Bartłomiej Kocot authored Sep 28, 2023

* Add grouped conv bwd data wmma

* Fix copyrights

* Add instances with smaller NPerBlock

* Update interface test

* Minor stylistic fixes

* Minor stylistic fixes

cb538740

27 Sep, 2023 2 commits

Add column to image kernel (#930) · e2243a4d

Bartłomiej Kocot authored Sep 27, 2023

* Add column to image kernel

* Minor fixes for dtypes and client examples

* Disable tests for disabled dtypes

* Disable add instances functions for disabled data types

* Minor stylistic fixes

* Revert "Disable add instances functions for disabled data types"

This reverts commit 728b8695.

* Instances reduction

* Add comments in device_column_to_image_impl

* Update changelog and Copyrights

* Improve changelog

e2243a4d

Add multiple A/B support (#906) · 11676c7e

zjing14 authored Sep 26, 2023



* add gridwise_multi_abd

* move element_op into RunRead

* merge element_wise op with data read

* add multiABD example

* allow packed elementwise_op

* changed example

* clean

* clean

* add is_detected

* fix

* minor fix

* add scaleAdd_vec4 example

---------
Co-authored-by: Jing Zhang <jizha@amd.com>

11676c7e

21 Sep, 2023 1 commit

Refactoring cmake files to build data types separately. (#932) · bba085d2

Illia Silin authored Sep 20, 2023

* refactor cmake files for the tests

* refactor cmake files for examples

* fix cmake for gemm example

* fix the cmake file for all examples

* add splitting by data types in gemm_splitk instance header

* rename test to reflect only dl instances are used

* clean up CI workspace, update cmake for instances

* change the jenkinsfile syntax

* build all instances except DL on gfx11

* move workspace cleanup after stages

* clean up workspace after every stage

* isolate data types in grouped_conv_fwd header

* isolate dl instances for grouped_conv2d_fwd

* fix syntax

* fix cmake and batchnorm instances

* fix typo

* fix reduction instances

* fix grouped_conv headers

* fix syntax

* replace parsing logic for instances, replace bfp16 with bf16

* fix the client examples build

* clean up DTYPES from instances cmake files

* update the parsing logic in cmake files

* make an exception for reduction kernels

* update few remaining cmake files to handle DTYPES

* fix syntax

* fix cmake conflicts

* replace f8 with fp8 test name

* resolve conflicts for dpp instances

bba085d2

15 Sep, 2023 1 commit

Add fp16/fp8 support into Grouped gemm FixedNK (#874) · f9d0eddb

zjing14 authored Sep 14, 2023



* move all arguments into device

* add b2c_tile_map

* add examples

* add SetDeviceKernelArgs

* dedicated fixed_nk solution

* init client api

* add grouped_gemm_bias example

* add a instance

* add instances

* formatting

* fixed cmake

* Update EnableCompilerWarnings.cmake

* Update cmake-ck-dev.sh

* clean; fixed comments

* fixed comment

* add instances for fp32 output

* add instances for fp32 output

* add fp32 out client example

* fixed CI

* init commit for kbatch

* add splitk gridwise

* format

* fixed

* clean deviceop

* clean code

* finish splitk

* fixed instances

* change m_loops to tile_loops

* add setkbatch

* clean code

* add splitK+bias

* add instances

* opt mk_nk instances

* clean examples

* fixed CI

* remove zero

* finished non-zero

* clean

* clean code

* optimized global_barrier

* fixed ci

* fixed CI

* instance and client

* removed AddBias

* format

* fixed CI

* fixed CI

* move 20_grouped_gemm to 21_grouped_gemm

* clean

* formatting

* clean

* clean

* fixed computeType

---------
Co-authored-by: Jing Zhang <jizha@amd.com>

f9d0eddb

13 Sep, 2023 1 commit

Add grouped conv bwd weight dl instances and new layout (#897) · 475188ca

Bartłomiej Kocot authored Sep 13, 2023

* Add grouped conv bwd weight dl instances and new layout

* Add M and N padding

* Remove todo comment

* Enable grouped conv fwd dl k,c=1 generic instance

* Comment fixes

475188ca