Commits · e6bb1dd72df7f948289bf7420266aabc0d593c1c · gaoqiong / composable_kernel_ROCM

16 Jul, 2024 1 commit

Adding more instances of grouped convolution 3d forward for FP8 with ConvScale... · 802a8a1d

Andriy Roshchenko authored Jul 16, 2024

Adding more instances of grouped convolution 3d forward for FP8 with ConvScale element-wise operation and ReLU activation. (#1386)

* Add CMakePresets configurations.

* Add ConvScale+ReLU Functor and an Example

* Account for ReLU FLOPs.

* Add instances of 3D convolutions with ConvscaleRelu operation.

* Implement Client Example

* Cleanup

802a8a1d

11 Jul, 2024 1 commit
- Add instances for grouped conv fwd 3d with ConvScale for bf8@fp8->fp8 (#1369) · 7a46a91c
  Rostyslav Geyyer authored Jul 11, 2024
```
* Add an example

* Add instances

* Add a client example
```
  7a46a91c
04 Jul, 2024 1 commit
- Fix issue with multiple targets and remove smfmac tests from unsupported test targets (#1372) · 95907384
  Jun Liu authored Jul 03, 2024
  
  95907384
22 Jun, 2024 1 commit

Add instances of grouped convolution 3d forward with a ConvScale element-wise... · 05b10e0e

Andriy Roshchenko authored Jun 21, 2024


Add instances of grouped convolution 3d forward with a ConvScale element-wise op for bf8@bf8->fp8 (#1326)

We are adding more instances of grouped convolution 3d forward with a ConvScale element-wise operation.
This commit handles bf8@bf8->fp8 data types combination.

* Included an example.
* Added instances.
* Added a client example.

---------
Co-authored-by: Rostyslav Geyyer <rosty.geyyer@amd.com>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

05b10e0e

18 Jun, 2024 1 commit

Switch to universal gemm in grouped gemm tile loop (#1335) · e2d13920

jakpiase authored Jun 18, 2024



* switch to universal gemm in grouped gemm tile loop

* minor fixes

* add reviewers comments

---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

e2d13920

12 Jun, 2024 1 commit
- Add instances for grouped conv fwd 3d with ConvScale for fp8@bf8->fp8 (#1325) · acda4c5a
  Rostyslav Geyyer authored Jun 12, 2024
```
* Add fp8 bf8 conv example

* Add instances

* Add client example

* Add random scale values

* Format
```
  acda4c5a
10 Jun, 2024 1 commit

Add a convinvscale op, related instances and examples (#1307) · ce66277a

Rostyslav Geyyer authored Jun 10, 2024



* Update the element op

* Add an example

* Add instances

* Add a client example

* make sure new instances only build on gfx9

* Update element op and its handling

* Format

* Update instances to take element op as an argument

* Update examples to use random scale values

* Format

* Update client example with random scales

* Format

---------
Co-authored-by: illsilin <Illia.Silin@amd.com>

ce66277a

05 Jun, 2024 1 commit

Add a scale op, related instances and examples (#1242) · cb0645be

Rostyslav Geyyer authored Jun 04, 2024



* Add a scale op

* Update the element op

* Add instances

* Add an example

* Add a client example

* Add a flag check

* Revert flag check addition

* Fix flag check

* Update d strides in example

* Update d strides in client example

* Apply suggestions from code review

Update copyright header
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Move the example

* Move the client example

* Update element op

* Update example with the new element op

* Add scalar layout

* Update example

* Update kernel for scalar Ds

* Revert kernel changes

* Update element op

* Update example to use scales' pointers

* Format

* Update instances

* Update client example

* Move element op to unary elements

* Update element op to work with values instead of pointers

* Update instances to take element op as an argument

* Update examples to use random scale values

---------
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

cb0645be

21 May, 2024 1 commit
- Move grouped conv fwd client examples (#1299) · 204da9c5
  Rostyslav Geyyer authored May 21, 2024
```
* Move grouped conv fwd client examples

* Update existing examples

* Format
```
  204da9c5
10 May, 2024 1 commit
- Code clean-up (#1285) · 566b6480
  Illia Silin authored May 10, 2024
```
* code clean-up

* remove the profiling output samples
```
  566b6480
08 May, 2024 1 commit
- Add two stage grouped conv bwd weight kernel (#1280) · 0b6b5d17
  Bartłomiej Kocot authored May 08, 2024
  
  0b6b5d17
26 Apr, 2024 2 commits

ggemm tile_loop multD bf16 int8 (#1258) · 5ae893c0

zjing14 authored Apr 26, 2024



* Overload output stream operator for LoopScheduler and PiplineVersion

* Add Run overload accepting grid descriptors MK.

* Add __device__ keyword for CalculateGridSize

* Create device op GroupedGemmMultipleD

* Add GroupedGemm MultipleD Tile Loop implementation.

* Add an example for GroupedGemm MultipleD tile loop.

* Device Op GroupedGEMMTileLoop.

* Bunch of small changes in exmaple.

* CkProfiler

* Remove unused tparam.

* changed the copy function to v7r2

* adding multi_abd

* in-progress

* add post-load oob check

* Fix include statement.

* Fix output stream overloads.

* Do not make descriptors and check validity untill we find group.

* Fix gemm desc initialization.

* debugging

* adjust instances

* add run_lds

* add elemntwise_op

* replace multi_abd_device with v3

* clean up

* clean

* clean

* Revert device op

* Fix compilation for DTYPES=FP16

* Validate tensor transfers paramters.

* Added LDSType

* profiling

* adjust oobcheck

* add missing file

* Validate on host only NK dims if M is not known.

* add

* clean

* refactor

* clean

* add examples

* add fuse

* add fusion and client example

* Fix bug.

* A convenient debug func for selecting threads.

* Fix has main k block loop bug.

* Make sure that b2c has up to date tile offset.

* Output stream operator for Sequence type.

* Cmake file formatting.

* clean

---------
Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>

5ae893c0

bf16A_Int8B with fastgelu/bias (#1264) · 0d0150db

zjing14 authored Apr 26, 2024

* changed the copy function to v7r2

* adding multi_abd

* in-progress

* add post-load oob check

* debugging

* adjust instances

* add run_lds

* add elemntwise_op

* replace multi_abd_device with v3

* clean up

* clean

* clean

* Added LDSType

* profiling

* adjust oobcheck

* add missing file

* refactor

* clean

* add examples

0d0150db

19 Apr, 2024 1 commit

Refactor elementwise kernels (#1222) · ad1597c4

Bartłomiej Kocot authored Apr 19, 2024

* Refactor elementwise kernels

* Instances fixes

* Fix cmake

* Fix max pool bwd test

* Update two stage gemm split k

* Restore elementwise scale for hiptensor backward compatiblity

* Fix Acc data type check in conv fwd multiple abd

* Disable conv fp64 fwd example

* Update grouped conv weight multi d

ad1597c4

18 Apr, 2024 1 commit

Add grouped conv bwd weight multi d kernel (#1237) · fd923b6d

Bartłomiej Kocot authored Apr 18, 2024

* Add grouped conv bwd weight multi d kernel

* Reference fix

* Fix cmake files

* bwd weight scale only xdl

* Fixes

* Fix client conv fwd example

fd923b6d

16 Apr, 2024 1 commit

Added Multi_ABD support into Gemm and GroupedGemmFixedNK (#978) · 12865fbf

zjing14 authored Apr 15, 2024



* added an example grouped_gemm_multi_abd

* fixed ci

* add setElementwiseOp

* changed API

* clean code: add multiA into example

* fixed v7r2 copy

* add transpose

* clean

* fixed vector_load check

* Update example/15_grouped_gemm/grouped_gemm_multi_abd_xdl_fixed_nk_bias_fp16.cpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update example/15_grouped_gemm/grouped_gemm_multi_abd_xdl_fixed_nk_bias_fp16.cpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update example/15_grouped_gemm/grouped_gemm_multi_abd_xdl_fixed_nk_bias_fp16.cpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_multiple_abd_xdl_cshuffle.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_multiple_abd_xdl_cshuffle.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd_fixed_nk.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd_fixed_nk.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* add reduce

* testing

* add example_b16_i8

* refactor example

* clean

* add mpading

* disable reduce for kbatch = 1

* seperate reduce device op

* add reduce op

* add guard for workspace_size

* add instances

* format

* fixed

* add client example

* add a colmajor

* add instances

* Update cmake-ck-dev.sh

* Update profile_gemm_splitk.cpp

* Update gridwise_gemm_xdlops_v2r4r2.hpp

* format

* Update profile_gemm_splitk.cpp

* fixed

* fixed

* adjust test

* adjust precision loss

* adjust test

* fixed

* add bf16_i8 scale bias

* fixed scale

* fixed scale elementwise_op

* revert contraction deviceop changes

* fixed

* Add AddFastGelu

* Revert "Merge branch 'jizhan/gemm_splitk_reduce' into grouped_gemm_multi_abd_fixed_nk_example"

This reverts commit 3b5d001efd74335b38dcb7d8c8877580b49d23a4, reversing
changes made to 943199a99191661c5597c51ca8371a90bf57837e.

* add Scales into elementwise

* add gemm_multi_abd client example

* add client examples

* add rcr and crr

* add grouped gemm client example

* add grouped gemm client example

* add instance for rcr crr

* format

* fixed

* fixed cmake

* fixed

* fixed client_example

* format

* fixed contraction isSupport

* Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd_fixed_nk.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update device_reduce_threadwise.hpp

* clean

* Fixes

* Fix example

---------
Co-authored-by: Jing Zhang <jizha@amd.com>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

12865fbf

15 Apr, 2024 1 commit
- add CK_USE_XDL/WMMA for client examples (#1238) · dd34ab6e
  Illia Silin authored Apr 15, 2024
  
  dd34ab6e
11 Apr, 2024 1 commit
- Add instances for conv_scale with bf8@fp8->fp8 (#1231) · bbefc12a
  Rostyslav Geyyer authored Apr 11, 2024
```
* Add instances

* Add example

* Add profiler mode

* Add client example
```
  bbefc12a
03 Apr, 2024 1 commit

Add instances for conv_scale with fp8@bf8->fp8 (#1220) · a61e73bc

Rostyslav Geyyer authored Apr 03, 2024

* Update device op api to support BComputeType

* Add example

* Add instances

* Add profiler mode

* Add client example

* Update copyright year

* Add BComputeType check

* Fix compute types

a61e73bc

02 Apr, 2024 1 commit

Split the instances by architecture. (#1223) · ae57e593

Illia Silin authored Apr 02, 2024

* parse examples inside the add_example_executable function

* fix the example 64 cmake file

* add xdl flag to the gemm_bias_softmax_gemm_permute example

* add filtering of tests based on architecture type

* enable test_grouped_gemm for gfx9 only

* enable test_transpose only for gfx9

* only linnk test_transpose if it gets built

* split the gemm instances by architectures

* split gemm_bilinear,grouped_conv_bwd_weight instances by targets

* split instances by architecture

* split grouped_conv instances by architecture

* fix clang format

* fix the if-else logic in group_conv headers

* small fix for grouped convolution instances

* fix the grouped conv bwd weight dl instances

* fix client examples

* only enable client examples 3 and 4 on gfx9

* set the gfx9 macro

* make sure the architecture macros are set by cmake

* use separate set of xdl/wmma flags for host code

* sinmplify the main cmake file

* add conv_fwd_bf8 instance declaration

ae57e593

21 Mar, 2024 1 commit

Add instances for conv_scale with bf8 in / fp8 out (#1200) · fd0d093e

Rostyslav Geyyer authored Mar 21, 2024

* Add bf8 conv fwd instances

* Add example

* Add profiler mode

* Add client example

* Fix copyright headers

* Format

fd0d093e

15 Mar, 2024 1 commit

Add instances for conv_scale with fp8 in/out (#1193) · e626d520

Rostyslav Geyyer authored Mar 15, 2024

* Add fp8 conv instances and client example

* Format

* Add example

* Update cmakelists

* Add profiler mode

* Format

* Fix copyright headers

e626d520

13 Mar, 2024 1 commit

Add conv fwd/bwd data scale instances, extend bilinear instances (#1178) · 28525176

Bartłomiej Kocot authored Mar 13, 2024



* Add conv fwd/bwd data scale instances

* Fix cmake client example file

---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

28525176

29 Feb, 2024 1 commit

Style improvement: improving type alias usage consistency in gemm-related... · a776978c

amoskvic authored Feb 28, 2024


Style improvement: improving type alias usage consistency in gemm-related client examples. Also copyright year update for all client examples. (#1180)
Co-authored-by: Arseny Moskvichev <amoskvic@amd.com>

a776978c

26 Feb, 2024 1 commit
- Remove unnecessary comments (#1177) · d9095997
  Bartłomiej Kocot authored Feb 26, 2024
  
  d9095997
21 Feb, 2024 1 commit

Add support for mixed precision bf16&int8 grouped gemm (#1166) · 32d4be3d

jakpiase authored Feb 21, 2024

* add support for mixed precision bf16&int8 grouped gemm

* fix gfx versions and add bf16 kbatch condition

* added reviewers comments

32d4be3d

13 Feb, 2024 2 commits

Add optimized blockwise gemm using ck wrapper (#1157) · 1e73adbc

Bartłomiej Kocot authored Feb 13, 2024



* Add optimized blockwise gemm using ck wrapper

* Add basic gemm example

* Update docs

* Add tutorial for gemm using ck wrapper

* Add perf note

* edits

* Fix cmake

* Fixes

---------
Co-authored-by: Lisa Delaney <lisa.delaney@amd.com>

1e73adbc

Add bilinear conv fwd and bwd data instances (#1164) · bf98b476
Bartłomiej Kocot authored Feb 13, 2024

bf98b476

25 Jan, 2024 1 commit

layernorm & groupnorm bwd gamma beta (#1133) · 28f68a5a

rocking authored Jan 25, 2024

* Add layernorm bwd gamma beta external api

* Add groupnorm external api

* Add layernorm bwd gamma beta profiler

* Add groupnorm bwd gamma beta ckProfiler

* Add layernorm & groupnorm bwd gamma beta test

* Fix groupnorm bwd gamma beta profiler bug

* Layernorm bwd weight client example

* Groupnorm bwd weight client example

* clang format

* Remove useless header

* Let inv_std be positive

* Rename to num_bytes and move this calculation outside the loop

28f68a5a

19 Jan, 2024 1 commit

Add optimized copy to ck wrapper (#1126) · 7e4eb4b8

Bartłomiej Kocot authored Jan 19, 2024



* Add optimized copy to ck wrapper

* Example optimizations

* Fixes

* Move img2col test to client example

* Refactor example

* Fix docs

* Fixes

* Fix

* Fixes

* Fixes

* Fixes

* Fixes

* Fixes

---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>

7e4eb4b8

19 Dec, 2023 1 commit

Remove index tensor in avgpool (#1093) · b305a29e

rocking authored Dec 19, 2023



* Remove index tensor

* fix syntax

---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>

b305a29e

18 Dec, 2023 1 commit

layernorm and groupnorm backward data (#1083) · a69aa2a1

rocking authored Dec 19, 2023

* rename folder

* Add type string

* Remove typo

* Add deviceOp to backward x

* Add comment to describe the behavior of backward normalization

* Add kernel function, prepare to implement

* implement generic kernel

* Check vector size

* Add sweep once pipeline for small reduce size

* Fix bug of KRaw_ error

* Fix bug of dx stride

* sanity check for mean and rstd

* backward x for groupnorm

* Add bwd x instance

* add layernorm 2d bwd gamma beta instances

* Change save mean var type from f32 to f16 in f16 mode

* Change the example to f16

* Add groupnorm bwd gamma beta instance

* Add groupnorm bwd x instance

* Fix naming

* Add layernorm bwd x ckprofiler

* Add groupnorm bwd x profiler

* clang format

* Rename bwd x to bwd data

* Fix bug of verification in profiler

* Add test of layernorm and groupnorm bwd data

* Add missing cmake

* Add layernorm2d bwd data

* rename fwd example

* Add groupnorm client example

* Fix typo. replace Invarient with Invariant

* Add checking before running the best instance

a69aa2a1

08 Dec, 2023 1 commit
- Support broadcast for bias in grouped conv fwd (#1081) · f8369848
  Bartłomiej Kocot authored Dec 08, 2023
```
* Support broadcast for bias in grouped conv fwd

* Fix comment

* Comment fixes

* Remove GK layout
```
  f8369848
06 Dec, 2023 1 commit

Introduce wrapper library (#1071) · 836b7e55

Bartłomiej Kocot authored Dec 06, 2023

* Introduce wrapper library

* Update cmake files

* Revert "Update cmake files"

This reverts commit c27f88b56590c11a88e26d5d0df7aca51a08133d.

* Fix comments

836b7e55

28 Nov, 2023 1 commit

Split the static library into several files. (#1044) · 7965d66a

Illia Silin authored Nov 28, 2023

* spolit the static library into several

* update lib paths and fix client example

* do not use device_mha_operarions for client examples

* use appropriate libs to link to client examples

* remove the gpu/transpose path from the list

* try fixing clinet examples 3,4,9

* add necessary libs for client examples

* fix the layernorm client example

* fix the client examples 23 and 24

* fix typo

* add interface library and refresh clang format

7965d66a

14 Nov, 2023 1 commit

Introduce multiABD api and deprecate multiD (#1035) · f2398f61

Bartłomiej Kocot authored Nov 14, 2023

* Introduce multiABD api and deprecate multiD

* Replace multiD with multiABD

* Mark structures as deprecated

* Change doxygen deprecated to note to avoid warnings

f2398f61

13 Nov, 2023 1 commit

Add conv bwd weight client example (#1005) · 5356c4a9

Rostyslav Geyyer authored Nov 13, 2023

* Add conv bwd weight client example

* Update instance selector

* Fake the conversion

* Bring the conversion back

5356c4a9

10 Nov, 2023 1 commit

Support multi AB for grouped conv fwd xdl (#1027) · 49e52bb3

Bartłomiej Kocot authored Nov 10, 2023

* Support multi AB for grouped conv fwd xdl

* Add instances

* Add client example

* Add example

* Add interface test

* Minor fixes

Minor fixes

Minor fixes

* Comment fixes

* Fixes

* Reference fix

* Test xdl fixes

* Improve multi_ab interface test

49e52bb3

09 Nov, 2023 2 commits

Transpose 3d (#984) · 3af8c81a

arai713 authored Nov 08, 2023



* added working example for 5D input using 1D kernel

* example with 5D input tensor and 2d kernel - not working: issues with arguments

* added updated version of 3d device op - changed descriptors/dims

* added example file to check kernel

* fixed descriptor and isSupportedArgument stride problem

* added and modified kernel for 3d - updated tids/loop

* adding some more 5d example files

* fixed some issues

* changes made for testing

* working version: fixed error in stride for A, still a bit inefficient

* cleaned up formatting/comments

* updating formatting

* more formatting fixes

* fixing cmake, adding back gpu targets in cmake script

* adding client example

* added instances for client example

* fixed errors in client example

* implemented client ex with device_elementwise.hpp and device_elementwise_3d_impl.hpp

* removed extra files

* minor formatting and naming fixes

* adding test files and profiler

* fixing minor error

* minor fix

* removed unneccesary comments, renamed files

* updated instance list for client example, added different layout example

* removing instances

* fixed error in instance generation

* remove comments

* update profiler and client example tensor layouts

* fixed errors in test/profiler

* updated vector dim access to enable vector load

* updated test/profiler files

* updated example with 1d kernel

* updating profiler

* renamed files

---------
Co-authored-by: Jing Zhang <jizha@amd.com>

3af8c81a

Layernorm4d (#1022) · a3d9a2cd

rocking authored Nov 09, 2023



* Rename folder

* Add layernorm 4d fwd example

* Rename original layernorm example

* Add layernorm 4d f16  test

* Add layernorm4d_fwd client example

* Support layernorm4D in ckProfiler

* Rename groupnorm to groupnorm fwd in example

* Rename layernorm and group fwd in test

* Rename normalization to normalization_fwd (instances)

* Add fwd to DeviceNormalization

* Rename external api header

* Rename folder, because we can also add bwd in this folder

* Add fwd in layernorm and groupnorm (profiler

* Fix compile error

---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

a3d9a2cd