Commits · f6ceef7830fd0c5d65e29b3c3d84332398adaa4b · gaoqiong / composable_kernel_ROCM

21 Aug, 2024 1 commit

Adding Instances and Examples for FP8-based Scaled Convolution and AMAX Reduction. (#1473) · c3515f27

Andriy Roshchenko authored Aug 21, 2024

* Enable CMakePresets build

* Verify Convolution, Scaling and ReLU algorithms.

* Add tensor element-wise scale and type cast operation.

* Reduction implemented but does not work.

* Exploration of Reduction functionality.

* Completed example for Convolution scaled with ReLu activation and AMAX reduction.

* WIP: Add required instances for convolution.

* WIP: Create client example. Implement convolution stage.

* Add elementwise instances.

* Add elementwise scale + convert example.

* Add reduction instances.

* WIP: Client example for AMAX reduction.

* WIP: Add instances for multistage reduction.

* WIP: Implementation of multistage reduction.

* Refactoring.

* Clean up.

* Add CMakePresets.json

* Guard off FP8 instances when the data type is not available.

* Add example for Scaled FP8 Convolution with AMAX reduction.

* Refactor CombConvScaleRelu instances.

* Add CombConvScale instances.

* Add client example for Scaled FP8 Convolution with AMAX reduction.

* Cleanup.

c3515f27

20 Aug, 2024 1 commit

Adding Instances and Examples for FP8-based Scaled Convolution with ReLU... · a94113a9

Andriy Roshchenko authored Aug 20, 2024

Adding Instances and Examples for FP8-based Scaled Convolution with ReLU Activation and AMAX Reduction. (#1469)

* Enable CMakePresets build

* Verify Convolution, Scaling and ReLU algorithms.

* Add tensor element-wise scale and type cast operation.

* Reduction implemented but does not work.

* Exploration of Reduction functionality.

* Completed example for Convolution scaled with ReLu activation and AMAX reduction.

* WIP: Add required instances for convolution.

* WIP: Create client example. Implement convolution stage.

* Add elementwise instances.

* Add elementwise scale + convert example.

* Add reduction instances.

* WIP: Client example for AMAX reduction.

* WIP: Add instances for multistage reduction.

* WIP: Implementation of multistage reduction.

* Refactoring.

* Clean up.

* Guard off FP8 instances when the data type is not available.

* Improve output readability.

* Addressing reviewer's comments.

a94113a9

16 Aug, 2024 1 commit

Re-enable fp8 types for all architectures. (#1470) · c8b6b642

Illia Silin authored Aug 16, 2024

* re-enable fp8 and bf8 for all targets

* restore the fp8 gemm instances

* re-enable conv_3d fp8 on all architectures

* diasble several fp8 gemm instances on all architectures except gfx94

* clang format fix

c8b6b642

14 Aug, 2024 1 commit

[GEMM] gemm_universal related optimization (#1453) · 3049b546

Haocong WANG authored Aug 14, 2024



* replace buffer_atomic with global_atomic

* fixed global_atomic_add

* added bf16 atomic_add

* format

* clang-format-12

* clean

* clean

* add guards

* Update gtest.cmake

* enabled splitk_gemm_multi_d

* format

* add ckProfiler

* format

* fixed naming

* format

* clean

* clean

* add guards

* fix clang format

* format

* add kbatch printout

* clean

* Add rocm6.2 related gemm optimization

* Limit bf16 atomic usage

* remove redundant RCR gemm_universal instance

* Add RRR fp8 gemm universal instance

* Bug fix

* Add GPU_TARGET guard to FP8/BF8 target

* bug fix

* update cmake

* remove all fp8/bf8 example if arch not support

* Enable fp8 RRR support in ckProfiler

* limit greedy-reverse flag to gemm_universal in ckProfiler

---------
Co-authored-by: Jing Zhang <jizhan@fb.com>
Co-authored-by: Jing Zhang <jizhan@meta.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>

3049b546

12 Aug, 2024 1 commit
- Disable inapplicable xdl and mha instances for gfx12 (#1464) · cbb6f2ab
  Illia Silin authored Aug 12, 2024
  
  cbb6f2ab
09 Aug, 2024 1 commit

Revert "Revert Revert Support access per groups and filter2x3 in grouped conv... · 5ff8eeeb

Jun Liu authored Aug 08, 2024

Revert "Revert Revert Support access per groups and filter2x3 in grouped conv fwd (#1382) (#1406) (#1415)" (#1455)

This reverts commit 33b399cc.

5ff8eeeb

06 Aug, 2024 2 commits

adding mha as static lib (#1366) · 840c5397

bibek authored Aug 06, 2024



* adding mha as static lib

* add fmha fwd compile options

* typo

* fix python version

* python version to 3

* increase path length

* add max path flag in mha cmake

* fix long path issue

* mha currently only runs in gfx94x

* only buld mha in mi300

* populate gpu_list

* add mha compile flags

* avoid building mha in gpu other then gfx94x

* some comments and  include ck_tile in rocm

* use rocm_install

* place ck_tile in include

* correct ck_tile path

---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

840c5397

Add Grouped Conv Fwd Large Tensor kernel (#1432) · 4ec5c52a

Bartłomiej Kocot authored Aug 06, 2024

* Support 64 bit indexing

* Add new grouped conv fwd kernel for large tensors

* Add instances large tensor

* Fixes for transform conv to gemm

* Fixes

* fixes

* Remove not needed instances

* examples fixes

* Remove not need ds arrays

* Fix tests

* Add 2GB check in gridwise dl

* Fixes

4ec5c52a

05 Aug, 2024 1 commit

add --offload-compress compiler flag (#1433) · 7f57b2e0

Illia Silin authored Aug 05, 2024



* add --offload-compress compiler flag

* only apply the --offload-compress flag to the ckProfiler

* move the --offload-compress flag back to main cmake file

* add offload-compress to target compile option of ckProfiler

---------
Co-authored-by: carlushuang <carlus.huang@amd.com>

7f57b2e0

30 Jul, 2024 1 commit
- Revert Revert Support access per groups and filter2x3 in grouped conv fwd (#1382) (#1406) (#1415) · 33b399cc
  Bartłomiej Kocot authored Jul 30, 2024
  
  33b399cc
24 Jul, 2024 1 commit

Adding more instances of grouped convolution 3d forward for FP8 with... · 4a8a1bef

Andriy Roshchenko authored Jul 24, 2024

Adding more instances of grouped convolution 3d forward for FP8 with ConvScale+Bias element-wise operation. (#1412)

* Add CMakePresets configurations.

* Add binary elementwise ConvScaleAdd and an example.

* Numerical verification of results.

Observed significant irregularities in F8 to F32 type conversions:
```log
ConvScaleAdd: float=145.000000   f8_t=160.000000    e=144.000000
ConvScaleAdd: float=97.000000   f8_t=96.000000    e=104.000000
ConvScaleAdd: float=65.000000   f8_t=64.000000    e=72.000000
```

* Implemented ConvScaleAdd + Example.

* Add ConvScale+Bias Instances

* Add Client Example for ConvScale+Bias

* Fix number of bytes in an example..

* Cleanup.

4a8a1bef

23 Jul, 2024 1 commit
- disable bad instance (#1410) · d22713a7
  Haocong WANG authored Jul 24, 2024
  
  d22713a7
22 Jul, 2024 1 commit
- Revert Support access per groups and filter2x3 in grouped conv fwd (#1382) (#1406) · 5d8c3d81
  Bartłomiej Kocot authored Jul 22, 2024
  
  5d8c3d81
19 Jul, 2024 2 commits

[GEMM] F8 GEMM, performance optimized. (#1384) · 8c90f25b

Haocong WANG authored Jul 19, 2024



* add ab_scale init support

* enabled interwave

* add scale type; update isSupport

* adjust example

* clean

* enable f8 pure gemm rcr ckprofiler

* Add gemm_multiply_multiply instances

* clang format

* Optimize for ScaleBlockMNK=128

* enable abscale f8 gemm ck profiler

* Add pure f8 gemm test suite

* Reverting to the state of project at f60fd77

* update copyright

* clang format

* update copyright

---------
Co-authored-by: root <jizhan@amd.com>

8c90f25b

Universal gemm splitk using reduce (with multi-d) (#1341) · c544eb4d

ltqin authored Jul 19, 2024



* init for reduce_threadwise multi_d

* add reduce_threadwise_multi_d

* add reduce_multi_d

* clean

* start add an other splitk device op

* add reduce template parameter to SplitKBatchOffset

* add reduce c matrix

* clean up code

* change example data type to bf16

* add bf16Ai8B example

* remove reduce template parameter

* add splitk atomic status to v4

* example add multi d parameters

* device op add multi-d parameters

* add multi-d to reduce

* fix kbach=1 bug

* change B layout to col in  bf16Ai8B example

* remove float adding struct

* change  multi-d interface

* change file and class name

* remove multi-d of bf16Ai8B example

* change IsReduce function to IsReduceAdd

* change example layout to RRR from RCR

* according layout to set ds stride

* reset parameter layout

* add gemm universal reduce instance

* add reduce factory

* add profile_gemm_universal_reduce

* add reduce to profiler

* fix reduce instance

* fix profiler reduce compiling bug

* format

* format library instance code

* add mem instance for reduce library

* fix call instance names

* add workspace for reduce in ckProfiler

* format

* add mnpading to reduce library instance

* add fp16 instance to reduce of profiler

* change copyright time

* restore profiler cmake file

* add reduce text to instances

* add DsLayout and DsDataType to instances template parameter

* fixed gemm_reduce_multi_d

* add an example without multi_d

* Update common.hpp

* Update gtest.cmake

* Update gemm_xdl_splitk_reduce_bf16.cpp

* clean

* Update gtest.cmake

* format

* fixe api

* format

* default parameter change to RRR

* add vector_len for multi_d

* format

* Update gtest.cmake

* fix bf16A iBB elementwiseop

* add ReduceDataType

* move ReduceDataType to end position

* format

* remove googletest git method  address

* fix copyright time

* update init data

---------
Co-authored-by: root <jizhan@amd.com>
Co-authored-by: letaoqin <letaoqin@amd.com>
Co-authored-by: Jing Zhang <jizhan@meta.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>

c544eb4d

16 Jul, 2024 2 commits

Adding more instances of grouped convolution 3d forward for FP8 with ConvScale... · 802a8a1d

Andriy Roshchenko authored Jul 16, 2024

Adding more instances of grouped convolution 3d forward for FP8 with ConvScale element-wise operation and ReLU activation. (#1386)

* Add CMakePresets configurations.

* Add ConvScale+ReLU Functor and an Example

* Account for ReLU FLOPs.

* Add instances of 3D convolutions with ConvscaleRelu operation.

* Implement Client Example

* Cleanup

802a8a1d

Disbale failed instance in rocm6.2 rel (#1388) · 1ff4f251
Haocong WANG authored Jul 16, 2024

1ff4f251

12 Jul, 2024 1 commit
- Support access per groups and filter3x3 in grouped conv fwd (#1382) · 82e8a78a
  Bartłomiej Kocot authored Jul 12, 2024
```
* Support access per groups and filter3x3 in grouped conv fwd

* Fixes for large cases

* Fixes for large tensors
```
  82e8a78a
11 Jul, 2024 1 commit
- Add instances for grouped conv fwd 3d with ConvScale for bf8@fp8->fp8 (#1369) · 7a46a91c
  Rostyslav Geyyer authored Jul 11, 2024
```
* Add an example

* Add instances

* Add a client example
```
  7a46a91c
09 Jul, 2024 1 commit
- Fix the cmake logic when building with INSTANCES_ONLY=ON. (#1376) · a328df25
  Illia Silin authored Jul 08, 2024
```
* fix the cmake logic when building for various targets

* another minor fix
```
  a328df25
06 Jul, 2024 1 commit

Universal streamk with atomics (#1360) · 75e622f0

Harisankar Sadasivan authored Jul 05, 2024

* universal streamk with atomics with ckprofiler support. grid_size and streamk strategy are tunable. grid_size of -1 leads to #WGs = maximum occupancy X num_CUs. implementation supports many different streamk policies: 1-tile, 2-tile, 3-tile and 4-tile. streamk strategy of -1 leads to default streamk policy (4-tile). 

* Update README.md

* fixing clang-format issues

* removed conflicts in struct members between streamk and universal streamk

* corrected arg parsing for streamk and universal streamk

* added stream-k policies for 3 tile and 4 tile

* fixed argument type issue with parsing cmd args

* changes suggested in PR review are made- removing comments and correcting copyright

* file permissions updated

* added default value support for grid_size and streamk-policy selection set to -1

* print messages for arguments

* print messages for arguments

* print messages for arguments1

75e622f0

27 Jun, 2024 1 commit
- Merging the gfx12 code into public repo. (#1362) · 941d1f7c
  Illia Silin authored Jun 27, 2024
  
  941d1f7c
22 Jun, 2024 1 commit

Add instances of grouped convolution 3d forward with a ConvScale element-wise... · 05b10e0e

Andriy Roshchenko authored Jun 21, 2024


Add instances of grouped convolution 3d forward with a ConvScale element-wise op for bf8@bf8->fp8 (#1326)

We are adding more instances of grouped convolution 3d forward with a ConvScale element-wise operation.
This commit handles bf8@bf8->fp8 data types combination.

* Included an example.
* Added instances.
* Added a client example.

---------
Co-authored-by: Rostyslav Geyyer <rosty.geyyer@amd.com>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

05b10e0e

18 Jun, 2024 1 commit

Switch to universal gemm in grouped gemm tile loop (#1335) · e2d13920

jakpiase authored Jun 18, 2024



* switch to universal gemm in grouped gemm tile loop

* minor fixes

* add reviewers comments

---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

e2d13920

12 Jun, 2024 1 commit
- Add instances for grouped conv fwd 3d with ConvScale for fp8@bf8->fp8 (#1325) · acda4c5a
  Rostyslav Geyyer authored Jun 12, 2024
```
* Add fp8 bf8 conv example

* Add instances

* Add client example

* Add random scale values

* Format
```
  acda4c5a
10 Jun, 2024 1 commit

Add a convinvscale op, related instances and examples (#1307) · ce66277a

Rostyslav Geyyer authored Jun 10, 2024



* Update the element op

* Add an example

* Add instances

* Add a client example

* make sure new instances only build on gfx9

* Update element op and its handling

* Format

* Update instances to take element op as an argument

* Update examples to use random scale values

* Format

* Update client example with random scales

* Format

---------
Co-authored-by: illsilin <Illia.Silin@amd.com>

ce66277a

05 Jun, 2024 2 commits

Integrate universal gemm with conv forward (#1320) · ac58cc5d

Bartłomiej Kocot authored Jun 05, 2024

* Integrate universal gemm with conv fwd

* Fix conv fwd wmma test

* Fix instances

* Remove direct load check

ac58cc5d

Add a scale op, related instances and examples (#1242) · cb0645be

Rostyslav Geyyer authored Jun 04, 2024



* Add a scale op

* Update the element op

* Add instances

* Add an example

* Add a client example

* Add a flag check

* Revert flag check addition

* Fix flag check

* Update d strides in example

* Update d strides in client example

* Apply suggestions from code review

Update copyright header
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Move the example

* Move the client example

* Update element op

* Update example with the new element op

* Add scalar layout

* Update example

* Update kernel for scalar Ds

* Revert kernel changes

* Update element op

* Update example to use scales' pointers

* Format

* Update instances

* Update client example

* Move element op to unary elements

* Update element op to work with values instead of pointers

* Update instances to take element op as an argument

* Update examples to use random scale values

---------
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

cb0645be

23 May, 2024 1 commit
- Split the gemm_multi_abd instances. (#1306) · ec2bae27
  Illia Silin authored May 23, 2024
```
* split the gemm_multi_abd instances

* update the dates
```
  ec2bae27
22 May, 2024 2 commits

Optimize grouped conv bwd weight for small M and N (#1303) · fd72380a
Bartłomiej Kocot authored May 22, 2024
```
* Optimize grouped conv bwd weight for small M and N

* Fixes
```
fd72380a

Select appropriate GPU targets for instances, tests, and examples. (#1304) · 7b027d56

Illia Silin authored May 22, 2024

* set individual gpu targets for instances, examples, tests

* fix path to hip compiler

* fix path to hip compiler once more

* aggregate device macros in ck_tile config header

* fix the cmake logic for instances

* fix clang format

* add gfx900 and gfx906 to default set of targets

7b027d56

08 May, 2024 1 commit
- Add two stage grouped conv bwd weight kernel (#1280) · 0b6b5d17
  Bartłomiej Kocot authored May 08, 2024
  
  0b6b5d17
01 May, 2024 1 commit
- Add an ignore (#1270) · a2d0bdd5
  Rostyslav Geyyer authored Apr 30, 2024
  
  a2d0bdd5
29 Apr, 2024 1 commit

Mark unneeded instances as "getting deprecated" (#1265) · 6ced3c12

Rostyslav Geyyer authored Apr 29, 2024



* Add a flag

* Add flag check and messages

---------
Co-authored-by: root <root@aus-g7-rogeyyer.amd.com>

6ced3c12

26 Apr, 2024 3 commits

[GEMM] UniversalGemm update (#1262) · 764164b4

Haocong WANG authored Apr 27, 2024



* Add bf16 instances

* Add bf16 gemm universal example

* tempsave

* Add guard to navi compilation

* workground on a specific mixed gemm instance ( bring back it when compiler fix upload)

* fix formatting condition statement issue

* solve conflict

---------
Co-authored-by: Jun Liu <Liu.Jun@amd.com>

764164b4

ggemm tile_loop multD bf16 int8 (#1258) · 5ae893c0

zjing14 authored Apr 26, 2024



* Overload output stream operator for LoopScheduler and PiplineVersion

* Add Run overload accepting grid descriptors MK.

* Add __device__ keyword for CalculateGridSize

* Create device op GroupedGemmMultipleD

* Add GroupedGemm MultipleD Tile Loop implementation.

* Add an example for GroupedGemm MultipleD tile loop.

* Device Op GroupedGEMMTileLoop.

* Bunch of small changes in exmaple.

* CkProfiler

* Remove unused tparam.

* changed the copy function to v7r2

* adding multi_abd

* in-progress

* add post-load oob check

* Fix include statement.

* Fix output stream overloads.

* Do not make descriptors and check validity untill we find group.

* Fix gemm desc initialization.

* debugging

* adjust instances

* add run_lds

* add elemntwise_op

* replace multi_abd_device with v3

* clean up

* clean

* clean

* Revert device op

* Fix compilation for DTYPES=FP16

* Validate tensor transfers paramters.

* Added LDSType

* profiling

* adjust oobcheck

* add missing file

* Validate on host only NK dims if M is not known.

* add

* clean

* refactor

* clean

* add examples

* add fuse

* add fusion and client example

* Fix bug.

* A convenient debug func for selecting threads.

* Fix has main k block loop bug.

* Make sure that b2c has up to date tile offset.

* Output stream operator for Sequence type.

* Cmake file formatting.

* clean

---------
Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>

5ae893c0

bf16A_Int8B with fastgelu/bias (#1264) · 0d0150db

zjing14 authored Apr 26, 2024

* changed the copy function to v7r2

* adding multi_abd

* in-progress

* add post-load oob check

* debugging

* adjust instances

* add run_lds

* add elemntwise_op

* replace multi_abd_device with v3

* clean up

* clean

* clean

* Added LDSType

* profiling

* adjust oobcheck

* add missing file

* refactor

* clean

* add examples

0d0150db

25 Apr, 2024 1 commit

Grouped GEMM Multiple D tile loop. (#1247) · b4032629

Adam Osewski authored Apr 25, 2024

* Overload output stream operator for LoopScheduler and PiplineVersion

* Add Run overload accepting grid descriptors MK.

* Add __device__ keyword for CalculateGridSize

* Create device op GroupedGemmMultipleD

* Add GroupedGemm MultipleD Tile Loop implementation.

* Add an example for GroupedGemm MultipleD tile loop.

* Device Op GroupedGEMMTileLoop.

* Bunch of small changes in exmaple.

* CkProfiler

* Remove unused tparam.

* Fix include statement.

* Fix output stream overloads.

* Do not make descriptors and check validity untill we find group.

* Fix gemm desc initialization.

* Revert device op

* Fix compilation for DTYPES=FP16

* Validate tensor transfers paramters.

* Validate on host only NK dims if M is not known.

* Fix bug.

* A convenient debug func for selecting threads.

* Fix has main k block loop bug.

* Make sure that b2c has up to date tile offset.

* Output stream operator for Sequence type.

* Cmake file formatting.

b4032629

19 Apr, 2024 2 commits

Refactor elementwise kernels (#1222) · ad1597c4

Bartłomiej Kocot authored Apr 19, 2024

* Refactor elementwise kernels

* Instances fixes

* Fix cmake

* Fix max pool bwd test

* Update two stage gemm split k

* Restore elementwise scale for hiptensor backward compatiblity

* Fix Acc data type check in conv fwd multiple abd

* Disable conv fp64 fwd example

* Update grouped conv weight multi d

ad1597c4

Add bf16 and bf16@int8 mk_nk_mn instances for grouped gemm two stage (#1228) · e0f3f918
jakpiase authored Apr 19, 2024
```
* added bf16 and bf16@int8 mk_nk_mn instances

* fix preprocessor guards
```
e0f3f918