Commits · 9d171dc57ca5e9eef218ca1db3b9958a858b7011 · gaoqiong / composable_kernel

28 Nov, 2023 1 commit
- small edits to 3d transpose profiler · 9d171dc5
  Astha Rai authored Nov 28, 2023
  
  9d171dc5
11 Nov, 2023 1 commit

add more instances for bfp16 gemm (#1036) · 600fc000

zjing14 authored Nov 11, 2023



* add more instances for bfp16

* reduce the gemm input values to prevent round-off errors

---------
Co-authored-by: Jing Zhang <jizha@amd.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>

600fc000

09 Nov, 2023 2 commits

Transpose 3d (#984) · 3af8c81a

arai713 authored Nov 08, 2023



* added working example for 5D input using 1D kernel

* example with 5D input tensor and 2d kernel - not working: issues with arguments

* added updated version of 3d device op - changed descriptors/dims

* added example file to check kernel

* fixed descriptor and isSupportedArgument stride problem

* added and modified kernel for 3d - updated tids/loop

* adding some more 5d example files

* fixed some issues

* changes made for testing

* working version: fixed error in stride for A, still a bit inefficient

* cleaned up formatting/comments

* updating formatting

* more formatting fixes

* fixing cmake, adding back gpu targets in cmake script

* adding client example

* added instances for client example

* fixed errors in client example

* implemented client ex with device_elementwise.hpp and device_elementwise_3d_impl.hpp

* removed extra files

* minor formatting and naming fixes

* adding test files and profiler

* fixing minor error

* minor fix

* removed unneccesary comments, renamed files

* updated instance list for client example, added different layout example

* removing instances

* fixed error in instance generation

* remove comments

* update profiler and client example tensor layouts

* fixed errors in test/profiler

* updated vector dim access to enable vector load

* updated test/profiler files

* updated example with 1d kernel

* updating profiler

* renamed files

---------
Co-authored-by: Jing Zhang <jizha@amd.com>

3af8c81a

Layernorm4d (#1022) · a3d9a2cd

rocking authored Nov 09, 2023



* Rename folder

* Add layernorm 4d fwd example

* Rename original layernorm example

* Add layernorm 4d f16  test

* Add layernorm4d_fwd client example

* Support layernorm4D in ckProfiler

* Rename groupnorm to groupnorm fwd in example

* Rename layernorm and group fwd in test

* Rename normalization to normalization_fwd (instances)

* Add fwd to DeviceNormalization

* Rename external api header

* Rename folder, because we can also add bwd in this folder

* Add fwd in layernorm and groupnorm (profiler

* Fix compile error

---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

a3d9a2cd

07 Nov, 2023 1 commit

Add Gemm instances for performance improvement (#1018) · 98fd41f5

zjing14 authored Nov 07, 2023



* improve kpad

* more tuning parameters

* f16_f8_fp16

* cut test time

* add f16_f8_fp16

* add f16_f8_f16

* testing instances for skinny cases

* format

* clean

* add fp16_f8_fp16

* clang-format

* add grouped gemm instalces

* fixed profile grouped_gemm

* clean

* clean

* clean

* clean

* clean

* add missing instance func

* fixed inferface

---------
Co-authored-by: Jing Zhang <jizha@amd.com>
Co-authored-by: root <root@sh5-1e707-rc06-38.mkm.dcgpu>

98fd41f5

02 Nov, 2023 1 commit

Add support for mixed precision in contraction scale and bilinear (#973) · 4ef704d8

Bartlomiej Wroblewski authored Nov 02, 2023



* Add support for mixed precision in contraction scale and bilinear (#936)

* Extract common functionality to separate files

* Reference contraction: Remove incorrect consts from type_converts

* Reference contraction: Add missing type_convert for dst value

* Reference contraction: Fix incorrect order of B matrix dimensions

* Add support for mixed precision in contraction scale and bilinear

* Move using statements from instances to a common file

* Move using statements from examples to a common file

* Fix the order of B matrix dimensions across examples and profiler

* Fix the computation of error threshold

* Make ComputeDataType an optional argument

* Include possible DataType -> ComputeDataType casting error in the threshold

* Remove commented code

* Make the ComputeDataType an optional argument in instance

---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

4ef704d8

31 Oct, 2023 1 commit

Add support for groups in Img2Col/Col2Img (#1007) · 2e824c6d

Bartłomiej Kocot authored Oct 31, 2023

* Add support for groups in Img2Col/Col2Img

* Fix interface test

* Fix interface test G to N

* Improve performance

* Change gemm layout to 3d

* Fixes

2e824c6d

28 Oct, 2023 1 commit

Fix the fp8 gemm for large tensors on MI300. (#1011) · f46a6ffa

Illia Silin authored Oct 27, 2023



* Fix the fp8 conversion

* Try clipping value before conversion

* Fix return

* Simplify with a const

* reduce the gemm input tensor values to reduce round-off error

* replace if-else with lambda

* fix syntax

---------
Co-authored-by: Rostyslav Geyyer <rosty.geyyer@amd.com>

f46a6ffa

19 Oct, 2023 1 commit
- Change 1d,2d,... to 1D,2D,... (#997) · 0abc0f87
  Bartlomiej Wroblewski authored Oct 19, 2023
  
  0abc0f87
18 Oct, 2023 2 commits

Layernorm and groupnorm support to save mean and inverse std in forward (#929) · 3696fe1c

rocking authored Oct 19, 2023

* save mean and inverse std in normalization

* Save mean and inverse std in splitK

* Vector save mean and inv std

* Modify instance for save mean and std

* simplify the layernorm example

* Save mean and std in groupnorm example

* Save mean and inv std in ckProfiler and test

* Remove compute data type from base class

* Save mean and inv std in client example

* Add changelog

* clang format

* Fix compile error

* Refine naming

* Avoid error in bf16

* revert changelog

3696fe1c

Clean DTYPES conditions in CMake (#974) · bf435140

zjing14 authored Oct 18, 2023



* Add a condition to build fp8 instances

* simplified buffer_load/store

* add bfp8/fp8

* fixed

* remove all f8/bf8 condition include folder

* fixed cmake conditions

* fixed DTYPES=fp16/bfp16

* fix

* fixed buffer_load

* fixed buffer_store

* fix

* clean example cmake files

* fixed ci

* fixed cit

---------
Co-authored-by: Rostyslav Geyyer <rosty.geyyer@amd.com>
Co-authored-by: Jing Zhang <jizha@amd.com>

bf435140

17 Oct, 2023 1 commit

Add grouped conv bwd weight wmma (#985) · 16d7c4d2

Bartłomiej Kocot authored Oct 17, 2023

* Add grouped conv bwd weight wmma

* Update README, changelog, profiler

* Minor fixes

* Fix grouped conv bwd wei dl kernel

* Minor fixes

* Minor stylistic fixes

16d7c4d2

13 Oct, 2023 1 commit
- Add splitk gemm fp16 @ fp16 with fp8 compute instances (#983) · fa753f27
  Rostyslav Geyyer authored Oct 13, 2023
```
* Add ComputeType

* Update for compatibility

* Add instances

* Update profiler api
```
  fa753f27
05 Oct, 2023 1 commit

Revert "Add support for mixed precision in contraction scale and bilinear" (#967) · 4daedf8c

Illia Silin authored Oct 05, 2023

* Revert "Add support for mixed precision in contraction scale and bilinear (#936)"

This reverts commit f0748506.

* revert commits #957 and #960

4daedf8c

04 Oct, 2023 1 commit

Add conv bwd weight fp16 comp bf8 fp8 op, instances and example (#945) · 42facfc6

Rostyslav Geyyer authored Oct 04, 2023



* Add f8 bf8 gemm example

* Add element-wise ops

* Add intrinsics

* Update reference calculation

* Add an additional type option for xdlops gemm

* Fix build process

* Add bf8 to buffer addressing

* Update blockwise op, split typeA and typeB

* Update for compatibility

* Uppdate naming to f8->fp8

* Update naming

* Format

* Update naming (#937)

* Add a client example

* Add computetypes to device and gridwise ops

* Add instances, update instance factory

* Format

* Fix a flag

* Add ckProfiler mode

* Fix typos

* Add an example

* Add bf8 generator

* add bf8 mfma; fixed type_convert for bf8

* move verfication ahead of timing

* Update reference calculation

* Fix reference

* Narrow down float init range

* Fix bf8 bf8 mfma

* Add bf8 @ fp8 mfma

* Update example

* Update instances

* Update profiler api

* Update for compatibility

* Format

* Remove extra example

* Clean up

* workaround convert

---------
Co-authored-by: Jing Zhang <jizha@amd.com>

42facfc6

29 Sep, 2023 1 commit

Add support for mixed precision in contraction scale and bilinear (#936) · f0748506

Bartlomiej Wroblewski authored Sep 29, 2023

* Extract common functionality to separate files

* Reference contraction: Remove incorrect consts from type_converts

* Reference contraction: Add missing type_convert for dst value

* Reference contraction: Fix incorrect order of B matrix dimensions

* Add support for mixed precision in contraction scale and bilinear

* Move using statements from instances to a common file

* Move using statements from examples to a common file

* Fix the order of B matrix dimensions across examples and profiler

* Fix the computation of error threshold

* Make ComputeDataType an optional argument

* Include possible DataType -> ComputeDataType casting error in the threshold

* Remove commented code

f0748506

27 Sep, 2023 1 commit

Add column to image kernel (#930) · e2243a4d

Bartłomiej Kocot authored Sep 27, 2023

* Add column to image kernel

* Minor fixes for dtypes and client examples

* Disable tests for disabled dtypes

* Disable add instances functions for disabled data types

* Minor stylistic fixes

* Revert "Disable add instances functions for disabled data types"

This reverts commit 728b8695.

* Instances reduction

* Add comments in device_column_to_image_impl

* Update changelog and Copyrights

* Improve changelog

e2243a4d

26 Sep, 2023 1 commit
- Add fp8 gemm instances (#920) · 94bfa502
  Rostyslav Geyyer authored Sep 26, 2023
```
* Add fp8 gemm instances

* Update instance naming
```
  94bfa502
13 Sep, 2023 1 commit

fixed fp8 issues (#894) · a66d14ed

zjing14 authored Sep 12, 2023



* fixed fp8 init; and reference gemm

* Update host_tensor_generator.hpp

* fixed convert

* fixed reference gemm

* fixed comments

* fixed comments

* fixed ci

* fixed computeType

---------
Co-authored-by: Jing Zhang <jizha@amd.com>

a66d14ed

12 Sep, 2023 1 commit

Refactor f8_t, add bf8_t (#792) · 62d4af74

Rostyslav Geyyer authored Sep 12, 2023

* Refactor f8_t to add bf8_t

* Add check_err impl for f8_t

* Update fp8 test

* Format

* Revert the fix

* Update vector_type implementation

* Add bf8 test

* Add bf8, use BitInt types

* Add bf8 conversion methods

* Update type_convert for fp8/bf8

* Add check_err fp8/bf8 support

* Add subnorm fp8 tests

* Add subnorm bf8 tests

* Fix conversion

* Add bf8 cmake bindings

* Add macros to enable build with disabled fp8/bf8

* Remove is_native method

* Update flag combination for mixed precision instances

* Add more flag checks

* Add another flag to a client example

* Add type traits, decouple f8/bf8 casting

* Clean up

* Decouple fp8 and bf8 flags

* Remove more redundant flags

* Remove leftover comments

62d4af74

08 Sep, 2023 1 commit

[Navi3x] Add fp16/int8 wmma conv forward instances (#746) · 562b4cec

Haocong WANG authored Sep 08, 2023



* fix wmma gemm int8; add grouped conv int8 example

* Add int8 gemm-bilinear instances

* compile sanity check unknown

* Sanity pass + clang-format

* add int8 conv profiler instances

* solve merge conflict

---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

562b4cec

05 Sep, 2023 1 commit

Add image to column kernel (#867) · 0077eeb3

Bartłomiej Kocot authored Sep 05, 2023

* Add image to column kernel

* Add instances, tests, profiler, example

* Add client example

* Several fixes of image to column

* Fix variable name in device_image_to_column_impl

* Several fixes of image to column profiler

* Fix num_btype calculation

* Make new mesaurements for correct bytes calculation

0077eeb3

31 Aug, 2023 1 commit

MaxPool & AvgPool bwd instances, test, ckProfiler, client example (#861) · 866377de

rocking authored Aug 31, 2023

* Add maxpool instances

* Rename index pool to max pool.

* Add maxpool bwd bf16 instances

* Add avg pool bwd instances

* Rename avgpool and maxpool to avg_pool3d and max_pool

* Add bf16 pool fwd instances

* Add max pool bwd to ckProfiler

* Add avg pool3d bwd to ckProfiler

* Add avg pool bwd test

* Fix bug of reference pool fwd (dilation)

* Fix bug of max pool bwd  (dilation and initZero)

* Support bf16 compute data type

* Force compute type be f32. Because atomicAdd only support f32

* Add max pool bwd test

* Rename folder

* Rename pool

* Add max pool bwd client example

* Add avg pool bwd client example

* Add missing workspace

* clang format

* Rename macro

* remove useless header

* remove useless layout

866377de

28 Aug, 2023 1 commit

Fp16/fp8 mixed-precision Gemm with multiply+add fusion (#865) · 31ea132a

zjing14 authored Aug 28, 2023



* add compute_type

* add multiply_add ckProfiler

* add f8_fp16 support

* clean

* clean

* fixed lds size calc

* format

---------
Co-authored-by: Jing Zhang <jizha@amd.com>

31ea132a

23 Aug, 2023 2 commits

[HotFix] add config and version files to pass on build info (#856) · c8a8385f

Jun Liu authored Aug 23, 2023

* experiment with config file

* experiment with version.h config

* add more info to version.h

* minor updates

* minor updates

* fix case where DTYPE is not used

* large amount of files but minor changes

* remove white space

* minor changes to add more MACROs

* fix cmakedefine01

* fix issue with CK internal conflict

* fix define and define value

* fix clang-format

* fix formatting issue

* experiment with cmake

* clang format v12 to be consistent with miopen

* avoid clang-format for config file

c8a8385f

add generic instances (#858) · 8ebea3a5
zjing14 authored Aug 23, 2023
```
Co-authored-by: Jing Zhang <jizha@amd.com>
```
8ebea3a5

22 Aug, 2023 2 commits

Ck profiler splitk (#857) · ca3115e7

zjing14 authored Aug 22, 2023



* updated regular gemm

* update ckProfiler

* fixed gtests

---------
Co-authored-by: Jing Zhang <jizha@amd.com>

ca3115e7

Add instances/ckProfiler/client example for fp8/fp16 mixed precision Gemm (#853) · eac50708

Rostyslav Geyyer authored Aug 22, 2023



* Add ComputeType arg to splitk device and gridwise ops

* Update for gridwise op compatibility

* Update bf16 and int8 splitk gemm examples with ComputeType

* Add instances

* Update ckProfiler for mixed precision cases

* Add a mixed precision splitK gemm client example

---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>

eac50708

14 Aug, 2023 1 commit

Refactor pool fwd (#815) · f60f0a5e

rocking authored Aug 15, 2023

* Do not hardcode stride

* devicePool2DFwd Inherit devicePool3DFwd

* Move instance declaration out of common

* Add dilation

* use the pool3d rank, because pool2d inherit pooo3d

* calculate Do Ho Wo for the dilation

* Fix header name

* Modify ckProfiler

* Remove pool2d instance

* Remove pool2d in profiler

* Remove pool2d and add dilation

* In to client example, this commit revise following:
1. Add dilation.
2. Use pool3d to implement pool2d

* Refine naming and IsSupportedArgument()

* Add dilation to maxpool bwd example

* clang format

* 1. Remove useless header
2. Fix copyright
3. Refine naming

* Add layout parameter to pool fwd

* clang format

* Fix merge error

* Fix compile error

* Remove layout parameter in derived class

* Refine changlog

* Fix compile error

* Fix compiler error

* Add layout to external api and profiler

f60f0a5e

11 Aug, 2023 1 commit

Add Normalization splitk instances (#829) · 03b8119e

rocking authored Aug 12, 2023

* Add normalization splitK to layernorm and groupnorm instances

* Fix bug of GetKPerThread()

* Refine naming

* clang format

03b8119e

07 Aug, 2023 2 commits

Allow building CK for specific data types and split off last remaining DL instances. (#830) · 08eb1769

Illia Silin authored Aug 07, 2023

* properly split conv_nd_bwd_data instances

* split conv2d_fwd instance data types

* split the gemm, conv2d_fwd and batched_gemm_softamx_gemm

* split the tests by data types where possible

* filter examples by DTYPES

* split few remaining examples by DTYPES

* filter most instances by DTYPES

* add new lines at end of headers, fix grouped_gemm profiler

* fix syntax

* split the ckprofiler instances by DTYPES

* split the conv2d and quantization DL and XDL instances

* fix the splitting of conv2d DL instances

* split softmax and pool_fwd tests for fp16 and fp32 types

* fix syntax

* fix the dl_int8 quantization instances isolation

08eb1769

Add wei_strides to grouped conv3d wei to keep consistency (#817) · 22443f7a

Bartłomiej Kocot authored Aug 07, 2023



* Add wei_strides to grouped conv3d wei to keep consistency

* Fix strides in client examples

* Unify backward weight api with forward

* Fix for example

* Fixes for examples

---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>

22443f7a

02 Aug, 2023 1 commit

Update tuning parameter & compilation options of DeviceGemmXdl<> instance (layout=TT) (#819) · f7cc8c3b

Po Yen Chen authored Aug 02, 2023

* Enable pipeline v2 opt for layout=TT instance

* Use better thread mapping for reading A tile

* Conditionally enable pipeline v2 opt

* Allow enabling only fp16 gemm instances in profiler

* Fix formatting error

* Fix compilation error if we enable fp32 in profiler

f7cc8c3b

26 Jul, 2023 2 commits

initial stream-k implementation with example (#699) · e7dca79d

carlushuang authored Jul 27, 2023



* initial stream-k implementation with example

* fix unexpected change in err

* improve a little bit performance by reorganize pipeline.

* improve perf a little bit by swizzle block idx

* add profiler

* update example

* fix spelling

* shrink karg for streamk

* support dynamic buffer using memory coherence glc_slc bit from template

* control memory coherence while construct dynamic buffer

* update reduction for streamk(not ready yet)

* Add template parameter to make_dynamic_buffer to support amd_buffer coherence setting

* fix build issue

* fix several bug

* now result is correct, everything works (but has scratch)

* remove scratch by manually reset coordinate

* update device code

* fix a bug in final reduce

* fix something in example

* update async memset

* fix enum as camel case

* modify coherence enum name

* clean code and use atomic streamk by default

* remove unused var

* throw exception if have empty pointer

* fix format

* fix CI warning

* fix type in init

* modify CI error

* filter out on gfx10+

* restore changed example code

---------
Co-authored-by: Qianfeng Zhang <Qianfeng.Zhang@amd.com>

e7dca79d

Disable DL kernels by default. (#816) · 9195435c
Illia Silin authored Jul 26, 2023

9195435c

21 Jul, 2023 1 commit
- Grouped conv bwd wei NDHWGC/NDHWGK (#804) · 10732847
  Bartłomiej Kocot authored Jul 21, 2023
  
  10732847
18 Jul, 2023 2 commits

Grouped 3d conv backward data support (#799) · 49180fd6
Bartłomiej Kocot authored Jul 18, 2023
```
* Grouped 3d conv backward data support

* Fix comments
```
49180fd6

Add mechanism to build CK for select data types, add Navi3x CI. (#790) · 189ea3b9

Illia Silin authored Jul 17, 2023

* allow building CK for specific data types

* add CI build and test stage on Naiv3x without some int8 instances

* add missing gemm fp16 instances

* add the changes to the missed cmake file

* add empty lines at end of source files

* Do not build quantization client example on navi3 in CI

* disable batched_gemm_multi_d_int8 instances with DTYPES

* disable device_conv2d_bwd_data_instance with DTYPES

* fix ckprofiler for conv_bwd_data for int8

* properly isolate the conv_bwd_data int8 instances

* remove empty line

189ea3b9

12 Jul, 2023 1 commit

Support NHWGC conv2d_bwd_weight (#769) · 1ee99dca

Bartłomiej Kocot authored Jul 12, 2023



* Support NHWGC conv2d_bwd_weight

* Fix client example

* Fix client example

* Fix comments

* Redesign grouped_conv_bwd_weight instances

* Clang format fix

---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>

1ee99dca

06 Jul, 2023 1 commit

Batchnorm splitk single kernel (#771) · 8f5cafaf

Qianfeng authored Jul 06, 2023

* Use dim 0 as faster dim for writing mean/var/count workspace in batchnorm multiblock method [performance]

* Add CountDataType as template parameter in blockwise_welford

* Add utility/get_shift.hpp

* Add BatchNorm multiblock single-kernel implementation

* Add smem inline assembly based implementation of gms_init/gms_barrier/gms_reset for gfx90a

* Renaming in device_batchnorm_forward_impl.hpp

* Tiny fix in the batchnorm_fwd profiler

* Revert "Add smem inline assembly based implementation of gms_init/gms_barrier/gms_reset for gfx90a"

This reverts commit d16d00919c43f10759e7b4e4d112125221ed9064.

* Use the old two-kernel batchnorm multiblock method for gfx1030

* Use the old two-kernel batchnorm multiblock method for gfx908

* use the single-kernel batchnorm multiblock method only for gfx90a

* Remove get_wave_id() from utility/get_id.hpp since it is not used

* Set true for testing running mean/variance and saving mean/invvariance in the examples

* Fix to copy-right words

* Remove un-needed including in utility/get_id.hpp

* Add comments to workgroup_synchronization.hpp

* Remove un-used codes in gridwise_multiblock_batchnorm_forward.hpp

* Renaming in the kernels

* Remove un-used kernel file

8f5cafaf