Commits · 600a987098151bb9e883c02a634b7fe9653b77ca · gaoqiong / composable_kernel

24 Apr, 2023 3 commits

Clean up · 600a9870
Alan Turner authored Apr 24, 2023

600a9870
Add jit library · 17acbbf4
Alan Turner authored Apr 24, 2023

17acbbf4

Revise layout of group convolution (#675) · 3eecbfb6

rocking authored Apr 24, 2023

* [What] Remove pure conv int8 instance
[Why] We will never use pure int8 conv in AI, use int8 quantization instead

* Change layout

* Share the kernel parameter

* Support more type of NHWGC for group conv

* Revise client example of conv 2d, use NHWGC layout

* Add instance to cmake

* Revise layout of group conv quantization instance

* Revise layout of external api of group conv quantization

* Revise layout of group conv quantization client example

* Fix clang format

* Add comment to describe meaning of each parameter

3eecbfb6

22 Apr, 2023 1 commit

Put back the split-k gemm code. (#684) · 903cd19c

Illia Silin authored Apr 21, 2023



* simplify karg in device/grid split-k op

* fix mk_kn_mn instances

* add more instances

* use name from tensor layout

---------
Co-authored-by: carlushuang <carlus.huang@amd.com>

903cd19c

17 Apr, 2023 1 commit
- Add (#677) · fd11a4a1
  rocking5566 authored Apr 17, 2023
  
  fd11a4a1
10 Apr, 2023 1 commit

Groupnorm + swish external api (#668) · ed3a2e52

rocking5566 authored Apr 10, 2023

* Rename to proper naming

* Add example of groupnorm + swish

* Extract duplicate code in example

* Add groupnorm + swish instances

* Ractor instance generation, split into multiple cpp file

* Add external api and client example

* Refine profiler message

* Use ck math version of exp

* Refine problem size in example

* Add host version of exp

ed3a2e52

07 Apr, 2023 1 commit
- Issue #666: Revert "simplify karg in device/grid of split-k op (#644)" (#665) · 3248387b
  Jun Liu authored Apr 06, 2023
```
This reverts commit bb5530af.
```
  3248387b
30 Mar, 2023 2 commits

add fp64 instances (#658) · fde6d274
zjing14 authored Mar 30, 2023
```
Co-authored-by: root <root@ctr-ubbsmc15.amd.com>
```
fde6d274

simplify karg in device/grid of split-k op (#644) · bb5530af

carlushuang authored Mar 30, 2023

* simplify karg in device/grid split-k op

* fix mk_kn_mn instances

* add more instances

* use name from tensor layout

bb5530af

29 Mar, 2023 1 commit

Conv + quantization + tanh (#645) · 389e84a8

rocking5566 authored Mar 30, 2023



* Rename file. Prepare to support another activation

* Add comment for quantization

* Extract out_elementop

* Add tanh example

* Add conv + bias + tanh quantization instance

* Add missing parameter

* Refine cmake

* Add external api and client example

* Extract variable in example

* Fix the comment

---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>

389e84a8

20 Mar, 2023 1 commit

workaround 637 (#640) · 6ae12434

ltqin authored Mar 21, 2023



* add workaround 637

* format

* change id

---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>

6ae12434

15 Mar, 2023 1 commit

gemm/Conv xdlops + dlops quantization (#625) · 16dc18e0

rocking5566 authored Mar 16, 2023

* Add conv perlayer quantization

* Add gemm_dlops quantization

* Support int8 for innerproduct

* Refine gemm dlops int8 kernel parameter

* Support gfx908(MI100) and gfx90a(MI200)

* clang-format

* Rename example number

* Support different layout for d tensor

* Add conv dlops perchannel quantization example

* Move to example 40

* Extract the common code for different platform (dlops and xdlops)

* Move ot subfolder. Prepare to add other op of quantization

* Refine the quantization instance library

* Add conv dl instances and client example

* Remove unnecessary type

* Add gemm quantization instance

* Add external api and client example

* Refine num_bytes

* Separete different layout to different cpp

* Add more xdl instances

* Revert "Remove unnecessary type"

This reverts commit 82086918

.

* Remove CShuffleDataType in dlops
Let acc and CShuffleDataType be the same in xdlops

---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>

16dc18e0

08 Mar, 2023 1 commit

GroupedGEMM + Gelu client example/instances/profiler (#614) · 9096b1c7

Adam Osewski authored Mar 08, 2023



* Grouped gemm + Gelu instances.

* Device Instance Factory for GroupedGemm+Gelu

* Client example

* Rangify fill helper functions.

* Fix name clash.

* Profiler for grouped_gemm+gelu

* No need to use full namespace name.

* Add check for MRaw divisible by vector load.

* Ugly fix for big errors.

* Add grouped_gemm+gelu to profiler CMakelists.

* Store in argument additional info.

* Information about Mraw, Nraw, Kraw values.

* Use FastGelu instead of Gelu.

* Change client ex to use FastGelu

* Remove relaxed error precision.

* Remove duplicate output elementwise-op

---------
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>

9096b1c7

06 Mar, 2023 1 commit

Generate output using Doxygen / Breathe (#598) · e4bf6d42

pmaybank authored Mar 06, 2023



* Modify Doxygen config to pick up include directories recursively

* Add DeviceMem struct to API Reference guide

* Add classes that are used in Flash Attention kernel

* Add a reference and config for generating bibliography
Co-authored-by: Philip Maybank <Philip.Maybank@amd.com>

e4bf6d42

15 Feb, 2023 2 commits

Improve normalization (#580) · 6a6163a3

rocking5566 authored Feb 16, 2023

* Sync the order of type string with template parameter

* Add more instances

* Check the vector size and remove redundant var

* Extract var to static, prepare to separate sweep once kernel

* Separate sweeponce flow and optimize the flow

* 1. Rename AccDatatype in normalization to computeData
2. Rename AccElementwiseOperation to YElementwiseOperation in normalization

* Remove useless code

* Update naive variance kernel

* Refine string

* Fix typo

* Support naive variance for device_normalization

* Check the blocksize

* Share the VGPR of x and y

* Share the VGPR of gamma and beta

* Add more instances

* Support fp16 sqrt for experiment

* Add CHANGELOG

* Fix typo

* clang-format

6a6163a3

Conv3D FWD BWD WRW fp16 fp32 client examples (#559) · e9fd1228

Adam Osewski authored Feb 15, 2023



* Conv3d bwd weight client example.

* Update year in license

* Convolution bwd data 3D fp16/fp32 client example.

* Client example for convnd fwd fp16 fp32

* clang-format

* Review remarks.

* Fix compiler err.

* Update data layout to standard one.

* Add conv 3d fwd NDHWGC instances

* clang-format

* Conv3d fwd NDHWGC instances.

---------
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>

e9fd1228

13 Feb, 2023 1 commit

GroupedGEMM more bigger tiles. (#577) · 8f42780f

Adam Osewski authored Feb 13, 2023



* Adding more bigger tiles.

* Remove failing instance.

* Remove instances which that don't improve perf.

---------
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>

8f42780f

09 Feb, 2023 2 commits

Gemm+layernorm instance, ckProfiler, client example (#568) · f7d28f3e

rocking5566 authored Feb 10, 2023

* Add gemm + layernorm instance

* Add ckProfiler

* Add test

* Add client example

* Detect if user forger to set the workrspace

* Use literal in the example

* [What] use builtin function for sqrt
[Why] compiler will not use v_sqrt_f64_e64 if we use ::sqrt()

* check gemm vaildity in IsSupportedArgument

* Add more testcases

* Merge duplicated folder in client example

* Print more infomation

* Use better kernel parameter for MS problem size

* clang format

* Add constexpr for if condition and remove redundant include

* Remove cstdlib and add constexpr

f7d28f3e

Add instance for elementwise normlization (#573) · 76d144fa

guangzlu authored Feb 10, 2023

* added instances for large N

* add instance for elementwise normlization

* added supported restrict in device_elementwise_normalization_impl.hpp

76d144fa

08 Feb, 2023 1 commit

Add GemmAddSoftmaxGemm support for MSFT ORT (instances and client API) (#576) · 332ccc33

ltqin authored Feb 09, 2023

* add instance for gemm bias softmax gemm

* add client example

* change CGridDesc_G_M_N to CGridDesc_G_M_O

* add gridwise

* change c grid name

* device add d0s data

* fix 08 client_example

* add example 47_fused_attention

* example output correct

* add d0 to example

* add d0 element op

* rechange instance code

* change Acc0ElementwiseOperation to C0DEElementwiseOperation

* change example name

* update instance for cdeelementwiseop

* add bhalf_t ScaleAdd

* add test

* not surport geem1 bias

* remove some ignore

* fix test bug

332ccc33

26 Jan, 2023 1 commit
- Add more instances for irregular GEMM sizes. (#560) · 7494c1c6
  Adam Osewski authored Jan 26, 2023
```
Co-authored-by: Adam Osewski <aosewski@amd.com>
```
  7494c1c6
25 Jan, 2023 1 commit

Batchnorm inference instances, external API, client examples and gtests (#531) · a1b2441f

Qianfeng authored Jan 26, 2023

* File renaming and class renaming for device element-wise operation

* Add batchnorm-infer instances, external API and client example

* Add batchnorm-infer profiler module and gtests

* Remove file device_elementwise_extension.hpp and move NormalizeInInfer operation to element_wise_operation.hpp

* Remove the using of class aliasing for DeviceElementwiseForBatchNormInfer

* Rename class and file due to conflict from device_elementwise_2d.hpp

* Fix namespace in batcnnorm_infer_nhwc client example

a1b2441f

18 Jan, 2023 4 commits

Use double for all scaling values and float-point constant values at the Device Op API (#557) · 52abc2f3

Qianfeng authored Jan 19, 2023

* Use double as alpha/beta values type in reduce device op api

* Use double as alpha/beta values type in softmax device op api

* Use double as alpha/beta values type in multiple-reduce device op api

* Use double as epsilon value type in normalization/elementwise-normalization device op api

52abc2f3

Add multiD Gemm client APIs (#534) · d66421fe

ltqin authored Jan 19, 2023



* start add example

* fix config

* fix showinfo bug

* add an elementop

* change to padding

* add xdl example

* change elementwiseop

* add instance

* add instance to profiler

* change file name

* fix deive not support issue

* add client example

* fix client gemm_add_multiply name

* change AddMultiply elementwiseop

* fix elementwiseop

* fix client example

* fix addmultiply op

* fix comments and fun name
Co-authored-by: letaoqin <letaoqin@amd.com>

d66421fe

fix a bug for 6-dim kernels (#555) · 00ff30af
Illia Silin authored Jan 18, 2023

00ff30af

Add client API/examples for 3xGemm+Bias+Add+Permute{0, 2, 3, 1} (#550) · 55236709

ltqin authored Jan 19, 2023

* add example

* fix example

* add instance for gemm permute

* add to client example

* change configs

* change instance file name

* formate

* change client example file name and remove example

55236709

17 Jan, 2023 2 commits

Reduction external API and client examples (#493) · 80e05267

Qianfeng authored Jan 17, 2023



* Change to the DeviceReduce base class template to include all problem description information

* Add external api for reduction

* Add client example to test the reduction external api

* Spelling correction

* Re-implement the host_reduction to follow the DeviceReduce base API format

* Change the reduce profiler to call the external API for collecting device instances

* Rename reduce client example directory from 08_reduce to 12_reduce

* Remove (void) before the functional call

* Tiny update in reduce client example

* Tiny update in profile_reduce_impl.hpp

* Rename the reduce client example directory
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

80e05267

Gemm layernorm welford (#413) · 7829d729

rocking5566 authored Jan 17, 2023



* Add device op of gemm layernorm

* [What] Rename F to H
[Why] F and G prepare for welford tensor

* Add gridwise gemm + welford

* Extract template parameter

* Rename kernel. Prepare to add second half kernel

* Extract var

* Add second kernel for gemm+layernorm

* Move to the gemm_layernorm folder

* Rename F and G to mean and var

* Do not use snakeCurved, it makes determination of padding  for welford difficult

* Rewrite the device interface and rename some var

* Add welford count

* Update interface

* Sync code, prepare to test on MI200

* Clean the code

* Implement layernorm

* Add comment to mension hipFree

* Wrtie out the e for debug.
This could be remove and use h for instead

* 1. Allocate mean, var and count into by SetWorkSpacePointer.
2. Add GetWorkSpaceSize to calculate the space size

* Add gemm layernorm host code

* use reference layernorm

* Fix bug of blockwise welford for first kernel

* Fix bug of mean var padding for layernorm

* Use sgpr for shuffleM_index

* padding for GemmMeanVarCountGridDescriptor_M_NBlock

* Add layout parameter

* Check argument for gemm

* calculate max count for tail block

* Share E and H memory in device op

* Hard code the vector dim

* Refine the MakeDescriptor

* 1. Remove E parameter, because E is inside of device op
2. Check vector size

* [What] Rename MakeMeanVarDescriptor_M_N
[Why] Prepare to add count version of make descriptor

* Use 1D global memory for count

* Prevent redundant IO

* Update parameter

* Add pipeline v1/v2 selector

* Rename the example name

* Add base class for gemm layernorm

* Refine naming to distinguish naive and welford

* Add comment to explan in detail

* We don't need to pad in N dimension in gemm for mean/var/count. Set NPerTile 1

* Rewrite the 2st kernel, use multiple block along N dimension in layernorm kernel

* Share the vector size

* Refine var name

* [What] Force LayernormThreadSliceSize_N = vector size.
[Why] Memory coalesce

* Add comment

* Extract divisor out of the loop in reference layernorm

* Pad different size for E and H in layernorm kernel according to different block tile

* Refine naming

* Refine naming

* Prevent implicit cast

* [What] use ck::math::sqrt instead of __builtin_amdgcn_sqrtf
[Why] __builtin_amdgcn_sqrtf is only support float, double will cause casting

* Cast only constant

* Change of post shuffle thread descriptor

* Add EMeanVarDataType parameter.

* Merge the mean and var threadwise copy

* Add missing index

* Fix Typo

* Sync the variable with previous if

* 1. Declare e inside the host_gemm_layernorm()
2. Prevent implicit cast in reference code
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

7829d729

15 Dec, 2022 2 commits

Add MNK padding, M = 0 support into grouped_gemm (#539) · 0345963e

zjing14 authored Dec 15, 2022



* add mnk padding, support m=0

* clean code

* clean code
Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com>

0345963e

Add padding device_gemm_add_add_fastgelu_xdl_c_shuffle instances to enable... · 9a1f2475

Rostyslav Geyyer authored Dec 14, 2022

Add padding device_gemm_add_add_fastgelu_xdl_c_shuffle instances to enable arbitrary problem size (#535)

* Add padding device_gemm_add_add_fastgelu_xdl_c_shuffle instances

* Add padding device_gemm_add_fastgelu_xdl_c_shuffle instances

* Add gemm_add_fastgelu profiler impl

* Add padding device_gemm_fastgelu_xdl_c_shuffle instances

* Add gemm_fastgelu profiler impl

9a1f2475

07 Dec, 2022 1 commit

Add padding device_gemm_xdl instances (#529) · c7a4d361

Rostyslav Geyyer authored Dec 07, 2022


Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

c7a4d361

02 Dec, 2022 1 commit

Add multiple d gridwise gemm on Navi21 for ResNet50 (#517) · 23ecf0fa

ltqin authored Dec 03, 2022



* start add example

* add multiple d fp16 example

* device transfer elementwiseop to gridwise

* gridwise add multiple d

* change example for multiple d

* fix spill registers

* fix for passthrough element op

* fix int8 overflow

* change example file name

* add instance for dl multiple d

* example add DsDataType

* remove grouped_convolution_forward_dl.hpp

* add head file(was deleted before)

* fix not support device issue

* format

* remove passthrough check
Co-authored-by: letaoqin <letaoqin@amd.com>

23ecf0fa

30 Nov, 2022 2 commits

gemm, conv perchannel quantization (#503) · ad541ad6

rocking5566 authored Dec 01, 2022

* Use gemm_multiple_D instead

* Add gemm bias relu quantization example

* Add pure gemm quantization example

* Add quantization of perchannel conv + bias + relu example

* Refine the code

* Rename multiplier to requant_scale

* Rename the folder

* Remove redundant comment

* Rename the file. Prepare to add perchannel

* Add conv perchannel instance

* Move to quantization folder

* Add conv perchannel client example

* Apply Rangify constructor of HostTensorDescriptor & Tensor<>

* Fix merge error

ad541ad6

BatchNorm backward instance/external API/profiler/tests (#519) · 63af525c

Qianfeng authored Dec 01, 2022

* Refine the device batchnorm-backward base API templates and data type assignments

* Remove duplicated kernel file

* Add batchnorm backward instances and external API

* Add batchnorm-backward profiler and tests

* Add client example which uses batchnorm backward external API

* Merge test/batchnorm_fwd and test/batchnorm_bwd into one directory

* Loose the threshold for batchnorm-backward check_err()

63af525c

29 Nov, 2022 1 commit

BatchNorm backward implementation (#461) · 44789d99

Qianfeng authored Nov 29, 2022

* Implemented batchnorm-backward Blockwise and Multiblock kernels

* Add batchnorm-backward device op

* Add batchnorm-backward host-reference op

* Add batchnorm-backward example

* Parameters renaming in batchnorm backward kernels and device op

* Change in the example to loose the threshold for ScaleDiff checking

* Add comments to explain the implementation of batchnorm-backward

* Parameters renaming again in batchnorm backward kernels

* Improve the expression calculation for performance

* Add batchnorm backward to README

* Add comments to explain inv-variance in batchnorm forward and backward

* Renaming the batchnorm forward training and inferring examples

* Add/update the comments for batchnorm-backward kernels

* Renaming again

* Add block_sync_lds between two consecutive blockwise reductions

* Move common expression 1/N out of the static_for loops

* Add dy_elementwise_op

* Renaming in backward example again

* Add checking for reduceDims in reference_batchnorm_backward

* Update to comments and codes format

* Rename in the comments

* Remove common expression out of the loop in reference_batchnorm_backward_nhwc_c

* Add block_sync_lds() between blockwise reduction again

* Fix comments again

* Remove int8 from batchnorm-forward instances since it is not needed for forward training and could fail test

44789d99

28 Nov, 2022 1 commit
- Remove int8 from batchnorm-forward instances since it is not needed for... · 5bf0475a
  Qianfeng authored Nov 29, 2022
```
Remove int8 from batchnorm-forward instances since it is not needed for forward training and could fail test (#516)
```
  5bf0475a
25 Nov, 2022 1 commit

BatchNorm forward instance/external api/profiler/tests/client example (#511) · 4e6a5575

Qianfeng authored Nov 25, 2022



* Update to device_batchnorm_forward base class to include all template parameters for problem description

* Add batchnorm forward instances and external api

* Add batchnorm forward profiler module which uses the external api

* Add some comments in batchnorm_forward example to explain the dimensions in lengths[]

* Replace the reference_batchnorm_forward_nhwc_c by generic reference_batchnorm_forward

* Improvement to the batchnorm infer base API

* Add batchnorm forward client example which shows using the batchnorm forward external API

* Add test for batchnorm forward

* Tuning the batchnorm profiler initialized values and error threshold

* Add support for bhalf_t in instances/external api/tests

* Add support for int8_t in instances/external api/tests

* Add support for double in instances/external api/tests

* Let ScaleDataType and BiasDataType be same as XDataType and YDataType when creating instances

* Checking before running best instance in batchnorm_fwd_nhwc client example

* Add checking for YElementwiseOp in batchnorm_forward external API

* Add more types in batchnorm forward profiler

* Add more test lengths
Co-authored-by: rocking5566 <ChunYu.Lai@amd.com>

4e6a5575

20 Nov, 2022 1 commit

Client examples AddFastGelu and FastGelu + instances. (#509) · 43a889b7

Adam Osewski authored Nov 20, 2022



* FastGelu support for more data types.

* AddFastGelu & FastGelu instances.

* Client example.

* clang-format

* Remove unused stride variable.

* Add new line at EOF.
Co-authored-by: Adam Osewski <aosewski@amd.com>

43a889b7

15 Nov, 2022 2 commits

Add BF16 tests for batched_gemm_softmax_gemm_permute (#504) · 4c4c7328

guangzlu authored Nov 16, 2022



* fixed bug in softmax reference & add bf16 examples for batched_gemm_scale_softmax_gemm

* added bf16 tests for batched_gemm_softmax_gemm_permute

* changed format of device_batched_gemm_softmax_gemm_permute_xdl_cshuffle_bf16_bf16_bf16_bf16_gmk_gnk_gno_gmo_instance.cpp

* changed format device_batched_gemm_softmax_gemm_permute_xdl_cshuffle_bf16_bf16_bf16_bf16_gmk_gnk_gno_gmo_instance.cpp

* aligned annotations

* modified CMakeLists for examples

* add common example code of fp16/bf16 version for batched_gemm_scale_softmax_gemm_xdl

* use macro to control the instances

* added macro control into instances

* clang-format some files

* changed error tolerance for bf16

* changed index for 10_elementwise_normalization

* fixed xdlops code bug in amd_xdlops.hpp
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

4c4c7328

Add Conv Backward Data on Navi21 for ResNet50 (#499) · db0eb1ea

ltqin authored Nov 16, 2022



* start add example

* add device dl

* change launch kernel

* change init data method

* change example config

* add config valid check

* add instance for dl bwd

* add instance to ckProfiler

* reserver to profiler and cmakelist

* add instance to ckProfiler2

* change instance f32 config

* fix example return value
Co-authored-by: letaoqin <letaoqin@amd.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

db0eb1ea