Commits · 24c9ee1d22e02579d2e5db255722debb020e133b · yangql / composable_kernel-1

15 Feb, 2023 5 commits

Add contraction_fp64 example (#570) · 24c9ee1d

zjing14 authored Feb 15, 2023



* add contraction_bilinear

* add contraction_scale_xdl_fp64

* reduce tile size to avoid register spill

---------
Co-authored-by: root <root@ctr-ubbsmc16.amd.com>

24c9ee1d

Improve normalization (#580) · 6a6163a3

rocking5566 authored Feb 16, 2023

* Sync the order of type string with template parameter

* Add more instances

* Check the vector size and remove redundant var

* Extract var to static, prepare to separate sweep once kernel

* Separate sweeponce flow and optimize the flow

* 1. Rename AccDatatype in normalization to computeData
2. Rename AccElementwiseOperation to YElementwiseOperation in normalization

* Remove useless code

* Update naive variance kernel

* Refine string

* Fix typo

* Support naive variance for device_normalization

* Check the blocksize

* Share the VGPR of x and y

* Share the VGPR of gamma and beta

* Add more instances

* Support fp16 sqrt for experiment

* Add CHANGELOG

* Fix typo

* clang-format

6a6163a3

[Navi3x] Add Device Operations (#567) · 0cfda84d

Haocong WANG authored Feb 16, 2023

* wmma_op + unit test

* add arch limitation to wmma test

* change arch limitation

* Refactor + Add all type unit test(int4 compile failed)

* Add f32_16x16x16_bf16 unit test

* tempsave

* tempsave

* tempsave

* runtime bug, cannot find symbol

* workaround for incorrect HIP warpSize return value

* debugging

* tempsave

* Correctness OK, waiting for optimization

* Tidy up + format

* temp save

* temp save, reproduce the v_bfi_b32 issue

* add inline asm for wmmaop test

* tidy up

* clean some debug purpose code

* discard some codes

* clang format

* clang format

* compiler issue fixed + increase tile size

* navi3x_multipleD+example

* temp save

* workable

* batchedgemm[OK], groupconv[debug]

* groupconv: Sanity check[OK], Performance[Bad]

* navi3x_groupconv_need_optimization

* format

* Add arch limitation to all wmma examples

* fix bug: example30 input conv args

0cfda84d

Conv3D FWD BWD WRW fp16 fp32 client examples (#559) · e9fd1228

Adam Osewski authored Feb 15, 2023



* Conv3d bwd weight client example.

* Update year in license

* Convolution bwd data 3D fp16/fp32 client example.

* Client example for convnd fwd fp16 fp32

* clang-format

* Review remarks.

* Fix compiler err.

* Update data layout to standard one.

* Add conv 3d fwd NDHWGC instances

* clang-format

* Conv3d fwd NDHWGC instances.

---------
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>

e9fd1228

Remove the workaround for bf16 attention tests. (#586) · 06f1fc86
Illia Silin authored Feb 14, 2023
```
* remove workanround in bf16 attention test

* clean up another workaround
```
06f1fc86

13 Feb, 2023 1 commit

GroupedGEMM more bigger tiles. (#577) · 8f42780f

Adam Osewski authored Feb 13, 2023



* Adding more bigger tiles.

* Remove failing instance.

* Remove instances which that don't improve perf.

---------
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>

8f42780f

10 Feb, 2023 1 commit
- enable batched_gemm_softmax_bf16 tests (#582) · 0ac0f51a
  Illia Silin authored Feb 10, 2023
  
  0ac0f51a
09 Feb, 2023 2 commits

Gemm+layernorm instance, ckProfiler, client example (#568) · f7d28f3e

rocking5566 authored Feb 10, 2023

* Add gemm + layernorm instance

* Add ckProfiler

* Add test

* Add client example

* Detect if user forger to set the workrspace

* Use literal in the example

* [What] use builtin function for sqrt
[Why] compiler will not use v_sqrt_f64_e64 if we use ::sqrt()

* check gemm vaildity in IsSupportedArgument

* Add more testcases

* Merge duplicated folder in client example

* Print more infomation

* Use better kernel parameter for MS problem size

* clang format

* Add constexpr for if condition and remove redundant include

* Remove cstdlib and add constexpr

f7d28f3e

Add instance for elementwise normlization (#573) · 76d144fa

guangzlu authored Feb 10, 2023

* added instances for large N

* add instance for elementwise normlization

* added supported restrict in device_elementwise_normalization_impl.hpp

76d144fa

08 Feb, 2023 3 commits

adding the first draft of changelog (#571) · b63accee
Illia Silin authored Feb 08, 2023
```
* adding the first draft of changelog

* second draft of changelog
```
b63accee

Add GemmAddSoftmaxGemm support for MSFT ORT (instances and client API) (#576) · 332ccc33

ltqin authored Feb 09, 2023

* add instance for gemm bias softmax gemm

* add client example

* change CGridDesc_G_M_N to CGridDesc_G_M_O

* add gridwise

* change c grid name

* device add d0s data

* fix 08 client_example

* add example 47_fused_attention

* example output correct

* add d0 to example

* add d0 element op

* rechange instance code

* change Acc0ElementwiseOperation to C0DEElementwiseOperation

* change example name

* update instance for cdeelementwiseop

* add bhalf_t ScaleAdd

* add test

* not surport geem1 bias

* remove some ignore

* fix test bug

332ccc33

Fix a couple more CI issues. (#578) · bb3d9546

Illia Silin authored Feb 08, 2023

* test the QA cron parameter for compiler commit

* create separate dockers for latest and fixed amd-stg-open compiler versions

* change groovy syntax

* apply cron timers back to develop branch

bb3d9546

06 Feb, 2023 1 commit

Fix CI issues. (#572) · f73574ff

Illia Silin authored Feb 06, 2023

* switch to recent staging compiler as default for CI

* fix the baseline query

* roll back sqlalchemy to version 1.4.46

f73574ff

01 Feb, 2023 1 commit

Add the markdown tutorial hello world (#563) · afdfef74

Rostyslav Geyyer authored Feb 01, 2023



* Add the markdown tutorial

* Clean up

---------
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com>

afdfef74

31 Jan, 2023 1 commit
- remove unused variable (#564) · ba40c2ce
  who who who authored Jan 31, 2023
```
* remove unused variable

* format code
```
  ba40c2ce
30 Jan, 2023 1 commit
- Use defined seed for deterministic test runs. (#562) · 274108d6
  Adam Osewski authored Jan 30, 2023
```
Co-authored-by: Adam Osewski <aosewski@amd.com>
```
  274108d6
26 Jan, 2023 1 commit
- Add more instances for irregular GEMM sizes. (#560) · 7494c1c6
  Adam Osewski authored Jan 26, 2023
```
Co-authored-by: Adam Osewski <aosewski@amd.com>
```
  7494c1c6
25 Jan, 2023 1 commit

Batchnorm inference instances, external API, client examples and gtests (#531) · a1b2441f

Qianfeng authored Jan 26, 2023

* File renaming and class renaming for device element-wise operation

* Add batchnorm-infer instances, external API and client example

* Add batchnorm-infer profiler module and gtests

* Remove file device_elementwise_extension.hpp and move NormalizeInInfer operation to element_wise_operation.hpp

* Remove the using of class aliasing for DeviceElementwiseForBatchNormInfer

* Rename class and file due to conflict from device_elementwise_2d.hpp

* Fix namespace in batcnnorm_infer_nhwc client example

a1b2441f

18 Jan, 2023 6 commits

Use double for all scaling values and float-point constant values at the Device Op API (#557) · 52abc2f3

Qianfeng authored Jan 19, 2023

* Use double as alpha/beta values type in reduce device op api

* Use double as alpha/beta values type in softmax device op api

* Use double as alpha/beta values type in multiple-reduce device op api

* Use double as epsilon value type in normalization/elementwise-normalization device op api

52abc2f3

Wavelet (inter-wave consumer-producer) GEMM (#310) · 1cfa8760

Raman R jana authored Jan 18, 2023



* wavelet gemm programming model support for CK

* GEMM pipeline update for wavelet progrmmaing model

* Updated wavelet programming pipeline

* fixes for global-write for math-wave

* fixed bug in global writes

* Updated comments for better readability

* fixed clang format errors

* added block_lds without barrier sync

* clean

* clean

* clean

* clean

* refactor

* prototype

4 layouts

fix default stride

all problem sizes

tidy

move file

update build script

restore old file

fix build

* refactor standalone test to use gemm test harness

* simplify gemm test

* update build script

* remove redundant

* early return when cmd arg doesn't match

* tidy

* report failure when result not validated

* tidy

* Add comment depicting B2C mapping pattern.

* Formatting & comments.

* Comparison with custom B2C mapping pattern.

* Example for wavelet gemm.

* Add wavelet to Gemm standalone test.

* Remove debug code.

* Remove dangling #endif directive.

Co-authored-by: root <Raman Jana>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: Anthony Chang <ac.chang@outlook.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

1cfa8760

Add multiD Gemm client APIs (#534) · d66421fe

ltqin authored Jan 19, 2023



* start add example

* fix config

* fix showinfo bug

* add an elementop

* change to padding

* add xdl example

* change elementwiseop

* add instance

* add instance to profiler

* change file name

* fix deive not support issue

* add client example

* fix client gemm_add_multiply name

* change AddMultiply elementwiseop

* fix elementwiseop

* fix client example

* fix addmultiply op

* fix comments and fun name
Co-authored-by: letaoqin <letaoqin@amd.com>

d66421fe

fix a bug for 6-dim kernels (#555) · 00ff30af
Illia Silin authored Jan 18, 2023

00ff30af

add multi embeddings support (#542) · 147b7db5

who who who authored Jan 19, 2023

* add multi embeddings support

* fix format

* optimize sqrt

* add reduce operation

* change to elementwise op

* fix name

* rename

* run ci cd

* format example

* format code

* format code

147b7db5

Add client API/examples for 3xGemm+Bias+Add+Permute{0, 2, 3, 1} (#550) · 55236709

ltqin authored Jan 19, 2023

* add example

* fix example

* add instance for gemm permute

* add to client example

* change configs

* change instance file name

* formate

* change client example file name and remove example

55236709

17 Jan, 2023 3 commits

Reduction external API and client examples (#493) · 80e05267

Qianfeng authored Jan 17, 2023



* Change to the DeviceReduce base class template to include all problem description information

* Add external api for reduction

* Add client example to test the reduction external api

* Spelling correction

* Re-implement the host_reduction to follow the DeviceReduce base API format

* Change the reduce profiler to call the external API for collecting device instances

* Rename reduce client example directory from 08_reduce to 12_reduce

* Remove (void) before the functional call

* Tiny update in reduce client example

* Tiny update in profile_reduce_impl.hpp

* Rename the reduce client example directory
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

80e05267

Gemm layernorm welford (#413) · 7829d729

rocking5566 authored Jan 17, 2023



* Add device op of gemm layernorm

* [What] Rename F to H
[Why] F and G prepare for welford tensor

* Add gridwise gemm + welford

* Extract template parameter

* Rename kernel. Prepare to add second half kernel

* Extract var

* Add second kernel for gemm+layernorm

* Move to the gemm_layernorm folder

* Rename F and G to mean and var

* Do not use snakeCurved, it makes determination of padding  for welford difficult

* Rewrite the device interface and rename some var

* Add welford count

* Update interface

* Sync code, prepare to test on MI200

* Clean the code

* Implement layernorm

* Add comment to mension hipFree

* Wrtie out the e for debug.
This could be remove and use h for instead

* 1. Allocate mean, var and count into by SetWorkSpacePointer.
2. Add GetWorkSpaceSize to calculate the space size

* Add gemm layernorm host code

* use reference layernorm

* Fix bug of blockwise welford for first kernel

* Fix bug of mean var padding for layernorm

* Use sgpr for shuffleM_index

* padding for GemmMeanVarCountGridDescriptor_M_NBlock

* Add layout parameter

* Check argument for gemm

* calculate max count for tail block

* Share E and H memory in device op

* Hard code the vector dim

* Refine the MakeDescriptor

* 1. Remove E parameter, because E is inside of device op
2. Check vector size

* [What] Rename MakeMeanVarDescriptor_M_N
[Why] Prepare to add count version of make descriptor

* Use 1D global memory for count

* Prevent redundant IO

* Update parameter

* Add pipeline v1/v2 selector

* Rename the example name

* Add base class for gemm layernorm

* Refine naming to distinguish naive and welford

* Add comment to explan in detail

* We don't need to pad in N dimension in gemm for mean/var/count. Set NPerTile 1

* Rewrite the 2st kernel, use multiple block along N dimension in layernorm kernel

* Share the vector size

* Refine var name

* [What] Force LayernormThreadSliceSize_N = vector size.
[Why] Memory coalesce

* Add comment

* Extract divisor out of the loop in reference layernorm

* Pad different size for E and H in layernorm kernel according to different block tile

* Refine naming

* Refine naming

* Prevent implicit cast

* [What] use ck::math::sqrt instead of __builtin_amdgcn_sqrtf
[Why] __builtin_amdgcn_sqrtf is only support float, double will cause casting

* Cast only constant

* Change of post shuffle thread descriptor

* Add EMeanVarDataType parameter.

* Merge the mean and var threadwise copy

* Add missing index

* Fix Typo

* Sync the variable with previous if

* 1. Declare e inside the host_gemm_layernorm()
2. Prevent implicit cast in reference code
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

7829d729

[Navi3x-LWPCK-545] Block-wise GEMM + Real GEMM_WMMA_FP16 (#541) · 919aeb1f

Haocong WANG authored Jan 17, 2023

* wmma_op + unit test

* add arch limitation to wmma test

* change arch limitation

* Refactor + Add all type unit test(int4 compile failed)

* Add f32_16x16x16_bf16 unit test

* tempsave

* tempsave

* tempsave

* runtime bug, cannot find symbol

* workaround for incorrect HIP warpSize return value

* debugging

* tempsave

* Correctness OK, waiting for optimization

* Tidy up + format

* temp save

* temp save, reproduce the v_bfi_b32 issue

* add inline asm for wmmaop test

* tidy up

* clean some debug purpose code

* discard some codes

* clang format

* clang format

* compiler issue fixed + increase tile size

919aeb1f

12 Jan, 2023 2 commits

Add a flag to enable/disable debug output in many kernels. (#549) · 715e8dd2

Illia Silin authored Jan 11, 2023

* add DEBUG_LOG macro to enable/disable debug output

* fix syntax

* fix syntax again

* fix syntax one more time

* remove balnk spaces

* use ifdefs

* add the Print argument

* move the definition of DEBUG_LOG to ck.hpp

* add the missign argument to Print()

715e8dd2

Remove including of cmath (#551) · a17b0414

Qianfeng authored Jan 12, 2023

* Let cmath included when compiling host codes in math_v2.hpp

* Remove including of cmath in device_base.hpp and device_permute.hpp

a17b0414

15 Dec, 2022 4 commits

Add MNK padding, M = 0 support into grouped_gemm (#539) · 0345963e

zjing14 authored Dec 15, 2022



* add mnk padding, support m=0

* clean code

* clean code
Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com>

0345963e

disable the attention test that fails on MI100 (#540) · 11151175
Illia Silin authored Dec 15, 2022

11151175
Add interface GetTypeIdName() and GetTypeIdHashCode() for Device Op (#533) · 10c72ace
Qianfeng authored Dec 15, 2022

10c72ace

Add padding device_gemm_add_add_fastgelu_xdl_c_shuffle instances to enable... · 9a1f2475

Rostyslav Geyyer authored Dec 14, 2022

Add padding device_gemm_add_add_fastgelu_xdl_c_shuffle instances to enable arbitrary problem size (#535)

* Add padding device_gemm_add_add_fastgelu_xdl_c_shuffle instances

* Add padding device_gemm_add_fastgelu_xdl_c_shuffle instances

* Add gemm_add_fastgelu profiler impl

* Add padding device_gemm_fastgelu_xdl_c_shuffle instances

* Add gemm_fastgelu profiler impl

9a1f2475

14 Dec, 2022 1 commit
- Add a docker hub doc file (#538) · 74744cab
  Rostyslav Geyyer authored Dec 14, 2022
  
  74744cab
12 Dec, 2022 1 commit

Gridwise elementwise 2d (#466) · 0e5c264c

arai713 authored Dec 12, 2022



* added 2d gridwise elementwise

* added 2d version of device elementwise

* added example file with updated device elementwise call

* added Cmake file

* changed NumDim into 2D

* fixed compiler issues

* fixed indexing for loop step

* fixed NumDim dimension error

* changed blockID to 2D

* updated Grid Desc

* updated kernel call

* fixed 2d thread indexing

* added dimensions for example file

* commented out unused code

* changed vector load

* removed extra code

* temporarily removing vector load on 2nd dim

* changed vector load back, still causing errors

* altered indexing

* changed isSupportedArgument for 2D

* changed indexing + do/while

* fixed isSupportedArgument

* changed dimension for debugging

* fixed

* added testing printouts

* testing change

* added variables to distribute threads through both dimensions

* testing changes

* integrated variable for thread distribution into device elementwise and added as parameter for gridwise elementwise

* removed most of the extraneous code, testing with different dimensions

* testing

* removed debugging print statements

* moved 2d elementwise permute into elementwise permute directory

* fixed formatting

* removed debugging comments from threadwise transfer
Co-authored-by: Jing Zhang <jizhan@amd.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

0e5c264c

08 Dec, 2022 1 commit

Make sure that GEMM sizes in K dimension are supported. (#527) · d58b7f51

Illia Silin authored Dec 08, 2022

* apply new K-dimension check in gemm_xdl_cshuffle

* add K-dim check to gemm_xdl and batched_gemm_xdl

* fix syntax

* fix syntax

* clean-up the debug output

d58b7f51

07 Dec, 2022 3 commits
- Fix Grouped ConvBwdWeight test case failure (#524) · 614a7b1b
  Po Yen Chen authored Dec 08, 2022
```
* Use smaller tensor size in test

* Use even more smaller tensor size

* Touch only failing test case inputs
```
  614a7b1b
- Add padding device_gemm_xdl instances (#529) · c7a4d361
  Rostyslav Geyyer authored Dec 07, 2022
```
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
```
  c7a4d361
- modified half function in math_v2.hpp (#528) · ce87b4f7
  guangzlu authored Dec 08, 2022
```
Co-authored-by: Chao Liu <chao.liu2@amd.com>
```
  ce87b4f7
06 Dec, 2022 1 commit

Fix CI error. (#530) · d072790f

Illia Silin authored Dec 06, 2022

* ignore .git folder when doing clang-format

* fix syntax

* add backslashes before quotes

* add path filter for several extensions

d072790f