Commits · 2778e99758e149a6cb5309ca307bf7c1e61a562f · yangql / composable_kernel-1

19 Feb, 2022 1 commit

JD authored Feb 18, 2022



* add docker file and make default target buildable

* add Jenkinsfile

* remove empty env block

* fix package stage

* remove render group from docker run

* clean up Jenkins file

* add cppcheck as dev dependency

* update cmake file

* Add profiler build stage

* add hip_version config file for reduction operator

* correct jenkins var name

* Build release instead of debug

* clean up
Co-authored-by: Chao Liu <chao.liu2@amd.com>

2778e997

12 Feb, 2022 1 commit

NHWC conv 2d: fwd bfp16/int8, Device level tuning and host API (#73) · 880fbee9

ltqin authored Feb 12, 2022



* add fwd bf16 conv

* change tunning parametor

* add int8 for conv fwd

* remove comments

* change tunning parametor for int8

* change init int8 example

* add test for conv2d fwd

* change device operation file pos because merge develop

* fwd int8 use reference

* test_conv_fwd use reference

* add braket for if statement

* rename fwd example name

* remove StaticBufferOfVectorTypeV2

* tweak example
Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

880fbee9

11 Feb, 2022 4 commits

Add small tile size for fp16/fp32 and NN layout (#80) · 20a672d0

zjing14 authored Feb 11, 2022



* add DeviceGemmSplitKXdl

* add file device_gemm_splitk_xdl.hpp

* set c matrix zero

* using atomic

* add all tuning parameter to f32 mkkn

* grid size change to 720

* add tunning parameter for NT

* add tunning parameter for TN

* add tunning parameter for TT

* add m=96tunning parameter

* add lost config

* debug

* fix sweep

* add failed tuning params

* fixed sweep logic

* clean

* add padding to M/N for irr tile size

* clean code

* add element wise operation

* fixed MPerBlock=96

* remove marco for slpitk swtich

* add test

* add new line at the end of device_gemm_xdl_instance.hpp

* remove step hack

* seperate split-k instance files

* add tunning parameters

* change disired grid size to parameters

* remove slice length

* add desiredgridsize parameter to ckProfiler

* add losting file device_gemm_xdl_splitk_instance.hpp

* change desired gride size to kbatch

* format

* format

* clean up

* add selection of device_instances

* clean code

* clean code

* add small tile size in fp16 nn

* test for rocm 4.5

* merge develop

* clean

* clean

* clean

* remove no-use code

* add padding switch to device_gemm_xdl

* add padding switch for ksplit fp32

* clean

* clean

* add files

* rename

* Update profiler.cpp

* format
Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: ltqin <letao.qin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

20a672d0

Batched GEMM for fp16 (#79) · b53e9d08

zjing14 authored Feb 11, 2022

* prepare host for batched_gemm

* init commit of batched kernels

* fixed

* refine transform with freeze

* m/n padding

* fixed a bug; clean

* add small tiles

* clean

* clean code

* clean code

* add nt, tn, tt layout

* add missing file

* use StaticBufferTupleOfVector instead

* add reference_batched_gemm

* fixed a macro

b53e9d08

Support alpha beta scaling for GEMM (#78) · 6f928a08

rocking5566 authored Feb 11, 2022



* [What] Add 2d version of bias, prepare to implement alpha / beta scaling

* Add alpha / beta functor

* Refine parameter of example

* [What] Use real type instead of template
[Why] Prevent implicit cast

* Rename parameter for general operator

* Remove redundant comment

* Fix compile error
Co-authored-by: rocking <chunylai@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

6f928a08

fix build breaks (#81) · 904cbe2a

Anthony Chang authored Feb 11, 2022



- device_gemm_xdl_c_shuffle function signature matches split-k
- retire host_driver since it is no longer maintained
- linter error (unused variable)
Co-authored-by: Chao Liu <chao.liu2@amd.com>

904cbe2a

07 Feb, 2022 1 commit

GEMM+Bias+ReLU+Add (#76) · 823657ed

Chao Liu authored Feb 06, 2022

* tweak conv for odd C

* update script

* clean up elementwise op

* fix build

* clean up

* added example for gemm+bias+relu+add

* added example for gemm+bias+relu

* add profiler for gemm_s_shuffle; re-org files

* add profiler

* fix build

* clean up

* clean up

* clean up

* fix build

823657ed

04 Feb, 2022 1 commit

References for conv2d fwd bias relu and add (#75) · 690c75a7

ltqin authored Feb 04, 2022



* add reference

* clean up

* add reference for conv

* rename
Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

690c75a7

03 Feb, 2022 2 commits

Replace llvm Intrinsics with clang buildins (#65) · 6d92959a

zjing14 authored Feb 02, 2022

* test mfma builtins

* add fp16 buildins

* add int8 buildins

* add bfl16 buildins

* simplify host conv forward

* clean

* clean

6d92959a

add split-k GEMM (#59) · 4be7f019

ltqin authored Feb 03, 2022



* add DeviceGemmSplitKXdl

* add file device_gemm_splitk_xdl.hpp

* set c matrix zero

* using atomic

* add all tuning parameter to f32 mkkn

* grid size change to 720

* add tunning parameter for NT

* add tunning parameter for TN

* add tunning parameter for TT

* add m=96tunning parameter

* add lost config

* add element wise operation

* fixed MPerBlock=96

* remove marco for slpitk swtich

* add test

* add new line at the end of device_gemm_xdl_instance.hpp

* remove step hack

* seperate split-k instance files

* add tunning parameters

* change disired grid size to parameters

* remove slice length

* add desiredgridsize parameter to ckProfiler

* add losting file device_gemm_xdl_splitk_instance.hpp

* change desired gride size to kbatch

* format

* format

* clean up

* add selection of device_instances

* clean code

* fix build issue
Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Jing Zhang <jizhan@amd.com>

4be7f019

25 Jan, 2022 1 commit

Do not hardcode the function parameter, use template instead. (#72) · ca47a6cf

rocking5566 authored Jan 25, 2022

* Do not hardcode the function parameter, use template instead.

* [What] Remove AThreadTransferSrcResetCoordinateAfterRun and BThreadTransferSrcResetCoordinateAfterRun in host API
[Why] "C_Shuffle" version is supposed to be similar to the vanilla one

* Fix typo
Let DeviceGemmXdl_C_Shuffle use kernel_gemm_xdlops_v3r1

ca47a6cf

21 Jan, 2022 1 commit

Add gemm_shuffle host api (#71) · 4d40b197

rocking5566 authored Jan 21, 2022

* [What]
1. Add DeviceGemmXdl_C_Shuffle
2. Revise example of gemm_xdl
[Why] Prepare to add shuffle version of D = alpha * (A * B) + beta * C
[How] Imitate DeviceGemmXdl and device_conv2d_fwd_xdl_c_shuffle_nhwc_kyxc_nhwk.hpp

4d40b197

18 Jan, 2022 1 commit
- Fix building issue for examples (#66) · 6260ced2
  Chao Liu authored Jan 17, 2022
```
* fix build issue
```
  6260ced2
26 Dec, 2021 1 commit

Fusion Conv+Bias+ReLU(+Add) (#62) · acbd7bd7

Chao Liu authored Dec 26, 2021

* fix relu

* clean up

* clean up

* adding 1x1 conv

* adding 1x1 conv

* added 1x1 conv

* refactor

* refactor

* refactor

* added profiler for conv+bias+relu+add

* clean up

* adding conv+bias+relu

* adding conv+bias+relu

* added conv+bias+relu

* Update README.md

* update cpu verification

* adding c shuffle

* update static_tensor for dealing with invalid element

* adding c shuffle

* debugging

* fix bug

* convert to fp16 before shuffle

* shuffle more than one M/NRepeat

* clean up

* remove coordinate step hack from GridwiseGemm_k0mk1_k0nk1_mn_xdlops_v3r1

* clean up

* remove coordinate step hack from all gridwise gemm xdl

* clean up coordinate step hack

* clean up coordinate step hack

* ThreadwiseTensorSliceTransfer_v3r2 support pointwise op on both src and dst

* adding output shuffle in conv+bias+relu+add

* update

* added conv+bias+relu+add with c shuffle

* added conv+bias+relu+add with c shuffle

* fix forward_sweep bugs in threadwise copy

* clean up

* refactor

* clean up

* clean up

* added conv_c_shuffle+bias_relu

* clean up

* added conv+bias+relu+atomic_add

* clean up

* clean up

* clean up

* clean up

* clean up

* clean up

* misc fixes; add 1x1 specialization

* clean up

* delete unused device op

* clean up

* add support for odd C value

acbd7bd7

13 Dec, 2021 1 commit

manually apply bug fix changes in pr #63 (#64) · a4f24233

Chao Liu authored Dec 12, 2021

* Bug in BlockwiseGemmXdlops_k0mk1_k0nk1_m0n0m1n1m2m3m4n2_v1::MakeCGridDescriptor_M0_N0_M1_N1_M2_M3_M4_N2()
* Bug in ThreadwiseTensorSliceTransfer_v1r3 logic for calculating "forward_sweep"

a4f24233

04 Dec, 2021 1 commit
- fix ReLU formula (#61) · fd3d907a
  Chao Liu authored Dec 04, 2021
```
* fix relu

* clean up

* clean up
```
  fd3d907a
03 Dec, 2021 1 commit

GEMM/Conv+BiasAdd+ReLU+Add (#55) · 41cdd380

Chao Liu authored Dec 02, 2021

* gemm+activation

* move C pointwise operation into threadwise copy

* add pointwise operation to A/B matrix

* update ckProfiler

* adding bias add

* adding bias add

* adding bias add

* added bias add; worked around compiler issues

* clean up

* clean up

* Update README.md

* Update README.md

* Update README.md

* clean up

* add conv_xdl example

* adding conv_xdl_bias_relu_add example

* add conv+bias+relu+add, but has register spill issue

* tweak

* tweak

* refactor

* Update README.md

update readme for example/2_gemm_xdl_bias_relu_add

* clean up

* Update README.md

update readme for example/3_conv_xdl

* Update README.md

41cdd380

02 Dec, 2021 3 commits
- renaming/comments · d7a0a3f9
  Jing Zhang authored Dec 02, 2021
  
  d7a0a3f9
- add static_buffer_v2 zero out · 2cbb8976
  Jing Zhang authored Dec 02, 2021
  
  2cbb8976
- fixed c_buffer alloc · d798c9b8
  Jing Zhang authored Dec 02, 2021
  
  d798c9b8
30 Nov, 2021 2 commits
- fix layout naming convention (#56) · 4041850f
  Chao Liu authored Nov 30, 2021
  
  4041850f
- added test for magic number division (#58) · 237d4ca0
  Chao Liu authored Nov 30, 2021
  
  237d4ca0
24 Nov, 2021 1 commit
- add args for packed gemm (#54) · 567f5e9c
  zjing14 authored Nov 24, 2021
  
  567f5e9c
18 Nov, 2021 3 commits

Use __builtin_memcpy to implement bit_cast and for accessing vector from pointer of scalars (#53) · 64350aff
Chao Liu authored Nov 18, 2021
```
* reworking vector_type

* use __builtin_memcpy for bit_cast and vector access of scalar pointer

* clean up
```
64350aff

v5r1 fusion kernels for inference (#49) · 970fa3e9

zjing14 authored Nov 18, 2021



* init

* refactor for 1x1

* rename e0_e1

* add e1 with bugs

* debug

* fixed

* fixed e1

* add timer

* imprve threadwise gemm with dot2

* add e2

* tuning

* seperate c2

* add nhwc

* restore nchwc

* clean

* opt

* fixed; tuning

* add BGlobalMoveSliceWindowStepHacks{}

* tuning

* repeat running

* adjust

* merge v5r1 nchwc

* add adaptors

* split k0 k1 in c_thread_grid

* split h and w

* remove v5r1 nhwc

* clean for pr

* remove host_conv_add

* clean code

* clean

* add dynamic support

* static mode

* test static

* add conv+add fusion

* fixed validation

* naming fix

* use activ_enum

* make static

* refactor conv_add for InMem::add

* add bias

* add conv_out

* add configurable makeddesc

* add maxpool fusion

* add maxpool host for validation

* enable static desc

* conv-only use v5r1_add

* test

* test

* for binary dumps

* fixed incorrect results due to typo

* clean

* debugging maxpool

* workaround with offset trick

* clean code

* modularize ops of fusion

* add gridwise_gemm_v3

* create seperate fusion fun

* enable dynamic mode of conv and conv+resize_add

* add dynamic mode of maxpool

* add pass by point

* add activ_type as arguments

* merge develop

* clean

* reset config to old default
Co-authored-by: Chao Liu <chao.liu2@amd.com>

970fa3e9

Fixed bfp16 host_conv_fwd (#52) · a651ea4f

zjing14 authored Nov 18, 2021



* fixed bfloat16 issues

* refactor type_convert

* fixed host_convolution_forward for ushort
Co-authored-by: Chao Liu <chao.liu2@amd.com>

a651ea4f

16 Nov, 2021 2 commits
- fixed multiple definition issue of bfp16/fp32 conversion function when building ckProfiler (#51) · 0a66c54e
  zjing14 authored Nov 16, 2021
```
* fixed bfloat16 issues

* refactor type_convert
Co-authored-by: Chao Liu <chao.liu2@amd.com>
```
  0a66c54e
- updated bfloat16_to_float · 89e1ebd4
  Jing Zhang authored Nov 16, 2021
  
  89e1ebd4
15 Nov, 2021 2 commits

Add bfp16/int8 support into XDL GEMM operator (#50) · 3737bb03

zjing14 authored Nov 15, 2021



* init StaticBufferV2

* clean

* adopt old output stage for staticBufferV2

* clean

* remove hack

* clean

* clean

* add parameters

* clean code

* move c_buffer alloc into blockwise gemm

* add adaptors for m/n_thread_data_on_grid

* tweak gemm

* adjust blockwise_gemm_xdlops

* tweak

* update conv

* update script

* adding bwd 1x1

* update script

* adding 1x1 bwd

* debugging bwd 1x1 failure

* update script

* update script

* test

* test v100

* add bf16_1k

* clang-format

* clean

* add bfp16 for gfx908

* add verification

* clean up

* clean code

* restore bfl16

* clean

* add bfp16 support into gemm_driver

* apply new generator to other drivers

* add int8 support

* cleanb

* clean

* clean

* clean
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Chao Liu <lc.roy86@gmail.com>
Co-authored-by: root <root@hayabusa6111.amd.com>

3737bb03

FP16 data in-register transpose (#41) · b491ebf3

Chao Liu authored Nov 15, 2021

* start fixing 16bit data packing

* adding StaticTensor

* adding StaticTensor

* adding StaticTensor

* add missing constexpr

* adding static tensor

* adding static tensor

* adding transpose

* add inline asm for transpose 2x2 of half_t

* add general transpose_vectors(), but have unnecessary register initialization using v_mov

* fix unnecessary register initialization in transpose_vector by using more pass-by-reference

* add hardcoded logic for NHWC wrw

* improve asm for v_pack

* make ThreadwiseTensorSliceTransfer_v3r2 support any tensor

* tweak

* reorganize file

b491ebf3

14 Nov, 2021 1 commit

ckProfiler and device-level XDL GEMM operator (#48) · e823d518

Chao Liu authored Nov 14, 2021

* add DeviceGemmXdl

* update script

* fix naming issue

* fix comment

* output HostTensorDescriptor

* rename

* padded GEMM for fwd v4r4r4 nhwc

* refactor

* refactor

* refactor

* adding ckProfiler

* adding ckProfiler

* refactor

* fix tuning parameter bug

* add more gemm instances

* add more fp16 GEMM instances

* fix profiler driver

* fix bug in tuning parameter

* add fp32 gemm instances

* small fix

* refactor

* rename

* refactor gemm profiler; adding DeviceConv and conv profiler

* refactor

* fix

* add conv profiler

* refactor

* adding more GEMM and Conv instance

* Create README.md

Add build instruction for ckProfiler

* Create README.md

Add Readme for gemm_xdl example

* Update README.md

Remove build instruction from top most folder

* Update README.md

* clean up

e823d518

27 Oct, 2021 3 commits

[Bug Fix] GridwiseGemm_bk0mk1_bk0nk1_mn_xdlops_v2r4 loop issue (#44) · 6014185a

ltqin authored Oct 27, 2021



* change method computering kpad

* remove unusing variable: batchlen

* change KPerBlock to K0PerBlock

* fix bug for k0 == k0perblock

* fix bug for get k0 index

* use math::integer_divide_ceil
Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

6014185a

Merge pull request #46 from ROCmSoftwarePlatform/miopen_downstream_all · 3e911370
Chao Liu authored Oct 27, 2021
```
update ck from miopen ck_upstream
```
3e911370
Merge branch 'develop' into miopen_downstream_all · 211dae82
ltqin authored Oct 27, 2021

211dae82

26 Oct, 2021 1 commit
- [Composable Kernel] update develop branch code to ck_upstream · 5890e300
  Jun Liu authored Oct 25, 2021
```
Merge pull request #1236 from ROCmSoftwarePlatform/develop
```
  5890e300
21 Oct, 2021 1 commit
- fix bug in gridwise gemm xdlops v2r3 (#45) · d5297aba
  Chao Liu authored Oct 21, 2021
  
  d5297aba
19 Oct, 2021 2 commits

bug fix (#39) · c3018794
Chao Liu authored Oct 19, 2021

c3018794

add nchw atomic , nhwc and nhwc atomic method for backward weight (#30) · fd49ff80

ltqin authored Oct 20, 2021



* add add new algorithm from v4r4r2

* program once issue

* add split k functiion

* redefine code

* add a matrix unmerge

* add b matrix unmerge k0

* trans a and b to gridegemm

* nhwc init

* no hacks and vector load

* add hacks

* modify some parameter

* fix tuning prometer for fp32

* fix tuning prometer for fp16

* start change gridwise k split

* init ok

* revome a b matrix k0mk1 desc in grid

* carewrite lculate gridsize

* add kbatch to CalculateBottomIndex

* remove some unused funtion

* add clear data function before call kernel

* out hacks

* in hacks

* rename device convolution file and function name

* modify kBatch value

* fix some tuning code

* start from v4r4 nhwc

* nhwc atomic is able to run

* just for fp32

* enable nchw atomic

* tweak

* tweak

* re-arrange gridwise gemm hot loop for wrw

* add wrw v4r5

* v4r4r5 fp16

* v4r4r4 fp16

* v4r4r2 fp16

* V4R4R4XDLNHWC fp16

* V4R4R2XDLATOMICNCHW fp16

* adjust for fp16

* input gridsize

* change kbatch to gridsize

* testing wrw

* clean up

* k_batch to gridsize

* fix bug

* wrw v4r4r4 kbatch change to gride size

* wrw v4r4r2 kbatch change to gride size

* after merge , change gridwise gemm v2r4

* change MakeCBlockClusterAdaptor

* other method use new gridwise gemm

* clean up

* chapad method nge to make_right_pad_transform

* kbatch out from transform function

* clean up and fix bug

* fix bug

* using function type reduce template parameters

* using auto replace define fuction type

* clean up
Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Jing Zhang <jizhan@amd.com>

fd49ff80

06 Oct, 2021 2 commits

[MIOpen Downstream] Fix Reduction Kernel (#34) · b2dc55f8

Qianfeng authored Oct 07, 2021



* Tiny fix in using data type template parameters in blockwise and direct_threadwise kernel

* Fix with regard to implementing GetZeroVal() in both kernel and host

* Avoid convert to compType from dstDataType before writting the output value

* Add half_t support to NumericLimits and make constexpr GetZeroVal() of binary operator

* Add CONSTANT decorator for descriptor read buffer

* Use get_thread_local_1d_id() for thread local Id

* Rename GetZeroVal() to GetReductionZeroVal() in the kernels

* Remove constexpr from initialized zeroVal and tiny fix in reduction_operator.hpp

* Occasional tiny simplification and update in the kernel files

* Update to re-order tensor dimensions on the host, split second_call kernel wrapper files and simplify reduce_all kernel wrappers

* Update to remove OpenCL tidy checking failures

* Update for better readability

* Remove unused codes and not-needed template parameters in the kernel wrappers
Co-authored-by: Chao Liu <chao.liu2@amd.com>

b2dc55f8

Tweak GEMM kernel (#38) · b3e8d57d

Chao Liu authored Oct 06, 2021

* add parameters

* tweak gemm

* tweak

* update conv

* update script

* adding bwd 1x1

* update script

* adding 1x1 bwd

* debugging bwd 1x1 failure

* update script

* update script

* test

* test v100

* clean up

b3e8d57d