Commits · e17c0d8008148f254f044124e55118163b1a1701 · gaoqiong / composable_kernel

05 Mar, 2022 1 commit

Reduction in Composable Kernel (#82) · e17c0d80

Qianfeng authored Mar 06, 2022



* Initial adding of generic reduction

* Initial adding of generic reduction ...

* Updates to make compiling done

* clang-format all files

* clang-format some files again

* Renaming in profiler/include/profile_reduce.hpp

* Updates and make BlockWise cases passed

* Updates and make ThreadWise and MultiBlockTwoCall cases passed

* Remove the support for MUL and NORM1 reduceOp from the profiler and the device instances

* Change to replace the dim0_max_vector_size/dim1_max_vector_size template argument in the device reduce classes

* format

* adding pooling

* added max and average pooling

* comment out cout and kernel timing

* Tiny simplification in profiler/reduce_profiler.cpp

* Add example for reduce_blockwise

* Tiny updates

* Change to pass the ElementWiseOp from device layer to kernel

* Fix the vectorDim and vectorSize in Device layer

* Enable vector load on both dim0 and dim1 for Threadwise method

* Tiny updates

* Change to let the user to pass the preUnaryOp and posUnaryOp

* Make pooling example work

* split device_reduce_instance into two libraries

* Tiny update

* Replace nanPropaOpt enum by boolean propagate_nan

* Simplification in DeviceReduce layer codes

* update build

* Change to clarify the difference between ck::half_t and half_float::half

* Renaming in all the reduction codes

* Add VectorSize as template parameter for device layer

* Add BetaIsZero as kernel template and as AccDataType for alpha

* print

* Small updates for pooling

* Updates for host_generic_reduction for reference

* Update to make AVG pooling pass

* Update to make MAX pooling with indices output pass

* fix

* add OutDst vector store to threadwise reduction and pooling

* tweak

* turn off check_indices that caused build issue

* refactor pooling

* clean up

* turn off check_indices for building issue for php-compiler

* add more tile size for odd C

* tweak conv for odd C

* update script

* clean up elementwise op

* add hack in reduction_operator.hpp to avoid compile error. To fix it, need to use element_wise_op in reduction op

* Add OutVectorSize as device and kernel tunable, also update to Elementwise Operations

* Move reduce operator mapping to host layer file reduction_operator_mapping.hpp from reduction_operator.hpp

* Change to the unary operators

* Move the definitions of unary operations to element_wise_operation.hpp

* re-org files

* Refine in device interfaces and multiblock kernels

* Split the reduction configurations into instances for specific methods

* Update in getTypeString() of device pool2d

* Renaming in host and kernel

* Tiny update in profiler/src/profiler.cpp

* Uncomment in device_operation/CMakeLists.txt to enable the building of all operations

* Make check_indices a templated function to remove some linking issue

* Renaming in the profiler reduce module

* Add support for double Reduction (but disable MultiblockAtomicAdd for double)

* Tiny correction of literal string

* Rename DevicePoolFwd to DevicePool2dFwd

* Split device_reduce_instance_xxx.cpp files according to the data types to speed up compiling

* Add comments for lists of configurations, lists of instances and references of add_reduce_instances_xxx

* Remove un-used header file gridwise_generic_reduction_wrapper_common.hpp

* Renaming and refining in the Reduction codes

* Tiny change in the unary operators

* Renaming symbols and files

* Renaming symbols in the kernels

* Move kernel kernel_set_buffer_value to separate file

* Add IndexDataType template parameter for kernels and use int32_t as index data type in device layer

* Tiny update in the kernels

* Remove definition of sqrtf()/isnan()/abs() for half_t due to some ADL issue

* Simplify a helper function in device layer

* Tiny adjustment in testing data initialization

* Renaming in kernel/device/host

* Add two testing scripts for reduction

* Refine the Unary operators in element_wise_operation.hpp

* Update in the reduce profiler module

* Update to the reduction testing scripts

* reduce compile parallelism

* change CI docker to rocm5.0

* remove unused variables

* fix build
Co-authored-by: Chao Liu <chao.liu2@amd.com>

e17c0d80

04 Mar, 2022 1 commit

[Bf16 & int8] [example & ckprofiler] (#100) · 7e9a9d32

rocking5566 authored Mar 05, 2022



* Add int8 of mk_nk_mn to the ckProfiler

* Add example of int8 gemm

* Fix typo, use ushort instead of half_t for bfloat16

* replace ushortXXX_t to bhalfXXX_t

* rename ushort to bhalf_t

* Add bf16 example

* Add bf16 gemm to ckProfiler

* Fix alignment

* Fix typo

* Add unit test for gemm_xdl int8

* Add gemm_xdl fp32 unit test

* Add gemm_xdl bf16 unit test

* fix build

* fix build issue due to merge conflict

* Fix build

* Fix build error
Co-authored-by: rocking <chunylai@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

7e9a9d32

28 Feb, 2022 1 commit

Allow distinct K0/K1 values for A/B block descriptor (#98) · 6d4450ef

Anthony Chang authored Feb 28, 2022



* add gitignore

* host tensor: allow generating sequentially increasing value in a given dimension

* gridwise gemm v3r1: allow distinct K0/K1 values for A/B block descriptor

- remove dangling header include
- modify example gemm_xdl accordingly
- infer KPack value from M/NPerXdl
- device conv2d fwd: update parameters accordingly for the underlying gridwise gemm v3r1
(API for conv2d fwd stays the same for now until we decide to expose individual K0s for activation and weight)

* add LDS data dump utility

* profiler: reflect API change for distinct K0/K1 for A/B matrices

* profiler: add conflict-free LDS write FP16 kernel instances

* fix accidental perf regression

* address feedback; cosmetic changes

* clang-format for new files

* format
Co-authored-by: Chao Liu <chao.liu2@amd.com>

6d4450ef

23 Feb, 2022 1 commit

Conv3d new (#94) · 6dfb92bb

Jianfeng Yan authored Feb 22, 2022



* conv3d compiles but has memory error

* conv3d works

* fix performance issue by using __builtin_amdgc_readfirstlane

* change MakeBlock2CTileMap to MakeDefaultBlock2CTileMap; change c_blockid_to* to cblockid_to*

* clang-format

* remove CK_EXPERIMENTAL_PASS_TENSOR_DECRIPTOR_BY_*; moved wrapper into DeviceConv3d

* format

* remove useless marc

* add comment
Co-authored-by: Chao Liu <chao.liu2@amd.com>

6dfb92bb

12 Feb, 2022 1 commit

NHWC conv 2d: fwd bfp16/int8, Device level tuning and host API (#73) · 880fbee9

ltqin authored Feb 12, 2022



* add fwd bf16 conv

* change tunning parametor

* add int8 for conv fwd

* remove comments

* change tunning parametor for int8

* change init int8 example

* add test for conv2d fwd

* change device operation file pos because merge develop

* fwd int8 use reference

* test_conv_fwd use reference

* add braket for if statement

* rename fwd example name

* remove StaticBufferOfVectorTypeV2

* tweak example
Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

880fbee9

07 Feb, 2022 1 commit

GEMM+Bias+ReLU+Add (#76) · 823657ed

Chao Liu authored Feb 06, 2022

* tweak conv for odd C

* update script

* clean up elementwise op

* fix build

* clean up

* added example for gemm+bias+relu+add

* added example for gemm+bias+relu

* add profiler for gemm_s_shuffle; re-org files

* add profiler

* fix build

* clean up

* clean up

* clean up

* fix build

823657ed

03 Dec, 2021 1 commit

GEMM/Conv+BiasAdd+ReLU+Add (#55) · 41cdd380

Chao Liu authored Dec 02, 2021

* gemm+activation

* move C pointwise operation into threadwise copy

* add pointwise operation to A/B matrix

* update ckProfiler

* adding bias add

* adding bias add

* adding bias add

* added bias add; worked around compiler issues

* clean up

* clean up

* Update README.md

* Update README.md

* Update README.md

* clean up

* add conv_xdl example

* adding conv_xdl_bias_relu_add example

* add conv+bias+relu+add, but has register spill issue

* tweak

* tweak

* refactor

* Update README.md

update readme for example/2_gemm_xdl_bias_relu_add

* clean up

* Update README.md

update readme for example/3_conv_xdl

* Update README.md

41cdd380

18 Nov, 2021 1 commit

v5r1 fusion kernels for inference (#49) · 970fa3e9

zjing14 authored Nov 18, 2021



* init

* refactor for 1x1

* rename e0_e1

* add e1 with bugs

* debug

* fixed

* fixed e1

* add timer

* imprve threadwise gemm with dot2

* add e2

* tuning

* seperate c2

* add nhwc

* restore nchwc

* clean

* opt

* fixed; tuning

* add BGlobalMoveSliceWindowStepHacks{}

* tuning

* repeat running

* adjust

* merge v5r1 nchwc

* add adaptors

* split k0 k1 in c_thread_grid

* split h and w

* remove v5r1 nhwc

* clean for pr

* remove host_conv_add

* clean code

* clean

* add dynamic support

* static mode

* test static

* add conv+add fusion

* fixed validation

* naming fix

* use activ_enum

* make static

* refactor conv_add for InMem::add

* add bias

* add conv_out

* add configurable makeddesc

* add maxpool fusion

* add maxpool host for validation

* enable static desc

* conv-only use v5r1_add

* test

* test

* for binary dumps

* fixed incorrect results due to typo

* clean

* debugging maxpool

* workaround with offset trick

* clean code

* modularize ops of fusion

* add gridwise_gemm_v3

* create seperate fusion fun

* enable dynamic mode of conv and conv+resize_add

* add dynamic mode of maxpool

* add pass by point

* add activ_type as arguments

* merge develop

* clean

* reset config to old default
Co-authored-by: Chao Liu <chao.liu2@amd.com>

970fa3e9

16 Nov, 2021 2 commits
- fixed multiple definition issue of bfp16/fp32 conversion function when building ckProfiler (#51) · 0a66c54e
  zjing14 authored Nov 16, 2021
```
* fixed bfloat16 issues

* refactor type_convert
Co-authored-by: Chao Liu <chao.liu2@amd.com>
```
  0a66c54e
- updated bfloat16_to_float · 89e1ebd4
  Jing Zhang authored Nov 16, 2021
  
  89e1ebd4
15 Nov, 2021 2 commits

Add bfp16/int8 support into XDL GEMM operator (#50) · 3737bb03

zjing14 authored Nov 15, 2021



* init StaticBufferV2

* clean

* adopt old output stage for staticBufferV2

* clean

* remove hack

* clean

* clean

* add parameters

* clean code

* move c_buffer alloc into blockwise gemm

* add adaptors for m/n_thread_data_on_grid

* tweak gemm

* adjust blockwise_gemm_xdlops

* tweak

* update conv

* update script

* adding bwd 1x1

* update script

* adding 1x1 bwd

* debugging bwd 1x1 failure

* update script

* update script

* test

* test v100

* add bf16_1k

* clang-format

* clean

* add bfp16 for gfx908

* add verification

* clean up

* clean code

* restore bfl16

* clean

* add bfp16 support into gemm_driver

* apply new generator to other drivers

* add int8 support

* cleanb

* clean

* clean

* clean
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Chao Liu <lc.roy86@gmail.com>
Co-authored-by: root <root@hayabusa6111.amd.com>

3737bb03

FP16 data in-register transpose (#41) · b491ebf3

Chao Liu authored Nov 15, 2021

* start fixing 16bit data packing

* adding StaticTensor

* adding StaticTensor

* adding StaticTensor

* add missing constexpr

* adding static tensor

* adding static tensor

* adding transpose

* add inline asm for transpose 2x2 of half_t

* add general transpose_vectors(), but have unnecessary register initialization using v_mov

* fix unnecessary register initialization in transpose_vector by using more pass-by-reference

* add hardcoded logic for NHWC wrw

* improve asm for v_pack

* make ThreadwiseTensorSliceTransfer_v3r2 support any tensor

* tweak

* reorganize file

b491ebf3

14 Nov, 2021 1 commit

ckProfiler and device-level XDL GEMM operator (#48) · e823d518

Chao Liu authored Nov 14, 2021

* add DeviceGemmXdl

* update script

* fix naming issue

* fix comment

* output HostTensorDescriptor

* rename

* padded GEMM for fwd v4r4r4 nhwc

* refactor

* refactor

* refactor

* adding ckProfiler

* adding ckProfiler

* refactor

* fix tuning parameter bug

* add more gemm instances

* add more fp16 GEMM instances

* fix profiler driver

* fix bug in tuning parameter

* add fp32 gemm instances

* small fix

* refactor

* rename

* refactor gemm profiler; adding DeviceConv and conv profiler

* refactor

* fix

* add conv profiler

* refactor

* adding more GEMM and Conv instance

* Create README.md

Add build instruction for ckProfiler

* Create README.md

Add Readme for gemm_xdl example

* Update README.md

Remove build instruction from top most folder

* Update README.md

* clean up

e823d518

19 Oct, 2021 1 commit

add nchw atomic , nhwc and nhwc atomic method for backward weight (#30) · fd49ff80

ltqin authored Oct 20, 2021



* add add new algorithm from v4r4r2

* program once issue

* add split k functiion

* redefine code

* add a matrix unmerge

* add b matrix unmerge k0

* trans a and b to gridegemm

* nhwc init

* no hacks and vector load

* add hacks

* modify some parameter

* fix tuning prometer for fp32

* fix tuning prometer for fp16

* start change gridwise k split

* init ok

* revome a b matrix k0mk1 desc in grid

* carewrite lculate gridsize

* add kbatch to CalculateBottomIndex

* remove some unused funtion

* add clear data function before call kernel

* out hacks

* in hacks

* rename device convolution file and function name

* modify kBatch value

* fix some tuning code

* start from v4r4 nhwc

* nhwc atomic is able to run

* just for fp32

* enable nchw atomic

* tweak

* tweak

* re-arrange gridwise gemm hot loop for wrw

* add wrw v4r5

* v4r4r5 fp16

* v4r4r4 fp16

* v4r4r2 fp16

* V4R4R4XDLNHWC fp16

* V4R4R2XDLATOMICNCHW fp16

* adjust for fp16

* input gridsize

* change kbatch to gridsize

* testing wrw

* clean up

* k_batch to gridsize

* fix bug

* wrw v4r4r4 kbatch change to gride size

* wrw v4r4r2 kbatch change to gride size

* after merge , change gridwise gemm v2r4

* change MakeCBlockClusterAdaptor

* other method use new gridwise gemm

* clean up

* chapad method nge to make_right_pad_transform

* kbatch out from transform function

* clean up and fix bug

* fix bug

* using function type reduce template parameters

* using auto replace define fuction type

* clean up
Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Jing Zhang <jizhan@amd.com>

fd49ff80

06 Oct, 2021 1 commit

Tweak GEMM kernel (#38) · b3e8d57d

Chao Liu authored Oct 06, 2021

* add parameters

* tweak gemm

* tweak

* update conv

* update script

* adding bwd 1x1

* update script

* adding 1x1 bwd

* debugging bwd 1x1 failure

* update script

* update script

* test

* test v100

* clean up

b3e8d57d

05 Sep, 2021 1 commit

GEMM driver and kernel (#29) · 19613902

Chao Liu authored Sep 05, 2021

* add gemm driver

* tweak

* add gemm kernel: mk_kn_mn and km_kn_mn

* tweak

* add GEMM km_nk_mn

* fix comment

19613902

31 Aug, 2021 1 commit

Backward weight v4r4r2 with xdlops (#18) · 627d8ef3

ltqin authored Aug 31, 2021



* start

* modify transformat

* modify device convolutiion

* modify host

* added host conv bwd and wrw

* remove bwd, seperate wrw

* clean

* hacall k to zero

* out log

* fixed

* fixed

* change to (out in wei)

* input hack

* hack to out

* format

* fix by comments

* change wei hacks(wei transform has not merge)

* fix program once issue

* fix review comment

* fix vector load issue

* tweak
Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: Jing Zhang <jizhan@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

627d8ef3

19 Aug, 2021 2 commits

Composable kernel init integration v3 (#1097) · 6fe3627a

Chao Liu authored Aug 19, 2021

* Squashed 'src/composable_kernel/' content from commit f6edda61

git-subtree-dir: src/composable_kernel
git-subtree-split: f6edda61

* add solver ConvIgemmFwdV6r1DlopsNchwKcyxNkhw; rename static ck source files

* Squashed 'src/composable_kernel/' changes from f6edda61..5781adf5

5781adf5 Update develop (#5) (#6)
97e6d514 Merge pull request #4 from ROCmSoftwarePlatform/separate_online_compile
7b1ec41e refactor
49c33aae refactor
54b3e73d rename

git-subtree-dir: src/composable_kernel
git-subtree-split: 5781adf5



* fix

* refactor

* remove online compilation from CK

* refactor

* fix

* add ctest

* add c-style pointer cast

* vector/scalar pointer cast use c-style pointer cast instead of reinterpret_cast

* fix clang warning suppression

* tidy

* suppress cppcheck

* fix enum issue

* revert chagnes to hip build

* fix kernel filename

* update CK build script

* rename

* rename

* make innner product compatiable on gfx900

* Update src/include/miopen/solver/ck_utility_common.hpp
Co-authored-by: JD <Jehandad.Khan@amd.com>

* compiler parameter use stream

* use int instead of index_t in kernel wrapper

* DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element

* refactor

* refactor

* change cmakelist

* change ck common utility

* fix
Co-authored-by: JD <Jehandad.Khan@amd.com>

6fe3627a

Added host_conv_wrw for verification (#15) · ba6f79a7
zjing14 authored Aug 19, 2021
```
* added host conv wrw
```
ba6f79a7

10 Aug, 2021 2 commits
- rename · c03045ce
  Chao Liu authored Aug 10, 2021
  
  c03045ce
- vector/scalar pointer cast use c-style pointer cast instead of reinterpret_cast · 4f566c62
  Chao Liu authored Aug 10, 2021
  
  4f566c62
09 Aug, 2021 3 commits
- tidy · d1842890
  Chao Liu authored Aug 09, 2021
  
  d1842890
- tidy · f885c131
  Chao Liu authored Aug 09, 2021
  
  f885c131
- tidy · 56fc0842
  Chao Liu authored Aug 09, 2021
  
  56fc0842
30 Jul, 2021 1 commit
- update to clang-format-10 · 82fae390
  Chao Liu authored Jul 30, 2021
  
  82fae390
18 Jul, 2021 1 commit

reorganize files to prepare for MIOpen integration (#51) · 12649254

Chao Liu authored Jul 18, 2021

* change olc cmake

* adding online compile to fwd-v4r5r2

* update scripts

* remane fwd-v4r5r2 to fwd-v6r1

* clean up

12649254