Commits · 63eee2d9991b08ca286f6895dd8f90da12a62da3 · yangql / composable_kernel-1

24 May, 2022 2 commits

Overhaul to Reducton and its dependants (#237) · 63eee2d9

Qianfeng authored May 25, 2022

* Tiny fix in dynamic_buffer.hpp to support vectorized AtomicAdd for double type

* Update to host layer and host reduction

* Merge and remove reduction kernels

* Merge and remove reduction device interfaces and update pooling device interface

* Merge and remove useless reduction device instances

* Update to reduction profiler and reduction ctests

* Update to reduction and pooling examples and add one reduction example

* Change to reduction examples to let them testable by ctest

* Add explicit pass checking for reduction and pooling examples

* Explicit assignment of tensor shapes in example reduce_blockwise_two_call

* Use atomic_add to repace atomicAdd and add atomic_add for double type

* Add reduce ctest support for double data type

* Replace to_int_vector() by using c++ std::vector::assign()

* Keep DeviceReduceThreadWise separated from DeviceReduceBlockWise

* Merge DeviceReduceBlockWise and DeviceReduceMultiBlockAtomicAdd into DeviceReduceMultiBlock

* Add GetAtomicOperationZeroValue() support for AtomicMax

* Tiny change to reduce example README.md

* Fix some tiny issues due to branch merging

* Revoke previous change in dynamic_buffer.hpp and add atomic_add for double2_t

* Add reduce multiblock_atomic_add instances for fp64 to verify vectorized atomic_add on fp64

* Renaming

* Clean the header includings in device_reduce instances header files

63eee2d9

add GetWorkSpaceSize to base arg (#253) · 0d08cf18

Shaojie WANG authored May 25, 2022

* add GetWorkSpaceSize to base arg and make an example on convnd_bwd_weight

* remove redundant compute

* use datatype and split k to check whether a workspace is used

* remove unused computation for work space size

0d08cf18

23 May, 2022 1 commit

fix build (#246) · ba58a93f

Chao Liu authored May 23, 2022

* fix build

* Revert "fix build"

This reverts commit d7310238

.

* post PR #235 merge fix

* amend
Co-authored-by: Anthony Chang <ac.chang@outlook.com>

ba58a93f

20 May, 2022 7 commits

example of conv bwd weight 1d/2d/3d fp32/fp16/bf16 xdl (#244) · ac543313

Shaojie WANG authored May 21, 2022



* enable example of conv 1d/3d for bwd weight

* make bf16 kernel do not use atomic add

* using new gridwise gemm for bwd weight on convnd bwd weight
Co-authored-by: Chao Liu <chao.liu2@amd.com>

ac543313

remove options.hpp.in (#240) · 44943e0e
Chao Liu authored May 20, 2022

44943e0e

Refactor block to C tile map (#235) · a054f7d6

Anthony Chang authored May 21, 2022

* refactor block-to-ctile-map

* gridwise gemm block2ctile generic validity check

* format

* amend split-k gemm block2ctile map refactor

* add test

* format

* amend

* revert to calculating batch index in kernel instead of passing as block_id_z

* move file

* add valid ctile index check to gridwise v2r4

a054f7d6

[conv bwd-weight]Binding gemm k1 to conv n (#202) · 070619fb

Shaojie WANG authored May 21, 2022



* add some instance to develop

* avoid bank conflicts for wrw for all instance

* add small K1 test

* delete some unused instance

* binding gemm k1 to conv n

* try using half_4 to do ds_read

* reset buffer load oob and ds memcpy to default option

* remove useless instances

* remove redandunt space

* remove printf code

* clang-format-10 change

* use fastest config

* fix clang format for the other files

* remove gemmk0 pad for output

* add gemmk padding macro

* add bank length computation

* add template to distinguish the instance that need lds padding for wrw

* use rocm5.1 as docker

* use integer value for GEMM test

* add Right padding macro

* add 2 test asm code

* using 256x256x32 tile size

* 1. move dedicated transform into gridwisegemm's head file. 2. make lds tensor params a struct templete. 3. remove useless code

* using small vec

* 256*128 kernel size for example

* remove asm files

* use a new gridwise gemm header for bwd-weight

* revert gridwise gemm v2r4r2

* change foramt

* reset gridwise gemm v2r4r2

* remove unused code

* revert instance file

* revert example instance

* format file

* remove macros

* resolve compile error

* rename wrw kernel invoker

* use gridwisegemm pipeline struct instead of implement run fucntion in the same header
Co-authored-by: Chao Liu <chao.liu2@amd.com>

070619fb

[Perf][Bwd-weights]Lds re-layout to avoid ds read/write bank conflict and... · b9b9c3b8

Shaojie WANG authored May 20, 2022


[Perf][Bwd-weights]Lds re-layout to avoid ds read/write bank conflict and balance ds ops with address calculations (#190)

* add some instance to develop

* avoid bank conflicts for wrw for all instance

* add small K1 test

* delete some unused instance

* reset buffer load oob and ds memcpy to default option

* remove useless instances

* remove redandunt space

* remove printf code

* clang-format-10 change

* fix clang format for the other files

* add bank length computation

* add template to distinguish the instance that need lds padding for wrw

* use rocm5.1 as docker

* use integer value for GEMM test

* 1. move dedicated transform into gridwisegemm's head file. 2. make lds tensor params a struct templete. 3. remove useless code

* use a new gridwise gemm header for bwd-weight

* revert gridwise gemm v2r4r2

* change foramt

* rename kernel invoker
Co-authored-by: Chao Liu <chao.liu2@amd.com>

b9b9c3b8

Hotfix eltiwseop (#242) · bb4b82a9

rocking5566 authored May 20, 2022



* Use vector constructor instead

* Fix typo

* Move blockSize to the MakeArgumentPointer

* Fix naming

* Fix clang format

* remove blockSize from DeviceBinaryElementwise::Argument()
Co-authored-by: rocking <chunylai@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

bb4b82a9

Gemm reduce max (#209) · 0ffe956a

rocking5566 authored May 20, 2022



* [What] Rename the example
[Why] Prepare to add unary reduction

* Add global oparation to the parameter

* Add atomicmax

* Fix compile error

* Support atomicMax (hip library)

* Rename the reduction example

* Fix target name

* use p_d1_grid as the indicator directly

* Prevent performance issue. Let passthrough handle it.

* Implement the function template the specialize the float2

* No need to separate into two lines

* Remove empty line

* add comment

* Fix compile error due to merge from develop

* make the implementation of atomic_max / atomic_add explicit for each datatype

* Refine typo

* For future CI test

* Fix compiler error in ckProfiler

* Merge commit 'de2769e3a6695b38a20529261273ddc5cdaab2fe'

* simply use remove_pointer

* Rename type and var

* Refine example

* Modify reducemax example

* Fix bug in reduction

* Change initialize range

* Implement F64 version of atomicMax

* Move reduction  code together

* Add buffer atomic_max

* Fix coding style by clang-format

* Integrate new api of DeviceGemmReduce_Xdl_CShuffle

* Integrate Batch gemm reduction

* Fix example

* fix example

* clean up

* Fix batch gemm tensor operation

* Fix coding style

* Fix template augument

* Fix clang format

* Keep flexible of different stride for each D tensor

* Fix compile error for ckProfiler

* Fix typo

* [What] Fix naming
[Why] Prepare to add out elementop

* Add DoutElementOp
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: rocking <chunylai@amd.com>

0ffe956a

19 May, 2022 1 commit

elementwise op (#238) · aafc3ac2

rocking5566 authored May 19, 2022



* Add elementwise operation kernel and example

* Add comment

* Add template argument of dim . Prepare to support multiple dimension

* Rename example

* Support 1 dimension

* Add static assert

* Add comment

* Extract pad

* Remove redundant argument

* Support any dimension for elementwise operation

* Remove line

* Let it be the multiple number of CU

* Move thread per block to the parameter of constructor

* rename threadPerBlock with blockSize

* Support double

* rename kernel function name

* remove redundant include header

* Refine type

* Need to the final dimension

* Refine variable name

* Refine type

* Use index_t instead of int in API
Co-authored-by: rocking <chunylai@amd.com>

aafc3ac2

12 May, 2022 1 commit

Add host API (#220) · cec69bc3

JD authored May 12, 2022



* Add host API

* manually rebase on develop

* clean

* manually rebase on develop

* exclude tests from all target

* address review comments

* update client app name

* fix missing lib name

* clang-format update

* refactor

* refactor

* refactor

* refactor

* refactor

* fix test issue

* refactor

* refactor

* refactor

* upate cmake and readme
Co-authored-by: Chao Liu <chao.liu2@amd.com>

cec69bc3

11 May, 2022 1 commit

Manual control of MAC cluster for improved interwave performance (#184) · 76764d8c

Anthony Chang authored May 11, 2022

* manual control of MAC cluster for improved 2-wave performance

ensure setprio's order; ensure inner loop size >= local read size

synchronize when single mac cluster

* format

* use value field from ck::integral_constant

* roll out inter-wave loop scheduler to c-shuffle gemm variants

will gradually roll out to other applicable device ops when occasional reg spill is resolved

* additional comments

* format

* fix mismatch between inter-wave pipeline and interwave blockwise gemm

* address review feedback

* amend

76764d8c

10 May, 2022 1 commit

Post PR183 review fixes. (#224) · 712e464c

Adam Osewski authored May 10, 2022



* Suppress additional warnings for googltest.

* Rename file conv_fwd_util to conv_util.

* Update includes and ConvParams member access.

* Formatting.

* Change conv_fwd_util target to conv_util

* Fix compiler errors.

* Fix leftovers.
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

712e464c

09 May, 2022 2 commits

Resolution of issue #153: Add compiler warning on comparing int and size_t (#212) · f03a1738

myamlak authored May 09, 2022



* Turning compare warnings on

* Cleaning part I

* Cleaning part II

* Explicit static_cast to ck::type_convert

* Resolving large tensor size issue.

* format

* revert change to tensor descriptor; promote lementSpaceSize to 64bit

* use integer value for GEMM test

* Review remarks

* Review remarks + issues with (un)signed arithmetic

* Format fix

* Format

* Clang-format.

* fix 2gb limit issue
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Adam Osewski <aosewski@amd.com>

f03a1738

Code refactor (#175) · ec7c2e91

Chao Liu authored May 09, 2022

* format

* improving pipeline

* fix typo

* format

* adding thread group

* adding thread group

* adding thread group

* adding gemm pipeline

* tweak

* refactor

* refactor

* add missing type convert

* refactor

* refactor

* refactor

* clean

* fix build

* refactor

* format

* clean up

* use remove_cvref_t

* clean

* clean up

* clean up

* clean up

ec7c2e91

29 Apr, 2022 2 commits

Update to gemm_reduce and batched_gemm_reduce (#213) · c77ae65d

Qianfeng authored Apr 30, 2022

* [Experimental] Change to gemm+reduce and batched-gemm+reduce

* Use threadwise-reduce function to improve the gridwise_gemm_reduce_xdl_cshuffle kernel

* Tiny fix in device_batched_gemm_xdl.hpp

* clang-format library/src/utility/conv_fwd_util.cpp

c77ae65d

Add gfx90a CI stage for tests (#208) · 97d8c504
JD authored Apr 29, 2022
```
* Add gfx90a CI stage

* upgrade to ROCm 5.1 and fix formatting
```
97d8c504

25 Apr, 2022 1 commit

add comments to batched_gemm (#186) · 3956085d

Jianfeng Yan authored Apr 25, 2022

* add comments to batched_gemm

* formatting

* fix a typo in batched_gemm_documentation

* fix naming

3956085d

22 Apr, 2022 1 commit
- use inline asm for 4x4 int8 transposition (#187) · 08a979f1
  Anthony Chang authored Apr 23, 2022
  
  08a979f1
21 Apr, 2022 2 commits

removed unused lds loads (#196) · 860e291c
zjing14 authored Apr 20, 2022

860e291c

Use ck::half_t for Host Reduction (#195) · c1ef7319

Qianfeng authored Apr 21, 2022

* Add math functions for host

* Change to host reduction to use ck::math:

* Remove the using of half_float::half and half.hpp from reduction example/profiler/ctest

c1ef7319

15 Apr, 2022 1 commit

Compile CK for all targets (#188) · 4221505d

Illia Silin authored Apr 15, 2022



* compile ck for all targets

* update the target criteria

* change the target condition

* fixed some typos

* fixed missed file

* revert changes in README

* revert device_conv3d_fwd_xdl_...

* update device_conv3d_fwd_xdl_...

* update device_batched_gemm_reduce...

* test the unused arguments fix

* test the warning suppression

* try suppress warnings in device_batched_gemm_reduce_xdl...

* fix the last warnings

* replace UNUSED with std::ignore

* fix a typo

* replaced std::ignore with ignore

* add igonre header to common_header

* refactor atomicAdd
Co-authored-by: Chao Liu <chao.liu2@amd.com>

4221505d

05 Apr, 2022 4 commits

Common forward convolution utility refactor. (#141) · abf4bdb9

Adam Osewski authored Apr 05, 2022



* Convolution ND

* Code unification across dimensions for generating tensor descriptors.
* Example
* Instances

* Move convnd f32 instance file to comply with repo structure.

* Conv 1D tensor layouts.

* Formatting and use ReferenceConv

* Reference ConvFwd supporting 1D and 2D convolution.

* Debug printing TensorLayout name.

* Conv fwd 1D instance f32

* Refactor conv ND example.

Needed to support various conv dimensio.

Needed to support various conv dimensions

* Rename conv nd example director to prevent conflicts.

* Refactor some common utility to single file.

Plus some tests.

* Refactor GetHostTensorDescriptor + UT.

* Add 1D test case.

* Test reference convolution 1d/2d

* Remove some leftovers.

* Fix convolution example error for 1D

* Refactor test check errors utility function.

* Test Conv2D Fwd XDL

* More UT for 1D case.

* Parameterize input & weight initializers.

* Rename example to prevent conflicts.

* Split convnd instance into separate files for 1d/2d

* Address review comments.

* Fix data type for flops/gbytes calculations.

* Assign example number 11.

* 3D cases for convolution utility functions.

* 3D reference convolution.

* Add support for 3D convolution.

* Check for inputs bigger than  2GB.

* Formatting

* Support for bf16/f16/f32/i8 - conv instances + UT.

* Use check_err from test_util.hpp.

* Split convnd test into separate files for each dim.

* Fix data generation and use proper instances.

* Formatting

* Skip tensor initialization if not necessary.

* Fix CMakefiles.

* Remove redundant conv2d_fwd test.

* Lower problem size for conv3D UT.

* 3D case for convnd example.

* Remove leftovers after merge.

* Add Conv Specialization string to GetTypeString

* Skip instance causing numerical errors.

* Small fixes.

* Remove redundant includes.

* Fix namespace name error.

* Script for automatic testing and logging convolution fwd UTs

* Comment out numactl cmd.

* Refine weights initalization and relax rtol for fp16

* Move test_util.hpp to check_err.hpp

* Refine weights initalization and relax rtol for fp16

* Refactor common part of test conv utils.

* Move utility function to single common place.

* Add additional common functions to utility.

* Refactor convnd_fwd_xdl examples.

* Remove redundant files.
* Unify structure.

* Add constructor to ConvParams.

* And add input parameters validation.

* Modify conv examples to use single utility file.

* Remove check_error from host_tensor.hpp

* Get rid of check_indices function.

* Remove bf16_to_f32 function overload for scalars.

* Fix namespace.

* Add half_float::half for check_err.

* Fix conv params size in UT.

* Fix weights initialization for int8.

* Fix weights initialization for int8.

* Add type_convert when store output in ref conv 1D.

* Get back old conv2d_fwd_xdl operation.

* Silence conv debug print.

* format

* clean

* clean

* Fix merge.

* Fix namespace for check_err

* Formatting.

* Fix merge artifacts.

* Remove deleted header.

* Fix some includes and use ck::utils::check_err.

* Remove unused check_indices restored by previous merge.

* Fix namespaces after merge.

* Fix compilation error.

* Small fixes.

* Use common functions.
* Fix filename
* Fix namespaces.

* Fix merge artifact - retrieve removed by accident fun.

* Fix ConvForwardSpecialization.

* Adhere to coding style rules.

* Fix merge artifacts.
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

abf4bdb9

Patch for bwd data comments (#174) · 6717168c
ltqin authored Apr 05, 2022
```
* change function name and way to set input zero

* change enable if
```
6717168c

NHWC Conv2d Bwd weight fp16 ckprofiler and test (#166) · 781cacd2

ltqin authored Apr 05, 2022

* change backward weight name

* start add bwd weight lib and profiler

* change tuning paramter

* change output info

* add bwd weight test

* change test info

* using conv_util

* change wgt to weight

* add }

* add fp32

781cacd2

Improve Reduction kernel api (#152) · 82c8b9f8

Qianfeng authored Apr 05, 2022

* Add ThreadwiseReduction functor as per-thread reduction api

* Using ThreadwiseReduce api and some change in using PartitionedBlockwiseReduction api to simply the kernels

* Add comments and remove useless declarations in the kernels

* Tiny updates

82c8b9f8

31 Mar, 2022 4 commits

Tune & add conflict-free LDS gemm kernels (#159) · 7db48f90

Anthony Chang authored Apr 01, 2022

* retune & add conflict-free bf16/fp16 c-shuffle gemm instances

amend wrong K1 value in some fp16/bf16 kernel instances

* make gemm cshuffle's timing behavior consistent with all other functions

* clang-format

* retune & add conflict-free fp32 c-shuffle gemm instances

* retune & add conflict-free int8 c-shuffle gemm instances

* update the underlying gridwise gemm of all c-shuffle gemm kernels

* typo

7db48f90

Compile for gfx908 and gfx90a (#130) · cd167e49

Chao Liu authored Mar 31, 2022

* adding compilation for multiple targets

* fix build

* clean

* update Jekinsfile

* update readme

* update Jenkins

* use ck::half_t instead of ushort for bf16

* rename enum classes

* clean

* rename

* clean

cd167e49

fixed issue164 (#165) · ecf337ba
Jianfeng Yan authored Mar 31, 2022
```
* fixed issue164

* removed prints
```
ecf337ba
batched_gemm: use profiler in ctest (#163) · c8f3acf9
Jianfeng Yan authored Mar 30, 2022

c8f3acf9

30 Mar, 2022 1 commit

Batched gemm and reduction (#156) · 34c661e7

Jianfeng Yan authored Mar 30, 2022

* adding batched_gemm_and_reduction

* batched_gemm_reduce works with bactch_count=1

* fix a bug in grid_size; batched_gemm_reduce works for batch_count > 1

* adding profiler for batched_gemm_fp16

* fixed a bug in declaration of d1 and d0; both example and profiler work

* clang-format

* cleanup

* batched_gemm_reduce: add test

* minor change

* fixed some typo in function names

34c661e7

29 Mar, 2022 1 commit

Unified implementation of 1d/2d/3d conv bwd-data. fp32/fp16/bfp16/int8 (#134) · 0536f2b3

ltqin authored Mar 29, 2022



* start convnd bwd data

* add 3d laoyout name

* add conv1d reference

* add con3d reference

* finished example client code

* conv1d kernel finished

* fix input error

* add conv3d

* add 3d layout in conv_utils.hpp

* fix sepecial check

* addconvnd lib

* add test for bwd data

* finished test

* add check slice length

* convnd bwd data start

* profiler can be compiled

* fix some bug

* set input to zero

* modify readme for example

* fix test_convnd_bwd_data bug

* test_convnd_bwd_data parameter desc

* workaround for 1d

* workaroud for 2d

* change init value

* workaround for 3d int8

* fix init value bug

* remove workaround

* fix acc data type

* add int32

* change select function to template

* tilda to tilde

* remove int32 instance

* fix commit for device hpp

* fix comments for profiler

* using profile imp to test

* add pass verification

* fix conv2d reference

* fix conflict

* remove double batched_gemm

* fix exampel conv2d data and test convnd

* format

* change conv2d_bwd_data return value

* remove repeat = 1

* remove conv bwd data
Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

0536f2b3

24 Mar, 2022 1 commit

Gemm+Reduce Fusion (#128) · f95267f1

Chao Liu authored Mar 23, 2022

* add gridwise gemm v4r1

* rename

* adding gemm+reduce

* adding gemm+reduce

* adding gemm+reduce

* adding gemm+reduce

* use sfc in shuffling

* remove hardcode

* remove hardcode

* refactor

* fix build

* adding gemm+reduce

* adding gemm+reduce

* adding gemm+reduce

* adding gemm+reduce

* adding gemm+reduce

* format

* clean

* adding gemm+reduce

* adding profiler for gemm+reduce

* adding gemm+reduce profiler

* fix build

* clean up

* gemm+reduce

* fix build

* update DeviceGemm_Xdl_CShuffle; update enum to enum class

* clean up

* add test for gemm+reduce

* clean up

* refactor

* fix build

* fix build

f95267f1

23 Mar, 2022 2 commits

Unified conv3D API + support for all data types. (#133) · f91579aa

Adam Osewski authored Mar 23, 2022



* Convolution ND

* Code unification across dimensions for generating tensor descriptors.
* Example
* Instances

* Move convnd f32 instance file to comply with repo structure.

* Conv 1D tensor layouts.

* Formatting and use ReferenceConv

* Reference ConvFwd supporting 1D and 2D convolution.

* Debug printing TensorLayout name.

* Conv fwd 1D instance f32

* Refactor conv ND example.

Needed to support various conv dimensio.

Needed to support various conv dimensions

* Rename conv nd example director to prevent conflicts.

* Refactor some common utility to single file.

Plus some tests.

* Refactor GetHostTensorDescriptor + UT.

* Add 1D test case.

* Test reference convolution 1d/2d

* Remove some leftovers.

* Fix convolution example error for 1D

* Refactor test check errors utility function.

* Test Conv2D Fwd XDL

* More UT for 1D case.

* Parameterize input & weight initializers.

* Rename example to prevent conflicts.

* Split convnd instance into separate files for 1d/2d

* Address review comments.

* Fix data type for flops/gbytes calculations.

* Assign example number 11.

* 3D cases for convolution utility functions.

* 3D reference convolution.

* Add support for 3D convolution.

* Check for inputs bigger than  2GB.

* Formatting

* Support for bf16/f16/f32/i8 - conv instances + UT.

* Use check_err from test_util.hpp.

* Split convnd test into separate files for each dim.

* Fix data generation and use proper instances.

* Formatting

* Skip tensor initialization if not necessary.

* Fix CMakefiles.

* Remove redundant conv2d_fwd test.

* Lower problem size for conv3D UT.

* 3D case for convnd example.

* Remove leftovers after merge.

* Add Conv Specialization string to GetTypeString

* Skip instance causing numerical errors.

* Small fixes.

* Remove redundant includes.

* Fix namespace name error.

* Script for automatic testing and logging convolution fwd UTs

* Comment out numactl cmd.

* Refine weights initalization and relax rtol for fp16

* Fix weights initialization for int8.

* Add type_convert when store output in ref conv 1D.

* Get back old conv2d_fwd_xdl operation.

* Silence conv debug print.

* format

* clean

* clean

* Fix merge.

* Fix namespace for check_err
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

f91579aa

clean (#143) · 22061366
Chao Liu authored Mar 22, 2022

22061366

22 Mar, 2022 2 commits

Grouped GEMM for fp16 (#126) · 716f1c7f

zjing14 authored Mar 22, 2022

* init of grouped_gemm

* 2 gemm test

* perf test

* clean

* wrap desc into a struct

* test cast static_arr to pointer

* add ptr to GemmDesc

* add grouped gemm profiler

* fixed mem issue with unique_ptr

* clean

* clean

* finished ckprofiler

* Update README.md

* readme

* fixed readme

* add example

* improve code

* fixed comments: reserve, seperate ptr and gemm_shapes

* merge group and non-group

* fixed comments: replace push_back with emplace_back to avoid copy constructor

* fixed comments: unified blk2ctile; add test

* ci fix

* fixed ci

* fixed ci

* fixed ci

716f1c7f

Reduction for int8 and bfloat16 (#125) · 9a8ee8a3

Qianfeng authored Mar 23, 2022



* Use thread cluster descriptor and explicit M_K 2d descriptor to simply Blockwise Reduction

* Change by replacing ReduceDims by NumReduceDims as Device Reduce interface template parameter

* Rename the folder name for the pool2d and reduce examples

* Update to reduction test scripts

* Add Readme for pool2d_fwd and reduce_blockwise examples

* Add support for int8_t reduction (ADD/AVG, MIN/MAX/AMAX)

* Tiny fix in reduce profiler and tiny update in reduce testing scripts

* Tiny fix in testing script profile_reduce_no_index.sh

* Tiny fix in testing script profile_reduce_no_index.sh

* Add support for bfp16 reduction (using bhalf_t = ushort)

* Tiny fix in amd_buffer_addressing.hpp

* Tiny change in script/profile_reduce_with_index.sh

* Use AccDataType for Beta value and use element_wise::PassThrough

* Use type_convert for type converting in host layer reduction

* Renaming and refining in Reduction profiler/device layer/examples

* Renaming and refining in Reduction profiler/device layer/examples

* Renaming all NumReduceDims to NumReduceDim

* Fix the leaked type_convert in ThreadwiseTensorSliceTransfer_v2

* Update to testing scripts to add bf16 support

* added more static_assert

* Remove buggy tunable configurations defined in device_reduce_instance_xxx.hpp

* Add static_assert to give compile-time warning for incorrect thread slice-size/vector-size configurations

* minor change

* Refine and fix (in GetWorkspaceSizeInBytes of MultiBlockPartialReduce) to make int8 completely pass

* Tiny renaming in gridwise_2d_reduction_multiblock_partial_reduce.hpp

* Tiny fix in script/profile_reduce_no_index.sh

* Refine in DeviceReduce layer with regard to using NumInvariantDim/NumReduceDim or InvariantDims/ReduceDims

* Generic renaming in host reduction and DeviceReduce layer

* Add support for 4-d all dimension reduction in the profiler and add_device_reduce_xxx instances

* Use multi-thread and simplification for host Reduction implementation

* Add ctest for reduction

* Update to clarify the using of data init method in produce_reduce/example_reduce/test_reduce/

* Update to the reduce CTest executables to enable default testing behavior when no command argument

* Renaming
Co-authored-by: Jianfeng yan <jfyan008@gmail.com>

9a8ee8a3

21 Mar, 2022 2 commits
- refactored deviceBatchedGemm; removed GridwiseBatchedGemm; added fp32 and int8 to profiler (#120) · cb87b049
  Jianfeng Yan authored Mar 21, 2022
```
changed long_index_t to index_t when computing memory offset

uncomment other ops in profiler

added test for batched_gemm
```
  cb87b049
- Fix conv2d bwd data bug when filter is 1x1 and stride = 2 (#132) · b51808d7
  ltqin authored Mar 21, 2022
```
* fix bwd data filter1strid2 bug

* fichangeshort to ck::bhalf_t

* reset input to zero
Co-authored-by: ltqin <letaoqin@amd.com>
```
  b51808d7