Commits · 880fbee95782a30fb16654f830502d03dd92fae2 · yangql / composable_kernel-1

12 Feb, 2022 1 commit

NHWC conv 2d: fwd bfp16/int8, Device level tuning and host API (#73) · 880fbee9

ltqin authored Feb 12, 2022



* add fwd bf16 conv

* change tunning parametor

* add int8 for conv fwd

* remove comments

* change tunning parametor for int8

* change init int8 example

* add test for conv2d fwd

* change device operation file pos because merge develop

* fwd int8 use reference

* test_conv_fwd use reference

* add braket for if statement

* rename fwd example name

* remove StaticBufferOfVectorTypeV2

* tweak example
Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

880fbee9

11 Feb, 2022 1 commit

Batched GEMM for fp16 (#79) · b53e9d08

zjing14 authored Feb 11, 2022

* prepare host for batched_gemm

* init commit of batched kernels

* fixed

* refine transform with freeze

* m/n padding

* fixed a bug; clean

* add small tiles

* clean

* clean code

* clean code

* add nt, tn, tt layout

* add missing file

* use StaticBufferTupleOfVector instead

* add reference_batched_gemm

* fixed a macro

b53e9d08

03 Feb, 2022 1 commit

Replace llvm Intrinsics with clang buildins (#65) · 6d92959a

zjing14 authored Feb 02, 2022

* test mfma builtins

* add fp16 buildins

* add int8 buildins

* add bfl16 buildins

* simplify host conv forward

* clean

* clean

6d92959a

26 Dec, 2021 1 commit

Fusion Conv+Bias+ReLU(+Add) (#62) · acbd7bd7

Chao Liu authored Dec 26, 2021

* fix relu

* clean up

* clean up

* adding 1x1 conv

* adding 1x1 conv

* added 1x1 conv

* refactor

* refactor

* refactor

* added profiler for conv+bias+relu+add

* clean up

* adding conv+bias+relu

* adding conv+bias+relu

* added conv+bias+relu

* Update README.md

* update cpu verification

* adding c shuffle

* update static_tensor for dealing with invalid element

* adding c shuffle

* debugging

* fix bug

* convert to fp16 before shuffle

* shuffle more than one M/NRepeat

* clean up

* remove coordinate step hack from GridwiseGemm_k0mk1_k0nk1_mn_xdlops_v3r1

* clean up

* remove coordinate step hack from all gridwise gemm xdl

* clean up coordinate step hack

* clean up coordinate step hack

* ThreadwiseTensorSliceTransfer_v3r2 support pointwise op on both src and dst

* adding output shuffle in conv+bias+relu+add

* update

* added conv+bias+relu+add with c shuffle

* added conv+bias+relu+add with c shuffle

* fix forward_sweep bugs in threadwise copy

* clean up

* refactor

* clean up

* clean up

* added conv_c_shuffle+bias_relu

* clean up

* added conv+bias+relu+atomic_add

* clean up

* clean up

* clean up

* clean up

* clean up

* clean up

* misc fixes; add 1x1 specialization

* clean up

* delete unused device op

* clean up

* add support for odd C value

acbd7bd7

03 Dec, 2021 1 commit

GEMM/Conv+BiasAdd+ReLU+Add (#55) · 41cdd380

Chao Liu authored Dec 02, 2021

* gemm+activation

* move C pointwise operation into threadwise copy

* add pointwise operation to A/B matrix

* update ckProfiler

* adding bias add

* adding bias add

* adding bias add

* added bias add; worked around compiler issues

* clean up

* clean up

* Update README.md

* Update README.md

* Update README.md

* clean up

* add conv_xdl example

* adding conv_xdl_bias_relu_add example

* add conv+bias+relu+add, but has register spill issue

* tweak

* tweak

* refactor

* Update README.md

update readme for example/2_gemm_xdl_bias_relu_add

* clean up

* Update README.md

update readme for example/3_conv_xdl

* Update README.md

41cdd380

02 Dec, 2021 1 commit
- add static_buffer_v2 zero out · 2cbb8976
  Jing Zhang authored Dec 02, 2021
  
  2cbb8976
18 Nov, 2021 2 commits

Use __builtin_memcpy to implement bit_cast and for accessing vector from pointer of scalars (#53) · 64350aff
Chao Liu authored Nov 18, 2021
```
* reworking vector_type

* use __builtin_memcpy for bit_cast and vector access of scalar pointer

* clean up
```
64350aff

v5r1 fusion kernels for inference (#49) · 970fa3e9

zjing14 authored Nov 18, 2021



* init

* refactor for 1x1

* rename e0_e1

* add e1 with bugs

* debug

* fixed

* fixed e1

* add timer

* imprve threadwise gemm with dot2

* add e2

* tuning

* seperate c2

* add nhwc

* restore nchwc

* clean

* opt

* fixed; tuning

* add BGlobalMoveSliceWindowStepHacks{}

* tuning

* repeat running

* adjust

* merge v5r1 nchwc

* add adaptors

* split k0 k1 in c_thread_grid

* split h and w

* remove v5r1 nhwc

* clean for pr

* remove host_conv_add

* clean code

* clean

* add dynamic support

* static mode

* test static

* add conv+add fusion

* fixed validation

* naming fix

* use activ_enum

* make static

* refactor conv_add for InMem::add

* add bias

* add conv_out

* add configurable makeddesc

* add maxpool fusion

* add maxpool host for validation

* enable static desc

* conv-only use v5r1_add

* test

* test

* for binary dumps

* fixed incorrect results due to typo

* clean

* debugging maxpool

* workaround with offset trick

* clean code

* modularize ops of fusion

* add gridwise_gemm_v3

* create seperate fusion fun

* enable dynamic mode of conv and conv+resize_add

* add dynamic mode of maxpool

* add pass by point

* add activ_type as arguments

* merge develop

* clean

* reset config to old default
Co-authored-by: Chao Liu <chao.liu2@amd.com>

970fa3e9

16 Nov, 2021 2 commits
- fixed multiple definition issue of bfp16/fp32 conversion function when building ckProfiler (#51) · 0a66c54e
  zjing14 authored Nov 16, 2021
```
* fixed bfloat16 issues

* refactor type_convert
Co-authored-by: Chao Liu <chao.liu2@amd.com>
```
  0a66c54e
- updated bfloat16_to_float · 89e1ebd4
  Jing Zhang authored Nov 16, 2021
  
  89e1ebd4
15 Nov, 2021 2 commits

Add bfp16/int8 support into XDL GEMM operator (#50) · 3737bb03

zjing14 authored Nov 15, 2021



* init StaticBufferV2

* clean

* adopt old output stage for staticBufferV2

* clean

* remove hack

* clean

* clean

* add parameters

* clean code

* move c_buffer alloc into blockwise gemm

* add adaptors for m/n_thread_data_on_grid

* tweak gemm

* adjust blockwise_gemm_xdlops

* tweak

* update conv

* update script

* adding bwd 1x1

* update script

* adding 1x1 bwd

* debugging bwd 1x1 failure

* update script

* update script

* test

* test v100

* add bf16_1k

* clang-format

* clean

* add bfp16 for gfx908

* add verification

* clean up

* clean code

* restore bfl16

* clean

* add bfp16 support into gemm_driver

* apply new generator to other drivers

* add int8 support

* cleanb

* clean

* clean

* clean
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Chao Liu <lc.roy86@gmail.com>
Co-authored-by: root <root@hayabusa6111.amd.com>

3737bb03

FP16 data in-register transpose (#41) · b491ebf3

Chao Liu authored Nov 15, 2021

* start fixing 16bit data packing

* adding StaticTensor

* adding StaticTensor

* adding StaticTensor

* add missing constexpr

* adding static tensor

* adding static tensor

* adding transpose

* add inline asm for transpose 2x2 of half_t

* add general transpose_vectors(), but have unnecessary register initialization using v_mov

* fix unnecessary register initialization in transpose_vector by using more pass-by-reference

* add hardcoded logic for NHWC wrw

* improve asm for v_pack

* make ThreadwiseTensorSliceTransfer_v3r2 support any tensor

* tweak

* reorganize file

b491ebf3

14 Nov, 2021 1 commit

ckProfiler and device-level XDL GEMM operator (#48) · e823d518

Chao Liu authored Nov 14, 2021

* add DeviceGemmXdl

* update script

* fix naming issue

* fix comment

* output HostTensorDescriptor

* rename

* padded GEMM for fwd v4r4r4 nhwc

* refactor

* refactor

* refactor

* adding ckProfiler

* adding ckProfiler

* refactor

* fix tuning parameter bug

* add more gemm instances

* add more fp16 GEMM instances

* fix profiler driver

* fix bug in tuning parameter

* add fp32 gemm instances

* small fix

* refactor

* rename

* refactor gemm profiler; adding DeviceConv and conv profiler

* refactor

* fix

* add conv profiler

* refactor

* adding more GEMM and Conv instance

* Create README.md

Add build instruction for ckProfiler

* Create README.md

Add Readme for gemm_xdl example

* Update README.md

Remove build instruction from top most folder

* Update README.md

* clean up

e823d518

06 Oct, 2021 3 commits

[MIOpen Downstream] Fix Reduction Kernel (#34) · b2dc55f8

Qianfeng authored Oct 07, 2021



* Tiny fix in using data type template parameters in blockwise and direct_threadwise kernel

* Fix with regard to implementing GetZeroVal() in both kernel and host

* Avoid convert to compType from dstDataType before writting the output value

* Add half_t support to NumericLimits and make constexpr GetZeroVal() of binary operator

* Add CONSTANT decorator for descriptor read buffer

* Use get_thread_local_1d_id() for thread local Id

* Rename GetZeroVal() to GetReductionZeroVal() in the kernels

* Remove constexpr from initialized zeroVal and tiny fix in reduction_operator.hpp

* Occasional tiny simplification and update in the kernel files

* Update to re-order tensor dimensions on the host, split second_call kernel wrapper files and simplify reduce_all kernel wrappers

* Update to remove OpenCL tidy checking failures

* Update for better readability

* Remove unused codes and not-needed template parameters in the kernel wrappers
Co-authored-by: Chao Liu <chao.liu2@amd.com>

b2dc55f8

Tweak GEMM kernel (#38) · b3e8d57d

Chao Liu authored Oct 06, 2021

* add parameters

* tweak gemm

* tweak

* update conv

* update script

* adding bwd 1x1

* update script

* adding 1x1 bwd

* debugging bwd 1x1 failure

* update script

* update script

* test

* test v100

* clean up

b3e8d57d

Add VectorType support into StaticBuffer (#27) · 846f462b

zjing14 authored Oct 06, 2021



* init StaticBufferV2

* clean

* adopt old output stage for staticBufferV2

* clean

* remove hack

* clean

* clean

* clean code

* move c_buffer alloc into blockwise gemm

* add adaptors for m/n_thread_data_on_grid

* adjust blockwise_gemm_xdlops

* reorder ops in GEMM hot loop
Co-authored-by: Chao Liu <chao.liu2@amd.com>

846f462b

29 Sep, 2021 1 commit

[Enhancements] Several bugfixes and refactoring of dynamic generic reduction (#1156) · dfb80c4e

Qianfeng authored Sep 29, 2021

* Squashed 'src/composable_kernel/' content from commit f6edda61

git-subtree-dir: src/composable_kernel
git-subtree-split: f6edda61

* add solver ConvIgemmFwdV6r1DlopsNchwKcyxNkhw; rename static ck source files

* Squashed 'src/composable_kernel/' changes from f6edda61..5781adf5

5781adf5 Update develop (#5) (#6)
97e6d514 Merge pull request #4 from ROCmSoftwarePlatform/separate_online_compile
7b1ec41e refactor
49c33aae refactor
54b3e73d rename

git-subtree-dir: src/composable_kernel
git-subtree-split: 5781adf5



* fix

* refactor

* remove online compilation from CK

* refactor

* fix

* add ctest

* tidy

* add tidy

* tidy

* tidy

* tidy

* tidy

* tidy

* tidy

* tidy

* tidy

* tidy

* add c-style pointer cast

* vector/scalar pointer cast use c-style pointer cast instead of reinterpret_cast

* fix clang warning suppression

* tidy

* suppress cppcheck

* fix enum issue

* revert chagnes to hip build

* fix kernel filename

* update CK build script

* rename

* rename

* make innner product compatiable on gfx900

* Update src/include/miopen/solver/ck_utility_common.hpp
Co-authored-by: JD <Jehandad.Khan@amd.com>

* compiler parameter use stream

* use int instead of index_t in kernel wrapper

* DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element

* refactor

* refactor

* change cmakelist

* change ck common utility

* fix

* Squashed 'src/composable_kernel/' changes from 5781adf5..31b40352

31b40352 Merge pull request #16 from ROCmSoftwarePlatform/develop
b62bf8c3 Merge pull request #14 from ROCmSoftwarePlatform/miopen_downstream_init_integration
ccc4a1d3 Merge pull request #8 from ROCmSoftwarePlatform/miopen_downstream_init_integration
67ad47e7 refactor
16effa76 refactor
a91b68df DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element
2cbabbba use int instead of index_t in kernel wrapper
0834bc76 compiler parameter use stream
f2ac7832 make innner product compatiable on gfx900
4e57b30a rename
c03045ce rename
b2589957 update CK build script
2c48039d fix kernel filename
d626dccc fix enum issue
643ebd4f tidy
ddd49ec9 fix clang warning suppression
4f566c62 vector/scalar pointer cast use c-style pointer cast instead of reinterpret_cast
172036d7 add c-style pointer cast
76f31319 tidy
d1842890 tidy
f885c131 tidy
80120f0a tidy
c3efeb5e tidy
56fc0842 tidy
54fba515 tidy
e62bae7a tidy
24c87289 add tidy
61487e0a fix
ae98b52a remove online compilation from CK
cb954213 refactor
73ca9701 Merge commit '437cc595c6e206dfebb118985b5171bbc1e29eab' into composable_kernel_init_integration_v3
3b866461 Merge pull request #7 from ROCmSoftwarePlatform/master
d09ea4f4 Update develop (#5)
3d32ae94 add solver ConvIgemmFwdV6r1DlopsNchwKcyxNkhw; rename static ck source files

git-subtree-dir: src/composable_kernel
git-subtree-split: 31b40352



* Tiny fix in using data type template parameters in blockwise and direct_threadwise kernel

* Fix with regard to implementing GetZeroVal() in both kernel and host

* Avoid convert to compType from dstDataType before writting the output value

* Add half_t support to NumericLimits and make constexpr GetZeroVal() of binary operator

* Add CONSTANT decorator for descriptor read buffer

* Use get_thread_local_1d_id() for thread local Id

* Rename GetZeroVal() to GetReductionZeroVal() in the kernels

* Remove constexpr from initialized zeroVal and tiny fix in reduction_operator.hpp

* Occasional tiny simplification and update in the kernel files

* Update in src/reducetensor.cpp for consistent IDs passing to the kernel

* Update to re-order tensor dimensions on the host, split second_call kernel wrapper files and simplify reduce_all kernel wrappers

* Update to remove OpenCL tidy checking failures

* Small updates in src/reducetensor.cpp

* Update for better readability

* Remove unused codes and not-needed template parameters in the kernel wrappers
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: JD <Jehandad.Khan@amd.com>

dfb80c4e

27 Aug, 2021 2 commits

Misc fixes (#24) · 10bb8110

Chao Liu authored Aug 26, 2021

* use cast_pointer_to_generic_address_space() in v6r1 kernel wrapper, DynamcBuffer and buffer_load take customized invalid-element-value, add buffer_load/store for fp64

* use remove_cvref_t

10bb8110

[SWDEV-281541][MSRCHA-100] Implementation of Dynamic Generic Reduction (#1108) · 9e80cdce

Qianfeng authored Aug 27, 2021



* add solver ConvIgemmFwdV6r1DlopsNchwKcyxNkhw; rename static ck source files

* make inner product compatible on gfx900

* Update src/include/miopen/solver/ck_utility_common.hpp

* compiler parameter use stream

* use int instead of index_t in kernel wrapper

* DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element

* Add dynamic generic reduction kernel layer (kernel wrappers, kernel implementations and utilities)

* Some updates to dynamic composable kernel facility for the need of dynamic generic reduction

* Update to generic reduction C++ host interface layer to support dynamic generic reduction

* Update to remove tidy complaints in host interface layer

* Change the unary operator form from void op(T &x) to T op(T x)

* Update to pass single workspace pointer for all kernels (fix for OpenCL backend)

* Use cppcheck-suppress to prevent some strange warnings

* Re-use operator [] and () for DynamicBuffer and update to depending codes

* Remove useless codes in first call threadwise/warpwise/blockwise kernel wrappers

* [performance] Remove un-needed local buffer initialization
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: JD <Jehandad.Khan@amd.com>

9e80cdce

25 Aug, 2021 1 commit

GlobalAtomicAdd for fp32/int32 (#23) · a7a758d8

zjing14 authored Aug 25, 2021



* add f32/i32 atomicAdd support into dynamicBuffer, and enable it in v1r3

* fixed

* fixed

* update comment
Co-authored-by: Chao Liu <chao.liu2@amd.com>

a7a758d8

23 Aug, 2021 1 commit
- magic division use __umulhi() (#19) · c6f26bb4
  Chao Liu authored Aug 23, 2021
  
  c6f26bb4
19 Aug, 2021 1 commit

Composable kernel init integration v3 (#1097) · 6fe3627a

Chao Liu authored Aug 19, 2021

* Squashed 'src/composable_kernel/' content from commit f6edda61

git-subtree-dir: src/composable_kernel
git-subtree-split: f6edda61

* add solver ConvIgemmFwdV6r1DlopsNchwKcyxNkhw; rename static ck source files

* Squashed 'src/composable_kernel/' changes from f6edda61..5781adf5

5781adf5 Update develop (#5) (#6)
97e6d514 Merge pull request #4 from ROCmSoftwarePlatform/separate_online_compile
7b1ec41e refactor
49c33aae refactor
54b3e73d rename

git-subtree-dir: src/composable_kernel
git-subtree-split: 5781adf5



* fix

* refactor

* remove online compilation from CK

* refactor

* fix

* add ctest

* add c-style pointer cast

* vector/scalar pointer cast use c-style pointer cast instead of reinterpret_cast

* fix clang warning suppression

* tidy

* suppress cppcheck

* fix enum issue

* revert chagnes to hip build

* fix kernel filename

* update CK build script

* rename

* rename

* make innner product compatiable on gfx900

* Update src/include/miopen/solver/ck_utility_common.hpp
Co-authored-by: JD <Jehandad.Khan@amd.com>

* compiler parameter use stream

* use int instead of index_t in kernel wrapper

* DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element

* refactor

* refactor

* change cmakelist

* change ck common utility

* fix
Co-authored-by: JD <Jehandad.Khan@amd.com>

6fe3627a

16 Aug, 2021 2 commits
- refactor · 67ad47e7
  Chao Liu authored Aug 16, 2021
  
  67ad47e7
- refactor · 16effa76
  Chao Liu authored Aug 16, 2021
  
  16effa76
13 Aug, 2021 1 commit
- DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element · a91b68df
  Chao Liu authored Aug 13, 2021
  
  a91b68df
11 Aug, 2021 1 commit
- make innner product compatiable on gfx900 · f2ac7832
  Chao Liu authored Aug 11, 2021
  
  f2ac7832
10 Aug, 2021 5 commits
- rename · c03045ce
  Chao Liu authored Aug 10, 2021
  
  c03045ce
- fix enum issue · d626dccc
  Chao Liu authored Aug 10, 2021
  
  d626dccc
- fix clang warning suppression · ddd49ec9
  Chao Liu authored Aug 10, 2021
  
  ddd49ec9
- vector/scalar pointer cast use c-style pointer cast instead of reinterpret_cast · 4f566c62
  Chao Liu authored Aug 10, 2021
  
  4f566c62
- add c-style pointer cast · 172036d7
  Chao Liu authored Aug 10, 2021
  
  172036d7
09 Aug, 2021 3 commits
- tidy · f885c131
  Chao Liu authored Aug 09, 2021
  
  f885c131
- tidy · 80120f0a
  Chao Liu authored Aug 09, 2021
  
  80120f0a
- tidy · 54fba515
  Chao Liu authored Aug 09, 2021
  
  54fba515
08 Aug, 2021 1 commit
- add tidy · 24c87289
  Chao Liu authored Aug 08, 2021
  
  24c87289
30 Jul, 2021 1 commit
- update to clang-format-10 · 82fae390
  Chao Liu authored Jul 30, 2021
  
  82fae390
27 Jul, 2021 1 commit

[MIOpen Downstream] Initial MIOpen integration (#52) · f63a23ac

Chao Liu authored Jul 27, 2021

* update online kernel wrapper bundle all descriptors in a tuple

* change __CONSTANT__ to CONSTANT

* rename

* adding tuning

* added IsValidCompileParameter

* reorginze

* adding tunable for fp16 and int8

* fix kernel compile warning and bug fixes

* suppress warning about cast CONSTANT (address space 4) pointer

* fix building issue

f63a23ac

18 Jul, 2021 1 commit

reorganize files to prepare for MIOpen integration (#51) · 12649254

Chao Liu authored Jul 18, 2021

* change olc cmake

* adding online compile to fwd-v4r5r2

* update scripts

* remane fwd-v4r5r2 to fwd-v6r1

* clean up

12649254

17 Jul, 2021 1 commit

Add xdlops v4r4r4 into online compilation (#48) · fbdf4332

zjing14 authored Jul 16, 2021



* init for v4r4 xdlops olc

* refactor wrap

* init impl of v4r4 nchw xdlops olc

* tuning

* test perf

* fixed v4r4 nhwc

* tuned v4r4 nhwc

* use gridwise_gemm_xdlops_v2r3

* swap a/b

* add pointer support into offline v2r3

* debugging v4r4r4 transform for olc

* change timer of olc

* refactor v4r4 xdlops nchw olc

* remove transform fun in v4r4 xdlops nhwc olc
Co-authored-by: Chao Liu <chao.liu2@amd.com>

fbdf4332

08 Jul, 2021 1 commit
- Update default launch bounds (#43) · 2f82cfb1
  Chao Liu authored Jul 08, 2021
```
* update default launch bounds
```
  2f82cfb1