Commits · 303b1a86241eb03b15f644de993818ead6306604 · gaoqiong / composable_kernel

27 Dec, 2021 2 commits
- remove step hack · 303b1a86
  ltqin authored Dec 27, 2021
  
  303b1a86
- Merge branch 'develop' into conv_splitk_f32 · aaa89914
  ltqin authored Dec 27, 2021
  
  aaa89914
26 Dec, 2021 1 commit

Fusion Conv+Bias+ReLU(+Add) (#62) · acbd7bd7

Chao Liu authored Dec 26, 2021

* fix relu

* clean up

* clean up

* adding 1x1 conv

* adding 1x1 conv

* added 1x1 conv

* refactor

* refactor

* refactor

* added profiler for conv+bias+relu+add

* clean up

* adding conv+bias+relu

* adding conv+bias+relu

* added conv+bias+relu

* Update README.md

* update cpu verification

* adding c shuffle

* update static_tensor for dealing with invalid element

* adding c shuffle

* debugging

* fix bug

* convert to fp16 before shuffle

* shuffle more than one M/NRepeat

* clean up

* remove coordinate step hack from GridwiseGemm_k0mk1_k0nk1_mn_xdlops_v3r1

* clean up

* remove coordinate step hack from all gridwise gemm xdl

* clean up coordinate step hack

* clean up coordinate step hack

* ThreadwiseTensorSliceTransfer_v3r2 support pointwise op on both src and dst

* adding output shuffle in conv+bias+relu+add

* update

* added conv+bias+relu+add with c shuffle

* added conv+bias+relu+add with c shuffle

* fix forward_sweep bugs in threadwise copy

* clean up

* refactor

* clean up

* clean up

* added conv_c_shuffle+bias_relu

* clean up

* added conv+bias+relu+atomic_add

* clean up

* clean up

* clean up

* clean up

* clean up

* clean up

* misc fixes; add 1x1 specialization

* clean up

* delete unused device op

* clean up

* add support for odd C value

acbd7bd7

24 Dec, 2021 1 commit
- add new line at the end of device_gemm_xdl_instance.hpp · f8804804
  ltqin authored Dec 24, 2021
  
  f8804804
16 Dec, 2021 1 commit
- add test · 1b4ae8b5
  ltqin authored Dec 16, 2021
  
  1b4ae8b5
14 Dec, 2021 1 commit
- Merge branch 'develop' into conv_splitk_f32 · 982e59b3
  ltqin authored Dec 14, 2021
  
  982e59b3
13 Dec, 2021 1 commit

manually apply bug fix changes in pr #63 (#64) · a4f24233

Chao Liu authored Dec 12, 2021

* Bug in BlockwiseGemmXdlops_k0mk1_k0nk1_m0n0m1n1m2m3m4n2_v1::MakeCGridDescriptor_M0_N0_M1_N1_M2_M3_M4_N2()
* Bug in ThreadwiseTensorSliceTransfer_v1r3 logic for calculating "forward_sweep"

a4f24233

09 Dec, 2021 4 commits
- remove marco for slpitk swtich · f683fed7
  ltqin authored Dec 09, 2021
  
  f683fed7
- fixed MPerBlock=96 · b59d5490
  ltqin authored Dec 09, 2021
  
  b59d5490
- add element wise operation · 0eed5076
  ltqin authored Dec 09, 2021
  
  0eed5076
- Merge branch 'develop' into conv_splitk_f32 · c29dc4c5
  ltqin authored Dec 09, 2021
  
  c29dc4c5
04 Dec, 2021 1 commit
- fix ReLU formula (#61) · fd3d907a
  Chao Liu authored Dec 04, 2021
```
* fix relu

* clean up

* clean up
```
  fd3d907a
03 Dec, 2021 1 commit

GEMM/Conv+BiasAdd+ReLU+Add (#55) · 41cdd380

Chao Liu authored Dec 02, 2021

* gemm+activation

* move C pointwise operation into threadwise copy

* add pointwise operation to A/B matrix

* update ckProfiler

* adding bias add

* adding bias add

* adding bias add

* added bias add; worked around compiler issues

* clean up

* clean up

* Update README.md

* Update README.md

* Update README.md

* clean up

* add conv_xdl example

* adding conv_xdl_bias_relu_add example

* add conv+bias+relu+add, but has register spill issue

* tweak

* tweak

* refactor

* Update README.md

update readme for example/2_gemm_xdl_bias_relu_add

* clean up

* Update README.md

update readme for example/3_conv_xdl

* Update README.md

41cdd380

02 Dec, 2021 5 commits
- renaming/comments · d7a0a3f9
  Jing Zhang authored Dec 02, 2021
  
  d7a0a3f9
- add lost config · 134af43b
  ltqin authored Dec 02, 2021
  
  134af43b
- add m=96tunning parameter · 5576da22
  ltqin authored Dec 02, 2021
  
  5576da22
- add static_buffer_v2 zero out · 2cbb8976
  Jing Zhang authored Dec 02, 2021
  
  2cbb8976
- fixed c_buffer alloc · d798c9b8
  Jing Zhang authored Dec 02, 2021
  
  d798c9b8
01 Dec, 2021 1 commit
- Merge branch 'develop' into conv_splitk_f32 · a037693f
  ltqin authored Dec 01, 2021
  
  a037693f
30 Nov, 2021 2 commits
- fix layout naming convention (#56) · 4041850f
  Chao Liu authored Nov 30, 2021
  
  4041850f
- added test for magic number division (#58) · 237d4ca0
  Chao Liu authored Nov 30, 2021
  
  237d4ca0
25 Nov, 2021 4 commits
- add tunning parameter for TT · 0694d6ed
  ltqin authored Nov 25, 2021
  
  0694d6ed
- add tunning parameter for TN · b282e62f
  ltqin authored Nov 25, 2021
  
  b282e62f
- add tunning parameter for NT · 000db488
  ltqin authored Nov 25, 2021
  
  000db488
- grid size change to 720 · b98e339d
  ltqin authored Nov 25, 2021
  
  b98e339d
24 Nov, 2021 3 commits
- add args for packed gemm (#54) · 567f5e9c
  zjing14 authored Nov 24, 2021
  
  567f5e9c
- add all tuning parameter to f32 mkkn · 6a2157f4
  ltqin authored Nov 24, 2021
  
  6a2157f4
- using atomic · 114f9298
  ltqin authored Nov 24, 2021
  
  114f9298
23 Nov, 2021 2 commits
- set c matrix zero · b7ec2078
  ltqin authored Nov 23, 2021
  
  b7ec2078
- add file device_gemm_splitk_xdl.hpp · d1998945
  ltqin authored Nov 23, 2021
  
  d1998945
22 Nov, 2021 1 commit
- add DeviceGemmSplitKXdl · a624666d
  ltqin authored Nov 22, 2021
  
  a624666d
18 Nov, 2021 3 commits

Use __builtin_memcpy to implement bit_cast and for accessing vector from pointer of scalars (#53) · 64350aff
Chao Liu authored Nov 18, 2021
```
* reworking vector_type

* use __builtin_memcpy for bit_cast and vector access of scalar pointer

* clean up
```
64350aff

v5r1 fusion kernels for inference (#49) · 970fa3e9

zjing14 authored Nov 18, 2021



* init

* refactor for 1x1

* rename e0_e1

* add e1 with bugs

* debug

* fixed

* fixed e1

* add timer

* imprve threadwise gemm with dot2

* add e2

* tuning

* seperate c2

* add nhwc

* restore nchwc

* clean

* opt

* fixed; tuning

* add BGlobalMoveSliceWindowStepHacks{}

* tuning

* repeat running

* adjust

* merge v5r1 nchwc

* add adaptors

* split k0 k1 in c_thread_grid

* split h and w

* remove v5r1 nhwc

* clean for pr

* remove host_conv_add

* clean code

* clean

* add dynamic support

* static mode

* test static

* add conv+add fusion

* fixed validation

* naming fix

* use activ_enum

* make static

* refactor conv_add for InMem::add

* add bias

* add conv_out

* add configurable makeddesc

* add maxpool fusion

* add maxpool host for validation

* enable static desc

* conv-only use v5r1_add

* test

* test

* for binary dumps

* fixed incorrect results due to typo

* clean

* debugging maxpool

* workaround with offset trick

* clean code

* modularize ops of fusion

* add gridwise_gemm_v3

* create seperate fusion fun

* enable dynamic mode of conv and conv+resize_add

* add dynamic mode of maxpool

* add pass by point

* add activ_type as arguments

* merge develop

* clean

* reset config to old default
Co-authored-by: Chao Liu <chao.liu2@amd.com>

970fa3e9

Fixed bfp16 host_conv_fwd (#52) · a651ea4f

zjing14 authored Nov 18, 2021



* fixed bfloat16 issues

* refactor type_convert

* fixed host_convolution_forward for ushort
Co-authored-by: Chao Liu <chao.liu2@amd.com>

a651ea4f

16 Nov, 2021 2 commits
- fixed multiple definition issue of bfp16/fp32 conversion function when building ckProfiler (#51) · 0a66c54e
  zjing14 authored Nov 16, 2021
```
* fixed bfloat16 issues

* refactor type_convert
Co-authored-by: Chao Liu <chao.liu2@amd.com>
```
  0a66c54e
- updated bfloat16_to_float · 89e1ebd4
  Jing Zhang authored Nov 16, 2021
  
  89e1ebd4
15 Nov, 2021 2 commits

Add bfp16/int8 support into XDL GEMM operator (#50) · 3737bb03

zjing14 authored Nov 15, 2021



* init StaticBufferV2

* clean

* adopt old output stage for staticBufferV2

* clean

* remove hack

* clean

* clean

* add parameters

* clean code

* move c_buffer alloc into blockwise gemm

* add adaptors for m/n_thread_data_on_grid

* tweak gemm

* adjust blockwise_gemm_xdlops

* tweak

* update conv

* update script

* adding bwd 1x1

* update script

* adding 1x1 bwd

* debugging bwd 1x1 failure

* update script

* update script

* test

* test v100

* add bf16_1k

* clang-format

* clean

* add bfp16 for gfx908

* add verification

* clean up

* clean code

* restore bfl16

* clean

* add bfp16 support into gemm_driver

* apply new generator to other drivers

* add int8 support

* cleanb

* clean

* clean

* clean
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Chao Liu <lc.roy86@gmail.com>
Co-authored-by: root <root@hayabusa6111.amd.com>

3737bb03

FP16 data in-register transpose (#41) · b491ebf3

Chao Liu authored Nov 15, 2021

* start fixing 16bit data packing

* adding StaticTensor

* adding StaticTensor

* adding StaticTensor

* add missing constexpr

* adding static tensor

* adding static tensor

* adding transpose

* add inline asm for transpose 2x2 of half_t

* add general transpose_vectors(), but have unnecessary register initialization using v_mov

* fix unnecessary register initialization in transpose_vector by using more pass-by-reference

* add hardcoded logic for NHWC wrw

* improve asm for v_pack

* make ThreadwiseTensorSliceTransfer_v3r2 support any tensor

* tweak

* reorganize file

b491ebf3

14 Nov, 2021 1 commit

ckProfiler and device-level XDL GEMM operator (#48) · e823d518

Chao Liu authored Nov 14, 2021

* add DeviceGemmXdl

* update script

* fix naming issue

* fix comment

* output HostTensorDescriptor

* rename

* padded GEMM for fwd v4r4r4 nhwc

* refactor

* refactor

* refactor

* adding ckProfiler

* adding ckProfiler

* refactor

* fix tuning parameter bug

* add more gemm instances

* add more fp16 GEMM instances

* fix profiler driver

* fix bug in tuning parameter

* add fp32 gemm instances

* small fix

* refactor

* rename

* refactor gemm profiler; adding DeviceConv and conv profiler

* refactor

* fix

* add conv profiler

* refactor

* adding more GEMM and Conv instance

* Create README.md

Add build instruction for ckProfiler

* Create README.md

Add Readme for gemm_xdl example

* Update README.md

Remove build instruction from top most folder

* Update README.md

* clean up

e823d518

27 Oct, 2021 1 commit

[Bug Fix] GridwiseGemm_bk0mk1_bk0nk1_mn_xdlops_v2r4 loop issue (#44) · 6014185a

ltqin authored Oct 27, 2021



* change method computering kpad

* remove unusing variable: batchlen

* change KPerBlock to K0PerBlock

* fix bug for k0 == k0perblock

* fix bug for get k0 index

* use math::integer_divide_ceil
Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

6014185a