Commits · 5b178874a1b2a1cae217e87e1988ab92a40d71b8 · yangql / composable_kernel-1

05 Mar, 2022 2 commits

Chao Liu authored Mar 05, 2022

* fix tests

* remove useless file

* fix test build

* reduce parallelism when compiling

* fix test

5b178874

Example for conv2d backward weight fp16 (#106) · 7a9b93f4

ltqin authored Mar 05, 2022



* add wrw reference

* start device

* raw not split version

* run simple example

* start to use atomic add

* simple transform result correct

* first version that can run

* fix atomic and set operator choice

* add check split-k

* format

* change input parameter

* add pad for t total

* rename example index
Co-authored-by: ltqin <letaoqin@amd.com>

7a9b93f4

04 Mar, 2022 4 commits

[Bf16 & int8] [example & ckprofiler] (#100) · 7e9a9d32

rocking5566 authored Mar 05, 2022



* Add int8 of mk_nk_mn to the ckProfiler

* Add example of int8 gemm

* Fix typo, use ushort instead of half_t for bfloat16

* replace ushortXXX_t to bhalfXXX_t

* rename ushort to bhalf_t

* Add bf16 example

* Add bf16 gemm to ckProfiler

* Fix alignment

* Fix typo

* Add unit test for gemm_xdl int8

* Add gemm_xdl fp32 unit test

* Add gemm_xdl bf16 unit test

* fix build

* fix build issue due to merge conflict

* Fix build

* Fix build error
Co-authored-by: rocking <chunylai@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

7e9a9d32

fix type in PR #101 (#107) · 0c79af12
Chao Liu authored Mar 04, 2022

0c79af12

Refactor threadwise copy using sfcurve (#101) · 0619ebf7

Jianfeng Yan authored Mar 04, 2022



* add space_filling_curve

* cleanup and move space_filling_curve into test

* WIP: start refactoring threadwise_transfer_v1r3

* threadwise_copy works but needs further refactoring

* add some comments

* add SpaceFillingCurve::GetIndices()

* minor changes

* removed GetIndices; refactored GetDstCoordinateResetStep

* add DynamicBuffer::Transfer, but Add is not tested

* rebased agaist develop

* threadwise_copy_v6r1/v6r2/v6r3 using space-filling curve start to work

* minor changes

* refactored threadcopy v3r1, v2; removed old implementations

* clang-format

* cleanup

* fix a typo in v6r3

* format
Co-authored-by: Chao Liu <chao.liu2@amd.com>

0619ebf7

NHWC conv 2d: bwd fp32/fp16/bfp16/int8, Device level tuning and host API (#92) · c254e5ab

ltqin authored Mar 04, 2022



* start conv2d bwd api

* kernel running

* add bwd reference

* change to no shuffle

* fix bwd reference

* pass verification

* add Filter1x1Stride1Pad0 and start testing

* change some tuning parameter

* fix test error

* add fp16 tuning parameter

* add bf16 tuning parameter

* add int8 tuning parameters

* change fp32 tuning parameter

* add bwd to profiler

* fix bug for bwd profiler

* fix ckProfiler bug

* change conv2d_bwd_xdl to fp16

* fix bug in comments

* fix precompile id

* fix enum conv name

* chage _bwd_ to _bwd_data_

* change conv2d_bwd example id

* bwd to bwd data

* fix prehead

* fix MakeDefaultBlock2CTileMap ,import form merge develop

* format bwd instance

* bwd to bwd data

* change name bwd to bwd data

* change name bwd to bwd data in example

* formate code

* change conv2d bwd data id in example

* rewrite readme for example

* fix CalculateMagicNumbers about div zero

* add workaround CK_WORKAROUND_SWDEV_325164

* change test_conf2d_bwd_data show info

* format

* fix bug for workaround:CK_WORKAROUND_SWDEV_325164

* formate tuning parameters

* formate tuning parameters again

* formate tuning parameters 3

* formate tuning parameters 4

* remove add function template

* format

* update comment
Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

c254e5ab

03 Mar, 2022 1 commit

Update test CMakeLists to add new tests automatically and add Jenkins stage for tests (#88) · 992f71e3

JD authored Mar 03, 2022



* add docker file and make default target buildable

* add Jenkinsfile

* remove empty env block

* fix package stage

* remove render group from docker run

* clean up Jenkins file

* add cppcheck as dev dependency

* update cmake file

* Add profiler build stage

* add hip_version config file for reduction operator

* correct jenkins var name

* Build release instead of debug

* Update test CMakeLists.txt
reorg test dir
add test stage

* reduce compile threads to prevent compiler crash

* add optional debug stage, update second test

* remove old test target

* fix tests to return proper results and self review

* Fix package name and make test run without args

* change Dockerfile to ues rocm4.3.1

* remove parallelism from build

* Lower paralellism
Co-authored-by: Chao Liu <chao.liu2@amd.com>

992f71e3

28 Feb, 2022 1 commit

Allow distinct K0/K1 values for A/B block descriptor (#98) · 6d4450ef

Anthony Chang authored Feb 28, 2022



* add gitignore

* host tensor: allow generating sequentially increasing value in a given dimension

* gridwise gemm v3r1: allow distinct K0/K1 values for A/B block descriptor

- remove dangling header include
- modify example gemm_xdl accordingly
- infer KPack value from M/NPerXdl
- device conv2d fwd: update parameters accordingly for the underlying gridwise gemm v3r1
(API for conv2d fwd stays the same for now until we decide to expose individual K0s for activation and weight)

* add LDS data dump utility

* profiler: reflect API change for distinct K0/K1 for A/B matrices

* profiler: add conflict-free LDS write FP16 kernel instances

* fix accidental perf regression

* address feedback; cosmetic changes

* clang-format for new files

* format
Co-authored-by: Chao Liu <chao.liu2@amd.com>

6d4450ef

25 Feb, 2022 2 commits

Split k f16 (#97) · e221d11e

zjing14 authored Feb 25, 2022



* init for splitk f16

* a working prototype

* debug

* perf debug

* update example

* instances for mk kn

* add instances for all layers

* clean

* clean

* add tuning

* format

* add mn_padding into irregular tile

* clean
Co-authored-by: Chao Liu <chao.liu2@amd.com>

e221d11e

Space filling curve (#96) · bdedf64b

Jianfeng Yan authored Feb 24, 2022

* add space_filling_curve

* cleanup and move space_filling_curve into test

* add functions for backward and forward step; hard coded results in unit test

* minor changes

bdedf64b

23 Feb, 2022 3 commits

Add gridwise GEMM pipeline (#89) · 22d438ae

Chao Liu authored Feb 23, 2022

* clean up

* add mutilple thread scratch to ThreadwiseTensorSliceTransfer_v3r1

* add 2 stage prefetch

* add more sanity check into transform_tensor_descriptor

* tweak

* enabling 2 stage prefetch to exsiting gridwise gemm; tweak

* enabling 2 stage prefetch to exsiting gridwise gemm

* move gridwise gemm pipeline in class; clean up

* add some irregular tile size

* update CalculateHasMainK0BlockLoop for multi-stage-prefetch

* refactor gridwise gemm pipeline class

22d438ae

Unify Convolution FWD XDL 1D/2D implementation. (#93) · 756a7617

Adam Osewski authored Feb 23, 2022



* Convolution ND

* Code unification across dimensions for generating tensor descriptors.
* Example
* Instances

* Move convnd f32 instance file to comply with repo structure.

* Conv 1D tensor layouts.

* Formatting and use ReferenceConv

* Reference ConvFwd supporting 1D and 2D convolution.

* Debug printing TensorLayout name.

* Conv fwd 1D instance f32

* Refactor conv ND example.

Needed to support various conv dimensio.

Needed to support various conv dimensions

* Rename conv nd example director to prevent conflicts.

* Refactor some common utility to single file.

Plus some tests.

* Refactor GetHostTensorDescriptor + UT.

* Add 1D test case.

* Test reference convolution 1d/2d

* Remove some leftovers.

* Fix convolution example error for 1D

* Refactor test check errors utility function.

* Test Conv2D Fwd XDL

* More UT for 1D case.

* Parameterize input & weight initializers.

* Rename example to prevent conflicts.

* Split convnd instance into separate files for 1d/2d

* Address review comments.

* Fix data type for flops/gbytes calculations.

* Assign example number 11.
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

756a7617

Conv3d new (#94) · 6dfb92bb

Jianfeng Yan authored Feb 22, 2022



* conv3d compiles but has memory error

* conv3d works

* fix performance issue by using __builtin_amdgc_readfirstlane

* change MakeBlock2CTileMap to MakeDefaultBlock2CTileMap; change c_blockid_to* to cblockid_to*

* clang-format

* remove CK_EXPERIMENTAL_PASS_TENSOR_DECRIPTOR_BY_*; moved wrapper into DeviceConv3d

* format

* remove useless marc

* add comment
Co-authored-by: Chao Liu <chao.liu2@amd.com>

6dfb92bb

21 Feb, 2022 1 commit

Gemm alpha beta profiler (fp32 & fp16) (#91) · 19c5d6e6

rocking5566 authored Feb 22, 2022



* [What] Refactor verification of gemm alpha_beta, move to reference operation
[Why] Sync with other verification

* Profile mk_nk for gemm bias 2d

* Support bias 2d with mn * kn in profiler

* Support bias 2d with km*kn and km*nk in profiler

* Support fp32 bias 2d in profiler

* format

* format
Co-authored-by: rocking <chunylai@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

19c5d6e6

19 Feb, 2022 1 commit

Initial Setup for CI (#86) · 2778e997

JD authored Feb 18, 2022



* add docker file and make default target buildable

* add Jenkinsfile

* remove empty env block

* fix package stage

* remove render group from docker run

* clean up Jenkins file

* add cppcheck as dev dependency

* update cmake file

* Add profiler build stage

* add hip_version config file for reduction operator

* correct jenkins var name

* Build release instead of debug

* clean up
Co-authored-by: Chao Liu <chao.liu2@amd.com>

2778e997

12 Feb, 2022 1 commit

NHWC conv 2d: fwd bfp16/int8, Device level tuning and host API (#73) · 880fbee9

ltqin authored Feb 12, 2022



* add fwd bf16 conv

* change tunning parametor

* add int8 for conv fwd

* remove comments

* change tunning parametor for int8

* change init int8 example

* add test for conv2d fwd

* change device operation file pos because merge develop

* fwd int8 use reference

* test_conv_fwd use reference

* add braket for if statement

* rename fwd example name

* remove StaticBufferOfVectorTypeV2

* tweak example
Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

880fbee9

11 Feb, 2022 4 commits

Add small tile size for fp16/fp32 and NN layout (#80) · 20a672d0

zjing14 authored Feb 11, 2022



* add DeviceGemmSplitKXdl

* add file device_gemm_splitk_xdl.hpp

* set c matrix zero

* using atomic

* add all tuning parameter to f32 mkkn

* grid size change to 720

* add tunning parameter for NT

* add tunning parameter for TN

* add tunning parameter for TT

* add m=96tunning parameter

* add lost config

* debug

* fix sweep

* add failed tuning params

* fixed sweep logic

* clean

* add padding to M/N for irr tile size

* clean code

* add element wise operation

* fixed MPerBlock=96

* remove marco for slpitk swtich

* add test

* add new line at the end of device_gemm_xdl_instance.hpp

* remove step hack

* seperate split-k instance files

* add tunning parameters

* change disired grid size to parameters

* remove slice length

* add desiredgridsize parameter to ckProfiler

* add losting file device_gemm_xdl_splitk_instance.hpp

* change desired gride size to kbatch

* format

* format

* clean up

* add selection of device_instances

* clean code

* clean code

* add small tile size in fp16 nn

* test for rocm 4.5

* merge develop

* clean

* clean

* clean

* remove no-use code

* add padding switch to device_gemm_xdl

* add padding switch for ksplit fp32

* clean

* clean

* add files

* rename

* Update profiler.cpp

* format
Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: ltqin <letao.qin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

20a672d0

Batched GEMM for fp16 (#79) · b53e9d08

zjing14 authored Feb 11, 2022

* prepare host for batched_gemm

* init commit of batched kernels

* fixed

* refine transform with freeze

* m/n padding

* fixed a bug; clean

* add small tiles

* clean

* clean code

* clean code

* add nt, tn, tt layout

* add missing file

* use StaticBufferTupleOfVector instead

* add reference_batched_gemm

* fixed a macro

b53e9d08

Support alpha beta scaling for GEMM (#78) · 6f928a08

rocking5566 authored Feb 11, 2022



* [What] Add 2d version of bias, prepare to implement alpha / beta scaling

* Add alpha / beta functor

* Refine parameter of example

* [What] Use real type instead of template
[Why] Prevent implicit cast

* Rename parameter for general operator

* Remove redundant comment

* Fix compile error
Co-authored-by: rocking <chunylai@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

6f928a08

fix build breaks (#81) · 904cbe2a

Anthony Chang authored Feb 11, 2022



- device_gemm_xdl_c_shuffle function signature matches split-k
- retire host_driver since it is no longer maintained
- linter error (unused variable)
Co-authored-by: Chao Liu <chao.liu2@amd.com>

904cbe2a

07 Feb, 2022 1 commit

GEMM+Bias+ReLU+Add (#76) · 823657ed

Chao Liu authored Feb 06, 2022

* tweak conv for odd C

* update script

* clean up elementwise op

* fix build

* clean up

* added example for gemm+bias+relu+add

* added example for gemm+bias+relu

* add profiler for gemm_s_shuffle; re-org files

* add profiler

* fix build

* clean up

* clean up

* clean up

* fix build

823657ed

04 Feb, 2022 1 commit

References for conv2d fwd bias relu and add (#75) · 690c75a7

ltqin authored Feb 04, 2022



* add reference

* clean up

* add reference for conv

* rename
Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

690c75a7

03 Feb, 2022 2 commits

Replace llvm Intrinsics with clang buildins (#65) · 6d92959a

zjing14 authored Feb 02, 2022

* test mfma builtins

* add fp16 buildins

* add int8 buildins

* add bfl16 buildins

* simplify host conv forward

* clean

* clean

6d92959a

add split-k GEMM (#59) · 4be7f019

ltqin authored Feb 03, 2022



* add DeviceGemmSplitKXdl

* add file device_gemm_splitk_xdl.hpp

* set c matrix zero

* using atomic

* add all tuning parameter to f32 mkkn

* grid size change to 720

* add tunning parameter for NT

* add tunning parameter for TN

* add tunning parameter for TT

* add m=96tunning parameter

* add lost config

* add element wise operation

* fixed MPerBlock=96

* remove marco for slpitk swtich

* add test

* add new line at the end of device_gemm_xdl_instance.hpp

* remove step hack

* seperate split-k instance files

* add tunning parameters

* change disired grid size to parameters

* remove slice length

* add desiredgridsize parameter to ckProfiler

* add losting file device_gemm_xdl_splitk_instance.hpp

* change desired gride size to kbatch

* format

* format

* clean up

* add selection of device_instances

* clean code

* fix build issue
Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Jing Zhang <jizhan@amd.com>

4be7f019

25 Jan, 2022 1 commit

Do not hardcode the function parameter, use template instead. (#72) · ca47a6cf

rocking5566 authored Jan 25, 2022

* Do not hardcode the function parameter, use template instead.

* [What] Remove AThreadTransferSrcResetCoordinateAfterRun and BThreadTransferSrcResetCoordinateAfterRun in host API
[Why] "C_Shuffle" version is supposed to be similar to the vanilla one

* Fix typo
Let DeviceGemmXdl_C_Shuffle use kernel_gemm_xdlops_v3r1

ca47a6cf

21 Jan, 2022 1 commit

Add gemm_shuffle host api (#71) · 4d40b197

rocking5566 authored Jan 21, 2022

* [What]
1. Add DeviceGemmXdl_C_Shuffle
2. Revise example of gemm_xdl
[Why] Prepare to add shuffle version of D = alpha * (A * B) + beta * C
[How] Imitate DeviceGemmXdl and device_conv2d_fwd_xdl_c_shuffle_nhwc_kyxc_nhwk.hpp

4d40b197

18 Jan, 2022 1 commit
- Fix building issue for examples (#66) · 6260ced2
  Chao Liu authored Jan 17, 2022
```
* fix build issue
```
  6260ced2
26 Dec, 2021 1 commit

Fusion Conv+Bias+ReLU(+Add) (#62) · acbd7bd7

Chao Liu authored Dec 26, 2021

* fix relu

* clean up

* clean up

* adding 1x1 conv

* adding 1x1 conv

* added 1x1 conv

* refactor

* refactor

* refactor

* added profiler for conv+bias+relu+add

* clean up

* adding conv+bias+relu

* adding conv+bias+relu

* added conv+bias+relu

* Update README.md

* update cpu verification

* adding c shuffle

* update static_tensor for dealing with invalid element

* adding c shuffle

* debugging

* fix bug

* convert to fp16 before shuffle

* shuffle more than one M/NRepeat

* clean up

* remove coordinate step hack from GridwiseGemm_k0mk1_k0nk1_mn_xdlops_v3r1

* clean up

* remove coordinate step hack from all gridwise gemm xdl

* clean up coordinate step hack

* clean up coordinate step hack

* ThreadwiseTensorSliceTransfer_v3r2 support pointwise op on both src and dst

* adding output shuffle in conv+bias+relu+add

* update

* added conv+bias+relu+add with c shuffle

* added conv+bias+relu+add with c shuffle

* fix forward_sweep bugs in threadwise copy

* clean up

* refactor

* clean up

* clean up

* added conv_c_shuffle+bias_relu

* clean up

* added conv+bias+relu+atomic_add

* clean up

* clean up

* clean up

* clean up

* clean up

* clean up

* misc fixes; add 1x1 specialization

* clean up

* delete unused device op

* clean up

* add support for odd C value

acbd7bd7

13 Dec, 2021 1 commit

manually apply bug fix changes in pr #63 (#64) · a4f24233

Chao Liu authored Dec 12, 2021

* Bug in BlockwiseGemmXdlops_k0mk1_k0nk1_m0n0m1n1m2m3m4n2_v1::MakeCGridDescriptor_M0_N0_M1_N1_M2_M3_M4_N2()
* Bug in ThreadwiseTensorSliceTransfer_v1r3 logic for calculating "forward_sweep"

a4f24233

04 Dec, 2021 1 commit
- fix ReLU formula (#61) · fd3d907a
  Chao Liu authored Dec 04, 2021
```
* fix relu

* clean up

* clean up
```
  fd3d907a
03 Dec, 2021 1 commit

GEMM/Conv+BiasAdd+ReLU+Add (#55) · 41cdd380

Chao Liu authored Dec 02, 2021

* gemm+activation

* move C pointwise operation into threadwise copy

* add pointwise operation to A/B matrix

* update ckProfiler

* adding bias add

* adding bias add

* adding bias add

* added bias add; worked around compiler issues

* clean up

* clean up

* Update README.md

* Update README.md

* Update README.md

* clean up

* add conv_xdl example

* adding conv_xdl_bias_relu_add example

* add conv+bias+relu+add, but has register spill issue

* tweak

* tweak

* refactor

* Update README.md

update readme for example/2_gemm_xdl_bias_relu_add

* clean up

* Update README.md

update readme for example/3_conv_xdl

* Update README.md

41cdd380

02 Dec, 2021 3 commits
- renaming/comments · d7a0a3f9
  Jing Zhang authored Dec 02, 2021
  
  d7a0a3f9
- add static_buffer_v2 zero out · 2cbb8976
  Jing Zhang authored Dec 02, 2021
  
  2cbb8976
- fixed c_buffer alloc · d798c9b8
  Jing Zhang authored Dec 02, 2021
  
  d798c9b8
30 Nov, 2021 2 commits
- fix layout naming convention (#56) · 4041850f
  Chao Liu authored Nov 30, 2021
  
  4041850f
- added test for magic number division (#58) · 237d4ca0
  Chao Liu authored Nov 30, 2021
  
  237d4ca0
24 Nov, 2021 1 commit
- add args for packed gemm (#54) · 567f5e9c
  zjing14 authored Nov 24, 2021
  
  567f5e9c
18 Nov, 2021 3 commits

Use __builtin_memcpy to implement bit_cast and for accessing vector from pointer of scalars (#53) · 64350aff
Chao Liu authored Nov 18, 2021
```
* reworking vector_type

* use __builtin_memcpy for bit_cast and vector access of scalar pointer

* clean up
```
64350aff

v5r1 fusion kernels for inference (#49) · 970fa3e9

zjing14 authored Nov 18, 2021



* init

* refactor for 1x1

* rename e0_e1

* add e1 with bugs

* debug

* fixed

* fixed e1

* add timer

* imprve threadwise gemm with dot2

* add e2

* tuning

* seperate c2

* add nhwc

* restore nchwc

* clean

* opt

* fixed; tuning

* add BGlobalMoveSliceWindowStepHacks{}

* tuning

* repeat running

* adjust

* merge v5r1 nchwc

* add adaptors

* split k0 k1 in c_thread_grid

* split h and w

* remove v5r1 nhwc

* clean for pr

* remove host_conv_add

* clean code

* clean

* add dynamic support

* static mode

* test static

* add conv+add fusion

* fixed validation

* naming fix

* use activ_enum

* make static

* refactor conv_add for InMem::add

* add bias

* add conv_out

* add configurable makeddesc

* add maxpool fusion

* add maxpool host for validation

* enable static desc

* conv-only use v5r1_add

* test

* test

* for binary dumps

* fixed incorrect results due to typo

* clean

* debugging maxpool

* workaround with offset trick

* clean code

* modularize ops of fusion

* add gridwise_gemm_v3

* create seperate fusion fun

* enable dynamic mode of conv and conv+resize_add

* add dynamic mode of maxpool

* add pass by point

* add activ_type as arguments

* merge develop

* clean

* reset config to old default
Co-authored-by: Chao Liu <chao.liu2@amd.com>

970fa3e9

Fixed bfp16 host_conv_fwd (#52) · a651ea4f

zjing14 authored Nov 18, 2021



* fixed bfloat16 issues

* refactor type_convert

* fixed host_convolution_forward for ushort
Co-authored-by: Chao Liu <chao.liu2@amd.com>

a651ea4f