Commits · a79b95ebe84cb8d179920c44cf27fa97a1cee637 · gaoqiong / composable_kernel

05 Mar, 2022 4 commits

Merge remote-tracking branch 'origin/develop' into int8_qunatization_gemm_xdl · a79b95eb
Chao Liu authored Mar 05, 2022

a79b95eb

Example for conv2d backward weight fp16 (#106) · 7a9b93f4

ltqin authored Mar 05, 2022



* add wrw reference

* start device

* raw not split version

* run simple example

* start to use atomic add

* simple transform result correct

* first version that can run

* fix atomic and set operator choice

* add check split-k

* format

* change input parameter

* add pad for t total

* rename example index
Co-authored-by: ltqin <letaoqin@amd.com>

7a9b93f4

[What] gemm + relu inference · 487d9868
rocking authored Mar 05, 2022
```
[How] gemm + requant + relu + requant + clamp
```
487d9868
Fix build error · 929f72ab
rocking authored Mar 05, 2022

929f72ab

04 Mar, 2022 7 commits

[Bf16 & int8] [example & ckprofiler] (#100) · 7e9a9d32

rocking5566 authored Mar 05, 2022



* Add int8 of mk_nk_mn to the ckProfiler

* Add example of int8 gemm

* Fix typo, use ushort instead of half_t for bfloat16

* replace ushortXXX_t to bhalfXXX_t

* rename ushort to bhalf_t

* Add bf16 example

* Add bf16 gemm to ckProfiler

* Fix alignment

* Fix typo

* Add unit test for gemm_xdl int8

* Add gemm_xdl fp32 unit test

* Add gemm_xdl bf16 unit test

* fix build

* fix build issue due to merge conflict

* Fix build

* Fix build error
Co-authored-by: rocking <chunylai@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

7e9a9d32

fix type in PR #101 (#107) · 0c79af12
Chao Liu authored Mar 04, 2022

0c79af12
fix build issue due to merge conflict · 0d55111e
Chao Liu authored Mar 04, 2022

0d55111e
Merge remote-tracking branch 'origin/develop' into bf16_int8_ckprofiler · c2b3bede
Chao Liu authored Mar 04, 2022

c2b3bede
Fix build · ef2defdc
rocking authored Mar 05, 2022

ef2defdc

Refactor threadwise copy using sfcurve (#101) · 0619ebf7

Jianfeng Yan authored Mar 04, 2022



* add space_filling_curve

* cleanup and move space_filling_curve into test

* WIP: start refactoring threadwise_transfer_v1r3

* threadwise_copy works but needs further refactoring

* add some comments

* add SpaceFillingCurve::GetIndices()

* minor changes

* removed GetIndices; refactored GetDstCoordinateResetStep

* add DynamicBuffer::Transfer, but Add is not tested

* rebased agaist develop

* threadwise_copy_v6r1/v6r2/v6r3 using space-filling curve start to work

* minor changes

* refactored threadcopy v3r1, v2; removed old implementations

* clang-format

* cleanup

* fix a typo in v6r3

* format
Co-authored-by: Chao Liu <chao.liu2@amd.com>

0619ebf7

NHWC conv 2d: bwd fp32/fp16/bfp16/int8, Device level tuning and host API (#92) · c254e5ab

ltqin authored Mar 04, 2022



* start conv2d bwd api

* kernel running

* add bwd reference

* change to no shuffle

* fix bwd reference

* pass verification

* add Filter1x1Stride1Pad0 and start testing

* change some tuning parameter

* fix test error

* add fp16 tuning parameter

* add bf16 tuning parameter

* add int8 tuning parameters

* change fp32 tuning parameter

* add bwd to profiler

* fix bug for bwd profiler

* fix ckProfiler bug

* change conv2d_bwd_xdl to fp16

* fix bug in comments

* fix precompile id

* fix enum conv name

* chage _bwd_ to _bwd_data_

* change conv2d_bwd example id

* bwd to bwd data

* fix prehead

* fix MakeDefaultBlock2CTileMap ,import form merge develop

* format bwd instance

* bwd to bwd data

* change name bwd to bwd data

* change name bwd to bwd data in example

* formate code

* change conv2d bwd data id in example

* rewrite readme for example

* fix CalculateMagicNumbers about div zero

* add workaround CK_WORKAROUND_SWDEV_325164

* change test_conf2d_bwd_data show info

* format

* fix bug for workaround:CK_WORKAROUND_SWDEV_325164

* formate tuning parameters

* formate tuning parameters again

* formate tuning parameters 3

* formate tuning parameters 4

* remove add function template

* format

* update comment
Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

c254e5ab

03 Mar, 2022 3 commits

fix build · 820c25cb
Chao Liu authored Mar 03, 2022

820c25cb
Merge remote-tracking branch 'origin/develop' into bf16_int8_ckprofiler · 892cb743
Chao Liu authored Mar 03, 2022

892cb743

Update test CMakeLists to add new tests automatically and add Jenkins stage for tests (#88) · 992f71e3

JD authored Mar 03, 2022



* add docker file and make default target buildable

* add Jenkinsfile

* remove empty env block

* fix package stage

* remove render group from docker run

* clean up Jenkins file

* add cppcheck as dev dependency

* update cmake file

* Add profiler build stage

* add hip_version config file for reduction operator

* correct jenkins var name

* Build release instead of debug

* Update test CMakeLists.txt
reorg test dir
add test stage

* reduce compile threads to prevent compiler crash

* add optional debug stage, update second test

* remove old test target

* fix tests to return proper results and self review

* Fix package name and make test run without args

* change Dockerfile to ues rocm4.3.1

* remove parallelism from build

* Lower paralellism
Co-authored-by: Chao Liu <chao.liu2@amd.com>

992f71e3

02 Mar, 2022 3 commits
- Add gemm_xdl bf16 unit test · ba132d28
  rocking authored Mar 02, 2022
  
  ba132d28
- Add gemm_xdl fp32 unit test · bdba8d6f
  rocking authored Mar 02, 2022
  
  bdba8d6f
- Add unit test for gemm_xdl int8 · d1628241
  rocking authored Mar 02, 2022
  
  d1628241
01 Mar, 2022 2 commits
- Fix typo · bfb3f413
  rocking authored Mar 02, 2022
  
  bfb3f413
- Fix alignment · 6c3e2cdd
  rocking authored Mar 02, 2022
  
  6c3e2cdd
28 Feb, 2022 1 commit

Allow distinct K0/K1 values for A/B block descriptor (#98) · 6d4450ef

Anthony Chang authored Feb 28, 2022



* add gitignore

* host tensor: allow generating sequentially increasing value in a given dimension

* gridwise gemm v3r1: allow distinct K0/K1 values for A/B block descriptor

- remove dangling header include
- modify example gemm_xdl accordingly
- infer KPack value from M/NPerXdl
- device conv2d fwd: update parameters accordingly for the underlying gridwise gemm v3r1
(API for conv2d fwd stays the same for now until we decide to expose individual K0s for activation and weight)

* add LDS data dump utility

* profiler: reflect API change for distinct K0/K1 for A/B matrices

* profiler: add conflict-free LDS write FP16 kernel instances

* fix accidental perf regression

* address feedback; cosmetic changes

* clang-format for new files

* format
Co-authored-by: Chao Liu <chao.liu2@amd.com>

6d4450ef

27 Feb, 2022 1 commit
- Add bf16 gemm to ckProfiler · 245a9e0e
  rocking authored Feb 27, 2022
  
  245a9e0e
26 Feb, 2022 2 commits
- Add bf16 example · e0d22b24
  rocking authored Feb 26, 2022
  
  e0d22b24
- rename ushort to bhalf_t · a13bf453
  rocking authored Feb 26, 2022
  
  a13bf453
25 Feb, 2022 6 commits
- replace ushortXXX_t to bhalfXXX_t · 010ef9dc
  rocking authored Feb 26, 2022
  
  010ef9dc
- Fix typo, use ushort instead of half_t for bfloat16 · 63e10e34
  rocking authored Feb 26, 2022
  
  63e10e34
- Add example of int8 gemm · c50e3de5
  rocking authored Feb 26, 2022
  
  c50e3de5
- Add int8 of mk_nk_mn to the ckProfiler · f6138c40
  rocking authored Feb 26, 2022
  
  f6138c40
- Split k f16 (#97) · e221d11e
  zjing14 authored Feb 25, 2022
```
* init for splitk f16

* a working prototype

* debug

* perf debug

* update example

* instances for mk kn

* add instances for all layers

* clean

* clean

* add tuning

* format

* add mn_padding into irregular tile

* clean
Co-authored-by: Chao Liu <chao.liu2@amd.com>
```
  e221d11e
- Space filling curve (#96) · bdedf64b
  Jianfeng Yan authored Feb 24, 2022
```
* add space_filling_curve

* cleanup and move space_filling_curve into test

* add functions for backward and forward step; hard coded results in unit test

* minor changes
```
  bdedf64b
23 Feb, 2022 3 commits

Add gridwise GEMM pipeline (#89) · 22d438ae

Chao Liu authored Feb 23, 2022

* clean up

* add mutilple thread scratch to ThreadwiseTensorSliceTransfer_v3r1

* add 2 stage prefetch

* add more sanity check into transform_tensor_descriptor

* tweak

* enabling 2 stage prefetch to exsiting gridwise gemm; tweak

* enabling 2 stage prefetch to exsiting gridwise gemm

* move gridwise gemm pipeline in class; clean up

* add some irregular tile size

* update CalculateHasMainK0BlockLoop for multi-stage-prefetch

* refactor gridwise gemm pipeline class

22d438ae

Unify Convolution FWD XDL 1D/2D implementation. (#93) · 756a7617

Adam Osewski authored Feb 23, 2022



* Convolution ND

* Code unification across dimensions for generating tensor descriptors.
* Example
* Instances

* Move convnd f32 instance file to comply with repo structure.

* Conv 1D tensor layouts.

* Formatting and use ReferenceConv

* Reference ConvFwd supporting 1D and 2D convolution.

* Debug printing TensorLayout name.

* Conv fwd 1D instance f32

* Refactor conv ND example.

Needed to support various conv dimensio.

Needed to support various conv dimensions

* Rename conv nd example director to prevent conflicts.

* Refactor some common utility to single file.

Plus some tests.

* Refactor GetHostTensorDescriptor + UT.

* Add 1D test case.

* Test reference convolution 1d/2d

* Remove some leftovers.

* Fix convolution example error for 1D

* Refactor test check errors utility function.

* Test Conv2D Fwd XDL

* More UT for 1D case.

* Parameterize input & weight initializers.

* Rename example to prevent conflicts.

* Split convnd instance into separate files for 1d/2d

* Address review comments.

* Fix data type for flops/gbytes calculations.

* Assign example number 11.
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

756a7617

Conv3d new (#94) · 6dfb92bb

Jianfeng Yan authored Feb 22, 2022



* conv3d compiles but has memory error

* conv3d works

* fix performance issue by using __builtin_amdgc_readfirstlane

* change MakeBlock2CTileMap to MakeDefaultBlock2CTileMap; change c_blockid_to* to cblockid_to*

* clang-format

* remove CK_EXPERIMENTAL_PASS_TENSOR_DECRIPTOR_BY_*; moved wrapper into DeviceConv3d

* format

* remove useless marc

* add comment
Co-authored-by: Chao Liu <chao.liu2@amd.com>

6dfb92bb

21 Feb, 2022 1 commit

Gemm alpha beta profiler (fp32 & fp16) (#91) · 19c5d6e6

rocking5566 authored Feb 22, 2022



* [What] Refactor verification of gemm alpha_beta, move to reference operation
[Why] Sync with other verification

* Profile mk_nk for gemm bias 2d

* Support bias 2d with mn * kn in profiler

* Support bias 2d with km*kn and km*nk in profiler

* Support fp32 bias 2d in profiler

* format

* format
Co-authored-by: rocking <chunylai@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

19c5d6e6

19 Feb, 2022 1 commit

Initial Setup for CI (#86) · 2778e997

JD authored Feb 18, 2022



* add docker file and make default target buildable

* add Jenkinsfile

* remove empty env block

* fix package stage

* remove render group from docker run

* clean up Jenkins file

* add cppcheck as dev dependency

* update cmake file

* Add profiler build stage

* add hip_version config file for reduction operator

* correct jenkins var name

* Build release instead of debug

* clean up
Co-authored-by: Chao Liu <chao.liu2@amd.com>

2778e997

12 Feb, 2022 1 commit

NHWC conv 2d: fwd bfp16/int8, Device level tuning and host API (#73) · 880fbee9

ltqin authored Feb 12, 2022



* add fwd bf16 conv

* change tunning parametor

* add int8 for conv fwd

* remove comments

* change tunning parametor for int8

* change init int8 example

* add test for conv2d fwd

* change device operation file pos because merge develop

* fwd int8 use reference

* test_conv_fwd use reference

* add braket for if statement

* rename fwd example name

* remove StaticBufferOfVectorTypeV2

* tweak example
Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

880fbee9

11 Feb, 2022 4 commits

Add small tile size for fp16/fp32 and NN layout (#80) · 20a672d0

zjing14 authored Feb 11, 2022



* add DeviceGemmSplitKXdl

* add file device_gemm_splitk_xdl.hpp

* set c matrix zero

* using atomic

* add all tuning parameter to f32 mkkn

* grid size change to 720

* add tunning parameter for NT

* add tunning parameter for TN

* add tunning parameter for TT

* add m=96tunning parameter

* add lost config

* debug

* fix sweep

* add failed tuning params

* fixed sweep logic

* clean

* add padding to M/N for irr tile size

* clean code

* add element wise operation

* fixed MPerBlock=96

* remove marco for slpitk swtich

* add test

* add new line at the end of device_gemm_xdl_instance.hpp

* remove step hack

* seperate split-k instance files

* add tunning parameters

* change disired grid size to parameters

* remove slice length

* add desiredgridsize parameter to ckProfiler

* add losting file device_gemm_xdl_splitk_instance.hpp

* change desired gride size to kbatch

* format

* format

* clean up

* add selection of device_instances

* clean code

* clean code

* add small tile size in fp16 nn

* test for rocm 4.5

* merge develop

* clean

* clean

* clean

* remove no-use code

* add padding switch to device_gemm_xdl

* add padding switch for ksplit fp32

* clean

* clean

* add files

* rename

* Update profiler.cpp

* format
Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: ltqin <letao.qin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

20a672d0

Batched GEMM for fp16 (#79) · b53e9d08

zjing14 authored Feb 11, 2022

* prepare host for batched_gemm

* init commit of batched kernels

* fixed

* refine transform with freeze

* m/n padding

* fixed a bug; clean

* add small tiles

* clean

* clean code

* clean code

* add nt, tn, tt layout

* add missing file

* use StaticBufferTupleOfVector instead

* add reference_batched_gemm

* fixed a macro

b53e9d08

Support alpha beta scaling for GEMM (#78) · 6f928a08

rocking5566 authored Feb 11, 2022



* [What] Add 2d version of bias, prepare to implement alpha / beta scaling

* Add alpha / beta functor

* Refine parameter of example

* [What] Use real type instead of template
[Why] Prevent implicit cast

* Rename parameter for general operator

* Remove redundant comment

* Fix compile error
Co-authored-by: rocking <chunylai@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

6f928a08

fix build breaks (#81) · 904cbe2a

Anthony Chang authored Feb 11, 2022



- device_gemm_xdl_c_shuffle function signature matches split-k
- retire host_driver since it is no longer maintained
- linter error (unused variable)
Co-authored-by: Chao Liu <chao.liu2@amd.com>

904cbe2a

07 Feb, 2022 1 commit

GEMM+Bias+ReLU+Add (#76) · 823657ed

Chao Liu authored Feb 06, 2022

* tweak conv for odd C

* update script

* clean up elementwise op

* fix build

* clean up

* added example for gemm+bias+relu+add

* added example for gemm+bias+relu

* add profiler for gemm_s_shuffle; re-org files

* add profiler

* fix build

* clean up

* clean up

* clean up

* fix build

823657ed