Commits · 000eefbf6ebf3c35363b05aa1b00262577b12eaa · gaoqiong / composable_kernel

13 Aug, 2022 1 commit

Anthony Chang authored Aug 13, 2022



* initial stub for gemm_gemm_xdl_cshuffle

* set up example code

* compiles

* prevent integer overflow

* harmonize interface between ref_gemm and ref_batched_gemm

* batched_gemm_gemm

* fix example

* host tensor gen: diagonal pattern in lowest two-dimensions only

* make c descriptors containing only integral constants

* clean up

* add BlockwiseGemmXdlops_v2 while exploring an unified approach

* implement proper interface

* tidy up example

* fix compilation warnings

* coarsely controlled 2nd gemm padding

* remove rocm-cmake's hard requirement for certain revision

* clang-format

* resolve merge conflict

* fix compilation error on gfx10

* adds acc0 elementwise op to interface

* attention host validation

* add blockwsie softmax v1

* iteratively update softmax+gemm

* transpose both gemm0 and gemm1 xdl output so as to avoid broadcasting softmax max/sum

* add init method for easier debugging

* do away with manual thread cluster calculation

* generalize blockwise softmax interface

* row-wise softmax sum & max

* format

* rename to DeviceBatchedGemmSoftmaxGemm

* add gemm_softmax_gemm instances and tests

* comment
Co-authored-by: ltqin <letao.qin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

cac014f1

12 Aug, 2022 2 commits

Add example of conv_fwd_bias_relu_add for int4, int8, bfp16, fp16, and fp32 (#343) · 0c6ef7c1

Rostyslav Geyyer authored Aug 12, 2022



* [LWPCK-359] Initial commit

* Working version for fp16, add results to readme

* Update according to PR #341

* Update results in readme

* Add fp32 example

* Add bf16 example

* Update fp16 and fp32 examples

* Add int8 example

* Add separate lengths and strides tensors for D tensors
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com>

0c6ef7c1

add g; fixed strides (#355) · 35e49f2d
zjing14 authored Aug 12, 2022

35e49f2d

11 Aug, 2022 17 commits

Add examples for GEMM + AddAddFastGelu (data type: int8, bf16, fp32) (#340) · 68b61504

Po Yen Chen authored Aug 12, 2022

* Add always_false<> util to delay symbol resolution

* Use always_false<> to prevent trying instantiate unwanted method

* Add new specializations of AddAddFastGelu::operator() method

* Add GEMM + AddAddFastGelu examples for data types: int8, bf16, fp32

* Use floating point literal to simplify code

* Remove unnecessary capture in lambda expressions

* Extract fast GeLU calculation as standalone method

* Mark methods as 'constexpr'

* Add constraint for HostTensorDescriptor templated ctors

* Simplify HostTensorDescriptor ctor calls

* Add C++23 std::size_t literal suffix

* Use _uz suffix to shorten example code

* Remove unnecessary conversion to std::array<>

* Re-order include directives

* Remove C-style casting by literal suffix

* Remove unnecessary statements in main()

* Remove unused type parameter of always_false<>

* Remove unused include directive

* Exit main() by returning meaningful value

* Use 'if constexpr' to switch example flow

* Use std::is_same_v<> to shorten example code

* Add 'inline' specifier to literal functions

* Unify output methods in example

* Move common codes into .inc file

* Add type check in type_convert<>()

* Add type_convert<float>() before computation

* Merge AddAddFastGelu method specializations

* Remove always_false<>

* Add constraint to AddAddFastGelu::operator() parameter types

68b61504

ckProfiler for layernorm (#330) · fdfd7eb5

rocking5566 authored Aug 12, 2022

* Refine parameter

* Add base class for layernorm

* Add layernorm instance

* Add layernorm to ckProfiler

* Remove redundant

* Add verification

* Fix compile error due to merge

fdfd7eb5

avoid LDS data hazard · b64a2860
Anthony Chang authored Aug 11, 2022

b64a2860
adds acc0 elementwise op to interface · 51fc99a8
Anthony Chang authored Aug 11, 2022

51fc99a8
fix compilation error on gfx10 · 8672733f
Anthony Chang authored Aug 10, 2022

8672733f
resolve merge conflict · 3c5a50f2
Anthony Chang authored Aug 08, 2022

3c5a50f2
clang-format · edc494df
Anthony Chang authored Aug 04, 2022

edc494df
coarsely controlled 2nd gemm padding · c9bef1c6
Anthony Chang authored Aug 10, 2022

c9bef1c6
fix compilation warnings · e55b67a0
Anthony Chang authored Aug 10, 2022

e55b67a0
implement proper interface · 5f94555b
Anthony Chang authored Aug 04, 2022

5f94555b
add BlockwiseGemmXdlops_v2 while exploring an unified approach · 98e4c0ce
Anthony Chang authored Aug 03, 2022

98e4c0ce
clean up · eceea10a
Anthony Chang authored Aug 03, 2022

eceea10a
make c descriptors containing only integral constants · 4ee34028
Anthony Chang authored Aug 03, 2022

4ee34028
batched_gemm_gemm · 408ba59b
Anthony Chang authored Jul 27, 2022

408ba59b
compiles · 047cee2b
Anthony Chang authored Jul 20, 2022

047cee2b
set up example code · 68b71534
Anthony Chang authored Jul 12, 2022

68b71534
initial stub for gemm_gemm_xdl_cshuffle · 89a5e847
Anthony Chang authored Jul 11, 2022

89a5e847

10 Aug, 2022 1 commit

Add batched/grouped_gemm contraction deviceOps (#349) · e08d68d2

zjing14 authored Aug 10, 2022



* convnd_fwd fp16 example

* update example

* update example

* update instance

* updating refernce conv

* update reference conv

* update conv fwd profiler

* update conv 1d and 3d instance

* update include path

* clean

* update profiler for conv bwd data and weight

* update conv bwd weight

* clean

* update conv example

* update profiler for conv bwd weight

* update ckprofiler for conv bwd data

* fix reference conv bwd data bug; update conv bwd data test

* update examples

* fix initialization issue

* update test for conv fwd

* clean

* clean

* remove test case too sensitive to error threshhold

* fix test

* clean

* fix build

* adding conv multiple d

* adding conv multiple D

* add matrix padder

* add gemm padding to convnd

* adding group conv

* update gemm multi-d

* refactor

* refactor

* refactor

* clean

* clean

* refactor

* refactor

* reorg

* add ds

* add bias

* clean

* add G

* adding group

* adding group

* adding group

* update Tensor

* clean

* update example

* update DeviceGemmMultipleD_Xdl_CShuffle

* update conv bwd-data and bwd-weight

* upate contraction example

* update gemm and batch gemm with e permute

* fix example build

* instance for grouped conv1d

* update example

* adding group conv instance

* update gemm bilinear instance

* update gemm+add+add+fastgelu instance

* update profiler

* update profiler

* update test

* update test and client example

* clean

* add grouped conv into profiler

* update profiler

* clean

* add test grouped conv, update all conv test to gtest

* update test

* change gemm_c_permute with contraction

* add grouped_contraction

* add contraction in group_gemm

* add example of grouped_gemm with contraction

* add example of grouped_contraction_bias_e_permute

* clean

* fixed ds

* add m3n2 m2n3 examples into gemm_bias_e_permute
Co-authored-by: Chao Liu <chao.liu2@amd.com>

e08d68d2

03 Aug, 2022 1 commit

Update Group convolution (#341) · 75ab874e

Chao Liu authored Aug 03, 2022

* add conv oddC

* update example

* update example

* fix bug in example

* fix bug in group conv example

75ab874e

02 Aug, 2022 1 commit

CGEMM examples bf16, fp32, int8 (#332) · fb0dc358

Adam Osewski authored Aug 02, 2022



* Add int8 specialization for elementwise Add and Subtract.

* CGEMM examples bf16, fp32, int8

* Add convert reference output to CDataType.

* Skip BF16 data type during testing.

* Lower K value to get rid of accumulation error.

* Fix merge artifact.

* Fix changed function name: GetElementSpaceSize()

* Fix merge artifact.
Co-authored-by: Adam Osewski <aosewski@amd.com>

fb0dc358

29 Jul, 2022 1 commit

Clean up conv example, Instances, profiler and test (#324) · 500fa995

Chao Liu authored Jul 29, 2022

* convnd_fwd fp16 example

* update example

* update example

* update instance

* updating refernce conv

* update reference conv

* update conv fwd profiler

* update conv 1d and 3d instance

* update include path

* clean

* update profiler for conv bwd data and weight

* update conv bwd weight

* clean

* update conv example

* update profiler for conv bwd weight

* update ckprofiler for conv bwd data

* fix reference conv bwd data bug; update conv bwd data test

* update examples

* fix initialization issue

* update test for conv fwd

* clean

* clean

* remove test case too sensitive to error threshhold

* fix test

* clean

* fix build

* adding conv multiple d

* adding conv multiple D

* add matrix padder

* add gemm padding to convnd

* adding group conv

* update gemm multi-d

* refactor

* refactor

* refactor

* clean

* clean

* refactor

* refactor

* reorg

* add ds

* add bias

* clean

* add G

* adding group

* adding group

* adding group

* update Tensor

* clean

* update example

* update DeviceGemmMultipleD_Xdl_CShuffle

* update conv bwd-data and bwd-weight

* upate contraction example

* update gemm and batch gemm with e permute

* fix example build

* instance for grouped conv1d

* update example

* adding group conv instance

* update gemm bilinear instance

* update gemm+add+add+fastgelu instance

* update profiler

* update profiler

* update test

* update test and client example

* clean

* add grouped conv into profiler

* update profiler

* clean

* add test grouped conv, update all conv test to gtest

* update test

500fa995

22 Jul, 2022 1 commit

Batched Gemm with multiD (#329) · d7d78290

zjing14 authored Jul 22, 2022



* add batched_gemm_multiD

* add ds

* rename file

* add batched_gemm_bias example

* add batch_strides into bmm_c_permute

* clean

* rename example_28 to example_29
Co-authored-by: Chao Liu <chao.liu2@amd.com>

d7d78290

21 Jul, 2022 1 commit

Grouped Gemm device with multiD grid (#319) · 7959dad5

zjing14 authored Jul 21, 2022



* replace gridwise_v2r3 with multiD

* adjust parameters

* add instances

* fixed test_grouped_gemm

* fix standalone softmax race condition around blockwise reduction

* fixed ci

* fixed comment: remove redundant workspace

* use instanceFactory

* add test layout

* add empty Ds

* add bias example

* use array

* sperate examples
Co-authored-by: Anthony Chang <ac.chang@outlook.com>

7959dad5

15 Jul, 2022 1 commit
- fix standalone softmax race condition around blockwise reduction (#323) · a11680cc
  Anthony Chang authored Jul 15, 2022
  
  a11680cc
13 Jul, 2022 1 commit

Standalone layernorm (#315) · 7f216620

rocking5566 authored Jul 14, 2022



* Implement layernorm kernel and deviceOp

* verify gpu kernel with host code

* 1. Separate gamma aand beta from affine
2. Check if argument is valid

* clean

* Sync the naming

* Support sweep once mode if we can put k dimension data inside one block

* [What] Get length from upper length.
[Why] if we get length directly, we may get length after padding.

* We only use one block in K dimension.
Hence, we can simplify the indexing of global R/W.

* Use 1d descriptor for gamma and beta

* Add accElementwiseOp

* Extract layernorm host code

* Support different YVectorDim in GridwiseLayernorm

* Rename XSrcVectorDim to XYSrcVectorDim. Because we use same parameter in deviceOp

* Gamma and beta can share the VGPR.

* Add test for fp32 and fp16

* Fix bug of concurrency and add test case which may fail orignally

* Propagate NaN for layernorm
Co-authored-by: Chao Liu <chao.liu2@amd.com>

7f216620

08 Jul, 2022 2 commits

GEMM pipeline v2 (#317) · 63914743

Po Yen Chen authored Jul 09, 2022



* format

* improving pipeline

* fix typo

* format

* adding thread group

* adding thread group

* adding thread group

* adding gemm pipeline

* tweak

* refactor

* refactor

* add missing type convert

* refactor

* refactor

* refactor

* clean

* fix build

* refactor

* format

* clean up

* use remove_cvref_t

* clean

* use pipeline_v2 for gemm kernel

* Remove inconsistent indent

* Fix compilation errors due to incomplete merge process

* Add missing include directives

* Fix compilation errors in currently unused files

* Add license in newly added files

* Re-format touched files by clang-format-10

* Fix wrong template argument count of DeviceGemm<>

* Use language construct to choose between types

* Use language construct to choose GEMM example instance

* Fix compilation error due to interface change

* Re-use type alias to avoid duplication

* Unify type alias usage in source file

* Only use v2 pipeline in one gridwise GEMM type

* Remove no-longer used include directives

* Add static_assert() to check pipeline type requirements

* Revert "Add static_assert() to check pipeline type requirements"

This reverts commit f0985f0a132671a1caaea92810c9f30dcf062bde.

* clean

* clean

* clean

* clean
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: shaojiewang <wsjmessi@163.com>

63914743

add conv1d/3d bwd weight instances (#318) · 763ca615
Shaojie WANG authored Jul 09, 2022
```
* add conv1d/3d bwd weight instances

* add profiler code
```
763ca615

07 Jul, 2022 1 commit

N-D Tensor Contraction example, instance, and client example (#270) · 4fe9c393

Chao Liu authored Jul 07, 2022

* adding contraction

* add contraction example

* update examle

* update example

* format

* update readme

* clean header

* clean header

* contraction with multiple D

* rename

* fix naming issue; add instances for contraction+bilinear

* change assumed virtual layout of contraction; add client example

* update example

* update

* contraction+scale

* use type_convert

* rename

4fe9c393

06 Jul, 2022 1 commit

Batched Gemm with C Permute (#305) · 334361cb

zjing14 authored Jul 06, 2022



* init commit

* add c_permute

* add mnk padding

* fixed comments

* Fixed comments
Co-authored-by: Chao Liu <chao.liu2@amd.com>

334361cb

02 Jul, 2022 1 commit

Gemm+Bilinear (#316) · 9e4429f9

Chao Liu authored Jul 02, 2022

* refactor

* update example

* update example

* gemm bilinear

* clean

* update

9e4429f9

01 Jul, 2022 5 commits

modified grouped gemm addressing method (#307) · 8e374781

guangzlu authored Jul 01, 2022



* modified grouped gemm addressing method

* modified addressing method in device_grouped_gemm_xdl.hpp
Co-authored-by: root <root@dc-smc-13.amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

8e374781

Single-kernel GEMM + layernorm (#263) · 63fd5da6

Anthony Chang authored Jul 01, 2022



* dump lds content in appropriate precision type

* add squared add reduction op; allows sq sum

* initial stub from regular gemm impl

* layernorm example code & host verification

* initial layernorm implementation

* tidy up

* make C0 precision type consistent with C

* clang-tidy and additional comments

* tighten up example code

* account for extra flops/bytes from normalization

* clang-format

* c0 bias/beta/gamma now have its own precision type

* AccElemOp for gemm outputs prior to feeding to layernorm

* update workgroup mapping

* rename kernel template param to reflect its dual use

* use LDS mem pool for reduction workspace

* change cshuffle precision type to f16; clean up

* clang-format

* correct naming

* explicit cast

* fully implemented gemm + bias + activation + add + norm

* activation in correct order

* reflect reduction API's recent change

* amend

* clean up; add comment

* keep up with recent changes in reduction API

* format

* resolve merge conflicts
Co-authored-by: Chao Liu <chao.liu2@amd.com>

63fd5da6

add batch_stride into batched gemm (#314) · 1c8126a4
zjing14 authored Jul 01, 2022
```
* add batch_stride

* fixed test
Co-authored-by: Chao Liu <chao.liu2@amd.com>
```
1c8126a4

Improve external interface for GEMM and GEMM+add+add+fastgelu (#311) · 0dcb3496

Chao Liu authored Jun 30, 2022

* interface for GEMM and GEMM+add+add+fastgelu

* rename namespace

* instance factory

* fix build

* fix build; add GEMM client example

* clean

0dcb3496

Gemm + bias + c_permute (#312) · fa9a0a5c
zjing14 authored Jun 30, 2022
```
* init commit

* add desc

* finished c permute

* fixed vector lens
```
fa9a0a5c

30 Jun, 2022 1 commit

Standalone sweep once softmax kernel w/ ckProfiler (#295) · 93c99f3d

Anthony Chang authored Jul 01, 2022

* use 'sweep once' softmax kernel where applicable

* threadwise copy's dst buffer can specify invalid element value

* add int8 in/out float compute softmax support

give a bit of leeway for int absolute tolerance as there's a single data point of all test cases showing off-by-1 error

* format

* softmax inherits DeviceNormalization

* softmax profiler stub

* tighten up reference softmax interface

* example prints tensor dimension

* add fp32 to softmax profiler

* rename header

* hook with ckProfiler

* format

* resolve merge conflict

* resolve merge conflicts

* update normalization profiler help string

* resolve conflict

* typo

* remove residual

* softmax profiler: address feedback

* test for mixed precision input/output

* fully qualify ck::math::isnan

* add comment for device normalization interface

* revise wording

* constness for alpha/beta scaler pointer

93c99f3d

27 Jun, 2022 1 commit

external api for gemm + layernorm (#285) · 12235112

rocking5566 authored Jun 28, 2022

* Extract base class for elementwise

* Refactor interface of DeviceGemmReduce. Do not use tuple in interface

* [What] Rename d into reduce in gemm + reduction related code
[Why] Prepare to add d term for add

* Unify base class of gemm + reduce and gemm + bias + add + reduce

* 1. Rename gemm_bias_add_reduce for external api
 2. Refine cmake

* Add normalize device operation

* [What] Reorder the argument
[Why] Because d0 is also the input of c.

* Add type string

* Add example of gemm_bias_add_layernorm  via external api

* Refactor example code

* clang-format

* Fix compile error

* clang-format

* Add external api for gemm_add_add_layernorm and normalize

* Add client example

* clang-format

12235112