Commits · fused-gemm · gaoqiong / composable_kernel

13 Aug, 2022 8 commits

Merge branch 'fix_0813' into fused-gemm · 7ea9c9c4
Chao Liu authored Aug 13, 2022

7ea9c9c4
Merge remote-tracking branch 'origin/develop' into fused-gemm · 2564c493
Chao Liu authored Aug 13, 2022

2564c493
fix build · 8bea6b2d
Chao Liu authored Aug 13, 2022

8bea6b2d
Merge remote-tracking branch 'origin/develop' into fused-gemm · 000eefbf
Chao Liu authored Aug 13, 2022

000eefbf

ltqin authored Aug 13, 2022

* start

* read for gridwise gemm

* add MakeBGridDescriptor_K0_N0_N1_N2_N3_K1

* add thread  copy desc and register buffer

* add K0PerBlock dim

* add read global data

* finish gridwise gemm

* finish blockwise gemm

* add print data

* add smallest config

* add compare code for gridwis gemm

* fix NXdlPerWave

* fix k0perthread and gridewis gemm main loop

* remove b matrix lds alloc

* fix name

* add test code

* create b_grid_desc_k0_k1_k2_n0_n1_n2_n3_k3 from parameter

* add double register

* modify b_thread_desc_

* add float

* fp16 tag

* add tail for pipeline

* finish main loop

* optimize main loop

* start clear gridwise gemm

* clear code

* clear redundant code

* change file name

* change file name

* fix bug after merge develop

* fix input parameters

* using MultiK0 control b load data loop

* fix some config

* 4 buffer

* fix bug

* one can use

* change read order

* change buffer array to tuple

* change to 8 buffer

* interleave buffer load

* change to 16

* read 8 buffer

* add data buffer to template

* fix after merge develop(head file)

* format

* change to 4 buffer

* remove unnecessary lambda fun

10b3278b

Add examples for reduction fp16/fp32/bp16/int8/fp64 for 3d/4d/5d (#342) · 14932e8d

Qianfeng authored Aug 13, 2022

* Update the reduce_blockwise example to support user specified data type and input+reducing dimensions

* Add examples for using reduce_multiblock_atomic_add

* Add more running examples to the default command-line

* Remove un-necessary header including

* Update to the example README.md

14932e8d

Gemm multiple d multiple r (#335) · 6c3c06bf

rocking5566 authored Aug 13, 2022

* Imitate XXX_gemm_multiple_d, add XXX_gemm_multiple_d_multiple_r for gemm + reduction

* Implement run of kernel

* Add example

* Fix parameter of typo

* Rewrite the reduceMax example

* Rewrite the reduceMean + reduceMeanSquare example

* Refine naming

* Refine folder name

* refine naming

* Rewrite the gemm + bias + relu + add + layernorm example

* Rewrite the gemm + layernorm example

* clang-format

* Fix bug if sync lds

* Fix compile error

6c3c06bf

Fused attention (#345) · cac014f1

Anthony Chang authored Aug 13, 2022



* initial stub for gemm_gemm_xdl_cshuffle

* set up example code

* compiles

* prevent integer overflow

* harmonize interface between ref_gemm and ref_batched_gemm

* batched_gemm_gemm

* fix example

* host tensor gen: diagonal pattern in lowest two-dimensions only

* make c descriptors containing only integral constants

* clean up

* add BlockwiseGemmXdlops_v2 while exploring an unified approach

* implement proper interface

* tidy up example

* fix compilation warnings

* coarsely controlled 2nd gemm padding

* remove rocm-cmake's hard requirement for certain revision

* clang-format

* resolve merge conflict

* fix compilation error on gfx10

* adds acc0 elementwise op to interface

* attention host validation

* add blockwsie softmax v1

* iteratively update softmax+gemm

* transpose both gemm0 and gemm1 xdl output so as to avoid broadcasting softmax max/sum

* add init method for easier debugging

* do away with manual thread cluster calculation

* generalize blockwise softmax interface

* row-wise softmax sum & max

* format

* rename to DeviceBatchedGemmSoftmaxGemm

* add gemm_softmax_gemm instances and tests

* comment
Co-authored-by: ltqin <letao.qin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

cac014f1

12 Aug, 2022 4 commits

Move literal ""_uz & ""_zu into namespace 'ck::literals' (#354) · a670a5a0
Po Yen Chen authored Aug 13, 2022
```
* Move literal ""_uz & ""_zu into namespace 'literals'

* Move namespace 'literals' as 'ck::literals'
```
a670a5a0

Add example of conv_fwd_bias_relu_add for int4, int8, bfp16, fp16, and fp32 (#343) · 0c6ef7c1

Rostyslav Geyyer authored Aug 12, 2022



* [LWPCK-359] Initial commit

* Working version for fp16, add results to readme

* Update according to PR #341

* Update results in readme

* Add fp32 example

* Add bf16 example

* Update fp16 and fp32 examples

* Add int8 example

* Add separate lengths and strides tensors for D tensors
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com>

0c6ef7c1

add g; fixed strides (#355) · 35e49f2d
zjing14 authored Aug 12, 2022

35e49f2d

Build docker only once in CI, fix conv_bwd logfile names. (#353) · de60d290

Illia Silin authored Aug 12, 2022

* build docker in separate stage

* build docker with only one prefix

* add parallel statement

* add docker repo url

* fix the name of perf_conv_bwd_data log file

de60d290

11 Aug, 2022 24 commits
- Add examples for GEMM + AddAddFastGelu (data type: int8, bf16, fp32) (#340) · 68b61504
  Po Yen Chen authored Aug 12, 2022
```
* Add always_false<> util to delay symbol resolution

* Use always_false<> to prevent trying instantiate unwanted method

* Add new specializations of AddAddFastGelu::operator() method

* Add GEMM + AddAddFastGelu examples for data types: int8, bf16, fp32

* Use floating point literal to simplify code

* Remove unnecessary capture in lambda expressions

* Extract fast GeLU calculation as standalone method

* Mark methods as 'constexpr'

* Add constraint for HostTensorDescriptor templated ctors

* Simplify HostTensorDescriptor ctor calls

* Add C++23 std::size_t literal suffix

* Use _uz suffix to shorten example code

* Remove unnecessary conversion to std::array<>

* Re-order include directives

* Remove C-style casting by literal suffix

* Remove unnecessary statements in main()

* Remove unused type parameter of always_false<>

* Remove unused include directive

* Exit main() by returning meaningful value

* Use 'if constexpr' to switch example flow

* Use std::is_same_v<> to shorten example code

* Add 'inline' specifier to literal functions

* Unify output methods in example

* Move common codes into .inc file

* Add type check in type_convert<>()

* Add type_convert<float>() before computation

* Merge AddAddFastGelu method specializations

* Remove always_false<>

* Add constraint to AddAddFastGelu::operator() parameter types
```
  68b61504
- ckProfiler for layernorm (#330) · fdfd7eb5
  rocking5566 authored Aug 12, 2022
```
* Refine parameter

* Add base class for layernorm

* Add layernorm instance

* Add layernorm to ckProfiler

* Remove redundant

* Add verification

* Fix compile error due to merge
```
  fdfd7eb5
- avoid LDS data hazard · b64a2860
  Anthony Chang authored Aug 11, 2022
  
  b64a2860
- add gemm_gemm instances and tests · 8aa44bcd
  Anthony Chang authored Aug 11, 2022
  
  8aa44bcd
- adds acc0 elementwise op to interface · 51fc99a8
  Anthony Chang authored Aug 11, 2022
  
  51fc99a8
- fix compilation error on gfx10 · 8672733f
  Anthony Chang authored Aug 10, 2022
  
  8672733f
- resolve merge conflict · 3c5a50f2
  Anthony Chang authored Aug 08, 2022
  
  3c5a50f2
- clang-format · edc494df
  Anthony Chang authored Aug 04, 2022
  
  edc494df
- remove rocm-cmake's hard requirement for certain revision · 00331ee4
  Anthony Chang authored Aug 04, 2022
  
  00331ee4
- coarsely controlled 2nd gemm padding · c9bef1c6
  Anthony Chang authored Aug 10, 2022
  
  c9bef1c6
- fix compilation warnings · e55b67a0
  Anthony Chang authored Aug 10, 2022
  
  e55b67a0
- tidy up example · ed424975
  Anthony Chang authored Aug 04, 2022
  
  ed424975
- implement proper interface · 5f94555b
  Anthony Chang authored Aug 04, 2022
  
  5f94555b
- add BlockwiseGemmXdlops_v2 while exploring an unified approach · 98e4c0ce
  Anthony Chang authored Aug 03, 2022
  
  98e4c0ce
- clean up · eceea10a
  Anthony Chang authored Aug 03, 2022
  
  eceea10a
- make c descriptors containing only integral constants · 4ee34028
  Anthony Chang authored Aug 03, 2022
  
  4ee34028
- host tensor gen: diagonal pattern in lowest two-dimensions only · caf2b2ed
  Anthony Chang authored Aug 01, 2022
  
  caf2b2ed
- fix example · b790e44b
  Anthony Chang authored Aug 01, 2022
  
  b790e44b
- batched_gemm_gemm · 408ba59b
  Anthony Chang authored Jul 27, 2022
  
  408ba59b
- harmonize interface between ref_gemm and ref_batched_gemm · b57c3879
  Anthony Chang authored Jul 27, 2022
  
  b57c3879
- prevent integer overflow · 237371ad
  Anthony Chang authored Jul 27, 2022
  
  237371ad
- compiles · 047cee2b
  Anthony Chang authored Jul 20, 2022
  
  047cee2b
- set up example code · 68b71534
  Anthony Chang authored Jul 12, 2022
  
  68b71534
- initial stub for gemm_gemm_xdl_cshuffle · 89a5e847
  Anthony Chang authored Jul 11, 2022
  
  89a5e847
10 Aug, 2022 1 commit

Add batched/grouped_gemm contraction deviceOps (#349) · e08d68d2

zjing14 authored Aug 10, 2022



* convnd_fwd fp16 example

* update example

* update example

* update instance

* updating refernce conv

* update reference conv

* update conv fwd profiler

* update conv 1d and 3d instance

* update include path

* clean

* update profiler for conv bwd data and weight

* update conv bwd weight

* clean

* update conv example

* update profiler for conv bwd weight

* update ckprofiler for conv bwd data

* fix reference conv bwd data bug; update conv bwd data test

* update examples

* fix initialization issue

* update test for conv fwd

* clean

* clean

* remove test case too sensitive to error threshhold

* fix test

* clean

* fix build

* adding conv multiple d

* adding conv multiple D

* add matrix padder

* add gemm padding to convnd

* adding group conv

* update gemm multi-d

* refactor

* refactor

* refactor

* clean

* clean

* refactor

* refactor

* reorg

* add ds

* add bias

* clean

* add G

* adding group

* adding group

* adding group

* update Tensor

* clean

* update example

* update DeviceGemmMultipleD_Xdl_CShuffle

* update conv bwd-data and bwd-weight

* upate contraction example

* update gemm and batch gemm with e permute

* fix example build

* instance for grouped conv1d

* update example

* adding group conv instance

* update gemm bilinear instance

* update gemm+add+add+fastgelu instance

* update profiler

* update profiler

* update test

* update test and client example

* clean

* add grouped conv into profiler

* update profiler

* clean

* add test grouped conv, update all conv test to gtest

* update test

* change gemm_c_permute with contraction

* add grouped_contraction

* add contraction in group_gemm

* add example of grouped_gemm with contraction

* add example of grouped_contraction_bias_e_permute

* clean

* fixed ds

* add m3n2 m2n3 examples into gemm_bias_e_permute
Co-authored-by: Chao Liu <chao.liu2@amd.com>

e08d68d2

08 Aug, 2022 1 commit

Fix QA, allow switching compiler versions, fix google test compilation error. (#348) · aba7fefc

Illia Silin authored Aug 08, 2022

* allow selecting compiler version

* fix typo

* add Wno-deprecated flag for google tests

* change git repo, fix qa log files names

* change the git clone syntax

* use Omkar's git credentials

* try to use jenkins as git user

* try using illsilin username for gerrit repo with ssh key

* try new gerrit authorization

* change ssh key syntax

* try another way of passing ssh key to docker

* add mount ssh in dockerfile

* create .ssh folder

* move ssh-keyscan to later

* get rid of npm call

* build first docker image on master

* check the contents of the .ssh folder

* try replacing omkars creds with gerrit creds

* use open repo, clean up changes

* get rid of ssh default argument

aba7fefc

07 Aug, 2022 1 commit
- fix bug in gemm profiler (#344) · 146972f4
  Chao Liu authored Aug 07, 2022
  
  146972f4
03 Aug, 2022 1 commit

Update Group convolution (#341) · 75ab874e

Chao Liu authored Aug 03, 2022

* add conv oddC

* update example

* update example

* fix bug in example

* fix bug in group conv example

75ab874e