Commits · 31d2d52aee63f8fc81049bbce2ecfd1694ea7525 · gaoqiong / composable_kernel

20 Sep, 2022 3 commits
- merge develop · 31d2d52a
  wangshaojie6 authored Sep 20, 2022
  
  31d2d52a
- passing a mask struct · 5718bc14
  wangshaojie6 authored Sep 20, 2022
  
  5718bc14
- Add batched attention special kernel instances (#424) · 7c788e10
  Anthony Chang authored Sep 20, 2022
```
* sanity check

* add attribution

* add irrgular k tile size for batched attention

* format
```
  7c788e10
19 Sep, 2022 7 commits

work around inline asm potential hazard using intrinsic (#416) · c6b8b472
Anthony Chang authored Sep 20, 2022

c6b8b472

Grouped batched attention + permute (#412) · 9287b7c6

Anthony Chang authored Sep 20, 2022

* grouped attn without batch validates; now move toward grouped batched attn

* grouped batched attention

* working

* remove debug logging

clean up

clean up

* reintroduce g_ prefix back to host tensor variables

* format

* rename file

* restore old file

* rename

* consolidate padded/non-padded attention example

* harmonize padding specialization in attn examples

9287b7c6

Conv bwd data multiple d (#404) · 27858374

Shaojie WANG authored Sep 20, 2022



* init commit of convnd bwd data

* begin compiling example

* have a first version that produce a right result

* refine device level launch kernel code

* add more instances in example and get right results

* clang-format

* format example file

* add more instances

* fix instances

* adding conv_bwd_data multile_d

* adding conv_bwd_data multile_d

* adding conv_bwd multiple d

* adding conv_bwd multiple d

* adding conv_bwd multiple d

* refactor

* refactor

* adding conv bwd data multiple d

* adding conv bwd data multiple d

* adding conv bwd data multiple d

* adding conv bwd data multiple d

* adding conv bwd data multiple d

* adding conv bwd data multiple d

* adding conv bwd data multiple d

* refactor

* update conv fwd's bias impl

* refactor

* reorg file

* clean up cmake

* clean

* clean

* clean
Co-authored-by: Chao Liu <lc.roy86@gmail.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

27858374

add test and instance for MNK padding · 4976eb0c
wangshaojie6 authored Sep 19, 2022

4976eb0c
add k padding · 812521df
wangshaojie6 authored Sep 19, 2022

812521df
fix error: check left bottom corner for tile skipping · 12e7df12
wangshaojie6 authored Sep 19, 2022

12e7df12
check lef bottom corner for tile skipping · 616a51fd
wangshaojie6 authored Sep 19, 2022

616a51fd

17 Sep, 2022 2 commits
- fix compile error · 878f66fb
  wangshaojie6 authored Sep 17, 2022
  
  878f66fb
- Merge branch 'develop' into att_lower_triangle · abed3c1f
  wangshaojie6 authored Sep 17, 2022
  
  abed3c1f
16 Sep, 2022 7 commits
- disable print for group conv multiple D (#421) · 43c898f6
  Chao Liu authored Sep 16, 2022
  
  43c898f6
- merge remote develop · fefd767f
  wangshaojie6 authored Sep 16, 2022
  
  fefd767f
- clang-format · 96b0f78c
  wangshaojie6 authored Sep 16, 2022
  
  96b0f78c
- add gtest for bmm masking scale softmax bmm permute · 97dcc7b2
  wangshaojie6 authored Sep 16, 2022
  
  97dcc7b2
- add test file · 200dd06b
  wangshaojie6 authored Sep 16, 2022
  
  200dd06b
- add test · 6e0a93d2
  wangshaojie6 authored Sep 16, 2022
  
  6e0a93d2
- add 10 instance for masking bmm + scale + softmax + bmm + permute kernels · 8cdcad67
  wangshaojie6 authored Sep 16, 2022
  
  8cdcad67
15 Sep, 2022 3 commits
- add some comments on example · cf480e8a
  wangshaojie6 authored Sep 15, 2022
  
  cf480e8a
- remove lower triangle gemm reference struct · 74320196
  wangshaojie6 authored Sep 15, 2022
  
  74320196
- rename template and remove default template value · 7ae26b79
  wangshaojie6 authored Sep 15, 2022
  
  7ae26b79
14 Sep, 2022 8 commits

batched_gemm + multiple_d + gemm + multiple_d (#394) · 370efa6c

ltqin authored Sep 15, 2022



* refactor

* start

* add device gemm file

* add BatchStrideD0

* add stridd0

* add gridwise file

* add d0 parameters to gridwise gemm

* add c layout transformer

* add d0 threadwise copy

* init kernel

* init kernel

* regular code

* nm desc put to out

* kernel parameter can not use reference

* host add bias+gelu

* run right for bias+gelu

* change AddFastGelu into another file

* interface add d1 bias parameters

* add d1 parameter to argument

* add d1 parameter to gridwise

* first all code,not verify

* gelu change to relu and GetElementSpaceSize bug

* add instance

* start add to ckprofiler

* ckprofiler finish code

* change input parameter for ckProfiler

* fix host bias+gelu bug

* show help for ckProfiler

* fix bug for lunch kernel ignore parametes

* add pad and fix about bug

* mutiple d0

* add dynamic d0_element_op

* change profiler and  instance to mutiple d0

* example have 2 d0

* remove some comments not using

* change 2 d0 have self  parameters

* change d element_op name

* change class name(multiple_d)

* fix bug

* fix bug that don't find file

* update profiler

* refactor

* update profiler

* clean

* revert example change

* add gon layout

* optimize parameter for gno

* add gon to gemm+gemm

* change helping input parameters

* change to GemmPadder_v2

* using ForEach

* fix gb_per_sec
Co-authored-by: Chao Liu <lc.roy86@gmail.com>
Co-authored-by: ltqin <letaoqin@amd.com>

370efa6c

add template to distinguish masking kernel · 1dc91af9
wangshaojie6 authored Sep 15, 2022

1dc91af9
attention with lower triangle mask with tile skipping · 7b18e6fd
wangshaojie6 authored Sep 14, 2022

7b18e6fd
Merge branch 'att_diagnal' into att_lower_triangle · a614e299
wangshaojie6 authored Sep 14, 2022

a614e299
fix n2 compute error · 1ebc21d4
wangshaojie6 authored Sep 14, 2022

1ebc21d4
use 7*13 group · e392ce24
wangshaojie6 authored Sep 14, 2022

e392ce24
add decoder lower triangular mask calculation · 336a7065
danyao12 authored Sep 14, 2022

336a7065
functionality right with lower triangle mask · 870a2482
wangshaojie6 authored Sep 14, 2022

870a2482

13 Sep, 2022 3 commits
- Upgrade the OS and ROCM versions. (#411) · b22ebd44
  Illia Silin authored Sep 13, 2022
```
* upgrade the OS and ROCM versions in CK docker

* add cxx flags to link code with rocm5.2 and ck-9110 compiler

* rename the docker image

* run ONNX gemms using init=1
```
  b22ebd44
- init code for tile skipping · 506c8eb3
  wangshaojie6 authored Sep 13, 2022
  
  506c8eb3
- add lower triangle bmm · 3f9100cc
  wangshaojie6 authored Sep 13, 2022
  
  3f9100cc
09 Sep, 2022 1 commit

embedding fuse layernorm (#405) · efd1d257

carlushuang authored Sep 09, 2022



* add gridwise/device sparse embedding

* update code

* update code

* remove useless makefile

* code fix

* workable

* work properly

* emb add

* add more instance

* format

* remove useless code

* fix format

* fix clang-tidy

* clean

* fix a compile error
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Chao Liu <lc.roy86@gmail.com>

efd1d257

08 Sep, 2022 1 commit

Fix gemm-softmax-gemm-permute padding cases (#409) · d6709dc3

Anthony Chang authored Sep 08, 2022

* fix example; make padding on by default in example; fix argument checks

* fix Gemm1KPacK which has since regressed from PR #399

d6709dc3

07 Sep, 2022 1 commit

Add stderr to QA logfiles, process splitK and ONNX gemm kernels (#402) · ce74cea4

Illia Silin authored Sep 07, 2022

* add processing for the onng_gemm and splitK_gemm

* add profile_onnx_gemm.sh

* add stderr to logfiles, add splitK and onnx gemm parsing

* enable splitK gemm wresults posting to db

ce74cea4

06 Sep, 2022 3 commits

Fused attention instances & padding tests (#395) · 868e5c55

Anthony Chang authored Sep 07, 2022

* modify comment

* trim unnecessary check

* add gemm spec in kernel name

* add TNTT gemm_gemm + atten kernel instances

* refactor attention padding to better fit in unit tests

This streamlines usage where "ResetNaNToMinusInf" is now hidden from user facing device op.
Also added compile-time conditionals that load OOB value as NaN only after padding is enabled

* add adhoc padding test for atten

* shrink input value range for attention kernel validation to avoid occasional error by 1e-3

Still unsure whether this kind of deterministic floating point accurary issue is expected
or not. May want to try exact same approach as the GPU kernel in the host reference
GEMM+Softmax+GEMM function to see if the accuracy discrepancy goes away. Until then,
shrink the input value range as it is less likely to produce errors of around ~1e-3.

* attention kernel proper granular padding for all 4 dims

* IsSupportedArgument checks

* test more padded cases

* block PadK specialization in attention kernels

* workaround clang crash for gfx908

(gfx908 only) workaround for compiler crash in fused kernels on mainline #9110; #10738 seems ok
error message was "fatal error: error in backend: Error while trying to spill VGPR0 from class
VGPR_32: Cannot scavenge register without an emergency spill slot!"
this fall back to less ideal way of handle NPadding in fused attention kernel

* comment out kernels giving wrong results on MI100; MI200 doesn't seem affected

868e5c55

GemmGemm TNNT instances (#399) · fe52c94c

Anthony Chang authored Sep 07, 2022

* add gemm_gemm TNNT instance

* sanitize Gemm1KPack

* disable instances that failed validation on mi100

fe52c94c

Softmax client example (#396) · 3da5c19e

Adam Osewski authored Sep 06, 2022



* Update Softmax device operation interface.

* Update ckProfiler.

* Update Softmax UT.

* Update example.

* Client example.

* Clang format
Co-authored-by: Adam Osewski <aosewski@amd.com>

3da5c19e

02 Sep, 2022 1 commit

[Hotfix] SplitK Gemm fp32 (#401) · 75891161

zjing14 authored Sep 02, 2022

* add scripts

* fixed splitK_gemm_fp32

* clean

* clean

* use gemm_xdl_splitK_c_shuffle into profiler

* remove device_gemm_xdl_splitk.hpp

75891161