Commits · cef3d91fb0ac81b671abb5e94e9268d2a80cbc08 · gaoqiong / composable_kernel

19 Sep, 2022 4 commits
- Test groupnorm kernel from device_instance · cef3d91f
  rocking authored Sep 19, 2022
  
  cef3d91f
- Merge branch 'develop' into group_norm · cb2b5c86
  rocking5566 authored Sep 19, 2022
  
  cb2b5c86
- Add groupnorm ckProfiler · 7cda0a07
  rocking authored Sep 19, 2022
  
  7cda0a07
- Refine error message · e5b9beb3
  rocking authored Sep 16, 2022
  
  e5b9beb3
16 Sep, 2022 4 commits
- disable print for group conv multiple D (#421) · 43c898f6
  Chao Liu authored Sep 16, 2022
  
  43c898f6
- Add groupnorm test · 3d911b2a
  rocking authored Sep 16, 2022
  
  3d911b2a
- clang-format · ab10fbc0
  rocking authored Sep 16, 2022
  
  ab10fbc0
- [What] Rename original layernorm into layernorm2d · e9070031
  rocking authored Sep 16, 2022
```
[Why] Prepare to add groupnorm using layernorm5d
```
  e9070031
15 Sep, 2022 5 commits
- Fuse sigmoid after groupnorm · 58188d46
  Rocking authored Sep 15, 2022
  
  58188d46
- Merge branch 'develop' into group_norm · aea3b411
  rocking5566 authored Sep 16, 2022
  
  aea3b411
- Add reference for groupnorm · e4f8aa5c
  Rocking authored Sep 15, 2022
  
  e4f8aa5c
- [What] Fix bug of layernorm for greater than 2 dimension. · 22a38a50
  Rocking authored Sep 15, 2022
```
[Why] We need to get upper length from merge transform instead of embed transform.
```
  22a38a50
- Modify test, instance and client example · 8166d875
  rocking authored Sep 14, 2022
  
  8166d875
14 Sep, 2022 3 commits

batched_gemm + multiple_d + gemm + multiple_d (#394) · 370efa6c

ltqin authored Sep 15, 2022



* refactor

* start

* add device gemm file

* add BatchStrideD0

* add stridd0

* add gridwise file

* add d0 parameters to gridwise gemm

* add c layout transformer

* add d0 threadwise copy

* init kernel

* init kernel

* regular code

* nm desc put to out

* kernel parameter can not use reference

* host add bias+gelu

* run right for bias+gelu

* change AddFastGelu into another file

* interface add d1 bias parameters

* add d1 parameter to argument

* add d1 parameter to gridwise

* first all code,not verify

* gelu change to relu and GetElementSpaceSize bug

* add instance

* start add to ckprofiler

* ckprofiler finish code

* change input parameter for ckProfiler

* fix host bias+gelu bug

* show help for ckProfiler

* fix bug for lunch kernel ignore parametes

* add pad and fix about bug

* mutiple d0

* add dynamic d0_element_op

* change profiler and  instance to mutiple d0

* example have 2 d0

* remove some comments not using

* change 2 d0 have self  parameters

* change d element_op name

* change class name(multiple_d)

* fix bug

* fix bug that don't find file

* update profiler

* refactor

* update profiler

* clean

* revert example change

* add gon layout

* optimize parameter for gno

* add gon to gemm+gemm

* change helping input parameters

* change to GemmPadder_v2

* using ForEach

* fix gb_per_sec
Co-authored-by: Chao Liu <lc.roy86@gmail.com>
Co-authored-by: ltqin <letaoqin@amd.com>

370efa6c

Let shape of gamma and beta can be same as x · 12673f3f
rocking authored Sep 14, 2022

12673f3f
Add groupnorm example by layernorm · 45220e05
rocking authored Sep 14, 2022
```
1.  Reference is not ready
2. shape of gamma and beta need to be fix
```
45220e05

13 Sep, 2022 1 commit

Upgrade the OS and ROCM versions. (#411) · b22ebd44

Illia Silin authored Sep 13, 2022

* upgrade the OS and ROCM versions in CK docker

* add cxx flags to link code with rocm5.2 and ck-9110 compiler

* rename the docker image

* run ONNX gemms using init=1

b22ebd44

09 Sep, 2022 1 commit

embedding fuse layernorm (#405) · efd1d257

carlushuang authored Sep 09, 2022



* add gridwise/device sparse embedding

* update code

* update code

* remove useless makefile

* code fix

* workable

* work properly

* emb add

* add more instance

* format

* remove useless code

* fix format

* fix clang-tidy

* clean

* fix a compile error
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Chao Liu <lc.roy86@gmail.com>

efd1d257

08 Sep, 2022 1 commit

Fix gemm-softmax-gemm-permute padding cases (#409) · d6709dc3

Anthony Chang authored Sep 08, 2022

* fix example; make padding on by default in example; fix argument checks

* fix Gemm1KPacK which has since regressed from PR #399

d6709dc3

07 Sep, 2022 1 commit

Add stderr to QA logfiles, process splitK and ONNX gemm kernels (#402) · ce74cea4

Illia Silin authored Sep 07, 2022

* add processing for the onng_gemm and splitK_gemm

* add profile_onnx_gemm.sh

* add stderr to logfiles, add splitK and onnx gemm parsing

* enable splitK gemm wresults posting to db

ce74cea4

06 Sep, 2022 3 commits

Fused attention instances & padding tests (#395) · 868e5c55

Anthony Chang authored Sep 07, 2022

* modify comment

* trim unnecessary check

* add gemm spec in kernel name

* add TNTT gemm_gemm + atten kernel instances

* refactor attention padding to better fit in unit tests

This streamlines usage where "ResetNaNToMinusInf" is now hidden from user facing device op.
Also added compile-time conditionals that load OOB value as NaN only after padding is enabled

* add adhoc padding test for atten

* shrink input value range for attention kernel validation to avoid occasional error by 1e-3

Still unsure whether this kind of deterministic floating point accurary issue is expected
or not. May want to try exact same approach as the GPU kernel in the host reference
GEMM+Softmax+GEMM function to see if the accuracy discrepancy goes away. Until then,
shrink the input value range as it is less likely to produce errors of around ~1e-3.

* attention kernel proper granular padding for all 4 dims

* IsSupportedArgument checks

* test more padded cases

* block PadK specialization in attention kernels

* workaround clang crash for gfx908

(gfx908 only) workaround for compiler crash in fused kernels on mainline #9110; #10738 seems ok
error message was "fatal error: error in backend: Error while trying to spill VGPR0 from class
VGPR_32: Cannot scavenge register without an emergency spill slot!"
this fall back to less ideal way of handle NPadding in fused attention kernel

* comment out kernels giving wrong results on MI100; MI200 doesn't seem affected

868e5c55

GemmGemm TNNT instances (#399) · fe52c94c

Anthony Chang authored Sep 07, 2022

* add gemm_gemm TNNT instance

* sanitize Gemm1KPack

* disable instances that failed validation on mi100

fe52c94c

Softmax client example (#396) · 3da5c19e

Adam Osewski authored Sep 06, 2022



* Update Softmax device operation interface.

* Update ckProfiler.

* Update Softmax UT.

* Update example.

* Client example.

* Clang format
Co-authored-by: Adam Osewski <aosewski@amd.com>

3da5c19e

02 Sep, 2022 1 commit

[Hotfix] SplitK Gemm fp32 (#401) · 75891161

zjing14 authored Sep 02, 2022

* add scripts

* fixed splitK_gemm_fp32

* clean

* clean

* use gemm_xdl_splitK_c_shuffle into profiler

* remove device_gemm_xdl_splitk.hpp

75891161

01 Sep, 2022 1 commit

add more datatype to gemm+gemm and conv+conv example (#397) · 204ef976

Chao Liu authored Sep 01, 2022

* refactor

* refactor

* adding int4/int8/fp16/bf16 for conv+conv and gemm+gemm

* adding int4/int8/fp16/bf16 for conv+conv and gemm+gemm

* clean

204ef976

31 Aug, 2022 2 commits

Add examples of Conv + reduction (data type: int4, int8, bf16, fp16, fp32) (#380) · 46a675aa

Po Yen Chen authored Sep 01, 2022



* Refactor the design of DeviceGemmMultipleDMultipleR_Xdl_CShuffle

* Add 'DeviceGroupedConvFwdMultipleDMultipleR' interface

* Add DeviceGroupedConvFwdMultipleDMultipleR_Xdl_CShuffle

* Remove 'GridwiseConvFwdMultipleDMultipleR_xdl_cshuffle'

* Add 'TransformConvFwdToGemm<>' utility class (from Chao)

* Use 'TransformConvFwdToGemm<>' to shorten code

* Fix ill-formed method declaration

* Re-implement MakeRGridDescriptor_M() function

* Change problem description

* Use macro to define layout types

* Define K-reduced output tensor layout types

* Let user to decide R output tensor layout

* Rename variables

* Add padding to the reduced output tensor if necessary

* Extract common code as helper method

* Remove debug message

* Add missing include directive

* Add partial fp16 Conv + Reduction example

* Add example verification code for 2D Conv problem

* Use type alias to simplify code

* Share code across different-dimension Conv problems

* Rename file/functions from run_conv_fwd* to run_convnd_fwd*

* Make example code more verbose

* Add code to support 1D & 3D Conv + Reduction on host

* Add more examples for data type: bf16, fp32

* Add example for int8

* Add custom target to group examples

* Use more general custom target name

* Change the description in error message

* Disable testing for example other than fp32

* Add examplel for int4 (just copy from int8)

* Fix wrong data type

* Use larger data type for intermediate tensors

* Finish int4 example

* Undefine macro PP_DEFINE_LAYOUT_TYPE() after use

* Use named variables to replace magic numbers

* Remove debug messages

* Use same A/B data type for host Conv in int4 example

* Add check for the 'RLayout' type argument

* Group same-dim-layouts together in 'LayoutSetting<>'

* Add 'final' specifier to utility classes

* Use different initialization method for examples

* Remove macro PP_DEFINE_LAYOUT_TYPE()

* Fix code-comment mismatch

* Use more reasonable initialization value for all data types

* Default use init_method=1 for all examples

* Remove never-used code

* Remove confusing out-of-date comments

* clean
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Chao Liu <lc.roy86@gmail.com>

46a675aa

conv+conv (1x1 only) example using gemm+gemm (#393) · 4df6d93f
Chao Liu authored Aug 31, 2022
```
* refactor conv

* add conv+conv example, 1x1 only
```
4df6d93f

30 Aug, 2022 2 commits

Gemm reduce examples int4/int8/fp32/bf16 (#368) · d00e6115

Adam Osewski authored Aug 30, 2022



* GEMM + Reduce max fp16+fp32

* GEmm + Max bf16 + int8

* Refactor common definitions.

* Refactor common func of mean meansquare example.

* More examples for mean meansquare.

* Update int8 examples and skip them cause of random errors.

* Int4 examples.

* Fix examples for max int4/8

* Tensor conversion for int4 input data for mean meansquare example.

* Remove int4 mean_meansquare example

* Fix int8 mean_meansquare example.

-All ReductionAccData and R<N>DataType have to be F32. The INT32 data
type is giving wrong results.

* Guard int4 with ifdef

* Change int8 example to add_addsquare due to div rounding err.

* Clang format

* Change the return type of common function.

* Get back int8 example with division.

* Remove int8 mean meansquare.

* Use proper cast for BF16 data type.

* Use ck::literals.

* Use proper data type for host tensors & reference.

- Use ReduceAccDataType for reference gemm output data type.
- Cast host reference output tensor to EDataType
- Fix ifdefs for int4.
Co-authored-by: Adam Osewski <aosewski@amd.com>

d00e6115

Padding for attention: bmm+scale+softmax+bmm kernel (#385) · 45adb736

Shaojie WANG authored Aug 31, 2022



* add padding algo for bmm+scale+softmax+bmm. Version for verification

* remove verification code

* remove comments

* add padded bmm scale softmax bmm example

* format

* refactor

* add comments for usages of padding bmm+scale+softmax+bmm
Co-authored-by: Chao Liu <lc.roy86@gmail.com>

45adb736

29 Aug, 2022 2 commits
- Try to workaround flaky GemmSoftmaxGemm tests (#386) · 138faf39
  Anthony Chang authored Aug 29, 2022
```
* avoid potential hazard; flaky test issue persists

* pin down the random seed to avoid flakiness
```
  138faf39
- Fix the slow cpu reference batched gemm kernels. (#388) · 9061d39b
  Illia Silin authored Aug 29, 2022
```
* fix the performance of the batched gemm verification

* fix tabs
```
  9061d39b
26 Aug, 2022 2 commits

Add an option to build CK with clang directly (#387) · 1e5b59df

Illia Silin authored Aug 26, 2022

* replace hipcc compiler with clang++

* build client app with hipcc

* build client app with clang

* add an option to build with hipcc ro clang

* fix the environment for client app

* fix setting up compiler in cmake_build

* change the way the compiler is set

1e5b59df

Fixed splitk gemm fp32 (#384) · 9881625b
zjing14 authored Aug 26, 2022
```
* add scripts

* fixed splitK_gemm_fp32

* clean

* clean
```
9881625b

25 Aug, 2022 5 commits

More int4 tests. (#374) · 57fadf6f

Adam Osewski authored Aug 26, 2022



* More int4 UT.

* Disable BitwiseRepresentation UT.

* Add UT with static_cast

* Surround cout statements with #if
Co-authored-by: Adam Osewski <aosewski@amd.com>

57fadf6f

GEMM batched/splitK/cgemm/grouped int4 examples (#383) · 3ab20fd7

Adam Osewski authored Aug 26, 2022



* Grouped GEmm int4.

* Formatting + fix K dimension for int8.

* Batched Gemm int4 example.

* CGEMM int4 example.

* Include inc filese in clang-format.

* SplitK int4 example

* Refactoring of performance measurement.

* Fix #ifdef statements.
Co-authored-by: Adam Osewski <aosewski@amd.com>

3ab20fd7

Add int4 example for convnd_fwd_bias_relu_add (#375) · b73ae242

Rostyslav Geyyer authored Aug 25, 2022

* Add int4 example for convnd_fwd_bias_relu_add

* Fix AddReluAdd for building without int4 support

* Update CMakeLists.txt

* Format

* Convert int4 tensors for int8 kernel

* Fix device memory allocation

* Format

* Format

b73ae242

Add int4 reduction examples (#372) · d520d0cf

Qianfeng authored Aug 26, 2022

* Add int4 reduction examples

* Contain all using of int4_t inside the pre-compiling condition checking

d520d0cf

add scripts (#382) · f246fd2c
zjing14 authored Aug 25, 2022

f246fd2c

24 Aug, 2022 2 commits
- layernorm external api (#379) · e1a3fff6
  rocking5566 authored Aug 25, 2022
```
* Add layernorm client example

* [What] Add default make install dir to gitignore
[Why] client example need to make install
```
  e1a3fff6
- Refactor the design of DeviceGemmMultipleDMultipleR_Xdl_CShuffle (#378) · 88e43744
  Po Yen Chen authored Aug 24, 2022
  
  88e43744