Commits · c26b46dec59ca37aebf3f46fc8530613621afa07 · gaoqiong / composable_kernel

27 Dec, 2022 14 commits
- format · c26b46de
  Anthony Chang authored Dec 13, 2022
  
  c26b46de
- compute y dot dy · 15f1d4ad
  Anthony Chang authored Dec 13, 2022
  
  15f1d4ad
- add description in example code · a3e487ca
  Anthony Chang authored Dec 13, 2022
  
  a3e487ca
- format · f1b2e521
  Anthony Chang authored Dec 07, 2022
  
  f1b2e521
- strictly follow natural indexing for traversing P tile to avoid jumping accesses (no snake pattern) · 4ae9919e
  Anthony Chang authored Dec 07, 2022
  
  4ae9919e
- can validate dV with relaxed error tolerance · b67a58c0
  Anthony Chang authored Nov 29, 2022
  
  b67a58c0
- start with dY · 8551dd43
  Anthony Chang authored Nov 21, 2022
```
start with dY
```
  8551dd43
- comment LDS bank conflict considerations · ecd5f7c9
  Anthony Chang authored Nov 25, 2022
  
  ecd5f7c9
- ready to plug in kernel · b1e544e2
  Anthony Chang authored Nov 16, 2022
  
  b1e544e2
- add transpose const counterpart · 4f6d52c1
  Anthony Chang authored Nov 16, 2022
  
  4f6d52c1
- host softmax can run with pre-calculated stats for debug purposes · 3908c88b
  Anthony Chang authored Nov 15, 2022
  
  3908c88b
- host attention seems to validate · 0aafc6be
  Anthony Chang authored Nov 15, 2022
  
  0aafc6be
- serialize tensor object in readable format · 25e26104
  Anthony Chang authored Nov 13, 2022
  
  25e26104
- helper function for transposing host tensor · a82804a4
  Anthony Chang authored Nov 11, 2022
  
  a82804a4
15 Dec, 2022 4 commits

Add MNK padding, M = 0 support into grouped_gemm (#539) · 0345963e

zjing14 authored Dec 15, 2022



* add mnk padding, support m=0

* clean code

* clean code
Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com>

0345963e

disable the attention test that fails on MI100 (#540) · 11151175
Illia Silin authored Dec 15, 2022

11151175
Add interface GetTypeIdName() and GetTypeIdHashCode() for Device Op (#533) · 10c72ace
Qianfeng authored Dec 15, 2022

10c72ace

Add padding device_gemm_add_add_fastgelu_xdl_c_shuffle instances to enable... · 9a1f2475

Rostyslav Geyyer authored Dec 14, 2022

Add padding device_gemm_add_add_fastgelu_xdl_c_shuffle instances to enable arbitrary problem size (#535)

* Add padding device_gemm_add_add_fastgelu_xdl_c_shuffle instances

* Add padding device_gemm_add_fastgelu_xdl_c_shuffle instances

* Add gemm_add_fastgelu profiler impl

* Add padding device_gemm_fastgelu_xdl_c_shuffle instances

* Add gemm_fastgelu profiler impl

9a1f2475

14 Dec, 2022 1 commit
- Add a docker hub doc file (#538) · 74744cab
  Rostyslav Geyyer authored Dec 14, 2022
  
  74744cab
12 Dec, 2022 1 commit

Gridwise elementwise 2d (#466) · 0e5c264c

arai713 authored Dec 12, 2022



* added 2d gridwise elementwise

* added 2d version of device elementwise

* added example file with updated device elementwise call

* added Cmake file

* changed NumDim into 2D

* fixed compiler issues

* fixed indexing for loop step

* fixed NumDim dimension error

* changed blockID to 2D

* updated Grid Desc

* updated kernel call

* fixed 2d thread indexing

* added dimensions for example file

* commented out unused code

* changed vector load

* removed extra code

* temporarily removing vector load on 2nd dim

* changed vector load back, still causing errors

* altered indexing

* changed isSupportedArgument for 2D

* changed indexing + do/while

* fixed isSupportedArgument

* changed dimension for debugging

* fixed

* added testing printouts

* testing change

* added variables to distribute threads through both dimensions

* testing changes

* integrated variable for thread distribution into device elementwise and added as parameter for gridwise elementwise

* removed most of the extraneous code, testing with different dimensions

* testing

* removed debugging print statements

* moved 2d elementwise permute into elementwise permute directory

* fixed formatting

* removed debugging comments from threadwise transfer
Co-authored-by: Jing Zhang <jizhan@amd.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

0e5c264c

08 Dec, 2022 1 commit

Make sure that GEMM sizes in K dimension are supported. (#527) · d58b7f51

Illia Silin authored Dec 08, 2022

* apply new K-dimension check in gemm_xdl_cshuffle

* add K-dim check to gemm_xdl and batched_gemm_xdl

* fix syntax

* fix syntax

* clean-up the debug output

d58b7f51

07 Dec, 2022 3 commits
- Fix Grouped ConvBwdWeight test case failure (#524) · 614a7b1b
  Po Yen Chen authored Dec 08, 2022
```
* Use smaller tensor size in test

* Use even more smaller tensor size

* Touch only failing test case inputs
```
  614a7b1b
- Add padding device_gemm_xdl instances (#529) · c7a4d361
  Rostyslav Geyyer authored Dec 07, 2022
```
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
```
  c7a4d361
- modified half function in math_v2.hpp (#528) · ce87b4f7
  guangzlu authored Dec 08, 2022
```
Co-authored-by: Chao Liu <chao.liu2@amd.com>
```
  ce87b4f7
06 Dec, 2022 1 commit

Fix CI error. (#530) · d072790f

Illia Silin authored Dec 06, 2022

* ignore .git folder when doing clang-format

* fix syntax

* add backslashes before quotes

* add path filter for several extensions

d072790f

02 Dec, 2022 3 commits

Fix bug where scaling may not be applied in some code path (#526) · d1567094
Anthony Chang authored Dec 03, 2022
```
* fix bug where scaling may not be applied in some code path

* more test

* revert accidental example code changes
```
d1567094

Add multiple d gridwise gemm on Navi21 for ResNet50 (#517) · 23ecf0fa

ltqin authored Dec 03, 2022



* start add example

* add multiple d fp16 example

* device transfer elementwiseop to gridwise

* gridwise add multiple d

* change example for multiple d

* fix spill registers

* fix for passthrough element op

* fix int8 overflow

* change example file name

* add instance for dl multiple d

* example add DsDataType

* remove grouped_convolution_forward_dl.hpp

* add head file(was deleted before)

* fix not support device issue

* format

* remove passthrough check
Co-authored-by: letaoqin <letaoqin@amd.com>

23ecf0fa

[Navi3x-LWPCK-449] wmma_op + unit test (#484) · abf9cc6c

Haocong WANG authored Dec 03, 2022



* wmma_op + unit test

* add arch limitation to wmma test

* change arch limitation

* Refactor + Add all type unit test(int4 compile failed)

* Add f32_16x16x16_bf16 unit test

* Remote int4 related

* delete deprecated test
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

abf9cc6c

01 Dec, 2022 1 commit

Modularize ckProfiler operations (#514) · 8784a72e

Po Yen Chen authored Dec 02, 2022



* Re-structure ckProfiler source files

* Rename profiler.cpp to main.cpp

* Modularize ckProfiler operations

* Add description for profiler operations

* Use longer name to avoid name collision

* Use macro to delay expansion

* Use std::move() to avoid object copying

* Prohibit users from calling dtor

* Use macro to eliminate redundant code

* Make friend function hidden

* Add missing include directive <iostream>

* Fix wrong include directives

* Remove int8 from batchnorm-forward instances since it is not needed for forward training and could fail test
Co-authored-by: Qianfeng Zhang <Qianfeng.Zhang@amd.com>

8784a72e

30 Nov, 2022 2 commits

gemm, conv perchannel quantization (#503) · ad541ad6

rocking5566 authored Dec 01, 2022

* Use gemm_multiple_D instead

* Add gemm bias relu quantization example

* Add pure gemm quantization example

* Add quantization of perchannel conv + bias + relu example

* Refine the code

* Rename multiplier to requant_scale

* Rename the folder

* Remove redundant comment

* Rename the file. Prepare to add perchannel

* Add conv perchannel instance

* Move to quantization folder

* Add conv perchannel client example

* Apply Rangify constructor of HostTensorDescriptor & Tensor<>

* Fix merge error

ad541ad6

BatchNorm backward instance/external API/profiler/tests (#519) · 63af525c

Qianfeng authored Dec 01, 2022

* Refine the device batchnorm-backward base API templates and data type assignments

* Remove duplicated kernel file

* Add batchnorm backward instances and external API

* Add batchnorm-backward profiler and tests

* Add client example which uses batchnorm backward external API

* Merge test/batchnorm_fwd and test/batchnorm_bwd into one directory

* Loose the threshold for batchnorm-backward check_err()

63af525c

29 Nov, 2022 3 commits

Fix split-k gemm test (#231) · 236bd148

Anthony Chang authored Nov 30, 2022



* properly return error flag; reveals bug in split-k gemm

* fix bug in split k

* update split-k test case
Co-authored-by: Chao Liu <chao.liu2@amd.com>

236bd148

fix GetTypeString · 0e9c88ce
fsx950223 authored Nov 16, 2022

0e9c88ce

BatchNorm backward implementation (#461) · 44789d99

Qianfeng authored Nov 29, 2022

* Implemented batchnorm-backward Blockwise and Multiblock kernels

* Add batchnorm-backward device op

* Add batchnorm-backward host-reference op

* Add batchnorm-backward example

* Parameters renaming in batchnorm backward kernels and device op

* Change in the example to loose the threshold for ScaleDiff checking

* Add comments to explain the implementation of batchnorm-backward

* Parameters renaming again in batchnorm backward kernels

* Improve the expression calculation for performance

* Add batchnorm backward to README

* Add comments to explain inv-variance in batchnorm forward and backward

* Renaming the batchnorm forward training and inferring examples

* Add/update the comments for batchnorm-backward kernels

* Renaming again

* Add block_sync_lds between two consecutive blockwise reductions

* Move common expression 1/N out of the static_for loops

* Add dy_elementwise_op

* Renaming in backward example again

* Add checking for reduceDims in reference_batchnorm_backward

* Update to comments and codes format

* Rename in the comments

* Remove common expression out of the loop in reference_batchnorm_backward_nhwc_c

* Add block_sync_lds() between blockwise reduction again

* Fix comments again

* Remove int8 from batchnorm-forward instances since it is not needed for forward training and could fail test

44789d99

28 Nov, 2022 1 commit
- Remove int8 from batchnorm-forward instances since it is not needed for... · 5bf0475a
  Qianfeng authored Nov 29, 2022
```
Remove int8 from batchnorm-forward instances since it is not needed for forward training and could fail test (#516)
```
  5bf0475a
25 Nov, 2022 1 commit

BatchNorm forward instance/external api/profiler/tests/client example (#511) · 4e6a5575

Qianfeng authored Nov 25, 2022



* Update to device_batchnorm_forward base class to include all template parameters for problem description

* Add batchnorm forward instances and external api

* Add batchnorm forward profiler module which uses the external api

* Add some comments in batchnorm_forward example to explain the dimensions in lengths[]

* Replace the reference_batchnorm_forward_nhwc_c by generic reference_batchnorm_forward

* Improvement to the batchnorm infer base API

* Add batchnorm forward client example which shows using the batchnorm forward external API

* Add test for batchnorm forward

* Tuning the batchnorm profiler initialized values and error threshold

* Add support for bhalf_t in instances/external api/tests

* Add support for int8_t in instances/external api/tests

* Add support for double in instances/external api/tests

* Let ScaleDataType and BiasDataType be same as XDataType and YDataType when creating instances

* Checking before running best instance in batchnorm_fwd_nhwc client example

* Add checking for YElementwiseOp in batchnorm_forward external API

* Add more types in batchnorm forward profiler

* Add more test lengths
Co-authored-by: rocking5566 <ChunYu.Lai@amd.com>

4e6a5575

20 Nov, 2022 1 commit

Client examples AddFastGelu and FastGelu + instances. (#509) · 43a889b7

Adam Osewski authored Nov 20, 2022



* FastGelu support for more data types.

* AddFastGelu & FastGelu instances.

* Client example.

* clang-format

* Remove unused stride variable.

* Add new line at EOF.
Co-authored-by: Adam Osewski <aosewski@amd.com>

43a889b7

17 Nov, 2022 1 commit
- Work around develop validation failure (#513) · 892a8d76
  Anthony Chang authored Nov 18, 2022
```
* workaround bf16 atten fwd issue on gfx908

* typo
```
  892a8d76
15 Nov, 2022 2 commits

Add BF16 tests for batched_gemm_softmax_gemm_permute (#504) · 4c4c7328

guangzlu authored Nov 16, 2022



* fixed bug in softmax reference & add bf16 examples for batched_gemm_scale_softmax_gemm

* added bf16 tests for batched_gemm_softmax_gemm_permute

* changed format of device_batched_gemm_softmax_gemm_permute_xdl_cshuffle_bf16_bf16_bf16_bf16_gmk_gnk_gno_gmo_instance.cpp

* changed format device_batched_gemm_softmax_gemm_permute_xdl_cshuffle_bf16_bf16_bf16_bf16_gmk_gnk_gno_gmo_instance.cpp

* aligned annotations

* modified CMakeLists for examples

* add common example code of fp16/bf16 version for batched_gemm_scale_softmax_gemm_xdl

* use macro to control the instances

* added macro control into instances

* clang-format some files

* changed error tolerance for bf16

* changed index for 10_elementwise_normalization

* fixed xdlops code bug in amd_xdlops.hpp
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

4c4c7328

Add Conv Backward Data on Navi21 for ResNet50 (#499) · db0eb1ea

ltqin authored Nov 16, 2022



* start add example

* add device dl

* change launch kernel

* change init data method

* change example config

* add config valid check

* add instance for dl bwd

* add instance to ckProfiler

* reserver to profiler and cmakelist

* add instance to ckProfiler2

* change instance f32 config

* fix example return value
Co-authored-by: letaoqin <letaoqin@amd.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

db0eb1ea