Commits · d4c84256f790d210ea5c952cb18ad13542f6c698 · gaoqiong / composable_kernel_ROCM

14 Aug, 2023 2 commits

Implement DPP8 based GEMM for Navi21 (#826) · d4c84256
Bartlomiej Wroblewski authored Aug 14, 2023

d4c84256

rocking authored Aug 15, 2023

* Do not hardcode stride

* devicePool2DFwd Inherit devicePool3DFwd

* Move instance declaration out of common

* Add dilation

* use the pool3d rank, because pool2d inherit pooo3d

* calculate Do Ho Wo for the dilation

* Fix header name

* Modify ckProfiler

* Remove pool2d instance

* Remove pool2d in profiler

* Remove pool2d and add dilation

* In to client example, this commit revise following:
1. Add dilation.
2. Use pool3d to implement pool2d

* Refine naming and IsSupportedArgument()

* Add dilation to maxpool bwd example

* clang format

* 1. Remove useless header
2. Fix copyright
3. Refine naming

* Add layout parameter to pool fwd

* clang format

* Fix merge error

* Fix compile error

* Remove layout parameter in derived class

* Refine changlog

* Fix compile error

* Fix compiler error

* Add layout to external api and profiler

f60f0a5e

11 Aug, 2023 2 commits

Add Normalization splitk instances (#829) · 03b8119e

rocking authored Aug 12, 2023

* Add normalization splitK to layernorm and groupnorm instances

* Fix bug of GetKPerThread()

* Refine naming

* clang format

03b8119e

Bump rocm-docs-core from 0.10.3 to 0.20.0 in /docs/sphinx (#844) · a5343db0

dependabot[bot] authored Aug 11, 2023

* Bump rocm-docs-core from 0.10.3 to 0.20.0 in /docs/sphinx

Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.10.3 to 0.20.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.10.3...v0.20.0

)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com>

* set min version of rocm-docs-core

---------
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Sam Wu <sam.wu2@amd.com>

a5343db0

10 Aug, 2023 2 commits

Add the rocm5.7 RC1 compiler and use it for QA builds. (#842) · 6237bd12
Illia Silin authored Aug 10, 2023
```
* add docker for rocm5.7 RC1

* fix rocm5.7 rc1 build

* build QA with rocm5.7 rc1 compiler
```
6237bd12

Average pool backward deviceOP and example (#797) · 578142db

rocking authored Aug 10, 2023

* Add avgpool bwd reference code

* Refine naming

* Fix invalid in_element op in ref_conv

* Add example (only reference now)

* Add the full example of avgpool bwd

* Fix copyright

* Imitate MakeDescriptor from  transform_conv_bwd_data_to_gemm_v1.hpp

* rename channel to c from k

* Arrange the code

* Imitate the argument from conv bwd

* Implement invoker

* Fix order of parameter in example

* Refactor reference code for different dimension

* Support different stride

* Check if argument is valid

* Fix kernel parameter for NDHWC, fastest dimension C is not reduced

* Add more data type in example

* Fix bug in example

* calculate Do Ho Wo according to the dilation

* Remove useless header

* Add comment in reference code

* Add layout parameter

* Remove layout in derived class

* Refine reference comment

578142db

09 Aug, 2023 6 commits

Update the rocm version threshold to apply the -fno-offload-uniform-block flag. (#839) · cbbd172f

Illia Silin authored Aug 09, 2023

* add fno-offload-uniform-block flag for rocm5.7 and up

* add a comment and compiler ticket number

* update the threshold rocm version

cbbd172f

Update the list of contributors. (#836) · 1b7da171

Illia Silin authored Aug 09, 2023

* add linting and update contributors list

* skip the linting and doc changes

* add Astha

* add YanXing

1b7da171

add gfx941 to the ckProfiler package (#840) · 9af519ee
Illia Silin authored Aug 09, 2023

9af519ee

Enable grouped conv with small K or C (#822) · 472fa029

Bartłomiej Kocot authored Aug 09, 2023

* Enable grouped conv with small K or C

* Add missing instances

* Refactor grouped conv fwd instances

* Fix fp16 instances since it supports src_per_vec %2 = 0

* Add generic instances

472fa029

Enable f16/f8 mixed precision mode (#820) · 9c54eaab

Rostyslav Geyyer authored Aug 09, 2023

* Enable f16/f8 mixed precision

* Add an argument to enable mixed precision

* Update for compatibility

* Add mixed precision example

* Introduce ComputeType argument

9c54eaab

add no-offload-uniform-block flag for rocm5.7 and up (#838) · 68026113
Illia Silin authored Aug 08, 2023
```
* add -fno-offload-uniform-block flag for rocm5.7 and up

* add a comment and compiler ticket number
```
68026113

07 Aug, 2023 2 commits

Allow building CK for specific data types and split off last remaining DL instances. (#830) · 08eb1769

Illia Silin authored Aug 07, 2023

* properly split conv_nd_bwd_data instances

* split conv2d_fwd instance data types

* split the gemm, conv2d_fwd and batched_gemm_softamx_gemm

* split the tests by data types where possible

* filter examples by DTYPES

* split few remaining examples by DTYPES

* filter most instances by DTYPES

* add new lines at end of headers, fix grouped_gemm profiler

* fix syntax

* split the ckprofiler instances by DTYPES

* split the conv2d and quantization DL and XDL instances

* fix the splitting of conv2d DL instances

* split softmax and pool_fwd tests for fp16 and fp32 types

* fix syntax

* fix the dl_int8 quantization instances isolation

08eb1769

Add wei_strides to grouped conv3d wei to keep consistency (#817) · 22443f7a

Bartłomiej Kocot authored Aug 07, 2023



* Add wei_strides to grouped conv3d wei to keep consistency

* Fix strides in client examples

* Unify backward weight api with forward

* Fix for example

* Fixes for examples

---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>

22443f7a

03 Aug, 2023 4 commits
- add an option to build ckProfiler package for specific architectures (#828) · 2474dddb
  Illia Silin authored Aug 03, 2023
  
  2474dddb
- Change to github_issue prefix · aac65a03
  Bartlomiej Kocot authored Aug 01, 2023
  
  aac65a03
- Rename the workaround to a proper issue name · e6a826d3
  Bartlomiej Kocot authored Aug 01, 2023
  
  e6a826d3
- Improve formatting of docs; Add a note about the DL_KERNELS flag (#825) · 8c13df07
  Bartlomiej Wroblewski authored Aug 03, 2023
```
* Improve formatting of docs; Add a note about the DL_KERNELS flag

* Change the recommended version of ROCm to 5.6
```
  8c13df07
02 Aug, 2023 1 commit

Update tuning parameter & compilation options of DeviceGemmXdl<> instance (layout=TT) (#819) · f7cc8c3b

Po Yen Chen authored Aug 02, 2023

* Enable pipeline v2 opt for layout=TT instance

* Use better thread mapping for reading A tile

* Conditionally enable pipeline v2 opt

* Allow enabling only fp16 gemm instances in profiler

* Fix formatting error

* Fix compilation error if we enable fp32 in profiler

f7cc8c3b

27 Jul, 2023 1 commit

Add s_nops after v_dot to avoid hazard (#808) · 7761e523

Bartłomiej Kocot authored Jul 27, 2023

* Add s_nops after v_dot to avoid hazard

* Fix builtin for inner_produxt fp16

* Skip inline version to builtin

* Add comments regarding isa

* Fix comment regarding s_nop

7761e523

26 Jul, 2023 4 commits

initial stream-k implementation with example (#699) · e7dca79d

carlushuang authored Jul 27, 2023



* initial stream-k implementation with example

* fix unexpected change in err

* improve a little bit performance by reorganize pipeline.

* improve perf a little bit by swizzle block idx

* add profiler

* update example

* fix spelling

* shrink karg for streamk

* support dynamic buffer using memory coherence glc_slc bit from template

* control memory coherence while construct dynamic buffer

* update reduction for streamk(not ready yet)

* Add template parameter to make_dynamic_buffer to support amd_buffer coherence setting

* fix build issue

* fix several bug

* now result is correct, everything works (but has scratch)

* remove scratch by manually reset coordinate

* update device code

* fix a bug in final reduce

* fix something in example

* update async memset

* fix enum as camel case

* modify coherence enum name

* clean code and use atomic streamk by default

* remove unused var

* throw exception if have empty pointer

* fix format

* fix CI warning

* fix type in init

* modify CI error

* filter out on gfx10+

* restore changed example code

---------
Co-authored-by: Qianfeng Zhang <Qianfeng.Zhang@amd.com>

e7dca79d

Disable DL kernels by default. (#816) · 9195435c
Illia Silin authored Jul 26, 2023

9195435c

Disable XDL kernels on unsupported HW Add ck::is_xdl_supported (#768) · ac6d68b3

Bartłomiej Kocot authored Jul 26, 2023



* Disable XDL kernels on unsupported HW; Add ck::is_xdl_supported function (#765)

* Do not throw an error when GEMM problem is not supported.

---------
Co-authored-by: Bartlomiej Wroblewski <bwroblewski10@gmail.com>
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

ac6d68b3

Refine the dimension of host tesnor. This example only require 1D (#812) · 016bd428
rocking authored Jul 26, 2023

016bd428

25 Jul, 2023 2 commits

Speed-up global memory reading for GEMM instances (#813) · f4ea5601
Po Yen Chen authored Jul 26, 2023
```
* Use better ThreadClusterLengths to speed up

* Update B tile reading pattern for layout=NN instance
```
f4ea5601

Add bias scalar vectorload = 1 for gemm bias gemm (#791) · 50643dd5

ltqin authored Jul 25, 2023

* first change bias load

* add bias dim and scalervector parameter

* make CDE0BlockTransferSrcVectorDim not work

* changse toinstance

* add limit for CDE0BlockTransferSrcScalarPerVector

50643dd5

21 Jul, 2023 3 commits
- add ninja profiling tools to the base docker (#805) · 844b215d
  Illia Silin authored Jul 21, 2023
  
  844b215d
- add INSTANCES_ONLY cmake macro to build only instances (#807) · 7a29f711
  Illia Silin authored Jul 21, 2023
  
  7a29f711
- Grouped conv bwd wei NDHWGC/NDHWGK (#804) · 10732847
  Bartłomiej Kocot authored Jul 21, 2023
  
  10732847
18 Jul, 2023 3 commits

Grouped 3d conv backward data support (#799) · 49180fd6
Bartłomiej Kocot authored Jul 18, 2023
```
* Grouped 3d conv backward data support

* Fix comments
```
49180fd6
Remove type_convert bf16 to int32 and back (#802) · f82bd593
Rostyslav Geyyer authored Jul 18, 2023

f82bd593

Add mechanism to build CK for select data types, add Navi3x CI. (#790) · 189ea3b9

Illia Silin authored Jul 17, 2023

* allow building CK for specific data types

* add CI build and test stage on Naiv3x without some int8 instances

* add missing gemm fp16 instances

* add the changes to the missed cmake file

* add empty lines at end of source files

* Do not build quantization client example on navi3 in CI

* disable batched_gemm_multi_d_int8 instances with DTYPES

* disable device_conv2d_bwd_data_instance with DTYPES

* fix ckprofiler for conv_bwd_data for int8

* properly isolate the conv_bwd_data int8 instances

* remove empty line

189ea3b9

17 Jul, 2023 1 commit

Add check for compiler GPU target support. (#800) · 4867db42

Illia Silin authored Jul 17, 2023

* check if gpu_targets are supported by compiler

* set default list of targets and filter for them

4867db42

15 Jul, 2023 1 commit
- Disable Werror to ignore xnack+ warnings (#794) · 03d3395b
  arvindcheru authored Jul 14, 2023
```
* Disable Werror to ignore xnack+ warnings
```
  03d3395b
12 Jul, 2023 1 commit

Support NHWGC conv2d_bwd_weight (#769) · 1ee99dca

Bartłomiej Kocot authored Jul 12, 2023



* Support NHWGC conv2d_bwd_weight

* Fix client example

* Fix client example

* Fix comments

* Redesign grouped_conv_bwd_weight instances

* Clang format fix

---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>

1ee99dca

07 Jul, 2023 1 commit
- change the build thread usage in CI (#787) · 87f2bbcf
  Illia Silin authored Jul 06, 2023
  
  87f2bbcf
06 Jul, 2023 4 commits

Add basic setup for precommit (#749) (#764) · 237f9cd3

Adam Osewski authored Jul 06, 2023



* Add basic setup for precommit

* Update README.md with instructions on installing precommit hooks

---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: Bartlomiej Wroblewski <bwroblewski10@gmail.com>

237f9cd3

Split GEMM instance library & enable pipeline v2 optimization (#783) · 850144a0

Po Yen Chen authored Jul 06, 2023

* Move source file into sub-directories

* Add missing include directive

* Split DeviceGemmXdl<> fp16 instances

* Fix format

* Remove unnecessary CMakeLists.txt

* Add macros to toggle new features

* Remove debug message

* Turn off GEMM v2 pipeline optimization by default

* Fix format

* Extract duplicated string as list

* Enlarge indent in CMakeLists.txt

850144a0

Batchnorm splitk single kernel (#771) · 8f5cafaf

Qianfeng authored Jul 06, 2023

* Use dim 0 as faster dim for writing mean/var/count workspace in batchnorm multiblock method [performance]

* Add CountDataType as template parameter in blockwise_welford

* Add utility/get_shift.hpp

* Add BatchNorm multiblock single-kernel implementation

* Add smem inline assembly based implementation of gms_init/gms_barrier/gms_reset for gfx90a

* Renaming in device_batchnorm_forward_impl.hpp

* Tiny fix in the batchnorm_fwd profiler

* Revert "Add smem inline assembly based implementation of gms_init/gms_barrier/gms_reset for gfx90a"

This reverts commit d16d00919c43f10759e7b4e4d112125221ed9064.

* Use the old two-kernel batchnorm multiblock method for gfx1030

* Use the old two-kernel batchnorm multiblock method for gfx908

* use the single-kernel batchnorm multiblock method only for gfx90a

* Remove get_wave_id() from utility/get_id.hpp since it is not used

* Set true for testing running mean/variance and saving mean/invvariance in the examples

* Fix to copy-right words

* Remove un-needed including in utility/get_id.hpp

* Add comments to workgroup_synchronization.hpp

* Remove un-used codes in gridwise_multiblock_batchnorm_forward.hpp

* Renaming in the kernels

* Remove un-used kernel file

8f5cafaf

Move Device Ops implementations into impl directory. (#777) · f4dfc060
Adam Osewski authored Jul 06, 2023
```
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
```
f4dfc060