Commits · 73486a93a265b97ffa70775aa25d92c36f1d6fa1 · gaoqiong / composable_kernel

05 Oct, 2022 2 commits
- changed vector load · 73486a93
  Astha Rai authored Oct 05, 2022
  
  73486a93
- commented out unused code · 41bcd608
  Astha Rai authored Oct 05, 2022
  
  41bcd608
04 Oct, 2022 4 commits
- Merge branch 'gridwise_2d' of github.com:ROCmSoftwarePlatform/composable_kernel into gridwise_2d · b708807a
  Astha Rai authored Oct 04, 2022
  
  b708807a
- added dimensions for example file · be56fdef
  Astha Rai authored Oct 04, 2022
  
  be56fdef
- Merge branch 'develop' into gridwise_2d · 10d1a915
  arai713 authored Oct 04, 2022
  
  10d1a915
- fixed 2d thread indexing · 08848bb6
  Astha Rai authored Oct 04, 2022
  
  08848bb6
03 Oct, 2022 3 commits

Chao Liu authored Oct 03, 2022

* update cmake script

* update readme

* Update README.md

* add citation

* add images

* Update README.md

* update

* Update README.md

* Update CONTRIBUTORS.md

* Update README.md

* Update CITATION.cff

* Update README.md

* Update CITATION.cff

* update doc

* Update CONTRIBUTORS.md

* Update LICENSE

* update

9d8f834a

Update doc (#464) · 6de749e2

Chao Liu authored Oct 03, 2022

* update cmake script

* update readme

* Update README.md

* add citation

* add images

* Update README.md

* update

* Update README.md

* Update CONTRIBUTORS.md

* Update README.md

* Update CITATION.cff

* Update README.md

* Update CITATION.cff

* update doc

* Update CONTRIBUTORS.md

* Update LICENSE

6de749e2

update document: Readme, contributors, citation, (#463) · 473ba5bc

Chao Liu authored Oct 03, 2022

* update cmake script

* update readme

* Update README.md

* add citation

* add images

* Update README.md

* update

* Update README.md

* Update CONTRIBUTORS.md

* Update README.md

* Update CITATION.cff

* Update README.md

* Update CITATION.cff

473ba5bc

01 Oct, 2022 1 commit

Allow setting ROCM version, activate cchache, etc. (#462) · 7fc3ed76

Illia Silin authored Oct 01, 2022

* enable ccache and decouple it from MIOpen ccache use

* fix the ccache check script

* use another method to get server name

* fix syntax

* add quotes around the server name variable

* use check_host as function

* change syntax

* fix syntax

* test if server name is parsed correctly

* try different syntax

* check the env var value

* test new check node function

* add ROCMVERSION parameter and fix script syntax

* fix script syntax

* add missing instances of rocm version

* install ccache in the docker image

* do not check GPU in clang format stage, clean up old code

* update defaults and clean up

7fc3ed76

28 Sep, 2022 3 commits
- updated kernel call · 5f01c06f
  Astha Rai authored Sep 28, 2022
  
  5f01c06f
- updated Grid Desc · 1d97c3a4
  Astha Rai authored Sep 28, 2022
  
  1d97c3a4
- changed blockID to 2D · facdb52e
  Astha Rai authored Sep 28, 2022
  
  facdb52e
27 Sep, 2022 2 commits

Fix build issues, set new compiler default, etc. (#451) · b8825547

Illia Silin authored Sep 27, 2022

* add an option to select specific compiler commit

* change the logic of forcing building a docker

* add check for compiler commit in dockerfile

* compiler check syntax fix

* change compiler selection logic

* fix the new compiler build issue

* set new compiler as default, update dev-requirements

* fix jenkins syntax

* fix docker syntax

* get rid of hipcc.pl editing in jenkinsfile

* fix the hipcc.pl in both places

* try to fix the 10738 compiler linking bug

* fix syntax

* use dockerhub to store images

* use newer amd-stg-open commit as default

b8825547

fixed NumDim dimension error · 76b44c60
Astha Rai authored Sep 27, 2022

76b44c60

26 Sep, 2022 3 commits
- fixed indexing for loop step · 4dfcf974
  Astha Rai authored Sep 26, 2022
  
  4dfcf974
- fixed compiler issues · 88d5d8d0
  Astha Rai authored Sep 26, 2022
  
  88d5d8d0
- changed NumDim into 2D · 085d9d11
  Astha Rai authored Sep 26, 2022
  
  085d9d11
25 Sep, 2022 4 commits
- added Cmake file · 5da7cd69
  Astha Rai authored Sep 25, 2022
  
  5da7cd69
- added example file with updated device elementwise call · 0e2be9c2
  Astha Rai authored Sep 25, 2022
  
  0e2be9c2
- added 2d version of device elementwise · 9e07a42f
  Astha Rai authored Sep 25, 2022
  
  9e07a42f
- added 2d gridwise elementwise · ad0470b5
  Astha Rai authored Sep 25, 2022
  
  ad0470b5
23 Sep, 2022 1 commit

Fix device instance libarary to include all instances (#418) · 2c6d63d0

JD authored Sep 23, 2022



* fix device instance library to add all instances

* remove cppcheck from requirements.txt
Co-authored-by: Jun Liu <Liu.Jun@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

2c6d63d0

22 Sep, 2022 2 commits
- fix build (#434) · e9d4e893
  Chao Liu authored Sep 22, 2022
```
* fix

* fix

* add instance
```
  e9d4e893
- Replace the obsolete offload-arch flags with GPU_TARGETS and fix a bug. (#437) · aa0b0515
  Illia Silin authored Sep 22, 2022
```
* replace obsolete offload-arch flags with GPU_TARGETS

* fix a build error for client app

* replace commma with semicolon in GPU_TARGETS
```
  aa0b0515
21 Sep, 2022 3 commits

Updated the supported components (#435) · 7acbf104
Lixun Zhang authored Sep 21, 2022

7acbf104

Build the CK targets only once. (#433) · 85b0920d

Illia Silin authored Sep 21, 2022

* build CK only once, use deb package in all subsequent stages

* update jenkins file

* change prefix for build_CK stage

* update writing deb metadata to control file

* update ubuntu source for docker, script syntax for deb package metadata

* try different way to create deb metadata

* clean up DEBIAN before creating one

* fix the CI folder names, fix splitK qa

* use correct docker in all stages, separate tests for splitK verification and performance

* clean old comments, change dir before packaging

* use different package syntax

* change packaging syntax

* package with cmake

* remove unnecessary build prefix

* get rid of unnecessary paths

* change paths during unpacking

* change script syntax while unpacking

* get rid of unneccesary steps

* get rid of comments in the scripts

* use double quotes for scripts

* add ccache during build, try dpkg -x

* pull and install each package separately

* use full package names

* try to use stashing for packages

* change stash/unstash syntax

* move unstash out of shell, run tests on any gpu node

* unpack each package separately

* try re-using existing workspace

* merge the build and test stages, only stash ckProfiler

* merge the build and test stages, only stash zipped ckProfiler

* fix syntax

* add GPU check before build and test, rename docker to usual name

85b0920d

fixed G offset calc for long_index (#428) · 01876afa
zjing14 authored Sep 21, 2022

01876afa

20 Sep, 2022 6 commits

fix build (#427) · 567f70f5
Chao Liu authored Sep 20, 2022
```
* fix build

* fix build
```
567f70f5

MNKO padding support on bmm+masking+scale+softmax+bmm+premute (#425) · ebab84b6

Shaojie WANG authored Sep 21, 2022



* add lower triangle bmm

* init code for tile skipping

* functionality right with lower triangle mask

* add decoder lower triangular mask calculation

* use 7*13 group

* fix n2 compute error

* attention with lower triangle mask with tile skipping

* add template to distinguish masking kernel

* rename template and remove default template value

* remove lower triangle gemm reference struct

* add some comments on example

* add 10 instance for masking bmm + scale + softmax + bmm + permute kernels

* add test

* add test file

* add gtest for bmm masking scale softmax bmm permute

* clang-format

* fix compile error

* check lef bottom corner for tile skipping

* fix error: check left bottom corner for tile skipping

* add k padding

* add test and instance for MNK padding

* passing a mask struct

* fix instances

* delete used comments

* format
Co-authored-by: danyao12 <yaodan@dc-smc-13.amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

ebab84b6

use rocm5.2 compiler as default, use same flags for amd-stg-open as for release (#426) · 9f7c1930
Illia Silin authored Sep 20, 2022

9f7c1930

Group norm (#417) · 4eba345f

rocking5566 authored Sep 20, 2022



* Add groupnorm example by layernorm
1.  Reference is not ready
2. shape of gamma and beta need to be fix

* Let shape of gamma and beta can be same as x

* Modify test, instance and client example

* [What] Fix bug of layernorm for greater than 2 dimension.
[Why] We need to get upper length from merge transform instead of embed transform.

* Add reference for groupnorm

* Fuse sigmoid after groupnorm

* [What] Rename original layernorm into layernorm2d
[Why] Prepare to add groupnorm using layernorm5d

* clang-format

* Add groupnorm test

* Refine error message

* Add groupnorm ckProfiler

* Test groupnorm kernel from device_instance

* update example

* upadte profiler

* Fix test naming

* Fix argc number

* Move descriptor and sweeponce to argument for quick debugging
Co-authored-by: Chao Liu <chao.liu2@amd.com>

4eba345f

Add 'Permute' device op & example (#408) · f584ab0c

Po Yen Chen authored Sep 20, 2022

* Add example folder for 'DeviceElementwise'

* Re-structure example files

* Move common parts into common.hpp

* Use more strict input

* Add more helper methods in 'DeviceElementwise'

* Use more specific method to write example

* Allow specify problem through command line argument

* Allow specify problem 'axes' through command line argument

* Add check to template type argument

* Add transpose_shape() to generalize shape permute

* Generalize transpose utility functions

* Use better name for tensor indices

* Add checks in helper functions

* Remove debug messages

* Refine error message for check_err()

* Generalize variable naming in example code

* Add device op 'DevicePermute'

This device op is clone of 'DeviceElementwise'

* Use 'DevicePermute' device op in example

* Remove 'elementwise' from identifiers

* Remove 'elementwise' from file paths

* Remove base class of 'DevicePermute'

* Let 'DevicePermute' inherit from 'BaseOperator'

* Add simple type traits to validate device op type

* Add static_assert() to check type constraints

* Create 'DevicePermuteBase' to generate methods

* Use indirect base type to generate methods

* Remove 'is_device_op<>' type traits

* Only accept single-input-single-output for 'DervicePermute'

* Simplify 'DevicePermute' interface

* Re-format 'DeviceElementwise'

* Use CRTP to generate overridden virtual method

* Remove unnecessary include directives

* Distinguish input & output shape in 'DevicePermute'

* Passing 'axes' to 'DevicePermute'

* Use more reasonable return value for Invoker::Run()

* Add 'GridwisePermute' kernel

This kernel is a clone of 'GridwiseElementwise_1D'

* Remove no-longer used type argument

* Check if input/output shape meet the requirement

* Remove no-longer used method

* Remove never-entered-if-clause

* Change problem description for 'DevicePermute'

* Transform descriptor into 3 dimensions

* Add debug code the verify result

* Add comment to indicate template argument location

* Add N/H/WPerBlock template parameter to 'DevicePermute'

* Rename 'GridwisePermute' to 'GridwiseCopy'

* Check tensor descriptor dimensions in 'GridwiseElementwise_1D'

* Add missing include directive

* Add 'BlockSize' parameter to 'DevicePermute'

* Remove no-longer used method

* Add 'BlockToTileMap' for 'GridwiseCopy'

* Use the normal Block2TileMap convention

* Rename 'BlockToTileMap' as 'Block2TileMap'

* Fix most of compilation errors

* Let 'Block2TileMap' map block to 2d coordinate

* Allow data transfer in 'GridwiseCopy'

* Fix wrong output descriptor for 2nd blockwise copy

* Rename 'GridwiseCopy' as 'GridwisePermute'

* Remove '1d' in identifiers

* Remove commented-out codes

* Remove 'MPerThread' template parameter

* Seperate template parameters

* Unify variable namming convention

* Use more verbose way to create expressions

* Add template parameter 'InBlockLdsExtraW'

* Release the constraint on In/OutGridDesc

* Use date type directly as template argument

* Re-arrange template arguments for blockwise copy

* Remove no-longer used template parameters

* Embed layout in the variable names

* Add GridwisePermute::CheckValidity()

* Extract local types as template parameters

* Rename local type alias

* Add more template parameters (vector width related)

* Calculate new SrcVectorDim/DstVectorDim after merge descriptor dimensions

* Fill tensor values start from 1

* Re-formate example code

* Avoid too-large block id

* Add comment

* Make sure 'SrcVectorDim' is not same as 'DstVectorDim'

* Add check for the 'VectorDim' & 'ScalarPerVector' template params

* Let 'DstVectorDim' equals 'SrcVectorDim' after transpose out grid desc

* Remove no-longer used template parameter 'NPerBlock'

* Fix wrong descriptor creation logics

* Specify problem in each examples

* Use better example name

* Add new example 'example_permute_NxHxW_fp32'

* Add example for demonstrating bundle multiple elems in tensor

* Add support to permute multiple elements together

* Change the default problem size

* Add span<> class template

* Use span<> to generalize check_err() interface

* Fix ambiguous ctor call

* Avoid create necessary objects

* Use helper functions to simplify example code

* Add example for 4xfp16 permute

* Disable failed-to-compile example

* Add check for the NUM_ELEMS_IN_BUNDLE

* Remove redundant parameter in helper lambda function

* Add check for the input tensor type's byte-size

* Check scalar-per-vector with padded length

* Use more verbose name to avoid name collision

* Use fixed 'VectorDim' & 'ScalarPerVector' for LDS

* Embed shape info in name of descriptor constructor

* Rename example folder '36_permute' into '37_permute'

* Avoid using too-large LDS in kernel code

* Remove redundant example

* Usw switch() to group similar codes

* Add const to the span<> type arguement

* Simply initialize tensor with floating point values

* Use fp16 as data type in all examples

* Enlarge tensor size in example

* Enalrge N-dim in example

* Add check for the bundled type in example

* Use more stricter error threshold

* Remove global load/store loop in kernel code

* Measure execution time by default

* Use faster device op config for example 'NxHxW_fp16'

* Use faster device op config for example '1xHxW_fp16'

* Use faster device op config for example 'HxWx4_fp16'

* Remove cmd arg parsing logics

* Rename functions

* Extract bundle permutation logic out

* Simplify permute bundle example

* Add Tensor<>::GetElementSpaceSizeInBytes()

* Add Tensor<>::data()

* Use new methods to simplify code

* Use type alias to replace duplicated code

* Use existing method to shorten code

* Allow FillUniformDistribution accept range arugment

* Intialize random values in range

* Add Tensor<>::size()

* Use more meaningful names in permute bundle example

* Use more meaningful names in permute element examples

* Use rangified copy() to copy elements

* Use function return value directly to eliminate variables

* Add to_array() conversion tool to eliminate more variables

* Add Tensor<>::AsSpan<>() to create view of tensor values

* Use AsSpan() to shorten check_err() calls

* Remove no-longer-used 'using' directives

* Move 'using' directive to proper code position

* Remove redudant variables

* Remove useless static_assert()

* Add check for range types

* Declare variable right before first use

* Move long return type as tailing return type

* Add BaseInvokerCRTP<> class template to generate method

* Create new base type for 'DervicePermute' implementations

* Move 'NumDim' template param to the first

* Rename 'DevicePermute' to 'DevicePermuteImpl'

* Add 'noexcept' specifier to CRTP generated method

* Move 'Block2TileMap' definition into 'GridwisePermute'

* Use type alias to reduce code

* Unify naming style in 'DevicePermute'

* Add comments in 'GridwisePermute'

* Rename permute example folder

* Use std::cerr to report error

* Use larger shape in examples

* Rename '38_permute' to '39_permute'

* Make sure we use unsigned type for shape & indices

* Remove opt-ed out assertion

* Remove template BaseInvokerCRTP<>

f584ab0c

Add batched attention special kernel instances (#424) · 7c788e10
Anthony Chang authored Sep 20, 2022
```
* sanity check

* add attribution

* add irrgular k tile size for batched attention

* format
```
7c788e10

19 Sep, 2022 3 commits

work around inline asm potential hazard using intrinsic (#416) · c6b8b472
Anthony Chang authored Sep 20, 2022

c6b8b472

Grouped batched attention + permute (#412) · 9287b7c6

Anthony Chang authored Sep 20, 2022

* grouped attn without batch validates; now move toward grouped batched attn

* grouped batched attention

* working

* remove debug logging

clean up

clean up

* reintroduce g_ prefix back to host tensor variables

* format

* rename file

* restore old file

* rename

* consolidate padded/non-padded attention example

* harmonize padding specialization in attn examples

9287b7c6

Conv bwd data multiple d (#404) · 27858374

Shaojie WANG authored Sep 20, 2022



* init commit of convnd bwd data

* begin compiling example

* have a first version that produce a right result

* refine device level launch kernel code

* add more instances in example and get right results

* clang-format

* format example file

* add more instances

* fix instances

* adding conv_bwd_data multile_d

* adding conv_bwd_data multile_d

* adding conv_bwd multiple d

* adding conv_bwd multiple d

* adding conv_bwd multiple d

* refactor

* refactor

* adding conv bwd data multiple d

* adding conv bwd data multiple d

* adding conv bwd data multiple d

* adding conv bwd data multiple d

* adding conv bwd data multiple d

* adding conv bwd data multiple d

* adding conv bwd data multiple d

* refactor

* update conv fwd's bias impl

* refactor

* reorg file

* clean up cmake

* clean

* clean

* clean
Co-authored-by: Chao Liu <lc.roy86@gmail.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

27858374

16 Sep, 2022 1 commit
- disable print for group conv multiple D (#421) · 43c898f6
  Chao Liu authored Sep 16, 2022
  
  43c898f6
14 Sep, 2022 1 commit

batched_gemm + multiple_d + gemm + multiple_d (#394) · 370efa6c

ltqin authored Sep 15, 2022

* refactor

* start

* add device gemm file

* add BatchStrideD0

* add stridd0

* add gridwise file

* add d0 parameters to gridwise gemm

* add c layout transformer

* add d0 threadwise copy

* init kernel

* init kernel

* regular code

* nm desc put to out

* kernel parameter can not use reference

* host add bias+gelu

* run right for bias+gelu

* change AddFastGelu into another file

* interface add d1 bias parameters

* add d1 parameter to argument

* add d1 parameter to gridwise

* first all code,not verify

* gelu change to relu and GetElementSpaceSize bug

* add instance

* start add to ckprofiler

* ckprofiler finish code

* change input parameter for ckProfiler

* fix host bias+gelu bug

* show help for ckProfiler

* fix bug for lunch kernel ignore parametes

* add pad and fix about bug

* mutiple d0

* add dynamic d0_element_op

* change profiler and  instance to mutiple d0

...

370efa6c

13 Sep, 2022 1 commit

Upgrade the OS and ROCM versions. (#411) · b22ebd44

Illia Silin authored Sep 13, 2022

* upgrade the OS and ROCM versions in CK docker

* add cxx flags to link code with rocm5.2 and ck-9110 compiler

* rename the docker image

* run ONNX gemms using init=1

b22ebd44