Commits · 2c1ed8b2138ea1308dae4087d9ed1f5d8ab52766 · gaoqiong / composable_kernel

19 Jun, 2022 1 commit

GEMM with Multiple Source, GEMM+Bias+Add+FastGeLU example and ckProfiler (#241) · 56adf7e9

Chao Liu authored Jun 19, 2022

* ad gelu and fast_gelu

* added GeLU and fast GeLU

* clean up

* add gemm+fastgelu example

* add gemm+gelu instances

* update profiler

* clean up

* clean up

* adding gemm+bias+activation

* clean

* adding bias

* clean

* adding gemm multiple d

* debugging

* add gemm bias add fastgelu

* rename, clean

* refactoring; add readme

* refactor

* refactor

* refactor

* refactor

* refactor

* refactor

* fix

* fix

* update example

* update example

* rename

* update example

* add ckProfiler

* clean

* clean

* clean

* clean

* add comment

* use type_convert

* clean

* clean element wise op

56adf7e9

17 Jun, 2022 4 commits

Regulate reduction accumulator operations and Element-wise operations (#274) · 1f543bfa

Qianfeng authored Jun 18, 2022

* Remove template from Reducton operation classes and add template to their operator() and GetIdentityValue() interfaces

* Change to unary elementwise operators and the reduce_unary_operator (class for mapping) and dependent variations in all host layers

* Remove the data type template parameter from reduce_binary_operator (class for mapping) and dependent variations in host layers

* Add InMemoryDataOperatonSupportedOnDataType to check the matching between data type and InMemoryDataOperation

* Use struct-scope operator template instantiation for binary and unary element-wise operations

* Change a few more elementwise operations to use template for operator()

* Tiny correction in Normalize operator

* Add static_assert to check the data type appliability for some reduction accumulator and element-wise operatons

* Correction in some examples with regard to using ReduceAccDataType

* Use static_assert for UnaryDivide

* Update to merged codes to use Element-wise operations and Reduction Accumulator operations correctly

* Tiny fix with regard to SetWorkSpacePointer()

1f543bfa

use universal workspace pointer in bwd-weight (#286) · 63cdd923
Shaojie WANG authored Jun 18, 2022

63cdd923
add p_workspace to baseargument (#275) · c7a96ed5
ltqin authored Jun 17, 2022

c7a96ed5

Gemm + bias + relu + add + layernorm (#272) · 6eb55499

rocking5566 authored Jun 17, 2022

* Copy "gemm reduce" to "gemm bias add reduce"

* Implement gemm bias add reduction

* Fix compiler error due to merge from develop

* Add tensor operation for gemm + bias + add + reduce

* Add gemm_bais_add_reduce to ckProfiler

* Add c1 functor

* Refine type

* Use reduceAccDataType instead of explicitly float

* Change to use check_err()

* Do relu in float32 instead of bhalf_t. Because bhalf_t is unsigned

* Refactor relu. using type_trait instead of overloading

* Rename DxsReduceAccElementwiseOperation to DxsReduceAccElementwiseOperation

* Fix denominator

* Refine nameing

* Fix denominator  in host

* Remove useless include header

* Use AccDataType

* Fix static_cast order

* Refine type

* [What] Remove tuple type in the base class
[Why] External api depend on base class. if base class has relationship with type, we will need many class for different type

6eb55499

16 Jun, 2022 1 commit

example for convnd bwd weight bf16 splitk (#265) · 561ec12f

Shaojie WANG authored Jun 17, 2022

* add GetWorkSpaceSize to base arg and make an example on convnd_bwd_weight

* add bwd weight for bf16: init

* remove redundant compute

* use datatype and split k to check whether a workspace is used

* remove unused computation for work space size

* add some code for bfp16

* add device/grid unary op

* add unary type convert to bwd-weight example

* support bf16 splitk kernel for convnd bwd weight

* 1. remove comments. 2. add checkvalidity. 3. add gridsize computation

* add workspace size check

* fix format

* change function name

561ec12f

15 Jun, 2022 1 commit
- clean up; add comment · b86b318b
  Anthony Chang authored Jun 15, 2022
  
  b86b318b
02 Jun, 2022 7 commits
- use old ctile to avoid conv2d fwd bias relu add compute error (#271) · 1c5d06f2
  Shaojie WANG authored Jun 03, 2022
  
  1c5d06f2
- amend · 54d032b0
  Anthony Chang authored Jun 02, 2022
  
  54d032b0
- reflect reduction API's recent change · f8c44314
  Anthony Chang authored Jun 02, 2022
  
  f8c44314
- activation in correct order · 6c496076
  Anthony Chang authored Jun 02, 2022
  
  6c496076
- fully implemented gemm + bias + activation + add + norm · 93235bb4
  Anthony Chang authored Jun 02, 2022
  
  93235bb4
- explicit cast · 31b3f1dc
  Anthony Chang authored Jun 02, 2022
  
  31b3f1dc
- Unify the naming of the math functions used by the host and kernel (#262) · 86185bd7
  Qianfeng authored Jun 02, 2022
```
* Use the unified naming for math functions on host and HIP kernel

* Corresponding change/simplification in reduction host/profiler/examples due to unified math functions renaming

* Renaming GetReductionZeroVal() to GetIdentityValue()

* Tiny renaming in profile_reduce_impl.hpp

* More renaming in profile_reduce_impl.hpp

* Replace zeroVal by identiyVal

* Remove ck_ prefix in the naming of ck::math provided functions
```
  86185bd7
01 Jun, 2022 1 commit
- correct naming · a537a8aa
  Anthony Chang authored Jun 01, 2022
  
  a537a8aa
31 May, 2022 10 commits

Pass gemm_descs for grouped gemm via __constant__ buff (#232) · b6eaf3eb

zjing14 authored May 31, 2022

* moved gemm_descs_args into const buff

* use CK_CONSTANT_ADDRESS_SPACE instead of global constant

* clean

* moved hipMemAlloc outside of deviceOp

* add SetWorkSpacePointer

* fix ignore

b6eaf3eb

Multi-kernel CGEMM (#230) · 7b1e2c37

myamlak authored May 31, 2022

* Reference CGEMM + test stub

* Format.

* Incomplete simple implementation

* Library instances

* Sketch of tests

* Test fixes.

* Example added

* Cosmetics

* Add elementwise operation kernel and example

* Add comment

* Add template argument of dim . Prepare to support multiple dimension

* Rename example

* Support 1 dimension

* Add static assert

* Add comment

* Second auxiliary buffer added

* Extract pad

* Remove redundant argument

* Support any dimension for elementwise operation

* Remove line

* Let it be the multiple number of CU

* Move thread per block to the parameter of constructor

* Consuming binary ops to do A+B / A-B

* Fix + cosmetics + bf16 test commented out temporarily

* Format

* Enabling bf16 test

* Revert "Enabling bf16 test"

This reverts commit f497e2ba.

* Fix + test reenabled

* fix build

* Revert "fix build"

This reverts commit d7310238

.

* post PR #235 merge fix

* amend

* Single workspace for cgemm + helper

* Perf calc fix

* Review remarks: static_cast

* Review remarks: binary ops templated

* Cleaning

* Removal of instances and their tests

* Review remarks from aosew addressed

* Review remark: unnecessary attribute

* Post-merge fixes

* Restrict 4gemm to PassThrough + bug fix

* Review remarks

* update licence

* change cgemm example to fp16
Co-authored-by: rocking <chunylai@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Anthony Chang <ac.chang@outlook.com>

7b1e2c37

clang-format · dde01029
Anthony Chang authored May 31, 2022

dde01029
change cshuffle precision type to f16; clean up · 597155e8
Anthony Chang authored May 31, 2022

597155e8
use LDS mem pool for reduction workspace · bf44991f
Anthony Chang authored May 31, 2022

bf44991f
rename kernel template param to reflect its dual use · 3db406f0
Anthony Chang authored May 31, 2022

3db406f0
update workgroup mapping · 728384fa
Anthony Chang authored May 31, 2022

728384fa
AccElemOp for gemm outputs prior to feeding to layernorm · ddc76a8b
Anthony Chang authored May 31, 2022

ddc76a8b
c0 bias/beta/gamma now have its own precision type · 12db5b6d
Anthony Chang authored May 31, 2022

12db5b6d

Minor fix for recent PR (#260) · 85fc91c3

Chao Liu authored May 30, 2022

* fix example

* update IsSupportedArgument

* fix

* disable fp64 conv example as test

85fc91c3

30 May, 2022 9 commits

gemm + layernorm (#261) · d32a67a9

rocking5566 authored May 31, 2022

* Implement reduction meand and reduction square mean

* Refine file name

* Add reduce mean and square mean

* Fix parameter name

* Add normalize device op (not implement invoker::run())

* Remove epislon

* Refine deviceop

* Add 5ary elementwise for normalization

* Add layernorm example

* layerNorm verication

* Fix compiler error due to merge from develop

* Fix typo

* Fix compile error

* Refine naming

* [What] Suport non pointer for invoker and argument
[Why] Snyc coding style with gemm

* Refine folder name

* Refine class name

* Evaluate perf of the kernel

* Fix compile error

* [What] Refine perf evaluation in example of gemm + reduction
[Why] evaluation of gemm + reduction may cause verification fail. Because evaluation will not initial global memory

* clang-format

d32a67a9

clang-format · d08aa99e
Anthony Chang authored May 30, 2022

d08aa99e
clang-tidy and additional comments · ebdb48ae
Anthony Chang authored May 30, 2022

ebdb48ae
make C0 precision type consistent with C · 7392e40c
Anthony Chang authored May 29, 2022

7392e40c
tidy up · ac6977f7
Anthony Chang authored May 29, 2022

ac6977f7
initial layernorm implementation · 2d91fd12
Anthony Chang authored May 29, 2022

2d91fd12
initial stub from regular gemm impl · 83fde45b
Anthony Chang authored May 29, 2022

83fde45b
add squared add reduction op; allows sq sum · 7f3c6e28
Anthony Chang authored May 29, 2022

7f3c6e28
dump lds content in appropriate precision type · 8c144c7a
Anthony Chang authored May 29, 2022

8c144c7a

27 May, 2022 1 commit

Fixing conv bug (#258) · 91d8b7d6

Chao Liu authored May 27, 2022



* debugging conv

* fix oversight where ctile map is constructed before initializing c desc

* example program should returns error code

* clean up

* changed Block2CTileMap in conv2d and convnd

* clean up

* clean up

* cleanup
Co-authored-by: Anthony Chang <ac.chang@outlook.com>

91d8b7d6

26 May, 2022 1 commit

Add FP64 XDL GEMM built-in function (#199) · 3e6c2610

ltqin authored May 27, 2022



* add intrin_mfma_f64_16x16x4f64

* add example

* gemm reference add double data type

* chang init data

* fix M N PerXdlops

* fix ifdef

* add comparsion config

* add conv fwd example

* format log out

* change rc matrix egister layout

* reorganize example

* reorganize example 2

* format,because merge develop

* fix call impl adding acc data type

* lost ;

* add compiler warning

* change example tunning parameters

* add test for fp64

* add instance

* add test/gemm/gemm_fp64.cpp

* fix get name issue

* remove some tunning parameter

* fix conflict

* format

* use integer value for GEMM test

* add acc data type

* remove typeid because fp16

* fix streamconfig etc bug from merging develop

* format

* remove test_gemm_xdl_fp64

* add AccDataType

* AccDataType problem
Co-authored-by: qinletao <letaoqin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

3e6c2610

25 May, 2022 3 commits

Hotfix binary elementwise (for broadcast on fastest axis) (#254) · 82d7d993

rocking5566 authored May 26, 2022



* Support different length of ScalarPerVector

* Add example of broadcast on fastest axis

* Typo

* Refine fastest example

* Add dimension check

* Modify fastest broadcast example to 3d

* Enforce users give scalarPerVector explicitely

* 1. Add CscalarPerVedctor
2. Not only broadcast on fastest need to set scalarPerVector to 1

* Rename var

* Move IsScalarPerVectorValid() inside IsSupportedArgument()

* Separate GridDesc_M0 into A, B and C

* rename var

* Rename var of length
Co-authored-by: rocking <chunylai@amd.com>

82d7d993

Tensile-style block to C tile map (#239) · e579c9e5

Anthony Chang authored May 25, 2022

* fix build

* Revert "fix build"

This reverts commit d7310238

.

* post PR #235 merge fix

* amend

* adds tensile-stype c-tile map

* make it dynamic version

* add k-split flavor tile map

* apply tensile-style tile map to all xdl gridwise gemms

* remove dead code
Co-authored-by: Chao Liu <chao.liu2@amd.com>

e579c9e5

minor fix for recent PR (#255) · 61851ae2
Chao Liu authored May 24, 2022
```
* minor fix

* clean
```
61851ae2

24 May, 2022 1 commit

Navi21 gemm (#197) · 40b59a63

Jianfeng Yan authored May 24, 2022



* start adding navi21 GEMM

* navi_gemm_km_kn_mn_fp32 compiles and passes one test.

* rename variables and functions in gridwise_gemm_dlops_v1r3

* add other 3 layouts; format instance

* adding more tuning parameters

add tuning parameters for other 3 layouts

* add gemm_dlops_f16

* tmp

* add dependence of DeviceGemm::IsSupportedArg() on arch

* minor changes

* minor changes

* minor changes

* minor changes

* minor changes

* minor changes

* minor changes

* push gemm_dlops into profiler

* minor changes

* if using xdl or dlops is moved into profiler_gemm_impl

* minor changes

* minor changes

* remove is_xdl from profile_gemm_impl

* make IsSupportedArg dependent on arch for other device_gemm

* minor changes

* minor changes

* fix a bug in f_generate_tensor_value

* add 64x64x64 for gemm_dlops_int8

* add 64x64x64 for gemm_dlops_int8

* comment out 3 layouts in gemm_dlops_int8; add 32x32x32 for gemm_dlops_int8; init A values to 1

* fix

* start fixing tuning parameters

* monir

* minor changes

* minor changes

* minor changes

* fixing

* adding example

* adding example

* adding example

* add gemm fp32 example

* clean up

* use 128x128x16 as MNK tile in navi21 gemm example

* bug fix

* fix test

* use new block c tile

* clean

* fix build
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: shaojiewang <wsjmessi@163.com>

40b59a63