Commits · 18781f5620e422a40b1a2ba203b0f4a226910422 · gaoqiong / composable_kernel

06 Sep, 2022 11 commits
- Remove 'elementwise' from identifiers · 18781f56
  Po-Yen, Chen authored Sep 06, 2022
  
  18781f56
- Use 'DevicePermute' device op in example · 9c5dd6bf
  Po-Yen, Chen authored Sep 06, 2022
  
  9c5dd6bf
- Add device op 'DevicePermute' · 1fdcf492
  Po-Yen, Chen authored Sep 06, 2022
```
This device op is clone of 'DeviceElementwise'
```
  1fdcf492
- Generalize variable naming in example code · 60ab70d8
  Po-Yen, Chen authored Sep 06, 2022
  
  60ab70d8
- Refine error message for check_err() · 31d758fb
  Po-Yen, Chen authored Sep 06, 2022
  
  31d758fb
- Remove debug messages · 43d4bd7a
  Po-Yen, Chen authored Sep 06, 2022
  
  43d4bd7a
- Add checks in helper functions · 7ebb1cbf
  Po-Yen, Chen authored Sep 06, 2022
  
  7ebb1cbf
- Use better name for tensor indices · e1f959fd
  Po-Yen, Chen authored Sep 06, 2022
  
  e1f959fd
- Generalize transpose utility functions · db32635c
  Po-Yen, Chen authored Sep 06, 2022
  
  db32635c
- Add transpose_shape() to generalize shape permute · 98498486
  Po-Yen, Chen authored Sep 06, 2022
  
  98498486
- Add check to template type argument · 185f7844
  Po-Yen, Chen authored Sep 06, 2022
  
  185f7844
05 Sep, 2022 8 commits
- Allow specify problem 'axes' through command line argument · 75831d9e
  Po-Yen, Chen authored Sep 05, 2022
  
  75831d9e
- Allow specify problem through command line argument · 8e71cad0
  Po-Yen, Chen authored Sep 05, 2022
  
  8e71cad0
- Use more specific method to write example · 19147f59
  Po-Yen, Chen authored Sep 05, 2022
  
  19147f59
- Add more helper methods in 'DeviceElementwise' · 665b73ff
  Po-Yen, Chen authored Sep 05, 2022
  
  665b73ff
- Use more strict input · 8a1ccdd4
  Po-Yen, Chen authored Sep 05, 2022
  
  8a1ccdd4
- Move common parts into common.hpp · 58945ac2
  Po-Yen, Chen authored Sep 05, 2022
  
  58945ac2
- Re-structure example files · ccd26cbd
  Po-Yen, Chen authored Sep 05, 2022
  
  ccd26cbd
- Add example folder for 'DeviceElementwise' · ef22508c
  Po-Yen, Chen authored Sep 05, 2022
  
  ef22508c
02 Sep, 2022 1 commit

[Hotfix] SplitK Gemm fp32 (#401) · 75891161

zjing14 authored Sep 02, 2022

* add scripts

* fixed splitK_gemm_fp32

* clean

* clean

* use gemm_xdl_splitK_c_shuffle into profiler

* remove device_gemm_xdl_splitk.hpp

75891161

01 Sep, 2022 1 commit

add more datatype to gemm+gemm and conv+conv example (#397) · 204ef976

Chao Liu authored Sep 01, 2022

* refactor

* refactor

* adding int4/int8/fp16/bf16 for conv+conv and gemm+gemm

* adding int4/int8/fp16/bf16 for conv+conv and gemm+gemm

* clean

204ef976

31 Aug, 2022 2 commits

Add examples of Conv + reduction (data type: int4, int8, bf16, fp16, fp32) (#380) · 46a675aa

Po Yen Chen authored Sep 01, 2022



* Refactor the design of DeviceGemmMultipleDMultipleR_Xdl_CShuffle

* Add 'DeviceGroupedConvFwdMultipleDMultipleR' interface

* Add DeviceGroupedConvFwdMultipleDMultipleR_Xdl_CShuffle

* Remove 'GridwiseConvFwdMultipleDMultipleR_xdl_cshuffle'

* Add 'TransformConvFwdToGemm<>' utility class (from Chao)

* Use 'TransformConvFwdToGemm<>' to shorten code

* Fix ill-formed method declaration

* Re-implement MakeRGridDescriptor_M() function

* Change problem description

* Use macro to define layout types

* Define K-reduced output tensor layout types

* Let user to decide R output tensor layout

* Rename variables

* Add padding to the reduced output tensor if necessary

* Extract common code as helper method

* Remove debug message

* Add missing include directive

* Add partial fp16 Conv + Reduction example

* Add example verification code for 2D Conv problem

* Use type alias to simplify code

* Share code across different-dimension Conv problems

* Rename file/functions from run_conv_fwd* to run_convnd_fwd*

* Make example code more verbose

* Add code to support 1D & 3D Conv + Reduction on host

* Add more examples for data type: bf16, fp32

* Add example for int8

* Add custom target to group examples

* Use more general custom target name

* Change the description in error message

* Disable testing for example other than fp32

* Add examplel for int4 (just copy from int8)

* Fix wrong data type

* Use larger data type for intermediate tensors

* Finish int4 example

* Undefine macro PP_DEFINE_LAYOUT_TYPE() after use

* Use named variables to replace magic numbers

* Remove debug messages

* Use same A/B data type for host Conv in int4 example

* Add check for the 'RLayout' type argument

* Group same-dim-layouts together in 'LayoutSetting<>'

* Add 'final' specifier to utility classes

* Use different initialization method for examples

* Remove macro PP_DEFINE_LAYOUT_TYPE()

* Fix code-comment mismatch

* Use more reasonable initialization value for all data types

* Default use init_method=1 for all examples

* Remove never-used code

* Remove confusing out-of-date comments

* clean
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Chao Liu <lc.roy86@gmail.com>

46a675aa

conv+conv (1x1 only) example using gemm+gemm (#393) · 4df6d93f
Chao Liu authored Aug 31, 2022
```
* refactor conv

* add conv+conv example, 1x1 only
```
4df6d93f

30 Aug, 2022 2 commits

Gemm reduce examples int4/int8/fp32/bf16 (#368) · d00e6115

Adam Osewski authored Aug 30, 2022



* GEMM + Reduce max fp16+fp32

* GEmm + Max bf16 + int8

* Refactor common definitions.

* Refactor common func of mean meansquare example.

* More examples for mean meansquare.

* Update int8 examples and skip them cause of random errors.

* Int4 examples.

* Fix examples for max int4/8

* Tensor conversion for int4 input data for mean meansquare example.

* Remove int4 mean_meansquare example

* Fix int8 mean_meansquare example.

-All ReductionAccData and R<N>DataType have to be F32. The INT32 data
type is giving wrong results.

* Guard int4 with ifdef

* Change int8 example to add_addsquare due to div rounding err.

* Clang format

* Change the return type of common function.

* Get back int8 example with division.

* Remove int8 mean meansquare.

* Use proper cast for BF16 data type.

* Use ck::literals.

* Use proper data type for host tensors & reference.

- Use ReduceAccDataType for reference gemm output data type.
- Cast host reference output tensor to EDataType
- Fix ifdefs for int4.
Co-authored-by: Adam Osewski <aosewski@amd.com>

d00e6115

Padding for attention: bmm+scale+softmax+bmm kernel (#385) · 45adb736

Shaojie WANG authored Aug 31, 2022



* add padding algo for bmm+scale+softmax+bmm. Version for verification

* remove verification code

* remove comments

* add padded bmm scale softmax bmm example

* format

* refactor

* add comments for usages of padding bmm+scale+softmax+bmm
Co-authored-by: Chao Liu <lc.roy86@gmail.com>

45adb736

29 Aug, 2022 2 commits
- Try to workaround flaky GemmSoftmaxGemm tests (#386) · 138faf39
  Anthony Chang authored Aug 29, 2022
```
* avoid potential hazard; flaky test issue persists

* pin down the random seed to avoid flakiness
```
  138faf39
- Fix the slow cpu reference batched gemm kernels. (#388) · 9061d39b
  Illia Silin authored Aug 29, 2022
```
* fix the performance of the batched gemm verification

* fix tabs
```
  9061d39b
26 Aug, 2022 2 commits

Add an option to build CK with clang directly (#387) · 1e5b59df

Illia Silin authored Aug 26, 2022

* replace hipcc compiler with clang++

* build client app with hipcc

* build client app with clang

* add an option to build with hipcc ro clang

* fix the environment for client app

* fix setting up compiler in cmake_build

* change the way the compiler is set

1e5b59df

Fixed splitk gemm fp32 (#384) · 9881625b
zjing14 authored Aug 26, 2022
```
* add scripts

* fixed splitK_gemm_fp32

* clean

* clean
```
9881625b

25 Aug, 2022 5 commits

More int4 tests. (#374) · 57fadf6f

Adam Osewski authored Aug 26, 2022



* More int4 UT.

* Disable BitwiseRepresentation UT.

* Add UT with static_cast

* Surround cout statements with #if
Co-authored-by: Adam Osewski <aosewski@amd.com>

57fadf6f

GEMM batched/splitK/cgemm/grouped int4 examples (#383) · 3ab20fd7

Adam Osewski authored Aug 26, 2022



* Grouped GEmm int4.

* Formatting + fix K dimension for int8.

* Batched Gemm int4 example.

* CGEMM int4 example.

* Include inc filese in clang-format.

* SplitK int4 example

* Refactoring of performance measurement.

* Fix #ifdef statements.
Co-authored-by: Adam Osewski <aosewski@amd.com>

3ab20fd7

Add int4 example for convnd_fwd_bias_relu_add (#375) · b73ae242

Rostyslav Geyyer authored Aug 25, 2022

* Add int4 example for convnd_fwd_bias_relu_add

* Fix AddReluAdd for building without int4 support

* Update CMakeLists.txt

* Format

* Convert int4 tensors for int8 kernel

* Fix device memory allocation

* Format

* Format

b73ae242

Add int4 reduction examples (#372) · d520d0cf

Qianfeng authored Aug 26, 2022

* Add int4 reduction examples

* Contain all using of int4_t inside the pre-compiling condition checking

d520d0cf

add scripts (#382) · f246fd2c
zjing14 authored Aug 25, 2022

f246fd2c

24 Aug, 2022 2 commits
- layernorm external api (#379) · e1a3fff6
  rocking5566 authored Aug 25, 2022
```
* Add layernorm client example

* [What] Add default make install dir to gitignore
[Why] client example need to make install
```
  e1a3fff6
- Refactor the design of DeviceGemmMultipleDMultipleR_Xdl_CShuffle (#378) · 88e43744
  Po Yen Chen authored Aug 24, 2022
  
  88e43744
23 Aug, 2022 4 commits

Add examples of Gemm (data type: int4) (#367) · fa2d894b

Po Yen Chen authored Aug 24, 2022

* Add GEMM examples for int4

Currently the source files are just copied from int8 examples

* Re-use pre-defined alias in int4 exmples

* Distinguish user-side type from kernel-side type

* Add int4_t support for check_err()

* Allow conversion between Tensor<> specializations

* Re-format source files

* Use different type for host tensors

* Re-use CopyAsType<>() to implement copy ctor

* Re-use element-wise operation type alias

* Fix typo in alias names

* Complete the int4 examples

* Add constraint to Tensor<> templated methods

* Add type traits 'is_signed_integral<>'

* Add type constraints for integer version check_err<>()

* Allow comparing different-sized integral types in check_err()

* Check converted Tensor<int4_t> with golden Tensor<int8_t>

* Remove constraint of Tensor<>::CopyAsType()

* Avoid compilation error while disabling ck::int4_t support

* Remove debug messages

* Add #error directive to prevent compile sources with wrong setting

* Simplify tensor usages in examples

* Add constraint to check_err() input reference type

* Align design with other PR

* Use ""_uz to simplify example code

* Avoid too much generalizing check_err()

* Re-format GEMM instance template arguments

* Extract int4 example common codes

* Sort include directives

* Move #include directives into new header

* Move common codes together

* Re-format template argument in example code

* Reuse same implementation code for most of GEMM examples

* Re-format common.hpp

* Unify structured comment in examples

* Use reinterpret_cast<>() for cross-type pointer conversion

* Revert "Add type traits 'is_signed_integral<>'"

This reverts commit f2c148efaedf42c8ee66032dac6d13a1003b0f3a.

* Allow unsigned integer arguments for check_err()

* Fix compilation error in check_err()

* Remove unnecessary copy ctor for Tensor<>

* Mark Tensor<> special member functions as 'default'

* Use more strict condition to add code in examples

* Fix wrong program return value of GEMM examples

* Handle the case while user specify all the strides

* Fix never-ran examples

* Exit successfully if GEMM instance does not support given problem

* Add missing 'else' keyword

* Re-format CMakeLists.txt

* Add wrapper function to hide value conversion while copying memory

* Add new DeviceMem API to copy memory

* Use new DeviceMem API to implement examples

* Revert "Add new DeviceMem API to copy memory"

This reverts commit 3f190b0779ceedf7aaf0b380712fda0518de72c1.

* Add conversion ctor for Tensor<>

* Write Tensor<> conversion logics explicitly in example code

* Convert Tensor<> values after transfer data to host

fa2d894b

Attention with output permutation (#370) · e0d8806c

Anthony Chang authored Aug 24, 2022

* comment on specialization for TensorSpecialization::Packed

* gemm_softmax_gemm with output permutation

* scaling

* refactor MatrixPadder; rename to GemmPadder

* remove old sanity check

* restore original gemm_softmax_gemm

* revise comment in gemm_softmax_gemm example

* use GetElementSpaceSize()

* remove extra header

* typo

* remove archaic DeviceOpPtr

e0d8806c

Add examples of batched/grouped/SplitK Gemm for int8/bfp16/fp16/fp32 (#361) · 60914583

zjing14 authored Aug 23, 2022



* add examples into grouped/batched_gemm

* adding splitK examples

* fixed splitK

* add bfp16 int8 example into splitK

* formatting

* use static_cast

* added common for batched_gemm

* add commons for examples of splitK/batched/grouped_gemm

* return true

* adjust splitK check tol

* update example
Co-authored-by: Chao Liu <lc.roy86@gmail.com>

60914583

Add example of Gemm + AddAddFastGelu (data type: int4) (#369) · 2327f1a6

Po Yen Chen authored Aug 23, 2022

* Add custom target to bundle examples together

* Add int4 example conditionally (just copy from int8 example)

* Extract common code into common.hpp

* Move ref gemm type alias into data-type-specific sources

* Add #error directive to prevent compile with wrong setting

* Let AddAddFastGelu support int4 parameter type

* Let check_err() support int4 parameter type

* Add wrapper function to hide value conversion while copying memory

* Finish int4 example for GEMM + AddAddFastGelu

* Add new DeviceMem API to copy memory

* Use new DeviceMem API to implement examples

* Fix wrongly use of macro 'CK_EXPERIMENTAL_BIT_INT_EXTENSION_INT4'

* Revert "Add new DeviceMem API to copy memory"

This reverts commit e26e7af71e1f982a4ca7406401e2fc9b1f086b32.

* Add conversion ctor for Tensor<>

* Add 'const' specifier to Tensor<>::CopyAsType()

* Convert Tensor<> values before/after transfer between host & device

2327f1a6