Commits · 702c74453f319f9a056223281fd533c624083f44 · gaoqiong / composable_kernel

07 Sep, 2022 1 commit
- Check if input/output shape meet the requirement · 2e5d4f91
  Po-Yen, Chen authored Sep 07, 2022
  
  2e5d4f91
06 Sep, 2022 22 commits
- Remove no-longer used type argument · d356c871
  Po-Yen, Chen authored Sep 06, 2022
  
  d356c871
- Fused attention instances & padding tests (#395) · 868e5c55
  Anthony Chang authored Sep 07, 2022
```
* modify comment

* trim unnecessary check

* add gemm spec in kernel name

* add TNTT gemm_gemm + atten kernel instances

* refactor attention padding to better fit in unit tests

This streamlines usage where "ResetNaNToMinusInf" is now hidden from user facing device op.
Also added compile-time conditionals that load OOB value as NaN only after padding is enabled

* add adhoc padding test for atten

* shrink input value range for attention kernel validation to avoid occasional error by 1e-3

Still unsure whether this kind of deterministic floating point accurary issue is expected
or not. May want to try exact same approach as the GPU kernel in the host reference
GEMM+Softmax+GEMM function to see if the accuracy discrepancy goes away. Until then,
shrink the input value range as it is less likely to produce errors of around ~1e-3.

* attention kernel proper granular padding for all 4 dims

* IsSupportedArgument checks

* test more padded cases

* block PadK specialization in attention kernels

* workaround clang crash for gfx908

(gfx908 only) workaround for compiler crash in fused kernels on mainline #9110; #10738 seems ok
error message was "fatal error: error in backend: Error while trying to spill VGPR0 from class
VGPR_32: Cannot scavenge register without an emergency spill slot!"
this fall back to less ideal way of handle NPadding in fused attention kernel

* comment out kernels giving wrong results on MI100; MI200 doesn't seem affected
```
  868e5c55
- Passing 'axes' to 'DevicePermute' · 5b63400a
  Po-Yen, Chen authored Sep 06, 2022
  
  5b63400a
- Softmax client example (#396) · 3da5c19e
  Adam Osewski authored Sep 06, 2022
```
* Update Softmax device operation interface.

* Update ckProfiler.

* Update Softmax UT.

* Update example.

* Client example.

* Clang format
Co-authored-by: Adam Osewski <aosewski@amd.com>
```
  3da5c19e
- Distinguish input & output shape in 'DevicePermute' · 50f5ce49
  Po-Yen, Chen authored Sep 06, 2022
  
  50f5ce49
- Simplify 'DevicePermute' interface · 339e51d1
  Po-Yen, Chen authored Sep 06, 2022
  
  339e51d1
- Only accept single-input-single-output for 'DervicePermute' · 5ae42120
  Po-Yen, Chen authored Sep 06, 2022
  
  5ae42120
- Remove 'is_device_op<>' type traits · 179092df
  Po-Yen, Chen authored Sep 06, 2022
  
  179092df
- Use indirect base type to generate methods · 32a2d78b
  Po-Yen, Chen authored Sep 06, 2022
  
  32a2d78b
- Add static_assert() to check type constraints · ea343345
  Po-Yen, Chen authored Sep 06, 2022
  
  ea343345
- Add simple type traits to validate device op type · 70757860
  Po-Yen, Chen authored Sep 06, 2022
  
  70757860
- Remove 'elementwise' from file paths · 6c4268f9
  Po-Yen, Chen authored Sep 06, 2022
  
  6c4268f9
- Remove 'elementwise' from identifiers · 18781f56
  Po-Yen, Chen authored Sep 06, 2022
  
  18781f56
- Use 'DevicePermute' device op in example · 9c5dd6bf
  Po-Yen, Chen authored Sep 06, 2022
  
  9c5dd6bf
- Generalize variable naming in example code · 60ab70d8
  Po-Yen, Chen authored Sep 06, 2022
  
  60ab70d8
- Refine error message for check_err() · 31d758fb
  Po-Yen, Chen authored Sep 06, 2022
  
  31d758fb
- Remove debug messages · 43d4bd7a
  Po-Yen, Chen authored Sep 06, 2022
  
  43d4bd7a
- Add checks in helper functions · 7ebb1cbf
  Po-Yen, Chen authored Sep 06, 2022
  
  7ebb1cbf
- Use better name for tensor indices · e1f959fd
  Po-Yen, Chen authored Sep 06, 2022
  
  e1f959fd
- Generalize transpose utility functions · db32635c
  Po-Yen, Chen authored Sep 06, 2022
  
  db32635c
- Add transpose_shape() to generalize shape permute · 98498486
  Po-Yen, Chen authored Sep 06, 2022
  
  98498486
- Add check to template type argument · 185f7844
  Po-Yen, Chen authored Sep 06, 2022
  
  185f7844
05 Sep, 2022 7 commits
- Allow specify problem 'axes' through command line argument · 75831d9e
  Po-Yen, Chen authored Sep 05, 2022
  
  75831d9e
- Allow specify problem through command line argument · 8e71cad0
  Po-Yen, Chen authored Sep 05, 2022
  
  8e71cad0
- Use more specific method to write example · 19147f59
  Po-Yen, Chen authored Sep 05, 2022
  
  19147f59
- Use more strict input · 8a1ccdd4
  Po-Yen, Chen authored Sep 05, 2022
  
  8a1ccdd4
- Move common parts into common.hpp · 58945ac2
  Po-Yen, Chen authored Sep 05, 2022
  
  58945ac2
- Re-structure example files · ccd26cbd
  Po-Yen, Chen authored Sep 05, 2022
  
  ccd26cbd
- Add example folder for 'DeviceElementwise' · ef22508c
  Po-Yen, Chen authored Sep 05, 2022
  
  ef22508c
01 Sep, 2022 1 commit

add more datatype to gemm+gemm and conv+conv example (#397) · 204ef976

Chao Liu authored Sep 01, 2022

* refactor

* refactor

* adding int4/int8/fp16/bf16 for conv+conv and gemm+gemm

* adding int4/int8/fp16/bf16 for conv+conv and gemm+gemm

* clean

204ef976

31 Aug, 2022 2 commits

Add examples of Conv + reduction (data type: int4, int8, bf16, fp16, fp32) (#380) · 46a675aa

Po Yen Chen authored Sep 01, 2022



* Refactor the design of DeviceGemmMultipleDMultipleR_Xdl_CShuffle

* Add 'DeviceGroupedConvFwdMultipleDMultipleR' interface

* Add DeviceGroupedConvFwdMultipleDMultipleR_Xdl_CShuffle

* Remove 'GridwiseConvFwdMultipleDMultipleR_xdl_cshuffle'

* Add 'TransformConvFwdToGemm<>' utility class (from Chao)

* Use 'TransformConvFwdToGemm<>' to shorten code

* Fix ill-formed method declaration

* Re-implement MakeRGridDescriptor_M() function

* Change problem description

* Use macro to define layout types

* Define K-reduced output tensor layout types

* Let user to decide R output tensor layout

* Rename variables

* Add padding to the reduced output tensor if necessary

* Extract common code as helper method

* Remove debug message

* Add missing include directive

* Add partial fp16 Conv + Reduction example

* Add example verification code for 2D Conv problem

* Use type alias to simplify code

* Share code across different-dimension Conv problems

* Rename file/functions from run_conv_fwd* to run_convnd_fwd*

* Make example code more verbose

* Add code to support 1D & 3D Conv + Reduction on host

* Add more examples for data type: bf16, fp32

* Add example for int8

* Add custom target to group examples

* Use more general custom target name

* Change the description in error message

* Disable testing for example other than fp32

* Add examplel for int4 (just copy from int8)

* Fix wrong data type

* Use larger data type for intermediate tensors

* Finish int4 example

* Undefine macro PP_DEFINE_LAYOUT_TYPE() after use

* Use named variables to replace magic numbers

* Remove debug messages

* Use same A/B data type for host Conv in int4 example

* Add check for the 'RLayout' type argument

* Group same-dim-layouts together in 'LayoutSetting<>'

* Add 'final' specifier to utility classes

* Use different initialization method for examples

* Remove macro PP_DEFINE_LAYOUT_TYPE()

* Fix code-comment mismatch

* Use more reasonable initialization value for all data types

* Default use init_method=1 for all examples

* Remove never-used code

* Remove confusing out-of-date comments

* clean
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Chao Liu <lc.roy86@gmail.com>

46a675aa

conv+conv (1x1 only) example using gemm+gemm (#393) · 4df6d93f
Chao Liu authored Aug 31, 2022
```
* refactor conv

* add conv+conv example, 1x1 only
```
4df6d93f

30 Aug, 2022 2 commits

Gemm reduce examples int4/int8/fp32/bf16 (#368) · d00e6115

Adam Osewski authored Aug 30, 2022



* GEMM + Reduce max fp16+fp32

* GEmm + Max bf16 + int8

* Refactor common definitions.

* Refactor common func of mean meansquare example.

* More examples for mean meansquare.

* Update int8 examples and skip them cause of random errors.

* Int4 examples.

* Fix examples for max int4/8

* Tensor conversion for int4 input data for mean meansquare example.

* Remove int4 mean_meansquare example

* Fix int8 mean_meansquare example.

-All ReductionAccData and R<N>DataType have to be F32. The INT32 data
type is giving wrong results.

* Guard int4 with ifdef

* Change int8 example to add_addsquare due to div rounding err.

* Clang format

* Change the return type of common function.

* Get back int8 example with division.

* Remove int8 mean meansquare.

* Use proper cast for BF16 data type.

* Use ck::literals.

* Use proper data type for host tensors & reference.

- Use ReduceAccDataType for reference gemm output data type.
- Cast host reference output tensor to EDataType
- Fix ifdefs for int4.
Co-authored-by: Adam Osewski <aosewski@amd.com>

d00e6115

Padding for attention: bmm+scale+softmax+bmm kernel (#385) · 45adb736

Shaojie WANG authored Aug 31, 2022



* add padding algo for bmm+scale+softmax+bmm. Version for verification

* remove verification code

* remove comments

* add padded bmm scale softmax bmm example

* format

* refactor

* add comments for usages of padding bmm+scale+softmax+bmm
Co-authored-by: Chao Liu <lc.roy86@gmail.com>

45adb736

25 Aug, 2022 3 commits

GEMM batched/splitK/cgemm/grouped int4 examples (#383) · 3ab20fd7

Adam Osewski authored Aug 26, 2022



* Grouped GEmm int4.

* Formatting + fix K dimension for int8.

* Batched Gemm int4 example.

* CGEMM int4 example.

* Include inc filese in clang-format.

* SplitK int4 example

* Refactoring of performance measurement.

* Fix #ifdef statements.
Co-authored-by: Adam Osewski <aosewski@amd.com>

3ab20fd7

Add int4 example for convnd_fwd_bias_relu_add (#375) · b73ae242

Rostyslav Geyyer authored Aug 25, 2022

* Add int4 example for convnd_fwd_bias_relu_add

* Fix AddReluAdd for building without int4 support

* Update CMakeLists.txt

* Format

* Convert int4 tensors for int8 kernel

* Fix device memory allocation

* Format

* Format

b73ae242

Add int4 reduction examples (#372) · d520d0cf

Qianfeng authored Aug 26, 2022

* Add int4 reduction examples

* Contain all using of int4_t inside the pre-compiling condition checking

d520d0cf

23 Aug, 2022 2 commits

Add examples of Gemm (data type: int4) (#367) · fa2d894b

Po Yen Chen authored Aug 24, 2022

* Add GEMM examples for int4

Currently the source files are just copied from int8 examples

* Re-use pre-defined alias in int4 exmples

* Distinguish user-side type from kernel-side type

* Add int4_t support for check_err()

* Allow conversion between Tensor<> specializations

* Re-format source files

* Use different type for host tensors

* Re-use CopyAsType<>() to implement copy ctor

* Re-use element-wise operation type alias

* Fix typo in alias names

* Complete the int4 examples

* Add constraint to Tensor<> templated methods

* Add type traits 'is_signed_integral<>'

* Add type constraints for integer version check_err<>()

* Allow comparing different-sized integral types in check_err()

* Check converted Tensor<int4_t> with golden Tensor<int8_t>

* Remove constraint of Tensor<>::CopyAsType()

* Avoid compilation error while disabling ck::int4_t support

* Remove debug messages

* Add #error directive to prevent compile sources with wrong setting

* Simplify tensor usages in examples

* Add constraint to check_err() input reference type

* Align design with other PR

* Use ""_uz to simplify example code

* Avoid too much generalizing check_err()

* Re-format GEMM instance template arguments

* Extract int4 example common codes

* Sort include directives

* Move #include directives into new header

* Move common codes together

* Re-format template argument in example code

* Reuse same implementation code for most of GEMM examples

* Re-format common.hpp

* Unify structured comment in examples

* Use reinterpret_cast<>() for cross-type pointer conversion

* Revert "Add type traits 'is_signed_integral<>'"

This reverts commit f2c148efaedf42c8ee66032dac6d13a1003b0f3a.

* Allow unsigned integer arguments for check_err()

* Fix compilation error in check_err()

* Remove unnecessary copy ctor for Tensor<>

* Mark Tensor<> special member functions as 'default'

* Use more strict condition to add code in examples

* Fix wrong program return value of GEMM examples

* Handle the case while user specify all the strides

* Fix never-ran examples

* Exit successfully if GEMM instance does not support given problem

* Add missing 'else' keyword

* Re-format CMakeLists.txt

* Add wrapper function to hide value conversion while copying memory

* Add new DeviceMem API to copy memory

* Use new DeviceMem API to implement examples

* Revert "Add new DeviceMem API to copy memory"

This reverts commit 3f190b0779ceedf7aaf0b380712fda0518de72c1.

* Add conversion ctor for Tensor<>

* Write Tensor<> conversion logics explicitly in example code

* Convert Tensor<> values after transfer data to host

fa2d894b

Attention with output permutation (#370) · e0d8806c

Anthony Chang authored Aug 24, 2022

* comment on specialization for TensorSpecialization::Packed

* gemm_softmax_gemm with output permutation

* scaling

* refactor MatrixPadder; rename to GemmPadder

* remove old sanity check

* restore original gemm_softmax_gemm

* revise comment in gemm_softmax_gemm example

* use GetElementSpaceSize()

* remove extra header

* typo

* remove archaic DeviceOpPtr

e0d8806c