Commits · 80f0da377f0bc80ef717e62c18a454e7f987dcf6 · gaoqiong / composable_kernel

01 Dec, 2023 4 commits
- fix build (#46) · 80f0da37
  Chao Liu authored Dec 01, 2023
  
  80f0da37
- Ck tiled main fixing for building xformers (#44) · ca2105a4
  Qianfeng authored Dec 02, 2023
```
* Add include/ck/config.h to support xformers c++ extension building

* Disable exp() and log() overloading for half_t to support xformers C++ extension building

* config.h.default

---------
Co-authored-by: Chao Liu <chao.liu2@amd.com>
```
  ca2105a4
- format (#45) · 99c9d3b7
  Chao Liu authored Dec 01, 2023
  
  99c9d3b7
- fix bug · 87c7888e
  Chao Liu authored Nov 30, 2023
  
  87c7888e
30 Nov, 2023 2 commits

Fixed GroupedGemmFixedNK with hipGraph (#1065) · 49df1dc5

zjing14 authored Nov 30, 2023



* fixed examples; add async_mem_set

* add stream to all deviceOp using SetWorkspace

---------
Co-authored-by: Jing Zhang <jizha@amd.com>

49df1dc5

Introduce wrapper for layout (#1054) · 8ff845f2

Bartłomiej Kocot authored Nov 30, 2023

* Introduce wrapper for layout

* Extend functionality

* Fix for getLength

* Comment fixes

* Add comments and remove not needed getters

8ff845f2

29 Nov, 2023 1 commit

Disable transpose device op for MI300 (#1050) · a2969aa8

arai713 authored Nov 29, 2023



* added working example for 5D input using 1D kernel

* example with 5D input tensor and 2d kernel - not working: issues with arguments

* added updated version of 3d device op - changed descriptors/dims

* added example file to check kernel

* fixed descriptor and isSupportedArgument stride problem

* added and modified kernel for 3d - updated tids/loop

* adding some more 5d example files

* fixed some issues

* changes made for testing

* working version: fixed error in stride for A, still a bit inefficient

* cleaned up formatting/comments

* updating formatting

* more formatting fixes

* fixing cmake, adding back gpu targets in cmake script

* adding client example

* added instances for client example

* fixed errors in client example

* implemented client ex with device_elementwise.hpp and device_elementwise_3d_impl.hpp

* removed extra files

* minor formatting and naming fixes

* adding test files and profiler

* fixing minor error

* minor fix

* removed unneccesary comments, renamed files

* updated instance list for client example, added different layout example

* removing instances

* fixed error in instance generation

* remove comments

* update profiler and client example tensor layouts

* fixed errors in test/profiler

* updated vector dim access to enable vector load

* updated test/profiler files

* updated example with 1d kernel

* updating profiler

* renamed files

* disabled device op for MI300

* skip  elementwise_permute_2d on gfx94x

* Update CMakeLists.txt

* fixing CMake - disabling some GPU targets

---------
Co-authored-by: Jing Zhang <jizha@amd.com>
Co-authored-by: Jing Zhang <jizhan@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>

a2969aa8

28 Nov, 2023 2 commits
- recover default niter (#1064) · ae5e5181
  zjing14 authored Nov 28, 2023
  
  ae5e5181
- Switch default f8 conversion to stochastic rounding (#1048) · 6ef034f6
  Rostyslav Geyyer authored Nov 27, 2023
```
* Switch default f8 conversion to stochastic rounding

* Refactor f8-related type_converts

* Add an element-wise op
```
  6ef034f6
27 Nov, 2023 1 commit
- Add missing check for K padding in XDL GEMM (#1056) · 60ecfd73
  Bartlomiej Wroblewski authored Nov 27, 2023
  
  60ecfd73
25 Nov, 2023 1 commit

Add basic support for direct loads from global to LDS (#999) · 627054b9

Bartlomiej Wroblewski authored Nov 25, 2023

* Add basic support for direct loads from global to LDS

* Clean the code and comments

* Add support for fp16

* Add comments

* Add check for thread cluster lengths

* Align non-direct-load fp16 example

* Small fixes

* Extend IsSupported to check for supported GPU gens

* Build examples only on the supported HW

* Do not throw when instance not supported in 04 example

* Review: Apply review suggestions

* Review: small fix

* Review: small fix

627054b9

21 Nov, 2023 1 commit
- Merge with (not the latest) upstream CK (#32) · 0a7174ad
  Chao Liu authored Nov 21, 2023
```
* fix build for old ck examples

* fix build for old ck
```
  0a7174ad
17 Nov, 2023 1 commit

Improve 4k gemm perf (#1047) · e8cddfdc

zjing14 authored Nov 17, 2023



* improve 4k gemm perf

* add f8 instances

* format

---------
Co-authored-by: Jing Zhang <jizha@amd.com>

e8cddfdc

15 Nov, 2023 4 commits

increase warm-up to 10 iter (#28) · 496be40e
Chao Liu authored Nov 15, 2023

496be40e

Fmha pr 2 (#26) · 3753c4bc

carlushuang authored Nov 16, 2023

* support hdim=64/128 in same example code

* support v transpose

* revert gemm.cpp, not intent to modify it

* remove useless code

* fix a bug for swizzle C encoding, no perf change

* optimize LDS encoding

* update LDS layout

* clean up code

3753c4bc

Log CDEBlockTransferScalarPerVector_NPerBlock in conv fwd multiD xdl (#1042) · 1fefd82e

Bartłomiej Kocot authored Nov 15, 2023

* Log CDEBlockTransferScalarPerVector_NPerBlock in conv_fwd_multi_d_xdl implementation

* Log CDEBlockTransferScalarPerVector_NPerBlock in conv fwd multiD xdl

1fefd82e

Fix check for conv Fwd Filter1x1Pad0 (#1040) · 3ef3102f
Bartłomiej Kocot authored Nov 15, 2023
```
* Fix check for conv Fwd Filter1x1Pad0

* Fix check for conv Fwd Filter1x1Pad0
```
3ef3102f

14 Nov, 2023 1 commit

Introduce multiABD api and deprecate multiD (#1035) · f2398f61

Bartłomiej Kocot authored Nov 14, 2023

* Introduce multiABD api and deprecate multiD

* Replace multiD with multiABD

* Mark structures as deprecated

* Change doxygen deprecated to note to avoid warnings

f2398f61

13 Nov, 2023 1 commit

Hip tensor permute (#1002) · 454cf7bd

arai713 authored Nov 13, 2023

* adding files for F32 example

* adding functioning implementation with scalar multiplication and unary operator support

* added fp 16 type check in unary square

* updating scalar multiplication as an operator

* functioning version with scalar operator

* changing strides for col major

* updated column major implementation

* working column major implementation

* cleaned up comments, rearranged/renamed files

454cf7bd

10 Nov, 2023 2 commits

Support multi AB for grouped conv fwd xdl (#1027) · 49e52bb3

Bartłomiej Kocot authored Nov 10, 2023

* Support multi AB for grouped conv fwd xdl

* Add instances

* Add client example

* Add example

* Add interface test

* Minor fixes

Minor fixes

Minor fixes

* Comment fixes

* Fixes

* Reference fix

* Test xdl fixes

* Improve multi_ab interface test

49e52bb3

Backward of gamma and beta for layernorm and groupnorm (#1013) · 1db75603

rocking authored Nov 10, 2023

* Add layernorm backward reference code

* Add groupnorm backward reference code

* Add example

* clang format

* Fixc bug of reference layernorm and groupnorm

* Fix naming

* Refine naming

* Add device op for normalization bwd gamma and beta

* Refine template parameter

* Add bwd gamma & beta of kernel

* 1. Add groupnorm example
2. Refine layernorm naming

* Narrow down the static check for performance

* Refine variable name

1db75603

09 Nov, 2023 2 commits

Transpose 3d (#984) · 3af8c81a

arai713 authored Nov 08, 2023



* added working example for 5D input using 1D kernel

* example with 5D input tensor and 2d kernel - not working: issues with arguments

* added updated version of 3d device op - changed descriptors/dims

* added example file to check kernel

* fixed descriptor and isSupportedArgument stride problem

* added and modified kernel for 3d - updated tids/loop

* adding some more 5d example files

* fixed some issues

* changes made for testing

* working version: fixed error in stride for A, still a bit inefficient

* cleaned up formatting/comments

* updating formatting

* more formatting fixes

* fixing cmake, adding back gpu targets in cmake script

* adding client example

* added instances for client example

* fixed errors in client example

* implemented client ex with device_elementwise.hpp and device_elementwise_3d_impl.hpp

* removed extra files

* minor formatting and naming fixes

* adding test files and profiler

* fixing minor error

* minor fix

* removed unneccesary comments, renamed files

* updated instance list for client example, added different layout example

* removing instances

* fixed error in instance generation

* remove comments

* update profiler and client example tensor layouts

* fixed errors in test/profiler

* updated vector dim access to enable vector load

* updated test/profiler files

* updated example with 1d kernel

* updating profiler

* renamed files

---------
Co-authored-by: Jing Zhang <jizha@amd.com>

3af8c81a

Layernorm4d (#1022) · a3d9a2cd

rocking authored Nov 09, 2023



* Rename folder

* Add layernorm 4d fwd example

* Rename original layernorm example

* Add layernorm 4d f16  test

* Add layernorm4d_fwd client example

* Support layernorm4D in ckProfiler

* Rename groupnorm to groupnorm fwd in example

* Rename layernorm and group fwd in test

* Rename normalization to normalization_fwd (instances)

* Add fwd to DeviceNormalization

* Rename external api header

* Rename folder, because we can also add bwd in this folder

* Add fwd in layernorm and groupnorm (profiler

* Fix compile error

---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

a3d9a2cd

08 Nov, 2023 1 commit
- Support fp64 contraction on gfx94x. (#1029) · ce526211
  Illia Silin authored Nov 08, 2023
```
* enable contraction fp64 on gfx94*

* fix the logic
```
  ce526211
07 Nov, 2023 1 commit

Add Gemm instances for performance improvement (#1018) · 98fd41f5

zjing14 authored Nov 07, 2023



* improve kpad

* more tuning parameters

* f16_f8_fp16

* cut test time

* add f16_f8_fp16

* add f16_f8_f16

* testing instances for skinny cases

* format

* clean

* add fp16_f8_fp16

* clang-format

* add grouped gemm instalces

* fixed profile grouped_gemm

* clean

* clean

* clean

* clean

* clean

* add missing instance func

* fixed inferface

---------
Co-authored-by: Jing Zhang <jizha@amd.com>
Co-authored-by: root <root@sh5-1e707-rc06-38.mkm.dcgpu>

98fd41f5

03 Nov, 2023 1 commit
- unify q persistent in register (#24) · e71aa1d6
  carlushuang authored Nov 03, 2023
```
* unify q persistent in register

* add refactor warp_gemm dispatcher
```
  e71aa1d6
02 Nov, 2023 1 commit

Add support for mixed precision in contraction scale and bilinear (#973) · 4ef704d8

Bartlomiej Wroblewski authored Nov 02, 2023



* Add support for mixed precision in contraction scale and bilinear (#936)

* Extract common functionality to separate files

* Reference contraction: Remove incorrect consts from type_converts

* Reference contraction: Add missing type_convert for dst value

* Reference contraction: Fix incorrect order of B matrix dimensions

* Add support for mixed precision in contraction scale and bilinear

* Move using statements from instances to a common file

* Move using statements from examples to a common file

* Fix the order of B matrix dimensions across examples and profiler

* Fix the computation of error threshold

* Make ComputeDataType an optional argument

* Include possible DataType -> ComputeDataType casting error in the threshold

* Remove commented code

* Make the ComputeDataType an optional argument in instance

---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

4ef704d8

01 Nov, 2023 1 commit
- Add ScaleAddScaleAddRelu post op for conv fwd (#1006) · f27ea94e
  Bartłomiej Kocot authored Nov 02, 2023
```
* Add ScaleAddScaleAddRelu post op for conv fwd

* Fixes

* Fix instance file name

* Minor fix
```
  f27ea94e
31 Oct, 2023 2 commits

Enable gfx941 & gfx942 support for DeviceGemmXdl<> device op (#1017) · 675b6978
Po Yen Chen authored Nov 01, 2023
```
* Enable gfx942 support for DeviceGemmXdl<> device op

* Enable gfx941 support for DeviceGemmXdl<> device op
```
675b6978

Add support for groups in Img2Col/Col2Img (#1007) · 2e824c6d

Bartłomiej Kocot authored Oct 31, 2023

* Add support for groups in Img2Col/Col2Img

* Fix interface test

* Fix interface test G to N

* Improve performance

* Change gemm layout to 3d

* Fixes

2e824c6d

28 Oct, 2023 1 commit

Fix the fp8 gemm for large tensors on MI300. (#1011) · f46a6ffa

Illia Silin authored Oct 27, 2023



* Fix the fp8 conversion

* Try clipping value before conversion

* Fix return

* Simplify with a const

* reduce the gemm input tensor values to reduce round-off error

* replace if-else with lambda

* fix syntax

---------
Co-authored-by: Rostyslav Geyyer <rosty.geyyer@amd.com>

f46a6ffa

27 Oct, 2023 1 commit

support batch & nhead, and scale (#20) · 95889861

carlushuang authored Oct 27, 2023

* support batch & nhead

* support scale

* tile scheduler

* rename tile-scheduler to tile-partitioner

* add some exp2 math

* fix a bug when chaning tile size

95889861

26 Oct, 2023 1 commit
- fix compile error for fmha_fwd example (#21) · b7abe77a
  carlushuang authored Oct 26, 2023
  
  b7abe77a
20 Oct, 2023 1 commit
- Fix bf8 conversion issues (#1003) · 1fd27d52
  Rostyslav Geyyer authored Oct 20, 2023
```
* Fix the conversion

* Add bf8 functionality

* Enable example on MI200 as well
```
  1fd27d52
19 Oct, 2023 5 commits

Fix the DL kernel issues on Navi3x. (#998) · f7331c60
Illia Silin authored Oct 19, 2023
```
* apply the patch for dl kernels on gfx11

* build DL kernels on navi32 CI
```
f7331c60

Misc fixes (#994) · b4fc4d0b

Qianfeng authored Oct 20, 2023

* reinterpret_cast to const char* in dumpBufferToFile to be compatible with both const and non-const input pointers

* Add seed input to GeneratorTensor_4 for normal_distribution generator

* Add GetTypeString() for DeviceElementwiseImpl

* Add HIP_CHECK_ERROR macro

b4fc4d0b

Extend available elementwise operations with conv examples (#995) · 82f3a835

Bartłomiej Kocot authored Oct 19, 2023

* Extend available elementwise operations with conv examples

* Fixes

* Remove not needed convert

* Update CMakeFile and dir name

82f3a835

refactor gemm+softmax+gemm (#19) · 7ccf0bb5
Chao Liu authored Oct 19, 2023
```
* refactor gemm+softmax+gemm using block-gemm

* reorg files

* clean
```
7ccf0bb5

add fmha fwd pipeline (#17) · 9f36ac7c

carlushuang authored Oct 19, 2023



* Revert "Extract gemm0 prefetch0 out from loop"

This reverts commit d3b56f39f9fd12edb476b24ae9cf480841d311e4.

* add fmha fwd  pipeline

* Extract gemm0 prefetch0 out from loop

* move blockSize to another place ; fix a missing header in tile_window_impl_static_distribution.hpp

* remove KArgs from tile modules

---------
Co-authored-by: Po-Yen, Chen <PoYen.Chen@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

9f36ac7c

18 Oct, 2023 1 commit

Layernorm and groupnorm support to save mean and inverse std in forward (#929) · 3696fe1c

rocking authored Oct 19, 2023

* save mean and inverse std in normalization

* Save mean and inverse std in splitK

* Vector save mean and inv std

* Modify instance for save mean and std

* simplify the layernorm example

* Save mean and std in groupnorm example

* Save mean and inv std in ckProfiler and test

* Remove compute data type from base class

* Save mean and inv std in client example

* Add changelog

* clang format

* Fix compile error

* Refine naming

* Avoid error in bf16

* revert changelog

3696fe1c