Commits · 9588986115297924fbfad40caa2d6a07f10c2098 · gaoqiong / composable_kernel

27 Oct, 2023 1 commit

support batch & nhead, and scale (#20) · 95889861

carlushuang authored Oct 27, 2023

* support batch & nhead

* support scale

* tile scheduler

* rename tile-scheduler to tile-partitioner

* add some exp2 math

* fix a bug when chaning tile size

95889861

19 Oct, 2023 3 commits

refactor gemm+softmax+gemm (#19) · 7ccf0bb5
Chao Liu authored Oct 19, 2023
```
* refactor gemm+softmax+gemm using block-gemm

* reorg files

* clean
```
7ccf0bb5
Revert "slice kv, and use 3d padding LDS layout (#15)" (#18) · 2dfbfbbc
Chao Liu authored Oct 19, 2023
```
This reverts commit 7b1a0b7f.
```
2dfbfbbc

add fmha fwd pipeline (#17) · 9f36ac7c

carlushuang authored Oct 19, 2023



* Revert "Extract gemm0 prefetch0 out from loop"

This reverts commit d3b56f39f9fd12edb476b24ae9cf480841d311e4.

* add fmha fwd  pipeline

* Extract gemm0 prefetch0 out from loop

* move blockSize to another place ; fix a missing header in tile_window_impl_static_distribution.hpp

* remove KArgs from tile modules

---------
Co-authored-by: Po-Yen, Chen <PoYen.Chen@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

9f36ac7c

18 Oct, 2023 1 commit

Pre-compute coordinates to speed up store_tile() for TileWindowWithStaticDistribution<> (#12) · 63bc96e3

Po Yen Chen authored Oct 18, 2023



* Extract store_tile() logics as method

* Extract load_tile() logics as method

* Rename type alias

* Extract common logics as traits

* Remove unnecessary access specifier

* Add ComputeMode for TileWindowWithStaticDistribution

* Put field check into Traits

* More definition of Traits types

* Use more clear static_assert() message

* Enable pre-compute coordinates in store_tile()

* Re-formate static assert

* Undo changes to the wrong method

* Enable pre-compute coords for store_tile()

* Remove static_vector usage

* Add method to move non-member coordinates

* Force using pre-computed coordinates in Store()

* Fix wrong access for SFC_Ys

* Change comment

* Allow users to hint # access per coord

* Add comment for noting remove data members later

* Unify FIXME comments

* Replace FIXME comments by TODO

* Let user specify HintNumCoords

* clean

* clean

* clean

* clean

* refactor load/store for window

* clean

* clean

* bug fix for window; clean

---------
Co-authored-by: Chao Liu <chao.liu2@amd.com>

63bc96e3

12 Oct, 2023 2 commits

Refactor 1010 (#14) · 7337ec25

Chao Liu authored Oct 12, 2023

* refactor

* refactor

* change load_tile, update block gemm

* debug

* clean

* clean

* experiment lod

* workaround spilling issue

* clean

7337ec25

slice kv, and use 3d padding LDS layout (#15) · 7b1a0b7f

carlushuang authored Oct 12, 2023

* slice kv, and use 3d padding LDS layout

* add missing sync

* put sync to another poace

* move sync place

* revert to normal

7b1a0b7f

06 Oct, 2023 1 commit

add tensor slicing API (#7) · 6491acda

carlushuang authored Oct 06, 2023



* add tensor slicing API

* remove redundant ck namespace

* better gemm_gemm interface

* modify gemm_gemm

* add slice_tile api

* fix merge bug

* update to 3d padding, since we no longer need that much LDS size

* clean

* cleang

* clean

* clean

* clean

* clean

* clean

* clean

* clean

* clean

* clean

* clean

* clean

* clean

---------
Co-authored-by: Chao Liu <chao.liu2@amd.com>

6491acda

03 Oct, 2023 1 commit

Shuffle in thread (#13) · 1cf54e86

Chao Liu authored Oct 03, 2023

* adding in-thread shuffle

* update softmax example

* refactor grid gemm

* refactor gemm: layouts

* bug fix

* clean

* clean

1cf54e86

14 Sep, 2023 2 commits
- Batch gemm softmax gemm (#11) · 2837e6b3
  Chao Liu authored Sep 14, 2023
```
* make it simple

* batched gemm+softmax+gemm
```
  2837e6b3
- Remove program server (#10) · 6bc9ee05
  Chao Liu authored Sep 13, 2023
```
* removing program server

* specify launch bound per kernel instance
```
  6bc9ee05
13 Sep, 2023 1 commit
- Gemm+softmax+gemm (#9) · f3baea0d
  Chao Liu authored Sep 12, 2023
```
* adding gemm+softmax+gemm
```
  f3baea0d
05 Sep, 2023 3 commits

add softmax example (#6) · 98109c8b
Chao Liu authored Sep 05, 2023

98109c8b

Tile program init bulk PR (#4) · 0e92deb7

Chao Liu authored Sep 05, 2023



Tile Program init bulk PR

---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>
Co-authored-by: Po-Yen, Chen <PoYen.Chen@amd.com>

0e92deb7

Add image to column kernel (#867) · 0077eeb3

Bartłomiej Kocot authored Sep 05, 2023

* Add image to column kernel

* Add instances, tests, profiler, example

* Add client example

* Several fixes of image to column

* Fix variable name in device_image_to_column_impl

* Several fixes of image to column profiler

* Fix num_btype calculation

* Make new mesaurements for correct bytes calculation

0077eeb3

31 Aug, 2023 2 commits

Grouped Gemm with Fixed K and N with SplitK (#818) · f5ec04f0

zjing14 authored Aug 31, 2023



* move all arguments into device

* add b2c_tile_map

* add examples

* add SetDeviceKernelArgs

* dedicated fixed_nk solution

* init client api

* add grouped_gemm_bias example

* add a instance

* add instances

* formatting

* fixed cmake

* Update EnableCompilerWarnings.cmake

* Update cmake-ck-dev.sh

* clean; fixed comments

* fixed comment

* add instances for fp32 output

* add instances for fp32 output

* add fp32 out client example

* fixed CI

* init commit for kbatch

* add splitk gridwise

* format

* fixed

* clean deviceop

* clean code

* finish splitk

* fixed instances

* change m_loops to tile_loops

* add setkbatch

* clean code

* add splitK+bias

* add instances

* opt mk_nk instances

* clean examples

* fixed CI

* remove zero

* finished non-zero

* clean

* clean code

* optimized global_barrier

* fixed ci

* fixed CI

* removed AddBias

* format

* fixed CI

* fixed CI

* move 20_grouped_gemm to 21_grouped_gemm

---------
Co-authored-by: Jing Zhang <jizha@amd.com>

f5ec04f0

MaxPool & AvgPool bwd instances, test, ckProfiler, client example (#861) · 866377de

rocking authored Aug 31, 2023

* Add maxpool instances

* Rename index pool to max pool.

* Add maxpool bwd bf16 instances

* Add avg pool bwd instances

* Rename avgpool and maxpool to avg_pool3d and max_pool

* Add bf16 pool fwd instances

* Add max pool bwd to ckProfiler

* Add avg pool3d bwd to ckProfiler

* Add avg pool bwd test

* Fix bug of reference pool fwd (dilation)

* Fix bug of max pool bwd  (dilation and initZero)

* Support bf16 compute data type

* Force compute type be f32. Because atomicAdd only support f32

* Add max pool bwd test

* Rename folder

* Rename pool

* Add max pool bwd client example

* Add avg pool bwd client example

* Add missing workspace

* clang format

* Rename macro

* remove useless header

* remove useless layout

866377de

29 Aug, 2023 1 commit

add an example of customized type convert - bfp16_rtn (#869) · 38ada109

zjing14 authored Aug 29, 2023



* add an example of customized bfp16_rtn

* fixed threadwise_copy

---------
Co-authored-by: Jing Zhang <jizha@amd.com>

38ada109

23 Aug, 2023 1 commit
- use correct data types in cmake conditions for splitk gemm example (#862) · 7c71dc7e
  Illia Silin authored Aug 23, 2023
  
  7c71dc7e
22 Aug, 2023 1 commit

Add instances/ckProfiler/client example for fp8/fp16 mixed precision Gemm (#853) · eac50708

Rostyslav Geyyer authored Aug 22, 2023



* Add ComputeType arg to splitk device and gridwise ops

* Update for gridwise op compatibility

* Update bf16 and int8 splitk gemm examples with ComputeType

* Add instances

* Update ckProfiler for mixed precision cases

* Add a mixed precision splitK gemm client example

---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>

eac50708

14 Aug, 2023 2 commits

Implement DPP8 based GEMM for Navi21 (#826) · d4c84256
Bartlomiej Wroblewski authored Aug 14, 2023

d4c84256

Refactor pool fwd (#815) · f60f0a5e

rocking authored Aug 15, 2023

* Do not hardcode stride

* devicePool2DFwd Inherit devicePool3DFwd

* Move instance declaration out of common

* Add dilation

* use the pool3d rank, because pool2d inherit pooo3d

* calculate Do Ho Wo for the dilation

* Fix header name

* Modify ckProfiler

* Remove pool2d instance

* Remove pool2d in profiler

* Remove pool2d and add dilation

* In to client example, this commit revise following:
1. Add dilation.
2. Use pool3d to implement pool2d

* Refine naming and IsSupportedArgument()

* Add dilation to maxpool bwd example

* clang format

* 1. Remove useless header
2. Fix copyright
3. Refine naming

* Add layout parameter to pool fwd

* clang format

* Fix merge error

* Fix compile error

* Remove layout parameter in derived class

* Refine changlog

* Fix compile error

* Fix compiler error

* Add layout to external api and profiler

f60f0a5e

10 Aug, 2023 1 commit

Average pool backward deviceOP and example (#797) · 578142db

rocking authored Aug 10, 2023

* Add avgpool bwd reference code

* Refine naming

* Fix invalid in_element op in ref_conv

* Add example (only reference now)

* Add the full example of avgpool bwd

* Fix copyright

* Imitate MakeDescriptor from  transform_conv_bwd_data_to_gemm_v1.hpp

* rename channel to c from k

* Arrange the code

* Imitate the argument from conv bwd

* Implement invoker

* Fix order of parameter in example

* Refactor reference code for different dimension

* Support different stride

* Check if argument is valid

* Fix kernel parameter for NDHWC, fastest dimension C is not reduced

* Add more data type in example

* Fix bug in example

* calculate Do Ho Wo according to the dilation

* Remove useless header

* Add comment in reference code

* Add layout parameter

* Remove layout in derived class

* Refine reference comment

578142db

09 Aug, 2023 1 commit

Enable f16/f8 mixed precision mode (#820) · 9c54eaab

Rostyslav Geyyer authored Aug 09, 2023

* Enable f16/f8 mixed precision

* Add an argument to enable mixed precision

* Update for compatibility

* Add mixed precision example

* Introduce ComputeType argument

9c54eaab

07 Aug, 2023 2 commits

Allow building CK for specific data types and split off last remaining DL instances. (#830) · 08eb1769

Illia Silin authored Aug 07, 2023

* properly split conv_nd_bwd_data instances

* split conv2d_fwd instance data types

* split the gemm, conv2d_fwd and batched_gemm_softamx_gemm

* split the tests by data types where possible

* filter examples by DTYPES

* split few remaining examples by DTYPES

* filter most instances by DTYPES

* add new lines at end of headers, fix grouped_gemm profiler

* fix syntax

* split the ckprofiler instances by DTYPES

* split the conv2d and quantization DL and XDL instances

* fix the splitting of conv2d DL instances

* split softmax and pool_fwd tests for fp16 and fp32 types

* fix syntax

* fix the dl_int8 quantization instances isolation

08eb1769

Add wei_strides to grouped conv3d wei to keep consistency (#817) · 22443f7a

Bartłomiej Kocot authored Aug 07, 2023



* Add wei_strides to grouped conv3d wei to keep consistency

* Fix strides in client examples

* Unify backward weight api with forward

* Fix for example

* Fixes for examples

---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>

22443f7a

26 Jul, 2023 3 commits

initial stream-k implementation with example (#699) · e7dca79d

carlushuang authored Jul 27, 2023



* initial stream-k implementation with example

* fix unexpected change in err

* improve a little bit performance by reorganize pipeline.

* improve perf a little bit by swizzle block idx

* add profiler

* update example

* fix spelling

* shrink karg for streamk

* support dynamic buffer using memory coherence glc_slc bit from template

* control memory coherence while construct dynamic buffer

* update reduction for streamk(not ready yet)

* Add template parameter to make_dynamic_buffer to support amd_buffer coherence setting

* fix build issue

* fix several bug

* now result is correct, everything works (but has scratch)

* remove scratch by manually reset coordinate

* update device code

* fix a bug in final reduce

* fix something in example

* update async memset

* fix enum as camel case

* modify coherence enum name

* clean code and use atomic streamk by default

* remove unused var

* throw exception if have empty pointer

* fix format

* fix CI warning

* fix type in init

* modify CI error

* filter out on gfx10+

* restore changed example code

---------
Co-authored-by: Qianfeng Zhang <Qianfeng.Zhang@amd.com>

e7dca79d

Disable XDL kernels on unsupported HW Add ck::is_xdl_supported (#768) · ac6d68b3

Bartłomiej Kocot authored Jul 26, 2023



* Disable XDL kernels on unsupported HW; Add ck::is_xdl_supported function (#765)

* Do not throw an error when GEMM problem is not supported.

---------
Co-authored-by: Bartlomiej Wroblewski <bwroblewski10@gmail.com>
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

ac6d68b3

Refine the dimension of host tesnor. This example only require 1D (#812) · 016bd428
rocking authored Jul 26, 2023

016bd428

25 Jul, 2023 1 commit

Add bias scalar vectorload = 1 for gemm bias gemm (#791) · 50643dd5

ltqin authored Jul 25, 2023

* first change bias load

* add bias dim and scalervector parameter

* make CDE0BlockTransferSrcVectorDim not work

* changse toinstance

* add limit for CDE0BlockTransferSrcScalarPerVector

50643dd5

18 Jul, 2023 1 commit

Add mechanism to build CK for select data types, add Navi3x CI. (#790) · 189ea3b9

Illia Silin authored Jul 17, 2023

* allow building CK for specific data types

* add CI build and test stage on Naiv3x without some int8 instances

* add missing gemm fp16 instances

* add the changes to the missed cmake file

* add empty lines at end of source files

* Do not build quantization client example on navi3 in CI

* disable batched_gemm_multi_d_int8 instances with DTYPES

* disable device_conv2d_bwd_data_instance with DTYPES

* fix ckprofiler for conv_bwd_data for int8

* properly isolate the conv_bwd_data int8 instances

* remove empty line

189ea3b9

12 Jul, 2023 1 commit

Support NHWGC conv2d_bwd_weight (#769) · 1ee99dca

Bartłomiej Kocot authored Jul 12, 2023



* Support NHWGC conv2d_bwd_weight

* Fix client example

* Fix client example

* Fix comments

* Redesign grouped_conv_bwd_weight instances

* Clang format fix

---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>

1ee99dca

06 Jul, 2023 2 commits

Batchnorm splitk single kernel (#771) · 8f5cafaf

Qianfeng authored Jul 06, 2023

* Use dim 0 as faster dim for writing mean/var/count workspace in batchnorm multiblock method [performance]

* Add CountDataType as template parameter in blockwise_welford

* Add utility/get_shift.hpp

* Add BatchNorm multiblock single-kernel implementation

* Add smem inline assembly based implementation of gms_init/gms_barrier/gms_reset for gfx90a

* Renaming in device_batchnorm_forward_impl.hpp

* Tiny fix in the batchnorm_fwd profiler

* Revert "Add smem inline assembly based implementation of gms_init/gms_barrier/gms_reset for gfx90a"

This reverts commit d16d00919c43f10759e7b4e4d112125221ed9064.

* Use the old two-kernel batchnorm multiblock method for gfx1030

* Use the old two-kernel batchnorm multiblock method for gfx908

* use the single-kernel batchnorm multiblock method only for gfx90a

* Remove get_wave_id() from utility/get_id.hpp since it is not used

* Set true for testing running mean/variance and saving mean/invvariance in the examples

* Fix to copy-right words

* Remove un-needed including in utility/get_id.hpp

* Add comments to workgroup_synchronization.hpp

* Remove un-used codes in gridwise_multiblock_batchnorm_forward.hpp

* Renaming in the kernels

* Remove un-used kernel file

8f5cafaf

Move Device Ops implementations into impl directory. (#777) · f4dfc060
Adam Osewski authored Jul 06, 2023
```
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
```
f4dfc060

05 Jul, 2023 1 commit

Add fp8 GEMM and an example for it (#767) · 1cf50031

Rostyslav Geyyer authored Jul 04, 2023

* Add fp8 xdl gemm

* Add example

* Use int8 intrinsics for buffer load/store

* Format

* Update cmakelists

1cf50031

19 Jun, 2023 2 commits

do not build gemm-gemm and conv-conv examples for gfx94* (#761) · 645eb2f2

Illia Silin authored Jun 19, 2023

* do not build gemm-gemm and conv-conv examples for gfx94*

* do not build gemm-gemm and conv-conv examples on navi

645eb2f2

Maxpool bwd (#750) · 341ad956

rocking authored Jun 19, 2023

* Add maxpool f32 kernel and example

* Revise copyright

* Add device pool bwd device op

* Support f16 and bf16

* Add compute datatype for reference code.
Prevent error in bf16

* Fix type error

* Remove layout

* Fix bf16 error

* Add f16 and bf16 example

* Add more operations

* Implement IsSupportedArgument

* Add changelog

* Add comment

* Add comment

* Remove useless header

* Move initialize of workspace to the run

* Move set din zero to the device operator

* Save din_length_raw

* Remove useless header

* Calculate gridsize according to the number of CU

* Calculate gridSize according to the number of CU.
Remove useless header

* Add put example

* Remove useless header

* Fix CI fail

341ad956

15 Jun, 2023 1 commit

Enable gfx941 and gfx942 architectures. (#752) · 027e46ee

Illia Silin authored Jun 15, 2023

* enable gfx941/942 targets

* fix clang format

* fix the cmake logic for multiple targets

* fix cmake syntax for looping over targets

* add gfx941/942 support for gemm_xdl instances

027e46ee

12 Jun, 2023 1 commit

Fix flash attn mask bug (#733) · 0ede66de

ltqin authored Jun 12, 2023



* add check input parameter

* add instance for vector load = 1

* move gerneral instance to first pos

* fix read bias code

* regular code for bias load

---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>

0ede66de

01 Jun, 2023 1 commit

Simplify kernel argument of device operator Device(Batched)GemmXdl<> (#723) · 9eae73df

Po Yen Chen authored Jun 02, 2023



* Remove M/N/KPad local variables

* Use M/N/KPad to name padded lengths

* Replace duplicated local variable by parameters

* Rename variables M/N/KRaw to M/N/K

* Move AK0/BK0 compute logic into GridwiseGemm

* Use macro to shorten code

* Move CalculateGridSize() logic into GridwiseGemm

* Add comment to credit the implementation source

* Reuse the existing implementation

* Remove no-longer used data members

* Remove elementwise-op objects from interfaces

* Reserve kernel arg as whole object in interfaces

* Remove redundant data member

* Make 3rd type parameter optional

* Remove unnesscary type parameters

* Remove no-longer used descriptor-creation methods

* Move kernel arg type definition into GridwiseGemm

* Add macro to switch between code sections

* Move argument field computing logic into device op side

* Make utility method 'static'

* Declare special methods

* Unify MakeArgument() usage

* Adapt the new GridwiseGemm interface

* Push-down class 'GridwiseGemm::Argument' fields

* Remove no-longer used methods

* Add unused parameters

* Force copying parameters in 'Embed' ctor

* Remove no-longer used descriptors

* Fallback change on BaseArgument

* Remove macro 'INTEGER_DIVIDE_CEIL'

* Make variable naming more consistent

* Make sure methods are only invoked on right place

* Remove tailing underscore in public attribute name

* Remove necessary methods

* Hide computing logic of derived attributes

* Make new 'Embed' ctor only available for device code

* Make sure 'Embed' type args are not references

* Move check for karg.K into CheckValidity()

* Remove more integer division logic form device code

* Undo changes on Embed

* Separate 'Problem' concept out from 'Argument'

* Add overloaded version of __builtin_amdgcn_readfirstlane()

* Remove 'static' specifiers

* Remove more 'static' specifier

* Replace unsigne char by std::byte

* Add 'const' specifier to never changing variable

* Add 'inline' specifier to funcion definition

* Share same name for kernel interfaces

* Fix wrong boundar calculation logic

* Leave the third template arg for compatibility

* Remove unnecessary parameters

* Fix wrong error message (for type name)

* Create descriptor on device side

* Fix wrong debug message

* Remove no-longer used data members

* Rename type trait

* Remove std:: qualifier from standard types

* Replace 'size_t' by 'unsigned'

* Use type alias to hint usage

* Replace static_for<> by ordinary 'for' loop

* Reject unsupported argument

* Rename readfirstlane() to amd_wave_read_first_lane()

* Rename file readfirstlance.hpp as amd_wave_read_first_lane.hpp

* Update function calls

* Reorder statements

* Re-format files

---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>

9eae73df