Commits · 5683ea4ed8f9f9e852dd92bdecbbf4cbd7cf45e5 · gaoqiong / composable_kernel

07 Aug, 2023 1 commit

Allow building CK for specific data types and split off last remaining DL instances. (#830) · 08eb1769

Illia Silin authored Aug 07, 2023

* properly split conv_nd_bwd_data instances

* split conv2d_fwd instance data types

* split the gemm, conv2d_fwd and batched_gemm_softamx_gemm

* split the tests by data types where possible

* filter examples by DTYPES

* split few remaining examples by DTYPES

* filter most instances by DTYPES

* add new lines at end of headers, fix grouped_gemm profiler

* fix syntax

* split the ckprofiler instances by DTYPES

* split the conv2d and quantization DL and XDL instances

* fix the splitting of conv2d DL instances

* split softmax and pool_fwd tests for fp16 and fp32 types

* fix syntax

* fix the dl_int8 quantization instances isolation

08eb1769

31 May, 2023 1 commit
- update copyright headers (#726) · b94fd0b2
  Illia Silin authored May 31, 2023
  
  b94fd0b2
15 May, 2023 1 commit

Update staging branch. (#706) · 72b7ae25

Illia Silin authored May 15, 2023



* update daily build from rocm 5.4.3 to 5.5 (#693)

* Fix grouped_gemm_splitk kernels on MI300. (#694)

* replace amd_buffer_atomic_add with hip_atomic_add

* fix grouped_gemm_splitk kernels on mi300

* fix syntax

* revert experimental atomic_add changes

---------
Co-authored-by: Jing Zhang <jizhan@amd.com>

* Fix the group of quantization_int8 kernels on MI300. (#695)

* replace amd_buffer_atomic_add with hip_atomic_add

* fix grouped_gemm_splitk kernels on mi300

* fix syntax

* revert experimental atomic_add changes

* fix the group of kernels from ticket 723 on MI300

---------
Co-authored-by: Jing Zhang <jizhan@amd.com>

* Optimize bf16 conversion (#664)

* Add TypeConvert class and start refactoring

* Refactor TypeConvert as a struct

* Get back to template functions type_convert

* Add a type_convert_bf16_rtn, set rtz as default

* Clean up

* Add UnaryConvertPrecision struct for high-precision workloads

* Format

* Update type_convert to UnaryConvert on threadwise level

* Update UnaryConvertPrecision

* Format

* Fix chmod

* Add a flag to pick converion method

* Format

* Remove the added flag

* Merge elementwise op with type conversion

* Move type_convert to elemwise op, update the op

* Update type_convert_precision -> bf16_convert_rtn

* Clean up

* Update comments

* Update the CK_WORKAROUND_DENORM_FIX flag handling

* Update the unneeded op to work but warn user

* Remove the message

* Use a PassThrough instead of ConvertBF16RTN to calcaulate reference

* Format

* Add missing include

* Normalization/split k (#615)

* Add contraction profiler and tests (#701)

* Add contraction profiler and tests

* Build and style fixes

* Allow to use any elementwise operator for ref_contraction

* Introduce profile_contraction_scale and profile_contraction_bilinear

* Make ref_contraction generic and extend interface tests

* Stylistic minor fixes

* Extend test_contraction_interface

---------
Co-authored-by: Jing Zhang <jizhan@amd.com>
Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com>
Co-authored-by: rocking <ChunYu.Lai@amd.com>
Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com>

72b7ae25

03 May, 2023 1 commit

Fix grouped_gemm_splitk kernels on MI300. (#694) · 4a51d2da

Illia Silin authored May 03, 2023



* replace amd_buffer_atomic_add with hip_atomic_add

* fix grouped_gemm_splitk kernels on mi300

* fix syntax

* revert experimental atomic_add changes

---------
Co-authored-by: Jing Zhang <jizhan@amd.com>

4a51d2da

24 Apr, 2023 1 commit

Grouped Gemm + SplitK + simplified Kernel Args (#669) · 8bb2bb4a

Adam Osewski authored Apr 24, 2023



* simplify karg in device/grid split-k op

* fix mk_kn_mn instances

* add more instances

* B2C with 3D grid for KSplit

* Remove unused code.

* Use default B2C (3D grid) in grid gemm v2r4r2.

* Device gemm splitk use B2C map.

* Device GroupedGemmXdlSplitKCShuffle

* Example for GroupedGemm Xdl SplitK

* Introduce Device GroupedGemmSplitK

* Fix updating kbatch size.

* Add instance mk-nk-mn

* Enable set kbatch in profiler.

* Add GGemmSplitK mk-kn-mn instances

* Add more instances & split into multiple files.

* minor fix

* tuning

* clean

* disabled failed instances

* use pipe v2

* Ignore arg on not supported arch.

* fix warning

---------
Co-authored-by: carlushuang <carlus.huang@amd.com>
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
Co-authored-by: Jing Zhang <jizhan@amd.com>
Co-authored-by: root <root@ctr-ubbsmc15.amd.com>

8bb2bb4a

15 Mar, 2023 1 commit

Device Op GroupedGemmMultipleD + example fp16 (#633) · a2d5ca8e

Adam Osewski authored Mar 15, 2023



* Pass shared mem pointer as pointer to void.

* Device Op GroupedGEMM Multiple D

* Example for grouped gemm multiple d.

* Add MI200 to supported archs.

---------
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>

a2d5ca8e

11 Nov, 2022 1 commit

Rangify constructor of HostTensorDescriptor & Tensor<> (#445) · 4a2a56c2

Po Yen Chen authored Nov 12, 2022

* Rangify STL algorithms

This commit adapts rangified std::copy(), std::fill() & std::transform()

* Rangify check_err()

By rangifying check_err(), we can not only compare values between
std::vector<>s, but also compare any ranges which have same value
type.

* Allow constructing Tensor<> like a HostTensorDescriptor

* Simplify Tensor<> object construction logics

* Remove more unnecessary 'HostTensorDescriptor' objects

* Re-format example code

* Re-write more HostTensorDescriptor ctor call

4a2a56c2

13 Oct, 2022 1 commit

Refactor device op implementations into `impl` subdirectory. (#420) · 30480288

Adam Osewski authored Oct 13, 2022



* Move kernel implementation files under impl directory.

* Update examples paths.

* Update device kernel impl include paths.

* Update tensor operation instances include paths.

* Update profiler and tests include paths.

* Clang-format

* Update include paths for batched gemm reduce

* Refactor UnitTest ConvNDBwdWeight.

* Refactor fwd and bwd data convND UT.

* Fix used test macro.

* Fix include path.

* Fix include paths.

* Fix include paths in profiler and tests.

* Fix include paths.
Co-authored-by: Adam Osewski <aosewski@amd.com>

30480288

25 Aug, 2022 1 commit

GEMM batched/splitK/cgemm/grouped int4 examples (#383) · 3ab20fd7

Adam Osewski authored Aug 26, 2022



* Grouped GEmm int4.

* Formatting + fix K dimension for int8.

* Batched Gemm int4 example.

* CGEMM int4 example.

* Include inc filese in clang-format.

* SplitK int4 example

* Refactoring of performance measurement.

* Fix #ifdef statements.
Co-authored-by: Adam Osewski <aosewski@amd.com>

3ab20fd7

23 Aug, 2022 1 commit

Add examples of batched/grouped/SplitK Gemm for int8/bfp16/fp16/fp32 (#361) · 60914583

zjing14 authored Aug 23, 2022



* add examples into grouped/batched_gemm

* adding splitK examples

* fixed splitK

* add bfp16 int8 example into splitK

* formatting

* use static_cast

* added common for batched_gemm

* add commons for examples of splitK/batched/grouped_gemm

* return true

* adjust splitK check tol

* update example
Co-authored-by: Chao Liu <lc.roy86@gmail.com>

60914583

29 Jul, 2022 1 commit

Clean up conv example, Instances, profiler and test (#324) · 500fa995

Chao Liu authored Jul 29, 2022

* convnd_fwd fp16 example

* update example

* update example

* update instance

* updating refernce conv

* update reference conv

* update conv fwd profiler

* update conv 1d and 3d instance

* update include path

* clean

* update profiler for conv bwd data and weight

* update conv bwd weight

* clean

* update conv example

* update profiler for conv bwd weight

* update ckprofiler for conv bwd data

* fix reference conv bwd data bug; update conv bwd data test

* update examples

* fix initialization issue

* update test for conv fwd

* clean

* clean

* remove test case too sensitive to error threshhold

* fix test

* clean

* fix build

* adding conv multiple d

* adding conv multiple D

* add matrix padder

* add gemm padding to convnd

* adding group conv

* update gemm multi-d

* refactor

* refactor

* refactor

* clean

* clean

* refactor

* refactor

* reorg

* add ds

* add bias

* clean

* add G

* adding group

* adding group

* adding group

* update Tensor

* clean

* update example

* update DeviceGemmMultipleD_Xdl_CShuffle

* update conv bwd-data and bwd-weight

* upate contraction example

* update gemm and batch gemm with e permute

* fix example build

* instance for grouped conv1d

* update example

* adding group conv instance

* update gemm bilinear instance

* update gemm+add+add+fastgelu instance

* update profiler

* update profiler

* update test

* update test and client example

* clean

* add grouped conv into profiler

* update profiler

* clean

* add test grouped conv, update all conv test to gtest

* update test

500fa995

21 Jul, 2022 1 commit

Grouped Gemm device with multiD grid (#319) · 7959dad5

zjing14 authored Jul 21, 2022



* replace gridwise_v2r3 with multiD

* adjust parameters

* add instances

* fixed test_grouped_gemm

* fix standalone softmax race condition around blockwise reduction

* fixed ci

* fixed comment: remove redundant workspace

* use instanceFactory

* add test layout

* add empty Ds

* add bias example

* use array

* sperate examples
Co-authored-by: Anthony Chang <ac.chang@outlook.com>

7959dad5

25 Jun, 2022 2 commits

add license in file (#303) · d3051d75
Chao Liu authored Jun 24, 2022

d3051d75

Absolute include path (#281) · d1db6a0c

Chao Liu authored Jun 24, 2022

* ad gelu and fast_gelu

* added GeLU and fast GeLU

* clean up

* add gemm+fastgelu example

* add gemm+gelu instances

* update profiler

* clean up

* clean up

* adding gemm+bias+activation

* clean

* adding bias

* clean

* adding gemm multiple d

* debugging

* add gemm bias add fastgelu

* rename, clean

* refactoring; add readme

* refactor

* refactor

* refactor

* refactor

* refactor

* refactor

* fix

* fix

* update example

* update example

* rename

* update example

* add ckProfiler

* clean

* clean

* clean

* clean

* add client app example

* update readme

* delete obselete files

* remove old client app

* delete old file

* cleaning

* clean

* remove half

* fix header path

* fix header path

* fix header path

* fix header path

* fix header path

* fix header path for all examples

* fix header path

* fix header path

* fix header path

* fix header path

* fix header path

* fix header path

* fix header path

* fix header path

* fix header path

* revert client app example

* clean build

* fix build

* temporary disable client test on Jenkins

* clean

* clean

* clean

d1db6a0c

31 May, 2022 1 commit

Pass gemm_descs for grouped gemm via __constant__ buff (#232) · b6eaf3eb

zjing14 authored May 31, 2022

* moved gemm_descs_args into const buff

* use CK_CONSTANT_ADDRESS_SPACE instead of global constant

* clean

* moved hipMemAlloc outside of deviceOp

* add SetWorkSpacePointer

* fix ignore

b6eaf3eb

26 May, 2022 1 commit

Add FP64 XDL GEMM built-in function (#199) · 3e6c2610

ltqin authored May 27, 2022



* add intrin_mfma_f64_16x16x4f64

* add example

* gemm reference add double data type

* chang init data

* fix M N PerXdlops

* fix ifdef

* add comparsion config

* add conv fwd example

* format log out

* change rc matrix egister layout

* reorganize example

* reorganize example 2

* format,because merge develop

* fix call impl adding acc data type

* lost ;

* add compiler warning

* change example tunning parameters

* add test for fp64

* add instance

* add test/gemm/gemm_fp64.cpp

* fix get name issue

* remove some tunning parameter

* fix conflict

* format

* use integer value for GEMM test

* add acc data type

* remove typeid because fp16

* fix streamconfig etc bug from merging develop

* format

* remove test_gemm_xdl_fp64

* add AccDataType

* AccDataType problem
Co-authored-by: qinletao <letaoqin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

3e6c2610

13 May, 2022 1 commit

Validate examples in CI (#233) · 9f71ff48

Anthony Chang authored May 14, 2022



* validate examples in ctest runs

* format

* fix usage of check_err

* amend

* add example codes to custom target 'check'
Co-authored-by: Chao Liu <chao.liu2@amd.com>

9f71ff48

12 May, 2022 1 commit

Add host API (#220) · cec69bc3

JD authored May 12, 2022



* Add host API

* manually rebase on develop

* clean

* manually rebase on develop

* exclude tests from all target

* address review comments

* update client app name

* fix missing lib name

* clang-format update

* refactor

* refactor

* refactor

* refactor

* refactor

* fix test issue

* refactor

* refactor

* refactor

* upate cmake and readme
Co-authored-by: Chao Liu <chao.liu2@amd.com>

cec69bc3

09 May, 2022 1 commit

Resolution of issue #153: Add compiler warning on comparing int and size_t (#212) · f03a1738

myamlak authored May 09, 2022



* Turning compare warnings on

* Cleaning part I

* Cleaning part II

* Explicit static_cast to ck::type_convert

* Resolving large tensor size issue.

* format

* revert change to tensor descriptor; promote lementSpaceSize to 64bit

* use integer value for GEMM test

* Review remarks

* Review remarks + issues with (un)signed arithmetic

* Format fix

* Format

* Clang-format.

* fix 2gb limit issue
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Adam Osewski <aosewski@amd.com>

f03a1738

05 Apr, 2022 1 commit

Common forward convolution utility refactor. (#141) · abf4bdb9

Adam Osewski authored Apr 05, 2022



* Convolution ND

* Code unification across dimensions for generating tensor descriptors.
* Example
* Instances

* Move convnd f32 instance file to comply with repo structure.

* Conv 1D tensor layouts.

* Formatting and use ReferenceConv

* Reference ConvFwd supporting 1D and 2D convolution.

* Debug printing TensorLayout name.

* Conv fwd 1D instance f32

* Refactor conv ND example.

Needed to support various conv dimensio.

Needed to support various conv dimensions

* Rename conv nd example director to prevent conflicts.

* Refactor some common utility to single file.

Plus some tests.

* Refactor GetHostTensorDescriptor + UT.

* Add 1D test case.

* Test reference convolution 1d/2d

* Remove some leftovers.

* Fix convolution example error for 1D

* Refactor test check errors utility function.

* Test Conv2D Fwd XDL

* More UT for 1D case.

* Parameterize input & weight initializers.

* Rename example to prevent conflicts.

* Split convnd instance into separate files for 1d/2d

* Address review comments.

* Fix data type for flops/gbytes calculations.

* Assign example number 11.

* 3D cases for convolution utility functions.

* 3D reference convolution.

* Add support for 3D convolution.

* Check for inputs bigger than  2GB.

* Formatting

* Support for bf16/f16/f32/i8 - conv instances + UT.

* Use check_err from test_util.hpp.

* Split convnd test into separate files for each dim.

* Fix data generation and use proper instances.

* Formatting

* Skip tensor initialization if not necessary.

* Fix CMakefiles.

* Remove redundant conv2d_fwd test.

* Lower problem size for conv3D UT.

* 3D case for convnd example.

* Remove leftovers after merge.

* Add Conv Specialization string to GetTypeString

* Skip instance causing numerical errors.

* Small fixes.

* Remove redundant includes.

* Fix namespace name error.

* Script for automatic testing and logging convolution fwd UTs

* Comment out numactl cmd.

* Refine weights initalization and relax rtol for fp16

* Move test_util.hpp to check_err.hpp

* Refine weights initalization and relax rtol for fp16

* Refactor common part of test conv utils.

* Move utility function to single common place.

* Add additional common functions to utility.

* Refactor convnd_fwd_xdl examples.

* Remove redundant files.
* Unify structure.

* Add constructor to ConvParams.

* And add input parameters validation.

* Modify conv examples to use single utility file.

* Remove check_error from host_tensor.hpp

* Get rid of check_indices function.

* Remove bf16_to_f32 function overload for scalars.

* Fix namespace.

* Add half_float::half for check_err.

* Fix conv params size in UT.

* Fix weights initialization for int8.

* Fix weights initialization for int8.

* Add type_convert when store output in ref conv 1D.

* Get back old conv2d_fwd_xdl operation.

* Silence conv debug print.

* format

* clean

* clean

* Fix merge.

* Fix namespace for check_err

* Formatting.

* Fix merge artifacts.

* Remove deleted header.

* Fix some includes and use ck::utils::check_err.

* Remove unused check_indices restored by previous merge.

* Fix namespaces after merge.

* Fix compilation error.

* Small fixes.

* Use common functions.
* Fix filename
* Fix namespaces.

* Fix merge artifact - retrieve removed by accident fun.

* Fix ConvForwardSpecialization.

* Adhere to coding style rules.

* Fix merge artifacts.
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

abf4bdb9

31 Mar, 2022 1 commit

Compile for gfx908 and gfx90a (#130) · cd167e49

Chao Liu authored Mar 31, 2022

* adding compilation for multiple targets

* fix build

* clean

* update Jekinsfile

* update readme

* update Jenkins

* use ck::half_t instead of ushort for bf16

* rename enum classes

* clean

* rename

* clean

cd167e49

24 Mar, 2022 1 commit
- fixed alloc mem size (#145) · 12f4cfce
  zjing14 authored Mar 23, 2022
  
  12f4cfce
22 Mar, 2022 1 commit

Grouped GEMM for fp16 (#126) · 716f1c7f

zjing14 authored Mar 22, 2022

* init of grouped_gemm

* 2 gemm test

* perf test

* clean

* wrap desc into a struct

* test cast static_arr to pointer

* add ptr to GemmDesc

* add grouped gemm profiler

* fixed mem issue with unique_ptr

* clean

* clean

* finished ckprofiler

* Update README.md

* readme

* fixed readme

* add example

* improve code

* fixed comments: reserve, seperate ptr and gemm_shapes

* merge group and non-group

* fixed comments: replace push_back with emplace_back to avoid copy constructor

* fixed comments: unified blk2ctile; add test

* ci fix

* fixed ci

* fixed ci

* fixed ci

716f1c7f