Commits · 83d926dcb1393a800f16d91d3609901ee87084de · gaoqiong / composable_kernel

15 Jun, 2023 1 commit
- clang-format · 83d926dc
  aska-0096 authored Jun 15, 2023
  
  83d926dc
13 Jun, 2023 1 commit
- deprecate inline asm wmma · d44f6660
  aska-0096 authored Jun 13, 2023
  
  d44f6660
12 Jun, 2023 1 commit

Fix incomplete object size (=4n + 3) support of amd_wave_read_first_lane() (#738) · 7c24654c

Po Yen Chen authored Jun 12, 2023

* Fix wrong pointer type

* Rename type trait get_unsigned_int<> to get_carrier<>

* Add 3-bytes carrier type

* Add missing __device__ specifier

* Rename template non-type parameter

* Leave the rest byte uninitialized

* Avoid invoking (host) STL algorithms

* Remove unnecessary 'inline' specifier

* Extract common logic out as helper method

* Hide dummy member function

* Add missing __device__ specifier

7c24654c

08 Jun, 2023 1 commit
- support dynamic buffer using memory coherence glc_slc bit from template (#725) · 016ebaa7
  carlushuang authored Jun 08, 2023
  
  016ebaa7
31 May, 2023 2 commits

update copyright headers (#726) · b94fd0b2
Illia Silin authored May 31, 2023

b94fd0b2

Add class type support for __builtin_amdgcn_readfirstlane() (#711) · 582e31e8

Po Yen Chen authored May 31, 2023

* Add overloaded version of __builtin_amdgcn_readfirstlane()

* Remove 'static' specifiers

* Remove more 'static' specifier

* Replace unsigne char by std::byte

* Add 'const' specifier to never changing variable

* Add 'inline' specifier to funcion definition

* Fix wrong boundar calculation logic

* Rename type trait

* Remove std:: qualifier from standard types

* Replace 'size_t' by 'unsigned'

* Use type alias to hint usage

* Replace static_for<> by ordinary 'for' loop

* Rename readfirstlane() to amd_wave_read_first_lane()

* Rename file readfirstlance.hpp as amd_wave_read_first_lane.hpp

* Reorder statements

582e31e8

24 May, 2023 1 commit

Clean-up the headers (#713) · ac9e01e2

Illia Silin authored May 24, 2023



* fix headers for gpu instances

* remove unused headers

---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>

ac9e01e2

10 May, 2023 1 commit
- 1. Enable 2-stage global Prefetch ( May cause VGPR spilling) · 0bb08f4b
  aska-0096 authored May 10, 2023
```
2. Enable FP16 accumulator blockwise_gemm
```
  0bb08f4b
04 May, 2023 1 commit

Optimize bf16 conversion (#664) · b076a02a

Rostyslav Geyyer authored May 04, 2023

* Add TypeConvert class and start refactoring

* Refactor TypeConvert as a struct

* Get back to template functions type_convert

* Add a type_convert_bf16_rtn, set rtz as default

* Clean up

* Add UnaryConvertPrecision struct for high-precision workloads

* Format

* Update type_convert to UnaryConvert on threadwise level

* Update UnaryConvertPrecision

* Format

* Fix chmod

* Add a flag to pick converion method

* Format

* Remove the added flag

* Merge elementwise op with type conversion

* Move type_convert to elemwise op, update the op

* Update type_convert_precision -> bf16_convert_rtn

* Clean up

* Update comments

* Update the CK_WORKAROUND_DENORM_FIX flag handling

* Update the unneeded op to work but warn user

* Remove the message

* Use a PassThrough instead of ConvertBF16RTN to calcaulate reference

* Format

* Add missing include

b076a02a

28 Apr, 2023 1 commit

Syncing up from internal repo to enable MI300. (#690) · 4feebedd

Illia Silin authored Apr 28, 2023



* enable gfx940

* switch between intrinsic mfma routines on mi100/200 and mi300

* fix mfma_int8 on MI300

* disable 2 int8 examples on MI300

* Update cmake-ck-dev.sh

* restore gitignore file

* modify Jenkinsfile to the internal repo

---------
Co-authored-by: Jing Zhang <jizha@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>

4feebedd

21 Apr, 2023 1 commit

fix layernorm, reduction Ops (#4) · 394dbf83

Haocong WANG authored Apr 21, 2023



* [Navi3x] Fix Gridwise_multiple_d operation (#649)

* Add CMake Option "USE_OPT_NAVI3X"

* fix bug

* standardize docs (#655)

* Separate bibtex requirement from rocm-docs-core (#656)

* separate bibtex requirement from rocm-docs-core

* point requirements to source rocm-docs-core repo

* Add CMake Option "USE_OPT_NAVI3X" (#647)

* Add CMake Option "USE_OPT_NAVI3X"

* remove navi3x opt compile option from cmake script

* Conv + quantization + tanh  (#645)

* Rename file. Prepare to support another activation

* Add comment for quantization

* Extract out_elementop

* Add tanh example

* Add conv + bias + tanh quantization instance

* Add missing parameter

* Refine cmake

* Add external api and client example

* Extract variable in example

* Fix the comment

---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>

* Add a denorm test fix (#603)

* Add type_convert implementations for bf16

* Add the fix for conv_fwd

* Add the fix for conv_bwd_data

* Add the fix for conv_bwd_weight

* Format

* Format

* Another format

* Add a macro to use workaround on MI200 only

* Format

---------
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>

* simplify karg in device/grid of split-k op (#644)

* simplify karg in device/grid split-k op

* fix mk_kn_mn instances

* add more instances

* use name from tensor layout

* fix 3rd dword of buffer source descriptor (#659)

* add fp64 instances (#658)
Co-authored-by: root <root@ctr-ubbsmc15.amd.com>

* Issue #666: Revert "simplify karg in device/grid of split-k op (#644)" (#665)

This reverts commit bb5530af

.

* Groupnorm + swish external api (#668)

* Rename to proper naming

* Add example of groupnorm + swish

* Extract duplicate code in example

* Add groupnorm + swish instances

* Ractor instance generation, split into multiple cpp file

* Add external api and client example

* Refine profiler message

* Use ck math version of exp

* Refine problem size in example

* Add host version of exp

* add a marco to turn on/off denorm fix (off by default) (#673)

* add a marco to turn off denorm fix by default

* expose the marco

---------
Co-authored-by: root <root@ctr-ubbsmc15.amd.com>

* fixed quant example (#672)
Co-authored-by: root <root@ctr-ubbsmc15.amd.com>

* Add dependabot config and pin rocm-docs-core (#663)

* [gtest] suppress unsafe buffer warn (#670)

ref: https://github.com/ROCmSoftwarePlatform/MIOpen/pull/1912



* Add memory index guard in wmma device ops (#667)

* Add more macros to turn on/off denorm fix (#678)
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com>

* Fix a typo (#676)

* Add (#677)

* Allow using ROCm release candidate compilers. (#679)

* enable use of rocm5.5 release candidate 4

* upgrade to ROCM5.5 RC5

* try fix the PUB_KEY error, remove the cmake-data package

* upgrade to latest cmake version

* use private dockerhub repo for rocm5.5 rc5

* add missing bracket

* Disable SkipLDS & Align AIT api

* Update dependabot config (#682)
Co-authored-by: samjwu <samjwu@users.noreply.github.com>

* update attn api

* solve type_convert bug + enable

---------
Co-authored-by: Sam Wu <sjwu@ualberta.ca>
Co-authored-by: Sam Wu <sam.wu2@amd.com>
Co-authored-by: rocking5566 <ChunYu.Lai@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com>
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>
Co-authored-by: root <root@ctr-ubbsmc15.amd.com>
Co-authored-by: Jun Liu <Liu.Jun@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: samjwu <samjwu@users.noreply.github.com>
Co-authored-by: haocwang <Haocong.WANG@amd.com>

394dbf83

19 Apr, 2023 1 commit

Merge origin dev (#2) · cad3212d

Haocong WANG authored Apr 19, 2023



* [Navi3x] Fix Gridwise_multiple_d operation (#649)

* Add CMake Option "USE_OPT_NAVI3X"

* fix bug

* standardize docs (#655)

* Separate bibtex requirement from rocm-docs-core (#656)

* separate bibtex requirement from rocm-docs-core

* point requirements to source rocm-docs-core repo

* Add CMake Option "USE_OPT_NAVI3X" (#647)

* Add CMake Option "USE_OPT_NAVI3X"

* remove navi3x opt compile option from cmake script

* Conv + quantization + tanh  (#645)

* Rename file. Prepare to support another activation

* Add comment for quantization

* Extract out_elementop

* Add tanh example

* Add conv + bias + tanh quantization instance

* Add missing parameter

* Refine cmake

* Add external api and client example

* Extract variable in example

* Fix the comment

---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>

* Add a denorm test fix (#603)

* Add type_convert implementations for bf16

* Add the fix for conv_fwd

* Add the fix for conv_bwd_data

* Add the fix for conv_bwd_weight

* Format

* Format

* Another format

* Add a macro to use workaround on MI200 only

* Format

---------
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>

* simplify karg in device/grid of split-k op (#644)

* simplify karg in device/grid split-k op

* fix mk_kn_mn instances

* add more instances

* use name from tensor layout

* fix 3rd dword of buffer source descriptor (#659)

* add fp64 instances (#658)
Co-authored-by: root <root@ctr-ubbsmc15.amd.com>

* Issue #666: Revert "simplify karg in device/grid of split-k op (#644)" (#665)

This reverts commit bb5530af

.

* Groupnorm + swish external api (#668)

* Rename to proper naming

* Add example of groupnorm + swish

* Extract duplicate code in example

* Add groupnorm + swish instances

* Ractor instance generation, split into multiple cpp file

* Add external api and client example

* Refine profiler message

* Use ck math version of exp

* Refine problem size in example

* Add host version of exp

* add a marco to turn on/off denorm fix (off by default) (#673)

* add a marco to turn off denorm fix by default

* expose the marco

---------
Co-authored-by: root <root@ctr-ubbsmc15.amd.com>

* fixed quant example (#672)
Co-authored-by: root <root@ctr-ubbsmc15.amd.com>

* Add dependabot config and pin rocm-docs-core (#663)

* [gtest] suppress unsafe buffer warn (#670)

ref: https://github.com/ROCmSoftwarePlatform/MIOpen/pull/1912



* Add memory index guard in wmma device ops (#667)

* Add more macros to turn on/off denorm fix (#678)
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com>

* Fix a typo (#676)

* Add (#677)

* Allow using ROCm release candidate compilers. (#679)

* enable use of rocm5.5 release candidate 4

* upgrade to ROCM5.5 RC5

* try fix the PUB_KEY error, remove the cmake-data package

* upgrade to latest cmake version

* use private dockerhub repo for rocm5.5 rc5

* add missing bracket

* add vector load check

* solve conflicts

---------
Co-authored-by: Sam Wu <sjwu@ualberta.ca>
Co-authored-by: Sam Wu <sam.wu2@amd.com>
Co-authored-by: rocking5566 <ChunYu.Lai@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com>
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>
Co-authored-by: root <root@ctr-ubbsmc15.amd.com>
Co-authored-by: Jun Liu <Liu.Jun@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

cad3212d

10 Apr, 2023 1 commit

Groupnorm + swish external api (#668) · ed3a2e52

rocking5566 authored Apr 10, 2023

* Rename to proper naming

* Add example of groupnorm + swish

* Extract duplicate code in example

* Add groupnorm + swish instances

* Ractor instance generation, split into multiple cpp file

* Add external api and client example

* Refine profiler message

* Use ck math version of exp

* Refine problem size in example

* Add host version of exp

ed3a2e52

29 Mar, 2023 2 commits

Add a denorm test fix (#603) · dbd8f94b

Rostyslav Geyyer authored Mar 29, 2023



* Add type_convert implementations for bf16

* Add the fix for conv_fwd

* Add the fix for conv_bwd_data

* Add the fix for conv_bwd_weight

* Format

* Format

* Another format

* Add a macro to use workaround on MI200 only

* Format

---------
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>

dbd8f94b

Conv + quantization + tanh (#645) · 389e84a8

rocking5566 authored Mar 30, 2023



* Rename file. Prepare to support another activation

* Add comment for quantization

* Extract out_elementop

* Add tanh example

* Add conv + bias + tanh quantization instance

* Add missing parameter

* Refine cmake

* Add external api and client example

* Extract variable in example

* Fix the comment

---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>

389e84a8

20 Mar, 2023 1 commit

rtn in ternary way (#632) · 8a659a2e

Dan Yao authored Mar 21, 2023



* rtn in ternary way

* Check both flags to preserve NaN

* Format

* Rearrange flag1

* Apply suggestions from code review
Co-authored-by: Ronan Keryell <ronan@keryell.fr>

---------
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com>
Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com>
Co-authored-by: Ronan Keryell <ronan@keryell.fr>

8a659a2e

15 Mar, 2023 2 commits

gemm/Conv xdlops + dlops quantization (#625) · 16dc18e0

rocking5566 authored Mar 16, 2023



* Add conv perlayer quantization

* Add gemm_dlops quantization

* Support int8 for innerproduct

* Refine gemm dlops int8 kernel parameter

* Support gfx908(MI100) and gfx90a(MI200)

* clang-format

* Rename example number

* Support different layout for d tensor

* Add conv dlops perchannel quantization example

* Move to example 40

* Extract the common code for different platform (dlops and xdlops)

* Move ot subfolder. Prepare to add other op of quantization

* Refine the quantization instance library

* Add conv dl instances and client example

* Remove unnecessary type

* Add gemm quantization instance

* Add external api and client example

* Refine num_bytes

* Separete different layout to different cpp

* Add more xdl instances

* Revert "Remove unnecessary type"

This reverts commit 820869182f6a8f62b2c9004101ba6bf76b96be14.

* Remove CShuffleDataType in dlops
Let acc and CShuffleDataType be the same in xdlops

---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>

16dc18e0

Fix arch limitation bug (#639) · ea028ac6
Haocong WANG authored Mar 15, 2023

ea028ac6

10 Mar, 2023 1 commit

[Navi3x] Multiple issue fix (#612) · 087e3105

Haocong WANG authored Mar 11, 2023



* Change gridwise gemm mD blockwise gemm to naive

* RRR Gemm fix

* Fix RCR gemm bug

* Isolate wmma instructions

* Update amd_inline_asm.hpp

* Update amd_wmma.hpp

* Update amd_wmma.hpp

* fix syntax and update Jenkinsfile

---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>

087e3105

09 Mar, 2023 1 commit
- fix a bug with non-dword-aligned offset when OOB, in case crash (#616) · 76fcdc60
  carlushuang authored Mar 09, 2023
```
Co-authored-by: zjing14 <zhangjing14@gmail.com>
```
  76fcdc60
24 Feb, 2023 1 commit
- Mat-A LDS Bypass sanity pass · d4adc71a
  aska-0096 authored Feb 24, 2023
  
  d4adc71a
16 Feb, 2023 4 commits
- temp save, add asm backend flag to amd_wmma · c811a0e9
  aska-0096 authored Feb 16, 2023
  
  c811a0e9
- format · 8dbb73b1
  aska-0096 authored Feb 16, 2023
  
  8dbb73b1
- fix a host tensor bug and clean up flash-attn code · 27dc055b
  aska-0096 authored Feb 16, 2023
  
  27dc055b
- sanity check pass · 4ddda63b
  aska-0096 authored Feb 16, 2023
  
  4ddda63b
15 Feb, 2023 1 commit

Improve normalization (#580) · 6a6163a3

rocking5566 authored Feb 16, 2023

* Sync the order of type string with template parameter

* Add more instances

* Check the vector size and remove redundant var

* Extract var to static, prepare to separate sweep once kernel

* Separate sweeponce flow and optimize the flow

* 1. Rename AccDatatype in normalization to computeData
2. Rename AccElementwiseOperation to YElementwiseOperation in normalization

* Remove useless code

* Update naive variance kernel

* Refine string

* Fix typo

* Support naive variance for device_normalization

* Check the blocksize

* Share the VGPR of x and y

* Share the VGPR of gamma and beta

* Add more instances

* Support fp16 sqrt for experiment

* Add CHANGELOG

* Fix typo

* clang-format

6a6163a3

09 Feb, 2023 1 commit

Gemm+layernorm instance, ckProfiler, client example (#568) · f7d28f3e

rocking5566 authored Feb 10, 2023

* Add gemm + layernorm instance

* Add ckProfiler

* Add test

* Add client example

* Detect if user forger to set the workrspace

* Use literal in the example

* [What] use builtin function for sqrt
[Why] compiler will not use v_sqrt_f64_e64 if we use ::sqrt()

* check gemm vaildity in IsSupportedArgument

* Add more testcases

* Merge duplicated folder in client example

* Print more infomation

* Use better kernel parameter for MS problem size

* clang format

* Add constexpr for if condition and remove redundant include

* Remove cstdlib and add constexpr

f7d28f3e

03 Feb, 2023 1 commit
- save progress · a0a469e4
  aska-0096 authored Feb 03, 2023
  
  a0a469e4
18 Jan, 2023 1 commit

Wavelet (inter-wave consumer-producer) GEMM (#310) · 1cfa8760

Raman R jana authored Jan 18, 2023



* wavelet gemm programming model support for CK

* GEMM pipeline update for wavelet progrmmaing model

* Updated wavelet programming pipeline

* fixes for global-write for math-wave

* fixed bug in global writes

* Updated comments for better readability

* fixed clang format errors

* added block_lds without barrier sync

* clean

* clean

* clean

* clean

* refactor

* prototype

4 layouts

fix default stride

all problem sizes

tidy

move file

update build script

restore old file

fix build

* refactor standalone test to use gemm test harness

* simplify gemm test

* update build script

* remove redundant

* early return when cmd arg doesn't match

* tidy

* report failure when result not validated

* tidy

* Add comment depicting B2C mapping pattern.

* Formatting & comments.

* Comparison with custom B2C mapping pattern.

* Example for wavelet gemm.

* Add wavelet to Gemm standalone test.

* Remove debug code.

* Remove dangling #endif directive.

Co-authored-by: root <Raman Jana>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: Anthony Chang <ac.chang@outlook.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

1cfa8760

17 Jan, 2023 2 commits

Reduction external API and client examples (#493) · 80e05267

Qianfeng authored Jan 17, 2023



* Change to the DeviceReduce base class template to include all problem description information

* Add external api for reduction

* Add client example to test the reduction external api

* Spelling correction

* Re-implement the host_reduction to follow the DeviceReduce base API format

* Change the reduce profiler to call the external API for collecting device instances

* Rename reduce client example directory from 08_reduce to 12_reduce

* Remove (void) before the functional call

* Tiny update in reduce client example

* Tiny update in profile_reduce_impl.hpp

* Rename the reduce client example directory
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

80e05267

[Navi3x-LWPCK-545] Block-wise GEMM + Real GEMM_WMMA_FP16 (#541) · 919aeb1f

Haocong WANG authored Jan 17, 2023

* wmma_op + unit test

* add arch limitation to wmma test

* change arch limitation

* Refactor + Add all type unit test(int4 compile failed)

* Add f32_16x16x16_bf16 unit test

* tempsave

* tempsave

* tempsave

* runtime bug, cannot find symbol

* workaround for incorrect HIP warpSize return value

* debugging

* tempsave

* Correctness OK, waiting for optimization

* Tidy up + format

* temp save

* temp save, reproduce the v_bfi_b32 issue

* add inline asm for wmmaop test

* tidy up

* clean some debug purpose code

* discard some codes

* clang format

* clang format

* compiler issue fixed + increase tile size

919aeb1f

12 Jan, 2023 1 commit

Remove including of cmath (#551) · a17b0414

Qianfeng authored Jan 12, 2023

* Let cmath included when compiling host codes in math_v2.hpp

* Remove including of cmath in device_base.hpp and device_permute.hpp

a17b0414

15 Dec, 2022 2 commits
- clang format · 5d5891b0
  aska-0096 authored Dec 15, 2022
  
  5d5891b0
- clean some debug purpose code · 2a0e5439
  aska-0096 authored Dec 15, 2022
  
  2a0e5439
13 Dec, 2022 1 commit
- temp save, reproduce the v_bfi_b32 issue · e43df26a
  aska-0096 authored Dec 13, 2022
  
  e43df26a
12 Dec, 2022 1 commit
- temp save · 9739ede0
  aska-0096 authored Dec 12, 2022
  
  9739ede0
09 Dec, 2022 2 commits
- Tidy up + format · 0a808724
  aska-0096 authored Dec 09, 2022
  
  0a808724
- Correctness OK, waiting for optimization · 9bd44685
  aska-0096 authored Dec 09, 2022
  
  9bd44685
07 Dec, 2022 1 commit
- modified half function in math_v2.hpp (#528) · ce87b4f7
  guangzlu authored Dec 08, 2022
```
Co-authored-by: Chao Liu <chao.liu2@amd.com>
```
  ce87b4f7
02 Dec, 2022 1 commit

[Navi3x-LWPCK-449] wmma_op + unit test (#484) · abf9cc6c

Haocong WANG authored Dec 03, 2022



* wmma_op + unit test

* add arch limitation to wmma test

* change arch limitation

* Refactor + Add all type unit test(int4 compile failed)

* Add f32_16x16x16_bf16 unit test

* Remote int4 related

* delete deprecated test
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

abf9cc6c