Commits · bec6fbc65fe766ab23fe563675703defdb0dd2be · gaoqiong / composable_kernel_ROCM

09 Nov, 2024 1 commit

dummycoderfe authored Nov 09, 2024



* add moe_sorting & check ok

* fix comments & typo

* Run remod.py under include/ck_tile & example/ck_tile directories

* format codes

* fix output ci check bug

* fix moe sorting readme and error commit file

* use magiv div to accelerate compute

* add an loop unroll for moe lds ops

* add extblocksnel to set zeros for moebufs

* [Ck_tile] moe set zero run ok, add size check and fix ref check

* [Ck_tile]fix moe_sorting fuse set_zero remod

* [Ck_tile] change name style, fix zero buffer size err, change folder

* [Ck_tile] moe_sorting: fix name style

* [Ck_tile] moe_sorting, remove useless params in traits

* [Ck_tile] change outputtile cnt * unit_size; change output buf alloc

---------
Co-authored-by: dummycoderfe <noplydummmycoder@163.com>
Co-authored-by: Po Yen, Chen <PoYen.Chen@amd.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>

bec6fbc6

08 Nov, 2024 1 commit

[Ck tile] layernorm2d fwd optimize (#1637) · 686a58a9

dummycoderfe authored Nov 08, 2024



* optimze small N case using vec io and using rcp div

* [Ck_tile] layernorm, add param to control fastdiv; change generate codes and test pass

* [Ck_tile] fix blockSize compute in Generic2dBlockShape

* [Ck_tile]fix kfastfdiv template style

* [Ck_tile] layernorm, fix stype in review

---------
Co-authored-by: dummycoderfe <noplydummmycoder@163.com>

686a58a9

07 Nov, 2024 1 commit
- enable compilation for generic navi targets (#1645) · 75c5bfa3
  Illia Silin authored Nov 07, 2024
  
  75c5bfa3
05 Nov, 2024 1 commit
- Statically Cast Pointer Offset (#1631) · d0e3a70a
  darren-amd authored Nov 05, 2024
```
* explicit cast ptr offset

* formating change
```
  d0e3a70a
02 Nov, 2024 1 commit

[CK_TILE] layernorm have more accurate residual (#1623) · cb6c5d39

carlushuang authored Nov 02, 2024



* more accurate residual

* modify comment

* Fix literal case in README.md

---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

cb6c5d39

01 Nov, 2024 2 commits

[Ck_tile] smoothquant (#1617) · fbd65454

rocking authored Nov 01, 2024



* fix compile error

* fix typo of padding

* Add smoothquant op

* Add smoothquant instance library

* refine type

* add test script

* Re-generate smoothquant.hpp

* Always use 'current year' in copyright

* use Generic2dBlockShape instead

* Add vector = 8 instance back

* Find exe path automatically

* Simplify the api condition

* Remove debugging code

* update year

* Add blank line between function declaration

* explicitly cast return value to dim3

* refine return value

* Fix default warmup and repeat value

* Add comment

* refactor sommthquant cmake

* Add README

* Fix typo

---------
Co-authored-by: Po Yen, Chen <PoYen.Chen@amd.com>

fbd65454

[layernorm] hot fix (#1620) · 550248de
carlushuang authored Nov 01, 2024
```
* hot fix ln

* some rename
```
550248de

31 Oct, 2024 1 commit

[CK_TILE] layernorm support fused-quant/fused-add (#1604) · c3a4800c

carlushuang authored Oct 31, 2024

* add prenorm/postnorm support, refactor using generate.py

* update README

* update README

* fix format

* update some description and fix format

* update format

* format

* use non-raw for loading

* format and update n4096

* dynamic-quant ready

* update readme

* support fused dynamic-quant

* update fused-quant, with smooth

* update README

* update args

* update some based on comment

c3a4800c

30 Oct, 2024 5 commits

Remove virtual destructors from unary ops (#1610) · 9a8a5213
Bartłomiej Kocot authored Oct 30, 2024
```
* Remove virtual destructors from unary ops

* Fixes

* Fixes

* clang format fixes
```
9a8a5213
clang-format (#1612) · 7d911154
rocking authored Oct 30, 2024

7d911154

[CK-Tile] Universal gemm memory bound pipeline (#1558) · 24d996aa

Adam Osewski authored Oct 30, 2024

* CK-Tile GEMM with memory bound pipeline.

* Memory bound gemm pipeline.

* Fix not closed namespace.

* Block gemm mem pipeline draft.

* Do not use ck_tile:: within ck_tile namespace.

* Refactoring & Move Layout info to pipeline problem.

* Get hot loop and TailNum information before lunching kernel.

* Fixes in pipeline.

* Add comment to load_tile_raw and change variable naming style.

* Few small changes & formatting.

* Do not use macro.

* Add gtests.

* Use AccDataType for Output of MFMA instruction.

* Formatting.

* Refactor gemm examples.

* Switch over to current block gemm.

* Use currently available pipeline policy.

* Refactoring and review comment.s

* Fixes after merge.

* Add missing include.

* Add load tile overload which accepts output tensor as parameter.

* This give 8% perf boost at the cost of using more registers.

* Rename example.

* Small changes.

* Fix compilation err and lower K.

* Support different layouts for A/B

* Fix vector size for different layouts.

* Rename Alignment into VectorSize

* Unblock tests.

24d996aa

[Ck tile] support rmsnorm and related fusion (#1605) · 3d609534

rocking authored Oct 30, 2024

* Add reduce2d new api

* Prevent user use cross warp reduction

* Fix bug of std caculation

* Add rmsnorm2d

* Add rmsnorm small example

* Remove static assert to prevent compile fail

* Add script to test performance and correctness

* Add missing cmake change

* refine naming

* refine example of rmsnorm

* Fix bug of rmsnorm

* Refine naming

* Fix cmake

* clang format

* Refine pipeline name

* Add add_rmsnorm2d_rdquant kernel

* Add reduce op

* host verification

* Fix bug of one pass pipeline

* Refine tile size

* Add two pass pipeline

* Rename two pass to three pass

* Fix bug of kSaveX == false

* Add instance library

* Add test script

* Fix bug of x verification

* Add save_x to trait

* Add README

* Move reduce2d into reduce folder

* Fix bug of welford when number of m warp > 1

* remove reduncant comment

* 1. move 06_rmsnorm2d to 10_rmsnorm2d
2. move 07_add_rmsnorm2d_rdquant to 11_add_rmsnorm2d_rdquant

* clang format and add missing header

* Add host validation of add + layernorm2d + rsquant

* Revert "Add host validation of add + layernorm2d + rsquant"

This reverts commit 936cb457978b928b90eff89a08fcdb7dc8bbed67.

* Remove deprecated flag

3d609534

[CK_TILE] Add fmha fwd headdim96 support (#1608) · 86322218

Qianfeng authored Oct 30, 2024



* Add ceil_to_qualified_tile_length()

* Rename kK0BlockLength to kQKHeaddim

* Add kSubQKHeaddim concept to support headdim96

* Fix in math.hpp to avoid using __half interfaces

* Add LdsBufferSequence instance for headdim96

* Update in fmha_fwd/fmha_fwd_splitkv codegen to support hd96 testing

* Disable hd96 instance generation in codegen fmha_fwd and fmha_fwd_splitkv to save compiling time

* Reformat one file

* Fix text alignment in fmha_fwd_splitkv.py

---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

86322218

29 Oct, 2024 3 commits
- [CK_TILE] add scatter_gather (#1609) · 4d7e063a
  valarLip authored Oct 29, 2024
  
  4d7e063a
- [CK_TILE] add generic_permute (#1607) · 9fbd72e9
  valarLip authored Oct 29, 2024
  
  9fbd72e9
- fix compilation errors for gfx12 with clang20 (#1606) · 922e42a0
  Illia Silin authored Oct 28, 2024
  
  922e42a0
26 Oct, 2024 4 commits

topk_softmax (#1592) · b098b71b

carlushuang authored Oct 26, 2024

* topk_softmax

* remove some file

* fix atomix linear_offset

* address various comment, and change sfc get_index api to static(tuple)

b098b71b

Add dynamic elementwise op (#1426) · 31bf253a

Bartłomiej Kocot authored Oct 26, 2024



* Add dynamic elementwise op
Co-authored-by: ThruptiRajLakshmanaGowda <thruptiraj.lakshmanagowda@amd.com>

* CI issues fix

* Custom parameter value for dynamic functions - Comments addressed

---------
Co-authored-by: ThruptiRajLakshmanaGowda <thruptiraj.lakshmanagowda@amd.com>
Co-authored-by: ThruptiRajLakshmanaGowda <tlakshma@amd.com>

31bf253a

[CK_TILE] More fmha splitkv optimizations (#1588) · 54f0e6f4

Po Yen Chen authored Oct 26, 2024

* Use pre-defined constants for readability

* Use vector write for o_acc tensor

* Remove no-longer used policy method

* Deprecate no-longer used policy/pipeline

* Specify gemm0/gemm1 block warps separately in codegen

* Fix wrong ps_idx creation logic

* Add single-warp block gemm

* Supoprt single-warp gemm0

* Make MakeCBlockTile() as static method

* Use MakeCBlockTile() to get underlying tile distribution

* Use kNumGemm1Warps to compute # threads for gemm1

* Put normal case in the if clause

* Refine fmha splitkv block mapping

* Refine & fix the lse_acc/o_acc layout

* Fix wrong LDS size for K tile

* Use kK0=64 for hdim=128,256 fmha splitkv kernels

* Use kK1=64 for hdim=32,64,128 fmha splitkv kernels

* Undo kK0/kK1 changes

* Use more reasonable GetAlignmentV() computation

* Using store_tile() in fmha splitkv kernel epilogue

54f0e6f4

add int8 gemm multiply multiply a8w8 (#1591) · 37f7afed

valarLip authored Oct 26, 2024



* add int8 gemm multiply multiply a8w8

* uncomment

* clang-format-12

* Add example_gemm_multiply_multiply_xdl_int8

* Remove shell scripts

* update preprocess number for mi308; bring back printout in ckprofiler

* format

---------
Co-authored-by: chenjun <junchen2@amd.com>
Co-authored-by: Haocong WANG <haocwang@amd.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>

37f7afed

25 Oct, 2024 2 commits

Generic threshold calculation (#1546) · 9385caa3

aledudek authored Oct 25, 2024

* Calculate generic relative threshold pool3dfwd

* Calculate absolute error threshold pool3d fwd

* Generic threshold calculation take max input for relative error pool3dfwd

* Remove max possible value for error calculation at runtime

* Remove debug print in pool3dfwd

* Pool3d fwd adjusted types in generic threshold calculation

* Generic threshold calculation take into account number of accumulations and accdatatype

* Generic threshold fix final error formula

* Generic threshold calculation - num of accs fix

* Generic threshold calculation - adjust absolute error

* Generic threshold calculation - OutDataType in absolute error

9385caa3

hot_fix epsilon pos (#1597) · 9183ce69
dummycoderfe authored Oct 25, 2024
```
Co-authored-by: dummycoderfe <noplydummmycoder@163.com>
```
9183ce69

22 Oct, 2024 2 commits

Explicit cast values to half (#1593) · 4d5248e2
Jatin Chaudhary authored Oct 22, 2024
```
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
```
4d5248e2

update layernorm (#1570) · 0394f8a7

ltqin authored Oct 22, 2024

* port layernorm

* change warp_welford.hpp

* Update warpshuffle

* 1. Add save mean and save std back
2. Move construction of tensor_view and tile_window to operator()

* refine welford max count calculation

* unify layernorm api

* Rename file

* Remove save mean and inv std

* Revert "refine welford max count calculation"

This reverts commit 02236580

.

* Fix order of parameter

* refine welford max count calculation again

* Remove fp32 instances

* Fix bug of padding

* refactor api

* Support bf16

* Extract common function

* Refine arg of operator()

* Add kMThreadPerBlock to template parameter

* clang format

* Refine variable name

* Refine file name

* remove redundant line

* refactor layernorm2d pipeline and add block-per-block utility

* fix name

* rename more

* add more block-per-tile instance

* remove duplicated define

* update instance for 2048, 1024 case

* support up to 2048 now

* opt loading

* add n1536

* Add two pass pipeline

* format

* Fix incorrect type

* parallel compilation

* Use smaller N

* fix 2p pass

* Support Repeat_M in distribution

* Refine nameing

* Add reduce example

---------
Co-authored-by: letaoqin <letaoqin@amd.com>
Co-authored-by: aska-0096 <haocwang@amd.com>
Co-authored-by: rocking <ChunYu.Lai@amd.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>

0394f8a7

21 Oct, 2024 1 commit

[CK_TILE] Optimize fmha splitkv & splitkv combine kernels (#1577) · 95e722a3

Po Yen Chen authored Oct 21, 2024

* Use smaller width for lse_accum dist tensor

* Update pipeline comment

* Fix wrong distribution for lse_accum

* Remove duplicate dim in lse_accum dist encoding

* Decide fmha splitkv combine kernel kBlockSize by kM0

* Remove assumption of MPerThread=1

* Add log<4> & log<8> specialization

* Enlarge occupancy array

* Fix vector size for small tile

* Add support for kMaxSplits=8

* Re-format gemm.hpp

* Use 16x16x16 warp gemm for fwd_splitkv

* Centralize policy code changes

* Leave fp8/bf8 tile settings unchanged

95e722a3

16 Oct, 2024 1 commit

[CK_TILE] Improve headdim96 performance for fmha-bwd (#1573) · 14c3cfb1

Qianfeng authored Oct 16, 2024



* Add kQKHeaddimForGemmN and kVHeaddimForGemmN in order to support headdim 96

* Remove the using of MakeKRegBlockDescriptor and MakeVRegBlockDescriptor

* Fix in bwd_piple_default_policy

* Remove kQKHeaddim and rename kQKHeaddimForGemmN to kQKHeaddim in the bwd kernel and pipelines

* Replace kVHeaddimForGemmN by kVHeaddim and kDoDvHeaddim

* Update to hd96 tile settings

* Add smoke test scripts for fmha-bwd hd96

* Revert "Add smoke test scripts for fmha-bwd hd96"

This reverts commit 7ca7e1a93dc65eb99ce3ff4e82693589830e42a2.

* Remove hd96 tile settings in fmha_bwd codegen to save compiling

* Fix lost code line in bwd_pipeline_default_policy

* Merge kDoDvHeaddim/kPadHeadDimDoDv to kVHeaddim/kPadHeadDimV and remove TileFmhaBwdTraits

* Rename KRegSliceBlockDescriptor/VRegSliceBlockDescriptor to KRegBlockDescriptor/VRegBlockDescriptor

* tiny adjustments

---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: danyao12 <Dan.Yao@amd.com>

14c3cfb1

15 Oct, 2024 2 commits
- [CK_TILE] Add block universal gemm pipeline policy (#1557) · d02a92cc
  Bartłomiej Kocot authored Oct 15, 2024
```
* [CK_TILE] Add block universal gemm pipeline policy

* Fixes

* fixes2

* Fixes3

* fixeS
```
  d02a92cc
- Apply ROCm 6.2 WA to ROCm 6.3 and later (#1563) · 9868fd02
  Po Yen Chen authored Oct 15, 2024
  
  9868fd02
14 Oct, 2024 3 commits

Add custom type vector support (#1333) · 4cf70b36

Rostyslav Geyyer authored Oct 14, 2024



* Add non_native_vector_type

* Add a test

* Add non-native vector type

* Fix CTOR

* Fix non-native vector type of 1

* Fix CTORs

* Use vector_type to cover non-native implementation as well

* Update the test

* Format

* Format

* Fix copyright years

* Remove BoolVecT so far

* Add AsType test cases

* Update assert error message

* Remove redundant type

* Update naming

* Add complex half type with tests

* Add tests for vector reshaping

* Add missing alignas

* Update test/data_type/test_custom_type.cpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Compare custom types to built-in types

* Add default constructor test

* Add an alignment test

---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

4cf70b36

Add transpose scale amax example (#1547) · f21cda25
Bartłomiej Kocot authored Oct 14, 2024
```
* Add transpose scale amax example

* fixes

* Tune reduce instance
```
f21cda25
decouple the calling from gemm_pipeline (#1571) · 35c1777d
Thomas Ning authored Oct 14, 2024
```
* decouple the calling from gemm_pipeline

* clang format
```
35c1777d

12 Oct, 2024 1 commit
- Implement GetWorkSpaceSize from BaseOperator. (#1564) · 29d384d0
  Adam Osewski authored Oct 12, 2024
  
  29d384d0
10 Oct, 2024 1 commit

Ck tile gemm cshuffle & CK Tile GEMM restructure (#1535) · 6f27bc98

Thomas Ning authored Oct 10, 2024



* ake the cshuffle compilable

* modify Mhe reference on gpu and cpu. Correaccess of cshuffle

* fix the cpu reference code

* Complete the in tile shuffle logic

* restructure the kernel template input

* change the naming pattern of ck_tile gemm pipeline

* Re-format files using remod.py

* Solve the fmha conflict with gemm

* Comment Addressed from Carlus

---------
Co-authored-by: Po Yen, Chen <PoYen.Chen@amd.com>

6f27bc98

09 Oct, 2024 1 commit
- Fixes small memory leak from missing hipEventDestroy (#1554) · ceaed8e0
  Christopher Millette authored Oct 09, 2024
  
  ceaed8e0
08 Oct, 2024 2 commits

[CK_TILE] Update example README files & fix script compatibility issue (#1548) · 0c094daa

Po Yen Chen authored Oct 08, 2024

* Fix text alignment of ArgParser::print()

* Update example README files

* Clarify make-ck-dev.sh <arch> usage

* Only keep some of the argument from '-?' output

* Undo command line output changes in README

* Only keep existing argument on doc and update description

* Fix text alignment

* Make cmake-ck-*.sh compatible with 'sh' command

0c094daa

[CK_TILE] Simplify the codes in splitkv_combine pipeline (#1549) · 74d68e3b

Qianfeng authored Oct 08, 2024



* Simplify the codes in splitkv_combine pipeline

* Always set kPadSeqLenK=true for fmha splitkv kernels

* Change in Oacc Alignment and TileDistribution to be more adaptable to tile sizes

---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

74d68e3b

07 Oct, 2024 3 commits

Fix build logic using GRU_ARCHS. (#1536) · 7d8ea5f0

Illia Silin authored Oct 07, 2024

* update build logic with GPU_ARCHS

* fix the GPU_ARCHS build for codegen

* unset GPU_TARGETS when GPU_ARCHS are set

7d8ea5f0

[CK_TILE] Fix conv param multiple definition (#1550) · cc8f466a
Bartłomiej Kocot authored Oct 07, 2024
```
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
```
cc8f466a

[Ck tile] Support layernorm one pass (#1512) · 0023f01a

rocking authored Oct 07, 2024



* Fix compile error

* Add one pass pipeline

* Extract creating tile_window to operator()

* clang format

* reduce duplicated code

* do not hardcode

* Support padding in layernorm

---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

0023f01a

04 Oct, 2024 1 commit

Adding seed and offset pointer support to the philox random number generator. (#1523) · c24fae23

kylasa authored Oct 04, 2024



* Adding seed and offset pointer support to the philox random number generator.

* Separating seed and offset pointer checks with different condition statements.

* Changes include, adding support for device seed and offset pointers, union is used to store seed/offset values and device pointers to minimize device SGPRs.

* Correcting a typo in the readme file

* Re-format files using remod.py

* Use STL type for API parameters

* Use simpler struct design for drop_seed & drop_offset

* Undo unnecessary changes

* Sync kargs style for fmha_fwd.hpp/.cpp

* Use templated union to reduce code

* Use structured binding to make code more readable

---------
Co-authored-by: Sudhir Kylasa <sukylasa@amd.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

c24fae23