Commits · 85d6fcd30ab0a615cfc9e107dda67fe7bb8201f3 · gaoqiong / composable_kernel_ROCM

04 Feb, 2025 3 commits
- Add Grouped Convolution and GEMM documentation (#1719) · 85d6fcd3
  Bartłomiej Kocot authored Feb 04, 2025
```
* Add Grouped Convolution docs

* Add gemm docs

* Update docs

* fix
```
  85d6fcd3
- Fix duplication of pk_add_f16 symbols (#1858) · 11e4082d
  Bartłomiej Kocot authored Feb 04, 2025
  
  11e4082d
- Fix pk_int4 cast and add pk_int4 dtype in ck tile (#1854) · 9ee69dd2
  Bartłomiej Kocot authored Feb 04, 2025
```
* Fix pk_int4 cast and add pk_int4 dtype in ck tile

* fixes

* Improvements

* fix typo
```
  9ee69dd2
31 Jan, 2025 1 commit

Codegen hipRTC compilation (#1579) · 2e3183af

arai713 authored Jan 31, 2025



* updating codegen build for MIOpen access: adding .cmake for codegen component

* updating CMake

* adding in header guards for some headers due to issues with hiprtc compilation in MIOpen

* some more header guards

* putting env file in header guard

* cleaning up some includes

* updated types file for hiprtc purposes

* fixed types file: bit-wise/memcpy issue

* updating multiple utility files to deal with standard header inclusion for hiprtc

* added some more header guards in the utility files, replacing some standard header functionality

* added some more header guards

* fixing some conflicts in utility files, another round of header guards

* fixing errors in data type file

* resolved conflict errors in a few utility files

* added header guards/replicated functionality in device files

* resolved issues with standard headers in device files: device_base and device_grouped_conv_fwd_multiple_abd

* resolved issues with standard headers in device files: device_base.hpp, device_grouped_conv_fwd_multiple_abd.hpp, device_grouped_conv_fwd_multiple_abd_xdl_cshuffle.hpp

* added header guards for gridwise gemm files: gridwise_gemm_multiple_abd_xdl_cshuffle.hpp and gridwise_gemm_multiple_d_xdl_cshuffle.hpp

* fixed issue with numerics header, removed from transform_conv_fwd_to_gemm and added to device_column_to_image_impl, device_grouped_conv_fwd_multiple_abd_xdl_cshuffle, device_grouped_conv_fwd_multiple_abd_xdl_cshuffle_v3, device_image_to_column_impl

* replaced standard header usage and added header guards in block to ctile map and gridwise_gemm_pipeline_selector

* resolved errors in device_gemm_xdl_splitk_c_shuffle files in regards to replacement of standard headers in previous commit

* added replicated functionality for standard header methods in utility files

* replaced standard header functionality in threadwise tensor slice transfer files and added header guards in element_wise_operation.hpp

* temp fix for namespace error in MIOpen

* remove standard header usage in codegen device op

* removed standard header usage in elementwise files, resolved namespace errors

* formatting fix

* changed codegen argument to ON for testing

* temporarily removing codegen compiler flag for testing purposes

* added codegen flag again, set default to ON

* set codegen flag default back to OFF

* replaced enable_if_t standard header usage in data_type.hpp

* added some debug prints to pinpoint issues in MIOpen

* added print outs to debug in MIOpen

* removed debug print outs from device op

* resolved stdexcept include error

* formatting fix

* adding includes to new fp8 file to resolve ck::enable_if_t errors

* made changes to amd_wave_read_first_lane

* updated functionality in type utility file

* fixed end of file issue

* resovled errors in type utility file, added functionality to array utility file

* fixed standard header usage replication in data_type file, resolves error with failing examples on navi3x

* formatting fix

* replaced standard header usage in amd_ck_fp8 file

* added include to random_gen file

* removed and replicated standard header usage from data_type and type_convert files for fp8 changes

* replicated standard unsigned integer types in random_gen

* resolved comments from review: put calls to reinterpret_cast for size_t in header guards

* updated/added copyright headers

* removed duplicate header

* fixed typo in header guard

* updated copyright headers

---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

2e3183af

30 Jan, 2025 2 commits

[CK Tile] Spatially local GEMM tile partitioner. (#1843) · ce448002

Adam Osewski authored Jan 31, 2025

* Add spatially local tile partitioner

* Use 1D Grid size & create partitioner object.

* Docs & use 1D partitioner in example.

* Clang format.

* Change kernel grid size

Now: X is the # of output C-tiles,
     Y is the batch count
     Z is the splitK

* Formatting & more doc.

* Clang format.

* Fix batched gemm test. Use 1d partitioner.

* Move condition.

* FIx ctor.

* clang-format.

ce448002

[CK TILE] Implement cschuflle algorithm (#1842) · 25e2e0f0

Bartłomiej Kocot authored Jan 30, 2025

* [CK TILE] Implement cschuflle algorithm

* Rebase

* Vector store size fixes

* fixes

* Fixes

* fixes

* fmha fix

* fixes

* fixes of fixes

25e2e0f0

29 Jan, 2025 1 commit

add batched_transpose implement (#1660) · c5fff071

fangche123 authored Jan 29, 2025



* add batched_transpose implement

---------
Co-authored-by: root <root@ctr-ubbsmc16.amd.com>
Co-authored-by: ThruptiRajLakshmanaGowda <tlakshma@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>

c5fff071

28 Jan, 2025 1 commit

Change flag to CK_GFX90A_DENORM_WORKAROUND (#1817) · d6a4605e

darren-amd authored Jan 28, 2025

* Change flag from CK_WORKAROUND_DENORM_FIX to CK_GFX90A_DENORM_WORKAROUND for more clarity. Also changed the definition macros to be more clear.

d6a4605e

27 Jan, 2025 2 commits

Add OCP FP8 support in CK_TILE (#1829) · 35aebe59
Andriy Roshchenko authored Jan 27, 2025
```
* Add OCP FP8 to CK_TILE

* Validate OCP FP8 in FMHA FWD under VALID=1
```
35aebe59

[CK-Tile] Enable vectorized reads on all layouts & improve perf. (#1835) · 39dc25a9

Adam Osewski authored Jan 27, 2025



* Refactor universal gemm policy.

* Adapt example to refactor changes.

* Introduce static encoding pattern

* Adding shuffled encoding patterns.

* Fix err in reverse tuple.

* Add transpose_tile2d

* Small refactoring + doc

* Enable reading on contiguous dimension in all layouts.

* Transpose A/B register tile if needed for comp v3 pipeline.

* Take contiguous dim size when calculating dram vector load size.

* A/B smem pack size taken from WarpGemm attributes

* Update B LDS layout and setup tile distribution pattern at class level.

* Fix static assert.

* Fix errors in examples.

* Formatting & fix IsTranspose

* Fix VectorSize & refactor.

* Add error loging messages.

* Fix VecLoadSize and TranspseC for mem pipeline.

* Update unit-tests & disable mem pipeline.

* Clang format

* Update include/ck_tile/core/tensor/tile_window.hpp
Co-authored-by: jakpiase <jakub.piasecki@amd.com>

* Fix compilation and reviewers comments.

* Refactor unit-test. Fallback to non-universal gemm.

Need to use GemmPipelineAGmemBGmemCRegV1 for now,
since GemmKernel is now supporting also non-K major vector reads.

---------
Co-authored-by: jakpiase <jakub.piasecki@amd.com>

39dc25a9

24 Jan, 2025 2 commits

Implement fp8 quant for layernorm and rmsnorm (#1814) · 64d5c4d6
ruanjm authored Jan 24, 2025

64d5c4d6

[CK_TILE] not using structures under ck_tile/ops for ck_tile/host (#1834) · 5b9b083d

carlushuang authored Jan 24, 2025



* not using structures under ck_tile/ops for ck_tile/host

* update as constexpr function

* Rename fn

* Update other examples.

---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>

5b9b083d

22 Jan, 2025 1 commit
- add fp8 as dst (#1830) · 052a7265
  carlushuang authored Jan 22, 2025
  
  052a7265
21 Jan, 2025 2 commits

Simplify static_cast if-lands (#1828) · 3db77bc4
Mateusz Ozga authored Jan 21, 2025

3db77bc4

CK-Tile Grouped GEMM refactor and post PR fixes (#1756) · 3c93d3c4

Mateusz Ozga authored Jan 21, 2025

* Grouped gemm simple code refactor

* Offset invoker

* Invoke generic Run, and replace name of parrtitioner variable

* Tests fix type

* Removed namespaces

* Add template param to avoid implicit cast

* Remove generic function

* Constant value

* underline enum to int16_t

* Generalize partitioner function

* Remove whitespaces

* Rename function

* Using support

* Clang-format

* Clang-format

* Fn-partitioner description fn

* Typo

* Typo 2

* Better description

* Better description

* Refactor after review

* Use ctr instead of set fn

* Inovke ctr and typo

* Comments

* Remove unnecessary comment

* Review, remove modulo

3c93d3c4

20 Jan, 2025 1 commit

Add CK_TIME_KERNEL as toggleable CMake Variable (#1794) · 3fb2f5ac

lucbruni-amd authored Jan 20, 2025

* Disable CK_TIME_KERNEL by Default, Add as CMake Variable

* Enable CK_TIME_KERNEL by Default, Maintaining CMake Variable Functionality.

* Fix build error.

3fb2f5ac

19 Jan, 2025 1 commit
- fix a bug for int4 scale weight only kernel (#1820) · 86d1b46a
  Mingtao Gu authored Jan 19, 2025
```
Co-authored-by: mtgu0705 <mtgu@amd.com>
```
  86d1b46a
18 Jan, 2025 1 commit
- [CK_TILE] Add error threshold calculation for gemm examples (#1821) · bdddf1ea
  Bartłomiej Kocot authored Jan 18, 2025
  
  bdddf1ea
16 Jan, 2025 2 commits

Fix and optimize dynamic unary elementwise (#1818) · 1519ce91
Bartłomiej Kocot authored Jan 16, 2025
```
* Fix and optimize dynamic unary elementwise

* fix
```
1519ce91

[CK_TILE] Fix mock token id, support g1u1/g1u0 through same inline code block (#1808) · 1ff50e78

carlushuang authored Jan 16, 2025

* fix mock token id

* prepare host for g1u1

* reformat inline-asm

* restructure uk_0

* restructure gate_up

* done

* change default to init=1

* update readme

* fix a bug in interleave pipeline

* rcp for silu

1ff50e78

15 Jan, 2025 2 commits

Add rounding for float to bf16 conversion as default (#1812) · 7790e8c3

Bartłomiej Kocot authored Jan 15, 2025

* Add rounding for float to bf16 conversion

* Add bhalf test

* Add inf test bhalf

* Refactor

* update cmake

* Fixes

7790e8c3

[CK_TILE] Add Various Fusion Functions to RMSNorm (#1802) · 04dd3148

ruanjm authored Jan 15, 2025



* Add shortcut to RMSNorm

* Modify test for adding shortcut for RMSNorm

* Add fused parameter into tests

* 1. Add YDataType. 2. rmsnorm2d_fwd_traits_ from rmsnorm2d_fwd.hpp to rmsnorm2d_fwd_api.cpp and rmsnorm2d_fwd_instance_common.hpp

* 1. Supports various stride and percisions.

* Add support of Epilogue

* Add fuse and epilogue support to rmsnorm ref

* Modify rmsnorm example

* Refactor tests/examples

* Bug fix for newly added tests/examples

* Bug fix for new tests 2

* Modify smoke test scripts

remove dbg code

* Supports non-smooth dyanmic quant

* Update Rmsnorm2dFwd::GetName()

* rename xscale and prec_sx to smoothscale and prec_sm

Bug fix after rename

Remove files

* change example_rmsnorm2d_fwd.cpp

* update performance calculator

* Fix issue in two-pass when fuse add is enabled

* Remove comment of beta

---------
Co-authored-by: rocking <ChunYu.Lai@amd.com>

04dd3148

13 Jan, 2025 2 commits

CK Tile GEMM CICD fixed & register block method refactor (#1776) · 5d671a5f

Thomas Ning authored Jan 12, 2025

* refactor the block_gemm_areg_breg_creg_v1 and add the v2 policy with 2x2 warp gemm

* Finished the 2x2 warp gemm policy and the block selection mechanism

* Clang format

* address poyen's comment

* Address feedbacks

* Fixed the compilation issue

* Change the function name

5d671a5f

Update for fmha_fwd qs_ks_vs pipeline (#1810) · 3d50f57f

Qianfeng authored Jan 13, 2025



* Update for fmha_fwd qs_ks_vs pipeline

* Remove _builtin_amdgcn_sched_barrier(0)

* Move p_compute to p converting earlier for trying to increase vgprs re-using

* Enable GetQKBlockGemm to use WarpGemm-16x16x16 for QLoadOnce==false situation

* Re-add __builtin_amdgcn_sched_barrier(0)

---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

3d50f57f

10 Jan, 2025 1 commit

Grouped convolution backward weight special vector size loads (#1772) · fd46a01d

Bartłomiej Kocot authored Jan 10, 2025

* Grouped convolution backward weight special vector size loads

* Instnaces and tests

* Fixes

* Add 7 and 13 special cases

* fix comments

* Fix

* Fix2

* fixes

* fix atomic add bf16

fd46a01d

08 Jan, 2025 12 commits
- Disable building DPP kernels by default (#1804) · 26b3829c
  darren-amd authored Jan 08, 2025
```
* Disable building DPP kernels by default

* Disable building dpp instances, examples, or tests if DPP_KERNELS is not set

* Add new DPP_KERNELS flag to readme
```
  26b3829c
- mark unused args · ad697c78
  Max Podkorytov authored Jan 07, 2025
  
  ad697c78
- run clang-format -style=file · a2e6ad62
  Max Podkorytov authored Jan 07, 2025
  
  a2e6ad62
- run clang-format==12 · aa59ecaa
  Max Podkorytov authored Dec 19, 2024
  
  aa59ecaa
- update comment in the policy · 82fb3f84
  Max Podkorytov authored Dec 19, 2024
  
  82fb3f84
- update qsksvs comment · 4daa82b4
  Max Podkorytov authored Dec 19, 2024
  
  4daa82b4
- remove dead code · 66c5b715
  Max Podkorytov authored Dec 19, 2024
  
  66c5b715
- clang-format and remove dead code · edb78a47
  Max Podkorytov authored Dec 19, 2024
  
  edb78a47
- roll back splitkv · 60113859
  Max Podkorytov authored Dec 18, 2024
  
  60113859
- update qsksvs pipeline · bfc997a7
  Max Podkorytov authored Dec 18, 2024
  
  bfc997a7
- qsksvs pipeline changes to mirror qrksvs · f7942b99
  Max Podkorytov authored Dec 17, 2024
  
  f7942b99
- enable bias feature that add bias before adding residual (for rtpllm project) (#1741) · d5c8a334
  AMD-dteng authored Jan 08, 2025
```
* 1. enable bias feature that add bias before adding residual; 2. change block size from 128->64 when m<64 in fp16

* delete comment

* 1.remove fmha change 2.change buffer name from bias to xbias

* Now bias can be used independently from fadd

* change kbias to kxbias

---------
Co-authored-by: feli <felix.li@amd.com>
```
  d5c8a334
07 Jan, 2025 1 commit

[CK_TILE] fmha fwd splitkv optimization for decode (seqlen_q=1) (#1789) · 24b12d04

Po Yen Chen authored Jan 07, 2025



* Update license year

* Add initial code to override decode problem

* Fix splitkv traits/args overriding error

* Reshape and transpose lse for decode

* Remove debug code

* Prettify example code

* Use better function name

* Add kMergeNumHeadGroupsSeqLenQ flag

Kernel user can use this switch to turn on/off optimization for
some problem sizes

* Add missing flag declarations

* Default turn off kMergeNumHeadGroupsSeqLenQ in codegen

* Group similar statements together

* Remove assumption of seqlen_q=1

* Remove kMergeNumHeadGroupsSeqLenQ from splitkv combine kernel

* Support kMergeNumHeadGroupsSeqLenQ=true in fmha splitkv kernel

* Run kMergeNumHeadGroupsSeqLenQ=true kernels when need

* Fix group mode block skip logics

* Undo changes of normal fwd kernel

* Update in GridSize() and using GridSize() for splitkv kernel (#1799)

---------
Co-authored-by: Qianfeng <qianfeng.zhang@amd.com>

24b12d04

04 Jan, 2025 2 commits
- Fix universal gemm profiler for pk_i4_t (#1790) · 888317e6
  Bartłomiej Kocot authored Jan 04, 2025
```
* Fix universal gemm profiler for pk_i4_t

* fix
```
  888317e6
- terminology clean-up (#1792) · 8ea375bb
  Illia Silin authored Jan 03, 2025
  
  8ea375bb