Commits · d032ea56798526f6f3c9a26ec4a89d3f30e2aeae · gaoqiong / composable_kernel_ROCM

30 Jan, 2025 4 commits
- Add docstrings · d032ea56
  Rostyslav Geyyer authored Jan 30, 2025
  
  d032ea56
- Remove unneeded AsType accessors · 97c7e725
  Rostyslav Geyyer authored Jan 30, 2025
  
  97c7e725
- Update pack/unpack methods · bcc12098
  Rostyslav Geyyer authored Jan 30, 2025
  
  bcc12098
- Fix build logic · acf8854e
  Rostyslav Geyyer authored Jan 30, 2025
  
  acf8854e
29 Jan, 2025 2 commits
- Add conversions · b8f4de71
  Rostyslav Geyyer authored Jan 29, 2025
  
  b8f4de71
- Add a flag · c98974ee
  Rostyslav Geyyer authored Jan 29, 2025
  
  c98974ee
27 Jan, 2025 1 commit
- Add size checks in pack function · 2a807013
  Rostyslav Geyyer authored Jan 27, 2025
  
  2a807013
24 Jan, 2025 2 commits
- Fix merge · 7c6a541b
  Rostyslav Geyyer authored Jan 24, 2025
  
  7c6a541b
- Update unpack signature · 86950b3a
  Rostyslav Geyyer authored Jan 24, 2025
  
  86950b3a
22 Jan, 2025 3 commits
- fix typo · 6a747f03
  illsilin authored Jan 22, 2025
  
  6a747f03
- fix typo · 108f2733
  illsilin authored Jan 22, 2025
  
  108f2733
- fic build for multiple archs · 50010cf9
  illsilin authored Jan 21, 2025
  
  50010cf9
16 Jan, 2025 4 commits
- Fix and optimize dynamic unary elementwise (#1818) · 1519ce91
  Bartłomiej Kocot authored Jan 16, 2025
```
* Fix and optimize dynamic unary elementwise

* fix
```
  1519ce91
- Add missing type aliases · 17d1e68b
  Rostyslav Geyyer authored Jan 16, 2025
  
  17d1e68b
- Add vector support · 3a64757f
  Rostyslav Geyyer authored Jan 16, 2025
  
  3a64757f
- [CK_TILE] Fix mock token id, support g1u1/g1u0 through same inline code block (#1808) · 1ff50e78
  carlushuang authored Jan 16, 2025
```
* fix mock token id

* prepare host for g1u1

* reformat inline-asm

* restructure uk_0

* restructure gate_up

* done

* change default to init=1

* update readme

* fix a bug in interleave pipeline

* rcp for silu
```
  1ff50e78
15 Jan, 2025 2 commits

Add rounding for float to bf16 conversion as default (#1812) · 7790e8c3

Bartłomiej Kocot authored Jan 15, 2025

* Add rounding for float to bf16 conversion

* Add bhalf test

* Add inf test bhalf

* Refactor

* update cmake

* Fixes

7790e8c3

[CK_TILE] Add Various Fusion Functions to RMSNorm (#1802) · 04dd3148

ruanjm authored Jan 15, 2025



* Add shortcut to RMSNorm

* Modify test for adding shortcut for RMSNorm

* Add fused parameter into tests

* 1. Add YDataType. 2. rmsnorm2d_fwd_traits_ from rmsnorm2d_fwd.hpp to rmsnorm2d_fwd_api.cpp and rmsnorm2d_fwd_instance_common.hpp

* 1. Supports various stride and percisions.

* Add support of Epilogue

* Add fuse and epilogue support to rmsnorm ref

* Modify rmsnorm example

* Refactor tests/examples

* Bug fix for newly added tests/examples

* Bug fix for new tests 2

* Modify smoke test scripts

remove dbg code

* Supports non-smooth dyanmic quant

* Update Rmsnorm2dFwd::GetName()

* rename xscale and prec_sx to smoothscale and prec_sm

Bug fix after rename

Remove files

* change example_rmsnorm2d_fwd.cpp

* update performance calculator

* Fix issue in two-pass when fuse add is enabled

* Remove comment of beta

---------
Co-authored-by: rocking <ChunYu.Lai@amd.com>

04dd3148

13 Jan, 2025 2 commits

CK Tile GEMM CICD fixed & register block method refactor (#1776) · 5d671a5f

Thomas Ning authored Jan 12, 2025

* refactor the block_gemm_areg_breg_creg_v1 and add the v2 policy with 2x2 warp gemm

* Finished the 2x2 warp gemm policy and the block selection mechanism

* Clang format

* address poyen's comment

* Address feedbacks

* Fixed the compilation issue

* Change the function name

5d671a5f

Update for fmha_fwd qs_ks_vs pipeline (#1810) · 3d50f57f

Qianfeng authored Jan 13, 2025



* Update for fmha_fwd qs_ks_vs pipeline

* Remove _builtin_amdgcn_sched_barrier(0)

* Move p_compute to p converting earlier for trying to increase vgprs re-using

* Enable GetQKBlockGemm to use WarpGemm-16x16x16 for QLoadOnce==false situation

* Re-add __builtin_amdgcn_sched_barrier(0)

---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

3d50f57f

10 Jan, 2025 1 commit

Grouped convolution backward weight special vector size loads (#1772) · fd46a01d

Bartłomiej Kocot authored Jan 10, 2025

* Grouped convolution backward weight special vector size loads

* Instnaces and tests

* Fixes

* Add 7 and 13 special cases

* fix comments

* Fix

* Fix2

* fixes

* fix atomic add bf16

fd46a01d

08 Jan, 2025 12 commits
- Disable building DPP kernels by default (#1804) · 26b3829c
  darren-amd authored Jan 08, 2025
```
* Disable building DPP kernels by default

* Disable building dpp instances, examples, or tests if DPP_KERNELS is not set

* Add new DPP_KERNELS flag to readme
```
  26b3829c
- mark unused args · ad697c78
  Max Podkorytov authored Jan 07, 2025
  
  ad697c78
- run clang-format -style=file · a2e6ad62
  Max Podkorytov authored Jan 07, 2025
  
  a2e6ad62
- run clang-format==12 · aa59ecaa
  Max Podkorytov authored Dec 19, 2024
  
  aa59ecaa
- update comment in the policy · 82fb3f84
  Max Podkorytov authored Dec 19, 2024
  
  82fb3f84
- update qsksvs comment · 4daa82b4
  Max Podkorytov authored Dec 19, 2024
  
  4daa82b4
- remove dead code · 66c5b715
  Max Podkorytov authored Dec 19, 2024
  
  66c5b715
- clang-format and remove dead code · edb78a47
  Max Podkorytov authored Dec 19, 2024
  
  edb78a47
- roll back splitkv · 60113859
  Max Podkorytov authored Dec 18, 2024
  
  60113859
- update qsksvs pipeline · bfc997a7
  Max Podkorytov authored Dec 18, 2024
  
  bfc997a7
- qsksvs pipeline changes to mirror qrksvs · f7942b99
  Max Podkorytov authored Dec 17, 2024
  
  f7942b99
- enable bias feature that add bias before adding residual (for rtpllm project) (#1741) · d5c8a334
  AMD-dteng authored Jan 08, 2025
```
* 1. enable bias feature that add bias before adding residual; 2. change block size from 128->64 when m<64 in fp16

* delete comment

* 1.remove fmha change 2.change buffer name from bias to xbias

* Now bias can be used independently from fadd

* change kbias to kxbias

---------
Co-authored-by: feli <felix.li@amd.com>
```
  d5c8a334
07 Jan, 2025 2 commits

[MX FP8] Add Scaled Type Convert Functions for OCP FP8/BF8 data types (#271) · c4a05057

Andriy Roshchenko authored Jan 07, 2025

* Move scaled_type_convert functions to a separate header

* Introduce MX data tests

* Build MX tests only on relevant architectures

* Refactor E8M0 scale implementation

* Fix `config.h` typo

* Cleanup deprecated symbols

* Refactor `amd_ck_fp8.hpp`

* `scaled_type_convert` for `f8_ocp_t`

* Implement test for MX FP8 scaled type convert

* Implement test for MX BF8 scaled type convert

* Scaled type convert for vectors of 2 FP8 elements

* Scaled type convert for vectors of 16 FP8 elements

* Implementation of scaled conversion from F32 to F8

* Add tests for scaled conversions from FP32 to FP8

* Add documentation to the test functions

* Implementation of scaled conversion from F32x2 to F8x2

* Implementation of scaled conversion from F32x16 to F8x16

* Implementation of scaled conversion from F32x32 to F8x32

* Implementation of scaled conversion from F8x32 to F32x32

* Verified on the emulator

c4a05057

[CK_TILE] fmha fwd splitkv optimization for decode (seqlen_q=1) (#1789) · 24b12d04

Po Yen Chen authored Jan 07, 2025



* Update license year

* Add initial code to override decode problem

* Fix splitkv traits/args overriding error

* Reshape and transpose lse for decode

* Remove debug code

* Prettify example code

* Use better function name

* Add kMergeNumHeadGroupsSeqLenQ flag

Kernel user can use this switch to turn on/off optimization for
some problem sizes

* Add missing flag declarations

* Default turn off kMergeNumHeadGroupsSeqLenQ in codegen

* Group similar statements together

* Remove assumption of seqlen_q=1

* Remove kMergeNumHeadGroupsSeqLenQ from splitkv combine kernel

* Support kMergeNumHeadGroupsSeqLenQ=true in fmha splitkv kernel

* Run kMergeNumHeadGroupsSeqLenQ=true kernels when need

* Fix group mode block skip logics

* Undo changes of normal fwd kernel

* Update in GridSize() and using GridSize() for splitkv kernel (#1799)

---------
Co-authored-by: Qianfeng <qianfeng.zhang@amd.com>

24b12d04

06 Jan, 2025 1 commit

Add MXFP6 and MXBF6 conversion methods (#270) · e093146e

Rostyslav Geyyer authored Jan 06, 2025

* Add conversions

* Add tests

* Add docstrings

* Add scaled conversions

* Add fp6/bf6 tests

* Remove misleading fp4 test case

* Add docstrings

* Clean up

* Address comments

* Set stricter tolerances for RNE tests

* Add missing tests

* Add native conversions to float

* Revert "Add native conversions to float"

This reverts commit 09467111f73b753c8cc3d597533b187940353dab.

* Update copyright years

e093146e

04 Jan, 2025 2 commits
- Fix universal gemm profiler for pk_i4_t (#1790) · 888317e6
  Bartłomiej Kocot authored Jan 04, 2025
```
* Fix universal gemm profiler for pk_i4_t

* fix
```
  888317e6
- terminology clean-up (#1792) · 8ea375bb
  Illia Silin authored Jan 03, 2025
  
  8ea375bb
03 Jan, 2025 2 commits

[CK_TILE]naive attn support FP8 KVCache quant (#1747) · 6df5fe2a

carlushuang authored Jan 03, 2025



* quant

* fix bug

* simple smoothquant after softmax

* update kv-quant

* update stride

* fix fp8-pertoken-kvcache

* update int8/fp8 quant support

---------

Co-authored-by: so <a.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

6df5fe2a

Implement the fp16xint4 scale weight only kernel for Ali (#1786) · 4f62f6e9

Mingtao Gu authored Jan 03, 2025



* enable int4 scale (weight only) kernel

* format some files

* Add unit test for int4 weight only

* fixed and formatted code

* fixed

* formated

* formated

* fixed

* fixed a bug in the ckProfiler, and formatted the code

---------
Co-authored-by: mtgu0705 <mtgu@amd.com>

4f62f6e9