Commits · lynm_bwd_dteng · gaoqiong / composable_kernel_ROCM

"vscode:/vscode.git/clone" did not exist on "ded93f798cc3dc794b4c3d5b2643ac5e7b818a57"

06 Feb, 2025 1 commit
- update dweight cal · 3289e656
  AMD-dteng authored Feb 06, 2025
  
  3289e656
24 Jan, 2025 1 commit
- optimize for dgrad · b0b399d9
  AMD-dteng authored Jan 24, 2025
  
  b0b399d9
22 Jan, 2025 1 commit
- temp commit · 30e15644
  AMD-dteng authored Jan 22, 2025
  
  30e15644
14 Jan, 2025 1 commit
- local base version · 677a842e
  AMD-dteng authored Jan 14, 2025
  
  677a842e
13 Jan, 2025 3 commits

CK Tile GEMM CICD fixed & register block method refactor (#1776) · 5d671a5f

Thomas Ning authored Jan 12, 2025

* refactor the block_gemm_areg_breg_creg_v1 and add the v2 policy with 2x2 warp gemm

* Finished the 2x2 warp gemm policy and the block selection mechanism

* Clang format

* address poyen's comment

* Address feedbacks

* Fixed the compilation issue

* Change the function name

5d671a5f

[CK_TILE] Adjust kBlockSize of reduce example for better perf (#1779) · 0b8f117f
ClementLinCF authored Jan 13, 2025
```
* Observed a 2x perf improvement with kBlockSize = 256
* Using 512 threads may lead to redundant computations
```
0b8f117f

Update for fmha_fwd qs_ks_vs pipeline (#1810) · 3d50f57f

Qianfeng authored Jan 13, 2025



* Update for fmha_fwd qs_ks_vs pipeline

* Remove _builtin_amdgcn_sched_barrier(0)

* Move p_compute to p converting earlier for trying to increase vgprs re-using

* Enable GetQKBlockGemm to use WarpGemm-16x16x16 for QLoadOnce==false situation

* Re-add __builtin_amdgcn_sched_barrier(0)

---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

3d50f57f

10 Jan, 2025 2 commits

Grouped convolution backward weight special vector size loads (#1772) · fd46a01d

Bartłomiej Kocot authored Jan 10, 2025

* Grouped convolution backward weight special vector size loads

* Instnaces and tests

* Fixes

* Add 7 and 13 special cases

* fix comments

* Fix

* Fix2

* fixes

* fix atomic add bf16

fd46a01d

Ck tile/gemm perf measure (#1750) · 73a076ee

Thomas Ning authored Jan 09, 2025



* Finished adding the performance benchmark for ck tile gemm

* Fix the executable rename problem

* fix the executable name error

* delete the unsupported layout combinations

* Update run_full_test.sh

* Update benchmark_mem_pipeline.sh

* Update benchmark_basic.sh

* change the executable of gemm_universal

* change ck_tile_gemm script permissions

* Addressed the comment

* Addressed the comment

* Fixed the comments

* Fixed Comment

* roll back the malfunctioned change

* Fix the Typo

* finalize the tile_gemm_fp16 performance monitoring

* fix the stash names for ck_tile gemm logs

* change the stashing logic

* change stashing syntax

---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>

73a076ee

08 Jan, 2025 12 commits
- Disable building DPP kernels by default (#1804) · 26b3829c
  darren-amd authored Jan 08, 2025
```
* Disable building DPP kernels by default

* Disable building dpp instances, examples, or tests if DPP_KERNELS is not set

* Add new DPP_KERNELS flag to readme
```
  26b3829c
- mark unused args · ad697c78
  Max Podkorytov authored Jan 07, 2025
  
  ad697c78
- run clang-format -style=file · a2e6ad62
  Max Podkorytov authored Jan 07, 2025
  
  a2e6ad62
- run clang-format==12 · aa59ecaa
  Max Podkorytov authored Dec 19, 2024
  
  aa59ecaa
- update comment in the policy · 82fb3f84
  Max Podkorytov authored Dec 19, 2024
  
  82fb3f84
- update qsksvs comment · 4daa82b4
  Max Podkorytov authored Dec 19, 2024
  
  4daa82b4
- remove dead code · 66c5b715
  Max Podkorytov authored Dec 19, 2024
  
  66c5b715
- clang-format and remove dead code · edb78a47
  Max Podkorytov authored Dec 19, 2024
  
  edb78a47
- roll back splitkv · 60113859
  Max Podkorytov authored Dec 18, 2024
  
  60113859
- update qsksvs pipeline · bfc997a7
  Max Podkorytov authored Dec 18, 2024
  
  bfc997a7
- qsksvs pipeline changes to mirror qrksvs · f7942b99
  Max Podkorytov authored Dec 17, 2024
  
  f7942b99
- enable bias feature that add bias before adding residual (for rtpllm project) (#1741) · d5c8a334
  AMD-dteng authored Jan 08, 2025
```
* 1. enable bias feature that add bias before adding residual; 2. change block size from 128->64 when m<64 in fp16

* delete comment

* 1.remove fmha change 2.change buffer name from bias to xbias

* Now bias can be used independently from fadd

* change kbias to kxbias

---------
Co-authored-by: feli <felix.li@amd.com>
```
  d5c8a334
07 Jan, 2025 3 commits

Update LICENSE to 2025 (#1797) · a6b761c3
spolifroni-amd authored Jan 07, 2025

a6b761c3

Bump rocm-docs-core from 1.12.1 to 1.13.0 in /docs/sphinx (#1798) · 9f6bf9ab

dependabot[bot] authored Jan 07, 2025

Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.12.1 to 1.13.0.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.12.1...v1.13.0

)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

9f6bf9ab

[CK_TILE] fmha fwd splitkv optimization for decode (seqlen_q=1) (#1789) · 24b12d04

Po Yen Chen authored Jan 07, 2025



* Update license year

* Add initial code to override decode problem

* Fix splitkv traits/args overriding error

* Reshape and transpose lse for decode

* Remove debug code

* Prettify example code

* Use better function name

* Add kMergeNumHeadGroupsSeqLenQ flag

Kernel user can use this switch to turn on/off optimization for
some problem sizes

* Add missing flag declarations

* Default turn off kMergeNumHeadGroupsSeqLenQ in codegen

* Group similar statements together

* Remove assumption of seqlen_q=1

* Remove kMergeNumHeadGroupsSeqLenQ from splitkv combine kernel

* Support kMergeNumHeadGroupsSeqLenQ=true in fmha splitkv kernel

* Run kMergeNumHeadGroupsSeqLenQ=true kernels when need

* Fix group mode block skip logics

* Undo changes of normal fwd kernel

* Update in GridSize() and using GridSize() for splitkv kernel (#1799)

---------
Co-authored-by: Qianfeng <qianfeng.zhang@amd.com>

24b12d04

04 Jan, 2025 3 commits

Fix universal gemm profiler for pk_i4_t (#1790) · 888317e6
Bartłomiej Kocot authored Jan 04, 2025
```
* Fix universal gemm profiler for pk_i4_t

* fix
```
888317e6

Bump rocm-docs-core from 1.12.0 to 1.12.1 in /docs/sphinx (#1788) · 37b35146

dependabot[bot] authored Jan 03, 2025

Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.12.0 to 1.12.1.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.12.0...v1.12.1

)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

37b35146

terminology clean-up (#1792) · 8ea375bb
Illia Silin authored Jan 03, 2025

8ea375bb

03 Jan, 2025 4 commits

[CK_TILE]naive attn support FP8 KVCache quant (#1747) · 6df5fe2a

carlushuang authored Jan 03, 2025



* quant

* fix bug

* simple smoothquant after softmax

* update kv-quant

* update stride

* fix fp8-pertoken-kvcache

* update int8/fp8 quant support

---------

Co-authored-by: so <a.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

6df5fe2a

Implement the fp16xint4 scale weight only kernel for Ali (#1786) · 4f62f6e9

Mingtao Gu authored Jan 03, 2025



* enable int4 scale (weight only) kernel

* format some files

* Add unit test for int4 weight only

* fixed and formatted code

* fixed

* formated

* formated

* fixed

* fixed a bug in the ckProfiler, and formatted the code

---------
Co-authored-by: mtgu0705 <mtgu@amd.com>

4f62f6e9

Ck tile/layernorm: implement naive reduce, opt performance (#1784) · 4bc61041

feli authored Jan 03, 2025



* add no welford

* enable output raw

* raw of int8

* fix build

* fix smoke test err

* [ck_tile]layernorm: fix welford ok, set int8 and bf16 small N as default and others open by generate

* [cktile]layernorm, fix err commit files and remove uselss

* fix quant 8192 err & change norm_reduce class and file name

---------
Co-authored-by: coderfeli <coderfeli@163.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>

4bc61041

Add afagaj to CODEOWNERS (#1787) · 17e8efb5
John Afaganis authored Jan 02, 2025

17e8efb5

02 Jan, 2025 2 commits

BF16 GEMM Stream-K (#1541) · 9e95d54c

Muhammed Emin Ozturk authored Jan 02, 2025



* initial

* Cmake file

* successfull compilation but validation failed

* Cmake

* update

* gpu validation

* gemm universal

* gemm universal sk update

* sk bf16 universal instance

* gemm_universal_streamk.hpp

* only build for gfx94

* Cmakelist

* profiler update, bf16 sk only works at gfx42

* clang

* clang

* clang all

* no need flags

* cmake script

* delete comment

* gemm universal sk fix

* clang

* profiler fix

* clang

* update

* update

* delete comment

* code formatting

* cmake

* fix instance

* clang

* argument supported

* argument supported and clang

* update

* fix

* removing unnecessary comments

* clang formatting

* Update library/src/tensor_operation_instance/gpu/CMakeLists.txt
Co-authored-by: afagaj <john.afaganis@gmail.com>

* CopyRight Comment 2025

* clang reformatting

* copy right 2025

---------
Co-authored-by: Emin Ozturk <ozturk.27@osu.edu>
Co-authored-by: root <root@ctr-ubbsmc16.amd.com>
Co-authored-by: Muhammed Emin Ozturk <meozturk@t004-008.hpcfund>
Co-authored-by: root <root@splinter-126-wr-d3.amd.com>
Co-authored-by: Muhammed Emin Ozturk <meozturk@t006-001.hpcfund>
Co-authored-by: Muhammed Emin Ozturk <meozturk@login1.hpcfund>
Co-authored-by: Muhammed Emin Ozturk <meozturk@t004-004.hpcfund>
Co-authored-by: Emin Ozturk <emin.ozturk@utah.edu>
Co-authored-by: Muhammed Emin Ozturk <meozturk@t008-001.hpcfund>
Co-authored-by: afagaj <john.afaganis@gmail.com>

9e95d54c

Jing's contribution: prototype of mixed precision gemm FP16/BF16xint4 GEMM (#1762) · 1d8e4ec2

Adam Osewski authored Jan 02, 2025



* add a prototype of int4

* clean

* debug

* clean

* clean

* move packed into dynamic_buffer

* fixed coord reset

* add fast pki4 to half conversion

* fix

* fixed reference and host_tensor

* fixed tensor init

* format

* debug i4_to_f16_convert

* format

* fixed splitk

* weight permute

* add b tile permute

* clean

* weight permute with splitki

* format

* improve weight layout

* add and_or_b32

* fixed splitk crush

* add permute switch as a template

* recover v3r1

* clean

* failure with intrawave v2

* fixed

* fixed

* add ckProfiler

* add bfp16 support

* add bf16 example

* fixed int4 to bhalf_t conversion

* format

* fixed int4 to bf16 conversion

* clean

* add instances for mem

* clean

* fixed host tensor size

* fixed

* debug

* fixed

* add pk_i4_t as a struct

* fix

* Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* revert

* Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* fixed comments

* revert

* clean

* revert

* revert

* fixed

* Update CMakeLists.txt

* Update script/cmake-ck-dev.sh
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update include/ck/tensor_operation/gpu/element/unary_element_wise_operation.hpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update CMakeLists.txt
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* fixed

* fixed

* fixed

* revert

* revert

* add comments

* format

* fixed assert

* fixed

* Fix I4 define in ckProfiler

* Fixed example_gemm_xdl_bf16_pk_i4_v3 test failed issue

---------
Co-authored-by: Jing Zhang <jizhan@fb.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
Co-authored-by: mtgu0705 <mtgu@amd.com>

1d8e4ec2

01 Jan, 2025 1 commit
- Add NGCHW bf16 grouped conv fwd instances (#1783) · 159fa319
  Bartłomiej Kocot authored Jan 01, 2025
```
* Add NGCHW bf16 grouped conv fwd instances

* add missed cmake
```
  159fa319
29 Dec, 2024 1 commit

Remove using partitioner for all fmha kernels (#1778) · 4e076909

Qianfeng authored Dec 29, 2024

* Remove using tile partitioner for fmha_fwd_kernel

* Remove using tile partitioner for fmha_fwd_splitkv and splitkv-combine kernels

* Remove using tile partitioner for fmha_fwd_appendkv kernel

* Unify the format of GetTileIndex

4e076909

28 Dec, 2024 1 commit

[CK TILE] GEMM and Batched GEMM SplitK support (#1724) · af664948

Bartłomiej Kocot authored Dec 28, 2024

* [CK TILE] Add split K support in GEMM

* Updates

* Fixes

* rebase

* fix

* Fix

* fixes

* support for batched gemm

af664948

25 Dec, 2024 1 commit
- Correct the dtype checking logics (#1775) · 4c2eff02
  Po Yen Chen authored Dec 25, 2024
  
  4c2eff02
23 Dec, 2024 1 commit
- [CK_TILE] optimize moe-sorting kernel (#1771) · 3d15f364
  carlushuang authored Dec 23, 2024
```
* opt moe sorting

* remove commented code
```
  3d15f364
20 Dec, 2024 2 commits
- fix typo for CK_USE_OCP_FP8 (#1769) · 07339c73
  Illia Silin authored Dec 20, 2024
  
  07339c73
- hot-fix (#1768) · 1c45ca35
  carlushuang authored Dec 20, 2024
  
  1c45ca35