Commits · 07339c738396ebeae57374771ded4dcf11bddf1e · gaoqiong / composable_kernel_ROCM

20 Dec, 2024 4 commits

fix typo for CK_USE_OCP_FP8 (#1769) · 07339c73
Illia Silin authored Dec 20, 2024

07339c73
hot-fix (#1768) · 1c45ca35
carlushuang authored Dec 20, 2024

1c45ca35

[CK_TILE] Add fmha fwd N-Warp S-Shuffle pipeline (fmha fwd splitkv pipeline variant) (#1705) · 37cdbf4f

Po Yen Chen authored Dec 20, 2024



* Add check for zero values

* Add static assertions

* Remove invalid option '-e' in smoke_test.sh

* Use correct path of smoke_test.sh

* Avoid zero-sized shared memory array

* Add warning comment

* Replace expr by integer_divide_ceil() call

* Use more readable constant names

* Write down assumption as static assertion

* Add more diagnostic error messages

* Fix wrong BlockWarps when using default pipeline policy

* Add more static assertions for A LDS desc

* Allow using vector size < 8 for data type fp16/bf16

* Align vector size between DRAM dist & LDS desc

* Remove no-longer used func decl

* Fix wrong displayed piepline name

* Undo policy template changes for tile_example_gemm_basic

* Add missing space and make error message stands out

* Unify print precision

* Add missing include directive <iomanip>

* Replace constant 64 by get_warp_size() call

* Replace constant 128 by named variable: BankLength

* Add kAMBlock/kBNBlock attributes

* Allow usig different A/B warp dist for multiple blocks

* Add helper function to get warp dist encodings

* Add 4x64x4 fp16 warp gemm attribute impl

* Complete the A/B warp dist encoding logic

* Fix wrong thread mapping for C matrix

* Use smaller vector size for small tile

* Add static assert to block unsupported warp gemm impl

* Extract common code out as helper method

* Add 4x64x16 fp16 warp gemm type alias

* Add comment to warning developers

* Undo WarpGemmAtrributeMfma<> changes

* Use more clear static assertion error message

* Add trivial wrapper to get warp dstr encodings

* Only transpose warp gemm result if it's square

* Fix compilation error

* Support multi-block warp gemm (on N direction)

* Remove duplicated code

* Fix output encoding of warp gemm

* Fix wrong shape of WarpGemmAtrributeMfmaIterateK<>

* Remove unused code

* Fix wrong shape of WarpGemmAttributeMfmaImplF16F16F32M4N64K4

* Add type config for bf16_t

* Add 4x64x16 bf16 warp gemm

* Update WarpGemmAtrributeMfmaIterateKAndTransposedCDistribution

* Add 64x4x4 fp16/bf16 warp gemm impl

* Add 64x4x16 fp16/bf16 warp gemm

* Add static assertion for better error diagnostic

* Get Q dram dstr directly form block gemm

* Add missing header: fused_moe.hpp

* Allow specifying different warp-gemm for gemm0 & gemm1

* Store P matrix into LDS before gemm1

* Fix inconsistant kernel name

* Remove constraint on gemm0 & gemm1 block warps

* Remove unsupported vector size from checking list

* Allow using 4x64x16 warp gemm for gemm0

* Finish policy customization

* Finish pipeline modification
F#

* Use block warps in codegen

* Fix wrong rank of m_lds_window origin

* Use better distributed tensor

* Make P-store earlier

* Remove duplicated experssions

* Remove unnecessary tile window

* Create new files for new splitkv pipeline

* Separate old/new pipeline codegen logic

* Sync changes form develop

* Undo gemm kernel/pipeline changes

* Undo gemm example changes

* Remove blank lines

* Fix typo

* Use new warp gemm interface

* Fix link error

* Fix wrong pipeline tag

* Fix more link error

* Avoid unnecessary padding

* Always use vector load for K

* Padding on fastest dimension when necessary

* Force padding Q on hdim_q

* Set high dimension padding flag to false

* Re-format headers

* Use warps=<1, 4, 1> for both gemm0 & gemm1

* Fix complilation errors

* Remove m/l shuffle logics

* Ignore duplicate data when write lse_acc

* Use gemm0 block warps as lds tile width

* Remove hard-coded numbers

* Fix wrong distribution width

* Remove unnecessary code

* Add s_barrier before writing to LDS

* Store Q into LDS before gemm0

* Fix wrong Q tile size

* Use simple Q lds descriptor for debuging

* Use more realistic Q lds descriptor

* Add comment & use better variable name

* Make Q lds space not overlapped with others

* Remove unnecessary block_tile_reduce_sync() call

* Move Q load statements

* Move block_sync_lds() right before use

* Re-order instructions

* Remove necessary lambda expression

* Use 8 threads on kMaxSplits direction while doing reduction

* Tiny correction for using 8 threads on kMaxSplits direction for combine kernel

* Padding num_split direction of o_acc tile window to 4x

* Update splitkv combine pipeline design

* Add kN1 back to splitkv combine pipeline problem

* Fix compilation errors

* Add missing template parameter

* Fix wrong splitkv combine kernel name

* Fix wrong origin

* Fix wrong LDS descriptor shape

* Fix sync & reduction logics

* Remove unnecessary static assertions

* Extract tile size computation logics

* Make sure we can reuse padding flags in combine kernels

* Rename variables

* Use OaccDataType in BlockFmhaSplitKVCombinePipelineTileSizes<>

* Remove unnecessary static assertion

* Fix function name typo

* Add constraint on kN1 template parameter

* Hide K tile loading latency in earlier iteration

* Fix wrong splitkv kernel name

* Use s_shuffling to replace p_shuffling which removes the needs of cross-warp reduction

* Rename pipeline

* Fix wrong pipeline name attribute

* Add GetAlignmentQ() for NWarpSShuffle pipeline

* Separate Q tile into dram tile & register tile concepts

* Remove non-squre warp gemm transpose c type alias

* Fallback tile size changes for fmha fwd splitkv

* Remove redundant change

* Refine naming for the S tile

* Use better naming of the S tile dstr (read from lds)

* Share Q lds with K lds

* Tiny change

* Fix with using static_for for passing CI checking

---------
Co-authored-by: Qianfeng Zhang <Qianfeng.Zhang@amd.com>

37cdbf4f

fix profiler_grouped_gemm (#1766) · 2944c508
Illia Silin authored Dec 19, 2024

2944c508

19 Dec, 2024 1 commit

Apply Ck-tile argument parser for vectors [I/O] (#1758) · e758d006

Mateusz Ozga authored Dec 19, 2024

* Parser for a vector was added. Additionaly we valid correctnes of numbers

* Remove unnecessary comments

* Review part 1

* Review part 2

* Add const to variadic lambda

* Rename C->K

e758d006

18 Dec, 2024 3 commits

[CK TILE] Refactor GemmKernel to be reused by other GEMM related operators (#1730) · 453ca373

aledudek authored Dec 18, 2024

* Gemm Kernel Refactor part1

* Gemm Kernel Refactor common gemm pipeline part2

* [CK TILE] Refactor batched gemm to reuse GemmKernel

* [CK TILE] Refactor GemmKernel - review changes part1

* [CK TILE] Refactor GemmKernel - references fix

* [CK TILE] Refactor GemmKernel - naming changes, add problem

* [CK_TILE] Refactor GemmKernel - update tests

* [CK_TILE] Refactor GemmKernel - review changes

* [CK_TILE] Refactor GemmKernel - update test

* [CK_TILE] Refactor GemmKernel - constness fixes

* [CK_TILE] Refactor GemmKernel - update tests

453ca373

Disambiguate bit_cast (#1749) · 1c1b3363

Xiaodong Wang authored Dec 18, 2024



Adding namespace to disambiguate with std::bit_cast
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

1c1b3363

[CK_TILE] Move hipmalloc/memcpy calls out of gpu reference gemm (#1743) · f6c4d614

aledudek authored Dec 18, 2024

* [CK_TILE] Move hipmalloc/memcpy calls out of gpu reference gemm

* [CK_TILE] Move hipmalloc/memcpy calls out of gpu reference gemm - review changes

* [CK_TILE] Move hipmalloc/memcpy calls out of gpu reference gemm - review fix

f6c4d614

17 Dec, 2024 6 commits

updated fp16 instances to be on parity with universal gemm instances (#1754) · d9e37c68
Harisankar Sadasivan authored Dec 17, 2024
```
* updated fp16 instances to be on parity with universal gemm instances

* corrected instance name to streamk instance
```
d9e37c68
Pass build flags to config.h (#1760) · 689a5ae4
Illia Silin authored Dec 17, 2024
```
* pass the build flags to config.h

* fix clang format
```
689a5ae4
refactor conditional usage; fix build on rocm6.1 where the reference didn't exist · 6ef8d3c2
Max Podkorytov authored Dec 12, 2024

6ef8d3c2

Bump rocm-docs-core from 1.11.0 to 1.12.0 in /docs/sphinx (#1753) · 0e54d7ae

dependabot[bot] authored Dec 17, 2024

Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.11.0 to 1.12.0.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.11.0...v1.12.0

)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

0e54d7ae

Added unit tests for CK Tile compute bound gemm pipeline (#1728) · 627a27bd
jakpiase authored Dec 17, 2024

627a27bd

Enhance printing functionality (#1751) · d46196f2

Adam Osewski authored Dec 17, 2024



* Added object print with all template parameters

* fix clang format

---------
Co-authored-by: ravil-mobile <ravil.aviva.com@gmail.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>

d46196f2

16 Dec, 2024 5 commits
- clarify release notes bullet point · 0fd6978d
  Max Podkorytov authored Dec 10, 2024
  
  0fd6978d
- add contributing placeholder · 1b75c77d
  Max Podkorytov authored Dec 10, 2024
  
  1b75c77d
- add pull request template placeholder · 30a37cac
  Max Podkorytov authored Dec 10, 2024
  
  30a37cac
- add template placeholders · a8ad7fcc
  Max Podkorytov authored Dec 10, 2024
  
  a8ad7fcc
- upgrade sqlalchemy version (#1748) · fdfe2102
  Illia Silin authored Dec 15, 2024
```
* upgrade sqlalchemy version

* replace the connection with engine in to_sql call

* change the hipTes=nsor ctest syntax
```
  fdfe2102
15 Dec, 2024 1 commit

added moe interleaving pipeline (#1712) · f57d720c

Xu, Shengnan authored Dec 15, 2024



* added moe interleaving pipeline

* remove redundant code

* formater

---------
Co-authored-by: root <root@hjbog-srdc-14.amd.com>

f57d720c

14 Dec, 2024 2 commits
- upgrade pandas package (#1746) · d68974a5
  Illia Silin authored Dec 13, 2024
  
  d68974a5
- Add zstd lib for building hipTensor. (#1745) · 41ebf117
  Illia Silin authored Dec 13, 2024
```
* add zstd library to CI docker

* fix the libzstd name
```
  41ebf117
13 Dec, 2024 2 commits

Add SplitK support into Batched GEMM V3 (#1729) · 4d8fce33

Bartłomiej Kocot authored Dec 13, 2024



* add bmm api

* add bf16 multi_d

* add ckProfiler for bf16

* add ckProfiler files

* add more instance; fixed 64bit index issue

* fixed naming

* enabled batched Ds

* use long_index for ds offsets

* clean

* add bmm fp8 ckProfiler

* Update example/24_batched_gemm/batched_gemm_xdl_bf16_v3.cpp
Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com>

* Update example/24_batched_gemm/batched_gemm_xdl_fp8_rowwise_v3.cpp
Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com>

* Update example/24_batched_gemm/run_batched_gemm_example_rowwise.inc
Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com>

* Update library/src/tensor_operation_instance/gpu/gemm_universal_batched/device_batched_gemm_xdl_universal_bf16_bf16_bf16/device_batched_gemm_xdl_universal_bf16_bf16_bf16_mk_nk_mn.hpp
Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com>

* Update library/src/tensor_operation_instance/gpu/gemm_universal_batched/device_batched_gemm_xdl_universal_bf16_bf16_bf16/device_batched_gemm_xdl_universal_bf16_bf16_bf16_mk_nk_mn_mem_v1_default_instance.cpp
Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com>

* Update library/src/tensor_operation_instance/gpu/gemm_universal_batched/device_batched_gemm_xdl_universal_bf16_bf16_bf16/device_batched_gemm_xdl_universal_bf16_bf16_bf16_mk_nk_mn_mem_v2_default_instance.cpp
Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com>

* Update profiler/src/profile_gemm_universal_batched.cpp
Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com>

* Update profiler/include/profiler/profile_gemm_universal_batched_impl.hpp
Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com>

* clean

* Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp

* Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp

* Update library/src/tensor_operation_instance/gpu/gemm_universal_batched/device_batched_gemm_xdl_universal_bf16_bf16_bf16/device_batched_gemm_xdl_universal_bf16_bf16_bf16_mk_nk_mn_comp_default_instance.cpp

* Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp

* Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp

* Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp

* refactor batch offset func

* add splitk suppport into bmm_v3

* clean

* clean

* format

* fixed

* fix

---------
Co-authored-by: Jing Zhang <jizhan@fb.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>

4d8fce33

Ck tile/smoothquant out stride (#1742) · 4e731776

chenjun authored Dec 13, 2024

* add ck_tile/smoothquant out stride parameter

* Remove the default stride value

---------

Co-authored-by: so <a.com>

4e731776

12 Dec, 2024 1 commit

[CK_TILE] naive attn (#1708) · 77a38e02

carlushuang authored Dec 12, 2024

* add reference attention fwd

* refactor addresser

* update

* paged, and i8 reflect-quant

* lets call it forward-quant

* fix error in decode variation

* update naive-attn

* fix page table

* fix build err

77a38e02

10 Dec, 2024 4 commits
- add missing stdexcept header (#1740) · 357a0b1c
  Illia Silin authored Dec 10, 2024
  
  357a0b1c
- Upgrade to Ubuntu22.04 as default OS. (#1738) · 90d8410d
  Illia Silin authored Dec 10, 2024
```
* upgrade to ubuntu 22.04

* try adding -u roof docker options for ubuntu 22
```
  90d8410d
- Make sure we call __hneg with half to remove ambigios error (#1736) · 67497a04
  Jatin Chaudhary authored Dec 10, 2024
  
  67497a04
- [CK TILE] Use config name instead of data type in FmhaFwdTypeConfig<config> (#1731) · 94ae7113
  rocking authored Dec 10, 2024
```
* Add data type config, Prepare to add mix precision in the future

* Fix compile error
```
  94ae7113
09 Dec, 2024 3 commits
- build CI for gfx12 by default (#1734) · 23cf2026
  Illia Silin authored Dec 09, 2024
  
  23cf2026
- update CI timeout limits (#1733) · 2f088b87
  Illia Silin authored Dec 09, 2024
  
  2f088b87
- remove unnecessary file (#1732) · c773cc25
  Illia Silin authored Dec 09, 2024
  
  c773cc25
06 Dec, 2024 5 commits

Refactor CI performance tests. (#1726) · 355893cd

Illia Silin authored Dec 06, 2024

* merge the build and performance tests CI stages together

* add gemm performance test on gfx11/gfx12

* add suffices to distinguish gemm performance logs from different archs

* use smaller gemm set in CI for gfx10/gfx11/gfx12

* disable performance tests on gfx1030

* fix the shashing logic

* fix finding python3 for mha instances

355893cd

Add copy assignment op test (#1718) · 5e6bd75a
Rostyslav Geyyer authored Dec 06, 2024
```
* Add copy assignment op test

* Add a deep copy testing
```
5e6bd75a
Support large batch tensors in grouped conv bwd data (#1711) · 261f1759
Bartłomiej Kocot authored Dec 06, 2024
```
* Support large batch tensors in grouped conv bwd data

* Fix multiD

* fixes

* fixes

* fixes
```
261f1759
Undo padding-flag changes in fmha_fwd_kernel.hpp (#1725) · 58e7f37f
Po Yen Chen authored Dec 06, 2024

58e7f37f

Upgrade default compiler to ROCm6.3 (#1723) · 86990558

Illia Silin authored Dec 05, 2024



* upgrade to rocm6.3 compiler

* Proposed solution to convnd test failures in ROCm 6.3

---------
Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com>

86990558

05 Dec, 2024 2 commits

Add IsSupportedArgument() to gemm_kernel (#1698) · feb9a2bd

jakpiase authored Dec 05, 2024

* add IsSupportedArgument to gemm_kernel

* add ut and do some refactoring

* switched to ck_tile's integral_constant

feb9a2bd

Bump rocm-docs-core from 1.10.0 to 1.11.0 in /docs/sphinx (#1720) · d2d1d177

dependabot[bot] authored Dec 04, 2024

Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.10.0 to 1.11.0.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.10.0...v1.11.0

)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

d2d1d177

04 Dec, 2024 1 commit

Ck tile grouped GEMM example (#1713) · 4cb3d7d7

Mateusz Ozga authored Dec 04, 2024



* Ck-tile, impl. grouped gemm

* Workspace is allocated by user, and is passed to the function

* Prepare test to new api design

* Unify GemTransKernelArgs, removing N0 param

* Add 1 to dim3 in paritioner

* Typo: gem - > gemm

---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

4cb3d7d7