Commits · f64b1375219c0e718201389dfc411117ee25e8b3 · gaoqiong / composable_kernel_ROCM

13 Feb, 2025 1 commit
- hotfix for ckprofiler operator · 0172488d
  aska-0096 authored Feb 13, 2025
  
  0172488d
10 Feb, 2025 1 commit

Added Int4 mixed batch gemm support (#1839) · d9f1ead3

Mingtao Gu authored Feb 10, 2025



* remove redundant kernels.

* added batched_gemm_xdl_fp16int4_b_scale_v3

* Enabled the split K.

* added the batched_gemm_b_scale ckProfiler, meet function issue

* fix some typo

* fix ckProfiler build issue

* fix some bugs

* updated some debug info

* comment some code

* Fix

* fixed some bugs and refactor the code

* fixed a function bug.

* formatted files.

* formatted

* uncommented the ckProfiler CMakeLists

* fixed.

* fix ckProfiler for batched_gemm_b_scale

---------
Co-authored-by: mtgu0705 <mtgu@amd.com>
Co-authored-by: aska-0096 <haocwang@amd.com>
Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>

d9f1ead3

05 Feb, 2025 1 commit
- fix errors · 5be42bb3
  aska-0096 authored Feb 05, 2025
  
  5be42bb3
20 Jan, 2025 1 commit

Added bf16 instances grouped gemm fixed nk (#1825) · e7dce4d2

deepsek authored Jan 20, 2025

* Feat: Add bf16 input instances

* feat: Add BF16 profiler code

* fix: reorder enum types

* fix: CI fail due to clang-format

* fix: clang script format issue

* fix: clang format broke cmakelist file

e7dce4d2

17 Jan, 2025 1 commit
- fix: preprocessor directives logic error if/else (#1764) · 0fcbb25f
  deepsek authored Jan 16, 2025
```
* fix: preprocessors logic error if/else

* fix: added macros as preferred by CK team
```
  0fcbb25f
13 Jan, 2025 1 commit

Dev/merge u8w8 (#1774) · 53ab1b90

feli authored Jan 14, 2025



* port tiles from a8w8

* rm debug used files

* add instances

* remove all non gemm in cmake

* merge; impl fp16

* recover cmake from develop

* add missed files; fix clang format

---------
Co-authored-by: coderfeli <coderfeli@163.com>

53ab1b90

03 Jan, 2025 1 commit

Implement the fp16xint4 scale weight only kernel for Ali (#1786) · 4f62f6e9

Mingtao Gu authored Jan 03, 2025



* enable int4 scale (weight only) kernel

* format some files

* Add unit test for int4 weight only

* fixed and formatted code

* fixed

* formated

* formated

* fixed

* fixed a bug in the ckProfiler, and formatted the code

---------
Co-authored-by: mtgu0705 <mtgu@amd.com>

4f62f6e9

02 Jan, 2025 3 commits

BF16 GEMM Stream-K (#1541) · 9e95d54c

Muhammed Emin Ozturk authored Jan 02, 2025



* initial

* Cmake file

* successfull compilation but validation failed

* Cmake

* update

* gpu validation

* gemm universal

* gemm universal sk update

* sk bf16 universal instance

* gemm_universal_streamk.hpp

* only build for gfx94

* Cmakelist

* profiler update, bf16 sk only works at gfx42

* clang

* clang

* clang all

* no need flags

* cmake script

* delete comment

* gemm universal sk fix

* clang

* profiler fix

* clang

* update

* update

* delete comment

* code formatting

* cmake

* fix instance

* clang

* argument supported

* argument supported and clang

* update

* fix

* removing unnecessary comments

* clang formatting

* Update library/src/tensor_operation_instance/gpu/CMakeLists.txt
Co-authored-by: afagaj <john.afaganis@gmail.com>

* CopyRight Comment 2025

* clang reformatting

* copy right 2025

---------
Co-authored-by: Emin Ozturk <ozturk.27@osu.edu>
Co-authored-by: root <root@ctr-ubbsmc16.amd.com>
Co-authored-by: Muhammed Emin Ozturk <meozturk@t004-008.hpcfund>
Co-authored-by: root <root@splinter-126-wr-d3.amd.com>
Co-authored-by: Muhammed Emin Ozturk <meozturk@t006-001.hpcfund>
Co-authored-by: Muhammed Emin Ozturk <meozturk@login1.hpcfund>
Co-authored-by: Muhammed Emin Ozturk <meozturk@t004-004.hpcfund>
Co-authored-by: Emin Ozturk <emin.ozturk@utah.edu>
Co-authored-by: Muhammed Emin Ozturk <meozturk@t008-001.hpcfund>
Co-authored-by: afagaj <john.afaganis@gmail.com>

9e95d54c

refine weight preshuffle format. · 0dbe5370
aska-0096 authored Jan 02, 2025

0dbe5370

Jing's contribution: prototype of mixed precision gemm FP16/BF16xint4 GEMM (#1762) · 1d8e4ec2

Adam Osewski authored Jan 02, 2025



* add a prototype of int4

* clean

* debug

* clean

* clean

* move packed into dynamic_buffer

* fixed coord reset

* add fast pki4 to half conversion

* fix

* fixed reference and host_tensor

* fixed tensor init

* format

* debug i4_to_f16_convert

* format

* fixed splitk

* weight permute

* add b tile permute

* clean

* weight permute with splitki

* format

* improve weight layout

* add and_or_b32

* fixed splitk crush

* add permute switch as a template

* recover v3r1

* clean

* failure with intrawave v2

* fixed

* fixed

* add ckProfiler

* add bfp16 support

* add bf16 example

* fixed int4 to bhalf_t conversion

* format

* fixed int4 to bf16 conversion

* clean

* add instances for mem

* clean

* fixed host tensor size

* fixed

* debug

* fixed

* add pk_i4_t as a struct

* fix

* Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* revert

* Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* fixed comments

* revert

* clean

* revert

* revert

* fixed

* Update CMakeLists.txt

* Update script/cmake-ck-dev.sh
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update include/ck/tensor_operation/gpu/element/unary_element_wise_operation.hpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update CMakeLists.txt
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* fixed

* fixed

* fixed

* revert

* revert

* add comments

* format

* fixed assert

* fixed

* Fix I4 define in ckProfiler

* Fixed example_gemm_xdl_bf16_pk_i4_v3 test failed issue

---------
Co-authored-by: Jing Zhang <jizhan@fb.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
Co-authored-by: mtgu0705 <mtgu@amd.com>

1d8e4ec2

31 Dec, 2024 1 commit
- add fp16 instances · bbbedc1f
  aska-0096 authored Dec 31, 2024
  
  bbbedc1f
30 Dec, 2024 2 commits
- sanity pass, most tile size enabled. TODO: NWave!=4 · f60f9d59
  aska-0096 authored Dec 30, 2024
  
  f60f9d59
- tempsave · 74ef5021
  aska-0096 authored Dec 30, 2024
  
  74ef5021
27 Dec, 2024 3 commits
- fix missed files and fix clang format · fda5f8cf
  coderfeli authored Dec 27, 2024
  
  fda5f8cf
- impl fp16 in ckprofiler · e2127d7a
  coderfeli authored Dec 27, 2024
  
  e2127d7a
- fix build · 04f09f08
  coderfeli authored Dec 25, 2024
  
  04f09f08
25 Dec, 2024 1 commit
- fix build · f82c9aef
  coderfeli authored Dec 25, 2024
  
  f82c9aef
24 Dec, 2024 2 commits
- remove all non gemm in cmake · 19b7c131
  coderfeli authored Dec 24, 2024
  
  19b7c131
- rm debug used files · 9ba219c8
  coderfeli authored Dec 24, 2024
  
  9ba219c8
23 Dec, 2024 1 commit
- port tiles from a8w8 · 3f50b99e
  coderfeli authored Dec 23, 2024
  
  3f50b99e
13 Dec, 2024 1 commit

Add SplitK support into Batched GEMM V3 (#1729) · 4d8fce33

Bartłomiej Kocot authored Dec 13, 2024



* add bmm api

* add bf16 multi_d

* add ckProfiler for bf16

* add ckProfiler files

* add more instance; fixed 64bit index issue

* fixed naming

* enabled batched Ds

* use long_index for ds offsets

* clean

* add bmm fp8 ckProfiler

* Update example/24_batched_gemm/batched_gemm_xdl_bf16_v3.cpp
Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com>

* Update example/24_batched_gemm/batched_gemm_xdl_fp8_rowwise_v3.cpp
Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com>

* Update example/24_batched_gemm/run_batched_gemm_example_rowwise.inc
Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com>

* Update library/src/tensor_operation_instance/gpu/gemm_universal_batched/device_batched_gemm_xdl_universal_bf16_bf16_bf16/device_batched_gemm_xdl_universal_bf16_bf16_bf16_mk_nk_mn.hpp
Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com>

* Update library/src/tensor_operation_instance/gpu/gemm_universal_batched/device_batched_gemm_xdl_universal_bf16_bf16_bf16/device_batched_gemm_xdl_universal_bf16_bf16_bf16_mk_nk_mn_mem_v1_default_instance.cpp
Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com>

* Update library/src/tensor_operation_instance/gpu/gemm_universal_batched/device_batched_gemm_xdl_universal_bf16_bf16_bf16/device_batched_gemm_xdl_universal_bf16_bf16_bf16_mk_nk_mn_mem_v2_default_instance.cpp
Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com>

* Update profiler/src/profile_gemm_universal_batched.cpp
Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com>

* Update profiler/include/profiler/profile_gemm_universal_batched_impl.hpp
Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com>

* clean

* Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp

* Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp

* Update library/src/tensor_operation_instance/gpu/gemm_universal_batched/device_batched_gemm_xdl_universal_bf16_bf16_bf16/device_batched_gemm_xdl_universal_bf16_bf16_bf16_mk_nk_mn_comp_default_instance.cpp

* Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp

* Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp

* Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp

* refactor batch offset func

* add splitk suppport into bmm_v3

* clean

* clean

* format

* fixed

* fix

---------
Co-authored-by: Jing Zhang <jizhan@fb.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>

4d8fce33

27 Nov, 2024 1 commit

Polished Grouped GEMM APIs and new BF16 instances (#1600) · 061ac064

Adam Osewski authored Nov 27, 2024

* Few small fixes.

* New GroupedGemm instances (BF16)

* Unify and refactor GroupedGEMM device API.

* Adapt changes to new API.

* Adapt grouped gemm profiler.

* Accept multiple kbatches for grouped gemm profiler.

- delete obsolete two stage as it is now covered by grouped gemm

* Update unit test for grouped gemm.

* Fix thresholds for BF16 and F8. Unblock tests.

* Fix few instances.

* Multiple small fixes.

* Adapt to new API, check dynamic casting.

* Uncomment few data types in grouped gemm profiler.

* Fix call to SetDeviceArgs.

* Fix profile grouped gemm multiply tile loop.

* Fix grouped gemm tile loop kernel args in client examples.

* Review comments.

061ac064

21 Nov, 2024 1 commit

universal streamk fp8 changes (#1665) · d6d4c278

Harisankar Sadasivan authored Nov 21, 2024



* universal streamk fp8 changes & ckprofiler instances

* revert strides to -1 and verification options

* fp8 exclusion on pre-gfx94 for universal_streamk

* PR review based revisions: permissions reverted,  removed hip err checks


---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

d6d4c278

18 Nov, 2024 1 commit

Batched GEMM Multiple D based on Universal GEMM (#1655) · 754adc70

Bartłomiej Kocot authored Nov 18, 2024



* Batched GEMM Multiple D based on Universal GEMM
Co-authored-by: Jing Zhang <jizhan@fb.com>

* CI fixes
Co-authored-by: Jing Zhang <jizhan@fb.com>

---------
Co-authored-by: Jing Zhang <jizhan@fb.com>

754adc70

15 Nov, 2024 1 commit
- re-enable fp8 gemms in ckProfiler (#1667) · b4a79045
  Illia Silin authored Nov 14, 2024
  
  b4a79045
06 Nov, 2024 1 commit
- Fix F16 type (#1583) · 3599418a
  rocking authored Nov 07, 2024
  
  3599418a
05 Nov, 2024 2 commits
- remove the change in ckprofiler src · 7a0ad60e
  aska-0096 authored Nov 05, 2024
  
  7a0ad60e
- update ck_a8w8 library, update flush cache timing api · b97c6876
  aska-0096 authored Nov 05, 2024
  
  b97c6876
01 Nov, 2024 1 commit

Reduce build time. (#1621) · 03c6448b

Illia Silin authored Oct 31, 2024

* disable fp8 gemm_universal on gfx90a and gfx908 by default

* fix cmake syntax

* fix clang format

* add ifdefs in amd_xdlops

* disable fp8 gemm instances on gfx90a by default

* update readme

03c6448b

30 Oct, 2024 1 commit
- tempsave · b3e5048f
  aska-0096 authored Oct 30, 2024
  
  b3e5048f
26 Oct, 2024 1 commit

add int8 gemm multiply multiply a8w8 (#1591) · 37f7afed

valarLip authored Oct 26, 2024



* add int8 gemm multiply multiply a8w8

* uncomment

* clang-format-12

* Add example_gemm_multiply_multiply_xdl_int8

* Remove shell scripts

* update preprocess number for mi308; bring back printout in ckprofiler

* format

---------
Co-authored-by: chenjun <junchen2@amd.com>
Co-authored-by: Haocong WANG <haocwang@amd.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>

37f7afed

23 Oct, 2024 1 commit
- [POST MERGE PR] Enable grouped conv bwd wei bf16 NGCHW (#1594) · cedccd59
  Bartłomiej Kocot authored Oct 23, 2024
  
  cedccd59
22 Oct, 2024 1 commit
- Enable grouped conv bwd wei bf16 NGCHW (#1589) · 82fc5383
  Bartłomiej Kocot authored Oct 22, 2024
```
* Enable grouped conv bwd wei bf16 NGCHW

* fixes

* fixes

* Fixes

* fixes

* fixes

* Fixes
```
  82fc5383
21 Oct, 2024 4 commits
- clang-format-12 · 1670bba9
  chenjun authored Oct 21, 2024
  
  1670bba9
- Ck profiler instance support (#1575) · 560917b1
  Thomas Ning authored Oct 21, 2024
```
* The draft on ckProfiler instance add

* support the ck profiler instance with same data types

* add a small feature on the M and N variable switch.

* Partially solve the incorrect result problem

* fix based on ci cd
```
  560917b1
- uncomment · 09852d3b
  chenjun authored Oct 21, 2024
  
  09852d3b
- add int8 gemm multiply multiply a8w8 · 7fb0b322
  chenjun authored Oct 21, 2024
  
  7fb0b322
07 Oct, 2024 1 commit

Fix build logic using GRU_ARCHS. (#1536) · 7d8ea5f0

Illia Silin authored Oct 07, 2024

* update build logic with GPU_ARCHS

* fix the GPU_ARCHS build for codegen

* unset GPU_TARGETS when GPU_ARCHS are set

7d8ea5f0

20 Sep, 2024 1 commit
- Add support for NGCHW in grouped conv fwd (#1499) · 4ba52b35
  Bartłomiej Kocot authored Sep 20, 2024
```
* Support NGCHW in grouped conv fwd

* Remove not needed variable

* Fixes
```
  4ba52b35
17 Sep, 2024 1 commit

Extend pool3d fwd avg, max operations by f8_t, int8_t types (#1483) · a793afc9

aledudek authored Sep 17, 2024



* Extend pool3d fwd avg, max operations by f8_t, int8_t types

* Pack MaxPool3dFwd params together

* Fix MaxPool3dFwd AVG instances

* Decrease verification precision for bf16

* Adjust tests + review changes

* Adjust threshold for F8

* Adjusted compute types for MAX op instances

* Fix ComputeDataType mismatch in tests and profiler for AVG

* Fix naming from max_pool3d_fwd to pool3d_fwd

* Adjust CMakeLists

---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

a793afc9