Commits · 598cfd77dd66ccd36f8cffcc5453def97088d3da · gaoqiong / composable_kernel_ROCM

01 Oct, 2024 1 commit

[CK_TILE] Change output accum tensor layout of fmha fwd split-kv & combine kernels (#1527) · a1c07e8d

Po Yen Chen authored Oct 01, 2024

* Use same layout for o_acc and o tensor

* Use better param names in partitioner

* Remove redundant kargs 'max_seqlen_q'

* Use better param names in splitkv kernel

* Add comment for additional kernel arguments

* Sync empty loop early return logics between pipelines

* Pass more arguments to cmake in scripts

* Align backslashes

* Fix wrong o_acc tensor view strides

* Change o_acc layout if o_perm=0

* Handle whole row masked via attn_bias

* Use use vector width = 1 for o_acc

* Use more even split sizes

a1c07e8d

20 Sep, 2024 1 commit
- Add support for NGCHW in grouped conv fwd (#1499) · 4ba52b35
  Bartłomiej Kocot authored Sep 20, 2024
```
* Support NGCHW in grouped conv fwd

* Remove not needed variable

* Fixes
```
  4ba52b35
03 Sep, 2024 1 commit
- Add support for NGCHW in grouped conv bwd wei (#1491) · 73b67f29
  Bartłomiej Kocot authored Sep 03, 2024
```
* Add support for NGCHW in grouped conv bwd wei

* Comments fixes

* navi fixes

* Update function names
```
  73b67f29
20 Aug, 2024 1 commit
- Convert MIOpen driver to ckProfiler script typos fix (#1476) · dc82daa8
  Bartłomiej Kocot authored Aug 20, 2024
  
  dc82daa8
19 Aug, 2024 1 commit
- Add script to convert MIOpen driver to ckProfiler (#1472) · a6a79665
  Bartłomiej Kocot authored Aug 19, 2024
```
* Add script to convert MIOpen driver to ckProfiler

* Fix
```
  a6a79665
16 Aug, 2024 1 commit

Add performance and large tensor tests for grouped conv (#1456) · 2581727d

Bartłomiej Kocot authored Aug 16, 2024



* Add performance and large tensor tests for grouped conv

* Resize tests

* Resize tests

* update the python script to parse the grouped_conv results

* Remove int8 tests

* change bwd wei layout

---------
Co-authored-by: illsilin <Illia.Silin@amd.com>

2581727d

12 Aug, 2024 1 commit

Rewrite *sh reduce unit tests to gtest: part 1 (#1407) · ab60b390

Mateusz Ozga authored Aug 12, 2024



* Rewrite .sh test to Gtest

* review chnages

* Removew unused comments

* Review v2

* Typo

* Separete UT: AMAX, MAX, MIN; added template params to trigger them

* Update test/reduce/reduce_no_index.cpp

---------
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

ab60b390

07 Aug, 2024 1 commit

Run CK_TILE FMHA benchmarks and collect the performance data. (#1447) · 12c1f68d

Illia Silin authored Aug 07, 2024

* run ck_tile benchmarks after the smoke tests and store logs

* change the path of fmha benchmark logs

* change the way of stashig ck_tile fmha logs

* prevent the errors in stages where no logs are generated

* fix the ck_tile fmha log names and headers

* generate the fmha performance logs in the root folder

* change jenkins scrip arguments format

* use exact file names for stashing

* modify scripts to process FMHA performance results

* unstash FMHA logs before parsing them

12c1f68d

22 Jul, 2024 1 commit
- Revert Support access per groups and filter2x3 in grouped conv fwd (#1382) (#1406) · 5d8c3d81
  Bartłomiej Kocot authored Jul 22, 2024
  
  5d8c3d81
08 Jul, 2024 1 commit
- Add ckProfiler support for forward 3D convolutions with OUT element-wise operations. (#1354) · eb44e047
  Andriy Roshchenko authored Jul 08, 2024
  
  eb44e047
06 Jul, 2024 1 commit

Universal streamk with atomics (#1360) · 75e622f0

Harisankar Sadasivan authored Jul 05, 2024

* universal streamk with atomics with ckprofiler support. grid_size and streamk strategy are tunable. grid_size of -1 leads to #WGs = maximum occupancy X num_CUs. implementation supports many different streamk policies: 1-tile, 2-tile, 3-tile and 4-tile. streamk strategy of -1 leads to default streamk policy (4-tile). 

* Update README.md

* fixing clang-format issues

* removed conflicts in struct members between streamk and universal streamk

* corrected arg parsing for streamk and universal streamk

* added stream-k policies for 3 tile and 4 tile

* fixed argument type issue with parsing cmd args

* changes suggested in PR review are made- removing comments and correcting copyright

* file permissions updated

* added default value support for grid_size and streamk-policy selection set to -1

* print messages for arguments

* print messages for arguments

* print messages for arguments1

75e622f0

10 May, 2024 1 commit
- Code clean-up (#1285) · 566b6480
  Illia Silin authored May 10, 2024
```
* code clean-up

* remove the profiling output samples
```
  566b6480
16 Apr, 2024 1 commit

introducing ck_tile! (#1216) · db376dd8

carlushuang authored Apr 16, 2024

* enable gfx940

* switch between intrinsic mfma routines on mi100/200 and mi300

* fix mfma_int8 on MI300

* disable 2 int8 examples on MI300

* Update cmake-ck-dev.sh

* restore gitignore file

* modify Jenkinsfile to the internal repo

* Bump rocm-docs-core from 0.24.0 to 0.29.0 in /docs/sphinx

Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.24.0 to 0.29.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.24.0...v0.29.0

)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com>

* initial enablement of gfx950

* fix clang format

* disable examples 31 and 41 int8 on gfx950...

db376dd8

14 Apr, 2024 1 commit

[GEMM] Gemm universal device operation (#1154) · f83e9701

Haocong WANG authored Apr 14, 2024



* Optimize GEMM on MI200/300:
1. Add new blockwise gemm pipeline
2. Add irregular splitk intances

* clang format + typo fix

* Fix a bug

* initial commit

* Add more instances to irregular splitk

* blkgemm pipeline v1~4 prototype

* Sanity Checked. Known issue:
1. Poor performance of splitk
2. Register spill on blkgemmpipeline v3

* Sanity and Performance fix:
1. fix a bug related to sanity in grouped b2c mapping
2. fix a bug related to sanity and performance in splitk offset

* Sanity and API update:
1. Remove prefetch stage
2. Fix valid check bug
3, Add first gemm_universal instance into ckProfiler

* Add NN instances for gemm universal

* 1. Add NT instances for gemm_universal
2. Fix a bug about Kpadding in gemm_universal

* Fix a bug regarding padding Odd K number

* remove kernel print

* Fix KPadding bug...

* Update safety check

* another try to fix kpadding..

* Sanity checked

* new instances..

* clang format+typo fix

* remove clang format script's change

* Add non-hotloop compile option

* 1. Add fp16xfp8 example
2. pull packed convert f8 from pr1150

* Some miscs.. opt and fix

* Add pipeline description docs

* Split universal gemm instance library to cut profiler compiling time

* uncomment cmakefile

* Fix a bug caused by blockwise_gemm_pipe_v2

* reduce default splitk to 1

* Add 224x256x64 tile size

* update, including:
1. Experiment pipeline 5~7
2. Optimization for pipeline 4
3. Organized instance library

* temp save

* temp save

* Permuted lds layout, sanity and function checked

* clang format

* Move OOB check from RunRead to RunWrite, for better software pipeline.
TODO: agpr spill when NN layout

* clangformat

* A/B splitpipe scheduler for v3

* Fix two bugs

* bug fix

* fix a bug in oob check

* Example for mixed fp16_fp8 gemm

* Clean experimental code blocks

* Add mixed precision gemm into profiler

* tempsave

* optimize m/n major lds layout

* Add RRR GEMM  mixed precision instances

* Optimize f8 matrix transpose

* Add test_gemm_universal

* A/B spilt schedule for blkpip v5

* Take ds_read2 into iglp scheduling scheme

* format

* fixed cmake

* Add llvm-option into CI cmake flag

---------
Co-authored-by: Jing Zhang <jizhan@amd.com>

f83e9701

02 Apr, 2024 2 commits
- Update cmake-ck-dev.sh · 54793dfd
  zjing14 authored Apr 02, 2024
  
  54793dfd
- Update cmake-ck-dev.sh · e1947323
  zjing14 authored Apr 02, 2024
  
  e1947323
22 Mar, 2024 1 commit
- Add elementwise with dynamic vector dim (#1198) · 9c052804
  Bartłomiej Kocot authored Mar 22, 2024
```
* Add elementwise with dynamic vector dim

* Reduce number of instaces

* Fixes

* Fixes
```
  9c052804
18 Mar, 2024 1 commit

Re-enable the performance tracking in CI. (#1203) · bdcd0374

Illia Silin authored Mar 18, 2024

* test CK with rocm6.1 RC2

* add docker credentials for pull

* update the performance db name

* use environment variable for db name

* add rocm-llvm-dev package to ck docker

* turn off verification for daily performance runs

* do not stash ckProfiler on MI300 node

* add processing of mixed gemms to qa, fix parsing of splitk gemm logs

* fix the splitk gemm log file name

* turn the timing on for splitk gemm performance

bdcd0374

12 Mar, 2024 1 commit
- some small changes · 9a9cb884
  illsilin authored Mar 11, 2024
  
  9a9cb884
11 Mar, 2024 1 commit
- fixed conflicts · 7d6cea85
  Jing Zhang authored Mar 11, 2024
  
  7d6cea85
10 Mar, 2024 1 commit
- enable gridwise · e05f0762
  Jing Zhang authored Mar 09, 2024
  
  e05f0762
09 Mar, 2024 1 commit
- fixed · 255fbc56
  Jing Zhang authored Mar 09, 2024
  
  255fbc56
08 Mar, 2024 1 commit
- fixed c_output · 7cb8a89f
  Jing Zhang authored Mar 08, 2024
  
  7cb8a89f
29 Feb, 2024 2 commits
- fixed wmma · 0b914465
  Jing Zhang authored Feb 29, 2024
  
  0b914465
- fixed layout · 2052dfc9
  Jing Zhang authored Feb 29, 2024
  
  2052dfc9
27 Feb, 2024 1 commit
- remove unnecessary changes · 924639f9
  aska-0096 authored Feb 27, 2024
  
  924639f9
26 Feb, 2024 1 commit
- Todo: fix gemm_bilinear_wmma instances compilation bug · 18d5297b
  aska-0096 authored Feb 26, 2024
  
  18d5297b
31 Jan, 2024 1 commit
- add new performance tests for mixed fp16/fp8 gemms (#1151) · 112b691b
  Illia Silin authored Jan 31, 2024
  
  112b691b
24 Jan, 2024 1 commit

Fixing most of the cppcheck errors. (#1142) · 180e5720

Illia Silin authored Jan 24, 2024

* fix cppcheck errors, first pass

* fix format

* fix returned value in examples

* add macro definitions for cppcheck

* fix the profile_gemm logic

* update the gemm profiler logic

* add more difinitions to cppcheck, fix couple more errors

* replace runtime error with message in device function

* fix a couple of int4 issues

* no return for fill function

* fix errors in data_types.hpp

* fix format

* fix few remaining errors

* fix errors in data_types.hpp

* fix last couple of errors in datat_types.hpp

180e5720

09 Nov, 2023 1 commit
- add linker script to QA builds (#1030) · 68f2b5e7
  Illia Silin authored Nov 08, 2023
  
  68f2b5e7
07 Nov, 2023 1 commit

Add Gemm instances for performance improvement (#1018) · 98fd41f5

zjing14 authored Nov 07, 2023



* improve kpad

* more tuning parameters

* f16_f8_fp16

* cut test time

* add f16_f8_fp16

* add f16_f8_f16

* testing instances for skinny cases

* format

* clean

* add fp16_f8_fp16

* clang-format

* add grouped gemm instalces

* fixed profile grouped_gemm

* clean

* clean

* clean

* clean

* clean

* add missing instance func

* fixed inferface

---------
Co-authored-by: Jing Zhang <jizha@amd.com>
Co-authored-by: root <root@sh5-1e707-rc06-38.mkm.dcgpu>

98fd41f5

30 Oct, 2023 1 commit

Enable sccache in the default docker and CI. (#1009) · 4e44a9e8

Illia Silin authored Oct 30, 2023



* replace ccache with sccache, pin package versions

* put ccache back temporarily to avoid breaking other CI jobs

* add sccashe_wrapper.sh script

* fix the package version syntax

* fix the pymysql package issue

* run sccache_wrapper before build if ccache server found

* set the paths before calling the sccache_wrapper

* use /tmp instead of /usr/local for cache

* try using sccache --start-server instead of wrapper

* try using redis server with sccache

* define SCCACHE_REDIS

* add redis and ping packages, and redis port

* use the new sccache redis server

* do not use sccache with staging compiler

* fix the condition syntax

* add stunnel to redis

* add tunnel verification

* separate caches for different architectures

* fix syntax for the cache tag

* quse double brackets for conditions

* add bash line to the script

* add a switch for sccache and only use it in build stage

* run check_host function when enabling sccache

* fix the invocation tags for sccache

* fix groovy syntax

* set the invocation tag in groovy

* disable sccache in clang-format stage

* try another syntax for invocation tags

* use local sccache server if can't connect to redis

* fix script syntax

* update README

* refresh readme

* readme updates

* remove the timing and verification caveat from readme

---------
Co-authored-by: Lisa Delaney <lisa.delaney@amd.com>

4e44a9e8

11 Oct, 2023 2 commits

Revert "Grouped Gemm with looping over the tiles. (#788)" (#982) · c99323be
zjing14 authored Oct 11, 2023
```
This reverts commit a4f72a31.
```
c99323be

Grouped Gemm with looping over the tiles. (#788) · a4f72a31

Adam Osewski authored Oct 11, 2023



* Introduce LocalBlockToCTileMap.

* Change the signature of CalculateBottomIndex() function which now does
not accept any argument. The B2C map which is already passed as an
argument to the kernel Run function is calculating block's local id
already outside at kernel entry point __global__ function.
The LocalB2C map stores as members local block ID.

* Use LocalBlockToCTile map in device ops.

* First draft of tile loop work distribution.

* Fix typo.

* Simplify kernel arguments.

Calculate descriptors & B2C maps on the device.

* Use looping kernel.

* Fix B2C constructor.

* Fix Navi21 errors.

* Calculate tile start/end in device kernel.

* Change Run API to accept user provided workspace buffer.

* Add new line at EOF.

* Move Gemm KernelArguments to device op interface.

* Remove unused code.

* Update API.

* Launch grid size which is min of occupancy vs tile count

* Get back to use constant memory for gemm descriptors.

* Remove unused code.

* Add default virtual method implementation.

* Update comments to conform with doxygen style.

* Fix doc style and unused parameters.

* Add thread cluster lengths to kernel name.

* Remove old splitk impl and replace it with tile looping one.

* Modify instances.

* set KPerBlock to 64
* maximize wherever possible vector load size.

* Fix instances cluster lengths.

* Change comment style.

* Use 128b store where possible in instances.

* Update test cases, since KPerBlock has doubled.

* Update output stream operator for Sequence.

* Add pipeline version to GroupedGEMM device op type string.

* Fix pipeline version type logging.

* Fix input tensors type after merge.

* Fix compiler error.

* Fix output stream operator for Pipeline version.

* Store using 128b.

* Set of instances with kpb 32/64

* Limit number of instances

* Remove commented out instances.

* Fix function name.

* Limit the number of instances.

Add pipline version to the regular instances

* Change thr cluster layout for reading B tensor.

* disabled failed instances

---------
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
Co-authored-by: Jing Zhang <jizha@amd.com>

a4f72a31

31 Aug, 2023 1 commit

Grouped Gemm with Fixed K and N with SplitK (#818) · f5ec04f0

zjing14 authored Aug 31, 2023



* move all arguments into device

* add b2c_tile_map

* add examples

* add SetDeviceKernelArgs

* dedicated fixed_nk solution

* init client api

* add grouped_gemm_bias example

* add a instance

* add instances

* formatting

* fixed cmake

* Update EnableCompilerWarnings.cmake

* Update cmake-ck-dev.sh

* clean; fixed comments

* fixed comment

* add instances for fp32 output

* add instances for fp32 output

* add fp32 out client example

* fixed CI

* init commit for kbatch

* add splitk gridwise

* format

* fixed

* clean deviceop

* clean code

* finish splitk

* fixed instances

* change m_loops to tile_loops

* add setkbatch

* clean code

* add splitK+bias

* add instances

* opt mk_nk instances

* clean examples

* fixed CI

* remove zero

* finished non-zero

* clean

* clean code

* optimized global_barrier

* fixed ci

* fixed CI

* removed AddBias

* format

* fixed CI

* fixed CI

* move 20_grouped_gemm to 21_grouped_gemm

---------
Co-authored-by: Jing Zhang <jizha@amd.com>

f5ec04f0

23 Aug, 2023 1 commit

[HotFix] add config and version files to pass on build info (#856) · c8a8385f

Jun Liu authored Aug 23, 2023

* experiment with config file

* experiment with version.h config

* add more info to version.h

* minor updates

* minor updates

* fix case where DTYPE is not used

* large amount of files but minor changes

* remove white space

* minor changes to add more MACROs

* fix cmakedefine01

* fix issue with CK internal conflict

* fix define and define value

* fix clang-format

* fix formatting issue

* experiment with cmake

* clang format v12 to be consistent with miopen

* avoid clang-format for config file

c8a8385f

03 Aug, 2023 2 commits
- Fp16AInt8B_GEMM sanity · b5083bfe
  aska-0096 authored Aug 03, 2023
  
  b5083bfe
- debug code enabled · 5cf73a5e
  aska-0096 authored Aug 03, 2023
  
  5cf73a5e
27 Jul, 2023 1 commit

Add s_nops after v_dot to avoid hazard (#808) · 7761e523

Bartłomiej Kocot authored Jul 27, 2023

* Add s_nops after v_dot to avoid hazard

* Fix builtin for inner_produxt fp16

* Skip inline version to builtin

* Add comments regarding isa

* Fix comment regarding s_nop

7761e523

06 Jul, 2023 1 commit

Add basic setup for precommit (#749) (#764) · 237f9cd3

Adam Osewski authored Jul 06, 2023



* Add basic setup for precommit

* Update README.md with instructions on installing precommit hooks

---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: Bartlomiej Wroblewski <bwroblewski10@gmail.com>

237f9cd3