Commits · 3dc5db7270f3e2129d05fac756a45add2b8165f9 · gaoqiong / composable_kernel_ROCM

14 Oct, 2024 1 commit
- Add transpose scale amax example (#1547) · f21cda25
  Bartłomiej Kocot authored Oct 14, 2024
```
* Add transpose scale amax example

* fixes

* Tune reduce instance
```
  f21cda25
10 Oct, 2024 2 commits

Fix default stride value (#1559) · d18fc079
Rostyslav Geyyer authored Oct 10, 2024

d18fc079

Ck tile gemm cshuffle & CK Tile GEMM restructure (#1535) · 6f27bc98

Thomas Ning authored Oct 10, 2024



* ake the cshuffle compilable

* modify Mhe reference on gpu and cpu. Correaccess of cshuffle

* fix the cpu reference code

* Complete the in tile shuffle logic

* restructure the kernel template input

* change the naming pattern of ck_tile gemm pipeline

* Re-format files using remod.py

* Solve the fmha conflict with gemm

* Comment Addressed from Carlus

---------
Co-authored-by: Po Yen, Chen <PoYen.Chen@amd.com>

6f27bc98

08 Oct, 2024 3 commits

Add a gpu gemm reference kernel (#1528) · aa932445

Rostyslav Geyyer authored Oct 08, 2024



* Add a gpu gemm reference kernel

* Switch to gpu reference in gemm examples

* Remove redundant arguments

* Update all related examples

* Update more examples

* Try less threads per block

* Try even less threads per block

* Add support for all matrix layouts

* Increase block size

* Clean up

* Remove hardcoded strides

* Clean up

* Try a column-major case

* Revert back to row-major

* Run both CPU and GPU veriffication

---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

aa932445

[CK_TILE] Update example README files & fix script compatibility issue (#1548) · 0c094daa

Po Yen Chen authored Oct 08, 2024

* Fix text alignment of ArgParser::print()

* Update example README files

* Clarify make-ck-dev.sh <arch> usage

* Only keep some of the argument from '-?' output

* Undo command line output changes in README

* Only keep existing argument on doc and update description

* Fix text alignment

* Make cmake-ck-*.sh compatible with 'sh' command

0c094daa

[CK_TILE] Simplify the codes in splitkv_combine pipeline (#1549) · 74d68e3b

Qianfeng authored Oct 08, 2024



* Simplify the codes in splitkv_combine pipeline

* Always set kPadSeqLenK=true for fmha splitkv kernels

* Change in Oacc Alignment and TileDistribution to be more adaptable to tile sizes

---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

74d68e3b

07 Oct, 2024 2 commits

Fix build logic using GRU_ARCHS. (#1536) · 7d8ea5f0

Illia Silin authored Oct 07, 2024

* update build logic with GPU_ARCHS

* fix the GPU_ARCHS build for codegen

* unset GPU_TARGETS when GPU_ARCHS are set

7d8ea5f0

[Ck tile] Support layernorm one pass (#1512) · 0023f01a

rocking authored Oct 07, 2024



* Fix compile error

* Add one pass pipeline

* Extract creating tile_window to operator()

* clang format

* reduce duplicated code

* do not hardcode

* Support padding in layernorm

---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

0023f01a

04 Oct, 2024 1 commit

Adding seed and offset pointer support to the philox random number generator. (#1523) · c24fae23

kylasa authored Oct 04, 2024



* Adding seed and offset pointer support to the philox random number generator.

* Separating seed and offset pointer checks with different condition statements.

* Changes include, adding support for device seed and offset pointers, union is used to store seed/offset values and device pointers to minimize device SGPRs.

* Correcting a typo in the readme file

* Re-format files using remod.py

* Use STL type for API parameters

* Use simpler struct design for drop_seed & drop_offset

* Undo unnecessary changes

* Sync kargs style for fmha_fwd.hpp/.cpp

* Use templated union to reduce code

* Use structured binding to make code more readable

---------
Co-authored-by: Sudhir Kylasa <sukylasa@amd.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

c24fae23

01 Oct, 2024 2 commits

[CK_TILE] Change output accum tensor layout of fmha fwd split-kv & combine kernels (#1527) · a1c07e8d

Po Yen Chen authored Oct 01, 2024

* Use same layout for o_acc and o tensor

* Use better param names in partitioner

* Remove redundant kargs 'max_seqlen_q'

* Use better param names in splitkv kernel

* Add comment for additional kernel arguments

* Sync empty loop early return logics between pipelines

* Pass more arguments to cmake in scripts

* Align backslashes

* Fix wrong o_acc tensor view strides

* Change o_acc layout if o_perm=0

* Handle whole row masked via attn_bias

* Use use vector width = 1 for o_acc

* Use more even split sizes

a1c07e8d

Complex Contraction CK Bilinear Example (#1061) · 4cd1dc7f

M.Emin Ozturk authored Sep 30, 2024



* complex type contraction

* bug fix

* update

* Tensor Contraction Complex Data Type is working

* 4D Kernel

* some change

* validation check in progress

* validation issue

* fp32 verification error is fixed

* fp32 and fp64 are done

* remove old files

* remove cmake files

* remove cmake files

* Readme

* img verification

* CMakeList

* number changed

---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: Emin Ozturk <emin.ozturk@utah.edu>

4cd1dc7f

27 Sep, 2024 1 commit

[CK_TILE] Image to Column kernel (#1532) · de3e3b64

Bartłomiej Kocot authored Sep 27, 2024

* [CK_TILE] Image to Column kernel

* Fixes

* Vector loads and stores

* Fixes

* Fixes

* change test dir name

de3e3b64

26 Sep, 2024 1 commit
- [CK_TILE] Fix compiler related FA bwd issues (#1530) · 9d69a099
  Dan Yao authored Sep 27, 2024
```
* add barriers

* tail bias barriers

* adjust bf16/hd256 tol

* continue adjust bf16/hd256 tol
```
  9d69a099
18 Sep, 2024 1 commit

Ck tile gemm padding dim (#1516) · 694c3001

Thomas Ning authored Sep 18, 2024

* Support the N dimension padding

* Finished the padding feature for different dimension of K

694c3001

17 Sep, 2024 1 commit

Customize filesystem in CK for legacy systems (#1509) · ded0d83d

Jun Liu authored Sep 13, 2024



* Legacy support: customized filesystem

* Update cmakefile for python alternative path

* fix build issues

* CK has no boost dependency

* More fixes to issues found on legay systems

* fix clang format issue

* Check if blob is correctly generated in cmake

* fix the python issues

* add a compiler flag for codegen when using alternative python

* use target_link_options instead of target_compile_options

---------
Co-authored-by: illsilin <Illia.Silin@amd.com>

ded0d83d

14 Sep, 2024 1 commit

Ck tile GPU verification sample develop & Add the CK TILE GEMM to the CI/CD test (#1505) · 844f5a17

Thomas Ning authored Sep 14, 2024



* Finished the feature of gpu verification

* Add the ck_tile_gemm test in the CI CD

* add the include of tensor_layou in reference_gemm

* Comment Addressed

* split ck_tile fhma and gemm tests into separate stages

* restructure the reference gemm

* restructure a new reference_gemm api that could read the device mem

---------
Co-authored-by: carlushuang <carlus.huang@amd.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>

844f5a17

13 Sep, 2024 1 commit

Customize filesystem in CK for legacy systems (#1509) · 81bc1496

Jun Liu authored Sep 13, 2024



* Legacy support: customized filesystem

* Update cmakefile for python alternative path

* fix build issues

* CK has no boost dependency

* More fixes to issues found on legay systems

* fix clang format issue

* Check if blob is correctly generated in cmake

* fix the python issues

* add a compiler flag for codegen when using alternative python

* use target_link_options instead of target_compile_options

---------
Co-authored-by: illsilin <Illia.Silin@amd.com>

81bc1496

09 Sep, 2024 1 commit
- fix the unsupported scenario of Ali TestGemmUniversal (#1501) · cf08df6b
  Thomas Ning authored Sep 09, 2024
  
  cf08df6b
07 Sep, 2024 1 commit

Ck tile gemm example (#1488) · caacd388

Thomas Ning authored Sep 07, 2024



* Checkpoint: Finished with the tile example & kernel verification, working on the different matrix layout

* Finished the Matrix Layout feature set up. Note: Need to modify the inner block to solve the shuffle problem in the future.

* Fix: Clang Format, API fixed from fmha

* fix with better naming convention

* revert back the pipeline code of fmha

* Fixed: Addressed the comments and merge the GEMM shape of GEMM Operator and FMHA Operator to one.

* clang format with the reference_gemm file

* convert the clang format with the remod.py

* Changed the format and variable name of the kernel gemm_shape and partitioner

---------
Co-authored-by: thomasning <thomasning@banff-cyxtera-s70-4.ctr.dcgpu>

caacd388

05 Sep, 2024 1 commit

Add gemm universal bf16 instances (#1484) · 5b10dae6

Haocong WANG authored Sep 05, 2024



* revert ckprofiler change

* temp save

* Add test and test pass

* test pass

* Fix bug inside rotating buffer when tensor is not packed

* bug fix

* clang format

---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

5b10dae6

29 Aug, 2024 1 commit
- Add missing function and parameters (#1493) · 8107ee62
  Po Yen Chen authored Aug 29, 2024
  
  8107ee62
28 Aug, 2024 1 commit

[CK_TILE] Add PagedAttention kernels (#1387) · c1569892

Po Yen Chen authored Aug 28, 2024



* Use dictionary to config all the functions

* Add init codegen logic for fmha fwd appendkv

* Call HIP_CHECK_ERROR() macro to get real source info

* Setup meaningfull arguments

* Sync kernel name with the codegen

* Add knew/vnew tensors to the kernel argument

* Fix wrong K values after appending

* Fix vnew append errro

* Extract common logics

* Fix Vnew tile dstr for row major case

* Conditionally add fwd_splitkv API in fmha_fwd example

* Conditionally add call to fmha_fwd_splitkv()

* Remove "EXAMPLE_" prefix of cmake variables

* Regsiter API handlers automatically

* Early return if 0 < s_k_new is not supported

* Show message if we are ignoring option

* Unify CMakeLists.txt coding style

* Set num_splits=1 if split-kv is not supported

* Add length/stride getters for HostTensor

* Add RoPE example utilities

* Add reference_rotary_position_embedding() (not implemented)

* Finish reference_rotary_position_embedding() impl

* Fix typo of HostTensor<>::get_length()

* Fix compilation errors

* Fix wrong answer when interleaved=false

* Fix wrong answer when interleaved=true

* Append K/V in the host verification code

* Simplify K appending logics

* Simplify v_host_ref definition

* Reduce input/output dimensions

* Rename function: add "batched" prefix

* Apply RoPE on host side

* Rename RoPE utility function

* Fix wrong tensor size

* Avoid invoking deprecated method 'find_module'

* Pass RoPE kernel args

* Create Rotary Cos/Sin tile windows in kernel

* Add compute data type alias for RoPE

* Randomly generate seqlen_knew if needed

* Fix seqlen_knew enabling check logic

* Add minimum seqlen_k to generate compliance kvcache

* Fix compilation error in debug mode

* Fix wrong boundaries

* Fix wrong seqlen_k for kvcache

* Rename variables used in distributio encoding

* Fix rotary cos/sin tensor/tile size

* Add constraint to the rotary_dim option

* Remove unused inner namespace

* Add dram distribution for rotary_cos/rotary_sin (interleaved)

* Only apply interleaved RoPE on Knew for now

* Fix wrong thread starting offset

* Instantiate multiple kernels for RoPE approaches

* Clean-up pipeline

* Fix error in RoPE host reference

* Handle RoPE half-rotated logics

* Support 8x rotary_dim under half-rotated RoPE

* Add comment

* Apply elementwise function to the loaded tiles

* Unify parameter/variable naming style

* Remove constness from q_ptr

* Add code blocks for q_tile

* Apply RoPE to q_tile

* Remove debug print code in kernel

* Fix wrong knew/vnew appending positions

* Use better naming for tile indices

* Add make_tile_window() for adding distribution only

* Skip code if # of block is more than needed

* Move thread locating logics into policy

* Remove always true static_assert()

* Rename header

* Rename RotaryEmbeddingEnum

* Extract rotary embedding logic out

* Re-order parameters

* Align naming of some tile size constants

* Rename more tile size constants

* Fix wrong grid size

* Fix wrong shape of knew_host/vnew_host

* Fix wrong index into knew_host/vnew_host

* Fix wrong rotary_cos/rotary_sin memory size for Q

* Extract Q/Knew vector size to helper methods

* Use different rotary_cos/rotary_sin distr for Q/Knew

* Update host/device specifiers

* Fix wrong data type for Q rotary_cos/rotary_sin

* Remove RoPEComputeDataType type alias

* Shift rotary_cos/rotary_sin by cache_seqlen_k

* Add comment for why I just 't' for all padding flags

* Align commit message to the real comment

* Fix wrong pipeline

* Rename utility function

* Disable host verification if API not exist

* Fix wrong rope key for fp8 pipeline

* Allow only apply RoPE on Q (without append KV)

* Add append-kv smoke tests

* Remove debug statements

* Remove more debug statements

* Re-arrange the 'set +x' command

* Remove no-longer used method in pipeline

* Add missing init code

* Refine pipeline padding settings

* Enlarge rotary_dim limit (8 -> 16)

* Enlarge KPerThread for rotary_interleaved=false

* Update rotary_dim range in smoke_test_fwd.sh

* Add template argument 'kIsPagedKV' for splitkv kernels

* Launch splitkv kernel if given page_block_size

* Fix wrong kernel name

* Fix seqlen_k_min for pre-fill case (1 -> 0)

* Add copy_const<> type trait

* Add another make_tile_window()

* Introduce 'TileWindowNavigator' types

* Simplify TileWindowNavigator interfaces

* Fix tile window navigation bugs

* Disable calling fmha_fwd()

* Remove ununnecessary data members

* Simplify more make_tile_window() overloads

* Move V tile through TileWindowNavigator

* Fix uneven split checking logic

* Move code after decide seqlen_q/seqlen_k

* Make sure we always start reading complete tile

* Use 128 as minimus page_block_size

* Fix wrong origin for bias

* Add batch_stride_k/batch_stride_v in group mode

* Unify origin

* Add missing kernel arguments for group mode

* Add paged-kv codegen logic for appendkv kernels

* Add block_table kernel args for appendkv kernel

* Add tile navigators to the appendkv kernel

* Fix wrong tensor descriptor lengths

* Pass re-created tile window to pipeline

* Fix wrong strides for appendkv kernel

* Allow transit tile_window to another page-block

* Handle cross-page-block write

* Donot perform write again if already in last page-block

* Always add fmha_fwd() api

* Add missing group mode argument

* Remove debug macro usages

* Rename option s_k_new to s_knew

* Separate splitkv/non-splitkv args/traits

* Remove fmha_fwd_dispatch()

* Fix compilation errors

* Remove dropout code in splitkv kernel

* Allow problem types without define kHasDropout attr

* Use generic lambda to init traits objects

* Separate more non-splitkv & splitkv traits/args

* Display more info for specific kernels

* Show more detailed warning message

* Rename 'max_num_blocks' to 'max_num_page_blocks'

* Remove no-longer used pipeline files

* Wrap code by #if directives

* Move functors to the begining of validation code

* Use generic lambda to init all the api traits/args

* Fix wrong seqlen for kvcache

* Add missing comment

* Rename TileWindowNavigator to PageBlockNavigator

* Only expose necessary methods (not attributes)

* Re-order pipeline paremeters

* Refine smoke_test_fwd.sh

* Fix wrong arugment count

* Make tile window directly via PageBlockNavigator

* Remove unused template paremeter

* Remove group mode from appendkv kernel

* Fix skcheck logic

* Fix wrong syntax in skcheck expr

* Use meaningful options in smoke test

* Remove options

* Fix formatting

* Fix more format

* Re-organize bash functions

* Pass cache_batch_idx to kernels

* Support cache_batch_idx in example

* Fix compilation error

* Add more appendkv test

* Add more case for appendkv

* Fix unexisted attribute

* Remove 0 < seqlen_knew constraint

* Clarify the case in warning message

* Remove macro checking

* Force batch mode when invoking appendkv & splitkv apis

* Fix mode overriding logics

* Fix wrong parameter name

* Randomize seqlen_k if use kvcache

* Use randomized seqlen_k for kvcache

* Avoid using too small rotary_cos & rotary_sin

* Rename parameter

* Add seqlen_q & seqlen_k rules

* Add comment

* Add more comments

* Fix compilation errors

* Fix typo in comment

* Remove type argument

* Avoid seqlen_k=0 for kvcache

* Revert "Avoid seqlen_k=0 for kvcache"

This reverts commit 21c4df89e416182e8e9bc78e67bd4b98dbb6c88d.

* Fix wrong uneven split checking logics

* Only randomize kvcache seqlen_k if 1 < batch

* Return earlier if split is empty

* Revert "Only randomize kvcache seqlen_k if 1 < batch"

This reverts commit b9a4ab0d7e3c2beecc0fccafd2a13259dd06299c.

* Re-order seqlen_k_start adjustment logics

* Fix compilation errors

* Re-format script

* Find executable from folder automatically

* Fix kvcache seqlen_k generating logic

* Make comment more clear

* Fix wrong knew/vew appending logic on host

* Add s_barrier to sync threads

* Revert "Add s_barrier to sync threads"

This reverts commit d3f550f30c0a4d9df15c613015d5dff268d6746d.

* Support only using 1 row of rotary_cos/rotary_sin

* Rotate Q in different way

* Unify tensor view creation logics

* Fix wrong argument

* Add mask to switch how we use the rotary_cos/sin

* Move attr from traits to problem

* Move has_mask to fmha_fwd_appendkv_args

* Support use uint32_t as SAD operand in Alibi<>

* Use sad_u32() in splitkv kernels

* Store tensor views in PageBlockNavigator

* Use stored tensor view to update tile windows

* Enlarge tensor view size

* Remove debug code

* Fix wrong tensor view size

* Wrap tensor view into PageBlockNavigator

* Add DataType member to PageBlockNavigator

* Remove unnecessary member functions

* Refind macro use

* Fix typo

* Add blank line between directives and actual code

* Re-format files

* Remove type in comment

---------
Co-authored-by: carlushuang <carlus.huang@amd.com>
Co-authored-by: rocking <ChunYu.Lai@amd.com>

c1569892

21 Aug, 2024 2 commits

Adding Instances and Examples for FP8-based Scaled Convolution and AMAX Reduction. (#1473) · c3515f27

Andriy Roshchenko authored Aug 21, 2024

* Enable CMakePresets build

* Verify Convolution, Scaling and ReLU algorithms.

* Add tensor element-wise scale and type cast operation.

* Reduction implemented but does not work.

* Exploration of Reduction functionality.

* Completed example for Convolution scaled with ReLu activation and AMAX reduction.

* WIP: Add required instances for convolution.

* WIP: Create client example. Implement convolution stage.

* Add elementwise instances.

* Add elementwise scale + convert example.

* Add reduction instances.

* WIP: Client example for AMAX reduction.

* WIP: Add instances for multistage reduction.

* WIP: Implementation of multistage reduction.

* Refactoring.

* Clean up.

* Add CMakePresets.json

* Guard off FP8 instances when the data type is not available.

* Add example for Scaled FP8 Convolution with AMAX reduction.

* Refactor CombConvScaleRelu instances.

* Add CombConvScale instances.

* Add client example for Scaled FP8 Convolution with AMAX reduction.

* Cleanup.

c3515f27

Set RNE fp8 conversion as a default (#1458) · e20f20ef

Rostyslav Geyyer authored Aug 21, 2024

* Set RNE fp8 conversion as a default

* Update f8 tests

* Disable failing test on gfx11

* Update bf8 tests

* Add a flag

* Fix the flag

* Raise flag for gfx10 as well

* Temp commit for tolerance testing

* Update tolerances

e20f20ef

20 Aug, 2024 1 commit

Adding Instances and Examples for FP8-based Scaled Convolution with ReLU... · a94113a9

Andriy Roshchenko authored Aug 20, 2024

Adding Instances and Examples for FP8-based Scaled Convolution with ReLU Activation and AMAX Reduction. (#1469)

* Enable CMakePresets build

* Verify Convolution, Scaling and ReLU algorithms.

* Add tensor element-wise scale and type cast operation.

* Reduction implemented but does not work.

* Exploration of Reduction functionality.

* Completed example for Convolution scaled with ReLu activation and AMAX reduction.

* WIP: Add required instances for convolution.

* WIP: Create client example. Implement convolution stage.

* Add elementwise instances.

* Add elementwise scale + convert example.

* Add reduction instances.

* WIP: Client example for AMAX reduction.

* WIP: Add instances for multistage reduction.

* WIP: Implementation of multistage reduction.

* Refactoring.

* Clean up.

* Guard off FP8 instances when the data type is not available.

* Improve output readability.

* Addressing reviewer's comments.

a94113a9

16 Aug, 2024 1 commit

[CK_TILE] FA bwd kernels optimization (#1397) · 79a5d9c1

Dan Yao authored Aug 17, 2024



* tmp save

* fix batch deterministic bugs

* fix group deterministic bugs

* codegen update

* reorder files

* bias support

* hd256 bias support

* bwd smoke test update

* simplify convert dq

* fix hd256 dropout scratch

* do{}while() -> while(){}

* comments

* remove FmhaBwdTilePartitioner

* save clear_tile

* refactor dropout

* code cleanup

* code cleanup

* comments

* fix epilogue problem

* fix fwd dropout

* group convert_dq opt

* fix dq alignment

* Do not store storerandval in bwd for flash attention integration

* fix hd32 error and boost performance

* revert

* Remove duplicated WarpGemm definitions in the policy file

* dropout patch for mrepeat 16*16

* code sync up

* dq_acc stride

* dq_acc stride stuff

* codegen update

* fwd dropout revert

* fix hd128 scratches and boost performance

* receipt 3 for simplified smoke test

* more strides for fa integration

* fix hd64 scratches and boost performance

* non-iglp pipeline for headdim padding cases

* dpad same as dvpad for flash attention integration

* unpadded lse&d for group mode

* Support unpad layout for group lse

* Support unpad lse layout for splitkv

* Fix stride for splitkv kernel

* fix unpadded lse issue in fwd splitkv

* comment

* solve lds read&write conflicts

* rename

* bias rename

* tile index revert

---------

Co-authored-by: danyao12 <danyao12>
Co-authored-by: rocking <ChunYu.Lai@amd.com>
Co-authored-by: Qianfeng Zhang <Qianfeng.Zhang@amd.com>

79a5d9c1

14 Aug, 2024 1 commit

[GEMM] gemm_universal related optimization (#1453) · 3049b546

Haocong WANG authored Aug 14, 2024



* replace buffer_atomic with global_atomic

* fixed global_atomic_add

* added bf16 atomic_add

* format

* clang-format-12

* clean

* clean

* add guards

* Update gtest.cmake

* enabled splitk_gemm_multi_d

* format

* add ckProfiler

* format

* fixed naming

* format

* clean

* clean

* add guards

* fix clang format

* format

* add kbatch printout

* clean

* Add rocm6.2 related gemm optimization

* Limit bf16 atomic usage

* remove redundant RCR gemm_universal instance

* Add RRR fp8 gemm universal instance

* Bug fix

* Add GPU_TARGET guard to FP8/BF8 target

* bug fix

* update cmake

* remove all fp8/bf8 example if arch not support

* Enable fp8 RRR support in ckProfiler

* limit greedy-reverse flag to gemm_universal in ckProfiler

---------
Co-authored-by: Jing Zhang <jizhan@fb.com>
Co-authored-by: Jing Zhang <jizhan@meta.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>

3049b546

13 Aug, 2024 1 commit
- Support large: 12d tensor size for reduction kenrel (#1465) · 0606e549
  Mateusz Ozga authored Aug 13, 2024
  
  0606e549
12 Aug, 2024 1 commit
- Disable inapplicable xdl and mha instances for gfx12 (#1464) · cbb6f2ab
  Illia Silin authored Aug 12, 2024
  
  cbb6f2ab
07 Aug, 2024 1 commit

Run CK_TILE FMHA benchmarks and collect the performance data. (#1447) · 12c1f68d

Illia Silin authored Aug 07, 2024

* run ck_tile benchmarks after the smoke tests and store logs

* change the path of fmha benchmark logs

* change the way of stashig ck_tile fmha logs

* prevent the errors in stages where no logs are generated

* fix the ck_tile fmha log names and headers

* generate the fmha performance logs in the root folder

* change jenkins scrip arguments format

* use exact file names for stashing

* modify scripts to process FMHA performance results

* unstash FMHA logs before parsing them

12c1f68d

06 Aug, 2024 2 commits

Fix for beta!=0 in reduce (#1440) · b74d4d4d
jakpiase authored Aug 06, 2024
```
* fix for beta!=0 in reduce

* add reviewers suggestions
```
b74d4d4d

Add Grouped Conv Fwd Large Tensor kernel (#1432) · 4ec5c52a

Bartłomiej Kocot authored Aug 06, 2024

* Support 64 bit indexing

* Add new grouped conv fwd kernel for large tensors

* Add instances large tensor

* Fixes for transform conv to gemm

* Fixes

* fixes

* Remove not needed instances

* examples fixes

* Remove not need ds arrays

* Fix tests

* Add 2GB check in gridwise dl

* Fixes

4ec5c52a

31 Jul, 2024 1 commit
- workaround rocm-6.2 compiler issue (#1421) · b3f86e79
  carlushuang authored Jul 31, 2024
  
  b3f86e79
24 Jul, 2024 1 commit

Adding more instances of grouped convolution 3d forward for FP8 with... · 4a8a1bef

Andriy Roshchenko authored Jul 24, 2024

Adding more instances of grouped convolution 3d forward for FP8 with ConvScale+Bias element-wise operation. (#1412)

* Add CMakePresets configurations.

* Add binary elementwise ConvScaleAdd and an example.

* Numerical verification of results.

Observed significant irregularities in F8 to F32 type conversions:
```log
ConvScaleAdd: float=145.000000   f8_t=160.000000    e=144.000000
ConvScaleAdd: float=97.000000   f8_t=96.000000    e=104.000000
ConvScaleAdd: float=65.000000   f8_t=64.000000    e=72.000000
```

* Implemented ConvScaleAdd + Example.

* Add ConvScale+Bias Instances

* Add Client Example for ConvScale+Bias

* Fix number of bytes in an example..

* Cleanup.

4a8a1bef

19 Jul, 2024 2 commits

[GEMM] F8 GEMM, performance optimized. (#1384) · 8c90f25b

Haocong WANG authored Jul 19, 2024



* add ab_scale init support

* enabled interwave

* add scale type; update isSupport

* adjust example

* clean

* enable f8 pure gemm rcr ckprofiler

* Add gemm_multiply_multiply instances

* clang format

* Optimize for ScaleBlockMNK=128

* enable abscale f8 gemm ck profiler

* Add pure f8 gemm test suite

* Reverting to the state of project at f60fd77

* update copyright

* clang format

* update copyright

---------
Co-authored-by: root <jizhan@amd.com>

8c90f25b

Universal gemm splitk using reduce (with multi-d) (#1341) · c544eb4d

ltqin authored Jul 19, 2024



* init for reduce_threadwise multi_d

* add reduce_threadwise_multi_d

* add reduce_multi_d

* clean

* start add an other splitk device op

* add reduce template parameter to SplitKBatchOffset

* add reduce c matrix

* clean up code

* change example data type to bf16

* add bf16Ai8B example

* remove reduce template parameter

* add splitk atomic status to v4

* example add multi d parameters

* device op add multi-d parameters

* add multi-d to reduce

* fix kbach=1 bug

* change B layout to col in  bf16Ai8B example

* remove float adding struct

* change  multi-d interface

* change file and class name

* remove multi-d of bf16Ai8B example

* change IsReduce function to IsReduceAdd

* change example layout to RRR from RCR

* according layout to set ds stride

* reset parameter layout

* add gemm universal reduce instance

* add reduce factory

* add profile_gemm_universal_reduce

* add reduce to profiler

* fix reduce instance

* fix profiler reduce compiling bug

* format

* format library instance code

* add mem instance for reduce library

* fix call instance names

* add workspace for reduce in ckProfiler

* format

* add mnpading to reduce library instance

* add fp16 instance to reduce of profiler

* change copyright time

* restore profiler cmake file

* add reduce text to instances

* add DsLayout and DsDataType to instances template parameter

* fixed gemm_reduce_multi_d

* add an example without multi_d

* Update common.hpp

* Update gtest.cmake

* Update gemm_xdl_splitk_reduce_bf16.cpp

* clean

* Update gtest.cmake

* format

* fixe api

* format

* default parameter change to RRR

* add vector_len for multi_d

* format

* Update gtest.cmake

* fix bf16A iBB elementwiseop

* add ReduceDataType

* move ReduceDataType to end position

* format

* remove googletest git method  address

* fix copyright time

* update init data

---------
Co-authored-by: root <jizhan@amd.com>
Co-authored-by: letaoqin <letaoqin@amd.com>
Co-authored-by: Jing Zhang <jizhan@meta.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>

c544eb4d

16 Jul, 2024 1 commit

Adding more instances of grouped convolution 3d forward for FP8 with ConvScale... · 802a8a1d

Andriy Roshchenko authored Jul 16, 2024

Adding more instances of grouped convolution 3d forward for FP8 with ConvScale element-wise operation and ReLU activation. (#1386)

* Add CMakePresets configurations.

* Add ConvScale+ReLU Functor and an Example

* Account for ReLU FLOPs.

* Add instances of 3D convolutions with ConvscaleRelu operation.

* Implement Client Example

* Cleanup

802a8a1d

12 Jul, 2024 1 commit

add gemm_bias_add example (#1361) · 13c1e64d

zjing14 authored Jul 11, 2024



* add gemm_bias_add example

* changed strideD

* clang-format

---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

13c1e64d

11 Jul, 2024 2 commits

Add instances for grouped conv fwd 3d with ConvScale for bf8@fp8->fp8 (#1369) · 7a46a91c
Rostyslav Geyyer authored Jul 11, 2024
```
* Add an example

* Add instances

* Add a client example
```
7a46a91c

Add CK_TILE tests to daily CI builds. (#1381) · 98a01bbc

Illia Silin authored Jul 11, 2024

* add ck_tile tests to CI

* build and run ck_tile tests on gfx90a and gfx942 in parallel

* fix groovy syntax

* turn ck_tile tests OFF by default

* skip creating the build folder

* build ck_tile examples with 64 threads

* build ck_tile examples with cmake-ck-dev.sh script

* add video group to docker on mi300

* do not retry to rebuild the early CI stages

* help prevent jenkins false failure

* restore cron trigger

98a01bbc