Commits · 2541812e515481e2072f714a40e5e439bf0154a4 · gaoqiong / composable_kernel_ROCM

"vscode:/vscode.git/clone" did not exist on "f7b613332c347f47a4f337b73ee9d66e6d62b51c"

14 May, 2024 2 commits
- Use different active wait scheme. · 2541812e
  Adam Osewski authored May 14, 2024
  
  2541812e
- Fix Coherency bits and for gmem ordering through compiler builtins. · cfbef574
  Adam Osewski authored May 14, 2024
  
  cfbef574
29 Apr, 2024 1 commit
- Add vector instruction coherency bits for gfx94 targets. · 0695ccd8
  Adam Osewski authored Apr 29, 2024
  
  0695ccd8
14 Apr, 2024 1 commit

[GEMM] Gemm universal device operation (#1154) · f83e9701

Haocong WANG authored Apr 14, 2024



* Optimize GEMM on MI200/300:
1. Add new blockwise gemm pipeline
2. Add irregular splitk intances

* clang format + typo fix

* Fix a bug

* initial commit

* Add more instances to irregular splitk

* blkgemm pipeline v1~4 prototype

* Sanity Checked. Known issue:
1. Poor performance of splitk
2. Register spill on blkgemmpipeline v3

* Sanity and Performance fix:
1. fix a bug related to sanity in grouped b2c mapping
2. fix a bug related to sanity and performance in splitk offset

* Sanity and API update:
1. Remove prefetch stage
2. Fix valid check bug
3, Add first gemm_universal instance into ckProfiler

* Add NN instances for gemm universal

* 1. Add NT instances for gemm_universal
2. Fix a bug about Kpadding in gemm_universal

* Fix a bug regarding padding Odd K number

* remove kernel print

* Fix KPadding bug...

* Update safety check

* another try to fix kpadding..

* Sanity checked

* new instances..

* clang format+typo fix

* remove clang format script's change

* Add non-hotloop compile option

* 1. Add fp16xfp8 example
2. pull packed convert f8 from pr1150

* Some miscs.. opt and fix

* Add pipeline description docs

* Split universal gemm instance library to cut profiler compiling time

* uncomment cmakefile

* Fix a bug caused by blockwise_gemm_pipe_v2

* reduce default splitk to 1

* Add 224x256x64 tile size

* update, including:
1. Experiment pipeline 5~7
2. Optimization for pipeline 4
3. Organized instance library

* temp save

* temp save

* Permuted lds layout, sanity and function checked

* clang format

* Move OOB check from RunRead to RunWrite, for better software pipeline.
TODO: agpr spill when NN layout

* clangformat

* A/B splitpipe scheduler for v3

* Fix two bugs

* bug fix

* fix a bug in oob check

* Example for mixed fp16_fp8 gemm

* Clean experimental code blocks

* Add mixed precision gemm into profiler

* tempsave

* optimize m/n major lds layout

* Add RRR GEMM  mixed precision instances

* Optimize f8 matrix transpose

* Add test_gemm_universal

* A/B spilt schedule for blkpip v5

* Take ds_read2 into iglp scheduling scheme

* format

* fixed cmake

* Add llvm-option into CI cmake flag

---------
Co-authored-by: Jing Zhang <jizhan@amd.com>

f83e9701

02 Apr, 2024 1 commit
- Introduce combined elementwise ops (#1217) · 9a194837
  Bartłomiej Kocot authored Apr 03, 2024
```
* Introduce combined elementwise ops

* Introduce refrence elementwise
```
  9a194837
20 Mar, 2024 1 commit
- Single flag per workgroup synchronization scheme. · 5b2eb1ab
  Adam Osewski authored Mar 20, 2024
  
  5b2eb1ab
09 Mar, 2024 1 commit

Navi3 rel (#1176) · 1837040a

zjing14 authored Mar 08, 2024



* wmma_op + unit test

* add arch limitation to wmma test

* change arch limitation

* Refactor + Add all type unit test(int4 compile failed)

* Add f32_16x16x16_bf16 unit test

* tempsave

* tempsave

* tempsave

* runtime bug, cannot find symbol

* workaround for incorrect HIP warpSize return value

* debugging

* tempsave

* Correctness OK, waiting for optimization

* Tidy up + format

* temp save

* temp save, reproduce the v_bfi_b32 issue

* add inline asm for wmmaop test

* tidy up

* clean some debug purpose code

* discard some codes

* clang format

* clang format

* compiler issue fixed + increase tile size

* navi3x_multipleD+example

* temp save

* workable

* batchedgemm[OK], groupconv[debug]

* groupconv: Sanity check[OK], Performance[Bad]

* navi3x_groupconv_need_optimization

* create necessary files

* save progress

* Add Inter-Row thread transfer

* save progress

* save debugging progress

* sanity check pass

* fix a host tensor bug and clean up flash-attn code

* format

* cancel unnecessary change

* cancel unnecessary change

* cancel unnecessary change

* temp save, add asm backend flag to amd_wmma

* Mat-A LDS Bypass sanity pass

* temp save

* gemm sanity fix

* Porting new blockwise gemm to flash attention

* Example branch provide to compiler team

* tempsave

* Fix a bug

* batched gemm ported

* conv A-skip lds ported

* Skip B-Lds real gemm

* Skip B Lds Gemm + MulD

* batched gemm, conv, skip b lds

* format

* Attn, skip b lds

* Change GridwiseOp nam

* fix a typo caused bug

* Skip A_Lds sanity pass, Skip B_Lds scratch occured

* Bug found, intra-row permute off caused

* bug found

* a fix

* disable buffer load due to incorrect 3rd dword

* update fmha config, no scratch generated

* update 3rd dword

* fmha config update

* FMHA, add support to gfx1101/gfx1102

* Merge origin dev (#2)

* [Navi3x] Fix Gridwise_multiple_d operation (#649)

* Add CMake Option "USE_OPT_NAVI3X"

* fix bug

* standardize docs (#655)

* Separate bibtex requirement from rocm-docs-core (#656)

* separate bibtex requirement from rocm-docs-core

* point requirements to source rocm-docs-core repo

* Add CMake Option "USE_OPT_NAVI3X" (#647)

* Add CMake Option "USE_OPT_NAVI3X"

* remove navi3x opt compile option from cmake script

* Conv + quantization + tanh  (#645)

* Rename file. Prepare to support another activation

* Add comment for quantization

* Extract out_elementop

* Add tanh example

* Add conv + bias + tanh quantization instance

* Add missing parameter

* Refine cmake

* Add external api and client example

* Extract variable in example

* Fix the comment

---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>

* Add a denorm test fix (#603)

* Add type_convert implementations for bf16

* Add the fix for conv_fwd

* Add the fix for conv_bwd_data

* Add the fix for conv_bwd_weight

* Format

* Format

* Another format

* Add a macro to use workaround on MI200 only

* Format

---------
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>

* simplify karg in device/grid of split-k op (#644)

* simplify karg in device/grid split-k op

* fix mk_kn_mn instances

* add more instances

* use name from tensor layout

* fix 3rd dword of buffer source descriptor (#659)

* add fp64 instances (#658)
Co-authored-by: root <root@ctr-ubbsmc15.amd.com>

* Issue #666: Revert "simplify karg in device/grid of split-k op (#644)" (#665)

This reverts commit bb5530af

.

* Groupnorm + swish external api (#668)

* Rename to proper naming

* Add example of groupnorm + swish

* Extract duplicate code in example

* Add groupnorm + swish instances

* Ractor instance generation, split into multiple cpp file

* Add external api and client example

* Refine profiler message

* Use ck math version of exp

* Refine problem size in example

* Add host version of exp

* add a marco to turn on/off denorm fix (off by default) (#673)

* add a marco to turn off denorm fix by default

* expose the marco

---------
Co-authored-by: root <root@ctr-ubbsmc15.amd.com>

* fixed quant example (#672)
Co-authored-by: root <root@ctr-ubbsmc15.amd.com>

* Add dependabot config and pin rocm-docs-core (#663)

* [gtest] suppress unsafe buffer warn (#670)

ref: https://github.com/ROCmSoftwarePlatform/MIOpen/pull/1912



* Add memory index guard in wmma device ops (#667)

* Add more macros to turn on/off denorm fix (#678)
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com>

* Fix a typo (#676)

* Add (#677)

* Allow using ROCm release candidate compilers. (#679)

* enable use of rocm5.5 release candidate 4

* upgrade to ROCM5.5 RC5

* try fix the PUB_KEY error, remove the cmake-data package

* upgrade to latest cmake version

* use private dockerhub repo for rocm5.5 rc5

* add missing bracket

* add vector load check

* solve conflicts

---------
Co-authored-by: Sam Wu <sjwu@ualberta.ca>
Co-authored-by: Sam Wu <sam.wu2@amd.com>
Co-authored-by: rocking5566 <ChunYu.Lai@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com>
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>
Co-authored-by: root <root@ctr-ubbsmc15.amd.com>
Co-authored-by: Jun Liu <Liu.Jun@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

* Disable SkipLDS & Align AIT api (#3)

* fix layernorm, reduction Ops (#4)

* [Navi3x] Fix Gridwise_multiple_d operation (#649)

* Add CMake Option "USE_OPT_NAVI3X"

* fix bug

* standardize docs (#655)

* Separate bibtex requirement from rocm-docs-core (#656)

* separate bibtex requirement from rocm-docs-core

* point requirements to source rocm-docs-core repo

* Add CMake Option "USE_OPT_NAVI3X" (#647)

* Add CMake Option "USE_OPT_NAVI3X"

* remove navi3x opt compile option from cmake script

* Conv + quantization + tanh  (#645)

* Rename file. Prepare to support another activation

* Add comment for quantization

* Extract out_elementop

* Add tanh example

* Add conv + bias + tanh quantization instance

* Add missing parameter

* Refine cmake

* Add external api and client example

* Extract variable in example

* Fix the comment

---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>

* Add a denorm test fix (#603)

* Add type_convert implementations for bf16

* Add the fix for conv_fwd

* Add the fix for conv_bwd_data

* Add the fix for conv_bwd_weight

* Format

* Format

* Another format

* Add a macro to use workaround on MI200 only

* Format

---------
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>

* simplify karg in device/grid of split-k op (#644)

* simplify karg in device/grid split-k op

* fix mk_kn_mn instances

* add more instances

* use name from tensor layout

* fix 3rd dword of buffer source descriptor (#659)

* add fp64 instances (#658)
Co-authored-by: root <root@ctr-ubbsmc15.amd.com>

* Issue #666: Revert "simplify karg in device/grid of split-k op (#644)" (#665)

This reverts commit bb5530af

.

* Groupnorm + swish external api (#668)

* Rename to proper naming

* Add example of groupnorm + swish

* Extract duplicate code in example

* Add groupnorm + swish instances

* Ractor instance generation, split into multiple cpp file

* Add external api and client example

* Refine profiler message

* Use ck math version of exp

* Refine problem size in example

* Add host version of exp

* add a marco to turn on/off denorm fix (off by default) (#673)

* add a marco to turn off denorm fix by default

* expose the marco

---------
Co-authored-by: root <root@ctr-ubbsmc15.amd.com>

* fixed quant example (#672)
Co-authored-by: root <root@ctr-ubbsmc15.amd.com>

* Add dependabot config and pin rocm-docs-core (#663)

* [gtest] suppress unsafe buffer warn (#670)

ref: https://github.com/ROCmSoftwarePlatform/MIOpen/pull/1912



* Add memory index guard in wmma device ops (#667)

* Add more macros to turn on/off denorm fix (#678)
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com>

* Fix a typo (#676)

* Add (#677)

* Allow using ROCm release candidate compilers. (#679)

* enable use of rocm5.5 release candidate 4

* upgrade to ROCM5.5 RC5

* try fix the PUB_KEY error, remove the cmake-data package

* upgrade to latest cmake version

* use private dockerhub repo for rocm5.5 rc5

* add missing bracket

* Disable SkipLDS & Align AIT api

* Update dependabot config (#682)
Co-authored-by: samjwu <samjwu@users.noreply.github.com>

* update attn api

* solve type_convert bug + enable

---------
Co-authored-by: Sam Wu <sjwu@ualberta.ca>
Co-authored-by: Sam Wu <sam.wu2@amd.com>
Co-authored-by: rocking5566 <ChunYu.Lai@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com>
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>
Co-authored-by: root <root@ctr-ubbsmc15.amd.com>
Co-authored-by: Jun Liu <Liu.Jun@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: samjwu <samjwu@users.noreply.github.com>
Co-authored-by: haocwang <Haocong.WANG@amd.com>

* fix typo

* Fix attention with causal mask

* multiple fix, try ait compile

* Add A/B not use LDS pipeline

* Clang format, Add gfx1101, gfx1102 support of FMHA example

* cancel change of format script

* 1. Enable 2-stage global Prefetch ( May cause VGPR spilling)
2. Enable FP16 accumulator blockwise_gemm

* clang-format

* 1. change blockwise gemm loopover direction from kmn to mnk ( ~1% improvement)
2. change kernel timing mode to 50 warmup + 50 timed repeat

* Update low level abstration of blockwise gemm wmma

* (2/5) bilinear gemm pass, perf bug: skip a lds has lower performance than skip b lds

* (3/5) batched gemm pass, perf bug: skip a lds has lower performance than skip b lds

* (4/5) grouped conv pass

* (5/5) attention pass, todo: debug lds perf bug

* AIT Attention API refactor (#8)

* sanity pass

* sanity pass 2

* confirm significant performance regression.

* turn on all instances

* turn off instance format

* Fix bug & tunning & format

* DML meta, self_attn+cross_attn

* sanity pass

* remove useless flag

* update tile and problem size used in AIT attention

* bug fix in grouped conv supporting check

* deprecate inline asm wmma

* Bug fix: double lds skip

* clang-format

* Fix errors in
1. example, fmha
2. gridwise pipeline
3. deviceop, fmha, change some containers from vector to array

* part2 of previous commit

* clang format

* API fix of gridwisegemmpipeline

* separate array base and vector base attention tensor transformation

* fix gemm

* clang format

* add gemm fp16 instances

* Temp save

* fpAintB kernel compile pass

* Sanity pass.

* Temp save

* debug code enabled

* Fp16AInt8B_GEMM sanity

* MQA implementation

* GQA-4 example

* tempsave

* Compile pass

* New implementation of fp16Aint8B Gemm, Acheieve similar math throughput with native fp16 Gemm

* format

* Todo: fix gemm_bilinear_wmma instances compilation bug

* Solve a bug when K1=16

* remove unnecessary changes

* Remove tensor layout limitation to LDS usage in tesnor contraction

* update self-attention and cross-attention

* fix a typo of name

* Add arch limiter for fp8 gemm

* enable fp8 gemm_xdl for all gfx9 targets

* temporarily disable gemm_xdl_fp16_fp8 on MI100/200

* fix the cmake logic for gemm_xdl_fp16_fp8

* re-enable the gemm_xdl_fp16_fp8 on MI100/200

---------
Co-authored-by: aska-0096 <haocwang@amd.com>
Co-authored-by: Sam Wu <sjwu@ualberta.ca>
Co-authored-by: Sam Wu <sam.wu2@amd.com>
Co-authored-by: rocking5566 <ChunYu.Lai@amd.com>
Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com>
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>
Co-authored-by: root <root@ctr-ubbsmc15.amd.com>
Co-authored-by: Jun Liu <Liu.Jun@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: samjwu <samjwu@users.noreply.github.com>
Co-authored-by: haocwang <Haocong.WANG@amd.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>

1837040a

01 Mar, 2024 1 commit
- Update clipping for fp8/bf8 conversion (#1182) · acfb3392
  Rostyslav Geyyer authored Mar 01, 2024
```
* Update clipping for fp8 conversion

* Add clipping for bf8 conversion

* Format
```
  acfb3392
29 Feb, 2024 1 commit
- Fix synchronization scheme. · b398481e
  Adam Osewski authored Feb 29, 2024
  
  b398481e
27 Feb, 2024 1 commit

Clip fp8 to +/-240 on all targets. (#1172) · d0c7b451

Illia Silin authored Feb 27, 2024

* clip fp8 to +/-240 on all targets

* if inputs to fp8 conversion are +/-inf, they remain unaltered

* increase tolerance for test_elementwise_layernorm to prevent false errors

* change the input values for gemm examples to floats

* reduce gemm example float input values to prevent errors

* increase the tolerance for gemm examples

d0c7b451

07 Feb, 2024 3 commits
- Use memory fence and volatile attribute for synchronization flags. · 9cb25b86
  Adam Osewski authored Feb 07, 2024
  
  9cb25b86
- Clean up. · a74b2263
  Adam Osewski authored Feb 07, 2024
  
  a74b2263
- Implement direct loads split-K GEMM kernel (#1137) · 69518582
  Bartlomiej Wroblewski authored Feb 07, 2024
```
* WIP: Implement direct loads split-K GEMM kernel

* Clean the review

---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
```
  69518582
02 Feb, 2024 1 commit

Add support for more Navi2x and Navi3x models. (#1152) · 180f16f9

Illia Silin authored Feb 02, 2024

* add support for navi2x and navi3x models

* fix syntax

* use common macro for different mi300 architectures

180f16f9

29 Jan, 2024 1 commit
- Clean up and change how neighbours are counted. · e954c206
  Adam Osewski authored Jan 29, 2024
  
  e954c206
24 Jan, 2024 1 commit

Fixing most of the cppcheck errors. (#1142) · 180e5720

Illia Silin authored Jan 24, 2024

* fix cppcheck errors, first pass

* fix format

* fix returned value in examples

* add macro definitions for cppcheck

* fix the profile_gemm logic

* update the gemm profiler logic

* add more difinitions to cppcheck, fix couple more errors

* replace runtime error with message in device function

* fix a couple of int4 issues

* no return for fill function

* fix errors in data_types.hpp

* fix format

* fix few remaining errors

* fix errors in data_types.hpp

* fix last couple of errors in datat_types.hpp

180e5720

23 Jan, 2024 1 commit
- Commit debug WIP for sharing. · 7e71ea99
  Adam Osewski authored Jan 23, 2024
  
  7e71ea99
19 Jan, 2024 1 commit

Add optimized copy to ck wrapper (#1126) · 7e4eb4b8

Bartłomiej Kocot authored Jan 19, 2024



* Add optimized copy to ck wrapper

* Example optimizations

* Fixes

* Move img2col test to client example

* Refactor example

* Fix docs

* Fixes

* Fix

* Fixes

* Fixes

* Fixes

* Fixes

* Fixes

---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>

7e4eb4b8

11 Jan, 2024 1 commit
- Do not clear cthread buffer if needed. · f9f2cdf9
  root authored Jan 11, 2024
```
- Add output stream operators for LoopSched and PiplineVer
```
  f9f2cdf9
03 Jan, 2024 1 commit

Add tensor partition and generic copy for ck wrapper (#1108) · 4234b3a6

Bartłomiej Kocot authored Jan 03, 2024

* Add tensor partition and generic copy for ck wrapper

* Update changelog

* Stylistic fixes

* Change shape/strides logic to descriptor transforms

* Fixes

* Fix client example

* Fix comments

4234b3a6

13 Dec, 2023 1 commit
- Fix the bugs (#1099) · 6891e4d1
  Rostyslav Geyyer authored Dec 13, 2023
  
  6891e4d1
06 Dec, 2023 2 commits

Few fixes. · 88a4fbfb

Adam Osewski authored Dec 06, 2023

* prevent clearing c_thread_buffer between consecutive k-dim data tiles GEMM.
* Limit the number of launched thread blocks.

88a4fbfb

Introduce wrapper library (#1071) · 836b7e55

Bartłomiej Kocot authored Dec 06, 2023

* Introduce wrapper library

* Update cmake files

* Revert "Update cmake files"

This reverts commit c27f88b56590c11a88e26d5d0df7aca51a08133d.

* Fix comments

836b7e55

03 Dec, 2023 1 commit

Add support for double buffering in direct load GEMM kernel (#1052) · bc4bf9bd

Bartlomiej Wroblewski authored Dec 03, 2023

This PR introduces support for double buffering in LDS into GEMM kernels that use direct load instructions.

Direct loads now use inline asm instead of intrinsics. Usage of intrinsics results in compiler adding additional waitcnt instructions what breaks possible load/compute overlap in case of double buffering.

Usage of inline asm results in the need to use sched_barrier in order to make sure that compiler cannot incorrectly reschedule instructions since it does not know the data dependencies between global->LDS and LDS->registers.

bc4bf9bd

01 Dec, 2023 1 commit
- Change return type from inxed_t to uint32_t for GetFlagValue. · f8cbbd1b
  Adam Osewski authored Dec 01, 2023
```
Update doc.
```
  f8cbbd1b
30 Nov, 2023 1 commit

Introduce wrapper for layout (#1054) · 8ff845f2

Bartłomiej Kocot authored Nov 30, 2023

* Introduce wrapper for layout

* Extend functionality

* Fix for getLength

* Comment fixes

* Add comments and remove not needed getters

8ff845f2

28 Nov, 2023 1 commit

Switch default f8 conversion to stochastic rounding (#1048) · 6ef034f6

Rostyslav Geyyer authored Nov 27, 2023

* Switch default f8 conversion to stochastic rounding

* Refactor f8-related type_converts

* Add an element-wise op

6ef034f6

25 Nov, 2023 1 commit

Add basic support for direct loads from global to LDS (#999) · 627054b9

Bartlomiej Wroblewski authored Nov 25, 2023

* Add basic support for direct loads from global to LDS

* Clean the code and comments

* Add support for fp16

* Add comments

* Add check for thread cluster lengths

* Align non-direct-load fp16 example

* Small fixes

* Extend IsSupported to check for supported GPU gens

* Build examples only on the supported HW

* Do not throw when instance not supported in 04 example

* Review: Apply review suggestions

* Review: small fix

* Review: small fix

627054b9

07 Nov, 2023 1 commit

Add Gemm instances for performance improvement (#1018) · 98fd41f5

zjing14 authored Nov 07, 2023



* improve kpad

* more tuning parameters

* f16_f8_fp16

* cut test time

* add f16_f8_fp16

* add f16_f8_f16

* testing instances for skinny cases

* format

* clean

* add fp16_f8_fp16

* clang-format

* add grouped gemm instalces

* fixed profile grouped_gemm

* clean

* clean

* clean

* clean

* clean

* add missing instance func

* fixed inferface

---------
Co-authored-by: Jing Zhang <jizha@amd.com>
Co-authored-by: root <root@sh5-1e707-rc06-38.mkm.dcgpu>

98fd41f5

28 Oct, 2023 1 commit

Fix the fp8 gemm for large tensors on MI300. (#1011) · f46a6ffa

Illia Silin authored Oct 27, 2023



* Fix the fp8 conversion

* Try clipping value before conversion

* Fix return

* Simplify with a const

* reduce the gemm input tensor values to reduce round-off error

* replace if-else with lambda

* fix syntax

---------
Co-authored-by: Rostyslav Geyyer <rosty.geyyer@amd.com>

f46a6ffa

20 Oct, 2023 1 commit
- Fix bf8 conversion issues (#1003) · 1fd27d52
  Rostyslav Geyyer authored Oct 20, 2023
```
* Fix the conversion

* Add bf8 functionality

* Enable example on MI200 as well
```
  1fd27d52
19 Oct, 2023 7 commits
- Fix the DL kernel issues on Navi3x. (#998) · f7331c60
  Illia Silin authored Oct 19, 2023
```
* apply the patch for dl kernels on gfx11

* build DL kernels on navi32 CI
```
  f7331c60
- Extend available elementwise operations with conv examples (#995) · 82f3a835
  Bartłomiej Kocot authored Oct 19, 2023
```
* Extend available elementwise operations with conv examples

* Fixes

* Remove not needed convert

* Update CMakeFile and dir name
```
  82f3a835
- Add support for halft in reduction op Add. · 28a68428
  Adam Osewski authored Oct 19, 2023
  
  28a68428
- Refactor out FlagCount function. · fb0b3b75
  Adam Osewski authored Oct 16, 2023
  
  fb0b3b75
- Add more functionality to work scheduler. · 986f1d2b
  Adam Osewski authored Oct 05, 2023
  
  986f1d2b
- Add store and reset functions to workgroup barrier. · ae275040
  Adam Osewski authored Oct 05, 2023
  
  ae275040
- Strided Reduction Tile Loop work scheduler. · 030128fa
  Adam Osewski authored Oct 02, 2023
  
  030128fa
18 Oct, 2023 1 commit

Clean DTYPES conditions in CMake (#974) · bf435140

zjing14 authored Oct 18, 2023



* Add a condition to build fp8 instances

* simplified buffer_load/store

* add bfp8/fp8

* fixed

* remove all f8/bf8 condition include folder

* fixed cmake conditions

* fixed DTYPES=fp16/bfp16

* fix

* fixed buffer_load

* fixed buffer_store

* fix

* clean example cmake files

* fixed ci

* fixed cit

---------
Co-authored-by: Rostyslav Geyyer <rosty.geyyer@amd.com>
Co-authored-by: Jing Zhang <jizha@amd.com>

bf435140

16 Oct, 2023 1 commit
- workaround with float (#992) · 39430bfd
  zjing14 authored Oct 16, 2023
```
Co-authored-by: Jing Zhang <jizha@amd.com>
```
  39430bfd