Commits · 974d67f211ebc5ddbed1ab085d453630ea275caf · gaoqiong / composable_kernel_ROCM

29 Oct, 2024 1 commit
- Add missing constructor. · 974d67f2
  Andriy Roshchenko authored Oct 29, 2024
  
  974d67f2
23 Oct, 2024 1 commit
- Enable build of example_gemm_xdl_fp8_bf8 test. · 7e2f7c95
  Andriy Roshchenko authored Oct 23, 2024
  
  7e2f7c95
21 Oct, 2024 1 commit
- Add constexpr where applicable. · 807a4818
  Andriy Roshchenko authored Oct 21, 2024
  
  807a4818
18 Oct, 2024 1 commit
- Enable OCP build of example_gemm_xdl_fp8. · 739d3db9
  Andriy Roshchenko authored Oct 18, 2024
  
  739d3db9
16 Oct, 2024 2 commits
- Fix compilation error for gfx942 architecture. · 4a50b93a
  Andriy Roshchenko authored Oct 16, 2024
  
  4a50b93a
- Refactoring. Move FP8 definitions into a separate header file. · ca99f301
  Andriy Roshchenko authored Oct 16, 2024
  
  ca99f301
15 Oct, 2024 3 commits
- Implement ConvertFP16Nearest and ConvertFP16Stochastic tests. · e36b09b7
  Andriy Roshchenko authored Oct 15, 2024
  
  e36b09b7
- Implement ConvertFP32Stochastic test. · 487cb570
  Andriy Roshchenko authored Oct 15, 2024
  
  487cb570
- Implement ConvertFP32Nearest test. · 2052651b
  Andriy Roshchenko authored Oct 15, 2024
  
  2052651b
14 Oct, 2024 1 commit
- enable bf16 atomic add on gfx950 · ca15fa77
  illsilin authored Oct 14, 2024
  
  ca15fa77
11 Oct, 2024 3 commits
- Implement FP8OCP tests for half_t type conversions. · 2bd1b9cf
  Andriy Roshchenko authored Oct 11, 2024
  
  2bd1b9cf
- Implement FP8OCP test for stochastic rounding mode. · c76b765a
  Andriy Roshchenko authored Oct 11, 2024
  
  c76b765a
- Remove dependence on possibly undeclared alias. · d40d1ff1
  Andriy Roshchenko authored Oct 11, 2024
  
  d40d1ff1
10 Oct, 2024 1 commit
- Implementation of ConvertFP32Nearest in test_fp8_ocp. · 13dd3ab5
  Andriy Roshchenko authored Oct 10, 2024
  
  13dd3ab5
03 Oct, 2024 1 commit
- Initial introduction of OFP8 data types. · 79a4b17f
  Andriy Roshchenko authored Oct 03, 2024
  
  79a4b17f
20 Sep, 2024 1 commit

Remove unsupported (fp8) type from Add memory operation. (#1521) · 0c39954d

Adam Osewski authored Sep 20, 2024

The dynamic buffer doesn't have support for fp8 in `Update` operation thus fp8 is not supporting `InMemoryDataOperation::Add`

0c39954d

12 Sep, 2024 1 commit

Pool2d max/avg kernel in the BWD version (#1494) · 448c0f56

Mateusz Ozga authored Sep 12, 2024

* Add pool2d instance BWD AVG

* Add pool2d instance BWD MAX

* Fix: avg review

* Fix review: part2

* Fix - enable test when type is compiled

* Fix review part3

448c0f56

11 Sep, 2024 1 commit

Added structural sparsity blockwise gemm (#1435) · 2a261afc

jakpiase authored Sep 11, 2024



* Implemented smfmac xdlops

* Added smfmac blockwise xdlops

* fixes

* add reviewers suggestions

---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

2a261afc

14 Aug, 2024 1 commit

[GEMM] gemm_universal related optimization (#1453) · 3049b546

Haocong WANG authored Aug 14, 2024



* replace buffer_atomic with global_atomic

* fixed global_atomic_add

* added bf16 atomic_add

* format

* clang-format-12

* clean

* clean

* add guards

* Update gtest.cmake

* enabled splitk_gemm_multi_d

* format

* add ckProfiler

* format

* fixed naming

* format

* clean

* clean

* add guards

* fix clang format

* format

* add kbatch printout

* clean

* Add rocm6.2 related gemm optimization

* Limit bf16 atomic usage

* remove redundant RCR gemm_universal instance

* Add RRR fp8 gemm universal instance

* Bug fix

* Add GPU_TARGET guard to FP8/BF8 target

* bug fix

* update cmake

* remove all fp8/bf8 example if arch not support

* Enable fp8 RRR support in ckProfiler

* limit greedy-reverse flag to gemm_universal in ckProfiler

---------
Co-authored-by: Jing Zhang <jizhan@fb.com>
Co-authored-by: Jing Zhang <jizhan@meta.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>

3049b546

07 Aug, 2024 1 commit

Remove reinterpret_cast uses that result in undefined behaviour. (#1445) · 901e5f15

Juan Manuel Martinez Caamaño authored Aug 07, 2024

* Remove reinterpret_cast uses that result in undefined behaviour. Use a bitcast instead.

See https://en.cppreference.com/w/cpp/language/reinterpret_cast#Type_accessibility



Closes #1439

* fix clang format

---------
Co-authored-by: illsilin <Illia.Silin@amd.com>

901e5f15

06 Aug, 2024 2 commits

Add missing constexpr to if conditions (#1444) · fd9ef4e6
Juan Manuel Martinez Caamaño authored Aug 06, 2024

fd9ef4e6

Add Grouped Conv Fwd Large Tensor kernel (#1432) · 4ec5c52a

Bartłomiej Kocot authored Aug 06, 2024

* Support 64 bit indexing

* Add new grouped conv fwd kernel for large tensors

* Add instances large tensor

* Fixes for transform conv to gemm

* Fixes

* fixes

* Remove not needed instances

* examples fixes

* Remove not need ds arrays

* Fix tests

* Add 2GB check in gridwise dl

* Fixes

4ec5c52a

24 Jul, 2024 1 commit
- Add support for half_t and bfloat to reduction operations (#1395) · ffabd70a
  Bartłomiej Kocot authored Jul 24, 2024
```
* Add support for half_t and bfloat to reduction operations

* Fix bhalf convert

* Next fix bf16
```
  ffabd70a
17 Jul, 2024 1 commit
- Replace the using of __expf by __ocml_exp_f32 to work-around the test_softmax_rank4 failure (#1394) · ee768148
  Qianfeng authored Jul 18, 2024
  
  ee768148
04 Jul, 2024 1 commit
- Fix issue with multiple targets and remove smfmac tests from unsupported test targets (#1372) · 95907384
  Jun Liu authored Jul 03, 2024
  
  95907384
27 Jun, 2024 2 commits
- Add structural sparsity gemm instruction tests (#1309) · ed21948b
  jakpiase authored Jun 27, 2024
```
* first version of smfmac test

* add reviewer comments

* add reviewer suggestions
```
  ed21948b
- Merging the gfx12 code into public repo. (#1362) · 941d1f7c
  Illia Silin authored Jun 27, 2024
  
  941d1f7c
25 Jun, 2024 1 commit

CK Instance Gen (#1145) · 3e9711f0

arai713 authored Jun 25, 2024



* Format

* Format

* Format

* Remove const

* Use the right template

* Format

* Format

* add row/col instances

* Add missing file

* fixed

* fixing block to etile error

* Format

* Updates

* Format

* fixed rrr layout

* generating a sample JSON file: currently contains includes, prologue/epilogue and instances

* version where the json is passed into the instances to generate a key

* updated run function to just launch kernel

* updated run function: only contains kernel object, json file is updated but still needs to be cleaned up, added front-end API to parse JSON into character buffer

* adding in testing files

* cleaned up comments, still need to work on including header files

* removed unneeded files

* removed/commented out JSON implementation

* added fusion(prologue/epilogue) into instance generation

* working on instance selection

* added instance selection, need to fix instance validation

* removed block2etile map validity check for testing purposes

* test running: failing due to incorrect files/input

* all grid descs/ptrs completed, but device file not found

* Update test and embed modules

* Restore older version

* added convolution operation, written test, debugging generated code for compilation

* attempting to include CK in host directory: _Float16 error

* CK header file issues

* slight fix

* don't crash when hip can't report total memory

* dump generated code to a file

* changing sizes

* creating tensor descriptors using CK methods: set up grid desc manually, also trying to set up an argument pointer - this needs to be fixed

* some fixes to call the device code

* separating test files for conv and gemm

* completed arg ptr, now have linking errors

* clang format fix

* resolved linker issues in conv test

* remove dependency on libutility from ck

* resolved num dim error

* properly passing arg ptr, errors with passing typenames: redefinition/redeclaration

* undo the commenting of device function

* hand created kernel code to find rtc issues

* dump the full src to file

* resolved redeclaration errors, cleaned up errors for Amber's kernel code

* debugging purposes: redeclaration error

* config files

* resolved errors for NumTensor and redeclaration, formatted version.h

* resolved most errors in manually added kernel and my own. error with calling kernel object: overloaded function type

* WIP: close to getting kernel compiled

* WIP: fixing rtc errors

* fixed sequence errors, formatting, still one error with run fcn

* yay: kernel compiles and runs

* updated templated/generated version to run and compile

* minor fixes

* working generated example, resolved memory access error due to padding

* adding in reference kernel, validation failing against reference

* debugging: printing kernel argsz

* reduced error in results

* debugged reference kernel and output errors, added to generated version, currently debugging prologue function issues

* working validation (using reference convolution) with prologue function for both hard-coded and generated version

* WIP: create an alt version that creates Argument on the device

* wip: added new duplicate files, fixed fusion templating errors from working example, setting up kernel arguments

* wip: making necessary methods device code

* added grid descs, working on grid pointers, errors with stl numerics

* wip: updating kernel args - issue, replacing some std functions

* replaced std::accumulate call with temp hardcoded version

* wip: args causing memory issue

* Construct Argument object inside the kernel and use it to call convolution device function. Code runs and verification passes

* adding object file dump

* temporary hardcoding of grid size, can remove device op inst + arg ptr

* minor fix for grid size

* added modified example where arg ptr is created on the device for generated version as well

* removed device op instance and arg ptr from modified examples

* moving device op file for testing purposes and to properly build CK

* commenting out print-outs

* adjust compiler args to produce a valid ELF file

* temporary removal of validation

* reverting compiler args back for working example

* retrieve necessary arguments from generated template parameters in correct format

* calculating grid size on host-side, still need to clean up process, pass parameters to host functions properly

* scaled up factory functions/wrapper structs to implement host-side launch parameter calculations using CK host side functions - in hard-coded example

* temporary change to generate ELF format binary object file

* removed unecessary code, added comments

* formatting fix

* cleaned up code, added new tests, restructured library: move helper into CK

* refactored launch parameter calculation to be more concise

* renamed files and variables for more clarity/uniformity

* more code cleaning, removed debug statements

* moved majority of my files into codegen directory, running properly

* updated Embed.cmake(string_view) in codegen directory

* updated host directory to match Embed.cmake as well

* added old tests in

* updated instance generation methods to be more concise

* removed layout from launch parameter calculation

* working test

* fixed issue with verification, all instances working

* updated verification in other tests

* removed duplicate matrix padder file, removed code dumps

* removed old hard-coded tests

* removed old host directory, all files in codegen directory now

* fixed copyright in files

* commenting out validation

* renamed files

* made changes for review: fixed copyright, renamed files for clarity, removed comments, refactored code

* updated headers

* removing duplicate file for fwd conv to gemm, merging with original file

* fix building codegen with clang++ directly

* resolving build error from conv_fwd_to_gemm

* fix for previous error

* renaming tests

* created common test file

* cleaned up code, added comments

* renamed device op

* fixed typos in comments

* removed extra space

* code cleanup: resolving Amber's comments

* removed wrapper struct for matrix padder, fixed template

* cleaned up if statements for better readability

---------
Co-authored-by: Paul <pfultz2@yahoo.com>
Co-authored-by: Jing Zhang <jizha@amd.com>
Co-authored-by: M. Amber Hassaan <amber_474@yahoo.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

3e9711f0

21 Jun, 2024 1 commit
- WA for rocm-6.2+ s constrait for buffer resource (#1346) · fa129c1a
  carlushuang authored Jun 22, 2024
```
* WA for rocm-6.2+ s constrait for buffer resource

* add missing memory clobber
```
  fa129c1a
18 Jun, 2024 1 commit
- Add read_first_lane function for int64 (#1347) · 8faec23c
  Bartłomiej Kocot authored Jun 18, 2024
  
  8faec23c
17 May, 2024 1 commit
- replace the ENV macro with CK_ENV (#1296) · 1274861a
  Illia Silin authored May 17, 2024
  
  1274861a
10 May, 2024 1 commit
- Code clean-up (#1285) · 566b6480
  Illia Silin authored May 10, 2024
```
* code clean-up

* remove the profiling output samples
```
  566b6480
09 May, 2024 1 commit
- Add vector instruction coherency bits for gfx94 targets. (#1268) · 3c043cd1
  Adam Osewski authored May 09, 2024
  
  3c043cd1
07 May, 2024 1 commit

Enable logging in CK with environment variable. (#1278) · bf420976

Illia Silin authored May 07, 2024



* enable logging using environment variable

* update ck.hpp header

* fix typo

* fix clang format

* Update include/ck/utility/env.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

---------
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

bf420976

26 Apr, 2024 1 commit

bf16A_Int8B with fastgelu/bias (#1264) · 0d0150db

zjing14 authored Apr 26, 2024

* changed the copy function to v7r2

* adding multi_abd

* in-progress

* add post-load oob check

* debugging

* adjust instances

* add run_lds

* add elemntwise_op

* replace multi_abd_device with v3

* clean up

* clean

* clean

* Added LDSType

* profiling

* adjust oobcheck

* add missing file

* refactor

* clean

* add examples

0d0150db

25 Apr, 2024 2 commits

Grouped GEMM Multiple D tile loop. (#1247) · b4032629

Adam Osewski authored Apr 25, 2024

* Overload output stream operator for LoopScheduler and PiplineVersion

* Add Run overload accepting grid descriptors MK.

* Add __device__ keyword for CalculateGridSize

* Create device op GroupedGemmMultipleD

* Add GroupedGemm MultipleD Tile Loop implementation.

* Add an example for GroupedGemm MultipleD tile loop.

* Device Op GroupedGEMMTileLoop.

* Bunch of small changes in exmaple.

* CkProfiler

* Remove unused tparam.

* Fix include statement.

* Fix output stream overloads.

* Do not make descriptors and check validity untill we find group.

* Fix gemm desc initialization.

* Revert device op

* Fix compilation for DTYPES=FP16

* Validate tensor transfers paramters.

* Validate on host only NK dims if M is not known.

* Fix bug.

* A convenient debug func for selecting threads.

* Fix has main k block loop bug.

* Make sure that b2c has up to date tile offset.

* Output stream operator for Sequence type.

* Cmake file formatting.

b4032629

Universal gemm flush cache (#1251) · f448d179

ltqin authored Apr 26, 2024



* add flush cache to device op

* add flush cache parameter to ckProfiler

* change calculate size a and b method

* chang evaluation time method foro AVERAGE to MEDIAN

* format code

* adjust some code

* fix core dumped

* remove loop call flush icache in kernel

* remove loop(outer) call flush icache

---------
Co-authored-by: letaoqin <letaoqin@amd.com>

f448d179

16 Apr, 2024 1 commit
- add gfx1201 macro in amd_wmma header · 0b7addbd
  illsilin authored Apr 16, 2024
  
  0b7addbd
14 Apr, 2024 1 commit

[GEMM] Gemm universal device operation (#1154) · f83e9701

Haocong WANG authored Apr 14, 2024



* Optimize GEMM on MI200/300:
1. Add new blockwise gemm pipeline
2. Add irregular splitk intances

* clang format + typo fix

* Fix a bug

* initial commit

* Add more instances to irregular splitk

* blkgemm pipeline v1~4 prototype

* Sanity Checked. Known issue:
1. Poor performance of splitk
2. Register spill on blkgemmpipeline v3

* Sanity and Performance fix:
1. fix a bug related to sanity in grouped b2c mapping
2. fix a bug related to sanity and performance in splitk offset

* Sanity and API update:
1. Remove prefetch stage
2. Fix valid check bug
3, Add first gemm_universal instance into ckProfiler

* Add NN instances for gemm universal

* 1. Add NT instances for gemm_universal
2. Fix a bug about Kpadding in gemm_universal

* Fix a bug regarding padding Odd K number

* remove kernel print

* Fix KPadding bug...

* Update safety check

* another try to fix kpadding..

* Sanity checked

* new instances..

* clang format+typo fix

* remove clang format script's change

* Add non-hotloop compile option

* 1. Add fp16xfp8 example
2. pull packed convert f8 from pr1150

* Some miscs.. opt and fix

* Add pipeline description docs

* Split universal gemm instance library to cut profiler compiling time

* uncomment cmakefile

* Fix a bug caused by blockwise_gemm_pipe_v2

* reduce default splitk to 1

* Add 224x256x64 tile size

* update, including:
1. Experiment pipeline 5~7
2. Optimization for pipeline 4
3. Organized instance library

* temp save

* temp save

* Permuted lds layout, sanity and function checked

* clang format

* Move OOB check from RunRead to RunWrite, for better software pipeline.
TODO: agpr spill when NN layout

* clangformat

* A/B splitpipe scheduler for v3

* Fix two bugs

* bug fix

* fix a bug in oob check

* Example for mixed fp16_fp8 gemm

* Clean experimental code blocks

* Add mixed precision gemm into profiler

* tempsave

* optimize m/n major lds layout

* Add RRR GEMM  mixed precision instances

* Optimize f8 matrix transpose

* Add test_gemm_universal

* A/B spilt schedule for blkpip v5

* Take ds_read2 into iglp scheduling scheme

* format

* fixed cmake

* Add llvm-option into CI cmake flag

---------
Co-authored-by: Jing Zhang <jizhan@amd.com>

f83e9701

04 Apr, 2024 1 commit
- enabled other types · 7d700bc0
  Jing Zhang authored Apr 03, 2024
  
  7d700bc0