Commits · ceaed8e097cbba105e23f465c6226ab48e37a3a8 · gaoqiong / composable_kernel_ROCM

09 Oct, 2024 1 commit
- Fixes small memory leak from missing hipEventDestroy (#1554) · ceaed8e0
  Christopher Millette authored Oct 09, 2024
  
  ceaed8e0
07 Oct, 2024 1 commit

Fix build logic using GRU_ARCHS. (#1536) · 7d8ea5f0

Illia Silin authored Oct 07, 2024

* update build logic with GPU_ARCHS

* fix the GPU_ARCHS build for codegen

* unset GPU_TARGETS when GPU_ARCHS are set

7d8ea5f0

04 Oct, 2024 1 commit
- Fix grouped gemm check to avoid overflow (#1545) · 6b54d2fa
  Bartłomiej Kocot authored Oct 04, 2024
  
  6b54d2fa
02 Oct, 2024 1 commit

Fix compilation errors generated by forthcoming Clang changes (#1544) · aeb7c91f

macurtis-amd authored Oct 02, 2024

Without this change, the following diagnostic is generated:
  a template argument list is expected after a name prefixed by the template
  keyword [-Wmissing-template-arg-list-after-template-kw]

See C++17 spec [temp.names] p5.

aeb7c91f

25 Sep, 2024 1 commit
- Fix compilation errors with Clang20.0. (#1533) · 42e6dcea
  Illia Silin authored Sep 25, 2024
```
* fix clang20 compilation errors for gfx90a

* fix clang20 compilation errors for gfx11 targets
```
  42e6dcea
20 Sep, 2024 2 commits
- Add support for NGCHW in grouped conv fwd (#1499) · 4ba52b35
  Bartłomiej Kocot authored Sep 20, 2024
```
* Support NGCHW in grouped conv fwd

* Remove not needed variable

* Fixes
```
  4ba52b35
- Remove unsupported (fp8) type from Add memory operation. (#1521) · 0c39954d
  Adam Osewski authored Sep 20, 2024
```
The dynamic buffer doesn't have support for fp8 in `Update` operation thus fp8 is not supporting `InMemoryDataOperation::Add`
```
  0c39954d
13 Sep, 2024 1 commit

Customize filesystem in CK for legacy systems (#1509) · 81bc1496

Jun Liu authored Sep 13, 2024



* Legacy support: customized filesystem

* Update cmakefile for python alternative path

* fix build issues

* CK has no boost dependency

* More fixes to issues found on legay systems

* fix clang format issue

* Check if blob is correctly generated in cmake

* fix the python issues

* add a compiler flag for codegen when using alternative python

* use target_link_options instead of target_compile_options

---------
Co-authored-by: illsilin <Illia.Silin@amd.com>

81bc1496

12 Sep, 2024 1 commit

Pool2d max/avg kernel in the BWD version (#1494) · 448c0f56

Mateusz Ozga authored Sep 12, 2024

* Add pool2d instance BWD AVG

* Add pool2d instance BWD MAX

* Fix: avg review

* Fix review: part2

* Fix - enable test when type is compiled

* Fix review part3

448c0f56

11 Sep, 2024 2 commits

Rewrite pool2d fwd (#1462) · e8d2887c

jakpiase authored Sep 11, 2024



* added pool2d fwd

* add tests

* add reviewers changes

* Revert "Merge remote-tracking branch 'origin/develop' into jakpiase/pool2d_fwd_new"

This reverts commit 6b2ba7ff8960b0a6ddbe30d8dac53eeb55a8597e, reversing
changes made to 22c82bea0caf3e0f29399100c1bb67b8003fc042.

* Revert "add reviewers changes"

This reverts commit 22c82bea0caf3e0f29399100c1bb67b8003fc042.

* added reviewers comments

* revert some old files

* add reviewers requests

---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

e8d2887c

Added structural sparsity blockwise gemm (#1435) · 2a261afc

jakpiase authored Sep 11, 2024



* Implemented smfmac xdlops

* Added smfmac blockwise xdlops

* fixes

* add reviewers suggestions

---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

2a261afc

05 Sep, 2024 2 commits

Moficiation to fix this issue "threadwise_tensor_slice_transfer_v5r1 issue #1279" (#1492) · 83788553

M.Emin Ozturk authored Sep 04, 2024



* issue fix, one line changed for tmp

* clang

---------
Co-authored-by: Emin Ozturk <emin.ozturk@utah.edu>
Co-authored-by: Harisankar Sadasivan <135730918+hsadasiv@users.noreply.github.com>

83788553

Add gemm universal bf16 instances (#1484) · 5b10dae6

Haocong WANG authored Sep 05, 2024



* revert ckprofiler change

* temp save

* Add test and test pass

* test pass

* Fix bug inside rotating buffer when tensor is not packed

* bug fix

* clang format

---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

5b10dae6

03 Sep, 2024 1 commit
- Add support for NGCHW in grouped conv bwd wei (#1491) · 73b67f29
  Bartłomiej Kocot authored Sep 03, 2024
```
* Add support for NGCHW in grouped conv bwd wei

* Comments fixes

* navi fixes

* Update function names
```
  73b67f29
02 Sep, 2024 1 commit

Revert "Revert "Revert Revert Support access per groups and filter2x3 in... · a9b170b5

Bartłomiej Kocot authored Sep 02, 2024

Revert "Revert "Revert Revert Support access per groups and filter2x3 in grouped conv fwd (#1382) (#1406) (#1415)" (#1455)" (#1490)

This reverts commit 5ff8eeeb.

a9b170b5

21 Aug, 2024 2 commits

Adding Instances and Examples for FP8-based Scaled Convolution and AMAX Reduction. (#1473) · c3515f27

Andriy Roshchenko authored Aug 21, 2024

* Enable CMakePresets build

* Verify Convolution, Scaling and ReLU algorithms.

* Add tensor element-wise scale and type cast operation.

* Reduction implemented but does not work.

* Exploration of Reduction functionality.

* Completed example for Convolution scaled with ReLu activation and AMAX reduction.

* WIP: Add required instances for convolution.

* WIP: Create client example. Implement convolution stage.

* Add elementwise instances.

* Add elementwise scale + convert example.

* Add reduction instances.

* WIP: Client example for AMAX reduction.

* WIP: Add instances for multistage reduction.

* WIP: Implementation of multistage reduction.

* Refactoring.

* Clean up.

* Add CMakePresets.json

* Guard off FP8 instances when the data type is not available.

* Add example for Scaled FP8 Convolution with AMAX reduction.

* Refactor CombConvScaleRelu instances.

* Add CombConvScale instances.

* Add client example for Scaled FP8 Convolution with AMAX reduction.

* Cleanup.

c3515f27

Set RNE fp8 conversion as a default (#1458) · e20f20ef

Rostyslav Geyyer authored Aug 21, 2024

* Set RNE fp8 conversion as a default

* Update f8 tests

* Disable failing test on gfx11

* Update bf8 tests

* Add a flag

* Fix the flag

* Raise flag for gfx10 as well

* Temp commit for tolerance testing

* Update tolerances

e20f20ef

14 Aug, 2024 1 commit

[GEMM] gemm_universal related optimization (#1453) · 3049b546

Haocong WANG authored Aug 14, 2024



* replace buffer_atomic with global_atomic

* fixed global_atomic_add

* added bf16 atomic_add

* format

* clang-format-12

* clean

* clean

* add guards

* Update gtest.cmake

* enabled splitk_gemm_multi_d

* format

* add ckProfiler

* format

* fixed naming

* format

* clean

* clean

* add guards

* fix clang format

* format

* add kbatch printout

* clean

* Add rocm6.2 related gemm optimization

* Limit bf16 atomic usage

* remove redundant RCR gemm_universal instance

* Add RRR fp8 gemm universal instance

* Bug fix

* Add GPU_TARGET guard to FP8/BF8 target

* bug fix

* update cmake

* remove all fp8/bf8 example if arch not support

* Enable fp8 RRR support in ckProfiler

* limit greedy-reverse flag to gemm_universal in ckProfiler

---------
Co-authored-by: Jing Zhang <jizhan@fb.com>
Co-authored-by: Jing Zhang <jizhan@meta.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>

3049b546

13 Aug, 2024 1 commit
- Support large: 12d tensor size for reduction kenrel (#1465) · 0606e549
  Mateusz Ozga authored Aug 13, 2024
  
  0606e549
10 Aug, 2024 1 commit
- Fix bug with n block id calculation in DeviceGroupedConvXdlCShuffle (#1457) · 4a870942
  Bartłomiej Kocot authored Aug 10, 2024
```
* Fix typo in TransformConvFwdToGemm

* Fix bug in n offset calculation
```
  4a870942
09 Aug, 2024 1 commit

Revert "Revert Revert Support access per groups and filter2x3 in grouped conv... · 5ff8eeeb

Jun Liu authored Aug 08, 2024

Revert "Revert Revert Support access per groups and filter2x3 in grouped conv fwd (#1382) (#1406) (#1415)" (#1455)

This reverts commit 33b399cc.

5ff8eeeb

07 Aug, 2024 1 commit

Remove reinterpret_cast uses that result in undefined behaviour. (#1445) · 901e5f15

Juan Manuel Martinez Caamaño authored Aug 07, 2024

* Remove reinterpret_cast uses that result in undefined behaviour. Use a bitcast instead.

See https://en.cppreference.com/w/cpp/language/reinterpret_cast#Type_accessibility



Closes #1439

* fix clang format

---------
Co-authored-by: illsilin <Illia.Silin@amd.com>

901e5f15

06 Aug, 2024 3 commits

Add missing constexpr to if conditions (#1444) · fd9ef4e6
Juan Manuel Martinez Caamaño authored Aug 06, 2024

fd9ef4e6
Fix for beta!=0 in reduce (#1440) · b74d4d4d
jakpiase authored Aug 06, 2024
```
* fix for beta!=0 in reduce

* add reviewers suggestions
```
b74d4d4d

Add Grouped Conv Fwd Large Tensor kernel (#1432) · 4ec5c52a

Bartłomiej Kocot authored Aug 06, 2024

* Support 64 bit indexing

* Add new grouped conv fwd kernel for large tensors

* Add instances large tensor

* Fixes for transform conv to gemm

* Fixes

* fixes

* Remove not needed instances

* examples fixes

* Remove not need ds arrays

* Fix tests

* Add 2GB check in gridwise dl

* Fixes

4ec5c52a

31 Jul, 2024 1 commit

Codegen: isSupportedArgument check (#1417) · d32997a7

arai713 authored Jul 31, 2024

* added isSupportedArgument check into codegen device op

* adding function call

* remove commented code

d32997a7

30 Jul, 2024 1 commit
- Revert Revert Support access per groups and filter2x3 in grouped conv fwd (#1382) (#1406) (#1415) · 33b399cc
  Bartłomiej Kocot authored Jul 30, 2024
  
  33b399cc
25 Jul, 2024 1 commit

Add rotating buff for gemm_multi_d (#1411) · 105bd708

zjing14 authored Jul 25, 2024



* add rotating_buff for gemm_multi_d

* format

* Update flush_cache.hpp

* Update gtest.cmake

---------
Co-authored-by: Jing Zhang <jizhan@fb.com>
Co-authored-by: Haocong WANG <haocwang@amd.com>

105bd708

24 Jul, 2024 2 commits

Adding more instances of grouped convolution 3d forward for FP8 with... · 4a8a1bef

Andriy Roshchenko authored Jul 24, 2024

Adding more instances of grouped convolution 3d forward for FP8 with ConvScale+Bias element-wise operation. (#1412)

* Add CMakePresets configurations.

* Add binary elementwise ConvScaleAdd and an example.

* Numerical verification of results.

Observed significant irregularities in F8 to F32 type conversions:
```log
ConvScaleAdd: float=145.000000   f8_t=160.000000    e=144.000000
ConvScaleAdd: float=97.000000   f8_t=96.000000    e=104.000000
ConvScaleAdd: float=65.000000   f8_t=64.000000    e=72.000000
```

* Implemented ConvScaleAdd + Example.

* Add ConvScale+Bias Instances

* Add Client Example for ConvScale+Bias

* Fix number of bytes in an example..

* Cleanup.

4a8a1bef

Add support for half_t and bfloat to reduction operations (#1395) · ffabd70a
Bartłomiej Kocot authored Jul 24, 2024
```
* Add support for half_t and bfloat to reduction operations

* Fix bhalf convert

* Next fix bf16
```
ffabd70a

22 Jul, 2024 1 commit
- Revert Support access per groups and filter2x3 in grouped conv fwd (#1382) (#1406) · 5d8c3d81
  Bartłomiej Kocot authored Jul 22, 2024
  
  5d8c3d81
19 Jul, 2024 3 commits

[GEMM] F8 GEMM, performance optimized. (#1384) · 8c90f25b

Haocong WANG authored Jul 19, 2024



* add ab_scale init support

* enabled interwave

* add scale type; update isSupport

* adjust example

* clean

* enable f8 pure gemm rcr ckprofiler

* Add gemm_multiply_multiply instances

* clang format

* Optimize for ScaleBlockMNK=128

* enable abscale f8 gemm ck profiler

* Add pure f8 gemm test suite

* Reverting to the state of project at f60fd77

* update copyright

* clang format

* update copyright

---------
Co-authored-by: root <jizhan@amd.com>

8c90f25b

Universal gemm splitk using reduce (with multi-d) (#1341) · c544eb4d

ltqin authored Jul 19, 2024



* init for reduce_threadwise multi_d

* add reduce_threadwise_multi_d

* add reduce_multi_d

* clean

* start add an other splitk device op

* add reduce template parameter to SplitKBatchOffset

* add reduce c matrix

* clean up code

* change example data type to bf16

* add bf16Ai8B example

* remove reduce template parameter

* add splitk atomic status to v4

* example add multi d parameters

* device op add multi-d parameters

* add multi-d to reduce

* fix kbach=1 bug

* change B layout to col in  bf16Ai8B example

* remove float adding struct

* change  multi-d interface

* change file and class name

* remove multi-d of bf16Ai8B example

* change IsReduce function to IsReduceAdd

* change example layout to RRR from RCR

* according layout to set ds stride

* reset parameter layout

* add gemm universal reduce instance

* add reduce factory

* add profile_gemm_universal_reduce

* add reduce to profiler

* fix reduce instance

* fix profiler reduce compiling bug

* format

* format library instance code

* add mem instance for reduce library

* fix call instance names

* add workspace for reduce in ckProfiler

* format

* add mnpading to reduce library instance

* add fp16 instance to reduce of profiler

* change copyright time

* restore profiler cmake file

* add reduce text to instances

* add DsLayout and DsDataType to instances template parameter

* fixed gemm_reduce_multi_d

* add an example without multi_d

* Update common.hpp

* Update gtest.cmake

* Update gemm_xdl_splitk_reduce_bf16.cpp

* clean

* Update gtest.cmake

* format

* fixe api

* format

* default parameter change to RRR

* add vector_len for multi_d

* format

* Update gtest.cmake

* fix bf16A iBB elementwiseop

* add ReduceDataType

* move ReduceDataType to end position

* format

* remove googletest git method  address

* fix copyright time

* update init data

---------
Co-authored-by: root <jizhan@amd.com>
Co-authored-by: letaoqin <letaoqin@amd.com>
Co-authored-by: Jing Zhang <jizhan@meta.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>

c544eb4d

Refactor transform conv to gemm fwd (#1391) · 70a814f1
Bartłomiej Kocot authored Jul 19, 2024
```
* Refactor transform conv to gemm fwd

* fixes codegen

* wmma fixes

* fix wmma

* Fix copyright
```
70a814f1

17 Jul, 2024 1 commit
- Replace the using of __expf by __ocml_exp_f32 to work-around the test_softmax_rank4 failure (#1394) · ee768148
  Qianfeng authored Jul 18, 2024
  
  ee768148
16 Jul, 2024 1 commit

Adding more instances of grouped convolution 3d forward for FP8 with ConvScale... · 802a8a1d

Andriy Roshchenko authored Jul 16, 2024

Adding more instances of grouped convolution 3d forward for FP8 with ConvScale element-wise operation and ReLU activation. (#1386)

* Add CMakePresets configurations.

* Add ConvScale+ReLU Functor and an Example

* Account for ReLU FLOPs.

* Add instances of 3D convolutions with ConvscaleRelu operation.

* Implement Client Example

* Cleanup

802a8a1d

12 Jul, 2024 1 commit
- Support access per groups and filter3x3 in grouped conv fwd (#1382) · 82e8a78a
  Bartłomiej Kocot authored Jul 12, 2024
```
* Support access per groups and filter3x3 in grouped conv fwd

* Fixes for large cases

* Fixes for large tensors
```
  82e8a78a
06 Jul, 2024 1 commit

Universal streamk with atomics (#1360) · 75e622f0

Harisankar Sadasivan authored Jul 05, 2024

* universal streamk with atomics with ckprofiler support. grid_size and streamk strategy are tunable. grid_size of -1 leads to #WGs = maximum occupancy X num_CUs. implementation supports many different streamk policies: 1-tile, 2-tile, 3-tile and 4-tile. streamk strategy of -1 leads to default streamk policy (4-tile). 

* Update README.md

* fixing clang-format issues

* removed conflicts in struct members between streamk and universal streamk

* corrected arg parsing for streamk and universal streamk

* added stream-k policies for 3 tile and 4 tile

* fixed argument type issue with parsing cmd args

* changes suggested in PR review are made- removing comments and correcting copyright

* file permissions updated

* added default value support for grid_size and streamk-policy selection set to -1

* print messages for arguments

* print messages for arguments

* print messages for arguments1

75e622f0

04 Jul, 2024 2 commits
- Add structural sparsity xdlops (#1363) · eaa870a1
  jakpiase authored Jul 04, 2024
```
* Implemented smfmac xdlops

* add reviewer comments
```
  eaa870a1
- Fix issue with multiple targets and remove smfmac tests from unsupported test targets (#1372) · 95907384
  Jun Liu authored Jul 03, 2024
  
  95907384