Commits · 2e3183af4f2c8f15650eacb6a42eac6df1340141 · gaoqiong / composable_kernel_ROCM

31 Jan, 2025 1 commit

Codegen hipRTC compilation (#1579) · 2e3183af

arai713 authored Jan 31, 2025



* updating codegen build for MIOpen access: adding .cmake for codegen component

* updating CMake

* adding in header guards for some headers due to issues with hiprtc compilation in MIOpen

* some more header guards

* putting env file in header guard

* cleaning up some includes

* updated types file for hiprtc purposes

* fixed types file: bit-wise/memcpy issue

* updating multiple utility files to deal with standard header inclusion for hiprtc

* added some more header guards in the utility files, replacing some standard header functionality

* added some more header guards

* fixing some conflicts in utility files, another round of header guards

* fixing errors in data type file

* resolved conflict errors in a few utility files

* added header guards/replicated functionality in device files

* resolved issues with standard headers in device files: device_base and device_grouped_conv_fwd_multiple_abd

* resolved issues with standard headers in device files: device_base.hpp, device_grouped_conv_fwd_multiple_abd.hpp, device_grouped_conv_fwd_multiple_abd_xdl_cshuffle.hpp

* added header guards for gridwise gemm files: gridwise_gemm_multiple_abd_xdl_cshuffle.hpp and gridwise_gemm_multiple_d_xdl_cshuffle.hpp

* fixed issue with numerics header, removed from transform_conv_fwd_to_gemm and added to device_column_to_image_impl, device_grouped_conv_fwd_multiple_abd_xdl_cshuffle, device_grouped_conv_fwd_multiple_abd_xdl_cshuffle_v3, device_image_to_column_impl

* replaced standard header usage and added header guards in block to ctile map and gridwise_gemm_pipeline_selector

* resolved errors in device_gemm_xdl_splitk_c_shuffle files in regards to replacement of standard headers in previous commit

* added replicated functionality for standard header methods in utility files

* replaced standard header functionality in threadwise tensor slice transfer files and added header guards in element_wise_operation.hpp

* temp fix for namespace error in MIOpen

* remove standard header usage in codegen device op

* removed standard header usage in elementwise files, resolved namespace errors

* formatting fix

* changed codegen argument to ON for testing

* temporarily removing codegen compiler flag for testing purposes

* added codegen flag again, set default to ON

* set codegen flag default back to OFF

* replaced enable_if_t standard header usage in data_type.hpp

* added some debug prints to pinpoint issues in MIOpen

* added print outs to debug in MIOpen

* removed debug print outs from device op

* resolved stdexcept include error

* formatting fix

* adding includes to new fp8 file to resolve ck::enable_if_t errors

* made changes to amd_wave_read_first_lane

* updated functionality in type utility file

* fixed end of file issue

* resovled errors in type utility file, added functionality to array utility file

* fixed standard header usage replication in data_type file, resolves error with failing examples on navi3x

* formatting fix

* replaced standard header usage in amd_ck_fp8 file

* added include to random_gen file

* removed and replicated standard header usage from data_type and type_convert files for fp8 changes

* replicated standard unsigned integer types in random_gen

* resolved comments from review: put calls to reinterpret_cast for size_t in header guards

* updated/added copyright headers

* removed duplicate header

* fixed typo in header guard

* updated copyright headers

---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

2e3183af

25 Apr, 2024 1 commit

Grouped GEMM Multiple D tile loop. (#1247) · b4032629

Adam Osewski authored Apr 25, 2024

* Overload output stream operator for LoopScheduler and PiplineVersion

* Add Run overload accepting grid descriptors MK.

* Add __device__ keyword for CalculateGridSize

* Create device op GroupedGemmMultipleD

* Add GroupedGemm MultipleD Tile Loop implementation.

* Add an example for GroupedGemm MultipleD tile loop.

* Device Op GroupedGEMMTileLoop.

* Bunch of small changes in exmaple.

* CkProfiler

* Remove unused tparam.

* Fix include statement.

* Fix output stream overloads.

* Do not make descriptors and check validity untill we find group.

* Fix gemm desc initialization.

* Revert device op

* Fix compilation for DTYPES=FP16

* Validate tensor transfers paramters.

* Validate on host only NK dims if M is not known.

* Fix bug.

* A convenient debug func for selecting threads.

* Fix has main k block loop bug.

* Make sure that b2c has up to date tile offset.

* Output stream operator for Sequence type.

* Cmake file formatting.

b4032629

11 Oct, 2023 2 commits

Revert "Grouped Gemm with looping over the tiles. (#788)" (#982) · c99323be
zjing14 authored Oct 11, 2023
```
This reverts commit a4f72a31.
```
c99323be

Grouped Gemm with looping over the tiles. (#788) · a4f72a31

Adam Osewski authored Oct 11, 2023



* Introduce LocalBlockToCTileMap.

* Change the signature of CalculateBottomIndex() function which now does
not accept any argument. The B2C map which is already passed as an
argument to the kernel Run function is calculating block's local id
already outside at kernel entry point __global__ function.
The LocalB2C map stores as members local block ID.

* Use LocalBlockToCTile map in device ops.

* First draft of tile loop work distribution.

* Fix typo.

* Simplify kernel arguments.

Calculate descriptors & B2C maps on the device.

* Use looping kernel.

* Fix B2C constructor.

* Fix Navi21 errors.

* Calculate tile start/end in device kernel.

* Change Run API to accept user provided workspace buffer.

* Add new line at EOF.

* Move Gemm KernelArguments to device op interface.

* Remove unused code.

* Update API.

* Launch grid size which is min of occupancy vs tile count

* Get back to use constant memory for gemm descriptors.

* Remove unused code.

* Add default virtual method implementation.

* Update comments to conform with doxygen style.

* Fix doc style and unused parameters.

* Add thread cluster lengths to kernel name.

* Remove old splitk impl and replace it with tile looping one.

* Modify instances.

* set KPerBlock to 64
* maximize wherever possible vector load size.

* Fix instances cluster lengths.

* Change comment style.

* Use 128b store where possible in instances.

* Update test cases, since KPerBlock has doubled.

* Update output stream operator for Sequence.

* Add pipeline version to GroupedGEMM device op type string.

* Fix pipeline version type logging.

* Fix input tensors type after merge.

* Fix compiler error.

* Fix output stream operator for Pipeline version.

* Store using 128b.

* Set of instances with kpb 32/64

* Limit number of instances

* Remove commented out instances.

* Fix function name.

* Limit the number of instances.

Add pipline version to the regular instances

* Change thr cluster layout for reading B tensor.

* disabled failed instances

---------
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
Co-authored-by: Jing Zhang <jizha@amd.com>

a4f72a31

31 May, 2023 1 commit
- update copyright headers (#726) · b94fd0b2
  Illia Silin authored May 31, 2023
  
  b94fd0b2
07 Jul, 2022 1 commit

N-D Tensor Contraction example, instance, and client example (#270) · 4fe9c393

Chao Liu authored Jul 07, 2022

* adding contraction

* add contraction example

* update examle

* update example

* format

* update readme

* clean header

* clean header

* contraction with multiple D

* rename

* fix naming issue; add instances for contraction+bilinear

* change assumed virtual layout of contraction; add client example

* update example

* update

* contraction+scale

* use type_convert

* rename

4fe9c393

25 Jun, 2022 1 commit
- add license in file (#303) · d3051d75
  Chao Liu authored Jun 24, 2022
  
  d3051d75
19 Jun, 2022 1 commit

GEMM with Multiple Source, GEMM+Bias+Add+FastGeLU example and ckProfiler (#241) · 56adf7e9

Chao Liu authored Jun 19, 2022

* ad gelu and fast_gelu

* added GeLU and fast GeLU

* clean up

* add gemm+fastgelu example

* add gemm+gelu instances

* update profiler

* clean up

* clean up

* adding gemm+bias+activation

* clean

* adding bias

* clean

* adding gemm multiple d

* debugging

* add gemm bias add fastgelu

* rename, clean

* refactoring; add readme

* refactor

* refactor

* refactor

* refactor

* refactor

* refactor

* fix

* fix

* update example

* update example

* rename

* update example

* add ckProfiler

* clean

* clean

* clean

* clean

* add comment

* use type_convert

* clean

* clean element wise op

56adf7e9

22 Mar, 2022 1 commit

Reduction for int8 and bfloat16 (#125) · 9a8ee8a3

Qianfeng authored Mar 23, 2022



* Use thread cluster descriptor and explicit M_K 2d descriptor to simply Blockwise Reduction

* Change by replacing ReduceDims by NumReduceDims as Device Reduce interface template parameter

* Rename the folder name for the pool2d and reduce examples

* Update to reduction test scripts

* Add Readme for pool2d_fwd and reduce_blockwise examples

* Add support for int8_t reduction (ADD/AVG, MIN/MAX/AMAX)

* Tiny fix in reduce profiler and tiny update in reduce testing scripts

* Tiny fix in testing script profile_reduce_no_index.sh

* Tiny fix in testing script profile_reduce_no_index.sh

* Add support for bfp16 reduction (using bhalf_t = ushort)

* Tiny fix in amd_buffer_addressing.hpp

* Tiny change in script/profile_reduce_with_index.sh

* Use AccDataType for Beta value and use element_wise::PassThrough

* Use type_convert for type converting in host layer reduction

* Renaming and refining in Reduction profiler/device layer/examples

* Renaming and refining in Reduction profiler/device layer/examples

* Renaming all NumReduceDims to NumReduceDim

* Fix the leaked type_convert in ThreadwiseTensorSliceTransfer_v2

* Update to testing scripts to add bf16 support

* added more static_assert

* Remove buggy tunable configurations defined in device_reduce_instance_xxx.hpp

* Add static_assert to give compile-time warning for incorrect thread slice-size/vector-size configurations

* minor change

* Refine and fix (in GetWorkspaceSizeInBytes of MultiBlockPartialReduce) to make int8 completely pass

* Tiny renaming in gridwise_2d_reduction_multiblock_partial_reduce.hpp

* Tiny fix in script/profile_reduce_no_index.sh

* Refine in DeviceReduce layer with regard to using NumInvariantDim/NumReduceDim or InvariantDims/ReduceDims

* Generic renaming in host reduction and DeviceReduce layer

* Add support for 4-d all dimension reduction in the profiler and add_device_reduce_xxx instances

* Use multi-thread and simplification for host Reduction implementation

* Add ctest for reduction

* Update to clarify the using of data init method in produce_reduce/example_reduce/test_reduce/

* Update to the reduce CTest executables to enable default testing behavior when no command argument

* Renaming
Co-authored-by: Jianfeng yan <jfyan008@gmail.com>

9a8ee8a3

09 Mar, 2022 1 commit

Reorganize files, Part 1 (#119) · 5d37d7bf

Chao Liu authored Mar 08, 2022

* delete obselete files

* move files

* build

* update cmake

* update cmake

* fix build

* reorg examples

* update cmake for example and test

5d37d7bf

19 Aug, 2021 1 commit

Composable kernel init integration v3 (#1097) · 6fe3627a

Chao Liu authored Aug 19, 2021

* Squashed 'src/composable_kernel/' content from commit f6edda61

git-subtree-dir: src/composable_kernel
git-subtree-split: f6edda61

* add solver ConvIgemmFwdV6r1DlopsNchwKcyxNkhw; rename static ck source files

* Squashed 'src/composable_kernel/' changes from f6edda61..5781adf5

5781adf5 Update develop (#5) (#6)
97e6d514 Merge pull request #4 from ROCmSoftwarePlatform/separate_online_compile
7b1ec41e refactor
49c33aae refactor
54b3e73d rename

git-subtree-dir: src/composable_kernel
git-subtree-split: 5781adf5



* fix

* refactor

* remove online compilation from CK

* refactor

* fix

* add ctest

* add c-style pointer cast

* vector/scalar pointer cast use c-style pointer cast instead of reinterpret_cast

* fix clang warning suppression

* tidy

* suppress cppcheck

* fix enum issue

* revert chagnes to hip build

* fix kernel filename

* update CK build script

* rename

* rename

* make innner product compatiable on gfx900

* Update src/include/miopen/solver/ck_utility_common.hpp
Co-authored-by: JD <Jehandad.Khan@amd.com>

* compiler parameter use stream

* use int instead of index_t in kernel wrapper

* DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element

* refactor

* refactor

* change cmakelist

* change ck common utility

* fix
Co-authored-by: JD <Jehandad.Khan@amd.com>

6fe3627a

09 Aug, 2021 1 commit
- tidy · 54fba515
  Chao Liu authored Aug 09, 2021
  
  54fba515
25 Mar, 2021 1 commit

Dynamic tensor descriptor (#24) · fcbb9788

Chao Liu authored Mar 25, 2021



* support dynamic tensor descriptor

* use buffer load OOB feature for padding case

* add navi support

* add int8x4 inference kernel
Co-authored-by: Chao Liu <chao@ixt-rack-81.local.lan>
Co-authored-by: Jing Zhang <jizhan@amd.com>

fcbb9788

26 Sep, 2019 1 commit
- removing dependency on old tensor descriptor · 51a9fa1d
  Chao Liu authored Sep 26, 2019
  
  51a9fa1d
25 Sep, 2019 1 commit
- adding GetLinearDimensionMask() · 4f4aba48
  Chao Liu authored Sep 24, 2019
  
  4f4aba48
24 Sep, 2019 1 commit
- refactor · 545d9305
  Chao Liu authored Sep 24, 2019
  
  545d9305
22 Sep, 2019 1 commit
- WIP: explicitly separate offset component into compile-time, block-invariant... · 51884fc2
  Chao Liu authored Sep 21, 2019
```
WIP: explicitly separate offset component into compile-time, block-invariant and per-thread components
```
  51884fc2
21 Sep, 2019 1 commit
- adding logic to judge linear dimension · f00c1381
  Chao Liu authored Sep 20, 2019
  
  f00c1381
11 Sep, 2019 1 commit
- enabling padding for chwn format · 724e984b
  Chao Liu authored Sep 11, 2019
  
  724e984b
10 Sep, 2019 1 commit
- adding merge transform · ca42e910
  Chao Liu authored Sep 10, 2019
  
  ca42e910
09 Sep, 2019 1 commit
- more utility code · 7a7fe160
  Chao Liu authored Sep 09, 2019
  
  7a7fe160
05 Sep, 2019 1 commit
- adding dimension tranformation · 0c05f427
  Chao Liu authored Sep 05, 2019
  
  0c05f427
06 Aug, 2019 2 commits
- added ReorderGiveOld2New() in Sequence and ConstantTensorDescriptor · 0271338e
  Chao Liu authored Aug 06, 2019
  
  0271338e
- reimplement threadwise copy · fdcfae3a
  Chao Liu authored Aug 06, 2019
  
  fdcfae3a
03 Aug, 2019 1 commit
- added new tensor copy operator · c01af899
  Chao Liu authored Aug 03, 2019
  
  c01af899
29 Jul, 2019 1 commit
- adding implicit gemm v4r4 · 9ba3b491
  Chao Liu authored Jul 28, 2019
  
  9ba3b491
20 Jun, 2019 1 commit
- refactor · 37b82b7e
  Chao Liu authored Jun 19, 2019
  
  37b82b7e
18 Jun, 2019 1 commit
- clean up for miopen · 23f633cd
  Chao Liu authored Jun 17, 2019
  
  23f633cd
17 Jun, 2019 2 commits
- refactoring · 9d59a39a
  Chao Liu authored Jun 17, 2019
  
  9d59a39a
- refactoring for miopen · 33d1e0e2
  Chao Liu authored Jun 17, 2019
  
  33d1e0e2
13 Jun, 2019 1 commit
- reorginzed files · 1566b317
  Chao Liu authored Jun 13, 2019
  
  1566b317
12 Jun, 2019 1 commit
- reorginze files · 81497a93
  Chao Liu authored Jun 11, 2019
  
  81497a93
11 Jun, 2019 2 commits
- rename files, added header guard, added namespace · 88b77181
  Chao Liu authored Jun 11, 2019
  
  88b77181
- remove .hip extension · 05e04665
  Chao Liu authored Jun 11, 2019
  
  05e04665
07 Jun, 2019 1 commit
- use more constexpr for Array · 0a386c46
  Chao Liu authored Jun 06, 2019
  
  0a386c46
06 Jun, 2019 1 commit
- refactor · 7a89684f
  Chao Liu authored Jun 06, 2019
  
  7a89684f
05 Jun, 2019 1 commit
- use more constexpr · 709f13a6
  Chao Liu authored Jun 04, 2019
  
  709f13a6
04 Jun, 2019 1 commit
- try using more constexpr · 498e71b0
  Chao Liu authored Jun 04, 2019
  
  498e71b0
30 May, 2019 1 commit
- adding implicit gemm v4 (nchw, kcyx) · b2439ec9
  Chao Liu authored May 30, 2019
  
  b2439ec9
24 May, 2019 1 commit
- adding implicit gemm v3 · 1cc683a3
  Chao Liu authored May 23, 2019
  
  1cc683a3