- 09 Aug, 2024 1 commit
-
- 06 Aug, 2024 2 commits
-
-
bibek authored
* adding mha as static lib * add fmha fwd compile options * typo * fix python version * python version to 3 * increase path length * add max path flag in mha cmake * fix long path issue * mha currently only runs in gfx94x * only buld mha in mi300 * populate gpu_list * add mha compile flags * avoid building mha in gpu other then gfx94x * some comments and include ck_tile in rocm * use rocm_install * place ck_tile in include * correct ck_tile path --------- Co-authored-by:Illia Silin <98187287+illsilin@users.noreply.github.com>
-
Bartłomiej Kocot authored
* Support 64 bit indexing * Add new grouped conv fwd kernel for large tensors * Add instances large tensor * Fixes for transform conv to gemm * Fixes * fixes * Remove not needed instances * examples fixes * Remove not need ds arrays * Fix tests * Add 2GB check in gridwise dl * Fixes
-
- 05 Aug, 2024 1 commit
-
-
Illia Silin authored
* add --offload-compress compiler flag * only apply the --offload-compress flag to the ckProfiler * move the --offload-compress flag back to main cmake file * add offload-compress to target compile option of ckProfiler --------- Co-authored-by:carlushuang <carlus.huang@amd.com>
-
- 30 Jul, 2024 1 commit
-
-
Bartłomiej Kocot authored
-
- 24 Jul, 2024 1 commit
-
-
Andriy Roshchenko authored
Adding more instances of grouped convolution 3d forward for FP8 with ConvScale+Bias element-wise operation. (#1412) * Add CMakePresets configurations. * Add binary elementwise ConvScaleAdd and an example. * Numerical verification of results. Observed significant irregularities in F8 to F32 type conversions: ```log ConvScaleAdd: float=145.000000 f8_t=160.000000 e=144.000000 ConvScaleAdd: float=97.000000 f8_t=96.000000 e=104.000000 ConvScaleAdd: float=65.000000 f8_t=64.000000 e=72.000000 ``` * Implemented ConvScaleAdd + Example. * Add ConvScale+Bias Instances * Add Client Example for ConvScale+Bias * Fix number of bytes in an example.. * Cleanup.
-
- 23 Jul, 2024 1 commit
-
-
Haocong WANG authored
-
- 22 Jul, 2024 1 commit
-
-
Bartłomiej Kocot authored
-
- 19 Jul, 2024 2 commits
-
-
Haocong WANG authored
* add ab_scale init support * enabled interwave * add scale type; update isSupport * adjust example * clean * enable f8 pure gemm rcr ckprofiler * Add gemm_multiply_multiply instances * clang format * Optimize for ScaleBlockMNK=128 * enable abscale f8 gemm ck profiler * Add pure f8 gemm test suite * Reverting to the state of project at f60fd77 * update copyright * clang format * update copyright --------- Co-authored-by:root <jizhan@amd.com>
-
ltqin authored
* init for reduce_threadwise multi_d * add reduce_threadwise_multi_d * add reduce_multi_d * clean * start add an other splitk device op * add reduce template parameter to SplitKBatchOffset * add reduce c matrix * clean up code * change example data type to bf16 * add bf16Ai8B example * remove reduce template parameter * add splitk atomic status to v4 * example add multi d parameters * device op add multi-d parameters * add multi-d to reduce * fix kbach=1 bug * change B layout to col in bf16Ai8B example * remove float adding struct * change multi-d interface * change file and class name * remove multi-d of bf16Ai8B example * change IsReduce function to IsReduceAdd * change example layout to RRR from RCR * according layout to set ds stride * reset parameter layout * add gemm universal reduce instance * add reduce factory * add profile_gemm_universal_reduce * add reduce to profiler * fix reduce instance * fix profiler reduce compiling bug * format * format library instance code * add mem instance for reduce library * fix call instance names * add workspace for reduce in ckProfiler * format * add mnpading to reduce library instance * add fp16 instance to reduce of profiler * change copyright time * restore profiler cmake file * add reduce text to instances * add DsLayout and DsDataType to instances template parameter * fixed gemm_reduce_multi_d * add an example without multi_d * Update common.hpp * Update gtest.cmake * Update gemm_xdl_splitk_reduce_bf16.cpp * clean * Update gtest.cmake * format * fixe api * format * default parameter change to RRR * add vector_len for multi_d * format * Update gtest.cmake * fix bf16A iBB elementwiseop * add ReduceDataType * move ReduceDataType to end position * format * remove googletest git method address * fix copyright time * update init data --------- Co-authored-by:
root <jizhan@amd.com> Co-authored-by:
letaoqin <letaoqin@amd.com> Co-authored-by:
Jing Zhang <jizhan@meta.com> Co-authored-by:
zjing14 <zhangjing14@gmail.com>
-
- 16 Jul, 2024 2 commits
-
-
Andriy Roshchenko authored
Adding more instances of grouped convolution 3d forward for FP8 with ConvScale element-wise operation and ReLU activation. (#1386) * Add CMakePresets configurations. * Add ConvScale+ReLU Functor and an Example * Account for ReLU FLOPs. * Add instances of 3D convolutions with ConvscaleRelu operation. * Implement Client Example * Cleanup
-
Haocong WANG authored
-
- 12 Jul, 2024 1 commit
-
-
Bartłomiej Kocot authored
* Support access per groups and filter3x3 in grouped conv fwd * Fixes for large cases * Fixes for large tensors
-
- 11 Jul, 2024 1 commit
-
-
Rostyslav Geyyer authored
* Add an example * Add instances * Add a client example
-
- 09 Jul, 2024 1 commit
-
-
Illia Silin authored
* fix the cmake logic when building for various targets * another minor fix
-
- 08 Jul, 2024 1 commit
-
-
Andriy Roshchenko authored
-
- 06 Jul, 2024 1 commit
-
-
Harisankar Sadasivan authored
* universal streamk with atomics with ckprofiler support. grid_size and streamk strategy are tunable. grid_size of -1 leads to #WGs = maximum occupancy X num_CUs. implementation supports many different streamk policies: 1-tile, 2-tile, 3-tile and 4-tile. streamk strategy of -1 leads to default streamk policy (4-tile). * Update README.md * fixing clang-format issues * removed conflicts in struct members between streamk and universal streamk * corrected arg parsing for streamk and universal streamk * added stream-k policies for 3 tile and 4 tile * fixed argument type issue with parsing cmd args * changes suggested in PR review are made- removing comments and correcting copyright * file permissions updated * added default value support for grid_size and streamk-policy selection set to -1 * print messages for arguments * print messages for arguments * print messages for arguments1
-
- 27 Jun, 2024 2 commits
-
-
jakpiase authored
* first version of smfmac test * add reviewer comments * add reviewer suggestions
-
Illia Silin authored
-
- 22 Jun, 2024 1 commit
-
-
Andriy Roshchenko authored
Add instances of grouped convolution 3d forward with a ConvScale element-wise op for bf8@bf8->fp8 (#1326) We are adding more instances of grouped convolution 3d forward with a ConvScale element-wise operation. This commit handles bf8@bf8->fp8 data types combination. * Included an example. * Added instances. * Added a client example. --------- Co-authored-by:
Rostyslav Geyyer <rosty.geyyer@amd.com> Co-authored-by:
Bartłomiej Kocot <barkocot@amd.com>
-
- 18 Jun, 2024 1 commit
-
-
jakpiase authored
* switch to universal gemm in grouped gemm tile loop * minor fixes * add reviewers comments --------- Co-authored-by:Adam Osewski <19374865+aosewski@users.noreply.github.com>
-
- 12 Jun, 2024 1 commit
-
-
Rostyslav Geyyer authored
* Add fp8 bf8 conv example * Add instances * Add client example * Add random scale values * Format
-
- 11 Jun, 2024 1 commit
-
-
Bartłomiej Kocot authored
-
- 10 Jun, 2024 1 commit
-
-
Rostyslav Geyyer authored
* Update the element op * Add an example * Add instances * Add a client example * make sure new instances only build on gfx9 * Update element op and its handling * Format * Update instances to take element op as an argument * Update examples to use random scale values * Format * Update client example with random scales * Format --------- Co-authored-by:illsilin <Illia.Silin@amd.com>
-
- 05 Jun, 2024 2 commits
-
-
Bartłomiej Kocot authored
* Integrate universal gemm with conv fwd * Fix conv fwd wmma test * Fix instances * Remove direct load check
-
Rostyslav Geyyer authored
* Add a scale op * Update the element op * Add instances * Add an example * Add a client example * Add a flag check * Revert flag check addition * Fix flag check * Update d strides in example * Update d strides in client example * Apply suggestions from code review Update copyright header Co-authored-by:
Bartłomiej Kocot <barkocot@amd.com> * Move the example * Move the client example * Update element op * Update example with the new element op * Add scalar layout * Update example * Update kernel for scalar Ds * Revert kernel changes * Update element op * Update example to use scales' pointers * Format * Update instances * Update client example * Move element op to unary elements * Update element op to work with values instead of pointers * Update instances to take element op as an argument * Update examples to use random scale values --------- Co-authored-by:
Bartłomiej Kocot <barkocot@amd.com>
-
- 23 May, 2024 1 commit
-
-
Illia Silin authored
* split the gemm_multi_abd instances * update the dates
-
- 22 May, 2024 2 commits
-
-
Bartłomiej Kocot authored
* Optimize grouped conv bwd weight for small M and N * Fixes
-
Illia Silin authored
* set individual gpu targets for instances, examples, tests * fix path to hip compiler * fix path to hip compiler once more * aggregate device macros in ck_tile config header * fix the cmake logic for instances * fix clang format * add gfx900 and gfx906 to default set of targets
-
- 10 May, 2024 1 commit
-
-
Bartłomiej Kocot authored
-
- 08 May, 2024 1 commit
-
-
Bartłomiej Kocot authored
-
- 01 May, 2024 1 commit
-
-
Rostyslav Geyyer authored
-
- 29 Apr, 2024 1 commit
-
-
Rostyslav Geyyer authored
* Add a flag * Add flag check and messages --------- Co-authored-by:root <root@aus-g7-rogeyyer.amd.com>
-
- 26 Apr, 2024 3 commits
-
-
Haocong WANG authored
* Add bf16 instances * Add bf16 gemm universal example * tempsave * Add guard to navi compilation * workground on a specific mixed gemm instance ( bring back it when compiler fix upload) * fix formatting condition statement issue * solve conflict --------- Co-authored-by:Jun Liu <Liu.Jun@amd.com>
-
zjing14 authored
* Overload output stream operator for LoopScheduler and PiplineVersion * Add Run overload accepting grid descriptors MK. * Add __device__ keyword for CalculateGridSize * Create device op GroupedGemmMultipleD * Add GroupedGemm MultipleD Tile Loop implementation. * Add an example for GroupedGemm MultipleD tile loop. * Device Op GroupedGEMMTileLoop. * Bunch of small changes in exmaple. * CkProfiler * Remove unused tparam. * changed the copy function to v7r2 * adding multi_abd * in-progress * add post-load oob check * Fix include statement. * Fix output stream overloads. * Do not make descriptors and check validity untill we find group. * Fix gemm desc initialization. * debugging * adjust instances * add run_lds * add elemntwise_op * replace multi_abd_device with v3 * clean up * clean * clean * Revert device op * Fix compilation for DTYPES=FP16 * Validate tensor transfers paramters. * Added LDSType * profiling * adjust oobcheck * add missing file * Validate on host only NK dims if M is not known. * add * clean * refactor * clean * add examples * add fuse * add fusion and client example * Fix bug. * A convenient debug func for selecting threads. * Fix has main k block loop bug. * Make sure that b2c has up to date tile offset. * Output stream operator for Sequence type. * Cmake file formatting. * clean --------- Co-authored-by:Adam Osewski <Adam.Osewski@amd.com>
-
zjing14 authored
* changed the copy function to v7r2 * adding multi_abd * in-progress * add post-load oob check * debugging * adjust instances * add run_lds * add elemntwise_op * replace multi_abd_device with v3 * clean up * clean * clean * Added LDSType * profiling * adjust oobcheck * add missing file * refactor * clean * add examples
-
- 25 Apr, 2024 1 commit
-
-
Adam Osewski authored
* Overload output stream operator for LoopScheduler and PiplineVersion * Add Run overload accepting grid descriptors MK. * Add __device__ keyword for CalculateGridSize * Create device op GroupedGemmMultipleD * Add GroupedGemm MultipleD Tile Loop implementation. * Add an example for GroupedGemm MultipleD tile loop. * Device Op GroupedGEMMTileLoop. * Bunch of small changes in exmaple. * CkProfiler * Remove unused tparam. * Fix include statement. * Fix output stream overloads. * Do not make descriptors and check validity untill we find group. * Fix gemm desc initialization. * Revert device op * Fix compilation for DTYPES=FP16 * Validate tensor transfers paramters. * Validate on host only NK dims if M is not known. * Fix bug. * A convenient debug func for selecting threads. * Fix has main k block loop bug. * Make sure that b2c has up to date tile offset. * Output stream operator for Sequence type. * Cmake file formatting.
-
- 19 Apr, 2024 2 commits
-
-
Bartłomiej Kocot authored
* Refactor elementwise kernels * Instances fixes * Fix cmake * Fix max pool bwd test * Update two stage gemm split k * Restore elementwise scale for hiptensor backward compatiblity * Fix Acc data type check in conv fwd multiple abd * Disable conv fp64 fwd example * Update grouped conv weight multi d
-
jakpiase authored
* added bf16 and bf16@int8 mk_nk_mn instances * fix preprocessor guards
-
- 18 Apr, 2024 1 commit
-
-
Bartłomiej Kocot authored
* Add grouped conv bwd weight multi d kernel * Reference fix * Fix cmake files * bwd weight scale only xdl * Fixes * Fix client conv fwd example
-