- 10 Feb, 2025 2 commits
-
-
M.Emin Ozturk authored
-
Mingtao Gu authored
* remove redundant kernels. * added batched_gemm_xdl_fp16int4_b_scale_v3 * Enabled the split K. * added the batched_gemm_b_scale ckProfiler, meet function issue * fix some typo * fix ckProfiler build issue * fix some bugs * updated some debug info * comment some code * Fix * fixed some bugs and refactor the code * fixed a function bug. * formatted files. * formatted * uncommented the ckProfiler CMakeLists * fixed. * fix ckProfiler for batched_gemm_b_scale --------- Co-authored-by:
mtgu0705 <mtgu@amd.com> Co-authored-by:
aska-0096 <haocwang@amd.com> Co-authored-by:
Bartlomiej Kocot <barkocot@amd.com>
-
- 06 Feb, 2025 1 commit
-
-
ozturkosu authored
-
- 20 Jan, 2025 1 commit
-
-
deepsek authored
* Feat: Add bf16 input instances * feat: Add BF16 profiler code * fix: reorder enum types * fix: CI fail due to clang-format * fix: clang script format issue * fix: clang format broke cmakelist file
-
- 17 Jan, 2025 1 commit
-
-
deepsek authored
* fix: preprocessors logic error if/else * fix: added macros as preferred by CK team
-
- 13 Jan, 2025 1 commit
-
-
feli authored
* port tiles from a8w8 * rm debug used files * add instances * remove all non gemm in cmake * merge; impl fp16 * recover cmake from develop * add missed files; fix clang format --------- Co-authored-by:coderfeli <coderfeli@163.com>
-
- 03 Jan, 2025 1 commit
-
-
Mingtao Gu authored
* enable int4 scale (weight only) kernel * format some files * Add unit test for int4 weight only * fixed and formatted code * fixed * formated * formated * fixed * fixed a bug in the ckProfiler, and formatted the code --------- Co-authored-by:mtgu0705 <mtgu@amd.com>
-
- 02 Jan, 2025 2 commits
-
-
Muhammed Emin Ozturk authored
* initial * Cmake file * successfull compilation but validation failed * Cmake * update * gpu validation * gemm universal * gemm universal sk update * sk bf16 universal instance * gemm_universal_streamk.hpp * only build for gfx94 * Cmakelist * profiler update, bf16 sk only works at gfx42 * clang * clang * clang all * no need flags * cmake script * delete comment * gemm universal sk fix * clang * profiler fix * clang * update * update * delete comment * code formatting * cmake * fix instance * clang * argument supported * argument supported and clang * update * fix * removing unnecessary comments * clang formatting * Update library/src/tensor_operation_instance/gpu/CMakeLists.txt Co-authored-by:
afagaj <john.afaganis@gmail.com> * CopyRight Comment 2025 * clang reformatting * copy right 2025 --------- Co-authored-by:
Emin Ozturk <ozturk.27@osu.edu> Co-authored-by:
root <root@ctr-ubbsmc16.amd.com> Co-authored-by:
Muhammed Emin Ozturk <meozturk@t004-008.hpcfund> Co-authored-by:
root <root@splinter-126-wr-d3.amd.com> Co-authored-by:
Muhammed Emin Ozturk <meozturk@t006-001.hpcfund> Co-authored-by:
Muhammed Emin Ozturk <meozturk@login1.hpcfund> Co-authored-by:
Muhammed Emin Ozturk <meozturk@t004-004.hpcfund> Co-authored-by:
Emin Ozturk <emin.ozturk@utah.edu> Co-authored-by:
Muhammed Emin Ozturk <meozturk@t008-001.hpcfund> Co-authored-by:
afagaj <john.afaganis@gmail.com>
-
Adam Osewski authored
* add a prototype of int4 * clean * debug * clean * clean * move packed into dynamic_buffer * fixed coord reset * add fast pki4 to half conversion * fix * fixed reference and host_tensor * fixed tensor init * format * debug i4_to_f16_convert * format * fixed splitk * weight permute * add b tile permute * clean * weight permute with splitki * format * improve weight layout * add and_or_b32 * fixed splitk crush * add permute switch as a template * recover v3r1 * clean * failure with intrawave v2 * fixed * fixed * add ckProfiler * add bfp16 support * add bf16 example * fixed int4 to bhalf_t conversion * format * fixed int4 to bf16 conversion * clean * add instances for mem * clean * fixed host tensor size * fixed * debug * fixed * add pk_i4_t as a struct * fix * Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp Co-authored-by:
Adam Osewski <19374865+aosewski@users.noreply.github.com> * Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp Co-authored-by:
Adam Osewski <19374865+aosewski@users.noreply.github.com> * Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp Co-authored-by:
Adam Osewski <19374865+aosewski@users.noreply.github.com> * revert * Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp Co-authored-by:
Adam Osewski <19374865+aosewski@users.noreply.github.com> * Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp Co-authored-by:
Adam Osewski <19374865+aosewski@users.noreply.github.com> * Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp Co-authored-by:
Adam Osewski <19374865+aosewski@users.noreply.github.com> * Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp Co-authored-by:
Adam Osewski <19374865+aosewski@users.noreply.github.com> * Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp Co-authored-by:
Adam Osewski <19374865+aosewski@users.noreply.github.com> * fixed comments * revert * clean * revert * revert * fixed * Update CMakeLists.txt * Update script/cmake-ck-dev.sh Co-authored-by:
Adam Osewski <19374865+aosewski@users.noreply.github.com> * Update include/ck/tensor_operation/gpu/element/unary_element_wise_operation.hpp Co-authored-by:
Adam Osewski <19374865+aosewski@users.noreply.github.com> * Update CMakeLists.txt Co-authored-by:
Adam Osewski <19374865+aosewski@users.noreply.github.com> * fixed * fixed * fixed * revert * revert * add comments * format * fixed assert * fixed * Fix I4 define in ckProfiler * Fixed example_gemm_xdl_bf16_pk_i4_v3 test failed issue --------- Co-authored-by:
Jing Zhang <jizhan@fb.com> Co-authored-by:
zjing14 <zhangjing14@gmail.com> Co-authored-by:
mtgu0705 <mtgu@amd.com>
-
- 13 Dec, 2024 1 commit
-
-
Bartłomiej Kocot authored
* add bmm api * add bf16 multi_d * add ckProfiler for bf16 * add ckProfiler files * add more instance; fixed 64bit index issue * fixed naming * enabled batched Ds * use long_index for ds offsets * clean * add bmm fp8 ckProfiler * Update example/24_batched_gemm/batched_gemm_xdl_bf16_v3.cpp Co-authored-by:
Bartłomiej Kocot <bartlomiejkocot98@gmail.com> * Update example/24_batched_gemm/batched_gemm_xdl_fp8_rowwise_v3.cpp Co-authored-by:
Bartłomiej Kocot <bartlomiejkocot98@gmail.com> * Update example/24_batched_gemm/run_batched_gemm_example_rowwise.inc Co-authored-by:
Bartłomiej Kocot <bartlomiejkocot98@gmail.com> * Update library/src/tensor_operation_instance/gpu/gemm_universal_batched/device_batched_gemm_xdl_universal_bf16_bf16_bf16/device_batched_gemm_xdl_universal_bf16_bf16_bf16_mk_nk_mn.hpp Co-authored-by:
Bartłomiej Kocot <bartlomiejkocot98@gmail.com> * Update library/src/tensor_operation_instance/gpu/gemm_universal_batched/device_batched_gemm_xdl_universal_bf16_bf16_bf16/device_batched_gemm_xdl_universal_bf16_bf16_bf16_mk_nk_mn_mem_v1_default_instance.cpp Co-authored-by:
Bartłomiej Kocot <bartlomiejkocot98@gmail.com> * Update library/src/tensor_operation_instance/gpu/gemm_universal_batched/device_batched_gemm_xdl_universal_bf16_bf16_bf16/device_batched_gemm_xdl_universal_bf16_bf16_bf16_mk_nk_mn_mem_v2_default_instance.cpp Co-authored-by:
Bartłomiej Kocot <bartlomiejkocot98@gmail.com> * Update profiler/src/profile_gemm_universal_batched.cpp Co-authored-by:
Bartłomiej Kocot <bartlomiejkocot98@gmail.com> * Update profiler/include/profiler/profile_gemm_universal_batched_impl.hpp Co-authored-by:
Bartłomiej Kocot <bartlomiejkocot98@gmail.com> * clean * Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp * Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp * Update library/src/tensor_operation_instance/gpu/gemm_universal_batched/device_batched_gemm_xdl_universal_bf16_bf16_bf16/device_batched_gemm_xdl_universal_bf16_bf16_bf16_mk_nk_mn_comp_default_instance.cpp * Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp * Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp * Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp * refactor batch offset func * add splitk suppport into bmm_v3 * clean * clean * format * fixed * fix --------- Co-authored-by:
Jing Zhang <jizhan@fb.com> Co-authored-by:
zjing14 <zhangjing14@gmail.com>
-
- 27 Nov, 2024 1 commit
-
-
Adam Osewski authored
* Few small fixes. * New GroupedGemm instances (BF16) * Unify and refactor GroupedGEMM device API. * Adapt changes to new API. * Adapt grouped gemm profiler. * Accept multiple kbatches for grouped gemm profiler. - delete obsolete two stage as it is now covered by grouped gemm * Update unit test for grouped gemm. * Fix thresholds for BF16 and F8. Unblock tests. * Fix few instances. * Multiple small fixes. * Adapt to new API, check dynamic casting. * Uncomment few data types in grouped gemm profiler. * Fix call to SetDeviceArgs. * Fix profile grouped gemm multiply tile loop. * Fix grouped gemm tile loop kernel args in client examples. * Review comments.
-
- 21 Nov, 2024 1 commit
-
-
Harisankar Sadasivan authored
* universal streamk fp8 changes & ckprofiler instances * revert strides to -1 and verification options * fp8 exclusion on pre-gfx94 for universal_streamk * PR review based revisions: permissions reverted, removed hip err checks --------- Co-authored-by:Illia Silin <98187287+illsilin@users.noreply.github.com>
-
- 18 Nov, 2024 1 commit
-
-
Bartłomiej Kocot authored
* Batched GEMM Multiple D based on Universal GEMM Co-authored-by:
Jing Zhang <jizhan@fb.com> * CI fixes Co-authored-by:
Jing Zhang <jizhan@fb.com> --------- Co-authored-by:
Jing Zhang <jizhan@fb.com>
-
- 15 Nov, 2024 1 commit
-
-
Illia Silin authored
-
- 06 Nov, 2024 1 commit
-
-
rocking authored
-
- 01 Nov, 2024 1 commit
-
-
Illia Silin authored
* disable fp8 gemm_universal on gfx90a and gfx908 by default * fix cmake syntax * fix clang format * add ifdefs in amd_xdlops * disable fp8 gemm instances on gfx90a by default * update readme
-
- 26 Oct, 2024 1 commit
-
-
valarLip authored
* add int8 gemm multiply multiply a8w8 * uncomment * clang-format-12 * Add example_gemm_multiply_multiply_xdl_int8 * Remove shell scripts * update preprocess number for mi308; bring back printout in ckprofiler * format --------- Co-authored-by:
chenjun <junchen2@amd.com> Co-authored-by:
Haocong WANG <haocwang@amd.com> Co-authored-by:
carlushuang <carlus.huang@amd.com>
-
- 23 Oct, 2024 1 commit
-
-
Bartłomiej Kocot authored
-
- 22 Oct, 2024 1 commit
-
-
Bartłomiej Kocot authored
* Enable grouped conv bwd wei bf16 NGCHW * fixes * fixes * Fixes * fixes * fixes * Fixes
-
- 21 Oct, 2024 1 commit
-
-
Thomas Ning authored
* The draft on ckProfiler instance add * support the ck profiler instance with same data types * add a small feature on the M and N variable switch. * Partially solve the incorrect result problem * fix based on ci cd
-
- 07 Oct, 2024 1 commit
-
-
Illia Silin authored
* update build logic with GPU_ARCHS * fix the GPU_ARCHS build for codegen * unset GPU_TARGETS when GPU_ARCHS are set
-
- 20 Sep, 2024 1 commit
-
-
Bartłomiej Kocot authored
* Support NGCHW in grouped conv fwd * Remove not needed variable * Fixes
-
- 17 Sep, 2024 1 commit
-
-
aledudek authored
* Extend pool3d fwd avg, max operations by f8_t, int8_t types * Pack MaxPool3dFwd params together * Fix MaxPool3dFwd AVG instances * Decrease verification precision for bf16 * Adjust tests + review changes * Adjust threshold for F8 * Adjusted compute types for MAX op instances * Fix ComputeDataType mismatch in tests and profiler for AVG * Fix naming from max_pool3d_fwd to pool3d_fwd * Adjust CMakeLists --------- Co-authored-by:Adam Osewski <19374865+aosewski@users.noreply.github.com>
-
- 13 Sep, 2024 1 commit
-
-
jakpiase authored
* add pool2d fp8 and int8 * minor fixes * add formatting * add reviewer suggestions * add reviewer suggestions
-
- 12 Sep, 2024 1 commit
-
-
Mateusz Ozga authored
* Add pool2d instance BWD AVG * Add pool2d instance BWD MAX * Fix: avg review * Fix review: part2 * Fix - enable test when type is compiled * Fix review part3
-
- 11 Sep, 2024 1 commit
-
-
jakpiase authored
* added pool2d fwd * add tests * add reviewers changes * Revert "Merge remote-tracking branch 'origin/develop' into jakpiase/pool2d_fwd_new" This reverts commit 6b2ba7ff8960b0a6ddbe30d8dac53eeb55a8597e, reversing changes made to 22c82bea0caf3e0f29399100c1bb67b8003fc042. * Revert "add reviewers changes" This reverts commit 22c82bea0caf3e0f29399100c1bb67b8003fc042. * added reviewers comments * revert some old files * add reviewers requests --------- Co-authored-by:Adam Osewski <19374865+aosewski@users.noreply.github.com>
-
- 05 Sep, 2024 1 commit
-
-
Haocong WANG authored
* revert ckprofiler change * temp save * Add test and test pass * test pass * Fix bug inside rotating buffer when tensor is not packed * bug fix * clang format --------- Co-authored-by:Illia Silin <98187287+illsilin@users.noreply.github.com>
-
- 03 Sep, 2024 1 commit
-
-
Bartłomiej Kocot authored
* Add support for NGCHW in grouped conv bwd wei * Comments fixes * navi fixes * Update function names
-
- 19 Aug, 2024 1 commit
-
-
Bartłomiej Kocot authored
* Add script to convert MIOpen driver to ckProfiler * Fix
-
- 14 Aug, 2024 1 commit
-
-
Haocong WANG authored
* replace buffer_atomic with global_atomic * fixed global_atomic_add * added bf16 atomic_add * format * clang-format-12 * clean * clean * add guards * Update gtest.cmake * enabled splitk_gemm_multi_d * format * add ckProfiler * format * fixed naming * format * clean * clean * add guards * fix clang format * format * add kbatch printout * clean * Add rocm6.2 related gemm optimization * Limit bf16 atomic usage * remove redundant RCR gemm_universal instance * Add RRR fp8 gemm universal instance * Bug fix * Add GPU_TARGET guard to FP8/BF8 target * bug fix * update cmake * remove all fp8/bf8 example if arch not support * Enable fp8 RRR support in ckProfiler * limit greedy-reverse flag to gemm_universal in ckProfiler --------- Co-authored-by:
Jing Zhang <jizhan@fb.com> Co-authored-by:
Jing Zhang <jizhan@meta.com> Co-authored-by:
zjing14 <zhangjing14@gmail.com> Co-authored-by:
Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by:
illsilin <Illia.Silin@amd.com>
-
- 06 Aug, 2024 2 commits
-
-
Haocong WANG authored
-
Bartłomiej Kocot authored
* Support 64 bit indexing * Add new grouped conv fwd kernel for large tensors * Add instances large tensor * Fixes for transform conv to gemm * Fixes * fixes * Remove not needed instances * examples fixes * Remove not need ds arrays * Fix tests * Add 2GB check in gridwise dl * Fixes
-
- 05 Aug, 2024 1 commit
-
-
Illia Silin authored
* add --offload-compress compiler flag * only apply the --offload-compress flag to the ckProfiler * move the --offload-compress flag back to main cmake file * add offload-compress to target compile option of ckProfiler --------- Co-authored-by:carlushuang <carlus.huang@amd.com>
-
- 31 Jul, 2024 1 commit
-
-
zjing14 authored
* fixed a typo * clean --------- Co-authored-by:Jing Zhang <jizhan@fb.com>
-
- 19 Jul, 2024 2 commits
-
-
Haocong WANG authored
* add ab_scale init support * enabled interwave * add scale type; update isSupport * adjust example * clean * enable f8 pure gemm rcr ckprofiler * Add gemm_multiply_multiply instances * clang format * Optimize for ScaleBlockMNK=128 * enable abscale f8 gemm ck profiler * Add pure f8 gemm test suite * Reverting to the state of project at f60fd77 * update copyright * clang format * update copyright --------- Co-authored-by:root <jizhan@amd.com>
-
ltqin authored
* init for reduce_threadwise multi_d * add reduce_threadwise_multi_d * add reduce_multi_d * clean * start add an other splitk device op * add reduce template parameter to SplitKBatchOffset * add reduce c matrix * clean up code * change example data type to bf16 * add bf16Ai8B example * remove reduce template parameter * add splitk atomic status to v4 * example add multi d parameters * device op add multi-d parameters * add multi-d to reduce * fix kbach=1 bug * change B layout to col in bf16Ai8B example * remove float adding struct * change multi-d interface * change file and class name * remove multi-d of bf16Ai8B example * change IsReduce function to IsReduceAdd * change example layout to RRR from RCR * according layout to set ds stride * reset parameter layout * add gemm universal reduce instance * add reduce factory * add profile_gemm_universal_reduce * add reduce to profiler * fix reduce instance * fix profiler reduce compiling bug * format * format library instance code * add mem instance for reduce library * fix call instance names * add workspace for reduce in ckProfiler * format * add mnpading to reduce library instance * add fp16 instance to reduce of profiler * change copyright time * restore profiler cmake file * add reduce text to instances * add DsLayout and DsDataType to instances template parameter * fixed gemm_reduce_multi_d * add an example without multi_d * Update common.hpp * Update gtest.cmake * Update gemm_xdl_splitk_reduce_bf16.cpp * clean * Update gtest.cmake * format * fixe api * format * default parameter change to RRR * add vector_len for multi_d * format * Update gtest.cmake * fix bf16A iBB elementwiseop * add ReduceDataType * move ReduceDataType to end position * format * remove googletest git method address * fix copyright time * update init data --------- Co-authored-by:
root <jizhan@amd.com> Co-authored-by:
letaoqin <letaoqin@amd.com> Co-authored-by:
Jing Zhang <jizhan@meta.com> Co-authored-by:
zjing14 <zhangjing14@gmail.com>
-
- 08 Jul, 2024 1 commit
-
-
Andriy Roshchenko authored
-
- 06 Jul, 2024 1 commit
-
-
Harisankar Sadasivan authored
* universal streamk with atomics with ckprofiler support. grid_size and streamk strategy are tunable. grid_size of -1 leads to #WGs = maximum occupancy X num_CUs. implementation supports many different streamk policies: 1-tile, 2-tile, 3-tile and 4-tile. streamk strategy of -1 leads to default streamk policy (4-tile). * Update README.md * fixing clang-format issues * removed conflicts in struct members between streamk and universal streamk * corrected arg parsing for streamk and universal streamk * added stream-k policies for 3 tile and 4 tile * fixed argument type issue with parsing cmd args * changes suggested in PR review are made- removing comments and correcting copyright * file permissions updated * added default value support for grid_size and streamk-policy selection set to -1 * print messages for arguments * print messages for arguments * print messages for arguments1
-
- 28 Jun, 2024 1 commit
-
-
Illia Silin authored
-
- 27 Jun, 2024 1 commit
-
-
Illia Silin authored
-