- 18 Nov, 2024 1 commit
-
-
Illia Silin authored
* add bf16 gemms for gfx11/gfx12 * reduce the input values in test_gemm * add int8 wmma gemm instances for gfx11/gfx12 * add example gemm_wmma_int8 * fix bug in gemm_wmma_int8 test * increase bf16 gemm test tolerance * update the dates and clean-up commented-out instances
-
- 05 Nov, 2024 1 commit
-
-
Andriy Roshchenko authored
-
- 23 Oct, 2024 1 commit
-
-
Andriy Roshchenko authored
-
- 08 Oct, 2024 1 commit
-
-
Rostyslav Geyyer authored
* Add a gpu gemm reference kernel * Switch to gpu reference in gemm examples * Remove redundant arguments * Update all related examples * Update more examples * Try less threads per block * Try even less threads per block * Add support for all matrix layouts * Increase block size * Clean up * Remove hardcoded strides * Clean up * Try a column-major case * Revert back to row-major * Run both CPU and GPU veriffication --------- Co-authored-by:Po Yen Chen <PoYen.Chen@amd.com>
-
- 27 Jun, 2024 1 commit
-
-
Illia Silin authored
-
- 27 Apr, 2024 2 commits
-
-
Jing Zhang authored
-
Jing Zhang authored
-
- 21 Apr, 2024 1 commit
-
-
Jing Zhang authored
-
- 09 Mar, 2024 2 commits
-
-
Jing Zhang authored
-
zjing14 authored
* wmma_op + unit test * add arch limitation to wmma test * change arch limitation * Refactor + Add all type unit test(int4 compile failed) * Add f32_16x16x16_bf16 unit test * tempsave * tempsave * tempsave * runtime bug, cannot find symbol * workaround for incorrect HIP warpSize return value * debugging * tempsave * Correctness OK, waiting for optimization * Tidy up + format * temp save * temp save, reproduce the v_bfi_b32 issue * add inline asm for wmmaop test * tidy up * clean some debug purpose code * discard some codes * clang format * clang format * compiler issue fixed + increase tile size * navi3x_multipleD+example * temp save * workable * batchedgemm[OK], groupconv[debug] * groupconv: Sanity check[OK], Performance[Bad] * navi3x_groupconv_need_optimization * create necessary files * save progress * Add Inter-Row thread transfer * save progress * save debugging progress * sanity check pass * fix a host tensor bug and clean up flash-attn code * format * cancel unnecessary change * cancel unnecessary change * cancel unnecessary change * temp save, add asm backend flag to amd_wmma * Mat-A LDS Bypass sanity pass * temp save * gemm sanity fix * Porting new blockwise gemm to flash attention * Example branch provide to compiler team * tempsave * Fix a bug * batched gemm ported * conv A-skip lds ported * Skip B-Lds real gemm * Skip B Lds Gemm + MulD * batched gemm, conv, skip b lds * format * Attn, skip b lds * Change GridwiseOp nam * fix a typo caused bug * Skip A_Lds sanity pass, Skip B_Lds scratch occured * Bug found, intra-row permute off caused * bug found * a fix * disable buffer load due to incorrect 3rd dword * update fmha config, no scratch generated * update 3rd dword * fmha config update * FMHA, add support to gfx1101/gfx1102 * Merge origin dev (#2) * [Navi3x] Fix Gridwise_multiple_d operation (#649) * Add CMake Option "USE_OPT_NAVI3X" * fix bug * standardize docs (#655) * Separate bibtex requirement from rocm-docs-core (#656) * separate bibtex requirement from rocm-docs-core * point requirements to source rocm-docs-core repo * Add CMake Option "USE_OPT_NAVI3X" (#647) * Add CMake Option "USE_OPT_NAVI3X" * remove navi3x opt compile option from cmake script * Conv + quantization + tanh (#645) * Rename file. Prepare to support another activation * Add comment for quantization * Extract out_elementop * Add tanh example * Add conv + bias + tanh quantization instance * Add missing parameter * Refine cmake * Add external api and client example * Extract variable in example * Fix the comment --------- Co-authored-by:
zjing14 <zhangjing14@gmail.com> * Add a denorm test fix (#603) * Add type_convert implementations for bf16 * Add the fix for conv_fwd * Add the fix for conv_bwd_data * Add the fix for conv_bwd_weight * Format * Format * Another format * Add a macro to use workaround on MI200 only * Format --------- Co-authored-by:
Rosty Geyyer <rosty.geyyer@amd.com> Co-authored-by:
zjing14 <zhangjing14@gmail.com> * simplify karg in device/grid of split-k op (#644) * simplify karg in device/grid split-k op * fix mk_kn_mn instances * add more instances * use name from tensor layout * fix 3rd dword of buffer source descriptor (#659) * add fp64 instances (#658) Co-authored-by:
root <root@ctr-ubbsmc15.amd.com> * Issue #666: Revert "simplify karg in device/grid of split-k op (#644)" (#665) This reverts commit bb5530af . * Groupnorm + swish external api (#668) * Rename to proper naming * Add example of groupnorm + swish * Extract duplicate code in example * Add groupnorm + swish instances * Ractor instance generation, split into multiple cpp file * Add external api and client example * Refine profiler message * Use ck math version of exp * Refine problem size in example * Add host version of exp * add a marco to turn on/off denorm fix (off by default) (#673) * add a marco to turn off denorm fix by default * expose the marco --------- Co-authored-by:
root <root@ctr-ubbsmc15.amd.com> * fixed quant example (#672) Co-authored-by:
root <root@ctr-ubbsmc15.amd.com> * Add dependabot config and pin rocm-docs-core (#663) * [gtest] suppress unsafe buffer warn (#670) ref: https://github.com/ROCmSoftwarePlatform/MIOpen/pull/1912 * Add memory index guard in wmma device ops (#667) * Add more macros to turn on/off denorm fix (#678) Co-authored-by:
Rosty Geyyer <rosty.geyyer@amd.com> * Fix a typo (#676) * Add (#677) * Allow using ROCm release candidate compilers. (#679) * enable use of rocm5.5 release candidate 4 * upgrade to ROCM5.5 RC5 * try fix the PUB_KEY error, remove the cmake-data package * upgrade to latest cmake version * use private dockerhub repo for rocm5.5 rc5 * add missing bracket * add vector load check * solve conflicts --------- Co-authored-by:
Sam Wu <sjwu@ualberta.ca> Co-authored-by:
Sam Wu <sam.wu2@amd.com> Co-authored-by:
rocking5566 <ChunYu.Lai@amd.com> Co-authored-by:
zjing14 <zhangjing14@gmail.com> Co-authored-by:
Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com> Co-authored-by:
Rosty Geyyer <rosty.geyyer@amd.com> Co-authored-by:
carlushuang <carlus.huang@amd.com> Co-authored-by:
root <root@ctr-ubbsmc15.amd.com> Co-authored-by:
Jun Liu <Liu.Jun@amd.com> Co-authored-by:
Illia Silin <98187287+illsilin@users.noreply.github.com> * Disable SkipLDS & Align AIT api (#3) * fix layernorm, reduction Ops (#4) * [Navi3x] Fix Gridwise_multiple_d operation (#649) * Add CMake Option "USE_OPT_NAVI3X" * fix bug * standardize docs (#655) * Separate bibtex requirement from rocm-docs-core (#656) * separate bibtex requirement from rocm-docs-core * point requirements to source rocm-docs-core repo * Add CMake Option "USE_OPT_NAVI3X" (#647) * Add CMake Option "USE_OPT_NAVI3X" * remove navi3x opt compile option from cmake script * Conv + quantization + tanh (#645) * Rename file. Prepare to support another activation * Add comment for quantization * Extract out_elementop * Add tanh example * Add conv + bias + tanh quantization instance * Add missing parameter * Refine cmake * Add external api and client example * Extract variable in example * Fix the comment --------- Co-authored-by:
zjing14 <zhangjing14@gmail.com> * Add a denorm test fix (#603) * Add type_convert implementations for bf16 * Add the fix for conv_fwd * Add the fix for conv_bwd_data * Add the fix for conv_bwd_weight * Format * Format * Another format * Add a macro to use workaround on MI200 only * Format --------- Co-authored-by:
Rosty Geyyer <rosty.geyyer@amd.com> Co-authored-by:
zjing14 <zhangjing14@gmail.com> * simplify karg in device/grid of split-k op (#644) * simplify karg in device/grid split-k op * fix mk_kn_mn instances * add more instances * use name from tensor layout * fix 3rd dword of buffer source descriptor (#659) * add fp64 instances (#658) Co-authored-by:
root <root@ctr-ubbsmc15.amd.com> * Issue #666: Revert "simplify karg in device/grid of split-k op (#644)" (#665) This reverts commit bb5530af . * Groupnorm + swish external api (#668) * Rename to proper naming * Add example of groupnorm + swish * Extract duplicate code in example * Add groupnorm + swish instances * Ractor instance generation, split into multiple cpp file * Add external api and client example * Refine profiler message * Use ck math version of exp * Refine problem size in example * Add host version of exp * add a marco to turn on/off denorm fix (off by default) (#673) * add a marco to turn off denorm fix by default * expose the marco --------- Co-authored-by:
root <root@ctr-ubbsmc15.amd.com> * fixed quant example (#672) Co-authored-by:
root <root@ctr-ubbsmc15.amd.com> * Add dependabot config and pin rocm-docs-core (#663) * [gtest] suppress unsafe buffer warn (#670) ref: https://github.com/ROCmSoftwarePlatform/MIOpen/pull/1912 * Add memory index guard in wmma device ops (#667) * Add more macros to turn on/off denorm fix (#678) Co-authored-by:
Rosty Geyyer <rosty.geyyer@amd.com> * Fix a typo (#676) * Add (#677) * Allow using ROCm release candidate compilers. (#679) * enable use of rocm5.5 release candidate 4 * upgrade to ROCM5.5 RC5 * try fix the PUB_KEY error, remove the cmake-data package * upgrade to latest cmake version * use private dockerhub repo for rocm5.5 rc5 * add missing bracket * Disable SkipLDS & Align AIT api * Update dependabot config (#682) Co-authored-by:
samjwu <samjwu@users.noreply.github.com> * update attn api * solve type_convert bug + enable --------- Co-authored-by:
Sam Wu <sjwu@ualberta.ca> Co-authored-by:
Sam Wu <sam.wu2@amd.com> Co-authored-by:
rocking5566 <ChunYu.Lai@amd.com> Co-authored-by:
zjing14 <zhangjing14@gmail.com> Co-authored-by:
Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com> Co-authored-by:
Rosty Geyyer <rosty.geyyer@amd.com> Co-authored-by:
carlushuang <carlus.huang@amd.com> Co-authored-by:
root <root@ctr-ubbsmc15.amd.com> Co-authored-by:
Jun Liu <Liu.Jun@amd.com> Co-authored-by:
Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by:
samjwu <samjwu@users.noreply.github.com> Co-authored-by:
haocwang <Haocong.WANG@amd.com> * fix typo * Fix attention with causal mask * multiple fix, try ait compile * Add A/B not use LDS pipeline * Clang format, Add gfx1101, gfx1102 support of FMHA example * cancel change of format script * 1. Enable 2-stage global Prefetch ( May cause VGPR spilling) 2. Enable FP16 accumulator blockwise_gemm * clang-format * 1. change blockwise gemm loopover direction from kmn to mnk ( ~1% improvement) 2. change kernel timing mode to 50 warmup + 50 timed repeat * Update low level abstration of blockwise gemm wmma * (2/5) bilinear gemm pass, perf bug: skip a lds has lower performance than skip b lds * (3/5) batched gemm pass, perf bug: skip a lds has lower performance than skip b lds * (4/5) grouped conv pass * (5/5) attention pass, todo: debug lds perf bug * AIT Attention API refactor (#8) * sanity pass * sanity pass 2 * confirm significant performance regression. * turn on all instances * turn off instance format * Fix bug & tunning & format * DML meta, self_attn+cross_attn * sanity pass * remove useless flag * update tile and problem size used in AIT attention * bug fix in grouped conv supporting check * deprecate inline asm wmma * Bug fix: double lds skip * clang-format * Fix errors in 1. example, fmha 2. gridwise pipeline 3. deviceop, fmha, change some containers from vector to array * part2 of previous commit * clang format * API fix of gridwisegemmpipeline * separate array base and vector base attention tensor transformation * fix gemm * clang format * add gemm fp16 instances * Temp save * fpAintB kernel compile pass * Sanity pass. * Temp save * debug code enabled * Fp16AInt8B_GEMM sanity * MQA implementation * GQA-4 example * tempsave * Compile pass * New implementation of fp16Aint8B Gemm, Acheieve similar math throughput with native fp16 Gemm * format * Todo: fix gemm_bilinear_wmma instances compilation bug * Solve a bug when K1=16 * remove unnecessary changes * Remove tensor layout limitation to LDS usage in tesnor contraction * update self-attention and cross-attention * fix a typo of name * Add arch limiter for fp8 gemm * enable fp8 gemm_xdl for all gfx9 targets * temporarily disable gemm_xdl_fp16_fp8 on MI100/200 * fix the cmake logic for gemm_xdl_fp16_fp8 * re-enable the gemm_xdl_fp16_fp8 on MI100/200 --------- Co-authored-by:
aska-0096 <haocwang@amd.com> Co-authored-by:
Sam Wu <sjwu@ualberta.ca> Co-authored-by:
Sam Wu <sam.wu2@amd.com> Co-authored-by:
rocking5566 <ChunYu.Lai@amd.com> Co-authored-by:
Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com> Co-authored-by:
Rosty Geyyer <rosty.geyyer@amd.com> Co-authored-by:
carlushuang <carlus.huang@amd.com> Co-authored-by:
root <root@ctr-ubbsmc15.amd.com> Co-authored-by:
Jun Liu <Liu.Jun@amd.com> Co-authored-by:
Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by:
samjwu <samjwu@users.noreply.github.com> Co-authored-by:
haocwang <Haocong.WANG@amd.com> Co-authored-by:
illsilin <Illia.Silin@amd.com>
-
- 29 Feb, 2024 2 commits
-
-
Jing Zhang authored
-
Jing Zhang authored
-
- 09 Aug, 2023 1 commit
-
-
aska-0096 authored
-
- 31 May, 2023 1 commit
-
-
Illia Silin authored
-
- 19 May, 2023 1 commit
-
-
aska-0096 authored
-
- 18 May, 2023 1 commit
-
-
aska-0096 authored
2. change kernel timing mode to 50 warmup + 50 timed repeat
-
- 10 May, 2023 1 commit
-
-
aska-0096 authored
2. Enable FP16 accumulator blockwise_gemm
-
- 27 Apr, 2023 1 commit
-
-
aska-0096 authored
-
- 20 Apr, 2023 1 commit
-
-
Haocong WANG authored
-
- 29 Mar, 2023 1 commit
-
-
aska-0096 authored
-
- 23 Mar, 2023 2 commits
- 06 Mar, 2023 2 commits
- 01 Mar, 2023 1 commit
-
-
Haocong WANG authored
* fix a bug blocking wmma_gemm_multipleD * Utilize matrix padder in device_wmma_op * cosmetic change for gemmpadding format * clang format * Change gridwise gemm from FIFO to KMN loop fashion
-
- 28 Feb, 2023 1 commit
-
-
aska-0096 authored
-
- 27 Feb, 2023 1 commit
-
-
aska-0096 authored
-
- 24 Feb, 2023 1 commit
-
-
aska-0096 authored
-
- 17 Jan, 2023 1 commit
-
-
Haocong WANG authored
* wmma_op + unit test * add arch limitation to wmma test * change arch limitation * Refactor + Add all type unit test(int4 compile failed) * Add f32_16x16x16_bf16 unit test * tempsave * tempsave * tempsave * runtime bug, cannot find symbol * workaround for incorrect HIP warpSize return value * debugging * tempsave * Correctness OK, waiting for optimization * Tidy up + format * temp save * temp save, reproduce the v_bfi_b32 issue * add inline asm for wmmaop test * tidy up * clean some debug purpose code * discard some codes * clang format * clang format * compiler issue fixed + increase tile size
-
- 11 Jan, 2023 1 commit
-
-
aska-0096 authored
-
- 15 Dec, 2022 2 commits
- 12 Dec, 2022 1 commit
-
-
aska-0096 authored
-
- 09 Dec, 2022 1 commit
-
-
aska-0096 authored
-
- 30 Nov, 2022 1 commit
-
-
aska-0096 authored
-
- 16 Nov, 2022 1 commit
-
-
aska-0096 authored
-
- 13 Oct, 2022 1 commit
-
-
Adam Osewski authored
* Move kernel implementation files under impl directory. * Update examples paths. * Update device kernel impl include paths. * Update tensor operation instances include paths. * Update profiler and tests include paths. * Clang-format * Update include paths for batched gemm reduce * Refactor UnitTest ConvNDBwdWeight. * Refactor fwd and bwd data convND UT. * Fix used test macro. * Fix include path. * Fix include paths. * Fix include paths in profiler and tests. * Fix include paths. Co-authored-by:Adam Osewski <aosewski@amd.com>
-
- 23 Aug, 2022 1 commit
-
-
Po Yen Chen authored
* Add GEMM examples for int4 Currently the source files are just copied from int8 examples * Re-use pre-defined alias in int4 exmples * Distinguish user-side type from kernel-side type * Add int4_t support for check_err() * Allow conversion between Tensor<> specializations * Re-format source files * Use different type for host tensors * Re-use CopyAsType<>() to implement copy ctor * Re-use element-wise operation type alias * Fix typo in alias names * Complete the int4 examples * Add constraint to Tensor<> templated methods * Add type traits 'is_signed_integral<>' * Add type constraints for integer version check_err<>() * Allow comparing different-sized integral types in check_err() * Check converted Tensor<int4_t> with golden Tensor<int8_t> * Remove constraint of Tensor<>::CopyAsType() * Avoid compilation error while disabling ck::int4_t support * Remove debug messages * Add #error directive to prevent compile sources with wrong setting * Simplify tensor usages in examples * Add constraint to check_err() input reference type * Align design with other PR * Use ""_uz to simplify example code * Avoid too much generalizing check_err() * Re-format GEMM instance template arguments * Extract int4 example common codes * Sort include directives * Move #include directives into new header * Move common codes together * Re-format template argument in example code * Reuse same implementation code for most of GEMM examples * Re-format common.hpp * Unify structured comment in examples * Use reinterpret_cast<>() for cross-type pointer conversion * Revert "Add type traits 'is_signed_integral<>'" This reverts commit f2c148efaedf42c8ee66032dac6d13a1003b0f3a. * Allow unsigned integer arguments for check_err() * Fix compilation error in check_err() * Remove unnecessary copy ctor for Tensor<> * Mark Tensor<> special member functions as 'default' * Use more strict condition to add code in examples * Fix wrong program return value of GEMM examples * Handle the case while user specify all the strides * Fix never-ran examples * Exit successfully if GEMM instance does not support given problem * Add missing 'else' keyword * Re-format CMakeLists.txt * Add wrapper function to hide value conversion while copying memory * Add new DeviceMem API to copy memory * Use new DeviceMem API to implement examples * Revert "Add new DeviceMem API to copy memory" This reverts commit 3f190b0779ceedf7aaf0b380712fda0518de72c1. * Add conversion ctor for Tensor<> * Write Tensor<> conversion logics explicitly in example code * Convert Tensor<> values after transfer data to host
-
- 29 Jul, 2022 1 commit
-
-
Chao Liu authored
* convnd_fwd fp16 example * update example * update example * update instance * updating refernce conv * update reference conv * update conv fwd profiler * update conv 1d and 3d instance * update include path * clean * update profiler for conv bwd data and weight * update conv bwd weight * clean * update conv example * update profiler for conv bwd weight * update ckprofiler for conv bwd data * fix reference conv bwd data bug; update conv bwd data test * update examples * fix initialization issue * update test for conv fwd * clean * clean * remove test case too sensitive to error threshhold * fix test * clean * fix build * adding conv multiple d * adding conv multiple D * add matrix padder * add gemm padding to convnd * adding group conv * update gemm multi-d * refactor * refactor * refactor * clean * clean * refactor * refactor * reorg * add ds * add bias * clean * add G * adding group * adding group * adding group * update Tensor * clean * update example * update DeviceGemmMultipleD_Xdl_CShuffle * update conv bwd-data and bwd-weight * upate contraction example * update gemm and batch gemm with e permute * fix example build * instance for grouped conv1d * update example * adding group conv instance * update gemm bilinear instance * update gemm+add+add+fastgelu instance * update profiler * update profiler * update test * update test and client example * clean * add grouped conv into profiler * update profiler * clean * add test grouped conv, update all conv test to gtest * update test
-
- 25 Jun, 2022 1 commit
-
-
Chao Liu authored
-