- 12 Aug, 2024 1 commit
-
-
Mateusz Ozga authored
* Rewrite .sh test to Gtest * review chnages * Removew unused comments * Review v2 * Typo * Separete UT: AMAX, MAX, MIN; added template params to trigger them * Update test/reduce/reduce_no_index.cpp --------- Co-authored-by:Bartłomiej Kocot <barkocot@amd.com>
-
- 06 Aug, 2024 1 commit
-
-
Bartłomiej Kocot authored
* Support 64 bit indexing * Add new grouped conv fwd kernel for large tensors * Add instances large tensor * Fixes for transform conv to gemm * Fixes * fixes * Remove not needed instances * examples fixes * Remove not need ds arrays * Fix tests * Add 2GB check in gridwise dl * Fixes
-
- 01 Aug, 2024 1 commit
-
-
Illia Silin authored
* add compiler flags to fix compiler issues * fix typo. * disable test_smfmac_op on all devices except gfx942 * specify full path to compiler in CI
-
- 19 Jul, 2024 1 commit
-
-
Haocong WANG authored
* add ab_scale init support * enabled interwave * add scale type; update isSupport * adjust example * clean * enable f8 pure gemm rcr ckprofiler * Add gemm_multiply_multiply instances * clang format * Optimize for ScaleBlockMNK=128 * enable abscale f8 gemm ck profiler * Add pure f8 gemm test suite * Reverting to the state of project at f60fd77 * update copyright * clang format * update copyright --------- Co-authored-by:root <jizhan@amd.com>
-
- 12 Jul, 2024 1 commit
-
-
Bartłomiej Kocot authored
* Support access per groups and filter3x3 in grouped conv fwd * Fixes for large cases * Fixes for large tensors
-
- 09 Jul, 2024 1 commit
-
-
Illia Silin authored
* fix the cmake logic when building for various targets * another minor fix
-
- 04 Jul, 2024 1 commit
-
-
Jun Liu authored
-
- 27 Jun, 2024 2 commits
-
-
jakpiase authored
* first version of smfmac test * add reviewer comments * add reviewer suggestions
-
Illia Silin authored
-
- 21 Jun, 2024 1 commit
-
-
Bartłomiej Kocot authored
* Cmake add -Wno-nvcc-compt * Remove template without initialization list * dpp remove template without init list * Fixes
-
- 18 Jun, 2024 1 commit
-
-
Bartłomiej Kocot authored
* Fix continous dim selection in contraction * Fixes
-
- 14 Jun, 2024 1 commit
-
-
Bartłomiej Kocot authored
* Support large tensors in grouped conv fwd * Multi ABD fixes * Fix calculate element space size
-
- 05 Jun, 2024 1 commit
-
-
Bartłomiej Kocot authored
* Integrate universal gemm with conv fwd * Fix conv fwd wmma test * Fix instances * Remove direct load check
-
- 28 May, 2024 1 commit
-
-
carlushuang authored
* support cmdline seqlen decode * silent print * update readme * update kernel launch 3d * update tile partitioner * fix spill for bf16 * modify based on comment * modify payload_t * fix bug for alibi mode * fix alibi test err * refactor kernel launch, support select timer * add missing file * remove useless code * add some comments
-
- 22 May, 2024 2 commits
-
-
Bartłomiej Kocot authored
* Optimize grouped conv bwd weight for small M and N * Fixes
-
Illia Silin authored
* set individual gpu targets for instances, examples, tests * fix path to hip compiler * fix path to hip compiler once more * aggregate device macros in ck_tile config header * fix the cmake logic for instances * fix clang format * add gfx900 and gfx906 to default set of targets
-
- 15 May, 2024 1 commit
-
-
jakpiase authored
* add unit tests for grouped gemm two stage * add reviewers suggestions --------- Co-authored-by:Adam Osewski <19374865+aosewski@users.noreply.github.com>
-
- 10 May, 2024 2 commits
-
-
Illia Silin authored
* code clean-up * remove the profiling output samples
-
Bartłomiej Kocot authored
-
- 07 May, 2024 1 commit
-
-
carlushuang authored
* add alibi support * fix code * update code based on comment * Support more hdim * fix fp8 bias * support seqlen_k=0 case * remove unused printf * fix format --------- Co-authored-by:rocking <ChunYu.Lai@amd.com>
-
- 26 Apr, 2024 1 commit
-
-
Haocong WANG authored
* Add bf16 instances * Add bf16 gemm universal example * tempsave * Add guard to navi compilation * workground on a specific mixed gemm instance ( bring back it when compiler fix upload) * fix formatting condition statement issue * solve conflict --------- Co-authored-by:Jun Liu <Liu.Jun@amd.com>
-
- 25 Apr, 2024 1 commit
-
-
Adam Osewski authored
* Overload output stream operator for LoopScheduler and PiplineVersion * Add Run overload accepting grid descriptors MK. * Add __device__ keyword for CalculateGridSize * Create device op GroupedGemmMultipleD * Add GroupedGemm MultipleD Tile Loop implementation. * Add an example for GroupedGemm MultipleD tile loop. * Device Op GroupedGEMMTileLoop. * Bunch of small changes in exmaple. * CkProfiler * Remove unused tparam. * Fix include statement. * Fix output stream overloads. * Do not make descriptors and check validity untill we find group. * Fix gemm desc initialization. * Revert device op * Fix compilation for DTYPES=FP16 * Validate tensor transfers paramters. * Validate on host only NK dims if M is not known. * Fix bug. * A convenient debug func for selecting threads. * Fix has main k block loop bug. * Make sure that b2c has up to date tile offset. * Output stream operator for Sequence type. * Cmake file formatting.
-
- 18 Apr, 2024 1 commit
-
-
Bartłomiej Kocot authored
* Add grouped conv bwd weight multi d kernel * Reference fix * Fix cmake files * bwd weight scale only xdl * Fixes * Fix client conv fwd example
-
- 14 Apr, 2024 1 commit
-
-
Haocong WANG authored
* Optimize GEMM on MI200/300: 1. Add new blockwise gemm pipeline 2. Add irregular splitk intances * clang format + typo fix * Fix a bug * initial commit * Add more instances to irregular splitk * blkgemm pipeline v1~4 prototype * Sanity Checked. Known issue: 1. Poor performance of splitk 2. Register spill on blkgemmpipeline v3 * Sanity and Performance fix: 1. fix a bug related to sanity in grouped b2c mapping 2. fix a bug related to sanity and performance in splitk offset * Sanity and API update: 1. Remove prefetch stage 2. Fix valid check bug 3, Add first gemm_universal instance into ckProfiler * Add NN instances for gemm universal * 1. Add NT instances for gemm_universal 2. Fix a bug about Kpadding in gemm_universal * Fix a bug regarding padding Odd K number * remove kernel print * Fix KPadding bug... * Update safety check * another try to fix kpadding.. * Sanity checked * new instances.. * clang format+typo fix * remove clang format script's change * Add non-hotloop compile option * 1. Add fp16xfp8 example 2. pull packed convert f8 from pr1150 * Some miscs.. opt and fix * Add pipeline description docs * Split universal gemm instance library to cut profiler compiling time * uncomment cmakefile * Fix a bug caused by blockwise_gemm_pipe_v2 * reduce default splitk to 1 * Add 224x256x64 tile size * update, including: 1. Experiment pipeline 5~7 2. Optimization for pipeline 4 3. Organized instance library * temp save * temp save * Permuted lds layout, sanity and function checked * clang format * Move OOB check from RunRead to RunWrite, for better software pipeline. TODO: agpr spill when NN layout * clangformat * A/B splitpipe scheduler for v3 * Fix two bugs * bug fix * fix a bug in oob check * Example for mixed fp16_fp8 gemm * Clean experimental code blocks * Add mixed precision gemm into profiler * tempsave * optimize m/n major lds layout * Add RRR GEMM mixed precision instances * Optimize f8 matrix transpose * Add test_gemm_universal * A/B spilt schedule for blkpip v5 * Take ds_read2 into iglp scheduling scheme * format * fixed cmake * Add llvm-option into CI cmake flag --------- Co-authored-by:Jing Zhang <jizhan@amd.com>
-
- 09 Apr, 2024 1 commit
-
-
Bartłomiej Kocot authored
* Extend support for contraction up to 5D * Extend contraction bilinear instances * Fix interface test * Add 6d support, remove 3d,4d,5d * Fixes * Fix readme * Make defualt dim for contraction instances
-
- 04 Apr, 2024 1 commit
-
-
Illia Silin authored
-
- 02 Apr, 2024 1 commit
-
-
Illia Silin authored
* parse examples inside the add_example_executable function * fix the example 64 cmake file * add xdl flag to the gemm_bias_softmax_gemm_permute example * add filtering of tests based on architecture type * enable test_grouped_gemm for gfx9 only * enable test_transpose only for gfx9 * only linnk test_transpose if it gets built * split the gemm instances by architectures * split gemm_bilinear,grouped_conv_bwd_weight instances by targets * split instances by architecture * split grouped_conv instances by architecture * fix clang format * fix the if-else logic in group_conv headers * small fix for grouped convolution instances * fix the grouped conv bwd weight dl instances * fix client examples * only enable client examples 3 and 4 on gfx9 * set the gfx9 macro * make sure the architecture macros are set by cmake * use separate set of xdl/wmma flags for host code * sinmplify the main cmake file * add conv_fwd_bf8 instance declaration
-
- 22 Mar, 2024 1 commit
-
-
Bartłomiej Kocot authored
* Add elementwise with dynamic vector dim * Reduce number of instaces * Fixes * Fixes
-
- 09 Mar, 2024 1 commit
-
-
zjing14 authored
* wmma_op + unit test * add arch limitation to wmma test * change arch limitation * Refactor + Add all type unit test(int4 compile failed) * Add f32_16x16x16_bf16 unit test * tempsave * tempsave * tempsave * runtime bug, cannot find symbol * workaround for incorrect HIP warpSize return value * debugging * tempsave * Correctness OK, waiting for optimization * Tidy up + format * temp save * temp save, reproduce the v_bfi_b32 issue * add inline asm for wmmaop test * tidy up * clean some debug purpose code * discard some codes * clang format * clang format * compiler issue fixed + increase tile size * navi3x_multipleD+example * temp save * workable * batchedgemm[OK], groupconv[debug] * groupconv: Sanity check[OK], Performance[Bad] * navi3x_groupconv_need_optimization * create necessary files * save progress * Add Inter-Row thread transfer * save progress * save debugging progress * sanity check pass * fix a host tensor bug and clean up flash-attn code * format * cancel unnecessary change * cancel unnecessary change * cancel unnecessary change * temp save, add asm backend flag to amd_wmma * Mat-A LDS Bypass sanity pass * temp save * gemm sanity fix * Porting new blockwise gemm to flash attention * Example branch provide to compiler team * tempsave * Fix a bug * batched gemm ported * conv A-skip lds ported * Skip B-Lds real gemm * Skip B Lds Gemm + MulD * batched gemm, conv, skip b lds * format * Attn, skip b lds * Change GridwiseOp nam * fix a typo caused bug * Skip A_Lds sanity pass, Skip B_Lds scratch occured * Bug found, intra-row permute off caused * bug found * a fix * disable buffer load due to incorrect 3rd dword * update fmha config, no scratch generated * update 3rd dword * fmha config update * FMHA, add support to gfx1101/gfx1102 * Merge origin dev (#2) * [Navi3x] Fix Gridwise_multiple_d operation (#649) * Add CMake Option "USE_OPT_NAVI3X" * fix bug * standardize docs (#655) * Separate bibtex requirement from rocm-docs-core (#656) * separate bibtex requirement from rocm-docs-core * point requirements to source rocm-docs-core repo * Add CMake Option "USE_OPT_NAVI3X" (#647) * Add CMake Option "USE_OPT_NAVI3X" * remove navi3x opt compile option from cmake script * Conv + quantization + tanh (#645) * Rename file. Prepare to support another activation * Add comment for quantization * Extract out_elementop * Add tanh example * Add conv + bias + tanh quantization instance * Add missing parameter * Refine cmake * Add external api and client example * Extract variable in example * Fix the comment --------- Co-authored-by:
zjing14 <zhangjing14@gmail.com> * Add a denorm test fix (#603) * Add type_convert implementations for bf16 * Add the fix for conv_fwd * Add the fix for conv_bwd_data * Add the fix for conv_bwd_weight * Format * Format * Another format * Add a macro to use workaround on MI200 only * Format --------- Co-authored-by:
Rosty Geyyer <rosty.geyyer@amd.com> Co-authored-by:
zjing14 <zhangjing14@gmail.com> * simplify karg in device/grid of split-k op (#644) * simplify karg in device/grid split-k op * fix mk_kn_mn instances * add more instances * use name from tensor layout * fix 3rd dword of buffer source descriptor (#659) * add fp64 instances (#658) Co-authored-by:
root <root@ctr-ubbsmc15.amd.com> * Issue #666: Revert "simplify karg in device/grid of split-k op (#644)" (#665) This reverts commit bb5530af . * Groupnorm + swish external api (#668) * Rename to proper naming * Add example of groupnorm + swish * Extract duplicate code in example * Add groupnorm + swish instances * Ractor instance generation, split into multiple cpp file * Add external api and client example * Refine profiler message * Use ck math version of exp * Refine problem size in example * Add host version of exp * add a marco to turn on/off denorm fix (off by default) (#673) * add a marco to turn off denorm fix by default * expose the marco --------- Co-authored-by:
root <root@ctr-ubbsmc15.amd.com> * fixed quant example (#672) Co-authored-by:
root <root@ctr-ubbsmc15.amd.com> * Add dependabot config and pin rocm-docs-core (#663) * [gtest] suppress unsafe buffer warn (#670) ref: https://github.com/ROCmSoftwarePlatform/MIOpen/pull/1912 * Add memory index guard in wmma device ops (#667) * Add more macros to turn on/off denorm fix (#678) Co-authored-by:
Rosty Geyyer <rosty.geyyer@amd.com> * Fix a typo (#676) * Add (#677) * Allow using ROCm release candidate compilers. (#679) * enable use of rocm5.5 release candidate 4 * upgrade to ROCM5.5 RC5 * try fix the PUB_KEY error, remove the cmake-data package * upgrade to latest cmake version * use private dockerhub repo for rocm5.5 rc5 * add missing bracket * add vector load check * solve conflicts --------- Co-authored-by:
Sam Wu <sjwu@ualberta.ca> Co-authored-by:
Sam Wu <sam.wu2@amd.com> Co-authored-by:
rocking5566 <ChunYu.Lai@amd.com> Co-authored-by:
zjing14 <zhangjing14@gmail.com> Co-authored-by:
Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com> Co-authored-by:
Rosty Geyyer <rosty.geyyer@amd.com> Co-authored-by:
carlushuang <carlus.huang@amd.com> Co-authored-by:
root <root@ctr-ubbsmc15.amd.com> Co-authored-by:
Jun Liu <Liu.Jun@amd.com> Co-authored-by:
Illia Silin <98187287+illsilin@users.noreply.github.com> * Disable SkipLDS & Align AIT api (#3) * fix layernorm, reduction Ops (#4) * [Navi3x] Fix Gridwise_multiple_d operation (#649) * Add CMake Option "USE_OPT_NAVI3X" * fix bug * standardize docs (#655) * Separate bibtex requirement from rocm-docs-core (#656) * separate bibtex requirement from rocm-docs-core * point requirements to source rocm-docs-core repo * Add CMake Option "USE_OPT_NAVI3X" (#647) * Add CMake Option "USE_OPT_NAVI3X" * remove navi3x opt compile option from cmake script * Conv + quantization + tanh (#645) * Rename file. Prepare to support another activation * Add comment for quantization * Extract out_elementop * Add tanh example * Add conv + bias + tanh quantization instance * Add missing parameter * Refine cmake * Add external api and client example * Extract variable in example * Fix the comment --------- Co-authored-by:
zjing14 <zhangjing14@gmail.com> * Add a denorm test fix (#603) * Add type_convert implementations for bf16 * Add the fix for conv_fwd * Add the fix for conv_bwd_data * Add the fix for conv_bwd_weight * Format * Format * Another format * Add a macro to use workaround on MI200 only * Format --------- Co-authored-by:
Rosty Geyyer <rosty.geyyer@amd.com> Co-authored-by:
zjing14 <zhangjing14@gmail.com> * simplify karg in device/grid of split-k op (#644) * simplify karg in device/grid split-k op * fix mk_kn_mn instances * add more instances * use name from tensor layout * fix 3rd dword of buffer source descriptor (#659) * add fp64 instances (#658) Co-authored-by:
root <root@ctr-ubbsmc15.amd.com> * Issue #666: Revert "simplify karg in device/grid of split-k op (#644)" (#665) This reverts commit bb5530af . * Groupnorm + swish external api (#668) * Rename to proper naming * Add example of groupnorm + swish * Extract duplicate code in example * Add groupnorm + swish instances * Ractor instance generation, split into multiple cpp file * Add external api and client example * Refine profiler message * Use ck math version of exp * Refine problem size in example * Add host version of exp * add a marco to turn on/off denorm fix (off by default) (#673) * add a marco to turn off denorm fix by default * expose the marco --------- Co-authored-by:
root <root@ctr-ubbsmc15.amd.com> * fixed quant example (#672) Co-authored-by:
root <root@ctr-ubbsmc15.amd.com> * Add dependabot config and pin rocm-docs-core (#663) * [gtest] suppress unsafe buffer warn (#670) ref: https://github.com/ROCmSoftwarePlatform/MIOpen/pull/1912 * Add memory index guard in wmma device ops (#667) * Add more macros to turn on/off denorm fix (#678) Co-authored-by:
Rosty Geyyer <rosty.geyyer@amd.com> * Fix a typo (#676) * Add (#677) * Allow using ROCm release candidate compilers. (#679) * enable use of rocm5.5 release candidate 4 * upgrade to ROCM5.5 RC5 * try fix the PUB_KEY error, remove the cmake-data package * upgrade to latest cmake version * use private dockerhub repo for rocm5.5 rc5 * add missing bracket * Disable SkipLDS & Align AIT api * Update dependabot config (#682) Co-authored-by:
samjwu <samjwu@users.noreply.github.com> * update attn api * solve type_convert bug + enable --------- Co-authored-by:
Sam Wu <sjwu@ualberta.ca> Co-authored-by:
Sam Wu <sam.wu2@amd.com> Co-authored-by:
rocking5566 <ChunYu.Lai@amd.com> Co-authored-by:
zjing14 <zhangjing14@gmail.com> Co-authored-by:
Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com> Co-authored-by:
Rosty Geyyer <rosty.geyyer@amd.com> Co-authored-by:
carlushuang <carlus.huang@amd.com> Co-authored-by:
root <root@ctr-ubbsmc15.amd.com> Co-authored-by:
Jun Liu <Liu.Jun@amd.com> Co-authored-by:
Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by:
samjwu <samjwu@users.noreply.github.com> Co-authored-by:
haocwang <Haocong.WANG@amd.com> * fix typo * Fix attention with causal mask * multiple fix, try ait compile * Add A/B not use LDS pipeline * Clang format, Add gfx1101, gfx1102 support of FMHA example * cancel change of format script * 1. Enable 2-stage global Prefetch ( May cause VGPR spilling) 2. Enable FP16 accumulator blockwise_gemm * clang-format * 1. change blockwise gemm loopover direction from kmn to mnk ( ~1% improvement) 2. change kernel timing mode to 50 warmup + 50 timed repeat * Update low level abstration of blockwise gemm wmma * (2/5) bilinear gemm pass, perf bug: skip a lds has lower performance than skip b lds * (3/5) batched gemm pass, perf bug: skip a lds has lower performance than skip b lds * (4/5) grouped conv pass * (5/5) attention pass, todo: debug lds perf bug * AIT Attention API refactor (#8) * sanity pass * sanity pass 2 * confirm significant performance regression. * turn on all instances * turn off instance format * Fix bug & tunning & format * DML meta, self_attn+cross_attn * sanity pass * remove useless flag * update tile and problem size used in AIT attention * bug fix in grouped conv supporting check * deprecate inline asm wmma * Bug fix: double lds skip * clang-format * Fix errors in 1. example, fmha 2. gridwise pipeline 3. deviceop, fmha, change some containers from vector to array * part2 of previous commit * clang format * API fix of gridwisegemmpipeline * separate array base and vector base attention tensor transformation * fix gemm * clang format * add gemm fp16 instances * Temp save * fpAintB kernel compile pass * Sanity pass. * Temp save * debug code enabled * Fp16AInt8B_GEMM sanity * MQA implementation * GQA-4 example * tempsave * Compile pass * New implementation of fp16Aint8B Gemm, Acheieve similar math throughput with native fp16 Gemm * format * Todo: fix gemm_bilinear_wmma instances compilation bug * Solve a bug when K1=16 * remove unnecessary changes * Remove tensor layout limitation to LDS usage in tesnor contraction * update self-attention and cross-attention * fix a typo of name * Add arch limiter for fp8 gemm * enable fp8 gemm_xdl for all gfx9 targets * temporarily disable gemm_xdl_fp16_fp8 on MI100/200 * fix the cmake logic for gemm_xdl_fp16_fp8 * re-enable the gemm_xdl_fp16_fp8 on MI100/200 --------- Co-authored-by:
aska-0096 <haocwang@amd.com> Co-authored-by:
Sam Wu <sjwu@ualberta.ca> Co-authored-by:
Sam Wu <sam.wu2@amd.com> Co-authored-by:
rocking5566 <ChunYu.Lai@amd.com> Co-authored-by:
Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com> Co-authored-by:
Rosty Geyyer <rosty.geyyer@amd.com> Co-authored-by:
carlushuang <carlus.huang@amd.com> Co-authored-by:
root <root@ctr-ubbsmc15.amd.com> Co-authored-by:
Jun Liu <Liu.Jun@amd.com> Co-authored-by:
Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by:
samjwu <samjwu@users.noreply.github.com> Co-authored-by:
haocwang <Haocong.WANG@amd.com> Co-authored-by:
illsilin <Illia.Silin@amd.com>
-
- 20 Feb, 2024 1 commit
-
-
Bartłomiej Kocot authored
* Extend permute scale support up to 6D * Fixes * Fixes * Update profiler/README.md Co-authored-by:
Lisa <lisajdelaney@gmail.com> * Update profiler/README.md Co-authored-by:
Lisa <lisajdelaney@gmail.com> * Update profiler/README.md Co-authored-by:
Lisa <lisajdelaney@gmail.com> * Update profiler/README.md Co-authored-by:
Lisa <lisajdelaney@gmail.com> * Update profiler/README.md Co-authored-by:
Lisa <lisajdelaney@gmail.com> * Update profiler/README.md Co-authored-by:
Lisa <lisajdelaney@gmail.com> * Update profiler/README.md Co-authored-by:
Lisa <lisajdelaney@gmail.com> --------- Co-authored-by:
Lisa <lisajdelaney@gmail.com>
-
- 13 Feb, 2024 1 commit
-
-
Bartłomiej Kocot authored
* Add optimized blockwise gemm using ck wrapper * Add basic gemm example * Update docs * Add tutorial for gemm using ck wrapper * Add perf note * edits * Fix cmake * Fixes --------- Co-authored-by:Lisa Delaney <lisa.delaney@amd.com>
-
- 07 Feb, 2024 1 commit
-
-
jakpiase authored
-
- 02 Feb, 2024 2 commits
-
-
Illia Silin authored
* add support for navi2x and navi3x models * fix syntax * use common macro for different mi300 architectures
-
Bartłomiej Kocot authored
-
- 31 Jan, 2024 1 commit
-
-
Bartłomiej Kocot authored
* Add blockwise gemm to ck wrapper * Add blockwise gemm traits * Disable test_gemm for non xdl devices * Fixes * Add c layout descritpions
-
- 25 Jan, 2024 1 commit
-
-
rocking authored
* Add layernorm bwd gamma beta external api * Add groupnorm external api * Add layernorm bwd gamma beta profiler * Add groupnorm bwd gamma beta ckProfiler * Add layernorm & groupnorm bwd gamma beta test * Fix groupnorm bwd gamma beta profiler bug * Layernorm bwd weight client example * Groupnorm bwd weight client example * clang format * Remove useless header * Let inv_std be positive * Rename to num_bytes and move this calculation outside the loop
-
- 24 Jan, 2024 1 commit
-
-
Illia Silin authored
* fix cppcheck errors, first pass * fix format * fix returned value in examples * add macro definitions for cppcheck * fix the profile_gemm logic * update the gemm profiler logic * add more difinitions to cppcheck, fix couple more errors * replace runtime error with message in device function * fix a couple of int4 issues * no return for fill function * fix errors in data_types.hpp * fix format * fix few remaining errors * fix errors in data_types.hpp * fix last couple of errors in datat_types.hpp
-
- 19 Jan, 2024 1 commit
-
-
Bartłomiej Kocot authored
* Add optimized copy to ck wrapper * Example optimizations * Fixes * Move img2col test to client example * Refactor example * Fix docs * Fixes * Fix * Fixes * Fixes * Fixes * Fixes * Fixes --------- Co-authored-by:zjing14 <zhangjing14@gmail.com>
-
- 09 Jan, 2024 1 commit
-
-
Illia Silin authored
* allow setting the number of warmup cycles and iterations for profiler * fix the gemm_splitk and grouped_gemm examples
-
- 04 Jan, 2024 1 commit
-
-
arai713 authored
* added working example for 5D input using 1D kernel * example with 5D input tensor and 2d kernel - not working: issues with arguments * added updated version of 3d device op - changed descriptors/dims * added example file to check kernel * fixed descriptor and isSupportedArgument stride problem * added and modified kernel for 3d - updated tids/loop * adding some more 5d example files * fixed some issues * changes made for testing * working version: fixed error in stride for A, still a bit inefficient * cleaned up formatting/comments * updating formatting * more formatting fixes * fixing cmake, adding back gpu targets in cmake script * adding client example * added instances for client example * fixed errors in client example * implemented client ex with device_elementwise.hpp and device_elementwise_3d_impl.hpp * removed extra files * minor formatting and naming fixes * adding test files and profiler * fixing minor error * minor fix * removed unneccesary comments, renamed files * updated instance list for client example, added different layout example * removing instances * fixed error in instance generation * remove comments * update profiler and client example tensor layouts * fixed errors in test/profiler * updated vector dim access to enable vector load * updated test/profiler files * updated example with 1d kernel * updating profiler * renamed files * disabled device op for MI300 * skip elementwise_permute_2d on gfx94x * Update CMakeLists.txt * fixing CMake - disabling some GPU targets * added transpose profiler to CMake * fixed transpose profiler errors * fixed instances for tests/profiler * cleaned up code in transpose profiler source code * added some comments, updated copyright * made function arguments const where possible --------- Co-authored-by:
Jing Zhang <jizha@amd.com> Co-authored-by:
Jing Zhang <jizhan@amd.com> Co-authored-by:
zjing14 <zhangjing14@gmail.com>
-