1. 22 May, 2024 2 commits
  2. 15 May, 2024 1 commit
  3. 10 May, 2024 2 commits
  4. 07 May, 2024 1 commit
  5. 26 Apr, 2024 1 commit
    • Haocong WANG's avatar
      [GEMM] UniversalGemm update (#1262) · 764164b4
      Haocong WANG authored
      
      
      * Add bf16 instances
      
      * Add bf16 gemm universal example
      
      * tempsave
      
      * Add guard to navi compilation
      
      * workground on a specific mixed gemm instance ( bring back it when compiler fix upload)
      
      * fix formatting condition statement issue
      
      * solve conflict
      
      ---------
      Co-authored-by: default avatarJun Liu <Liu.Jun@amd.com>
      764164b4
  6. 25 Apr, 2024 1 commit
    • Adam Osewski's avatar
      Grouped GEMM Multiple D tile loop. (#1247) · b4032629
      Adam Osewski authored
      * Overload output stream operator for LoopScheduler and PiplineVersion
      
      * Add Run overload accepting grid descriptors MK.
      
      * Add __device__ keyword for CalculateGridSize
      
      * Create device op GroupedGemmMultipleD
      
      * Add GroupedGemm MultipleD Tile Loop implementation.
      
      * Add an example for GroupedGemm MultipleD tile loop.
      
      * Device Op GroupedGEMMTileLoop.
      
      * Bunch of small changes in exmaple.
      
      * CkProfiler
      
      * Remove unused tparam.
      
      * Fix include statement.
      
      * Fix output stream overloads.
      
      * Do not make descriptors and check validity untill we find group.
      
      * Fix gemm desc initialization.
      
      * Revert device op
      
      * Fix compilation for DTYPES=FP16
      
      * Validate tensor transfers paramters.
      
      * Validate on host only NK dims if M is not known.
      
      * Fix bug.
      
      * A convenient debug func for selecting threads.
      
      * Fix has main k block loop bug.
      
      * Make sure that b2c has up to date tile offset.
      
      * Output stream operator for Sequence type.
      
      * Cmake file formatting.
      b4032629
  7. 18 Apr, 2024 1 commit
  8. 14 Apr, 2024 1 commit
    • Haocong WANG's avatar
      [GEMM] Gemm universal device operation (#1154) · f83e9701
      Haocong WANG authored
      
      
      * Optimize GEMM on MI200/300:
      1. Add new blockwise gemm pipeline
      2. Add irregular splitk intances
      
      * clang format + typo fix
      
      * Fix a bug
      
      * initial commit
      
      * Add more instances to irregular splitk
      
      * blkgemm pipeline v1~4 prototype
      
      * Sanity Checked. Known issue:
      1. Poor performance of splitk
      2. Register spill on blkgemmpipeline v3
      
      * Sanity and Performance fix:
      1. fix a bug related to sanity in grouped b2c mapping
      2. fix a bug related to sanity and performance in splitk offset
      
      * Sanity and API update:
      1. Remove prefetch stage
      2. Fix valid check bug
      3, Add first gemm_universal instance into ckProfiler
      
      * Add NN instances for gemm universal
      
      * 1. Add NT instances for gemm_universal
      2. Fix a bug about Kpadding in gemm_universal
      
      * Fix a bug regarding padding Odd K number
      
      * remove kernel print
      
      * Fix KPadding bug...
      
      * Update safety check
      
      * another try to fix kpadding..
      
      * Sanity checked
      
      * new instances..
      
      * clang format+typo fix
      
      * remove clang format script's change
      
      * Add non-hotloop compile option
      
      * 1. Add fp16xfp8 example
      2. pull packed convert f8 from pr1150
      
      * Some miscs.. opt and fix
      
      * Add pipeline description docs
      
      * Split universal gemm instance library to cut profiler compiling time
      
      * uncomment cmakefile
      
      * Fix a bug caused by blockwise_gemm_pipe_v2
      
      * reduce default splitk to 1
      
      * Add 224x256x64 tile size
      
      * update, including:
      1. Experiment pipeline 5~7
      2. Optimization for pipeline 4
      3. Organized instance library
      
      * temp save
      
      * temp save
      
      * Permuted lds layout, sanity and function checked
      
      * clang format
      
      * Move OOB check from RunRead to RunWrite, for better software pipeline.
      TODO: agpr spill when NN layout
      
      * clangformat
      
      * A/B splitpipe scheduler for v3
      
      * Fix two bugs
      
      * bug fix
      
      * fix a bug in oob check
      
      * Example for mixed fp16_fp8 gemm
      
      * Clean experimental code blocks
      
      * Add mixed precision gemm into profiler
      
      * tempsave
      
      * optimize m/n major lds layout
      
      * Add RRR GEMM  mixed precision instances
      
      * Optimize f8 matrix transpose
      
      * Add test_gemm_universal
      
      * A/B spilt schedule for blkpip v5
      
      * Take ds_read2 into iglp scheduling scheme
      
      * format
      
      * fixed cmake
      
      * Add llvm-option into CI cmake flag
      
      ---------
      Co-authored-by: default avatarJing Zhang <jizhan@amd.com>
      f83e9701
  9. 09 Apr, 2024 1 commit
    • Bartłomiej Kocot's avatar
      Extend support for contraction 6D (#1207) · ced5af16
      Bartłomiej Kocot authored
      * Extend support for contraction up to 5D
      
      * Extend contraction bilinear instances
      
      * Fix interface test
      
      * Add 6d support, remove 3d,4d,5d
      
      * Fixes
      
      * Fix readme
      
      * Make defualt dim for contraction instances
      ced5af16
  10. 04 Apr, 2024 1 commit
  11. 02 Apr, 2024 1 commit
    • Illia Silin's avatar
      Split the instances by architecture. (#1223) · ae57e593
      Illia Silin authored
      * parse examples inside the add_example_executable function
      
      * fix the example 64 cmake file
      
      * add xdl flag to the gemm_bias_softmax_gemm_permute example
      
      * add filtering of tests based on architecture type
      
      * enable test_grouped_gemm for gfx9 only
      
      * enable test_transpose only for gfx9
      
      * only linnk test_transpose if it gets built
      
      * split the gemm instances by architectures
      
      * split gemm_bilinear,grouped_conv_bwd_weight instances by targets
      
      * split instances by architecture
      
      * split grouped_conv instances by architecture
      
      * fix clang format
      
      * fix the if-else logic in group_conv headers
      
      * small fix for grouped convolution instances
      
      * fix the grouped conv bwd weight dl instances
      
      * fix client examples
      
      * only enable client examples 3 and 4 on gfx9
      
      * set the gfx9 macro
      
      * make sure the architecture macros are set by cmake
      
      * use separate set of xdl/wmma flags for host code
      
      * sinmplify the main cmake file
      
      * add conv_fwd_bf8 instance declaration
      ae57e593
  12. 22 Mar, 2024 1 commit
  13. 09 Mar, 2024 1 commit
    • zjing14's avatar
      Navi3 rel (#1176) · 1837040a
      zjing14 authored
      
      
      * wmma_op + unit test
      
      * add arch limitation to wmma test
      
      * change arch limitation
      
      * Refactor + Add all type unit test(int4 compile failed)
      
      * Add f32_16x16x16_bf16 unit test
      
      * tempsave
      
      * tempsave
      
      * tempsave
      
      * runtime bug, cannot find symbol
      
      * workaround for incorrect HIP warpSize return value
      
      * debugging
      
      * tempsave
      
      * Correctness OK, waiting for optimization
      
      * Tidy up + format
      
      * temp save
      
      * temp save, reproduce the v_bfi_b32 issue
      
      * add inline asm for wmmaop test
      
      * tidy up
      
      * clean some debug purpose code
      
      * discard some codes
      
      * clang format
      
      * clang format
      
      * compiler issue fixed + increase tile size
      
      * navi3x_multipleD+example
      
      * temp save
      
      * workable
      
      * batchedgemm[OK], groupconv[debug]
      
      * groupconv: Sanity check[OK], Performance[Bad]
      
      * navi3x_groupconv_need_optimization
      
      * create necessary files
      
      * save progress
      
      * Add Inter-Row thread transfer
      
      * save progress
      
      * save debugging progress
      
      * sanity check pass
      
      * fix a host tensor bug and clean up flash-attn code
      
      * format
      
      * cancel unnecessary change
      
      * cancel unnecessary change
      
      * cancel unnecessary change
      
      * temp save, add asm backend flag to amd_wmma
      
      * Mat-A LDS Bypass sanity pass
      
      * temp save
      
      * gemm sanity fix
      
      * Porting new blockwise gemm to flash attention
      
      * Example branch provide to compiler team
      
      * tempsave
      
      * Fix a bug
      
      * batched gemm ported
      
      * conv A-skip lds ported
      
      * Skip B-Lds real gemm
      
      * Skip B Lds Gemm + MulD
      
      * batched gemm, conv, skip b lds
      
      * format
      
      * Attn, skip b lds
      
      * Change GridwiseOp nam
      
      * fix a typo caused bug
      
      * Skip A_Lds sanity pass, Skip B_Lds scratch occured
      
      * Bug found, intra-row permute off caused
      
      * bug found
      
      * a fix
      
      * disable buffer load due to incorrect 3rd dword
      
      * update fmha config, no scratch generated
      
      * update 3rd dword
      
      * fmha config update
      
      * FMHA, add support to gfx1101/gfx1102
      
      * Merge origin dev (#2)
      
      * [Navi3x] Fix Gridwise_multiple_d operation (#649)
      
      * Add CMake Option "USE_OPT_NAVI3X"
      
      * fix bug
      
      * standardize docs (#655)
      
      * Separate bibtex requirement from rocm-docs-core (#656)
      
      * separate bibtex requirement from rocm-docs-core
      
      * point requirements to source rocm-docs-core repo
      
      * Add CMake Option "USE_OPT_NAVI3X" (#647)
      
      * Add CMake Option "USE_OPT_NAVI3X"
      
      * remove navi3x opt compile option from cmake script
      
      * Conv + quantization + tanh  (#645)
      
      * Rename file. Prepare to support another activation
      
      * Add comment for quantization
      
      * Extract out_elementop
      
      * Add tanh example
      
      * Add conv + bias + tanh quantization instance
      
      * Add missing parameter
      
      * Refine cmake
      
      * Add external api and client example
      
      * Extract variable in example
      
      * Fix the comment
      
      ---------
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      
      * Add a denorm test fix (#603)
      
      * Add type_convert implementations for bf16
      
      * Add the fix for conv_fwd
      
      * Add the fix for conv_bwd_data
      
      * Add the fix for conv_bwd_weight
      
      * Format
      
      * Format
      
      * Another format
      
      * Add a macro to use workaround on MI200 only
      
      * Format
      
      ---------
      Co-authored-by: default avatarRosty Geyyer <rosty.geyyer@amd.com>
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      
      * simplify karg in device/grid of split-k op (#644)
      
      * simplify karg in device/grid split-k op
      
      * fix mk_kn_mn instances
      
      * add more instances
      
      * use name from tensor layout
      
      * fix 3rd dword of buffer source descriptor (#659)
      
      * add fp64 instances (#658)
      Co-authored-by: default avatarroot <root@ctr-ubbsmc15.amd.com>
      
      * Issue #666: Revert "simplify karg in device/grid of split-k op (#644)" (#665)
      
      This reverts commit bb5530af
      
      .
      
      * Groupnorm + swish external api (#668)
      
      * Rename to proper naming
      
      * Add example of groupnorm + swish
      
      * Extract duplicate code in example
      
      * Add groupnorm + swish instances
      
      * Ractor instance generation, split into multiple cpp file
      
      * Add external api and client example
      
      * Refine profiler message
      
      * Use ck math version of exp
      
      * Refine problem size in example
      
      * Add host version of exp
      
      * add a marco to turn on/off denorm fix (off by default) (#673)
      
      * add a marco to turn off denorm fix by default
      
      * expose the marco
      
      ---------
      Co-authored-by: default avatarroot <root@ctr-ubbsmc15.amd.com>
      
      * fixed quant example (#672)
      Co-authored-by: default avatarroot <root@ctr-ubbsmc15.amd.com>
      
      * Add dependabot config and pin rocm-docs-core (#663)
      
      * [gtest] suppress unsafe buffer warn (#670)
      
      ref: https://github.com/ROCmSoftwarePlatform/MIOpen/pull/1912
      
      
      
      * Add memory index guard in wmma device ops (#667)
      
      * Add more macros to turn on/off denorm fix (#678)
      Co-authored-by: default avatarRosty Geyyer <rosty.geyyer@amd.com>
      
      * Fix a typo (#676)
      
      * Add (#677)
      
      * Allow using ROCm release candidate compilers. (#679)
      
      * enable use of rocm5.5 release candidate 4
      
      * upgrade to ROCM5.5 RC5
      
      * try fix the PUB_KEY error, remove the cmake-data package
      
      * upgrade to latest cmake version
      
      * use private dockerhub repo for rocm5.5 rc5
      
      * add missing bracket
      
      * add vector load check
      
      * solve conflicts
      
      ---------
      Co-authored-by: default avatarSam Wu <sjwu@ualberta.ca>
      Co-authored-by: default avatarSam Wu <sam.wu2@amd.com>
      Co-authored-by: default avatarrocking5566 <ChunYu.Lai@amd.com>
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      Co-authored-by: default avatarRostyslav Geyyer <46627076+geyyer@users.noreply.github.com>
      Co-authored-by: default avatarRosty Geyyer <rosty.geyyer@amd.com>
      Co-authored-by: default avatarcarlushuang <carlus.huang@amd.com>
      Co-authored-by: default avatarroot <root@ctr-ubbsmc15.amd.com>
      Co-authored-by: default avatarJun Liu <Liu.Jun@amd.com>
      Co-authored-by: default avatarIllia Silin <98187287+illsilin@users.noreply.github.com>
      
      * Disable SkipLDS & Align AIT api (#3)
      
      * fix layernorm, reduction Ops (#4)
      
      * [Navi3x] Fix Gridwise_multiple_d operation (#649)
      
      * Add CMake Option "USE_OPT_NAVI3X"
      
      * fix bug
      
      * standardize docs (#655)
      
      * Separate bibtex requirement from rocm-docs-core (#656)
      
      * separate bibtex requirement from rocm-docs-core
      
      * point requirements to source rocm-docs-core repo
      
      * Add CMake Option "USE_OPT_NAVI3X" (#647)
      
      * Add CMake Option "USE_OPT_NAVI3X"
      
      * remove navi3x opt compile option from cmake script
      
      * Conv + quantization + tanh  (#645)
      
      * Rename file. Prepare to support another activation
      
      * Add comment for quantization
      
      * Extract out_elementop
      
      * Add tanh example
      
      * Add conv + bias + tanh quantization instance
      
      * Add missing parameter
      
      * Refine cmake
      
      * Add external api and client example
      
      * Extract variable in example
      
      * Fix the comment
      
      ---------
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      
      * Add a denorm test fix (#603)
      
      * Add type_convert implementations for bf16
      
      * Add the fix for conv_fwd
      
      * Add the fix for conv_bwd_data
      
      * Add the fix for conv_bwd_weight
      
      * Format
      
      * Format
      
      * Another format
      
      * Add a macro to use workaround on MI200 only
      
      * Format
      
      ---------
      Co-authored-by: default avatarRosty Geyyer <rosty.geyyer@amd.com>
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      
      * simplify karg in device/grid of split-k op (#644)
      
      * simplify karg in device/grid split-k op
      
      * fix mk_kn_mn instances
      
      * add more instances
      
      * use name from tensor layout
      
      * fix 3rd dword of buffer source descriptor (#659)
      
      * add fp64 instances (#658)
      Co-authored-by: default avatarroot <root@ctr-ubbsmc15.amd.com>
      
      * Issue #666: Revert "simplify karg in device/grid of split-k op (#644)" (#665)
      
      This reverts commit bb5530af
      
      .
      
      * Groupnorm + swish external api (#668)
      
      * Rename to proper naming
      
      * Add example of groupnorm + swish
      
      * Extract duplicate code in example
      
      * Add groupnorm + swish instances
      
      * Ractor instance generation, split into multiple cpp file
      
      * Add external api and client example
      
      * Refine profiler message
      
      * Use ck math version of exp
      
      * Refine problem size in example
      
      * Add host version of exp
      
      * add a marco to turn on/off denorm fix (off by default) (#673)
      
      * add a marco to turn off denorm fix by default
      
      * expose the marco
      
      ---------
      Co-authored-by: default avatarroot <root@ctr-ubbsmc15.amd.com>
      
      * fixed quant example (#672)
      Co-authored-by: default avatarroot <root@ctr-ubbsmc15.amd.com>
      
      * Add dependabot config and pin rocm-docs-core (#663)
      
      * [gtest] suppress unsafe buffer warn (#670)
      
      ref: https://github.com/ROCmSoftwarePlatform/MIOpen/pull/1912
      
      
      
      * Add memory index guard in wmma device ops (#667)
      
      * Add more macros to turn on/off denorm fix (#678)
      Co-authored-by: default avatarRosty Geyyer <rosty.geyyer@amd.com>
      
      * Fix a typo (#676)
      
      * Add (#677)
      
      * Allow using ROCm release candidate compilers. (#679)
      
      * enable use of rocm5.5 release candidate 4
      
      * upgrade to ROCM5.5 RC5
      
      * try fix the PUB_KEY error, remove the cmake-data package
      
      * upgrade to latest cmake version
      
      * use private dockerhub repo for rocm5.5 rc5
      
      * add missing bracket
      
      * Disable SkipLDS & Align AIT api
      
      * Update dependabot config (#682)
      Co-authored-by: default avatarsamjwu <samjwu@users.noreply.github.com>
      
      * update attn api
      
      * solve type_convert bug + enable
      
      ---------
      Co-authored-by: default avatarSam Wu <sjwu@ualberta.ca>
      Co-authored-by: default avatarSam Wu <sam.wu2@amd.com>
      Co-authored-by: default avatarrocking5566 <ChunYu.Lai@amd.com>
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      Co-authored-by: default avatarRostyslav Geyyer <46627076+geyyer@users.noreply.github.com>
      Co-authored-by: default avatarRosty Geyyer <rosty.geyyer@amd.com>
      Co-authored-by: default avatarcarlushuang <carlus.huang@amd.com>
      Co-authored-by: default avatarroot <root@ctr-ubbsmc15.amd.com>
      Co-authored-by: default avatarJun Liu <Liu.Jun@amd.com>
      Co-authored-by: default avatarIllia Silin <98187287+illsilin@users.noreply.github.com>
      Co-authored-by: default avatarsamjwu <samjwu@users.noreply.github.com>
      Co-authored-by: default avatarhaocwang <Haocong.WANG@amd.com>
      
      * fix typo
      
      * Fix attention with causal mask
      
      * multiple fix, try ait compile
      
      * Add A/B not use LDS pipeline
      
      * Clang format, Add gfx1101, gfx1102 support of FMHA example
      
      * cancel change of format script
      
      * 1. Enable 2-stage global Prefetch ( May cause VGPR spilling)
      2. Enable FP16 accumulator blockwise_gemm
      
      * clang-format
      
      * 1. change blockwise gemm loopover direction from kmn to mnk ( ~1% improvement)
      2. change kernel timing mode to 50 warmup + 50 timed repeat
      
      * Update low level abstration of blockwise gemm wmma
      
      * (2/5) bilinear gemm pass, perf bug: skip a lds has lower performance than skip b lds
      
      * (3/5) batched gemm pass, perf bug: skip a lds has lower performance than skip b lds
      
      * (4/5) grouped conv pass
      
      * (5/5) attention pass, todo: debug lds perf bug
      
      * AIT Attention API refactor (#8)
      
      * sanity pass
      
      * sanity pass 2
      
      * confirm significant performance regression.
      
      * turn on all instances
      
      * turn off instance format
      
      * Fix bug & tunning & format
      
      * DML meta, self_attn+cross_attn
      
      * sanity pass
      
      * remove useless flag
      
      * update tile and problem size used in AIT attention
      
      * bug fix in grouped conv supporting check
      
      * deprecate inline asm wmma
      
      * Bug fix: double lds skip
      
      * clang-format
      
      * Fix errors in
      1. example, fmha
      2. gridwise pipeline
      3. deviceop, fmha, change some containers from vector to array
      
      * part2 of previous commit
      
      * clang format
      
      * API fix of gridwisegemmpipeline
      
      * separate array base and vector base attention tensor transformation
      
      * fix gemm
      
      * clang format
      
      * add gemm fp16 instances
      
      * Temp save
      
      * fpAintB kernel compile pass
      
      * Sanity pass.
      
      * Temp save
      
      * debug code enabled
      
      * Fp16AInt8B_GEMM sanity
      
      * MQA implementation
      
      * GQA-4 example
      
      * tempsave
      
      * Compile pass
      
      * New implementation of fp16Aint8B Gemm, Acheieve similar math throughput with native fp16 Gemm
      
      * format
      
      * Todo: fix gemm_bilinear_wmma instances compilation bug
      
      * Solve a bug when K1=16
      
      * remove unnecessary changes
      
      * Remove tensor layout limitation to LDS usage in tesnor contraction
      
      * update self-attention and cross-attention
      
      * fix a typo of name
      
      * Add arch limiter for fp8 gemm
      
      * enable fp8 gemm_xdl for all gfx9 targets
      
      * temporarily disable gemm_xdl_fp16_fp8 on MI100/200
      
      * fix the cmake logic for gemm_xdl_fp16_fp8
      
      * re-enable the gemm_xdl_fp16_fp8 on MI100/200
      
      ---------
      Co-authored-by: default avataraska-0096 <haocwang@amd.com>
      Co-authored-by: default avatarSam Wu <sjwu@ualberta.ca>
      Co-authored-by: default avatarSam Wu <sam.wu2@amd.com>
      Co-authored-by: default avatarrocking5566 <ChunYu.Lai@amd.com>
      Co-authored-by: default avatarRostyslav Geyyer <46627076+geyyer@users.noreply.github.com>
      Co-authored-by: default avatarRosty Geyyer <rosty.geyyer@amd.com>
      Co-authored-by: default avatarcarlushuang <carlus.huang@amd.com>
      Co-authored-by: default avatarroot <root@ctr-ubbsmc15.amd.com>
      Co-authored-by: default avatarJun Liu <Liu.Jun@amd.com>
      Co-authored-by: default avatarIllia Silin <98187287+illsilin@users.noreply.github.com>
      Co-authored-by: default avatarsamjwu <samjwu@users.noreply.github.com>
      Co-authored-by: default avatarhaocwang <Haocong.WANG@amd.com>
      Co-authored-by: default avatarillsilin <Illia.Silin@amd.com>
      1837040a
  14. 20 Feb, 2024 1 commit
  15. 13 Feb, 2024 1 commit
  16. 07 Feb, 2024 1 commit
  17. 02 Feb, 2024 2 commits
  18. 31 Jan, 2024 1 commit
  19. 25 Jan, 2024 1 commit
    • rocking's avatar
      layernorm & groupnorm bwd gamma beta (#1133) · 28f68a5a
      rocking authored
      * Add layernorm bwd gamma beta external api
      
      * Add groupnorm external api
      
      * Add layernorm bwd gamma beta profiler
      
      * Add groupnorm bwd gamma beta ckProfiler
      
      * Add layernorm & groupnorm bwd gamma beta test
      
      * Fix groupnorm bwd gamma beta profiler bug
      
      * Layernorm bwd weight client example
      
      * Groupnorm bwd weight client example
      
      * clang format
      
      * Remove useless header
      
      * Let inv_std be positive
      
      * Rename to num_bytes and move this calculation outside the loop
      28f68a5a
  20. 24 Jan, 2024 1 commit
    • Illia Silin's avatar
      Fixing most of the cppcheck errors. (#1142) · 180e5720
      Illia Silin authored
      * fix cppcheck errors, first pass
      
      * fix format
      
      * fix returned value in examples
      
      * add macro definitions for cppcheck
      
      * fix the profile_gemm logic
      
      * update the gemm profiler logic
      
      * add more difinitions to cppcheck, fix couple more errors
      
      * replace runtime error with message in device function
      
      * fix a couple of int4 issues
      
      * no return for fill function
      
      * fix errors in data_types.hpp
      
      * fix format
      
      * fix few remaining errors
      
      * fix errors in data_types.hpp
      
      * fix last couple of errors in datat_types.hpp
      180e5720
  21. 19 Jan, 2024 1 commit
  22. 09 Jan, 2024 1 commit
  23. 04 Jan, 2024 1 commit
    • arai713's avatar
      Transpose profiler fix (#1114) · aa3e2d79
      arai713 authored
      
      
      * added working example for 5D input using 1D kernel
      
      * example with 5D input tensor and 2d kernel - not working: issues with arguments
      
      * added updated version of 3d device op - changed descriptors/dims
      
      * added example file to check kernel
      
      * fixed descriptor and isSupportedArgument stride problem
      
      * added and modified kernel for 3d - updated tids/loop
      
      * adding some more 5d example files
      
      * fixed some issues
      
      * changes made for testing
      
      * working version: fixed error in stride for A, still a bit inefficient
      
      * cleaned up formatting/comments
      
      * updating formatting
      
      * more formatting fixes
      
      * fixing cmake, adding back gpu targets in cmake script
      
      * adding client example
      
      * added instances for client example
      
      * fixed errors in client example
      
      * implemented client ex with device_elementwise.hpp and device_elementwise_3d_impl.hpp
      
      * removed extra files
      
      * minor formatting and naming fixes
      
      * adding test files and profiler
      
      * fixing minor error
      
      * minor fix
      
      * removed unneccesary comments, renamed files
      
      * updated instance list for client example, added different layout example
      
      * removing instances
      
      * fixed error in instance generation
      
      * remove comments
      
      * update profiler and client example tensor layouts
      
      * fixed errors in test/profiler
      
      * updated vector dim access to enable vector load
      
      * updated test/profiler files
      
      * updated example with 1d kernel
      
      * updating profiler
      
      * renamed files
      
      * disabled device op for MI300
      
      * skip  elementwise_permute_2d on gfx94x
      
      * Update CMakeLists.txt
      
      * fixing CMake - disabling some GPU targets
      
      * added transpose profiler to CMake
      
      * fixed transpose profiler errors
      
      * fixed instances for tests/profiler
      
      * cleaned up code in transpose profiler source code
      
      * added some comments, updated copyright
      
      * made function arguments const where possible
      
      ---------
      Co-authored-by: default avatarJing Zhang <jizha@amd.com>
      Co-authored-by: default avatarJing Zhang <jizhan@amd.com>
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      aa3e2d79
  24. 03 Jan, 2024 1 commit
  25. 23 Dec, 2023 1 commit
  26. 20 Dec, 2023 1 commit
  27. 19 Dec, 2023 1 commit
    • arai713's avatar
      Hip tensor permute unit test (#1068) · 12a8883c
      arai713 authored
      * adding files for F32 example
      
      * adding functioning implementation with scalar multiplication and unary operator support
      
      * added fp 16 type check in unary square
      
      * updating scalar multiplication as an operator
      
      * functioning version with scalar operator
      
      * changing strides for col major
      
      * updated column major implementation
      
      * working column major implementation
      
      * cleaned up comments, rearranged/renamed files
      
      * small edits to 3d transpose profiler
      
      * adding test/profiler/instance files for hipTensor permute unit test
      
      * added more test instances
      
      * cleaned up errors, randomized input tensor, added more instances
      
      * turned off time printouts
      
      * removed conflicting transpose profiler
      
      * rearranged some files
      12a8883c
  28. 18 Dec, 2023 1 commit
    • rocking's avatar
      layernorm and groupnorm backward data (#1083) · a69aa2a1
      rocking authored
      * rename folder
      
      * Add type string
      
      * Remove typo
      
      * Add deviceOp to backward x
      
      * Add comment to describe the behavior of backward normalization
      
      * Add kernel function, prepare to implement
      
      * implement generic kernel
      
      * Check vector size
      
      * Add sweep once pipeline for small reduce size
      
      * Fix bug of KRaw_ error
      
      * Fix bug of dx stride
      
      * sanity check for mean and rstd
      
      * backward x for groupnorm
      
      * Add bwd x instance
      
      * add layernorm 2d bwd gamma beta instances
      
      * Change save mean var type from f32 to f16 in f16 mode
      
      * Change the example to f16
      
      * Add groupnorm bwd gamma beta instance
      
      * Add groupnorm bwd x instance
      
      * Fix naming
      
      * Add layernorm bwd x ckprofiler
      
      * Add groupnorm bwd x profiler
      
      * clang format
      
      * Rename bwd x to bwd data
      
      * Fix bug of verification in profiler
      
      * Add test of layernorm and groupnorm bwd data
      
      * Add missing cmake
      
      * Add layernorm2d bwd data
      
      * rename fwd example
      
      * Add groupnorm client example
      
      * Fix typo. replace Invarient with Invariant
      
      * Add checking before running the best instance
      a69aa2a1
  29. 15 Dec, 2023 1 commit
  30. 06 Dec, 2023 1 commit
    • Bartłomiej Kocot's avatar
      Introduce wrapper library (#1071) · 836b7e55
      Bartłomiej Kocot authored
      * Introduce wrapper library
      
      * Update cmake files
      
      * Revert "Update cmake files"
      
      This reverts commit c27f88b56590c11a88e26d5d0df7aca51a08133d.
      
      * Fix comments
      836b7e55
  31. 28 Nov, 2023 1 commit
    • Illia Silin's avatar
      Split the static library into several files. (#1044) · 7965d66a
      Illia Silin authored
      * spolit the static library into several
      
      * update lib paths and fix client example
      
      * do not use device_mha_operarions for client examples
      
      * use appropriate libs to link to client examples
      
      * remove the gpu/transpose path from the list
      
      * try fixing clinet examples 3,4,9
      
      * add necessary libs for client examples
      
      * fix the layernorm client example
      
      * fix the client examples 23 and 24
      
      * fix typo
      
      * add interface library and refresh clang format
      7965d66a
  32. 14 Nov, 2023 1 commit
  33. 10 Nov, 2023 1 commit
    • Bartłomiej Kocot's avatar
      Support multi AB for grouped conv fwd xdl (#1027) · 49e52bb3
      Bartłomiej Kocot authored
      * Support multi AB for grouped conv fwd xdl
      
      * Add instances
      
      * Add client example
      
      * Add example
      
      * Add interface test
      
      * Minor fixes
      
      Minor fixes
      
      Minor fixes
      
      * Comment fixes
      
      * Fixes
      
      * Reference fix
      
      * Test xdl fixes
      
      * Improve multi_ab interface test
      49e52bb3
  34. 09 Nov, 2023 2 commits
    • arai713's avatar
      Transpose 3d (#984) · 3af8c81a
      arai713 authored
      
      
      * added working example for 5D input using 1D kernel
      
      * example with 5D input tensor and 2d kernel - not working: issues with arguments
      
      * added updated version of 3d device op - changed descriptors/dims
      
      * added example file to check kernel
      
      * fixed descriptor and isSupportedArgument stride problem
      
      * added and modified kernel for 3d - updated tids/loop
      
      * adding some more 5d example files
      
      * fixed some issues
      
      * changes made for testing
      
      * working version: fixed error in stride for A, still a bit inefficient
      
      * cleaned up formatting/comments
      
      * updating formatting
      
      * more formatting fixes
      
      * fixing cmake, adding back gpu targets in cmake script
      
      * adding client example
      
      * added instances for client example
      
      * fixed errors in client example
      
      * implemented client ex with device_elementwise.hpp and device_elementwise_3d_impl.hpp
      
      * removed extra files
      
      * minor formatting and naming fixes
      
      * adding test files and profiler
      
      * fixing minor error
      
      * minor fix
      
      * removed unneccesary comments, renamed files
      
      * updated instance list for client example, added different layout example
      
      * removing instances
      
      * fixed error in instance generation
      
      * remove comments
      
      * update profiler and client example tensor layouts
      
      * fixed errors in test/profiler
      
      * updated vector dim access to enable vector load
      
      * updated test/profiler files
      
      * updated example with 1d kernel
      
      * updating profiler
      
      * renamed files
      
      ---------
      Co-authored-by: default avatarJing Zhang <jizha@amd.com>
      3af8c81a
    • rocking's avatar
      Layernorm4d (#1022) · a3d9a2cd
      rocking authored
      
      
      * Rename folder
      
      * Add layernorm 4d fwd example
      
      * Rename original layernorm example
      
      * Add layernorm 4d f16  test
      
      * Add layernorm4d_fwd client example
      
      * Support layernorm4D in ckProfiler
      
      * Rename groupnorm to groupnorm fwd in example
      
      * Rename layernorm and group fwd in test
      
      * Rename normalization to normalization_fwd (instances)
      
      * Add fwd to DeviceNormalization
      
      * Rename external api header
      
      * Rename folder, because we can also add bwd in this folder
      
      * Add fwd in layernorm and groupnorm (profiler
      
      * Fix compile error
      
      ---------
      Co-authored-by: default avatarPo Yen Chen <PoYen.Chen@amd.com>
      a3d9a2cd
  35. 07 Nov, 2023 1 commit
  36. 02 Nov, 2023 1 commit
    • Bartlomiej Wroblewski's avatar
      Add support for mixed precision in contraction scale and bilinear (#973) · 4ef704d8
      Bartlomiej Wroblewski authored
      
      
      * Add support for mixed precision in contraction scale and bilinear (#936)
      
      * Extract common functionality to separate files
      
      * Reference contraction: Remove incorrect consts from type_converts
      
      * Reference contraction: Add missing type_convert for dst value
      
      * Reference contraction: Fix incorrect order of B matrix dimensions
      
      * Add support for mixed precision in contraction scale and bilinear
      
      * Move using statements from instances to a common file
      
      * Move using statements from examples to a common file
      
      * Fix the order of B matrix dimensions across examples and profiler
      
      * Fix the computation of error threshold
      
      * Make ComputeDataType an optional argument
      
      * Include possible DataType -> ComputeDataType casting error in the threshold
      
      * Remove commented code
      
      * Make the ComputeDataType an optional argument in instance
      
      ---------
      Co-authored-by: default avatarIllia Silin <98187287+illsilin@users.noreply.github.com>
      4ef704d8