1. 22 May, 2024 1 commit
  2. 20 May, 2024 1 commit
  3. 17 May, 2024 2 commits
  4. 15 May, 2024 3 commits
  5. 10 May, 2024 2 commits
  6. 09 May, 2024 2 commits
  7. 08 May, 2024 2 commits
  8. 07 May, 2024 2 commits
  9. 02 May, 2024 1 commit
  10. 29 Apr, 2024 1 commit
  11. 26 Apr, 2024 3 commits
    • Haocong WANG's avatar
      [GEMM] UniversalGemm update (#1262) · 764164b4
      Haocong WANG authored
      
      
      * Add bf16 instances
      
      * Add bf16 gemm universal example
      
      * tempsave
      
      * Add guard to navi compilation
      
      * workground on a specific mixed gemm instance ( bring back it when compiler fix upload)
      
      * fix formatting condition statement issue
      
      * solve conflict
      
      ---------
      Co-authored-by: default avatarJun Liu <Liu.Jun@amd.com>
      764164b4
    • Rostyslav Geyyer's avatar
      Add element op (#1259) · f044ff71
      Rostyslav Geyyer authored
      f044ff71
    • zjing14's avatar
      bf16A_Int8B with fastgelu/bias (#1264) · 0d0150db
      zjing14 authored
      * changed the copy function to v7r2
      
      * adding multi_abd
      
      * in-progress
      
      * add post-load oob check
      
      * debugging
      
      * adjust instances
      
      * add run_lds
      
      * add elemntwise_op
      
      * replace multi_abd_device with v3
      
      * clean up
      
      * clean
      
      * clean
      
      * Added LDSType
      
      * profiling
      
      * adjust oobcheck
      
      * add missing file
      
      * refactor
      
      * clean
      
      * add examples
      0d0150db
  12. 25 Apr, 2024 2 commits
    • Adam Osewski's avatar
      Grouped GEMM Multiple D tile loop. (#1247) · b4032629
      Adam Osewski authored
      * Overload output stream operator for LoopScheduler and PiplineVersion
      
      * Add Run overload accepting grid descriptors MK.
      
      * Add __device__ keyword for CalculateGridSize
      
      * Create device op GroupedGemmMultipleD
      
      * Add GroupedGemm MultipleD Tile Loop implementation.
      
      * Add an example for GroupedGemm MultipleD tile loop.
      
      * Device Op GroupedGEMMTileLoop.
      
      * Bunch of small changes in exmaple.
      
      * CkProfiler
      
      * Remove unused tparam.
      
      * Fix include statement.
      
      * Fix output stream overloads.
      
      * Do not make descriptors and check validity untill we find group.
      
      * Fix gemm desc initialization.
      
      * Revert device op
      
      * Fix compilation for DTYPES=FP16
      
      * Validate tensor transfers paramters.
      
      * Validate on host only NK dims if M is not known.
      
      * Fix bug.
      
      * A convenient debug func for selecting threads.
      
      * Fix has main k block loop bug.
      
      * Make sure that b2c has up to date tile offset.
      
      * Output stream operator for Sequence type.
      
      * Cmake file formatting.
      b4032629
    • ltqin's avatar
      Universal gemm flush cache (#1251) · f448d179
      ltqin authored
      
      
      * add flush cache to device op
      
      * add flush cache parameter to ckProfiler
      
      * change calculate size a and b method
      
      * chang evaluation time method foro AVERAGE to MEDIAN
      
      * format code
      
      * adjust some code
      
      * fix core dumped
      
      * remove loop call flush icache in kernel
      
      * remove loop(outer) call flush icache
      
      ---------
      Co-authored-by: default avatarletaoqin <letaoqin@amd.com>
      f448d179
  13. 23 Apr, 2024 1 commit
  14. 22 Apr, 2024 1 commit
  15. 19 Apr, 2024 1 commit
    • Bartłomiej Kocot's avatar
      Refactor elementwise kernels (#1222) · ad1597c4
      Bartłomiej Kocot authored
      * Refactor elementwise kernels
      
      * Instances fixes
      
      * Fix cmake
      
      * Fix max pool bwd test
      
      * Update two stage gemm split k
      
      * Restore elementwise scale for hiptensor backward compatiblity
      
      * Fix Acc data type check in conv fwd multiple abd
      
      * Disable conv fp64 fwd example
      
      * Update grouped conv weight multi d
      ad1597c4
  16. 18 Apr, 2024 1 commit
  17. 16 Apr, 2024 2 commits
    • zjing14's avatar
      Added Multi_ABD support into Gemm and GroupedGemmFixedNK (#978) · 12865fbf
      zjing14 authored
      
      
      * added an example grouped_gemm_multi_abd
      
      * fixed ci
      
      * add setElementwiseOp
      
      * changed API
      
      * clean code: add multiA into example
      
      * fixed v7r2 copy
      
      * add transpose
      
      * clean
      
      * fixed vector_load check
      
      * Update example/15_grouped_gemm/grouped_gemm_multi_abd_xdl_fixed_nk_bias_fp16.cpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update example/15_grouped_gemm/grouped_gemm_multi_abd_xdl_fixed_nk_bias_fp16.cpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update example/15_grouped_gemm/grouped_gemm_multi_abd_xdl_fixed_nk_bias_fp16.cpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_multiple_abd_xdl_cshuffle.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_multiple_abd_xdl_cshuffle.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd_fixed_nk.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd_fixed_nk.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * add reduce
      
      * testing
      
      * add example_b16_i8
      
      * refactor example
      
      * clean
      
      * add mpading
      
      * disable reduce for kbatch = 1
      
      * seperate reduce device op
      
      * add reduce op
      
      * add guard for workspace_size
      
      * add instances
      
      * format
      
      * fixed
      
      * add client example
      
      * add a colmajor
      
      * add instances
      
      * Update cmake-ck-dev.sh
      
      * Update profile_gemm_splitk.cpp
      
      * Update gridwise_gemm_xdlops_v2r4r2.hpp
      
      * format
      
      * Update profile_gemm_splitk.cpp
      
      * fixed
      
      * fixed
      
      * adjust test
      
      * adjust precision loss
      
      * adjust test
      
      * fixed
      
      * add bf16_i8 scale bias
      
      * fixed scale
      
      * fixed scale elementwise_op
      
      * revert contraction deviceop changes
      
      * fixed
      
      * Add AddFastGelu
      
      * Revert "Merge branch 'jizhan/gemm_splitk_reduce' into grouped_gemm_multi_abd_fixed_nk_example"
      
      This reverts commit 3b5d001efd74335b38dcb7d8c8877580b49d23a4, reversing
      changes made to 943199a99191661c5597c51ca8371a90bf57837e.
      
      * add Scales into elementwise
      
      * add gemm_multi_abd client example
      
      * add client examples
      
      * add rcr and crr
      
      * add grouped gemm client example
      
      * add grouped gemm client example
      
      * add instance for rcr crr
      
      * format
      
      * fixed
      
      * fixed cmake
      
      * fixed
      
      * fixed client_example
      
      * format
      
      * fixed contraction isSupport
      
      * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd_fixed_nk.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update device_reduce_threadwise.hpp
      
      * clean
      
      * Fixes
      
      * Fix example
      
      ---------
      Co-authored-by: default avatarJing Zhang <jizha@amd.com>
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      12865fbf
    • carlushuang's avatar
      introducing ck_tile! (#1216) · db376dd8
      carlushuang authored
      * enable gfx940
      
      * switch between intrinsic mfma routines on mi100/200 and mi300
      
      * fix mfma_int8 on MI300
      
      * disable 2 int8 examples on MI300
      
      * Update cmake-ck-dev.sh
      
      * restore gitignore file
      
      * modify Jenkinsfile to the internal repo
      
      * Bump rocm-docs-core from 0.24.0 to 0.29.0 in /docs/sphinx
      
      Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.24.0 to 0.29.0.
      - [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
      - [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
      - [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.24.0...v0.29.0
      
      )
      
      ---
      updated-dependencies:
      - dependency-name: rocm-docs-core
        dependency-type: direct:production
        update-type: version-update:semver-minor
      ...
      Signed-off-by: default avatardependabot[bot] <support@github.com>
      
      * initial enablement of gfx950
      
      * fix clang format
      
      * disable examples 31 and 41 int8 on gfx950
      
      * add code
      
      * fix build wip
      
      * fix xx
      
      * now can build
      
      * naming
      
      * minor fix
      
      * wip fix
      
      * fix macro for exp2; fix warpgemm a/b in transposedC
      
      * unify as tuple_array
      
      * Update the required Python version to 3.9
      
      * Update executable name in test scripts
      
      * re-structure tuple/array to avoid spill
      
      * Merge function templates
      
      * Fix format
      
      * Add constraint to array<> ctor
      
      * Re-use function
      
      * Some minor changes
      
      * remove wrong code in store_raw()
      
      * fix compile issue in transpose
      
      * Rename enum
      Rename 'cood_transform_enum' to 'coord_transform_enum'
      
      * let more integral_constant->constant, and formating
      
      * make sure thread_buffer can be tuple/array
      
      * temp fix buffer_store spill
      
      * not using custom data type by default, now we can have ISA-level same code as opt_padding
      
      * fix compile error, fp8 not ready now
      
      * fix fp8 duplicated move/shift/and/or problem
      
      * Default use CK_TILE_FLOAT_TO_FP8_STOCHASTIC rounding mode
      
      * fix scratch in fp8 kernel
      
      * update some readme
      
      * fix merge from upstream
      
      * sync with upstream
      
      * sync upstream again
      
      * sync 22
      
      * remove unused
      
      * fix clang-format
      
      * update README of ck_tile example
      
      * fix several issue
      
      * let python version to be 3.8 as minimal
      
      * remove ck_tile example from default cmake target like all/install/check
      
      * remove mistake
      
      * 1).support receipe in generate.py 2).use simplified mask type 3).change left/right to pass into karg
      
      * fix some bug in group-mode masking and codegen. update README
      
      * F8 quantization for FMHA forward (#1224)
      
      * Add SAccElementFunction, PComputeElementFunction, OAccElementFunction in pipeline
      
      * Add element function to fmha api
      
      * Adjust P elementwise function
      
      * Fix bug of elementwise op, our elementwise op is not inout
      
      * Add some elementwise op, prepare to quantization
      
      * Let generate.py can generate different elementwise function
      
      * To prevent compiler issue, remove the elementwise function we have not used.
      
      * Remove f8 pipeline, we should share the same pipeline even in f8
      
      * Remove remove_cvref_t
      
      * Avoid warning
      
      * Fix wrong fp8 QK/KV block gemm setting
      
      * Check fp8 rounding error in check_err()
      
      * Set fp8 rounding error for check_err()
      
      * Use CK_TILE_FLOAT_TO_FP8_STANDARD as default fp8 rounding mode
      
      * 1. codgen the f8 api and kernel
      2. f8 host code
      
      * prevent warning in filter mode
      
      * Remove not-in-use elementwise function kargs
      
      * Remove more not-in-use elementwise function kargs
      
      * Small refinements in C++ source files
      
      * Use conditional_t<> to simplify code
      
      * Support heterogeneous argument for binary function types
      
      * Re-use already-existing scales<> functor template
      
      * Fix wrong value produced by saturating
      
      * Generalize the composes<> template
      
      * Unify saturates<> implementation
      
      * Fix type errors in composes<>
      
      * Extend less_equal<>
      
      * Reuse the existing template less_equal<> in check_err()
      
      * Add equal<float> & equal<double>
      
      * Rename check_err() parameter
      
      * Rename check_err() parameter
      
      * Add FIXME comment for adding new macro in future
      
      * Remove unnecessary cast to void
      
      * Eliminate duplicated code
      
      * Avoid dividing api pool into more than 2 groups
      
      * Use more clear variable names
      
      * Use affirmative condition in if stmt
      
      * Remove blank lines
      
      * Donot perfect forwarding in composes<>
      
      * To fix compile error, revert generate.py back to 4439cc107dd90302d68a6494bdd33113318709f8
      
      * Fix bug of p element function
      
      * Add compute element op to host softmax
      
      * Remove element function in api interface
      
      * Extract user parameter
      
      * Rename pscale and oscale variable
      
      * rename f8 to fp8
      
      * rename more f8 to fp8
      
      * Add pipeline::operator() without element_functor
      
      * 1. Remove deprecated pipeline enum
      2. Refine host code parameter
      
      * Use quantization range as input
      
      * 1. Rename max_dtype to dtype_max.
      2. Rename scale to scale_s
      3.Add init description
      
      * Refine description
      
      * prevent early return
      
      * unify _squant kernel name in cpp, update README
      
      * Adjust the default range.
      
      * Refine error message and bias range
      
      * Add fp8 benchmark and smoke test
      
      * fix fp8 swizzle_factor=4 case
      
      ---------
      Co-authored-by: default avatarPo Yen Chen <PoYen.Chen@amd.com>
      Co-authored-by: default avatarcarlushuang <carlus.huang@amd.com>
      
      ---------
      Signed-off-by: default avatardependabot[bot] <support@github.com>
      Co-authored-by: default avatarillsilin <Illia.Silin@amd.com>
      Co-authored-by: default avatarIllia Silin <98187287+illsilin@users.noreply.github.com>
      Co-authored-by: default avatarJing Zhang <jizha@amd.com>
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      Co-authored-by: default avatardependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
      Co-authored-by: default avatarPo-Yen, Chen <PoYen.Chen@amd.com>
      Co-authored-by: default avatarrocking <ChunYu.Lai@amd.com>
      db376dd8
  18. 14 Apr, 2024 1 commit
    • Haocong WANG's avatar
      [GEMM] Gemm universal device operation (#1154) · f83e9701
      Haocong WANG authored
      
      
      * Optimize GEMM on MI200/300:
      1. Add new blockwise gemm pipeline
      2. Add irregular splitk intances
      
      * clang format + typo fix
      
      * Fix a bug
      
      * initial commit
      
      * Add more instances to irregular splitk
      
      * blkgemm pipeline v1~4 prototype
      
      * Sanity Checked. Known issue:
      1. Poor performance of splitk
      2. Register spill on blkgemmpipeline v3
      
      * Sanity and Performance fix:
      1. fix a bug related to sanity in grouped b2c mapping
      2. fix a bug related to sanity and performance in splitk offset
      
      * Sanity and API update:
      1. Remove prefetch stage
      2. Fix valid check bug
      3, Add first gemm_universal instance into ckProfiler
      
      * Add NN instances for gemm universal
      
      * 1. Add NT instances for gemm_universal
      2. Fix a bug about Kpadding in gemm_universal
      
      * Fix a bug regarding padding Odd K number
      
      * remove kernel print
      
      * Fix KPadding bug...
      
      * Update safety check
      
      * another try to fix kpadding..
      
      * Sanity checked
      
      * new instances..
      
      * clang format+typo fix
      
      * remove clang format script's change
      
      * Add non-hotloop compile option
      
      * 1. Add fp16xfp8 example
      2. pull packed convert f8 from pr1150
      
      * Some miscs.. opt and fix
      
      * Add pipeline description docs
      
      * Split universal gemm instance library to cut profiler compiling time
      
      * uncomment cmakefile
      
      * Fix a bug caused by blockwise_gemm_pipe_v2
      
      * reduce default splitk to 1
      
      * Add 224x256x64 tile size
      
      * update, including:
      1. Experiment pipeline 5~7
      2. Optimization for pipeline 4
      3. Organized instance library
      
      * temp save
      
      * temp save
      
      * Permuted lds layout, sanity and function checked
      
      * clang format
      
      * Move OOB check from RunRead to RunWrite, for better software pipeline.
      TODO: agpr spill when NN layout
      
      * clangformat
      
      * A/B splitpipe scheduler for v3
      
      * Fix two bugs
      
      * bug fix
      
      * fix a bug in oob check
      
      * Example for mixed fp16_fp8 gemm
      
      * Clean experimental code blocks
      
      * Add mixed precision gemm into profiler
      
      * tempsave
      
      * optimize m/n major lds layout
      
      * Add RRR GEMM  mixed precision instances
      
      * Optimize f8 matrix transpose
      
      * Add test_gemm_universal
      
      * A/B spilt schedule for blkpip v5
      
      * Take ds_read2 into iglp scheduling scheme
      
      * format
      
      * fixed cmake
      
      * Add llvm-option into CI cmake flag
      
      ---------
      Co-authored-by: default avatarJing Zhang <jizhan@amd.com>
      f83e9701
  19. 11 Apr, 2024 1 commit
  20. 04 Apr, 2024 1 commit
  21. 03 Apr, 2024 1 commit
  22. 02 Apr, 2024 3 commits
    • Bartłomiej Kocot's avatar
      Introduce combined elementwise ops (#1217) · 9a194837
      Bartłomiej Kocot authored
      * Introduce combined elementwise ops
      
      * Introduce refrence elementwise
      9a194837
    • Illia Silin's avatar
      Split the instances by architecture. (#1223) · ae57e593
      Illia Silin authored
      * parse examples inside the add_example_executable function
      
      * fix the example 64 cmake file
      
      * add xdl flag to the gemm_bias_softmax_gemm_permute example
      
      * add filtering of tests based on architecture type
      
      * enable test_grouped_gemm for gfx9 only
      
      * enable test_transpose only for gfx9
      
      * only linnk test_transpose if it gets built
      
      * split the gemm instances by architectures
      
      * split gemm_bilinear,grouped_conv_bwd_weight instances by targets
      
      * split instances by architecture
      
      * split grouped_conv instances by architecture
      
      * fix clang format
      
      * fix the if-else logic in group_conv headers
      
      * small fix for grouped convolution instances
      
      * fix the grouped conv bwd weight dl instances
      
      * fix client examples
      
      * only enable client examples 3 and 4 on gfx9
      
      * set the gfx9 macro
      
      * make sure the architecture macros are set by cmake
      
      * use separate set of xdl/wmma flags for host code
      
      * sinmplify the main cmake file
      
      * add conv_fwd_bf8 instance declaration
      ae57e593
    • zjing14's avatar
      improved zeroing (#1221) · 303d4594
      zjing14 authored
      303d4594
  23. 22 Mar, 2024 1 commit
  24. 13 Mar, 2024 1 commit
  25. 09 Mar, 2024 2 commits
    • Bartłomiej Kocot's avatar
      Fix warnings during wrapper docs generation (#1192) · 42fc8edd
      Bartłomiej Kocot authored
      * Fix warnings during wrapper docs generation
      
      * Fixes
      42fc8edd
    • zjing14's avatar
      Navi3 rel (#1176) · 1837040a
      zjing14 authored
      
      
      * wmma_op + unit test
      
      * add arch limitation to wmma test
      
      * change arch limitation
      
      * Refactor + Add all type unit test(int4 compile failed)
      
      * Add f32_16x16x16_bf16 unit test
      
      * tempsave
      
      * tempsave
      
      * tempsave
      
      * runtime bug, cannot find symbol
      
      * workaround for incorrect HIP warpSize return value
      
      * debugging
      
      * tempsave
      
      * Correctness OK, waiting for optimization
      
      * Tidy up + format
      
      * temp save
      
      * temp save, reproduce the v_bfi_b32 issue
      
      * add inline asm for wmmaop test
      
      * tidy up
      
      * clean some debug purpose code
      
      * discard some codes
      
      * clang format
      
      * clang format
      
      * compiler issue fixed + increase tile size
      
      * navi3x_multipleD+example
      
      * temp save
      
      * workable
      
      * batchedgemm[OK], groupconv[debug]
      
      * groupconv: Sanity check[OK], Performance[Bad]
      
      * navi3x_groupconv_need_optimization
      
      * create necessary files
      
      * save progress
      
      * Add Inter-Row thread transfer
      
      * save progress
      
      * save debugging progress
      
      * sanity check pass
      
      * fix a host tensor bug and clean up flash-attn code
      
      * format
      
      * cancel unnecessary change
      
      * cancel unnecessary change
      
      * cancel unnecessary change
      
      * temp save, add asm backend flag to amd_wmma
      
      * Mat-A LDS Bypass sanity pass
      
      * temp save
      
      * gemm sanity fix
      
      * Porting new blockwise gemm to flash attention
      
      * Example branch provide to compiler team
      
      * tempsave
      
      * Fix a bug
      
      * batched gemm ported
      
      * conv A-skip lds ported
      
      * Skip B-Lds real gemm
      
      * Skip B Lds Gemm + MulD
      
      * batched gemm, conv, skip b lds
      
      * format
      
      * Attn, skip b lds
      
      * Change GridwiseOp nam
      
      * fix a typo caused bug
      
      * Skip A_Lds sanity pass, Skip B_Lds scratch occured
      
      * Bug found, intra-row permute off caused
      
      * bug found
      
      * a fix
      
      * disable buffer load due to incorrect 3rd dword
      
      * update fmha config, no scratch generated
      
      * update 3rd dword
      
      * fmha config update
      
      * FMHA, add support to gfx1101/gfx1102
      
      * Merge origin dev (#2)
      
      * [Navi3x] Fix Gridwise_multiple_d operation (#649)
      
      * Add CMake Option "USE_OPT_NAVI3X"
      
      * fix bug
      
      * standardize docs (#655)
      
      * Separate bibtex requirement from rocm-docs-core (#656)
      
      * separate bibtex requirement from rocm-docs-core
      
      * point requirements to source rocm-docs-core repo
      
      * Add CMake Option "USE_OPT_NAVI3X" (#647)
      
      * Add CMake Option "USE_OPT_NAVI3X"
      
      * remove navi3x opt compile option from cmake script
      
      * Conv + quantization + tanh  (#645)
      
      * Rename file. Prepare to support another activation
      
      * Add comment for quantization
      
      * Extract out_elementop
      
      * Add tanh example
      
      * Add conv + bias + tanh quantization instance
      
      * Add missing parameter
      
      * Refine cmake
      
      * Add external api and client example
      
      * Extract variable in example
      
      * Fix the comment
      
      ---------
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      
      * Add a denorm test fix (#603)
      
      * Add type_convert implementations for bf16
      
      * Add the fix for conv_fwd
      
      * Add the fix for conv_bwd_data
      
      * Add the fix for conv_bwd_weight
      
      * Format
      
      * Format
      
      * Another format
      
      * Add a macro to use workaround on MI200 only
      
      * Format
      
      ---------
      Co-authored-by: default avatarRosty Geyyer <rosty.geyyer@amd.com>
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      
      * simplify karg in device/grid of split-k op (#644)
      
      * simplify karg in device/grid split-k op
      
      * fix mk_kn_mn instances
      
      * add more instances
      
      * use name from tensor layout
      
      * fix 3rd dword of buffer source descriptor (#659)
      
      * add fp64 instances (#658)
      Co-authored-by: default avatarroot <root@ctr-ubbsmc15.amd.com>
      
      * Issue #666: Revert "simplify karg in device/grid of split-k op (#644)" (#665)
      
      This reverts commit bb5530af
      
      .
      
      * Groupnorm + swish external api (#668)
      
      * Rename to proper naming
      
      * Add example of groupnorm + swish
      
      * Extract duplicate code in example
      
      * Add groupnorm + swish instances
      
      * Ractor instance generation, split into multiple cpp file
      
      * Add external api and client example
      
      * Refine profiler message
      
      * Use ck math version of exp
      
      * Refine problem size in example
      
      * Add host version of exp
      
      * add a marco to turn on/off denorm fix (off by default) (#673)
      
      * add a marco to turn off denorm fix by default
      
      * expose the marco
      
      ---------
      Co-authored-by: default avatarroot <root@ctr-ubbsmc15.amd.com>
      
      * fixed quant example (#672)
      Co-authored-by: default avatarroot <root@ctr-ubbsmc15.amd.com>
      
      * Add dependabot config and pin rocm-docs-core (#663)
      
      * [gtest] suppress unsafe buffer warn (#670)
      
      ref: https://github.com/ROCmSoftwarePlatform/MIOpen/pull/1912
      
      
      
      * Add memory index guard in wmma device ops (#667)
      
      * Add more macros to turn on/off denorm fix (#678)
      Co-authored-by: default avatarRosty Geyyer <rosty.geyyer@amd.com>
      
      * Fix a typo (#676)
      
      * Add (#677)
      
      * Allow using ROCm release candidate compilers. (#679)
      
      * enable use of rocm5.5 release candidate 4
      
      * upgrade to ROCM5.5 RC5
      
      * try fix the PUB_KEY error, remove the cmake-data package
      
      * upgrade to latest cmake version
      
      * use private dockerhub repo for rocm5.5 rc5
      
      * add missing bracket
      
      * add vector load check
      
      * solve conflicts
      
      ---------
      Co-authored-by: default avatarSam Wu <sjwu@ualberta.ca>
      Co-authored-by: default avatarSam Wu <sam.wu2@amd.com>
      Co-authored-by: default avatarrocking5566 <ChunYu.Lai@amd.com>
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      Co-authored-by: default avatarRostyslav Geyyer <46627076+geyyer@users.noreply.github.com>
      Co-authored-by: default avatarRosty Geyyer <rosty.geyyer@amd.com>
      Co-authored-by: default avatarcarlushuang <carlus.huang@amd.com>
      Co-authored-by: default avatarroot <root@ctr-ubbsmc15.amd.com>
      Co-authored-by: default avatarJun Liu <Liu.Jun@amd.com>
      Co-authored-by: default avatarIllia Silin <98187287+illsilin@users.noreply.github.com>
      
      * Disable SkipLDS & Align AIT api (#3)
      
      * fix layernorm, reduction Ops (#4)
      
      * [Navi3x] Fix Gridwise_multiple_d operation (#649)
      
      * Add CMake Option "USE_OPT_NAVI3X"
      
      * fix bug
      
      * standardize docs (#655)
      
      * Separate bibtex requirement from rocm-docs-core (#656)
      
      * separate bibtex requirement from rocm-docs-core
      
      * point requirements to source rocm-docs-core repo
      
      * Add CMake Option "USE_OPT_NAVI3X" (#647)
      
      * Add CMake Option "USE_OPT_NAVI3X"
      
      * remove navi3x opt compile option from cmake script
      
      * Conv + quantization + tanh  (#645)
      
      * Rename file. Prepare to support another activation
      
      * Add comment for quantization
      
      * Extract out_elementop
      
      * Add tanh example
      
      * Add conv + bias + tanh quantization instance
      
      * Add missing parameter
      
      * Refine cmake
      
      * Add external api and client example
      
      * Extract variable in example
      
      * Fix the comment
      
      ---------
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      
      * Add a denorm test fix (#603)
      
      * Add type_convert implementations for bf16
      
      * Add the fix for conv_fwd
      
      * Add the fix for conv_bwd_data
      
      * Add the fix for conv_bwd_weight
      
      * Format
      
      * Format
      
      * Another format
      
      * Add a macro to use workaround on MI200 only
      
      * Format
      
      ---------
      Co-authored-by: default avatarRosty Geyyer <rosty.geyyer@amd.com>
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      
      * simplify karg in device/grid of split-k op (#644)
      
      * simplify karg in device/grid split-k op
      
      * fix mk_kn_mn instances
      
      * add more instances
      
      * use name from tensor layout
      
      * fix 3rd dword of buffer source descriptor (#659)
      
      * add fp64 instances (#658)
      Co-authored-by: default avatarroot <root@ctr-ubbsmc15.amd.com>
      
      * Issue #666: Revert "simplify karg in device/grid of split-k op (#644)" (#665)
      
      This reverts commit bb5530af
      
      .
      
      * Groupnorm + swish external api (#668)
      
      * Rename to proper naming
      
      * Add example of groupnorm + swish
      
      * Extract duplicate code in example
      
      * Add groupnorm + swish instances
      
      * Ractor instance generation, split into multiple cpp file
      
      * Add external api and client example
      
      * Refine profiler message
      
      * Use ck math version of exp
      
      * Refine problem size in example
      
      * Add host version of exp
      
      * add a marco to turn on/off denorm fix (off by default) (#673)
      
      * add a marco to turn off denorm fix by default
      
      * expose the marco
      
      ---------
      Co-authored-by: default avatarroot <root@ctr-ubbsmc15.amd.com>
      
      * fixed quant example (#672)
      Co-authored-by: default avatarroot <root@ctr-ubbsmc15.amd.com>
      
      * Add dependabot config and pin rocm-docs-core (#663)
      
      * [gtest] suppress unsafe buffer warn (#670)
      
      ref: https://github.com/ROCmSoftwarePlatform/MIOpen/pull/1912
      
      
      
      * Add memory index guard in wmma device ops (#667)
      
      * Add more macros to turn on/off denorm fix (#678)
      Co-authored-by: default avatarRosty Geyyer <rosty.geyyer@amd.com>
      
      * Fix a typo (#676)
      
      * Add (#677)
      
      * Allow using ROCm release candidate compilers. (#679)
      
      * enable use of rocm5.5 release candidate 4
      
      * upgrade to ROCM5.5 RC5
      
      * try fix the PUB_KEY error, remove the cmake-data package
      
      * upgrade to latest cmake version
      
      * use private dockerhub repo for rocm5.5 rc5
      
      * add missing bracket
      
      * Disable SkipLDS & Align AIT api
      
      * Update dependabot config (#682)
      Co-authored-by: default avatarsamjwu <samjwu@users.noreply.github.com>
      
      * update attn api
      
      * solve type_convert bug + enable
      
      ---------
      Co-authored-by: default avatarSam Wu <sjwu@ualberta.ca>
      Co-authored-by: default avatarSam Wu <sam.wu2@amd.com>
      Co-authored-by: default avatarrocking5566 <ChunYu.Lai@amd.com>
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      Co-authored-by: default avatarRostyslav Geyyer <46627076+geyyer@users.noreply.github.com>
      Co-authored-by: default avatarRosty Geyyer <rosty.geyyer@amd.com>
      Co-authored-by: default avatarcarlushuang <carlus.huang@amd.com>
      Co-authored-by: default avatarroot <root@ctr-ubbsmc15.amd.com>
      Co-authored-by: default avatarJun Liu <Liu.Jun@amd.com>
      Co-authored-by: default avatarIllia Silin <98187287+illsilin@users.noreply.github.com>
      Co-authored-by: default avatarsamjwu <samjwu@users.noreply.github.com>
      Co-authored-by: default avatarhaocwang <Haocong.WANG@amd.com>
      
      * fix typo
      
      * Fix attention with causal mask
      
      * multiple fix, try ait compile
      
      * Add A/B not use LDS pipeline
      
      * Clang format, Add gfx1101, gfx1102 support of FMHA example
      
      * cancel change of format script
      
      * 1. Enable 2-stage global Prefetch ( May cause VGPR spilling)
      2. Enable FP16 accumulator blockwise_gemm
      
      * clang-format
      
      * 1. change blockwise gemm loopover direction from kmn to mnk ( ~1% improvement)
      2. change kernel timing mode to 50 warmup + 50 timed repeat
      
      * Update low level abstration of blockwise gemm wmma
      
      * (2/5) bilinear gemm pass, perf bug: skip a lds has lower performance than skip b lds
      
      * (3/5) batched gemm pass, perf bug: skip a lds has lower performance than skip b lds
      
      * (4/5) grouped conv pass
      
      * (5/5) attention pass, todo: debug lds perf bug
      
      * AIT Attention API refactor (#8)
      
      * sanity pass
      
      * sanity pass 2
      
      * confirm significant performance regression.
      
      * turn on all instances
      
      * turn off instance format
      
      * Fix bug & tunning & format
      
      * DML meta, self_attn+cross_attn
      
      * sanity pass
      
      * remove useless flag
      
      * update tile and problem size used in AIT attention
      
      * bug fix in grouped conv supporting check
      
      * deprecate inline asm wmma
      
      * Bug fix: double lds skip
      
      * clang-format
      
      * Fix errors in
      1. example, fmha
      2. gridwise pipeline
      3. deviceop, fmha, change some containers from vector to array
      
      * part2 of previous commit
      
      * clang format
      
      * API fix of gridwisegemmpipeline
      
      * separate array base and vector base attention tensor transformation
      
      * fix gemm
      
      * clang format
      
      * add gemm fp16 instances
      
      * Temp save
      
      * fpAintB kernel compile pass
      
      * Sanity pass.
      
      * Temp save
      
      * debug code enabled
      
      * Fp16AInt8B_GEMM sanity
      
      * MQA implementation
      
      * GQA-4 example
      
      * tempsave
      
      * Compile pass
      
      * New implementation of fp16Aint8B Gemm, Acheieve similar math throughput with native fp16 Gemm
      
      * format
      
      * Todo: fix gemm_bilinear_wmma instances compilation bug
      
      * Solve a bug when K1=16
      
      * remove unnecessary changes
      
      * Remove tensor layout limitation to LDS usage in tesnor contraction
      
      * update self-attention and cross-attention
      
      * fix a typo of name
      
      * Add arch limiter for fp8 gemm
      
      * enable fp8 gemm_xdl for all gfx9 targets
      
      * temporarily disable gemm_xdl_fp16_fp8 on MI100/200
      
      * fix the cmake logic for gemm_xdl_fp16_fp8
      
      * re-enable the gemm_xdl_fp16_fp8 on MI100/200
      
      ---------
      Co-authored-by: default avataraska-0096 <haocwang@amd.com>
      Co-authored-by: default avatarSam Wu <sjwu@ualberta.ca>
      Co-authored-by: default avatarSam Wu <sam.wu2@amd.com>
      Co-authored-by: default avatarrocking5566 <ChunYu.Lai@amd.com>
      Co-authored-by: default avatarRostyslav Geyyer <46627076+geyyer@users.noreply.github.com>
      Co-authored-by: default avatarRosty Geyyer <rosty.geyyer@amd.com>
      Co-authored-by: default avatarcarlushuang <carlus.huang@amd.com>
      Co-authored-by: default avatarroot <root@ctr-ubbsmc15.amd.com>
      Co-authored-by: default avatarJun Liu <Liu.Jun@amd.com>
      Co-authored-by: default avatarIllia Silin <98187287+illsilin@users.noreply.github.com>
      Co-authored-by: default avatarsamjwu <samjwu@users.noreply.github.com>
      Co-authored-by: default avatarhaocwang <Haocong.WANG@amd.com>
      Co-authored-by: default avatarillsilin <Illia.Silin@amd.com>
      1837040a
  26. 06 Mar, 2024 1 commit
    • Paul Fultz II's avatar
      Add host lib (#1134) · 8eff4d62
      Paul Fultz II authored
      
      
      * Format
      
      * Format
      
      * Format
      
      * Remove const
      
      * Use the right template
      
      * Format
      
      * Format
      
      * add row/col instances
      
      * Add missing file
      
      * fixed
      
      * Format
      
      * Updates
      
      * Format
      
      * fixed rrr layout
      
      * Format
      
      * Update test and embed modules
      
      * Restore older version
      
      * Update year
      
      * Set -fPIC
      
      * Format
      
      * Use double for isnan
      
      * rename host folder to codegen + minor fix
      
      * add codegen CI test
      
      * add option to build components without building CK
      
      * fix the groovy syntax
      
      * fix typo
      
      * use the correct function for the codegen stage
      
      ---------
      Co-authored-by: default avatarJing Zhang <jizha@amd.com>
      Co-authored-by: default avatarIllia Silin <98187287+illsilin@users.noreply.github.com>
      Co-authored-by: default avatarillsilin <Illia.Silin@amd.com>
      8eff4d62