1. 25 Apr, 2024 1 commit
    • ltqin's avatar
      Universal gemm flush cache (#1251) · f448d179
      ltqin authored
      
      
      * add flush cache to device op
      
      * add flush cache parameter to ckProfiler
      
      * change calculate size a and b method
      
      * chang evaluation time method foro AVERAGE to MEDIAN
      
      * format code
      
      * adjust some code
      
      * fix core dumped
      
      * remove loop call flush icache in kernel
      
      * remove loop(outer) call flush icache
      
      ---------
      Co-authored-by: default avatarletaoqin <letaoqin@amd.com>
      f448d179
  2. 23 Apr, 2024 1 commit
  3. 22 Apr, 2024 1 commit
  4. 19 Apr, 2024 2 commits
  5. 18 Apr, 2024 3 commits
  6. 16 Apr, 2024 3 commits
    • peter's avatar
      docs: fix broken contributing link (#1244) · 501a6b68
      peter authored
      501a6b68
    • zjing14's avatar
      Added Multi_ABD support into Gemm and GroupedGemmFixedNK (#978) · 12865fbf
      zjing14 authored
      
      
      * added an example grouped_gemm_multi_abd
      
      * fixed ci
      
      * add setElementwiseOp
      
      * changed API
      
      * clean code: add multiA into example
      
      * fixed v7r2 copy
      
      * add transpose
      
      * clean
      
      * fixed vector_load check
      
      * Update example/15_grouped_gemm/grouped_gemm_multi_abd_xdl_fixed_nk_bias_fp16.cpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update example/15_grouped_gemm/grouped_gemm_multi_abd_xdl_fixed_nk_bias_fp16.cpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update example/15_grouped_gemm/grouped_gemm_multi_abd_xdl_fixed_nk_bias_fp16.cpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_multiple_abd_xdl_cshuffle.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_multiple_abd_xdl_cshuffle.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd_fixed_nk.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd_fixed_nk.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * add reduce
      
      * testing
      
      * add example_b16_i8
      
      * refactor example
      
      * clean
      
      * add mpading
      
      * disable reduce for kbatch = 1
      
      * seperate reduce device op
      
      * add reduce op
      
      * add guard for workspace_size
      
      * add instances
      
      * format
      
      * fixed
      
      * add client example
      
      * add a colmajor
      
      * add instances
      
      * Update cmake-ck-dev.sh
      
      * Update profile_gemm_splitk.cpp
      
      * Update gridwise_gemm_xdlops_v2r4r2.hpp
      
      * format
      
      * Update profile_gemm_splitk.cpp
      
      * fixed
      
      * fixed
      
      * adjust test
      
      * adjust precision loss
      
      * adjust test
      
      * fixed
      
      * add bf16_i8 scale bias
      
      * fixed scale
      
      * fixed scale elementwise_op
      
      * revert contraction deviceop changes
      
      * fixed
      
      * Add AddFastGelu
      
      * Revert "Merge branch 'jizhan/gemm_splitk_reduce' into grouped_gemm_multi_abd_fixed_nk_example"
      
      This reverts commit 3b5d001efd74335b38dcb7d8c8877580b49d23a4, reversing
      changes made to 943199a99191661c5597c51ca8371a90bf57837e.
      
      * add Scales into elementwise
      
      * add gemm_multi_abd client example
      
      * add client examples
      
      * add rcr and crr
      
      * add grouped gemm client example
      
      * add grouped gemm client example
      
      * add instance for rcr crr
      
      * format
      
      * fixed
      
      * fixed cmake
      
      * fixed
      
      * fixed client_example
      
      * format
      
      * fixed contraction isSupport
      
      * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd_fixed_nk.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update device_reduce_threadwise.hpp
      
      * clean
      
      * Fixes
      
      * Fix example
      
      ---------
      Co-authored-by: default avatarJing Zhang <jizha@amd.com>
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      12865fbf
    • carlushuang's avatar
      introducing ck_tile! (#1216) · db376dd8
      carlushuang authored
      * enable gfx940
      
      * switch between intrinsic mfma routines on mi100/200 and mi300
      
      * fix mfma_int8 on MI300
      
      * disable 2 int8 examples on MI300
      
      * Update cmake-ck-dev.sh
      
      * restore gitignore file
      
      * modify Jenkinsfile to the internal repo
      
      * Bump rocm-docs-core from 0.24.0 to 0.29.0 in /docs/sphinx
      
      Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.24.0 to 0.29.0.
      - [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
      - [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
      - [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.24.0...v0.29.0
      
      )
      
      ---
      updated-dependencies:
      - dependency-name: rocm-docs-core
        dependency-type: direct:production
        update-type: version-update:semver-minor
      ...
      Signed-off-by: default avatardependabot[bot] <support@github.com>
      
      * initial enablement of gfx950
      
      * fix clang format
      
      * disable examples 31 and 41 int8 on gfx950
      
      * add code
      
      * fix build wip
      
      * fix xx
      
      * now can build
      
      * naming
      
      * minor fix
      
      * wip fix
      
      * fix macro for exp2; fix warpgemm a/b in transposedC
      
      * unify as tuple_array
      
      * Update the required Python version to 3.9
      
      * Update executable name in test scripts
      
      * re-structure tuple/array to avoid spill
      
      * Merge function templates
      
      * Fix format
      
      * Add constraint to array<> ctor
      
      * Re-use function
      
      * Some minor changes
      
      * remove wrong code in store_raw()
      
      * fix compile issue in transpose
      
      * Rename enum
      Rename 'cood_transform_enum' to 'coord_transform_enum'
      
      * let more integral_constant->constant, and formating
      
      * make sure thread_buffer can be tuple/array
      
      * temp fix buffer_store spill
      
      * not using custom data type by default, now we can have ISA-level same code as opt_padding
      
      * fix compile error, fp8 not ready now
      
      * fix fp8 duplicated move/shift/and/or problem
      
      * Default use CK_TILE_FLOAT_TO_FP8_STOCHASTIC rounding mode
      
      * fix scratch in fp8 kernel
      
      * update some readme
      
      * fix merge from upstream
      
      * sync with upstream
      
      * sync upstream again
      
      * sync 22
      
      * remove unused
      
      * fix clang-format
      
      * update README of ck_tile example
      
      * fix several issue
      
      * let python version to be 3.8 as minimal
      
      * remove ck_tile example from default cmake target like all/install/check
      
      * remove mistake
      
      * 1).support receipe in generate.py 2).use simplified mask type 3).change left/right to pass into karg
      
      * fix some bug in group-mode masking and codegen. update README
      
      * F8 quantization for FMHA forward (#1224)
      
      * Add SAccElementFunction, PComputeElementFunction, OAccElementFunction in pipeline
      
      * Add element function to fmha api
      
      * Adjust P elementwise function
      
      * Fix bug of elementwise op, our elementwise op is not inout
      
      * Add some elementwise op, prepare to quantization
      
      * Let generate.py can generate different elementwise function
      
      * To prevent compiler issue, remove the elementwise function we have not used.
      
      * Remove f8 pipeline, we should share the same pipeline even in f8
      
      * Remove remove_cvref_t
      
      * Avoid warning
      
      * Fix wrong fp8 QK/KV block gemm setting
      
      * Check fp8 rounding error in check_err()
      
      * Set fp8 rounding error for check_err()
      
      * Use CK_TILE_FLOAT_TO_FP8_STANDARD as default fp8 rounding mode
      
      * 1. codgen the f8 api and kernel
      2. f8 host code
      
      * prevent warning in filter mode
      
      * Remove not-in-use elementwise function kargs
      
      * Remove more not-in-use elementwise function kargs
      
      * Small refinements in C++ source files
      
      * Use conditional_t<> to simplify code
      
      * Support heterogeneous argument for binary function types
      
      * Re-use already-existing scales<> functor template
      
      * Fix wrong value produced by saturating
      
      * Generalize the composes<> template
      
      * Unify saturates<> implementation
      
      * Fix type errors in composes<>
      
      * Extend less_equal<>
      
      * Reuse the existing template less_equal<> in check_err()
      
      * Add equal<float> & equal<double>
      
      * Rename check_err() parameter
      
      * Rename check_err() parameter
      
      * Add FIXME comment for adding new macro in future
      
      * Remove unnecessary cast to void
      
      * Eliminate duplicated code
      
      * Avoid dividing api pool into more than 2 groups
      
      * Use more clear variable names
      
      * Use affirmative condition in if stmt
      
      * Remove blank lines
      
      * Donot perfect forwarding in composes<>
      
      * To fix compile error, revert generate.py back to 4439cc107dd90302d68a6494bdd33113318709f8
      
      * Fix bug of p element function
      
      * Add compute element op to host softmax
      
      * Remove element function in api interface
      
      * Extract user parameter
      
      * Rename pscale and oscale variable
      
      * rename f8 to fp8
      
      * rename more f8 to fp8
      
      * Add pipeline::operator() without element_functor
      
      * 1. Remove deprecated pipeline enum
      2. Refine host code parameter
      
      * Use quantization range as input
      
      * 1. Rename max_dtype to dtype_max.
      2. Rename scale to scale_s
      3.Add init description
      
      * Refine description
      
      * prevent early return
      
      * unify _squant kernel name in cpp, update README
      
      * Adjust the default range.
      
      * Refine error message and bias range
      
      * Add fp8 benchmark and smoke test
      
      * fix fp8 swizzle_factor=4 case
      
      ---------
      Co-authored-by: default avatarPo Yen Chen <PoYen.Chen@amd.com>
      Co-authored-by: default avatarcarlushuang <carlus.huang@amd.com>
      
      ---------
      Signed-off-by: default avatardependabot[bot] <support@github.com>
      Co-authored-by: default avatarillsilin <Illia.Silin@amd.com>
      Co-authored-by: default avatarIllia Silin <98187287+illsilin@users.noreply.github.com>
      Co-authored-by: default avatarJing Zhang <jizha@amd.com>
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      Co-authored-by: default avatardependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
      Co-authored-by: default avatarPo-Yen, Chen <PoYen.Chen@amd.com>
      Co-authored-by: default avatarrocking <ChunYu.Lai@amd.com>
      db376dd8
  7. 15 Apr, 2024 1 commit
  8. 14 Apr, 2024 1 commit
    • Haocong WANG's avatar
      [GEMM] Gemm universal device operation (#1154) · f83e9701
      Haocong WANG authored
      
      
      * Optimize GEMM on MI200/300:
      1. Add new blockwise gemm pipeline
      2. Add irregular splitk intances
      
      * clang format + typo fix
      
      * Fix a bug
      
      * initial commit
      
      * Add more instances to irregular splitk
      
      * blkgemm pipeline v1~4 prototype
      
      * Sanity Checked. Known issue:
      1. Poor performance of splitk
      2. Register spill on blkgemmpipeline v3
      
      * Sanity and Performance fix:
      1. fix a bug related to sanity in grouped b2c mapping
      2. fix a bug related to sanity and performance in splitk offset
      
      * Sanity and API update:
      1. Remove prefetch stage
      2. Fix valid check bug
      3, Add first gemm_universal instance into ckProfiler
      
      * Add NN instances for gemm universal
      
      * 1. Add NT instances for gemm_universal
      2. Fix a bug about Kpadding in gemm_universal
      
      * Fix a bug regarding padding Odd K number
      
      * remove kernel print
      
      * Fix KPadding bug...
      
      * Update safety check
      
      * another try to fix kpadding..
      
      * Sanity checked
      
      * new instances..
      
      * clang format+typo fix
      
      * remove clang format script's change
      
      * Add non-hotloop compile option
      
      * 1. Add fp16xfp8 example
      2. pull packed convert f8 from pr1150
      
      * Some miscs.. opt and fix
      
      * Add pipeline description docs
      
      * Split universal gemm instance library to cut profiler compiling time
      
      * uncomment cmakefile
      
      * Fix a bug caused by blockwise_gemm_pipe_v2
      
      * reduce default splitk to 1
      
      * Add 224x256x64 tile size
      
      * update, including:
      1. Experiment pipeline 5~7
      2. Optimization for pipeline 4
      3. Organized instance library
      
      * temp save
      
      * temp save
      
      * Permuted lds layout, sanity and function checked
      
      * clang format
      
      * Move OOB check from RunRead to RunWrite, for better software pipeline.
      TODO: agpr spill when NN layout
      
      * clangformat
      
      * A/B splitpipe scheduler for v3
      
      * Fix two bugs
      
      * bug fix
      
      * fix a bug in oob check
      
      * Example for mixed fp16_fp8 gemm
      
      * Clean experimental code blocks
      
      * Add mixed precision gemm into profiler
      
      * tempsave
      
      * optimize m/n major lds layout
      
      * Add RRR GEMM  mixed precision instances
      
      * Optimize f8 matrix transpose
      
      * Add test_gemm_universal
      
      * A/B spilt schedule for blkpip v5
      
      * Take ds_read2 into iglp scheduling scheme
      
      * format
      
      * fixed cmake
      
      * Add llvm-option into CI cmake flag
      
      ---------
      Co-authored-by: default avatarJing Zhang <jizhan@amd.com>
      f83e9701
  9. 12 Apr, 2024 1 commit
  10. 11 Apr, 2024 3 commits
  11. 10 Apr, 2024 1 commit
  12. 09 Apr, 2024 3 commits
  13. 04 Apr, 2024 2 commits
  14. 03 Apr, 2024 1 commit
  15. 02 Apr, 2024 3 commits
    • Bartłomiej Kocot's avatar
      Introduce combined elementwise ops (#1217) · 9a194837
      Bartłomiej Kocot authored
      * Introduce combined elementwise ops
      
      * Introduce refrence elementwise
      9a194837
    • Illia Silin's avatar
      Split the instances by architecture. (#1223) · ae57e593
      Illia Silin authored
      * parse examples inside the add_example_executable function
      
      * fix the example 64 cmake file
      
      * add xdl flag to the gemm_bias_softmax_gemm_permute example
      
      * add filtering of tests based on architecture type
      
      * enable test_grouped_gemm for gfx9 only
      
      * enable test_transpose only for gfx9
      
      * only linnk test_transpose if it gets built
      
      * split the gemm instances by architectures
      
      * split gemm_bilinear,grouped_conv_bwd_weight instances by targets
      
      * split instances by architecture
      
      * split grouped_conv instances by architecture
      
      * fix clang format
      
      * fix the if-else logic in group_conv headers
      
      * small fix for grouped convolution instances
      
      * fix the grouped conv bwd weight dl instances
      
      * fix client examples
      
      * only enable client examples 3 and 4 on gfx9
      
      * set the gfx9 macro
      
      * make sure the architecture macros are set by cmake
      
      * use separate set of xdl/wmma flags for host code
      
      * sinmplify the main cmake file
      
      * add conv_fwd_bf8 instance declaration
      ae57e593
    • zjing14's avatar
      improved zeroing (#1221) · 303d4594
      zjing14 authored
      303d4594
  16. 27 Mar, 2024 1 commit
  17. 22 Mar, 2024 3 commits
  18. 21 Mar, 2024 1 commit
  19. 20 Mar, 2024 1 commit
  20. 19 Mar, 2024 1 commit
  21. 18 Mar, 2024 2 commits
    • Illia Silin's avatar
      update the changelog for ROCm6.1 release (#1205) · 9e011bcd
      Illia Silin authored
      * update the changelog for ROCm6.1 release
      
      * modifty the order of items in changelog, capitalize GEMMs
      9e011bcd
    • Illia Silin's avatar
      Re-enable the performance tracking in CI. (#1203) · bdcd0374
      Illia Silin authored
      * test CK with rocm6.1 RC2
      
      * add docker credentials for pull
      
      * update the performance db name
      
      * use environment variable for db name
      
      * add rocm-llvm-dev package to ck docker
      
      * turn off verification for daily performance runs
      
      * do not stash ckProfiler on MI300 node
      
      * add processing of mixed gemms to qa, fix parsing of splitk gemm logs
      
      * fix the splitk gemm log file name
      
      * turn the timing on for splitk gemm performance
      bdcd0374
  22. 15 Mar, 2024 1 commit
  23. 13 Mar, 2024 2 commits
  24. 12 Mar, 2024 1 commit