1. 10 Jan, 2025 1 commit
    • Thomas Ning's avatar
      Ck tile/gemm perf measure (#1750) · 73a076ee
      Thomas Ning authored
      
      
      * Finished adding the performance benchmark for ck tile gemm
      
      * Fix the executable rename problem
      
      * fix the executable name error
      
      * delete the unsupported layout combinations
      
      * Update run_full_test.sh
      
      * Update benchmark_mem_pipeline.sh
      
      * Update benchmark_basic.sh
      
      * change the executable of gemm_universal
      
      * change ck_tile_gemm script permissions
      
      * Addressed the comment
      
      * Addressed the comment
      
      * Fixed the comments
      
      * Fixed Comment
      
      * roll back the malfunctioned change
      
      * Fix the Typo
      
      * finalize the tile_gemm_fp16 performance monitoring
      
      * fix the stash names for ck_tile gemm logs
      
      * change the stashing logic
      
      * change stashing syntax
      
      ---------
      Co-authored-by: default avatarIllia Silin <98187287+illsilin@users.noreply.github.com>
      Co-authored-by: default avatarillsilin <Illia.Silin@amd.com>
      73a076ee
  2. 02 Jan, 2025 2 commits
  3. 16 Dec, 2024 1 commit
  4. 06 Dec, 2024 1 commit
    • Illia Silin's avatar
      Refactor CI performance tests. (#1726) · 355893cd
      Illia Silin authored
      * merge the build and performance tests CI stages together
      
      * add gemm performance test on gfx11/gfx12
      
      * add suffices to distinguish gemm performance logs from different archs
      
      * use smaller gemm set in CI for gfx10/gfx11/gfx12
      
      * disable performance tests on gfx1030
      
      * fix the shashing logic
      
      * fix finding python3 for mha instances
      355893cd
  5. 11 Nov, 2024 1 commit
  6. 22 Oct, 2024 1 commit
  7. 08 Oct, 2024 1 commit
    • Po Yen Chen's avatar
      [CK_TILE] Update example README files & fix script compatibility issue (#1548) · 0c094daa
      Po Yen Chen authored
      * Fix text alignment of ArgParser::print()
      
      * Update example README files
      
      * Clarify make-ck-dev.sh <arch> usage
      
      * Only keep some of the argument from '-?' output
      
      * Undo command line output changes in README
      
      * Only keep existing argument on doc and update description
      
      * Fix text alignment
      
      * Make cmake-ck-*.sh compatible with 'sh' command
      0c094daa
  8. 01 Oct, 2024 1 commit
    • Po Yen Chen's avatar
      [CK_TILE] Change output accum tensor layout of fmha fwd split-kv & combine kernels (#1527) · a1c07e8d
      Po Yen Chen authored
      * Use same layout for o_acc and o tensor
      
      * Use better param names in partitioner
      
      * Remove redundant kargs 'max_seqlen_q'
      
      * Use better param names in splitkv kernel
      
      * Add comment for additional kernel arguments
      
      * Sync empty loop early return logics between pipelines
      
      * Pass more arguments to cmake in scripts
      
      * Align backslashes
      
      * Fix wrong o_acc tensor view strides
      
      * Change o_acc layout if o_perm=0
      
      * Handle whole row masked via attn_bias
      
      * Use use vector width = 1 for o_acc
      
      * Use more even split sizes
      a1c07e8d
  9. 20 Sep, 2024 1 commit
  10. 03 Sep, 2024 1 commit
  11. 20 Aug, 2024 1 commit
  12. 19 Aug, 2024 1 commit
  13. 16 Aug, 2024 1 commit
  14. 12 Aug, 2024 1 commit
  15. 07 Aug, 2024 1 commit
    • Illia Silin's avatar
      Run CK_TILE FMHA benchmarks and collect the performance data. (#1447) · 12c1f68d
      Illia Silin authored
      * run ck_tile benchmarks after the smoke tests and store logs
      
      * change the path of fmha benchmark logs
      
      * change the way of stashig ck_tile fmha logs
      
      * prevent the errors in stages where no logs are generated
      
      * fix the ck_tile fmha log names and headers
      
      * generate the fmha performance logs in the root folder
      
      * change jenkins scrip arguments format
      
      * use exact file names for stashing
      
      * modify scripts to process FMHA performance results
      
      * unstash FMHA logs before parsing them
      12c1f68d
  16. 22 Jul, 2024 1 commit
  17. 08 Jul, 2024 1 commit
  18. 06 Jul, 2024 1 commit
    • Harisankar Sadasivan's avatar
      Universal streamk with atomics (#1360) · 75e622f0
      Harisankar Sadasivan authored
      * universal streamk with atomics with ckprofiler support. grid_size and streamk strategy are tunable. grid_size of -1 leads to #WGs = maximum occupancy X num_CUs. implementation supports many different streamk policies: 1-tile, 2-tile, 3-tile and 4-tile. streamk strategy of -1 leads to default streamk policy (4-tile). 
      
      * Update README.md
      
      * fixing clang-format issues
      
      * removed conflicts in struct members between streamk and universal streamk
      
      * corrected arg parsing for streamk and universal streamk
      
      * added stream-k policies for 3 tile and 4 tile
      
      * fixed argument type issue with parsing cmd args
      
      * changes suggested in PR review are made- removing comments and correcting copyright
      
      * file permissions updated
      
      * added default value support for grid_size and streamk-policy selection set to -1
      
      * print messages for arguments
      
      * print messages for arguments
      
      * print messages for arguments1
      75e622f0
  19. 10 May, 2024 1 commit
  20. 16 Apr, 2024 1 commit
    • carlushuang's avatar
      introducing ck_tile! (#1216) · db376dd8
      carlushuang authored
      * enable gfx940
      
      * switch between intrinsic mfma routines on mi100/200 and mi300
      
      * fix mfma_int8 on MI300
      
      * disable 2 int8 examples on MI300
      
      * Update cmake-ck-dev.sh
      
      * restore gitignore file
      
      * modify Jenkinsfile to the internal repo
      
      * Bump rocm-docs-core from 0.24.0 to 0.29.0 in /docs/sphinx
      
      Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.24.0 to 0.29.0.
      - [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
      - [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
      - [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.24.0...v0.29.0
      
      )
      
      ---
      updated-dependencies:
      - dependency-name: rocm-docs-core
        dependency-type: direct:production
        update-type: version-update:semver-minor
      ...
      Signed-off-by: default avatardependabot[bot] <support@github.com>
      
      * initial enablement of gfx950
      
      * fix clang format
      
      * disable examples 31 and 41 int8 on gfx950
      
      * add code
      
      * fix build wip
      
      * fix xx
      
      * now can build
      
      * naming
      
      * minor fix
      
      * wip fix
      
      * fix macro for exp2; fix warpgemm a/b in transposedC
      
      * unify as tuple_array
      
      * Update the required Python version to 3.9
      
      * Update executable name in test scripts
      
      * re-structure tuple/array to avoid spill
      
      * Merge function templates
      
      * Fix format
      
      * Add constraint to array<> ctor
      
      * Re-use function
      
      * Some minor changes
      
      * remove wrong code in store_raw()
      
      * fix compile issue in transpose
      
      * Rename enum
      Rename 'cood_transform_enum' to 'coord_transform_enum'
      
      * let more integral_constant->constant, and formating
      
      * make sure thread_buffer can be tuple/array
      
      * temp fix buffer_store spill
      
      * not using custom data type by default, now we can have ISA-level same code as opt_padding
      
      * fix compile error, fp8 not ready now
      
      * fix fp8 duplicated move/shift/and/or problem
      
      * Default use CK_TILE_FLOAT_TO_FP8_STOCHASTIC rounding mode
      
      * fix scratch in fp8 kernel
      
      * update some readme
      
      * fix merge from upstream
      
      * sync with upstream
      
      * sync upstream again
      
      * sync 22
      
      * remove unused
      
      * fix clang-format
      
      * update README of ck_tile example
      
      * fix several issue
      
      * let python version to be 3.8 as minimal
      
      * remove ck_tile example from default cmake target like all/install/check
      
      * remove mistake
      
      * 1).support receipe in generate.py 2).use simplified mask type 3).change left/right to pass into karg
      
      * fix some bug in group-mode masking and codegen. update README
      
      * F8 quantization for FMHA forward (#1224)
      
      * Add SAccElementFunction, PComputeElementFunction, OAccElementFunction in pipeline
      
      * Add element function to fmha api
      
      * Adjust P elementwise function
      
      * Fix bug of elementwise op, our elementwise op is not inout
      
      * Add some elementwise op, prepare to quantization
      
      * Let generate.py can generate different elementwise function
      
      * To prevent compiler issue, remove the elementwise function we have not used.
      
      * Remove f8 pipeline, we should share the same pipeline even in f8
      
      * Remove remove_cvref_t
      
      * Avoid warning
      
      * Fix wrong fp8 QK/KV block gemm setting
      
      * Check fp8 rounding error in check_err()
      
      * Set fp8 rounding error for check_err()
      
      * Use CK_TILE_FLOAT_TO_FP8_STANDARD as default fp8 rounding mode
      
      * 1. codgen the f8 api and kernel
      2. f8 host code
      
      * prevent warning in filter mode
      
      * Remove not-in-use elementwise function kargs
      
      * Remove more not-in-use elementwise function kargs
      
      * Small refinements in C++ source files
      
      * Use conditional_t<> to simplify code
      
      * Support heterogeneous argument for binary function types
      
      * Re-use already-existing scales<> functor template
      
      * Fix wrong value produced by saturating
      
      * Generalize the composes<> template
      
      * Unify saturates<> implementation
      
      * Fix type errors in composes<>
      
      * Extend less_equal<>
      
      * Reuse the existing template less_equal<> in check_err()
      
      * Add equal<float> & equal<double>
      
      * Rename check_err() parameter
      
      * Rename check_err() parameter
      
      * Add FIXME comment for adding new macro in future
      
      * Remove unnecessary cast to void
      
      * Eliminate duplicated code
      
      * Avoid dividing api pool into more than 2 groups
      
      * Use more clear variable names
      
      * Use affirmative condition in if stmt
      
      * Remove blank lines
      
      * Donot perfect forwarding in composes<>
      
      * To fix compile error, revert generate.py back to 4439cc107dd90302d68a6494bdd33113318709f8
      
      * Fix bug of p element function
      
      * Add compute element op to host softmax
      
      * Remove element function in api interface
      
      * Extract user parameter
      
      * Rename pscale and oscale variable
      
      * rename f8 to fp8
      
      * rename more f8 to fp8
      
      * Add pipeline::operator() without element_functor
      
      * 1. Remove deprecated pipeline enum
      2. Refine host code parameter
      
      * Use quantization range as input
      
      * 1. Rename max_dtype to dtype_max.
      2. Rename scale to scale_s
      3.Add init description
      
      * Refine description
      
      * prevent early return
      
      * unify _squant kernel name in cpp, update README
      
      * Adjust the default range.
      
      * Refine error message and bias range
      
      * Add fp8 benchmark and smoke test
      
      * fix fp8 swizzle_factor=4 case
      
      ---------
      Co-authored-by: default avatarPo Yen Chen <PoYen.Chen@amd.com>
      Co-authored-by: default avatarcarlushuang <carlus.huang@amd.com>
      
      ---------
      Signed-off-by: default avatardependabot[bot] <support@github.com>
      Co-authored-by: default avatarillsilin <Illia.Silin@amd.com>
      Co-authored-by: default avatarIllia Silin <98187287+illsilin@users.noreply.github.com>
      Co-authored-by: default avatarJing Zhang <jizha@amd.com>
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      Co-authored-by: default avatardependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
      Co-authored-by: default avatarPo-Yen, Chen <PoYen.Chen@amd.com>
      Co-authored-by: default avatarrocking <ChunYu.Lai@amd.com>
      db376dd8
  21. 14 Apr, 2024 1 commit
    • Haocong WANG's avatar
      [GEMM] Gemm universal device operation (#1154) · f83e9701
      Haocong WANG authored
      
      
      * Optimize GEMM on MI200/300:
      1. Add new blockwise gemm pipeline
      2. Add irregular splitk intances
      
      * clang format + typo fix
      
      * Fix a bug
      
      * initial commit
      
      * Add more instances to irregular splitk
      
      * blkgemm pipeline v1~4 prototype
      
      * Sanity Checked. Known issue:
      1. Poor performance of splitk
      2. Register spill on blkgemmpipeline v3
      
      * Sanity and Performance fix:
      1. fix a bug related to sanity in grouped b2c mapping
      2. fix a bug related to sanity and performance in splitk offset
      
      * Sanity and API update:
      1. Remove prefetch stage
      2. Fix valid check bug
      3, Add first gemm_universal instance into ckProfiler
      
      * Add NN instances for gemm universal
      
      * 1. Add NT instances for gemm_universal
      2. Fix a bug about Kpadding in gemm_universal
      
      * Fix a bug regarding padding Odd K number
      
      * remove kernel print
      
      * Fix KPadding bug...
      
      * Update safety check
      
      * another try to fix kpadding..
      
      * Sanity checked
      
      * new instances..
      
      * clang format+typo fix
      
      * remove clang format script's change
      
      * Add non-hotloop compile option
      
      * 1. Add fp16xfp8 example
      2. pull packed convert f8 from pr1150
      
      * Some miscs.. opt and fix
      
      * Add pipeline description docs
      
      * Split universal gemm instance library to cut profiler compiling time
      
      * uncomment cmakefile
      
      * Fix a bug caused by blockwise_gemm_pipe_v2
      
      * reduce default splitk to 1
      
      * Add 224x256x64 tile size
      
      * update, including:
      1. Experiment pipeline 5~7
      2. Optimization for pipeline 4
      3. Organized instance library
      
      * temp save
      
      * temp save
      
      * Permuted lds layout, sanity and function checked
      
      * clang format
      
      * Move OOB check from RunRead to RunWrite, for better software pipeline.
      TODO: agpr spill when NN layout
      
      * clangformat
      
      * A/B splitpipe scheduler for v3
      
      * Fix two bugs
      
      * bug fix
      
      * fix a bug in oob check
      
      * Example for mixed fp16_fp8 gemm
      
      * Clean experimental code blocks
      
      * Add mixed precision gemm into profiler
      
      * tempsave
      
      * optimize m/n major lds layout
      
      * Add RRR GEMM  mixed precision instances
      
      * Optimize f8 matrix transpose
      
      * Add test_gemm_universal
      
      * A/B spilt schedule for blkpip v5
      
      * Take ds_read2 into iglp scheduling scheme
      
      * format
      
      * fixed cmake
      
      * Add llvm-option into CI cmake flag
      
      ---------
      Co-authored-by: default avatarJing Zhang <jizhan@amd.com>
      f83e9701
  22. 22 Mar, 2024 1 commit
  23. 18 Mar, 2024 1 commit
    • Illia Silin's avatar
      Re-enable the performance tracking in CI. (#1203) · bdcd0374
      Illia Silin authored
      * test CK with rocm6.1 RC2
      
      * add docker credentials for pull
      
      * update the performance db name
      
      * use environment variable for db name
      
      * add rocm-llvm-dev package to ck docker
      
      * turn off verification for daily performance runs
      
      * do not stash ckProfiler on MI300 node
      
      * add processing of mixed gemms to qa, fix parsing of splitk gemm logs
      
      * fix the splitk gemm log file name
      
      * turn the timing on for splitk gemm performance
      bdcd0374
  24. 31 Jan, 2024 1 commit
  25. 24 Jan, 2024 1 commit
    • Illia Silin's avatar
      Fixing most of the cppcheck errors. (#1142) · 180e5720
      Illia Silin authored
      * fix cppcheck errors, first pass
      
      * fix format
      
      * fix returned value in examples
      
      * add macro definitions for cppcheck
      
      * fix the profile_gemm logic
      
      * update the gemm profiler logic
      
      * add more difinitions to cppcheck, fix couple more errors
      
      * replace runtime error with message in device function
      
      * fix a couple of int4 issues
      
      * no return for fill function
      
      * fix errors in data_types.hpp
      
      * fix format
      
      * fix few remaining errors
      
      * fix errors in data_types.hpp
      
      * fix last couple of errors in datat_types.hpp
      180e5720
  26. 09 Nov, 2023 1 commit
  27. 07 Nov, 2023 1 commit
  28. 30 Oct, 2023 1 commit
    • Illia Silin's avatar
      Enable sccache in the default docker and CI. (#1009) · 4e44a9e8
      Illia Silin authored
      
      
      * replace ccache with sccache, pin package versions
      
      * put ccache back temporarily to avoid breaking other CI jobs
      
      * add sccashe_wrapper.sh script
      
      * fix the package version syntax
      
      * fix the pymysql package issue
      
      * run sccache_wrapper before build if ccache server found
      
      * set the paths before calling the sccache_wrapper
      
      * use /tmp instead of /usr/local for cache
      
      * try using sccache --start-server instead of wrapper
      
      * try using redis server with sccache
      
      * define SCCACHE_REDIS
      
      * add redis and ping packages, and redis port
      
      * use the new sccache redis server
      
      * do not use sccache with staging compiler
      
      * fix the condition syntax
      
      * add stunnel to redis
      
      * add tunnel verification
      
      * separate caches for different architectures
      
      * fix syntax for the cache tag
      
      * quse double brackets for conditions
      
      * add bash line to the script
      
      * add a switch for sccache and only use it in build stage
      
      * run check_host function when enabling sccache
      
      * fix the invocation tags for sccache
      
      * fix groovy syntax
      
      * set the invocation tag in groovy
      
      * disable sccache in clang-format stage
      
      * try another syntax for invocation tags
      
      * use local sccache server if can't connect to redis
      
      * fix script syntax
      
      * update README
      
      * refresh readme
      
      * readme updates
      
      * remove the timing and verification caveat from readme
      
      ---------
      Co-authored-by: default avatarLisa Delaney <lisa.delaney@amd.com>
      4e44a9e8
  29. 11 Oct, 2023 2 commits
    • zjing14's avatar
      Revert "Grouped Gemm with looping over the tiles. (#788)" (#982) · c99323be
      zjing14 authored
      This reverts commit a4f72a31.
      c99323be
    • Adam Osewski's avatar
      Grouped Gemm with looping over the tiles. (#788) · a4f72a31
      Adam Osewski authored
      
      
      * Introduce LocalBlockToCTileMap.
      
      * Change the signature of CalculateBottomIndex() function which now does
      not accept any argument. The B2C map which is already passed as an
      argument to the kernel Run function is calculating block's local id
      already outside at kernel entry point __global__ function.
      The LocalB2C map stores as members local block ID.
      
      * Use LocalBlockToCTile map in device ops.
      
      * First draft of tile loop work distribution.
      
      * Fix typo.
      
      * Simplify kernel arguments.
      
      Calculate descriptors & B2C maps on the device.
      
      * Use looping kernel.
      
      * Fix B2C constructor.
      
      * Fix Navi21 errors.
      
      * Calculate tile start/end in device kernel.
      
      * Change Run API to accept user provided workspace buffer.
      
      * Add new line at EOF.
      
      * Move Gemm KernelArguments to device op interface.
      
      * Remove unused code.
      
      * Update API.
      
      * Launch grid size which is min of occupancy vs tile count
      
      * Get back to use constant memory for gemm descriptors.
      
      * Remove unused code.
      
      * Add default virtual method implementation.
      
      * Update comments to conform with doxygen style.
      
      * Fix doc style and unused parameters.
      
      * Add thread cluster lengths to kernel name.
      
      * Remove old splitk impl and replace it with tile looping one.
      
      * Modify instances.
      
      * set KPerBlock to 64
      * maximize wherever possible vector load size.
      
      * Fix instances cluster lengths.
      
      * Change comment style.
      
      * Use 128b store where possible in instances.
      
      * Update test cases, since KPerBlock has doubled.
      
      * Update output stream operator for Sequence.
      
      * Add pipeline version to GroupedGEMM device op type string.
      
      * Fix pipeline version type logging.
      
      * Fix input tensors type after merge.
      
      * Fix compiler error.
      
      * Fix output stream operator for Pipeline version.
      
      * Store using 128b.
      
      * Set of instances with kpb 32/64
      
      * Limit number of instances
      
      * Remove commented out instances.
      
      * Fix function name.
      
      * Limit the number of instances.
      
      Add pipline version to the regular instances
      
      * Change thr cluster layout for reading B tensor.
      
      * disabled failed instances
      
      ---------
      Co-authored-by: default avatarAdam Osewski <aosewski@amd.com>
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      Co-authored-by: default avatarJing Zhang <jizha@amd.com>
      a4f72a31
  30. 31 Aug, 2023 1 commit
    • zjing14's avatar
      Grouped Gemm with Fixed K and N with SplitK (#818) · f5ec04f0
      zjing14 authored
      
      
      * move all arguments into device
      
      * add b2c_tile_map
      
      * add examples
      
      * add SetDeviceKernelArgs
      
      * dedicated fixed_nk solution
      
      * init client api
      
      * add grouped_gemm_bias example
      
      * add a instance
      
      * add instances
      
      * formatting
      
      * fixed cmake
      
      * Update EnableCompilerWarnings.cmake
      
      * Update cmake-ck-dev.sh
      
      * clean; fixed comments
      
      * fixed comment
      
      * add instances for fp32 output
      
      * add instances for fp32 output
      
      * add fp32 out client example
      
      * fixed CI
      
      * init commit for kbatch
      
      * add splitk gridwise
      
      * format
      
      * fixed
      
      * clean deviceop
      
      * clean code
      
      * finish splitk
      
      * fixed instances
      
      * change m_loops to tile_loops
      
      * add setkbatch
      
      * clean code
      
      * add splitK+bias
      
      * add instances
      
      * opt mk_nk instances
      
      * clean examples
      
      * fixed CI
      
      * remove zero
      
      * finished non-zero
      
      * clean
      
      * clean code
      
      * optimized global_barrier
      
      * fixed ci
      
      * fixed CI
      
      * removed AddBias
      
      * format
      
      * fixed CI
      
      * fixed CI
      
      * move 20_grouped_gemm to 21_grouped_gemm
      
      ---------
      Co-authored-by: default avatarJing Zhang <jizha@amd.com>
      f5ec04f0
  31. 23 Aug, 2023 1 commit
    • Jun Liu's avatar
      [HotFix] add config and version files to pass on build info (#856) · c8a8385f
      Jun Liu authored
      * experiment with config file
      
      * experiment with version.h config
      
      * add more info to version.h
      
      * minor updates
      
      * minor updates
      
      * fix case where DTYPE is not used
      
      * large amount of files but minor changes
      
      * remove white space
      
      * minor changes to add more MACROs
      
      * fix cmakedefine01
      
      * fix issue with CK internal conflict
      
      * fix define and define value
      
      * fix clang-format
      
      * fix formatting issue
      
      * experiment with cmake
      
      * clang format v12 to be consistent with miopen
      
      * avoid clang-format for config file
      c8a8385f
  32. 27 Jul, 2023 1 commit
  33. 06 Jul, 2023 1 commit
  34. 16 Jun, 2023 1 commit
  35. 15 Jun, 2023 1 commit
    • Illia Silin's avatar
      Enable gfx941 and gfx942 architectures. (#752) · 027e46ee
      Illia Silin authored
      * enable gfx941/942 targets
      
      * fix clang format
      
      * fix the cmake logic for multiple targets
      
      * fix cmake syntax for looping over targets
      
      * add gfx941/942 support for gemm_xdl instances
      027e46ee
  36. 28 Apr, 2023 1 commit
  37. 29 Mar, 2023 1 commit
  38. 15 Mar, 2023 1 commit