1. 07 Jan, 2025 1 commit
    • Po Yen Chen's avatar
      [CK_TILE] fmha fwd splitkv optimization for decode (seqlen_q=1) (#1789) · 24b12d04
      Po Yen Chen authored
      
      
      * Update license year
      
      * Add initial code to override decode problem
      
      * Fix splitkv traits/args overriding error
      
      * Reshape and transpose lse for decode
      
      * Remove debug code
      
      * Prettify example code
      
      * Use better function name
      
      * Add kMergeNumHeadGroupsSeqLenQ flag
      
      Kernel user can use this switch to turn on/off optimization for
      some problem sizes
      
      * Add missing flag declarations
      
      * Default turn off kMergeNumHeadGroupsSeqLenQ in codegen
      
      * Group similar statements together
      
      * Remove assumption of seqlen_q=1
      
      * Remove kMergeNumHeadGroupsSeqLenQ from splitkv combine kernel
      
      * Support kMergeNumHeadGroupsSeqLenQ=true in fmha splitkv kernel
      
      * Run kMergeNumHeadGroupsSeqLenQ=true kernels when need
      
      * Fix group mode block skip logics
      
      * Undo changes of normal fwd kernel
      
      * Update in GridSize() and using GridSize() for splitkv kernel (#1799)
      
      ---------
      Co-authored-by: default avatarQianfeng <qianfeng.zhang@amd.com>
      24b12d04
  2. 31 Oct, 2024 1 commit
    • carlushuang's avatar
      [CK_TILE] layernorm support fused-quant/fused-add (#1604) · c3a4800c
      carlushuang authored
      * add prenorm/postnorm support, refactor using generate.py
      
      * update README
      
      * update README
      
      * fix format
      
      * update some description and fix format
      
      * update format
      
      * format
      
      * use non-raw for loading
      
      * format and update n4096
      
      * dynamic-quant ready
      
      * update readme
      
      * support fused dynamic-quant
      
      * update fused-quant, with smooth
      
      * update README
      
      * update args
      
      * update some based on comment
      c3a4800c
  3. 26 Oct, 2024 1 commit
    • carlushuang's avatar
      topk_softmax (#1592) · b098b71b
      carlushuang authored
      * topk_softmax
      
      * remove some file
      
      * fix atomix linear_offset
      
      * address various comment, and change sfc get_index api to static(tuple)
      b098b71b
  4. 16 Apr, 2024 1 commit
    • carlushuang's avatar
      introducing ck_tile! (#1216) · db376dd8
      carlushuang authored
      * enable gfx940
      
      * switch between intrinsic mfma routines on mi100/200 and mi300
      
      * fix mfma_int8 on MI300
      
      * disable 2 int8 examples on MI300
      
      * Update cmake-ck-dev.sh
      
      * restore gitignore file
      
      * modify Jenkinsfile to the internal repo
      
      * Bump rocm-docs-core from 0.24.0 to 0.29.0 in /docs/sphinx
      
      Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.24.0 to 0.29.0.
      - [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
      - [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
      - [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.24.0...v0.29.0
      
      )
      
      ---
      updated-dependencies:
      - dependency-name: rocm-docs-core
        dependency-type: direct:production
        update-type: version-update:semver-minor
      ...
      Signed-off-by: default avatardependabot[bot] <support@github.com>
      
      * initial enablement of gfx950
      
      * fix clang format
      
      * disable examples 31 and 41 int8 on gfx950
      
      * add code
      
      * fix build wip
      
      * fix xx
      
      * now can build
      
      * naming
      
      * minor fix
      
      * wip fix
      
      * fix macro for exp2; fix warpgemm a/b in transposedC
      
      * unify as tuple_array
      
      * Update the required Python version to 3.9
      
      * Update executable name in test scripts
      
      * re-structure tuple/array to avoid spill
      
      * Merge function templates
      
      * Fix format
      
      * Add constraint to array<> ctor
      
      * Re-use function
      
      * Some minor changes
      
      * remove wrong code in store_raw()
      
      * fix compile issue in transpose
      
      * Rename enum
      Rename 'cood_transform_enum' to 'coord_transform_enum'
      
      * let more integral_constant->constant, and formating
      
      * make sure thread_buffer can be tuple/array
      
      * temp fix buffer_store spill
      
      * not using custom data type by default, now we can have ISA-level same code as opt_padding
      
      * fix compile error, fp8 not ready now
      
      * fix fp8 duplicated move/shift/and/or problem
      
      * Default use CK_TILE_FLOAT_TO_FP8_STOCHASTIC rounding mode
      
      * fix scratch in fp8 kernel
      
      * update some readme
      
      * fix merge from upstream
      
      * sync with upstream
      
      * sync upstream again
      
      * sync 22
      
      * remove unused
      
      * fix clang-format
      
      * update README of ck_tile example
      
      * fix several issue
      
      * let python version to be 3.8 as minimal
      
      * remove ck_tile example from default cmake target like all/install/check
      
      * remove mistake
      
      * 1).support receipe in generate.py 2).use simplified mask type 3).change left/right to pass into karg
      
      * fix some bug in group-mode masking and codegen. update README
      
      * F8 quantization for FMHA forward (#1224)
      
      * Add SAccElementFunction, PComputeElementFunction, OAccElementFunction in pipeline
      
      * Add element function to fmha api
      
      * Adjust P elementwise function
      
      * Fix bug of elementwise op, our elementwise op is not inout
      
      * Add some elementwise op, prepare to quantization
      
      * Let generate.py can generate different elementwise function
      
      * To prevent compiler issue, remove the elementwise function we have not used.
      
      * Remove f8 pipeline, we should share the same pipeline even in f8
      
      * Remove remove_cvref_t
      
      * Avoid warning
      
      * Fix wrong fp8 QK/KV block gemm setting
      
      * Check fp8 rounding error in check_err()
      
      * Set fp8 rounding error for check_err()
      
      * Use CK_TILE_FLOAT_TO_FP8_STANDARD as default fp8 rounding mode
      
      * 1. codgen the f8 api and kernel
      2. f8 host code
      
      * prevent warning in filter mode
      
      * Remove not-in-use elementwise function kargs
      
      * Remove more not-in-use elementwise function kargs
      
      * Small refinements in C++ source files
      
      * Use conditional_t<> to simplify code
      
      * Support heterogeneous argument for binary function types
      
      * Re-use already-existing scales<> functor template
      
      * Fix wrong value produced by saturating
      
      * Generalize the composes<> template
      
      * Unify saturates<> implementation
      
      * Fix type errors in composes<>
      
      * Extend less_equal<>
      
      * Reuse the existing template less_equal<> in check_err()
      
      * Add equal<float> & equal<double>
      
      * Rename check_err() parameter
      
      * Rename check_err() parameter
      
      * Add FIXME comment for adding new macro in future
      
      * Remove unnecessary cast to void
      
      * Eliminate duplicated code
      
      * Avoid dividing api pool into more than 2 groups
      
      * Use more clear variable names
      
      * Use affirmative condition in if stmt
      
      * Remove blank lines
      
      * Donot perfect forwarding in composes<>
      
      * To fix compile error, revert generate.py back to 4439cc107dd90302d68a6494bdd33113318709f8
      
      * Fix bug of p element function
      
      * Add compute element op to host softmax
      
      * Remove element function in api interface
      
      * Extract user parameter
      
      * Rename pscale and oscale variable
      
      * rename f8 to fp8
      
      * rename more f8 to fp8
      
      * Add pipeline::operator() without element_functor
      
      * 1. Remove deprecated pipeline enum
      2. Refine host code parameter
      
      * Use quantization range as input
      
      * 1. Rename max_dtype to dtype_max.
      2. Rename scale to scale_s
      3.Add init description
      
      * Refine description
      
      * prevent early return
      
      * unify _squant kernel name in cpp, update README
      
      * Adjust the default range.
      
      * Refine error message and bias range
      
      * Add fp8 benchmark and smoke test
      
      * fix fp8 swizzle_factor=4 case
      
      ---------
      Co-authored-by: default avatarPo Yen Chen <PoYen.Chen@amd.com>
      Co-authored-by: default avatarcarlushuang <carlus.huang@amd.com>
      
      ---------
      Signed-off-by: default avatardependabot[bot] <support@github.com>
      Co-authored-by: default avatarillsilin <Illia.Silin@amd.com>
      Co-authored-by: default avatarIllia Silin <98187287+illsilin@users.noreply.github.com>
      Co-authored-by: default avatarJing Zhang <jizha@amd.com>
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      Co-authored-by: default avatardependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
      Co-authored-by: default avatarPo-Yen, Chen <PoYen.Chen@amd.com>
      Co-authored-by: default avatarrocking <ChunYu.Lai@amd.com>
      db376dd8