1. 26 Jun, 2024 2 commits
    • Po Yen Chen's avatar
    • Po Yen Chen's avatar
      [CK_TILE] fmha forward split-kv + combine kernels (#1338) · 0cb2e06d
      Po Yen Chen authored
      
      
      * FA fwd dropout
      
      * FA bwd
      
      * epilogue reuse
      
      * CMakeLists update
      
      * [CK_TILE] support alibi (#1269)
      
      * add alibi support
      
      * fix code
      
      * update code based on comment
      
      * Support more hdim
      
      * fix fp8 bias
      
      * support seqlen_k=0 case
      
      * remove unused printf
      
      * fix format
      
      ---------
      Co-authored-by: default avatarrocking <ChunYu.Lai@amd.com>
      
      * now fwd/bwd can build
      
      * bwd alibi
      
      * add bwd validation stream_config
      
      * update generated filenames
      
      * update bwd kernel launch
      
      * CK_TILE_HOST_DEVICE in philox
      
      * Transpose -> transpose
      
      * format
      
      * format
      
      * format
      
      * Generate the instance for FA required
      
      * format
      
      * fix error in WarpGemm
      
      * Add num_splits option and dummy split-kv api method
      
      * Generate fmha_fwd_splitkv()
      
      * Add SplitKV kernel codegen logics
      
      * Add SplitKV combine kernel codegen logics
      
      * Fix mismatched return type
      
      * Clean-up code
      
      * Replace sentinel value before storing
      
      * Fix wrong layout of LSE/LSEacc/Oacc
      
      * Format codes
      
      * Fix o_acc memory error
      
      * Fix wrong kBlockSize used in policy
      
      * Reduce # of combine kernels
      
      * Fix split-kv combine kernel name
      
      * Fix wrong LDS indexing logics
      
      * Fix wrong loop counter step logic
      
      * Undo vector size changes
      
      * Remove no-longer used field
      
      * Remove in-consistent comment
      
      * Remove debug statements in example
      
      * Remove more debug statements
      
      * Add constness to local variables
      
      * Clearn up generate.py
      
      * Fix unstable clang-format comment
      
      * Remove unused include directive
      
      * Use shorter template parameter name
      
      * Enable non-split-kv blobs
      
      * Update license date
      
      * Print num_splits conditionally
      
      * Undo disabling data types
      
      * Remove unnessary tile size for fp8
      
      * Fix wrong pipeline args for fp8
      
      * Fix example output format
      
      * Remove more debug code in combine pipeline
      
      * Add stride kernel arguments for LSE/O acc workspace
      
      * Re-order split-kv pipeline call operator arguments
      
      * Pass LSE/O strides in kernel argument
      
      * Re-order pipeline call operator arguments
      
      * Use tensor_descriptor to locate LSEacc elements
      
      * Support providing invalid element for tensor view
      
      * Set invalid element value for LSEacc tensor view
      
      * Remove hand-written store_tile() code
      
      * Remove necessary value-overwrite logic
      
      * Add transposed lds descriptor
      
      * Support load_tile() for tile_window_with_static_lengths<>
      
      * Undo removing necessary value-overwrite logic
      
      * Use read descriptor to locate lds elements
      
      * Simplify pipeline source code
      
      * Add constraint to kMaxSplits
      
      * Default use kMaxSplits=64 in generate.py
      
      * Revert "Add constraint to kMaxSplits"
      
      This reverts commit 0a2132d758042e6fb0292f4e354909b8a4d1c118.
      
      * Revert "Default use kMaxSplits=64 in generate.py"
      
      This reverts commit c7d9c80b77320aec6559222bed7d47adcaefe4e3.
      
      * Decide alignment by the padding parameter
      
      * Remove no-longer used utility functions
      
      * Remove not-working code
      
      * Add comment & remove no-longer used code
      
      * Fix computation errors
      
      * Add heuristic to override num_splits option
      
      * Add constraint to kMaxSplits
      
      * Fix compilation error
      
      * Clean up pipeline code
      
      * Wrap pointer access as lambda function
      
      * Rename confusing methods
      
      * Use kLogMasSplits as template parameter
      
      * Finish splitkv combine kernel codegen
      
      * Update kMaxSplits limit
      
      * Use smaller kM0 for splitkv combine kernel
      
      * Ignore droupout flag in splitkv pipeline
      
      * Unify flag usage
      
      * Add back flag kStoreLSE
      
      * Merge lambda calls in pipeline
      
      * Fix compilation errors
      
      * Avoid all empty splits
      
      * Always check for empty loop in splitkv pipelines
      
      * Re-order parameters
      
      * Remove redundant p_drop option check
      
      * Add traits/problem for fwd splitkv kernel
      
      * Conditionally enable uneven split boundary checks
      
      * Add comment for the splitkv traits field
      
      * Change even split criteria
      
      * Re-order statements
      
      * Refine occupancy value for hdim=128&256
      
      * Refine occupancy value for hdim=32&64
      
      * Remove redundant kernel argument
      
      * Separate fmha bwd codegen logics
      
      * Separate fmha fwd codegen logics
      
      * Remove redundant direction parameter in fwd&bwd codegen logics
      
      * Support generate multiple APIs for an example
      
      * Let 'api' an alias of 'direction' option
      
      * Remove choices for the 'direction' option
      
      * Use dictionary to config all the functions
      
      * Move fmha splitkv codegen logics to other file
      
      * Add fwd_splitkv api for tile_example_fmha_fwd
      
      ---------
      
      Co-authored-by: danyao12 <danyao12>
      Co-authored-by: default avatarcarlushuang <carlus.huang@amd.com>
      Co-authored-by: default avatarrocking <ChunYu.Lai@amd.com>
      Co-authored-by: default avatarJing Zhang <jizhan@amd.com>
      0cb2e06d
  2. 24 Jun, 2024 1 commit
  3. 21 Jun, 2024 1 commit
  4. 20 Jun, 2024 2 commits
  5. 19 Jun, 2024 1 commit
  6. 17 Jun, 2024 1 commit
  7. 13 Jun, 2024 1 commit
  8. 04 Jun, 2024 1 commit
    • Dan Yao's avatar
      CK Tile FA Training kernels (#1286) · 2cab8d39
      Dan Yao authored
      
      
      * FA fwd dropout
      
      * FA bwd
      
      * epilogue reuse
      
      * CMakeLists update
      
      * [CK_TILE] support alibi (#1269)
      
      * add alibi support
      
      * fix code
      
      * update code based on comment
      
      * Support more hdim
      
      * fix fp8 bias
      
      * support seqlen_k=0 case
      
      * remove unused printf
      
      * fix format
      
      ---------
      Co-authored-by: default avatarrocking <ChunYu.Lai@amd.com>
      
      * now fwd/bwd can build
      
      * bwd alibi
      
      * add bwd validation stream_config
      
      * update generated filenames
      
      * update bwd kernel launch
      
      * CK_TILE_HOST_DEVICE in philox
      
      * Transpose -> transpose
      
      * format
      
      * format
      
      * format
      
      * Generate the instance for FA required
      
      * format
      
      * fix error in WarpGemm
      
      ---------
      
      Co-authored-by: danyao12 <danyao12>
      Co-authored-by: default avatarcarlushuang <carlus.huang@amd.com>
      Co-authored-by: default avatarrocking <ChunYu.Lai@amd.com>
      Co-authored-by: default avatarPo Yen Chen <PoYen.Chen@amd.com>
      Co-authored-by: default avatarJing Zhang <jizhan@amd.com>
      2cab8d39
  9. 28 May, 2024 1 commit
    • carlushuang's avatar
      [CK_TILE] support group from cmdline (#1295) · 5055b3bd
      carlushuang authored
      * support cmdline seqlen decode
      
      * silent print
      
      * update readme
      
      * update kernel launch 3d
      
      * update tile partitioner
      
      * fix spill for bf16
      
      * modify based on comment
      
      * modify payload_t
      
      * fix bug for alibi mode
      
      * fix alibi test err
      
      * refactor kernel launch, support select timer
      
      * add missing file
      
      * remove useless code
      
      * add some comments
      5055b3bd
  10. 20 May, 2024 1 commit
  11. 17 May, 2024 1 commit
  12. 15 May, 2024 1 commit
  13. 07 May, 2024 1 commit
  14. 22 Apr, 2024 1 commit
  15. 16 Apr, 2024 1 commit
    • carlushuang's avatar
      introducing ck_tile! (#1216) · db376dd8
      carlushuang authored
      * enable gfx940
      
      * switch between intrinsic mfma routines on mi100/200 and mi300
      
      * fix mfma_int8 on MI300
      
      * disable 2 int8 examples on MI300
      
      * Update cmake-ck-dev.sh
      
      * restore gitignore file
      
      * modify Jenkinsfile to the internal repo
      
      * Bump rocm-docs-core from 0.24.0 to 0.29.0 in /docs/sphinx
      
      Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.24.0 to 0.29.0.
      - [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
      - [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
      - [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.24.0...v0.29.0
      
      )
      
      ---
      updated-dependencies:
      - dependency-name: rocm-docs-core
        dependency-type: direct:production
        update-type: version-update:semver-minor
      ...
      Signed-off-by: default avatardependabot[bot] <support@github.com>
      
      * initial enablement of gfx950
      
      * fix clang format
      
      * disable examples 31 and 41 int8 on gfx950
      
      * add code
      
      * fix build wip
      
      * fix xx
      
      * now can build
      
      * naming
      
      * minor fix
      
      * wip fix
      
      * fix macro for exp2; fix warpgemm a/b in transposedC
      
      * unify as tuple_array
      
      * Update the required Python version to 3.9
      
      * Update executable name in test scripts
      
      * re-structure tuple/array to avoid spill
      
      * Merge function templates
      
      * Fix format
      
      * Add constraint to array<> ctor
      
      * Re-use function
      
      * Some minor changes
      
      * remove wrong code in store_raw()
      
      * fix compile issue in transpose
      
      * Rename enum
      Rename 'cood_transform_enum' to 'coord_transform_enum'
      
      * let more integral_constant->constant, and formating
      
      * make sure thread_buffer can be tuple/array
      
      * temp fix buffer_store spill
      
      * not using custom data type by default, now we can have ISA-level same code as opt_padding
      
      * fix compile error, fp8 not ready now
      
      * fix fp8 duplicated move/shift/and/or problem
      
      * Default use CK_TILE_FLOAT_TO_FP8_STOCHASTIC rounding mode
      
      * fix scratch in fp8 kernel
      
      * update some readme
      
      * fix merge from upstream
      
      * sync with upstream
      
      * sync upstream again
      
      * sync 22
      
      * remove unused
      
      * fix clang-format
      
      * update README of ck_tile example
      
      * fix several issue
      
      * let python version to be 3.8 as minimal
      
      * remove ck_tile example from default cmake target like all/install/check
      
      * remove mistake
      
      * 1).support receipe in generate.py 2).use simplified mask type 3).change left/right to pass into karg
      
      * fix some bug in group-mode masking and codegen. update README
      
      * F8 quantization for FMHA forward (#1224)
      
      * Add SAccElementFunction, PComputeElementFunction, OAccElementFunction in pipeline
      
      * Add element function to fmha api
      
      * Adjust P elementwise function
      
      * Fix bug of elementwise op, our elementwise op is not inout
      
      * Add some elementwise op, prepare to quantization
      
      * Let generate.py can generate different elementwise function
      
      * To prevent compiler issue, remove the elementwise function we have not used.
      
      * Remove f8 pipeline, we should share the same pipeline even in f8
      
      * Remove remove_cvref_t
      
      * Avoid warning
      
      * Fix wrong fp8 QK/KV block gemm setting
      
      * Check fp8 rounding error in check_err()
      
      * Set fp8 rounding error for check_err()
      
      * Use CK_TILE_FLOAT_TO_FP8_STANDARD as default fp8 rounding mode
      
      * 1. codgen the f8 api and kernel
      2. f8 host code
      
      * prevent warning in filter mode
      
      * Remove not-in-use elementwise function kargs
      
      * Remove more not-in-use elementwise function kargs
      
      * Small refinements in C++ source files
      
      * Use conditional_t<> to simplify code
      
      * Support heterogeneous argument for binary function types
      
      * Re-use already-existing scales<> functor template
      
      * Fix wrong value produced by saturating
      
      * Generalize the composes<> template
      
      * Unify saturates<> implementation
      
      * Fix type errors in composes<>
      
      * Extend less_equal<>
      
      * Reuse the existing template less_equal<> in check_err()
      
      * Add equal<float> & equal<double>
      
      * Rename check_err() parameter
      
      * Rename check_err() parameter
      
      * Add FIXME comment for adding new macro in future
      
      * Remove unnecessary cast to void
      
      * Eliminate duplicated code
      
      * Avoid dividing api pool into more than 2 groups
      
      * Use more clear variable names
      
      * Use affirmative condition in if stmt
      
      * Remove blank lines
      
      * Donot perfect forwarding in composes<>
      
      * To fix compile error, revert generate.py back to 4439cc107dd90302d68a6494bdd33113318709f8
      
      * Fix bug of p element function
      
      * Add compute element op to host softmax
      
      * Remove element function in api interface
      
      * Extract user parameter
      
      * Rename pscale and oscale variable
      
      * rename f8 to fp8
      
      * rename more f8 to fp8
      
      * Add pipeline::operator() without element_functor
      
      * 1. Remove deprecated pipeline enum
      2. Refine host code parameter
      
      * Use quantization range as input
      
      * 1. Rename max_dtype to dtype_max.
      2. Rename scale to scale_s
      3.Add init description
      
      * Refine description
      
      * prevent early return
      
      * unify _squant kernel name in cpp, update README
      
      * Adjust the default range.
      
      * Refine error message and bias range
      
      * Add fp8 benchmark and smoke test
      
      * fix fp8 swizzle_factor=4 case
      
      ---------
      Co-authored-by: default avatarPo Yen Chen <PoYen.Chen@amd.com>
      Co-authored-by: default avatarcarlushuang <carlus.huang@amd.com>
      
      ---------
      Signed-off-by: default avatardependabot[bot] <support@github.com>
      Co-authored-by: default avatarillsilin <Illia.Silin@amd.com>
      Co-authored-by: default avatarIllia Silin <98187287+illsilin@users.noreply.github.com>
      Co-authored-by: default avatarJing Zhang <jizha@amd.com>
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      Co-authored-by: default avatardependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
      Co-authored-by: default avatarPo-Yen, Chen <PoYen.Chen@amd.com>
      Co-authored-by: default avatarrocking <ChunYu.Lai@amd.com>
      db376dd8