1. 11 Feb, 2025 1 commit
    • Max Podkorytov's avatar
      copy over fmha example · 4007289a
      Max Podkorytov authored
      
      
      remove bwd related commands from cmakelists
      
      remove unused ops in the example;
      
      select only bf16/nodropout/nolse/batched
      
      pass validation in the example driver
      
      fork pipeline
      
      add a hardcoded score_mod
      
      fork the kernel
      
      abstract score_mod from a pipeline
      
      unhardcode score_mod and pass it as a cpp expression from codegen
      
      modify host attention impl accounting for score_mod
      
      use custom score for testing
      
      reorder score mod and scale in host verification
      
      use cmakelists as the single source of truth for score_mod function definition
      
      fix numeric mismatches
      
      run clang-format
      
      remove bwd related scripts
      
      edit test and benchmark scripts for the new example
      
      remove readme
      
      remove unused cases from smoke test
      
      re-add group-mode kernels
      
      Add pre_softmax fnctor (#1852)
      
      * Add pre_softmax fnctor
      
      * remove stray define:wq
      
      * Move op out of pipeline, adds it to refnc
      
      ---------
      Co-authored-by: default avatarroot <root@splinter-126-wr-d1.aus.dcgpu>
      Co-authored-by: default avatarMax Podkorytov <4273004+tenpercent@users.noreply.github.com>
      
      added flex_attention in Jenkins file
      
      fixing clang
      
      fixing clang
      
      space added
      
      fixed copyright  errors
      
      fixed even more clangformat
      
      formatting
      
      modified jenkins
      
      fixed typo
      
      added flex attention test for gfx90a and gfx942
      
      fixed typo
      
      fixed example name
      
      fixed example script name
      
      added perf logs for both gpu arch
      
      pipeline fixes for accuracy issues; disable pre-softmax function until its accuracy is fixed
      
      added stash and unstash for perf logs
      
      fixed typo in perf name
      
      print error message
      
      print success  message
      
      hardcoded perf files names
      
      flex attention jenkins switch off
      
      flex attention jenkins switch off from settings
      
      fixed typo
      
      add context to score-mod signature
      4007289a
  2. 21 Nov, 2024 1 commit
  3. 29 Aug, 2024 1 commit
  4. 28 Aug, 2024 1 commit
    • Po Yen Chen's avatar
      [CK_TILE] Add PagedAttention kernels (#1387) · c1569892
      Po Yen Chen authored
      
      
      * Use dictionary to config all the functions
      
      * Add init codegen logic for fmha fwd appendkv
      
      * Call HIP_CHECK_ERROR() macro to get real source info
      
      * Setup meaningfull arguments
      
      * Sync kernel name with the codegen
      
      * Add knew/vnew tensors to the kernel argument
      
      * Fix wrong K values after appending
      
      * Fix vnew append errro
      
      * Extract common logics
      
      * Fix Vnew tile dstr for row major case
      
      * Conditionally add fwd_splitkv API in fmha_fwd example
      
      * Conditionally add call to fmha_fwd_splitkv()
      
      * Remove "EXAMPLE_" prefix of cmake variables
      
      * Regsiter API handlers automatically
      
      * Early return if 0 < s_k_new is not supported
      
      * Show message if we are ignoring option
      
      * Unify CMakeLists.txt coding style
      
      * Set num_splits=1 if split-kv is not supported
      
      * Add length/stride getters for HostTensor
      
      * Add RoPE example utilities
      
      * Add reference_rotary_position_embedding() (not implemented)
      
      * Finish reference_rotary_position_embedding() impl
      
      * Fix typo of HostTensor<>::get_length()
      
      * Fix compilation errors
      
      * Fix wrong answer when interleaved=false
      
      * Fix wrong answer when interleaved=true
      
      * Append K/V in the host verification code
      
      * Simplify K appending logics
      
      * Simplify v_host_ref definition
      
      * Reduce input/output dimensions
      
      * Rename function: add "batched" prefix
      
      * Apply RoPE on host side
      
      * Rename RoPE utility function
      
      * Fix wrong tensor size
      
      * Avoid invoking deprecated method 'find_module'
      
      * Pass RoPE kernel args
      
      * Create Rotary Cos/Sin tile windows in kernel
      
      * Add compute data type alias for RoPE
      
      * Randomly generate seqlen_knew if needed
      
      * Fix seqlen_knew enabling check logic
      
      * Add minimum seqlen_k to generate compliance kvcache
      
      * Fix compilation error in debug mode
      
      * Fix wrong boundaries
      
      * Fix wrong seqlen_k for kvcache
      
      * Rename variables used in distributio encoding
      
      * Fix rotary cos/sin tensor/tile size
      
      * Add constraint to the rotary_dim option
      
      * Remove unused inner namespace
      
      * Add dram distribution for rotary_cos/rotary_sin (interleaved)
      
      * Only apply interleaved RoPE on Knew for now
      
      * Fix wrong thread starting offset
      
      * Instantiate multiple kernels for RoPE approaches
      
      * Clean-up pipeline
      
      * Fix error in RoPE host reference
      
      * Handle RoPE half-rotated logics
      
      * Support 8x rotary_dim under half-rotated RoPE
      
      * Add comment
      
      * Apply elementwise function to the loaded tiles
      
      * Unify parameter/variable naming style
      
      * Remove constness from q_ptr
      
      * Add code blocks for q_tile
      
      * Apply RoPE to q_tile
      
      * Remove debug print code in kernel
      
      * Fix wrong knew/vnew appending positions
      
      * Use better naming for tile indices
      
      * Add make_tile_window() for adding distribution only
      
      * Skip code if # of block is more than needed
      
      * Move thread locating logics into policy
      
      * Remove always true static_assert()
      
      * Rename header
      
      * Rename RotaryEmbeddingEnum
      
      * Extract rotary embedding logic out
      
      * Re-order parameters
      
      * Align naming of some tile size constants
      
      * Rename more tile size constants
      
      * Fix wrong grid size
      
      * Fix wrong shape of knew_host/vnew_host
      
      * Fix wrong index into knew_host/vnew_host
      
      * Fix wrong rotary_cos/rotary_sin memory size for Q
      
      * Extract Q/Knew vector size to helper methods
      
      * Use different rotary_cos/rotary_sin distr for Q/Knew
      
      * Update host/device specifiers
      
      * Fix wrong data type for Q rotary_cos/rotary_sin
      
      * Remove RoPEComputeDataType type alias
      
      * Shift rotary_cos/rotary_sin by cache_seqlen_k
      
      * Add comment for why I just 't' for all padding flags
      
      * Align commit message to the real comment
      
      * Fix wrong pipeline
      
      * Rename utility function
      
      * Disable host verification if API not exist
      
      * Fix wrong rope key for fp8 pipeline
      
      * Allow only apply RoPE on Q (without append KV)
      
      * Add append-kv smoke tests
      
      * Remove debug statements
      
      * Remove more debug statements
      
      * Re-arrange the 'set +x' command
      
      * Remove no-longer used method in pipeline
      
      * Add missing init code
      
      * Refine pipeline padding settings
      
      * Enlarge rotary_dim limit (8 -> 16)
      
      * Enlarge KPerThread for rotary_interleaved=false
      
      * Update rotary_dim range in smoke_test_fwd.sh
      
      * Add template argument 'kIsPagedKV' for splitkv kernels
      
      * Launch splitkv kernel if given page_block_size
      
      * Fix wrong kernel name
      
      * Fix seqlen_k_min for pre-fill case (1 -> 0)
      
      * Add copy_const<> type trait
      
      * Add another make_tile_window()
      
      * Introduce 'TileWindowNavigator' types
      
      * Simplify TileWindowNavigator interfaces
      
      * Fix tile window navigation bugs
      
      * Disable calling fmha_fwd()
      
      * Remove ununnecessary data members
      
      * Simplify more make_tile_window() overloads
      
      * Move V tile through TileWindowNavigator
      
      * Fix uneven split checking logic
      
      * Move code after decide seqlen_q/seqlen_k
      
      * Make sure we always start reading complete tile
      
      * Use 128 as minimus page_block_size
      
      * Fix wrong origin for bias
      
      * Add batch_stride_k/batch_stride_v in group mode
      
      * Unify origin
      
      * Add missing kernel arguments for group mode
      
      * Add paged-kv codegen logic for appendkv kernels
      
      * Add block_table kernel args for appendkv kernel
      
      * Add tile navigators to the appendkv kernel
      
      * Fix wrong tensor descriptor lengths
      
      * Pass re-created tile window to pipeline
      
      * Fix wrong strides for appendkv kernel
      
      * Allow transit tile_window to another page-block
      
      * Handle cross-page-block write
      
      * Donot perform write again if already in last page-block
      
      * Always add fmha_fwd() api
      
      * Add missing group mode argument
      
      * Remove debug macro usages
      
      * Rename option s_k_new to s_knew
      
      * Separate splitkv/non-splitkv args/traits
      
      * Remove fmha_fwd_dispatch()
      
      * Fix compilation errors
      
      * Remove dropout code in splitkv kernel
      
      * Allow problem types without define kHasDropout attr
      
      * Use generic lambda to init traits objects
      
      * Separate more non-splitkv & splitkv traits/args
      
      * Display more info for specific kernels
      
      * Show more detailed warning message
      
      * Rename 'max_num_blocks' to 'max_num_page_blocks'
      
      * Remove no-longer used pipeline files
      
      * Wrap code by #if directives
      
      * Move functors to the begining of validation code
      
      * Use generic lambda to init all the api traits/args
      
      * Fix wrong seqlen for kvcache
      
      * Add missing comment
      
      * Rename TileWindowNavigator to PageBlockNavigator
      
      * Only expose necessary methods (not attributes)
      
      * Re-order pipeline paremeters
      
      * Refine smoke_test_fwd.sh
      
      * Fix wrong arugment count
      
      * Make tile window directly via PageBlockNavigator
      
      * Remove unused template paremeter
      
      * Remove group mode from appendkv kernel
      
      * Fix skcheck logic
      
      * Fix wrong syntax in skcheck expr
      
      * Use meaningful options in smoke test
      
      * Remove options
      
      * Fix formatting
      
      * Fix more format
      
      * Re-organize bash functions
      
      * Pass cache_batch_idx to kernels
      
      * Support cache_batch_idx in example
      
      * Fix compilation error
      
      * Add more appendkv test
      
      * Add more case for appendkv
      
      * Fix unexisted attribute
      
      * Remove 0 < seqlen_knew constraint
      
      * Clarify the case in warning message
      
      * Remove macro checking
      
      * Force batch mode when invoking appendkv & splitkv apis
      
      * Fix mode overriding logics
      
      * Fix wrong parameter name
      
      * Randomize seqlen_k if use kvcache
      
      * Use randomized seqlen_k for kvcache
      
      * Avoid using too small rotary_cos & rotary_sin
      
      * Rename parameter
      
      * Add seqlen_q & seqlen_k rules
      
      * Add comment
      
      * Add more comments
      
      * Fix compilation errors
      
      * Fix typo in comment
      
      * Remove type argument
      
      * Avoid seqlen_k=0 for kvcache
      
      * Revert "Avoid seqlen_k=0 for kvcache"
      
      This reverts commit 21c4df89e416182e8e9bc78e67bd4b98dbb6c88d.
      
      * Fix wrong uneven split checking logics
      
      * Only randomize kvcache seqlen_k if 1 < batch
      
      * Return earlier if split is empty
      
      * Revert "Only randomize kvcache seqlen_k if 1 < batch"
      
      This reverts commit b9a4ab0d7e3c2beecc0fccafd2a13259dd06299c.
      
      * Re-order seqlen_k_start adjustment logics
      
      * Fix compilation errors
      
      * Re-format script
      
      * Find executable from folder automatically
      
      * Fix kvcache seqlen_k generating logic
      
      * Make comment more clear
      
      * Fix wrong knew/vew appending logic on host
      
      * Add s_barrier to sync threads
      
      * Revert "Add s_barrier to sync threads"
      
      This reverts commit d3f550f30c0a4d9df15c613015d5dff268d6746d.
      
      * Support only using 1 row of rotary_cos/rotary_sin
      
      * Rotate Q in different way
      
      * Unify tensor view creation logics
      
      * Fix wrong argument
      
      * Add mask to switch how we use the rotary_cos/sin
      
      * Move attr from traits to problem
      
      * Move has_mask to fmha_fwd_appendkv_args
      
      * Support use uint32_t as SAD operand in Alibi<>
      
      * Use sad_u32() in splitkv kernels
      
      * Store tensor views in PageBlockNavigator
      
      * Use stored tensor view to update tile windows
      
      * Enlarge tensor view size
      
      * Remove debug code
      
      * Fix wrong tensor view size
      
      * Wrap tensor view into PageBlockNavigator
      
      * Add DataType member to PageBlockNavigator
      
      * Remove unnecessary member functions
      
      * Refind macro use
      
      * Fix typo
      
      * Add blank line between directives and actual code
      
      * Re-format files
      
      * Remove type in comment
      
      ---------
      Co-authored-by: default avatarcarlushuang <carlus.huang@amd.com>
      Co-authored-by: default avatarrocking <ChunYu.Lai@amd.com>
      c1569892
  5. 28 May, 2024 1 commit
    • carlushuang's avatar
      [CK_TILE] support group from cmdline (#1295) · 5055b3bd
      carlushuang authored
      * support cmdline seqlen decode
      
      * silent print
      
      * update readme
      
      * update kernel launch 3d
      
      * update tile partitioner
      
      * fix spill for bf16
      
      * modify based on comment
      
      * modify payload_t
      
      * fix bug for alibi mode
      
      * fix alibi test err
      
      * refactor kernel launch, support select timer
      
      * add missing file
      
      * remove useless code
      
      * add some comments
      5055b3bd
  6. 16 Apr, 2024 1 commit
    • carlushuang's avatar
      introducing ck_tile! (#1216) · db376dd8
      carlushuang authored
      * enable gfx940
      
      * switch between intrinsic mfma routines on mi100/200 and mi300
      
      * fix mfma_int8 on MI300
      
      * disable 2 int8 examples on MI300
      
      * Update cmake-ck-dev.sh
      
      * restore gitignore file
      
      * modify Jenkinsfile to the internal repo
      
      * Bump rocm-docs-core from 0.24.0 to 0.29.0 in /docs/sphinx
      
      Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.24.0 to 0.29.0.
      - [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
      - [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
      - [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.24.0...v0.29.0
      
      )
      
      ---
      updated-dependencies:
      - dependency-name: rocm-docs-core
        dependency-type: direct:production
        update-type: version-update:semver-minor
      ...
      Signed-off-by: default avatardependabot[bot] <support@github.com>
      
      * initial enablement of gfx950
      
      * fix clang format
      
      * disable examples 31 and 41 int8 on gfx950
      
      * add code
      
      * fix build wip
      
      * fix xx
      
      * now can build
      
      * naming
      
      * minor fix
      
      * wip fix
      
      * fix macro for exp2; fix warpgemm a/b in transposedC
      
      * unify as tuple_array
      
      * Update the required Python version to 3.9
      
      * Update executable name in test scripts
      
      * re-structure tuple/array to avoid spill
      
      * Merge function templates
      
      * Fix format
      
      * Add constraint to array<> ctor
      
      * Re-use function
      
      * Some minor changes
      
      * remove wrong code in store_raw()
      
      * fix compile issue in transpose
      
      * Rename enum
      Rename 'cood_transform_enum' to 'coord_transform_enum'
      
      * let more integral_constant->constant, and formating
      
      * make sure thread_buffer can be tuple/array
      
      * temp fix buffer_store spill
      
      * not using custom data type by default, now we can have ISA-level same code as opt_padding
      
      * fix compile error, fp8 not ready now
      
      * fix fp8 duplicated move/shift/and/or problem
      
      * Default use CK_TILE_FLOAT_TO_FP8_STOCHASTIC rounding mode
      
      * fix scratch in fp8 kernel
      
      * update some readme
      
      * fix merge from upstream
      
      * sync with upstream
      
      * sync upstream again
      
      * sync 22
      
      * remove unused
      
      * fix clang-format
      
      * update README of ck_tile example
      
      * fix several issue
      
      * let python version to be 3.8 as minimal
      
      * remove ck_tile example from default cmake target like all/install/check
      
      * remove mistake
      
      * 1).support receipe in generate.py 2).use simplified mask type 3).change left/right to pass into karg
      
      * fix some bug in group-mode masking and codegen. update README
      
      * F8 quantization for FMHA forward (#1224)
      
      * Add SAccElementFunction, PComputeElementFunction, OAccElementFunction in pipeline
      
      * Add element function to fmha api
      
      * Adjust P elementwise function
      
      * Fix bug of elementwise op, our elementwise op is not inout
      
      * Add some elementwise op, prepare to quantization
      
      * Let generate.py can generate different elementwise function
      
      * To prevent compiler issue, remove the elementwise function we have not used.
      
      * Remove f8 pipeline, we should share the same pipeline even in f8
      
      * Remove remove_cvref_t
      
      * Avoid warning
      
      * Fix wrong fp8 QK/KV block gemm setting
      
      * Check fp8 rounding error in check_err()
      
      * Set fp8 rounding error for check_err()
      
      * Use CK_TILE_FLOAT_TO_FP8_STANDARD as default fp8 rounding mode
      
      * 1. codgen the f8 api and kernel
      2. f8 host code
      
      * prevent warning in filter mode
      
      * Remove not-in-use elementwise function kargs
      
      * Remove more not-in-use elementwise function kargs
      
      * Small refinements in C++ source files
      
      * Use conditional_t<> to simplify code
      
      * Support heterogeneous argument for binary function types
      
      * Re-use already-existing scales<> functor template
      
      * Fix wrong value produced by saturating
      
      * Generalize the composes<> template
      
      * Unify saturates<> implementation
      
      * Fix type errors in composes<>
      
      * Extend less_equal<>
      
      * Reuse the existing template less_equal<> in check_err()
      
      * Add equal<float> & equal<double>
      
      * Rename check_err() parameter
      
      * Rename check_err() parameter
      
      * Add FIXME comment for adding new macro in future
      
      * Remove unnecessary cast to void
      
      * Eliminate duplicated code
      
      * Avoid dividing api pool into more than 2 groups
      
      * Use more clear variable names
      
      * Use affirmative condition in if stmt
      
      * Remove blank lines
      
      * Donot perfect forwarding in composes<>
      
      * To fix compile error, revert generate.py back to 4439cc107dd90302d68a6494bdd33113318709f8
      
      * Fix bug of p element function
      
      * Add compute element op to host softmax
      
      * Remove element function in api interface
      
      * Extract user parameter
      
      * Rename pscale and oscale variable
      
      * rename f8 to fp8
      
      * rename more f8 to fp8
      
      * Add pipeline::operator() without element_functor
      
      * 1. Remove deprecated pipeline enum
      2. Refine host code parameter
      
      * Use quantization range as input
      
      * 1. Rename max_dtype to dtype_max.
      2. Rename scale to scale_s
      3.Add init description
      
      * Refine description
      
      * prevent early return
      
      * unify _squant kernel name in cpp, update README
      
      * Adjust the default range.
      
      * Refine error message and bias range
      
      * Add fp8 benchmark and smoke test
      
      * fix fp8 swizzle_factor=4 case
      
      ---------
      Co-authored-by: default avatarPo Yen Chen <PoYen.Chen@amd.com>
      Co-authored-by: default avatarcarlushuang <carlus.huang@amd.com>
      
      ---------
      Signed-off-by: default avatardependabot[bot] <support@github.com>
      Co-authored-by: default avatarillsilin <Illia.Silin@amd.com>
      Co-authored-by: default avatarIllia Silin <98187287+illsilin@users.noreply.github.com>
      Co-authored-by: default avatarJing Zhang <jizha@amd.com>
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      Co-authored-by: default avatardependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
      Co-authored-by: default avatarPo-Yen, Chen <PoYen.Chen@amd.com>
      Co-authored-by: default avatarrocking <ChunYu.Lai@amd.com>
      db376dd8