1. 16 Oct, 2024 1 commit
    • Qianfeng's avatar
      [CK_TILE] Improve headdim96 performance for fmha-bwd (#1573) · 14c3cfb1
      Qianfeng authored
      
      
      * Add kQKHeaddimForGemmN and kVHeaddimForGemmN in order to support headdim 96
      
      * Remove the using of MakeKRegBlockDescriptor and MakeVRegBlockDescriptor
      
      * Fix in bwd_piple_default_policy
      
      * Remove kQKHeaddim and rename kQKHeaddimForGemmN to kQKHeaddim in the bwd kernel and pipelines
      
      * Replace kVHeaddimForGemmN by kVHeaddim and kDoDvHeaddim
      
      * Update to hd96 tile settings
      
      * Add smoke test scripts for fmha-bwd hd96
      
      * Revert "Add smoke test scripts for fmha-bwd hd96"
      
      This reverts commit 7ca7e1a93dc65eb99ce3ff4e82693589830e42a2.
      
      * Remove hd96 tile settings in fmha_bwd codegen to save compiling
      
      * Fix lost code line in bwd_pipeline_default_policy
      
      * Merge kDoDvHeaddim/kPadHeadDimDoDv to kVHeaddim/kPadHeadDimV and remove TileFmhaBwdTraits
      
      * Rename KRegSliceBlockDescriptor/VRegSliceBlockDescriptor to KRegBlockDescriptor/VRegBlockDescriptor
      
      * tiny adjustments
      
      ---------
      Co-authored-by: default avatarPo Yen Chen <PoYen.Chen@amd.com>
      Co-authored-by: default avatardanyao12 <Dan.Yao@amd.com>
      14c3cfb1
  2. 15 Oct, 2024 2 commits
  3. 14 Oct, 2024 3 commits
  4. 12 Oct, 2024 1 commit
  5. 10 Oct, 2024 1 commit
    • Thomas Ning's avatar
      Ck tile gemm cshuffle & CK Tile GEMM restructure (#1535) · 6f27bc98
      Thomas Ning authored
      
      
      * ake the cshuffle compilable
      
      * modify Mhe reference on gpu and cpu. Correaccess of cshuffle
      
      * fix the cpu reference code
      
      * Complete the in tile shuffle logic
      
      * restructure the kernel template input
      
      * change the naming pattern of ck_tile gemm pipeline
      
      * Re-format files using remod.py
      
      * Solve the fmha conflict with gemm
      
      * Comment Addressed from Carlus
      
      ---------
      Co-authored-by: default avatarPo Yen, Chen <PoYen.Chen@amd.com>
      6f27bc98
  6. 09 Oct, 2024 1 commit
  7. 08 Oct, 2024 2 commits
  8. 07 Oct, 2024 3 commits
  9. 04 Oct, 2024 2 commits
  10. 02 Oct, 2024 1 commit
  11. 01 Oct, 2024 2 commits
  12. 27 Sep, 2024 1 commit
  13. 26 Sep, 2024 1 commit
  14. 25 Sep, 2024 1 commit
  15. 22 Sep, 2024 1 commit
  16. 20 Sep, 2024 2 commits
  17. 18 Sep, 2024 1 commit
  18. 14 Sep, 2024 1 commit
  19. 13 Sep, 2024 1 commit
    • Jun Liu's avatar
      Customize filesystem in CK for legacy systems (#1509) · 81bc1496
      Jun Liu authored
      
      
      * Legacy support: customized filesystem
      
      * Update cmakefile for python alternative path
      
      * fix build issues
      
      * CK has no boost dependency
      
      * More fixes to issues found on legay systems
      
      * fix clang format issue
      
      * Check if blob is correctly generated in cmake
      
      * fix the python issues
      
      * add a compiler flag for codegen when using alternative python
      
      * use target_link_options instead of target_compile_options
      
      ---------
      Co-authored-by: default avatarillsilin <Illia.Silin@amd.com>
      81bc1496
  20. 12 Sep, 2024 1 commit
  21. 11 Sep, 2024 2 commits
  22. 10 Sep, 2024 1 commit
  23. 07 Sep, 2024 1 commit
    • Thomas Ning's avatar
      Ck tile gemm example (#1488) · caacd388
      Thomas Ning authored
      
      
      * Checkpoint: Finished with the tile example & kernel verification, working on the different matrix layout
      
      * Finished the Matrix Layout feature set up. Note: Need to modify the inner block to solve the shuffle problem in the future.
      
      * Fix: Clang Format, API fixed from fmha
      
      * fix with better naming convention
      
      * revert back the pipeline code of fmha
      
      * Fixed: Addressed the comments and merge the GEMM shape of GEMM Operator and FMHA Operator to one.
      
      * clang format with the reference_gemm file
      
      * convert the clang format with the remod.py
      
      * Changed the format and variable name of the kernel gemm_shape and partitioner
      
      ---------
      Co-authored-by: default avatarthomasning <thomasning@banff-cyxtera-s70-4.ctr.dcgpu>
      caacd388
  24. 05 Sep, 2024 2 commits
  25. 03 Sep, 2024 1 commit
  26. 02 Sep, 2024 1 commit
  27. 30 Aug, 2024 2 commits
  28. 28 Aug, 2024 1 commit
    • Po Yen Chen's avatar
      [CK_TILE] Add PagedAttention kernels (#1387) · c1569892
      Po Yen Chen authored
      
      
      * Use dictionary to config all the functions
      
      * Add init codegen logic for fmha fwd appendkv
      
      * Call HIP_CHECK_ERROR() macro to get real source info
      
      * Setup meaningfull arguments
      
      * Sync kernel name with the codegen
      
      * Add knew/vnew tensors to the kernel argument
      
      * Fix wrong K values after appending
      
      * Fix vnew append errro
      
      * Extract common logics
      
      * Fix Vnew tile dstr for row major case
      
      * Conditionally add fwd_splitkv API in fmha_fwd example
      
      * Conditionally add call to fmha_fwd_splitkv()
      
      * Remove "EXAMPLE_" prefix of cmake variables
      
      * Regsiter API handlers automatically
      
      * Early return if 0 < s_k_new is not supported
      
      * Show message if we are ignoring option
      
      * Unify CMakeLists.txt coding style
      
      * Set num_splits=1 if split-kv is not supported
      
      * Add length/stride getters for HostTensor
      
      * Add RoPE example utilities
      
      * Add reference_rotary_position_embedding() (not implemented)
      
      * Finish reference_rotary_position_embedding() impl
      
      * Fix typo of HostTensor<>::get_length()
      
      * Fix compilation errors
      
      * Fix wrong answer when interleaved=false
      
      * Fix wrong answer when interleaved=true
      
      * Append K/V in the host verification code
      
      * Simplify K appending logics
      
      * Simplify v_host_ref definition
      
      * Reduce input/output dimensions
      
      * Rename function: add "batched" prefix
      
      * Apply RoPE on host side
      
      * Rename RoPE utility function
      
      * Fix wrong tensor size
      
      * Avoid invoking deprecated method 'find_module'
      
      * Pass RoPE kernel args
      
      * Create Rotary Cos/Sin tile windows in kernel
      
      * Add compute data type alias for RoPE
      
      * Randomly generate seqlen_knew if needed
      
      * Fix seqlen_knew enabling check logic
      
      * Add minimum seqlen_k to generate compliance kvcache
      
      * Fix compilation error in debug mode
      
      * Fix wrong boundaries
      
      * Fix wrong seqlen_k for kvcache
      
      * Rename variables used in distributio encoding
      
      * Fix rotary cos/sin tensor/tile size
      
      * Add constraint to the rotary_dim option
      
      * Remove unused inner namespace
      
      * Add dram distribution for rotary_cos/rotary_sin (interleaved)
      
      * Only apply interleaved RoPE on Knew for now
      
      * Fix wrong thread starting offset
      
      * Instantiate multiple kernels for RoPE approaches
      
      * Clean-up pipeline
      
      * Fix error in RoPE host reference
      
      * Handle RoPE half-rotated logics
      
      * Support 8x rotary_dim under half-rotated RoPE
      
      * Add comment
      
      * Apply elementwise function to the loaded tiles
      
      * Unify parameter/variable naming style
      
      * Remove constness from q_ptr
      
      * Add code blocks for q_tile
      
      * Apply RoPE to q_tile
      
      * Remove debug print code in kernel
      
      * Fix wrong knew/vnew appending positions
      
      * Use better naming for tile indices
      
      * Add make_tile_window() for adding distribution only
      
      * Skip code if # of block is more than needed
      
      * Move thread locating logics into policy
      
      * Remove always true static_assert()
      
      * Rename header
      
      * Rename RotaryEmbeddingEnum
      
      * Extract rotary embedding logic out
      
      * Re-order parameters
      
      * Align naming of some tile size constants
      
      * Rename more tile size constants
      
      * Fix wrong grid size
      
      * Fix wrong shape of knew_host/vnew_host
      
      * Fix wrong index into knew_host/vnew_host
      
      * Fix wrong rotary_cos/rotary_sin memory size for Q
      
      * Extract Q/Knew vector size to helper methods
      
      * Use different rotary_cos/rotary_sin distr for Q/Knew
      
      * Update host/device specifiers
      
      * Fix wrong data type for Q rotary_cos/rotary_sin
      
      * Remove RoPEComputeDataType type alias
      
      * Shift rotary_cos/rotary_sin by cache_seqlen_k
      
      * Add comment for why I just 't' for all padding flags
      
      * Align commit message to the real comment
      
      * Fix wrong pipeline
      
      * Rename utility function
      
      * Disable host verification if API not exist
      
      * Fix wrong rope key for fp8 pipeline
      
      * Allow only apply RoPE on Q (without append KV)
      
      * Add append-kv smoke tests
      
      * Remove debug statements
      
      * Remove more debug statements
      
      * Re-arrange the 'set +x' command
      
      * Remove no-longer used method in pipeline
      
      * Add missing init code
      
      * Refine pipeline padding settings
      
      * Enlarge rotary_dim limit (8 -> 16)
      
      * Enlarge KPerThread for rotary_interleaved=false
      
      * Update rotary_dim range in smoke_test_fwd.sh
      
      * Add template argument 'kIsPagedKV' for splitkv kernels
      
      * Launch splitkv kernel if given page_block_size
      
      * Fix wrong kernel name
      
      * Fix seqlen_k_min for pre-fill case (1 -> 0)
      
      * Add copy_const<> type trait
      
      * Add another make_tile_window()
      
      * Introduce 'TileWindowNavigator' types
      
      * Simplify TileWindowNavigator interfaces
      
      * Fix tile window navigation bugs
      
      * Disable calling fmha_fwd()
      
      * Remove ununnecessary data members
      
      * Simplify more make_tile_window() overloads
      
      * Move V tile through TileWindowNavigator
      
      * Fix uneven split checking logic
      
      * Move code after decide seqlen_q/seqlen_k
      
      * Make sure we always start reading complete tile
      
      * Use 128 as minimus page_block_size
      
      * Fix wrong origin for bias
      
      * Add batch_stride_k/batch_stride_v in group mode
      
      * Unify origin
      
      * Add missing kernel arguments for group mode
      
      * Add paged-kv codegen logic for appendkv kernels
      
      * Add block_table kernel args for appendkv kernel
      
      * Add tile navigators to the appendkv kernel
      
      * Fix wrong tensor descriptor lengths
      
      * Pass re-created tile window to pipeline
      
      * Fix wrong strides for appendkv kernel
      
      * Allow transit tile_window to another page-block
      
      * Handle cross-page-block write
      
      * Donot perform write again if already in last page-block
      
      * Always add fmha_fwd() api
      
      * Add missing group mode argument
      
      * Remove debug macro usages
      
      * Rename option s_k_new to s_knew
      
      * Separate splitkv/non-splitkv args/traits
      
      * Remove fmha_fwd_dispatch()
      
      * Fix compilation errors
      
      * Remove dropout code in splitkv kernel
      
      * Allow problem types without define kHasDropout attr
      
      * Use generic lambda to init traits objects
      
      * Separate more non-splitkv & splitkv traits/args
      
      * Display more info for specific kernels
      
      * Show more detailed warning message
      
      * Rename 'max_num_blocks' to 'max_num_page_blocks'
      
      * Remove no-longer used pipeline files
      
      * Wrap code by #if directives
      
      * Move functors to the begining of validation code
      
      * Use generic lambda to init all the api traits/args
      
      * Fix wrong seqlen for kvcache
      
      * Add missing comment
      
      * Rename TileWindowNavigator to PageBlockNavigator
      
      * Only expose necessary methods (not attributes)
      
      * Re-order pipeline paremeters
      
      * Refine smoke_test_fwd.sh
      
      * Fix wrong arugment count
      
      * Make tile window directly via PageBlockNavigator
      
      * Remove unused template paremeter
      
      * Remove group mode from appendkv kernel
      
      * Fix skcheck logic
      
      * Fix wrong syntax in skcheck expr
      
      * Use meaningful options in smoke test
      
      * Remove options
      
      * Fix formatting
      
      * Fix more format
      
      * Re-organize bash functions
      
      * Pass cache_batch_idx to kernels
      
      * Support cache_batch_idx in example
      
      * Fix compilation error
      
      * Add more appendkv test
      
      * Add more case for appendkv
      
      * Fix unexisted attribute
      
      * Remove 0 < seqlen_knew constraint
      
      * Clarify the case in warning message
      
      * Remove macro checking
      
      * Force batch mode when invoking appendkv & splitkv apis
      
      * Fix mode overriding logics
      
      * Fix wrong parameter name
      
      * Randomize seqlen_k if use kvcache
      
      * Use randomized seqlen_k for kvcache
      
      * Avoid using too small rotary_cos & rotary_sin
      
      * Rename parameter
      
      * Add seqlen_q & seqlen_k rules
      
      * Add comment
      
      * Add more comments
      
      * Fix compilation errors
      
      * Fix typo in comment
      
      * Remove type argument
      
      * Avoid seqlen_k=0 for kvcache
      
      * Revert "Avoid seqlen_k=0 for kvcache"
      
      This reverts commit 21c4df89e416182e8e9bc78e67bd4b98dbb6c88d.
      
      * Fix wrong uneven split checking logics
      
      * Only randomize kvcache seqlen_k if 1 < batch
      
      * Return earlier if split is empty
      
      * Revert "Only randomize kvcache seqlen_k if 1 < batch"
      
      This reverts commit b9a4ab0d7e3c2beecc0fccafd2a13259dd06299c.
      
      * Re-order seqlen_k_start adjustment logics
      
      * Fix compilation errors
      
      * Re-format script
      
      * Find executable from folder automatically
      
      * Fix kvcache seqlen_k generating logic
      
      * Make comment more clear
      
      * Fix wrong knew/vew appending logic on host
      
      * Add s_barrier to sync threads
      
      * Revert "Add s_barrier to sync threads"
      
      This reverts commit d3f550f30c0a4d9df15c613015d5dff268d6746d.
      
      * Support only using 1 row of rotary_cos/rotary_sin
      
      * Rotate Q in different way
      
      * Unify tensor view creation logics
      
      * Fix wrong argument
      
      * Add mask to switch how we use the rotary_cos/sin
      
      * Move attr from traits to problem
      
      * Move has_mask to fmha_fwd_appendkv_args
      
      * Support use uint32_t as SAD operand in Alibi<>
      
      * Use sad_u32() in splitkv kernels
      
      * Store tensor views in PageBlockNavigator
      
      * Use stored tensor view to update tile windows
      
      * Enlarge tensor view size
      
      * Remove debug code
      
      * Fix wrong tensor view size
      
      * Wrap tensor view into PageBlockNavigator
      
      * Add DataType member to PageBlockNavigator
      
      * Remove unnecessary member functions
      
      * Refind macro use
      
      * Fix typo
      
      * Add blank line between directives and actual code
      
      * Re-format files
      
      * Remove type in comment
      
      ---------
      Co-authored-by: default avatarcarlushuang <carlus.huang@amd.com>
      Co-authored-by: default avatarrocking <ChunYu.Lai@amd.com>
      c1569892