1. 10 Oct, 2024 1 commit
    • Thomas Ning's avatar
      Ck tile gemm cshuffle & CK Tile GEMM restructure (#1535) · 6f27bc98
      Thomas Ning authored
      
      
      * ake the cshuffle compilable
      
      * modify Mhe reference on gpu and cpu. Correaccess of cshuffle
      
      * fix the cpu reference code
      
      * Complete the in tile shuffle logic
      
      * restructure the kernel template input
      
      * change the naming pattern of ck_tile gemm pipeline
      
      * Re-format files using remod.py
      
      * Solve the fmha conflict with gemm
      
      * Comment Addressed from Carlus
      
      ---------
      Co-authored-by: default avatarPo Yen, Chen <PoYen.Chen@amd.com>
      6f27bc98
  2. 08 Oct, 2024 1 commit
  3. 07 Oct, 2024 1 commit
  4. 04 Oct, 2024 1 commit
    • kylasa's avatar
      Adding seed and offset pointer support to the philox random number generator. (#1523) · c24fae23
      kylasa authored
      
      
      * Adding seed and offset pointer support to the philox random number generator.
      
      * Separating seed and offset pointer checks with different condition statements.
      
      * Changes include, adding support for device seed and offset pointers, union is used to store seed/offset values and device pointers to minimize device SGPRs.
      
      * Correcting a typo in the readme file
      
      * Re-format files using remod.py
      
      * Use STL type for API parameters
      
      * Use simpler struct design for drop_seed & drop_offset
      
      * Undo unnecessary changes
      
      * Sync kargs style for fmha_fwd.hpp/.cpp
      
      * Use templated union to reduce code
      
      * Use structured binding to make code more readable
      
      ---------
      Co-authored-by: default avatarSudhir Kylasa <sukylasa@amd.com>
      Co-authored-by: default avatarPo Yen Chen <PoYen.Chen@amd.com>
      c24fae23
  5. 01 Oct, 2024 2 commits
  6. 27 Sep, 2024 1 commit
  7. 26 Sep, 2024 1 commit
  8. 25 Sep, 2024 1 commit
  9. 22 Sep, 2024 1 commit
  10. 18 Sep, 2024 1 commit
  11. 14 Sep, 2024 1 commit
  12. 10 Sep, 2024 1 commit
  13. 07 Sep, 2024 1 commit
    • Thomas Ning's avatar
      Ck tile gemm example (#1488) · caacd388
      Thomas Ning authored
      
      
      * Checkpoint: Finished with the tile example & kernel verification, working on the different matrix layout
      
      * Finished the Matrix Layout feature set up. Note: Need to modify the inner block to solve the shuffle problem in the future.
      
      * Fix: Clang Format, API fixed from fmha
      
      * fix with better naming convention
      
      * revert back the pipeline code of fmha
      
      * Fixed: Addressed the comments and merge the GEMM shape of GEMM Operator and FMHA Operator to one.
      
      * clang format with the reference_gemm file
      
      * convert the clang format with the remod.py
      
      * Changed the format and variable name of the kernel gemm_shape and partitioner
      
      ---------
      Co-authored-by: default avatarthomasning <thomasning@banff-cyxtera-s70-4.ctr.dcgpu>
      caacd388
  14. 28 Aug, 2024 1 commit
    • Po Yen Chen's avatar
      [CK_TILE] Add PagedAttention kernels (#1387) · c1569892
      Po Yen Chen authored
      
      
      * Use dictionary to config all the functions
      
      * Add init codegen logic for fmha fwd appendkv
      
      * Call HIP_CHECK_ERROR() macro to get real source info
      
      * Setup meaningfull arguments
      
      * Sync kernel name with the codegen
      
      * Add knew/vnew tensors to the kernel argument
      
      * Fix wrong K values after appending
      
      * Fix vnew append errro
      
      * Extract common logics
      
      * Fix Vnew tile dstr for row major case
      
      * Conditionally add fwd_splitkv API in fmha_fwd example
      
      * Conditionally add call to fmha_fwd_splitkv()
      
      * Remove "EXAMPLE_" prefix of cmake variables
      
      * Regsiter API handlers automatically
      
      * Early return if 0 < s_k_new is not supported
      
      * Show message if we are ignoring option
      
      * Unify CMakeLists.txt coding style
      
      * Set num_splits=1 if split-kv is not supported
      
      * Add length/stride getters for HostTensor
      
      * Add RoPE example utilities
      
      * Add reference_rotary_position_embedding() (not implemented)
      
      * Finish reference_rotary_position_embedding() impl
      
      * Fix typo of HostTensor<>::get_length()
      
      * Fix compilation errors
      
      * Fix wrong answer when interleaved=false
      
      * Fix wrong answer when interleaved=true
      
      * Append K/V in the host verification code
      
      * Simplify K appending logics
      
      * Simplify v_host_ref definition
      
      * Reduce input/output dimensions
      
      * Rename function: add "batched" prefix
      
      * Apply RoPE on host side
      
      * Rename RoPE utility function
      
      * Fix wrong tensor size
      
      * Avoid invoking deprecated method 'find_module'
      
      * Pass RoPE kernel args
      
      * Create Rotary Cos/Sin tile windows in kernel
      
      * Add compute data type alias for RoPE
      
      * Randomly generate seqlen_knew if needed
      
      * Fix seqlen_knew enabling check logic
      
      * Add minimum seqlen_k to generate compliance kvcache
      
      * Fix compilation error in debug mode
      
      * Fix wrong boundaries
      
      * Fix wrong seqlen_k for kvcache
      
      * Rename variables used in distributio encoding
      
      * Fix rotary cos/sin tensor/tile size
      
      * Add constraint to the rotary_dim option
      
      * Remove unused inner namespace
      
      * Add dram distribution for rotary_cos/rotary_sin (interleaved)
      
      * Only apply interleaved RoPE on Knew for now
      
      * Fix wrong thread starting offset
      
      * Instantiate multiple kernels for RoPE approaches
      
      * Clean-up pipeline
      
      * Fix error in RoPE host reference
      
      * Handle RoPE half-rotated logics
      
      * Support 8x rotary_dim under half-rotated RoPE
      
      * Add comment
      
      * Apply elementwise function to the loaded tiles
      
      * Unify parameter/variable naming style
      
      * Remove constness from q_ptr
      
      * Add code blocks for q_tile
      
      * Apply RoPE to q_tile
      
      * Remove debug print code in kernel
      
      * Fix wrong knew/vnew appending positions
      
      * Use better naming for tile indices
      
      * Add make_tile_window() for adding distribution only
      
      * Skip code if # of block is more than needed
      
      * Move thread locating logics into policy
      
      * Remove always true static_assert()
      
      * Rename header
      
      * Rename RotaryEmbeddingEnum
      
      * Extract rotary embedding logic out
      
      * Re-order parameters
      
      * Align naming of some tile size constants
      
      * Rename more tile size constants
      
      * Fix wrong grid size
      
      * Fix wrong shape of knew_host/vnew_host
      
      * Fix wrong index into knew_host/vnew_host
      
      * Fix wrong rotary_cos/rotary_sin memory size for Q
      
      * Extract Q/Knew vector size to helper methods
      
      * Use different rotary_cos/rotary_sin distr for Q/Knew
      
      * Update host/device specifiers
      
      * Fix wrong data type for Q rotary_cos/rotary_sin
      
      * Remove RoPEComputeDataType type alias
      
      * Shift rotary_cos/rotary_sin by cache_seqlen_k
      
      * Add comment for why I just 't' for all padding flags
      
      * Align commit message to the real comment
      
      * Fix wrong pipeline
      
      * Rename utility function
      
      * Disable host verification if API not exist
      
      * Fix wrong rope key for fp8 pipeline
      
      * Allow only apply RoPE on Q (without append KV)
      
      * Add append-kv smoke tests
      
      * Remove debug statements
      
      * Remove more debug statements
      
      * Re-arrange the 'set +x' command
      
      * Remove no-longer used method in pipeline
      
      * Add missing init code
      
      * Refine pipeline padding settings
      
      * Enlarge rotary_dim limit (8 -> 16)
      
      * Enlarge KPerThread for rotary_interleaved=false
      
      * Update rotary_dim range in smoke_test_fwd.sh
      
      * Add template argument 'kIsPagedKV' for splitkv kernels
      
      * Launch splitkv kernel if given page_block_size
      
      * Fix wrong kernel name
      
      * Fix seqlen_k_min for pre-fill case (1 -> 0)
      
      * Add copy_const<> type trait
      
      * Add another make_tile_window()
      
      * Introduce 'TileWindowNavigator' types
      
      * Simplify TileWindowNavigator interfaces
      
      * Fix tile window navigation bugs
      
      * Disable calling fmha_fwd()
      
      * Remove ununnecessary data members
      
      * Simplify more make_tile_window() overloads
      
      * Move V tile through TileWindowNavigator
      
      * Fix uneven split checking logic
      
      * Move code after decide seqlen_q/seqlen_k
      
      * Make sure we always start reading complete tile
      
      * Use 128 as minimus page_block_size
      
      * Fix wrong origin for bias
      
      * Add batch_stride_k/batch_stride_v in group mode
      
      * Unify origin
      
      * Add missing kernel arguments for group mode
      
      * Add paged-kv codegen logic for appendkv kernels
      
      * Add block_table kernel args for appendkv kernel
      
      * Add tile navigators to the appendkv kernel
      
      * Fix wrong tensor descriptor lengths
      
      * Pass re-created tile window to pipeline
      
      * Fix wrong strides for appendkv kernel
      
      * Allow transit tile_window to another page-block
      
      * Handle cross-page-block write
      
      * Donot perform write again if already in last page-block
      
      * Always add fmha_fwd() api
      
      * Add missing group mode argument
      
      * Remove debug macro usages
      
      * Rename option s_k_new to s_knew
      
      * Separate splitkv/non-splitkv args/traits
      
      * Remove fmha_fwd_dispatch()
      
      * Fix compilation errors
      
      * Remove dropout code in splitkv kernel
      
      * Allow problem types without define kHasDropout attr
      
      * Use generic lambda to init traits objects
      
      * Separate more non-splitkv & splitkv traits/args
      
      * Display more info for specific kernels
      
      * Show more detailed warning message
      
      * Rename 'max_num_blocks' to 'max_num_page_blocks'
      
      * Remove no-longer used pipeline files
      
      * Wrap code by #if directives
      
      * Move functors to the begining of validation code
      
      * Use generic lambda to init all the api traits/args
      
      * Fix wrong seqlen for kvcache
      
      * Add missing comment
      
      * Rename TileWindowNavigator to PageBlockNavigator
      
      * Only expose necessary methods (not attributes)
      
      * Re-order pipeline paremeters
      
      * Refine smoke_test_fwd.sh
      
      * Fix wrong arugment count
      
      * Make tile window directly via PageBlockNavigator
      
      * Remove unused template paremeter
      
      * Remove group mode from appendkv kernel
      
      * Fix skcheck logic
      
      * Fix wrong syntax in skcheck expr
      
      * Use meaningful options in smoke test
      
      * Remove options
      
      * Fix formatting
      
      * Fix more format
      
      * Re-organize bash functions
      
      * Pass cache_batch_idx to kernels
      
      * Support cache_batch_idx in example
      
      * Fix compilation error
      
      * Add more appendkv test
      
      * Add more case for appendkv
      
      * Fix unexisted attribute
      
      * Remove 0 < seqlen_knew constraint
      
      * Clarify the case in warning message
      
      * Remove macro checking
      
      * Force batch mode when invoking appendkv & splitkv apis
      
      * Fix mode overriding logics
      
      * Fix wrong parameter name
      
      * Randomize seqlen_k if use kvcache
      
      * Use randomized seqlen_k for kvcache
      
      * Avoid using too small rotary_cos & rotary_sin
      
      * Rename parameter
      
      * Add seqlen_q & seqlen_k rules
      
      * Add comment
      
      * Add more comments
      
      * Fix compilation errors
      
      * Fix typo in comment
      
      * Remove type argument
      
      * Avoid seqlen_k=0 for kvcache
      
      * Revert "Avoid seqlen_k=0 for kvcache"
      
      This reverts commit 21c4df89e416182e8e9bc78e67bd4b98dbb6c88d.
      
      * Fix wrong uneven split checking logics
      
      * Only randomize kvcache seqlen_k if 1 < batch
      
      * Return earlier if split is empty
      
      * Revert "Only randomize kvcache seqlen_k if 1 < batch"
      
      This reverts commit b9a4ab0d7e3c2beecc0fccafd2a13259dd06299c.
      
      * Re-order seqlen_k_start adjustment logics
      
      * Fix compilation errors
      
      * Re-format script
      
      * Find executable from folder automatically
      
      * Fix kvcache seqlen_k generating logic
      
      * Make comment more clear
      
      * Fix wrong knew/vew appending logic on host
      
      * Add s_barrier to sync threads
      
      * Revert "Add s_barrier to sync threads"
      
      This reverts commit d3f550f30c0a4d9df15c613015d5dff268d6746d.
      
      * Support only using 1 row of rotary_cos/rotary_sin
      
      * Rotate Q in different way
      
      * Unify tensor view creation logics
      
      * Fix wrong argument
      
      * Add mask to switch how we use the rotary_cos/sin
      
      * Move attr from traits to problem
      
      * Move has_mask to fmha_fwd_appendkv_args
      
      * Support use uint32_t as SAD operand in Alibi<>
      
      * Use sad_u32() in splitkv kernels
      
      * Store tensor views in PageBlockNavigator
      
      * Use stored tensor view to update tile windows
      
      * Enlarge tensor view size
      
      * Remove debug code
      
      * Fix wrong tensor view size
      
      * Wrap tensor view into PageBlockNavigator
      
      * Add DataType member to PageBlockNavigator
      
      * Remove unnecessary member functions
      
      * Refind macro use
      
      * Fix typo
      
      * Add blank line between directives and actual code
      
      * Re-format files
      
      * Remove type in comment
      
      ---------
      Co-authored-by: default avatarcarlushuang <carlus.huang@amd.com>
      Co-authored-by: default avatarrocking <ChunYu.Lai@amd.com>
      c1569892
  15. 16 Aug, 2024 1 commit
    • Dan Yao's avatar
      [CK_TILE] FA bwd kernels optimization (#1397) · 79a5d9c1
      Dan Yao authored
      
      
      * tmp save
      
      * fix batch deterministic bugs
      
      * fix group deterministic bugs
      
      * codegen update
      
      * reorder files
      
      * bias support
      
      * hd256 bias support
      
      * bwd smoke test update
      
      * simplify convert dq
      
      * fix hd256 dropout scratch
      
      * do{}while() -> while(){}
      
      * comments
      
      * remove FmhaBwdTilePartitioner
      
      * save clear_tile
      
      * refactor dropout
      
      * code cleanup
      
      * code cleanup
      
      * comments
      
      * fix epilogue problem
      
      * fix fwd dropout
      
      * group convert_dq opt
      
      * fix dq alignment
      
      * Do not store storerandval in bwd for flash attention integration
      
      * fix hd32 error and boost performance
      
      * revert
      
      * Remove duplicated WarpGemm definitions in the policy file
      
      * dropout patch for mrepeat 16*16
      
      * code sync up
      
      * dq_acc stride
      
      * dq_acc stride stuff
      
      * codegen update
      
      * fwd dropout revert
      
      * fix hd128 scratches and boost performance
      
      * receipt 3 for simplified smoke test
      
      * more strides for fa integration
      
      * fix hd64 scratches and boost performance
      
      * non-iglp pipeline for headdim padding cases
      
      * dpad same as dvpad for flash attention integration
      
      * unpadded lse&d for group mode
      
      * Support unpad layout for group lse
      
      * Support unpad lse layout for splitkv
      
      * Fix stride for splitkv kernel
      
      * fix unpadded lse issue in fwd splitkv
      
      * comment
      
      * solve lds read&write conflicts
      
      * rename
      
      * bias rename
      
      * tile index revert
      
      ---------
      
      Co-authored-by: danyao12 <danyao12>
      Co-authored-by: default avatarrocking <ChunYu.Lai@amd.com>
      Co-authored-by: default avatarQianfeng Zhang <Qianfeng.Zhang@amd.com>
      79a5d9c1
  16. 31 Jul, 2024 1 commit
  17. 08 Jul, 2024 1 commit
    • carlushuang's avatar
      [CK_TILE] wa prec, remove sgpr offset for inline asm (#1356) · 8182976c
      carlushuang authored
      
      
      * wa prec, remove sgpr offset for inline asm
      
      * macro for set tile
      
      * ignore unused param if no kernel instances in host API
      
      * fix more prec issue
      
      * cache buffer resource
      
      * fix
      
      * support pre-nop
      
      * clear tile by vector type members
      
      * add workaround to reduce scratch memory
      
      * conditionally enable workaround code
      
      * enable workaround start from certain build version
      
      * fallback set_tile() implementation from certain build version
      
      * undo template argument changes
      
      * put dummy asm in load_raw()
      
      * fix comments, refactor s_nop inside buffer_load
      
      ---------
      Co-authored-by: default avatarPoYen, Chen <PoYen.Chen@amd.com>
      8182976c
  18. 26 Jun, 2024 1 commit
    • Po Yen Chen's avatar
      [CK_TILE] fmha forward split-kv + combine kernels (#1338) · 0cb2e06d
      Po Yen Chen authored
      
      
      * FA fwd dropout
      
      * FA bwd
      
      * epilogue reuse
      
      * CMakeLists update
      
      * [CK_TILE] support alibi (#1269)
      
      * add alibi support
      
      * fix code
      
      * update code based on comment
      
      * Support more hdim
      
      * fix fp8 bias
      
      * support seqlen_k=0 case
      
      * remove unused printf
      
      * fix format
      
      ---------
      Co-authored-by: default avatarrocking <ChunYu.Lai@amd.com>
      
      * now fwd/bwd can build
      
      * bwd alibi
      
      * add bwd validation stream_config
      
      * update generated filenames
      
      * update bwd kernel launch
      
      * CK_TILE_HOST_DEVICE in philox
      
      * Transpose -> transpose
      
      * format
      
      * format
      
      * format
      
      * Generate the instance for FA required
      
      * format
      
      * fix error in WarpGemm
      
      * Add num_splits option and dummy split-kv api method
      
      * Generate fmha_fwd_splitkv()
      
      * Add SplitKV kernel codegen logics
      
      * Add SplitKV combine kernel codegen logics
      
      * Fix mismatched return type
      
      * Clean-up code
      
      * Replace sentinel value before storing
      
      * Fix wrong layout of LSE/LSEacc/Oacc
      
      * Format codes
      
      * Fix o_acc memory error
      
      * Fix wrong kBlockSize used in policy
      
      * Reduce # of combine kernels
      
      * Fix split-kv combine kernel name
      
      * Fix wrong LDS indexing logics
      
      * Fix wrong loop counter step logic
      
      * Undo vector size changes
      
      * Remove no-longer used field
      
      * Remove in-consistent comment
      
      * Remove debug statements in example
      
      * Remove more debug statements
      
      * Add constness to local variables
      
      * Clearn up generate.py
      
      * Fix unstable clang-format comment
      
      * Remove unused include directive
      
      * Use shorter template parameter name
      
      * Enable non-split-kv blobs
      
      * Update license date
      
      * Print num_splits conditionally
      
      * Undo disabling data types
      
      * Remove unnessary tile size for fp8
      
      * Fix wrong pipeline args for fp8
      
      * Fix example output format
      
      * Remove more debug code in combine pipeline
      
      * Add stride kernel arguments for LSE/O acc workspace
      
      * Re-order split-kv pipeline call operator arguments
      
      * Pass LSE/O strides in kernel argument
      
      * Re-order pipeline call operator arguments
      
      * Use tensor_descriptor to locate LSEacc elements
      
      * Support providing invalid element for tensor view
      
      * Set invalid element value for LSEacc tensor view
      
      * Remove hand-written store_tile() code
      
      * Remove necessary value-overwrite logic
      
      * Add transposed lds descriptor
      
      * Support load_tile() for tile_window_with_static_lengths<>
      
      * Undo removing necessary value-overwrite logic
      
      * Use read descriptor to locate lds elements
      
      * Simplify pipeline source code
      
      * Add constraint to kMaxSplits
      
      * Default use kMaxSplits=64 in generate.py
      
      * Revert "Add constraint to kMaxSplits"
      
      This reverts commit 0a2132d758042e6fb0292f4e354909b8a4d1c118.
      
      * Revert "Default use kMaxSplits=64 in generate.py"
      
      This reverts commit c7d9c80b77320aec6559222bed7d47adcaefe4e3.
      
      * Decide alignment by the padding parameter
      
      * Remove no-longer used utility functions
      
      * Remove not-working code
      
      * Add comment & remove no-longer used code
      
      * Fix computation errors
      
      * Add heuristic to override num_splits option
      
      * Add constraint to kMaxSplits
      
      * Fix compilation error
      
      * Clean up pipeline code
      
      * Wrap pointer access as lambda function
      
      * Rename confusing methods
      
      * Use kLogMasSplits as template parameter
      
      * Finish splitkv combine kernel codegen
      
      * Update kMaxSplits limit
      
      * Use smaller kM0 for splitkv combine kernel
      
      * Ignore droupout flag in splitkv pipeline
      
      * Unify flag usage
      
      * Add back flag kStoreLSE
      
      * Merge lambda calls in pipeline
      
      * Fix compilation errors
      
      * Avoid all empty splits
      
      * Always check for empty loop in splitkv pipelines
      
      * Re-order parameters
      
      * Remove redundant p_drop option check
      
      * Add traits/problem for fwd splitkv kernel
      
      * Conditionally enable uneven split boundary checks
      
      * Add comment for the splitkv traits field
      
      * Change even split criteria
      
      * Re-order statements
      
      * Refine occupancy value for hdim=128&256
      
      * Refine occupancy value for hdim=32&64
      
      * Remove redundant kernel argument
      
      * Separate fmha bwd codegen logics
      
      * Separate fmha fwd codegen logics
      
      * Remove redundant direction parameter in fwd&bwd codegen logics
      
      * Support generate multiple APIs for an example
      
      * Let 'api' an alias of 'direction' option
      
      * Remove choices for the 'direction' option
      
      * Use dictionary to config all the functions
      
      * Move fmha splitkv codegen logics to other file
      
      * Add fwd_splitkv api for tile_example_fmha_fwd
      
      ---------
      
      Co-authored-by: danyao12 <danyao12>
      Co-authored-by: default avatarcarlushuang <carlus.huang@amd.com>
      Co-authored-by: default avatarrocking <ChunYu.Lai@amd.com>
      Co-authored-by: default avatarJing Zhang <jizhan@amd.com>
      0cb2e06d
  19. 24 Jun, 2024 1 commit
  20. 20 Jun, 2024 2 commits
  21. 19 Jun, 2024 1 commit
  22. 17 Jun, 2024 1 commit
  23. 04 Jun, 2024 1 commit
    • Dan Yao's avatar
      CK Tile FA Training kernels (#1286) · 2cab8d39
      Dan Yao authored
      
      
      * FA fwd dropout
      
      * FA bwd
      
      * epilogue reuse
      
      * CMakeLists update
      
      * [CK_TILE] support alibi (#1269)
      
      * add alibi support
      
      * fix code
      
      * update code based on comment
      
      * Support more hdim
      
      * fix fp8 bias
      
      * support seqlen_k=0 case
      
      * remove unused printf
      
      * fix format
      
      ---------
      Co-authored-by: default avatarrocking <ChunYu.Lai@amd.com>
      
      * now fwd/bwd can build
      
      * bwd alibi
      
      * add bwd validation stream_config
      
      * update generated filenames
      
      * update bwd kernel launch
      
      * CK_TILE_HOST_DEVICE in philox
      
      * Transpose -> transpose
      
      * format
      
      * format
      
      * format
      
      * Generate the instance for FA required
      
      * format
      
      * fix error in WarpGemm
      
      ---------
      
      Co-authored-by: danyao12 <danyao12>
      Co-authored-by: default avatarcarlushuang <carlus.huang@amd.com>
      Co-authored-by: default avatarrocking <ChunYu.Lai@amd.com>
      Co-authored-by: default avatarPo Yen Chen <PoYen.Chen@amd.com>
      Co-authored-by: default avatarJing Zhang <jizhan@amd.com>
      2cab8d39
  24. 28 May, 2024 1 commit
    • carlushuang's avatar
      [CK_TILE] support group from cmdline (#1295) · 5055b3bd
      carlushuang authored
      * support cmdline seqlen decode
      
      * silent print
      
      * update readme
      
      * update kernel launch 3d
      
      * update tile partitioner
      
      * fix spill for bf16
      
      * modify based on comment
      
      * modify payload_t
      
      * fix bug for alibi mode
      
      * fix alibi test err
      
      * refactor kernel launch, support select timer
      
      * add missing file
      
      * remove useless code
      
      * add some comments
      5055b3bd
  25. 20 May, 2024 1 commit
  26. 07 May, 2024 1 commit
  27. 22 Apr, 2024 1 commit
  28. 16 Apr, 2024 1 commit
    • carlushuang's avatar
      introducing ck_tile! (#1216) · db376dd8
      carlushuang authored
      * enable gfx940
      
      * switch between intrinsic mfma routines on mi100/200 and mi300
      
      * fix mfma_int8 on MI300
      
      * disable 2 int8 examples on MI300
      
      * Update cmake-ck-dev.sh
      
      * restore gitignore file
      
      * modify Jenkinsfile to the internal repo
      
      * Bump rocm-docs-core from 0.24.0 to 0.29.0 in /docs/sphinx
      
      Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.24.0 to 0.29.0.
      - [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
      - [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
      - [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.24.0...v0.29.0
      
      )
      
      ---
      updated-dependencies:
      - dependency-name: rocm-docs-core
        dependency-type: direct:production
        update-type: version-update:semver-minor
      ...
      Signed-off-by: default avatardependabot[bot] <support@github.com>
      
      * initial enablement of gfx950
      
      * fix clang format
      
      * disable examples 31 and 41 int8 on gfx950
      
      * add code
      
      * fix build wip
      
      * fix xx
      
      * now can build
      
      * naming
      
      * minor fix
      
      * wip fix
      
      * fix macro for exp2; fix warpgemm a/b in transposedC
      
      * unify as tuple_array
      
      * Update the required Python version to 3.9
      
      * Update executable name in test scripts
      
      * re-structure tuple/array to avoid spill
      
      * Merge function templates
      
      * Fix format
      
      * Add constraint to array<> ctor
      
      * Re-use function
      
      * Some minor changes
      
      * remove wrong code in store_raw()
      
      * fix compile issue in transpose
      
      * Rename enum
      Rename 'cood_transform_enum' to 'coord_transform_enum'
      
      * let more integral_constant->constant, and formating
      
      * make sure thread_buffer can be tuple/array
      
      * temp fix buffer_store spill
      
      * not using custom data type by default, now we can have ISA-level same code as opt_padding
      
      * fix compile error, fp8 not ready now
      
      * fix fp8 duplicated move/shift/and/or problem
      
      * Default use CK_TILE_FLOAT_TO_FP8_STOCHASTIC rounding mode
      
      * fix scratch in fp8 kernel
      
      * update some readme
      
      * fix merge from upstream
      
      * sync with upstream
      
      * sync upstream again
      
      * sync 22
      
      * remove unused
      
      * fix clang-format
      
      * update README of ck_tile example
      
      * fix several issue
      
      * let python version to be 3.8 as minimal
      
      * remove ck_tile example from default cmake target like all/install/check
      
      * remove mistake
      
      * 1).support receipe in generate.py 2).use simplified mask type 3).change left/right to pass into karg
      
      * fix some bug in group-mode masking and codegen. update README
      
      * F8 quantization for FMHA forward (#1224)
      
      * Add SAccElementFunction, PComputeElementFunction, OAccElementFunction in pipeline
      
      * Add element function to fmha api
      
      * Adjust P elementwise function
      
      * Fix bug of elementwise op, our elementwise op is not inout
      
      * Add some elementwise op, prepare to quantization
      
      * Let generate.py can generate different elementwise function
      
      * To prevent compiler issue, remove the elementwise function we have not used.
      
      * Remove f8 pipeline, we should share the same pipeline even in f8
      
      * Remove remove_cvref_t
      
      * Avoid warning
      
      * Fix wrong fp8 QK/KV block gemm setting
      
      * Check fp8 rounding error in check_err()
      
      * Set fp8 rounding error for check_err()
      
      * Use CK_TILE_FLOAT_TO_FP8_STANDARD as default fp8 rounding mode
      
      * 1. codgen the f8 api and kernel
      2. f8 host code
      
      * prevent warning in filter mode
      
      * Remove not-in-use elementwise function kargs
      
      * Remove more not-in-use elementwise function kargs
      
      * Small refinements in C++ source files
      
      * Use conditional_t<> to simplify code
      
      * Support heterogeneous argument for binary function types
      
      * Re-use already-existing scales<> functor template
      
      * Fix wrong value produced by saturating
      
      * Generalize the composes<> template
      
      * Unify saturates<> implementation
      
      * Fix type errors in composes<>
      
      * Extend less_equal<>
      
      * Reuse the existing template less_equal<> in check_err()
      
      * Add equal<float> & equal<double>
      
      * Rename check_err() parameter
      
      * Rename check_err() parameter
      
      * Add FIXME comment for adding new macro in future
      
      * Remove unnecessary cast to void
      
      * Eliminate duplicated code
      
      * Avoid dividing api pool into more than 2 groups
      
      * Use more clear variable names
      
      * Use affirmative condition in if stmt
      
      * Remove blank lines
      
      * Donot perfect forwarding in composes<>
      
      * To fix compile error, revert generate.py back to 4439cc107dd90302d68a6494bdd33113318709f8
      
      * Fix bug of p element function
      
      * Add compute element op to host softmax
      
      * Remove element function in api interface
      
      * Extract user parameter
      
      * Rename pscale and oscale variable
      
      * rename f8 to fp8
      
      * rename more f8 to fp8
      
      * Add pipeline::operator() without element_functor
      
      * 1. Remove deprecated pipeline enum
      2. Refine host code parameter
      
      * Use quantization range as input
      
      * 1. Rename max_dtype to dtype_max.
      2. Rename scale to scale_s
      3.Add init description
      
      * Refine description
      
      * prevent early return
      
      * unify _squant kernel name in cpp, update README
      
      * Adjust the default range.
      
      * Refine error message and bias range
      
      * Add fp8 benchmark and smoke test
      
      * fix fp8 swizzle_factor=4 case
      
      ---------
      Co-authored-by: default avatarPo Yen Chen <PoYen.Chen@amd.com>
      Co-authored-by: default avatarcarlushuang <carlus.huang@amd.com>
      
      ---------
      Signed-off-by: default avatardependabot[bot] <support@github.com>
      Co-authored-by: default avatarillsilin <Illia.Silin@amd.com>
      Co-authored-by: default avatarIllia Silin <98187287+illsilin@users.noreply.github.com>
      Co-authored-by: default avatarJing Zhang <jizha@amd.com>
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      Co-authored-by: default avatardependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
      Co-authored-by: default avatarPo-Yen, Chen <PoYen.Chen@amd.com>
      Co-authored-by: default avatarrocking <ChunYu.Lai@amd.com>
      db376dd8